Go and Sitecore Interlude

This is part of the same program I'm developing to generate, serialize and deserialize items, but it's a general helper method that I found very useful, and can be used in any Go program. It can be expanded to be more complete, I pretty much did it for things that I am currently working with. You'll see what I mean.

The application is written as one application that does it all (generation, serialization, deserialization). So, it's looking for configuration settings for all aspects of those pieces of functionality. You generally don't want to serialize the same Sitecore paths that you want to generate code against. However, having the configuration in one file is not what I wanted. Here are the drawbacks.

If the configuration is in one file, you would have to update your configuration file in consecutive runs if you wanted to serialize, git fetch and merge, then deserialize. Your configuration would be committed and would be set for the next person that wants to run the program. You couldn't write bat files to run updates.

You could use the flag package to control the program pieces. Of course. But I set out to have multiple configs. For instance, if you wanted to serialize from a shared database then serialize to your local database. You could also make the config file a flag and set it to different huge files that each only differ by the connection string.

You could.

But then I wouldn't have this cool piece of code :)

Basically, when you run the program, you call it with a "-c" flag which takes a csv list of config files. The program reads them in order and merges them, having configuration values later in the chain overwrite values in the previous versions. I do this using Go's reflect package. As follows:

func Join(destination interface{}, source interface{}) interface{} {
    if source == destination {
        return destination 
    }
    td := reflect.TypeOf(destination)
    ts := reflect.TypeOf(source)

    if td != ts || td.Kind() != reflect.Ptr {
        panic("Can't join different types OR non pointers")
    }

    tdValue := reflect.ValueOf(destination)
    tsValue := reflect.ValueOf(source)


    for i := 0; i < td.Elem().NumField(); i++ {
        fSource := tsValue.Elem().Field(i)
        fDest := tdValue.Elem().Field(i)

        if fDest.CanSet(){
            switch fSource.Kind() {
                case reflect.Int:
                    if fDest.Int() == 0 {
                        fDest.SetInt(fSource.Int())
                    }
                case reflect.Bool: 
                    if fDest.Bool() == false {
                        fDest.SetBool(fSource.Bool())    
                    }
                case reflect.String: 
                    if fDest.String() == "" && fSource.String() != "" {
                        fDest.SetString(fSource.String())
                    }
                case reflect.Slice:
                    fDest.Set(reflect.AppendSlice(fDest, fSource))
                case reflect.Map:
                    if fDest.IsNil(){
                        fDest.Set(reflect.MakeMap(fDest.Type()))
                    }
                    for _, key := range fSource.MapKeys() {
                        fDest.SetMapIndex(key, fSource.MapIndex(key))
                    }
                default:
                    fmt.Println(fSource.Kind())
            }
        } else {
            fmt.Println("Can't set", tdValue.Field(i))
        }
    }

    return destination
}

So, you can see what I mean when I said it can be expanded. I'm only doing strings, bools, ints, slices and maps. The slice handling is different in that it adds values to the current slice. Map handling will add entries or overwrite if the key exists. Strings will only overwrite if the existing string is blank and the source isn't blank. So that's probably different from how I described the code in the beginning :)

Go is very useful. There's like, nothing you can't do :)

So the program is called like this:

scgen -c scgen.json,project.json,serialize.json

scgen.json will have the template ids for "template" and "template field", stuff that's pretty ok if it's hard coded. If sitecore were to change those template IDs, I'm fairly positive there's a lot of existing code out there that will break.

project.json has the connection string, the field type map, serialization path (since it's used for serialization and deserialization), and base paths for serialization.

serialize.json, in this instance, only has { "serialize" : true }  as its entire contents. Files like "generate.json" have "generate": true  as well as the file mode, output path, the Go text template, and template paths to generate.

So these files can be combined in this way to build up an entire configuration. The bools like "serialize" and "generate" are used to control program execution. The settings can be set in separate files, different files can be set and used depending on the environment, like a continuous integration server, or in a project pre-build execution. I foresee this being used with bat files. Create a "generate.bat" file which calls with generate.json in the config paths, etc for each program mode. Or a bat file to serialize, git commit, git pull, and deserialize. Enjoy!

Go and Sitecore, Part 3

In parts 1 and 2, so far, we've covered code generation with Go against a Sitecore template tree, and serializing items from the database to disk. Part 3 takes that serialized form and updates the database with items and fields that are missing or different, then clears out any items or fields that were orphaned in the process.

It probably doesn't do the clearing orphaned fields completely correctly, as I will only clear fields where the item doesn't exist anymore. It won't clear fields that no longer belong to the new template if the item's template changed. That'll probably be an easy change though, as it could probably be done with a single (albeit, advanced) query.

Deserializing involves the following steps.

  1. Load all items (already done at the beginning every time the program runs)
  2. Load all field values. This happens if you are serializing or deserializing.
  3. Read the contents from disk, mapping serialized items with items in the database.
  4. Compare items and fields.
    1. If an item exists on disk but not in the database, it needs an insert
    2. If an item exists on the database but not on disk, it needs a delete (and all fields and children and children's fields, all the way down its lineage)
    3. #2 works if an item was moved because delete happens after moves.
    4. Do the same thing for fields... update if it changed, delete if in the db but not on disk, insert if on disk but not in the db.
  5. This can, in some cases, cause thousands of inserts or updates, so we'll do batch updates concurrently.

Deserialization code just involves 2 regular expressions, and filepath.Walk to get all serialized files. Read the files, build the list, map them to items where applicable, decide whether to insert / update / delete / ignore, and pass the whole list of updates to the data access layer to run the updates.

I love the path and filepath packages. Here's my filepath.Walk method.

func getItemsForDeserialization(cfg conf.Configuration) []data.DeserializedItem {
	list := []data.DeserializedItem{}
	filepath.Walk(cfg.SerializationPath, func(path string, info os.FileInfo, err error) error {
		if strings.HasSuffix(path, "."+cfg.SerializationExtension) {
			bytes, _ := ioutil.ReadFile(path)
			contents := string(bytes)
			if itemmatches := itemregex.FindAllStringSubmatch(contents, -1); len(itemmatches) == 1 {
				m := itemmatches[0]
				id := m[1]
				name := m[2]
				template := m[3]
				parent := m[4]
				master := m[5]

				item := data.DeserializedItem{ID: id, TemplateID: template, ParentID: parent, Name: name, MasterID: master, Fields: []data.DeserializedField{}}

				if fieldmatches := fieldregex.FindAllStringSubmatch(contents, -1); len(fieldmatches) > 0 {
					for _, m := range fieldmatches {
						id := m[1]
						name := m[2]
						version, _ := strconv.ParseInt(m[3], 10, 64)
						language := m[4]
						source := m[5]
						value := m[6]

						item.Fields = append(item.Fields, data.DeserializedField{ID: id, Name: name, Version: version, Language: language, Source: source, Value: value})
					}
				}
				list = append(list, item)
			}
		}

		return nil
	})

	return list
}

I did a quick and crude "kick off a bunch of update processes to cut the time down" method.

func update(cfg conf.Configuration, items []data.UpdateItem, fields []data.UpdateField) int64 {
	var updated int64 = 0
	var wg sync.WaitGroup
	wg.Add(6)
	itemGroupSize := len(items)/2 + 1
	fieldGroupSize := len(fields)/4 + 1

	// items - 2 processes
	for i := 0; i < 2; i++ {
		grp := items[i*itemGroupSize : (i+1)*itemGroupSize]
		go func() {
			updated += updateItems(cfg, grp)
			wg.Done()
		}()
	}

	// fields - 4 processes
	for i := 0; i < 4; i++ {
		grp := fields[i*fieldGroupSize : (i+1)*fieldGroupSize]
		go func() {
			updated += updateFields(cfg, grp)
			wg.Done()
		}()
	}

	wg.Wait()

	return updated
}

Very unclever. Take all of the update items and fields, break them into a set number of chunks, kick off six processes, allocating twice as many for fields than for items. Each call to the respective update methods opens its own connection to SQL Server. This can be done much better but it does accomplish what I set out to accomplish. Utilize Go's coroutines (goroutines) and where something can be done concurrently, do it concurrently to try to cut down the time required. This is the only process that uses Go's concurrent constructs.

That's it for part 3!  Part 4 will come more quickly than part 3 did. I had some things going on, a year anniversary with my girlfriend, lots of stuff :)

Series:
Part 1 - Generation
Part 2 - Serialization
Part 3 - Deserialization

Go and Sitecore, Part 2

In part 1, I covered how I'm now generating code from Sitecore templates, to a limited degree. I won't share the whole process and the whole program until the end, but just going over touch points until then.

For part 2, we'll cover Sitecore serialization. For the terminology, I'm not sure what TDS or other similar tools would refer to them as, but I will refer to these acts as serialization (writing Sitecore contents to disk) and deserialization (reading Sitecore contents from disk and writing to the database)

For Sitecore serialization, I would say step 1 is to decide which fields you DON'T want to bring over. In the past, I've had loads of issues with serializing things like Workflow state. And locks. So my approach is to ignore the existence of certain fields. Essentially, find out all of the fields on "Standard template", and decide which ones are essential or useful. Remove those from a global list of "ignored fields" list. Then get your data. For the data, from part 1 we use the same tree of items. When we build the tree, it gets a root node tree and an item map  (map[string]*data.Item). For serialization we need the item map. The root is only useful for building paths, after that we could most likely toss it. With the item map in hand, and a list of ignored fields, we can get the data.


        with FieldValues (ValueID, ItemID, FieldID, Value, Version, Language, Source)
        as
        (
            select
                ID, ItemId, FieldId, Value, 1, 'en', 'SharedFields'
            from SharedFields
            union
            select
                ID, ItemId, FieldId, Value, Version, Language, 'VersionedFields'
            from VersionedFields
            union
            select
                ID, ItemId, FieldId, Value, 1, Language, 'UnversionedFields'
            from UnversionedFields
        )

        select cast(fv.ValueID as varchar(100)) as ValueID, cast(fv.ItemID as varchar(100)) as ItemID, f.Name as FieldName, cast(fv.FieldID as varchar(100)) as FieldID, fv.Value, fv.Version, fv.Language, fv.Source
                from
                    FieldValues fv
                        join Items f
                            on fv.FieldID = f.ID
                where
                    f.Name not in (%[1]v)
                order by f.Name;
    

With SQL Server, we're able to do common table expressions (CTEs) which makes this a single query and pretty easy to read. We're getting all field values except for those ignored. We get version and language no matter what, and we get the source, which table the value comes from. ValueID is just the Fields table ID which could be useful as a unique identifier, but it's not actually used right now.  We simply pull all of these values into another list of serialize items, matching their ItemID with the item map to produce a new "serialized item" type, which will be serialized. SerializedItem only has a pointer to the Item, and a list of field values. Field values have Field ID and Name, the Value, the version, the language, and the source (VersionedFields, UnversionedFields, SharedFields).

The item map is also trimmed down to items in paths that you specify, so you're not writing the entire tree. In SQL Server with the current database (12K items), the field value query with no field name filter takes 3 seconds and returns 190K values. That's a bit high for my liking, but when you're dealing with loads of data you have to be accepting of some longer load times.

The serialized file format is hard coded, versus being a text template. However I feel I could do the text template since I've found out how to remove surrounding whitespace (e.g.  {{- end }}, that left hyphen says remove whitespace to the left). However, putting it in a text template, as with code generation, implies that the format can be configured. But, this needs to be able to be read back in through deserialization, so should be less configurable, 100% predictable.

func serializeItems(cfg conf.Configuration, list []*data.SerializedItem) error {
	os.RemoveAll(cfg.SerializationPath)
	sepstart := "__VALUESTART__"
	sepend := "___VALUEEND___"

	for _, item := range list {
		path := item.Item.Path
		path = strings.Replace(path, "/", "\\", -1)
		dir := filepath.Join(cfg.SerializationPath, path)

		if err := os.MkdirAll(dir, os.ModePerm); err == nil {
			d := fmt.Sprintf("ID: %v\r\nName: %v\r\nTemplateID: %v\r\nParentID: %v\r\nMasterID: %v\r\n\r\n", item.Item.ID, item.Item.Name, item.Item.TemplateID, item.Item.ParentID, item.Item.MasterID)
			for _, f := range item.Fields {
				d += fmt.Sprintf("__FIELD__\r\nID: %v\r\nName: %v\r\nVersion: %v\r\nLanguage: %v\r\nSource: %v\r\n%v\r\n%v\r\n%v\r\n\r\n", f.FieldID, f.Name, f.Version, f.Language, f.Source, sepstart, f.Value, sepend)
			}

			filename := filepath.Join(dir, item.Item.ID+"."+cfg.SerializationExtension)
			ioutil.WriteFile(filename, []byte(d), os.ModePerm)
		}
	}

	return nil
}

If you've looked into the TDS file format, you've noticed it adds the length of the value so that parsing the field value is "easier???" or something. However, it makes for git conflicts on occasion. Additionally, you can't just go in there and update the text and deserialize it.  For instance, if you had to bulk update a path that would end up in the value for each item, like a domain name or url in an external link field which is the value for many fields, with the TDS method you can't just do a find replace (unless the length of the value doesn't change!). Without the length you could find/replace across the whole path of serialized objects. There are other future benefits to this. Imagine you need to generate a tree but you don't want to use Sitecore API. You could generate this file structure and have it deserialize to Sitecore. The length doesn't help that scenario though, it just makes it a tiny less painful.

The idea for this was first, "common sense", but second, it's been working for HTTP and form posts for YEARS!! HTTP multipart forms just use the boundary property. My boundary isn't dynamic, it's just a marker. If that text were to show up in a Sitecore field, this program doesn't work. Most likely I'd replace underscores with some other value. I could generate a boundary at the start of serialization, and put it in a file in the root of serialization, like ".sersettings" with "boundary: __FIELDVALUE90210__" which would be determined at the start of serialization to be unique and having no occurrences in sitecore field values. Anyway, I've gone on too long about this :)

Also, the path and path/filepath packages in Go are the best. So helpful.

In this format, here is what the "sitecore" root node looks like serialized.

ID: 11111111-1111-1111-1111-111111111111
Name: sitecore
TemplateID: C6576836-910C-4A3D-BA03-C277DBD3B827
ParentID: 00000000-0000-0000-0000-000000000000
MasterID: 00000000-0000-0000-0000-000000000000

__FIELD__
ID: 56776EDF-261C-4ABC-9FE7-70C618795239
Name: __Help link
Version: 1
Language: en
Source: SharedFields
__VALUESTART__

___VALUEEND___

__FIELD__
ID: 577F1689-7DE4-4AD2-A15F-7FDC1759285F
Name: __Long description
Version: 1
Language: en
Source: UnversionedFields
__VALUESTART__
This is the root of the Sitecore content tree.
___VALUEEND___

__FIELD__
ID: 9541E67D-CE8C-4225-803D-33F7F29F09EF
Name: __Short description
Version: 1
Language: en
Source: UnversionedFields
__VALUESTART__
This is the root of the Sitecore content tree.
___VALUEEND___

In part 3, we'll be looking into deserializing these items.

Series:

Series:
Part 1 - Generation
Part 2 - Serialization
Part 3 - Deserialization

Go and Sitecore, Part 1

I will use this post to sell you on the merits of Go, since I am in love with it. :)  In our dev shop, a C# MVC.NET etc shop, we've been using Hedgehog's Team Development for Sitecore. While this product does what it says it does, it's a few things. Slow, expensive, difficult to learn, does way too much, requires a website to be running, "clunky" (official term), and a few other things that make it undesirable. Fidgetty, if that's a word. Sometimes unreliable. My boss has decided to try to head away from that direction and that product.

However. There are a few good things that it does provide. They are useful if the rest of the product has its flaws. (Of course there are other products out there as well, like Unicorn). Those features that I find most useful as a developer, and a developer on a team (hence the T in TDS), are

  1. Code generation from templates
  2. Serializing Sitecore content (templates, layouts, and other items) and committing those to git.
  3. Deserializing Sitecore content which others have committed to git  (or which you've serialized earlier and messed up)

Those features, if they could be separated out, are desirable to keep around.

Over the past month or two, I've had to do a lot of work dealing directly with the sitecore database. You have Items, VersionedFields, UnversionedFields, and SharedFields. That's your data. Unless you are worried about the Media Library, then you might need to deal with the Blobs table. I haven't researched that fully, so I'm not sure of which tables are required for media. So I felt comfortable getting in there and giving code generation a shot with my new-found knowledge of the Sitecore table. Now I know them even more intimately.

Go and Sitecore

Go and Sitecore are a natural fit. First thing you need is the sql server library from "github.com/denisenkom/go-mssqldb". That thing works great. You just have to change your "data source" parameter to "server" in your connection string. Very mild annoyance :) In building this thing, it's probably best to just select all items. Items table is a small set of data. The database I'm working on is 12,000+ items, which are all returned in milliseconds.

Select * from Items. In this case, though, I did a left joint to the SharedFields table two times, one to get the base templates (__Base templates) and one to get the field type (Type). If they are a Template, they'll have base templates sometimes, if they are a field, they'll have a field type. I just hard coded those field ids in there for now.

select 
            cast(Items.ID as varchar(100)) ID, Name, replace(replace(Name, ' ', ''), '-', '') as NameNoSpaces, cast(TemplateID as varchar(100)) TemplateID, cast(ParentID as varchar(100)) ParentID, cast(MasterID as varchar(100)) as MasterID, Items.Created, Items.Updated, isnull(sf.Value, '') as Type, isnull(Replace(Replace(b.Value, '}',''), '{', ''), '') as BaseTemplates
        from
            Items
                left join SharedFields sf
                    on Items.ID = sf.ItemId
                        and sf.FieldId = 'AB162CC0-DC80-4ABF-8871-998EE5D7BA32'
                left join SharedFields b
                    on Items.ID = b.ItemID
                        and b.FieldId = '12C33F3F-86C5-43A5-AEB4-5598CEC45116'

The next part is to use that item data to rebuild the sitecore tree in memory.


func buildTree(items []*data.Item) (root *data.Item, itemMap map[string]*data.Item, err error) {
	itemMap = make(map[string]*data.Item)
	for _, item := range items {
		itemMap[item.ID] = item
	}

	root = nil
	for _, item := range itemMap {
		if p, ok := itemMap[item.ParentID]; ok {
			p.Children = append(p.Children, item)
			item.Parent = p
		} else if item.ParentID == "00000000-0000-0000-0000-000000000000" {
			root = item
		}
	}

	if root != nil {
		root.Path = "/" + root.Name
		assignPaths(root)
		assignBaseTemplates(itemMap)
	}
	return root, itemMap, nil
}

Since I select all Items, I know the "sitecore" item will be in there, with ParentID equal to all 0s. This is the root. After the tree is built, I assign paths based on the tree. And then assign all of the base templates. Base templates are of course crucial if you're generating code. You will want to provide implemented interfaces to each interface when you generate Glass Mapper interfaces, for instance.

assignPaths of course couldn't be any easier. For each item in the children, set the path equal to the root path + / + the item name. Then recursively set it for all of that item's children. That process takes no time in Go, even across 12000 items. Now your tree is built.

The path is important for determining a namespace for your interfaces. If your path is at /sitecore/templates/User Defined/MySite and there's a template under MySite with the relative path of "Components/CalendarData", you'd want the namespace to be some configured base namespace  (Site.Data) plus the relative path to that template, to get something like Site.Data.Components   and then your class or interface would be e.g. ICalendarData.

So after you determine all of that for every item, you just have to filter the itemMap by templates and their children (template section, template field), create a list of templates  ( Template struct is another type in my Go code, to better map templates, namespaces, fields, field types, etc), and run that list of templates against a "text/template" Template in Go. It's only a list of templates at this time, vs keeping the hierarchy, because you don't need hierarchy once you've determined the path. And Templates just have fields, once you break it down. They aren't template sections and then fields. They are just Template, Field, and Base Templates.

My text template is this:

{{ range $index, $template := .Templates }}
namespace {{ .Namespace }} {
    public class {{ $template.Item.CleanName }}Constants {
        public const string TemplateIdString = "{{ $template.Item.ID }}";
        public static readonly ID TemplateId = new ID(TemplateIdString);
    }

    [SitecoreType(TemplateId={{ $template.Item.CleanName }}Constants.TemplateIdString)]
    public partial interface I{{ $template.Item.CleanName }} : ISitecoreItem{{ if (len $template.BaseTemplates) gt 0 }}{{ range $bindex, $baseTemplate := $template.BaseTemplates }}, global::{{ $baseTemplate.Namespace}}.I{{ $baseTemplate.Item.CleanName}}{{end}}{{end}} {
        {{ if (len $template.Fields) gt 0 }}
        {{ range $findex, $field := $template.Fields }}
            //{{$field.Item.ID}}
            [SitecoreField("{{$field.Name}}")]
            {{ $field.CodeType}} {{ $field.CleanName }}{{ $field.Suffix }} { get; set; }{{end}}{{end}}
    }
}{{end}}

A few other notes. In configuration, I provide a CodeType mapping. For each field type, you can specify what the type of that property should be. This is of course useful if you have a custom type for "Treelist" that you want to use, and also if you're generating code that's not in C#. I didn't want to hard code those values. Also useful if you provide custom field types in Sitecore.

Field suffix is for something like, if you have a Droptree, you can select one item, and that value would be a uniqueidentifier in the database. This is represented in C# as a "Guid" type. For these fields, I like to add an "ID" to the end of the property name to signify this. Then in the accompanying partial class that isn't overwritten every time you generate, provide a field for the actual item that it would reference, like  IReferencedItem ReferencedItem {get;set;}  where the generated code would yield a Guid ReferencedItemID {get;set;}  You get the picture ;)

For each field, template, and base template, you have a pointer to the Item from which it was created, so you have access to all of the item data. Like CleanName, ID, Created and Update times if you need them for some reason. The path. Everything.

The best part of this tool is that it does all of those things that I require. Code generation, serialization, and deserialization. And it does it in a neat way. Currently I'm not doing any concurrency so it might take 20 seconds to do all of it, but that's still way faster than TDS!

This is Part 1 of this small series. In the future I'll go over how I did, and improved, item serialization and deserialization, as well as some neat things I did to accomplish not running all of those processes each time you run the tool. So you can run it to just generate, or just deserialize, or generate and serialize. Not many permutations on that, but the important thing is you don't have to wait 8 seconds for it to also serialize if all you want to do is generate, which takes less than a second. 800 ms to be exact.

800 ms!!

Tune in for Part 2 soon! Serialization.

Series:

Series:
Part 1 - Generation
Part 2 - Serialization
Part 3 - Deserialization