Go and Sitecore Interlude

This is part of the same program I'm developing to generate, serialize and deserialize items, but it's a general helper method that I found very useful, and can be used in any Go program. It can be expanded to be more complete, I pretty much did it for things that I am currently working with. You'll see what I mean.

The application is written as one application that does it all (generation, serialization, deserialization). So, it's looking for configuration settings for all aspects of those pieces of functionality. You generally don't want to serialize the same Sitecore paths that you want to generate code against. However, having the configuration in one file is not what I wanted. Here are the drawbacks.

If the configuration is in one file, you would have to update your configuration file in consecutive runs if you wanted to serialize, git fetch and merge, then deserialize. Your configuration would be committed and would be set for the next person that wants to run the program. You couldn't write bat files to run updates.

You could use the flag package to control the program pieces. Of course. But I set out to have multiple configs. For instance, if you wanted to serialize from a shared database then serialize to your local database. You could also make the config file a flag and set it to different huge files that each only differ by the connection string.

You could.

But then I wouldn't have this cool piece of code :)

Basically, when you run the program, you call it with a "-c" flag which takes a csv list of config files. The program reads them in order and merges them, having configuration values later in the chain overwrite values in the previous versions. I do this using Go's reflect package. As follows:

func Join(destination interface{}, source interface{}) interface{} {
    if source == destination {
        return destination 
    td := reflect.TypeOf(destination)
    ts := reflect.TypeOf(source)

    if td != ts || td.Kind() != reflect.Ptr {
        panic("Can't join different types OR non pointers")

    tdValue := reflect.ValueOf(destination)
    tsValue := reflect.ValueOf(source)

    for i := 0; i < td.Elem().NumField(); i++ {
        fSource := tsValue.Elem().Field(i)
        fDest := tdValue.Elem().Field(i)

        if fDest.CanSet(){
            switch fSource.Kind() {
                case reflect.Int:
                    if fDest.Int() == 0 {
                case reflect.Bool: 
                    if fDest.Bool() == false {
                case reflect.String: 
                    if fDest.String() == "" && fSource.String() != "" {
                case reflect.Slice:
                    fDest.Set(reflect.AppendSlice(fDest, fSource))
                case reflect.Map:
                    if fDest.IsNil(){
                    for _, key := range fSource.MapKeys() {
                        fDest.SetMapIndex(key, fSource.MapIndex(key))
        } else {
            fmt.Println("Can't set", tdValue.Field(i))

    return destination

So, you can see what I mean when I said it can be expanded. I'm only doing strings, bools, ints, slices and maps. The slice handling is different in that it adds values to the current slice. Map handling will add entries or overwrite if the key exists. Strings will only overwrite if the existing string is blank and the source isn't blank. So that's probably different from how I described the code in the beginning :)

Go is very useful. There's like, nothing you can't do :)

So the program is called like this:

scgen -c scgen.json,project.json,serialize.json

scgen.json will have the template ids for "template" and "template field", stuff that's pretty ok if it's hard coded. If sitecore were to change those template IDs, I'm fairly positive there's a lot of existing code out there that will break.

project.json has the connection string, the field type map, serialization path (since it's used for serialization and deserialization), and base paths for serialization.

serialize.json, in this instance, only has { "serialize" : true }  as its entire contents. Files like "generate.json" have "generate": true  as well as the file mode, output path, the Go text template, and template paths to generate.

So these files can be combined in this way to build up an entire configuration. The bools like "serialize" and "generate" are used to control program execution. The settings can be set in separate files, different files can be set and used depending on the environment, like a continuous integration server, or in a project pre-build execution. I foresee this being used with bat files. Create a "generate.bat" file which calls with generate.json in the config paths, etc for each program mode. Or a bat file to serialize, git commit, git pull, and deserialize. Enjoy!

Go and Sitecore, Part 3

In parts 1 and 2, so far, we've covered code generation with Go against a Sitecore template tree, and serializing items from the database to disk. Part 3 takes that serialized form and updates the database with items and fields that are missing or different, then clears out any items or fields that were orphaned in the process.

It probably doesn't do the clearing orphaned fields completely correctly, as I will only clear fields where the item doesn't exist anymore. It won't clear fields that no longer belong to the new template if the item's template changed. That'll probably be an easy change though, as it could probably be done with a single (albeit, advanced) query.

Deserializing involves the following steps.

  1. Load all items (already done at the beginning every time the program runs)
  2. Load all field values. This happens if you are serializing or deserializing.
  3. Read the contents from disk, mapping serialized items with items in the database.
  4. Compare items and fields.
    1. If an item exists on disk but not in the database, it needs an insert
    2. If an item exists on the database but not on disk, it needs a delete (and all fields and children and children's fields, all the way down its lineage)
    3. #2 works if an item was moved because delete happens after moves.
    4. Do the same thing for fields... update if it changed, delete if in the db but not on disk, insert if on disk but not in the db.
  5. This can, in some cases, cause thousands of inserts or updates, so we'll do batch updates concurrently.

Deserialization code just involves 2 regular expressions, and filepath.Walk to get all serialized files. Read the files, build the list, map them to items where applicable, decide whether to insert / update / delete / ignore, and pass the whole list of updates to the data access layer to run the updates.

I love the path and filepath packages. Here's my filepath.Walk method.

func getItemsForDeserialization(cfg conf.Configuration) []data.DeserializedItem {
	list := []data.DeserializedItem{}
	filepath.Walk(cfg.SerializationPath, func(path string, info os.FileInfo, err error) error {
		if strings.HasSuffix(path, "."+cfg.SerializationExtension) {
			bytes, _ := ioutil.ReadFile(path)
			contents := string(bytes)
			if itemmatches := itemregex.FindAllStringSubmatch(contents, -1); len(itemmatches) == 1 {
				m := itemmatches[0]
				id := m[1]
				name := m[2]
				template := m[3]
				parent := m[4]
				master := m[5]

				item := data.DeserializedItem{ID: id, TemplateID: template, ParentID: parent, Name: name, MasterID: master, Fields: []data.DeserializedField{}}

				if fieldmatches := fieldregex.FindAllStringSubmatch(contents, -1); len(fieldmatches) > 0 {
					for _, m := range fieldmatches {
						id := m[1]
						name := m[2]
						version, _ := strconv.ParseInt(m[3], 10, 64)
						language := m[4]
						source := m[5]
						value := m[6]

						item.Fields = append(item.Fields, data.DeserializedField{ID: id, Name: name, Version: version, Language: language, Source: source, Value: value})
				list = append(list, item)

		return nil

	return list

I did a quick and crude "kick off a bunch of update processes to cut the time down" method.

func update(cfg conf.Configuration, items []data.UpdateItem, fields []data.UpdateField) int64 {
	var updated int64 = 0
	var wg sync.WaitGroup
	itemGroupSize := len(items)/2 + 1
	fieldGroupSize := len(fields)/4 + 1

	// items - 2 processes
	for i := 0; i < 2; i++ {
		grp := items[i*itemGroupSize : (i+1)*itemGroupSize]
		go func() {
			updated += updateItems(cfg, grp)

	// fields - 4 processes
	for i := 0; i < 4; i++ {
		grp := fields[i*fieldGroupSize : (i+1)*fieldGroupSize]
		go func() {
			updated += updateFields(cfg, grp)


	return updated

Very unclever. Take all of the update items and fields, break them into a set number of chunks, kick off six processes, allocating twice as many for fields than for items. Each call to the respective update methods opens its own connection to SQL Server. This can be done much better but it does accomplish what I set out to accomplish. Utilize Go's coroutines (goroutines) and where something can be done concurrently, do it concurrently to try to cut down the time required. This is the only process that uses Go's concurrent constructs.

That's it for part 3!  Part 4 will come more quickly than part 3 did. I had some things going on, a year anniversary with my girlfriend, lots of stuff :)

Part 1 - Generation
Part 2 - Serialization
Part 3 - Deserialization

Go and Sitecore, Part 2

In part 1, I covered how I'm now generating code from Sitecore templates, to a limited degree. I won't share the whole process and the whole program until the end, but just going over touch points until then.

For part 2, we'll cover Sitecore serialization. For the terminology, I'm not sure what TDS or other similar tools would refer to them as, but I will refer to these acts as serialization (writing Sitecore contents to disk) and deserialization (reading Sitecore contents from disk and writing to the database)

For Sitecore serialization, I would say step 1 is to decide which fields you DON'T want to bring over. In the past, I've had loads of issues with serializing things like Workflow state. And locks. So my approach is to ignore the existence of certain fields. Essentially, find out all of the fields on "Standard template", and decide which ones are essential or useful. Remove those from a global list of "ignored fields" list. Then get your data. For the data, from part 1 we use the same tree of items. When we build the tree, it gets a root node tree and an item map  (map[string]*data.Item). For serialization we need the item map. The root is only useful for building paths, after that we could most likely toss it. With the item map in hand, and a list of ignored fields, we can get the data.

        with FieldValues (ValueID, ItemID, FieldID, Value, Version, Language, Source)
                ID, ItemId, FieldId, Value, 1, 'en', 'SharedFields'
            from SharedFields
                ID, ItemId, FieldId, Value, Version, Language, 'VersionedFields'
            from VersionedFields
                ID, ItemId, FieldId, Value, 1, Language, 'UnversionedFields'
            from UnversionedFields

        select cast(fv.ValueID as varchar(100)) as ValueID, cast(fv.ItemID as varchar(100)) as ItemID, f.Name as FieldName, cast(fv.FieldID as varchar(100)) as FieldID, fv.Value, fv.Version, fv.Language, fv.Source
                    FieldValues fv
                        join Items f
                            on fv.FieldID = f.ID
                    f.Name not in (%[1]v)
                order by f.Name;

With SQL Server, we're able to do common table expressions (CTEs) which makes this a single query and pretty easy to read. We're getting all field values except for those ignored. We get version and language no matter what, and we get the source, which table the value comes from. ValueID is just the Fields table ID which could be useful as a unique identifier, but it's not actually used right now.  We simply pull all of these values into another list of serialize items, matching their ItemID with the item map to produce a new "serialized item" type, which will be serialized. SerializedItem only has a pointer to the Item, and a list of field values. Field values have Field ID and Name, the Value, the version, the language, and the source (VersionedFields, UnversionedFields, SharedFields).

The item map is also trimmed down to items in paths that you specify, so you're not writing the entire tree. In SQL Server with the current database (12K items), the field value query with no field name filter takes 3 seconds and returns 190K values. That's a bit high for my liking, but when you're dealing with loads of data you have to be accepting of some longer load times.

The serialized file format is hard coded, versus being a text template. However I feel I could do the text template since I've found out how to remove surrounding whitespace (e.g.  {{- end }}, that left hyphen says remove whitespace to the left). However, putting it in a text template, as with code generation, implies that the format can be configured. But, this needs to be able to be read back in through deserialization, so should be less configurable, 100% predictable.

func serializeItems(cfg conf.Configuration, list []*data.SerializedItem) error {
	sepstart := "__VALUESTART__"
	sepend := "___VALUEEND___"

	for _, item := range list {
		path := item.Item.Path
		path = strings.Replace(path, "/", "\\", -1)
		dir := filepath.Join(cfg.SerializationPath, path)

		if err := os.MkdirAll(dir, os.ModePerm); err == nil {
			d := fmt.Sprintf("ID: %v\r\nName: %v\r\nTemplateID: %v\r\nParentID: %v\r\nMasterID: %v\r\n\r\n", item.Item.ID, item.Item.Name, item.Item.TemplateID, item.Item.ParentID, item.Item.MasterID)
			for _, f := range item.Fields {
				d += fmt.Sprintf("__FIELD__\r\nID: %v\r\nName: %v\r\nVersion: %v\r\nLanguage: %v\r\nSource: %v\r\n%v\r\n%v\r\n%v\r\n\r\n", f.FieldID, f.Name, f.Version, f.Language, f.Source, sepstart, f.Value, sepend)

			filename := filepath.Join(dir, item.Item.ID+"."+cfg.SerializationExtension)
			ioutil.WriteFile(filename, []byte(d), os.ModePerm)

	return nil

If you've looked into the TDS file format, you've noticed it adds the length of the value so that parsing the field value is "easier???" or something. However, it makes for git conflicts on occasion. Additionally, you can't just go in there and update the text and deserialize it.  For instance, if you had to bulk update a path that would end up in the value for each item, like a domain name or url in an external link field which is the value for many fields, with the TDS method you can't just do a find replace (unless the length of the value doesn't change!). Without the length you could find/replace across the whole path of serialized objects. There are other future benefits to this. Imagine you need to generate a tree but you don't want to use Sitecore API. You could generate this file structure and have it deserialize to Sitecore. The length doesn't help that scenario though, it just makes it a tiny less painful.

The idea for this was first, "common sense", but second, it's been working for HTTP and form posts for YEARS!! HTTP multipart forms just use the boundary property. My boundary isn't dynamic, it's just a marker. If that text were to show up in a Sitecore field, this program doesn't work. Most likely I'd replace underscores with some other value. I could generate a boundary at the start of serialization, and put it in a file in the root of serialization, like ".sersettings" with "boundary: __FIELDVALUE90210__" which would be determined at the start of serialization to be unique and having no occurrences in sitecore field values. Anyway, I've gone on too long about this :)

Also, the path and path/filepath packages in Go are the best. So helpful.

In this format, here is what the "sitecore" root node looks like serialized.

ID: 11111111-1111-1111-1111-111111111111
Name: sitecore
TemplateID: C6576836-910C-4A3D-BA03-C277DBD3B827
ParentID: 00000000-0000-0000-0000-000000000000
MasterID: 00000000-0000-0000-0000-000000000000

ID: 56776EDF-261C-4ABC-9FE7-70C618795239
Name: __Help link
Version: 1
Language: en
Source: SharedFields


ID: 577F1689-7DE4-4AD2-A15F-7FDC1759285F
Name: __Long description
Version: 1
Language: en
Source: UnversionedFields
This is the root of the Sitecore content tree.

ID: 9541E67D-CE8C-4225-803D-33F7F29F09EF
Name: __Short description
Version: 1
Language: en
Source: UnversionedFields
This is the root of the Sitecore content tree.

In part 3, we'll be looking into deserializing these items.


Part 1 - Generation
Part 2 - Serialization
Part 3 - Deserialization

Go and Sitecore, Part 1

I will use this post to sell you on the merits of Go, since I am in love with it. :)  In our dev shop, a C# MVC.NET etc shop, we've been using Hedgehog's Team Development for Sitecore. While this product does what it says it does, it's a few things. Slow, expensive, difficult to learn, does way too much, requires a website to be running, "clunky" (official term), and a few other things that make it undesirable. Fidgetty, if that's a word. Sometimes unreliable. My boss has decided to try to head away from that direction and that product.

However. There are a few good things that it does provide. They are useful if the rest of the product has its flaws. (Of course there are other products out there as well, like Unicorn). Those features that I find most useful as a developer, and a developer on a team (hence the T in TDS), are

  1. Code generation from templates
  2. Serializing Sitecore content (templates, layouts, and other items) and committing those to git.
  3. Deserializing Sitecore content which others have committed to git  (or which you've serialized earlier and messed up)

Those features, if they could be separated out, are desirable to keep around.

Over the past month or two, I've had to do a lot of work dealing directly with the sitecore database. You have Items, VersionedFields, UnversionedFields, and SharedFields. That's your data. Unless you are worried about the Media Library, then you might need to deal with the Blobs table. I haven't researched that fully, so I'm not sure of which tables are required for media. So I felt comfortable getting in there and giving code generation a shot with my new-found knowledge of the Sitecore table. Now I know them even more intimately.

Go and Sitecore

Go and Sitecore are a natural fit. First thing you need is the sql server library from "github.com/denisenkom/go-mssqldb". That thing works great. You just have to change your "data source" parameter to "server" in your connection string. Very mild annoyance :) In building this thing, it's probably best to just select all items. Items table is a small set of data. The database I'm working on is 12,000+ items, which are all returned in milliseconds.

Select * from Items. In this case, though, I did a left joint to the SharedFields table two times, one to get the base templates (__Base templates) and one to get the field type (Type). If they are a Template, they'll have base templates sometimes, if they are a field, they'll have a field type. I just hard coded those field ids in there for now.

            cast(Items.ID as varchar(100)) ID, Name, replace(replace(Name, ' ', ''), '-', '') as NameNoSpaces, cast(TemplateID as varchar(100)) TemplateID, cast(ParentID as varchar(100)) ParentID, cast(MasterID as varchar(100)) as MasterID, Items.Created, Items.Updated, isnull(sf.Value, '') as Type, isnull(Replace(Replace(b.Value, '}',''), '{', ''), '') as BaseTemplates
                left join SharedFields sf
                    on Items.ID = sf.ItemId
                        and sf.FieldId = 'AB162CC0-DC80-4ABF-8871-998EE5D7BA32'
                left join SharedFields b
                    on Items.ID = b.ItemID
                        and b.FieldId = '12C33F3F-86C5-43A5-AEB4-5598CEC45116'

The next part is to use that item data to rebuild the sitecore tree in memory.

func buildTree(items []*data.Item) (root *data.Item, itemMap map[string]*data.Item, err error) {
	itemMap = make(map[string]*data.Item)
	for _, item := range items {
		itemMap[item.ID] = item

	root = nil
	for _, item := range itemMap {
		if p, ok := itemMap[item.ParentID]; ok {
			p.Children = append(p.Children, item)
			item.Parent = p
		} else if item.ParentID == "00000000-0000-0000-0000-000000000000" {
			root = item

	if root != nil {
		root.Path = "/" + root.Name
	return root, itemMap, nil

Since I select all Items, I know the "sitecore" item will be in there, with ParentID equal to all 0s. This is the root. After the tree is built, I assign paths based on the tree. And then assign all of the base templates. Base templates are of course crucial if you're generating code. You will want to provide implemented interfaces to each interface when you generate Glass Mapper interfaces, for instance.

assignPaths of course couldn't be any easier. For each item in the children, set the path equal to the root path + / + the item name. Then recursively set it for all of that item's children. That process takes no time in Go, even across 12000 items. Now your tree is built.

The path is important for determining a namespace for your interfaces. If your path is at /sitecore/templates/User Defined/MySite and there's a template under MySite with the relative path of "Components/CalendarData", you'd want the namespace to be some configured base namespace  (Site.Data) plus the relative path to that template, to get something like Site.Data.Components   and then your class or interface would be e.g. ICalendarData.

So after you determine all of that for every item, you just have to filter the itemMap by templates and their children (template section, template field), create a list of templates  ( Template struct is another type in my Go code, to better map templates, namespaces, fields, field types, etc), and run that list of templates against a "text/template" Template in Go. It's only a list of templates at this time, vs keeping the hierarchy, because you don't need hierarchy once you've determined the path. And Templates just have fields, once you break it down. They aren't template sections and then fields. They are just Template, Field, and Base Templates.

My text template is this:

{{ range $index, $template := .Templates }}
namespace {{ .Namespace }} {
    public class {{ $template.Item.CleanName }}Constants {
        public const string TemplateIdString = "{{ $template.Item.ID }}";
        public static readonly ID TemplateId = new ID(TemplateIdString);

    [SitecoreType(TemplateId={{ $template.Item.CleanName }}Constants.TemplateIdString)]
    public partial interface I{{ $template.Item.CleanName }} : ISitecoreItem{{ if (len $template.BaseTemplates) gt 0 }}{{ range $bindex, $baseTemplate := $template.BaseTemplates }}, global::{{ $baseTemplate.Namespace}}.I{{ $baseTemplate.Item.CleanName}}{{end}}{{end}} {
        {{ if (len $template.Fields) gt 0 }}
        {{ range $findex, $field := $template.Fields }}
            {{ $field.CodeType}} {{ $field.CleanName }}{{ $field.Suffix }} { get; set; }{{end}}{{end}}

A few other notes. In configuration, I provide a CodeType mapping. For each field type, you can specify what the type of that property should be. This is of course useful if you have a custom type for "Treelist" that you want to use, and also if you're generating code that's not in C#. I didn't want to hard code those values. Also useful if you provide custom field types in Sitecore.

Field suffix is for something like, if you have a Droptree, you can select one item, and that value would be a uniqueidentifier in the database. This is represented in C# as a "Guid" type. For these fields, I like to add an "ID" to the end of the property name to signify this. Then in the accompanying partial class that isn't overwritten every time you generate, provide a field for the actual item that it would reference, like  IReferencedItem ReferencedItem {get;set;}  where the generated code would yield a Guid ReferencedItemID {get;set;}  You get the picture ;)

For each field, template, and base template, you have a pointer to the Item from which it was created, so you have access to all of the item data. Like CleanName, ID, Created and Update times if you need them for some reason. The path. Everything.

The best part of this tool is that it does all of those things that I require. Code generation, serialization, and deserialization. And it does it in a neat way. Currently I'm not doing any concurrency so it might take 20 seconds to do all of it, but that's still way faster than TDS!

This is Part 1 of this small series. In the future I'll go over how I did, and improved, item serialization and deserialization, as well as some neat things I did to accomplish not running all of those processes each time you run the tool. So you can run it to just generate, or just deserialize, or generate and serialize. Not many permutations on that, but the important thing is you don't have to wait 8 seconds for it to also serialize if all you want to do is generate, which takes less than a second. 800 ms to be exact.

800 ms!!

Tune in for Part 2 soon! Serialization.


Part 1 - Generation
Part 2 - Serialization
Part 3 - Deserialization

Advent of Code 2016

 I've been doing the Advent of Code 2016 this year, I've been a little late to the game due to a week off of doing anything with computers and generally just not being very attentive to it.

There have been a few favorites. Day 11, 13 and 17 with their tree searching have been a good refresher. Day 15 literally took me longer to understand the problem than to code it up.

Day 19 took me through a few different paths. Day 19 is the White Elephant with 3 million elves. The elf takes the presents from the elf to their left and the next elf that has presents goes next. And it repeats until one elf has all the presents. Part 2, which you find out later, is that it changes it to take the presents from the elf across from them.

The first attempt at this solution was an array of elves. 3 million odd elves. Looping through would be fast but finding the next elf who has presents would be looping a ton. Solving the 5 elf example worked but moving it to 3 million+ pointed out how terribly inefficient this would be.

So. Linked list to the rescue  (I have always said "to the rescue" when picking a tool to solve the job, but ever since seeing "The Martian", now my internal voice is replaced by Matt Damon's, when he determines that hexadecimal is the answer to his problems).

For part one, just create 3 million elves and have a pointer to the next elf. The last elf's next elf is the first elf. Done.

for elf.Next != elf

      elf.Presents += elf.Next.Presents

       elf.Next = elf.Next.Next

      elf = elf.Next // move to the next elf

This solves it very quickly. But then the bombshell of part 2 comes in. Find the elf across from the current elf in a Linked List?! What?

First attempt was for each elf, use a counter to determine the one across. So every time an elf is removed from the circle, decrement the counter. To determine the across elf, loop to across.Next (count / 2) times. This works. But then it proved very slow as well. Each time you're looping it loops 1.5 million-ish times (half). This proved to be correct though, with my test with the five elves.

Then I remembered the ability to just have as many pointers as you want! This is computers, you can do whatever you want. There are no rules :)   So, instead I created an "across" pointer and initialized it with my count/2 loop. This would be the only time it would need to loop. Just determine it's starting location.

I would still have to keep a count though. The count was useful for the case where, for instance in the 5 elf circle...  1 takes presents from 3, but there are 2 across from 1, so you choose the left side. However, removing 3 then, and moving to 2, 2 is across from 5, but if we were to just increment the across elf to across.Next, it would point to 4. So if there are an even number of elves left, add another move to across.Next after setting it to across.Next (move it two forward in the list).  This proved to work and was very fast :)

Check the solution here - https://github.com/jasontconnell/advent/blob/develop/2016/19.go

Go CrawlFarm Code

I have added the code to Github

Please check it out. I will be making bug fixes and updating code to be more consistent and better. It's a pretty awesome project that I'm pretty proud of :)

Key Golang features, if you want to view it for example code:

  1. Gob
  2. Concurrent For / Select on Channels
  3. Logging using log package
  4. net.Conn, Server and Client, Dial and Accept
  5. sync.Mutex
  6. encoding/json
  7. time.NewTicker
  8. Synchronous and Asynchronous channels
  9. net/http.Client, CheckRedirect
  10. Regex to find href/src attributes

Feel free to fork and create pull requests :)

New Go Project: Distributed Crawler

In work, we created a site for a client. While the development work was pretty straightforward (although I did do a few neat things), it was very, extremely content heavy. We're talking 5000 or more pages. While the development also was pretty straightforward, it was ongoing until the end, and doing that while content entry is going on proves to be a bit of a challenge. If the way something works changes, and content is not present, things break.

There are tools like that yelling amphibian (you know the one) to help crawl sites and report on errors, like 404s, 500s, etc. However, it takes a long time to run, costs money, doesn't do specifically what I need, and did I mention it takes forever?

One of the reasons it takes a long time to run is that it's probably doing more in there than it needs to do. I don't know, since I didn't write it. So, the benefit of being a software developer, besides getting to do awesome things all the time, is that I can write code to replace software that I don't know how it works, don't know what it's doing that makes it slow, and it is free* to replace (* takes my time but I learn stuff, so it pays for itself :)

So I wrote a crawler. A parallel crawler with go routines. Go and get the main site, parse out any URLs (image sources and anchor hrefs), add any that are unique to a map, crawl the next one. It was a fun exercise in getting familiar with Go and concurrent processing again.

But it takes 24 minutes!  Unacceptable across 5000 pages. So my idea was to create a distributed crawler, using the same core code, modified for distributed workload across multiple machines. However many you want, just start up the client on a new computer and it will dial the server and begin getting URLs that haven't been processed yet. It's done. Pretty awesome thing.  I'm not going to post the code here now, because it's on a different computer, and I haven't figured out a good project structure and put it in git yet.

But here's the general idea. You can just imagine the code :)

Server has a "job" config with the site that it wants to crawl. It starts up and immediately goes into listen mode, just waiting for clients. It does no crawling of its own (although the way the code is shared, that code is compiled into the server executable as well).

Client connects. To the first client that connects, the server sends the first URL. The client does the parsing, and sends back a status code as well as the original URL, and the list of URLs that it found.

The server records the status along with the URL, and adds any unique URLs found by the client. These URLs go into the "UnprocessedLinks" channel of type Link ( Url string, Referrer string). Referrer is kept so we know where the page was that lead to the 404 or whatever.

The server then continues to send any links on the UnprocessedLinks channel to any client that connects (The first link is sent to the first client, it is just added to the UnprocessedLinks channel when the server starts)

There were a few challenges that I overcame while writing this. First, learning Gobs. Next, ensuring that packets should be sent and received first, are indeed, sent and received first. The other thing is determining, in Go, when a client disconnects or the server shuts down. That was really all, the rest of the logic was written when I wrote the parallel crawler, so that was all figured out.

There's another thing I have to figure out. At any one time, a client can have a few hundred URLs in its process queue, because the server just sends those across all clients when it gets them (uniquely giving 1 URL to 1 client and not duplicating any work). However, if a client disconnects and has loads of URLs left in its queue, those URLs are currently lost forever. So on the server, I have to keep track of which URLs were sent, and which ones were received back as processed. Then if it disconnects, I can pump the remaining URLs that were sent and not processed back into the UnprocessedLinks channel, for future processing by future or current clients.

That's it. I haven't ran it yet across multiple computers, but it does a pretty good job of doing what it's supposed to do. I can't wait to test it soon :)

Transactional Data from Investment Account

So.... there are impending law-related proceedings that I'm going through where I have to provide certain values of financial assets that I hold, what their values were over the years, etc.  I have an investment account at what's now called CapitalOne Investing (was ShareBuilder when I signed up).  I was told that I could call them up and tell them a date or two and they could send me the value of my brokerage account on those dates.

They could not do this...

However, they do provide data on every transaction that you performed or which was performed on your behalf over the course of those 8 or so years I've been investing. This data is in CSV format.... Programming to the rescue!!

I used Go of course.

The data is in this format:


With the important data being Date, Action, Security, Price, Quantity (for what I care about).

The other part of this is historical stock price data which was easy enough to find on a site like MarketWatch.

The design of my program is like this.

  1. Read in the data.
  2. Pick a Start and End date for iteration
  3. Build a Map of Transaction list by Date. So in go this would be map[time.Time]data.LineList     (with a single item in data.LineList just holding a CSV line)
  4. Iterate over the dates, adding 1 to day on each iteration, running through the transactions for that day if they exist, otherwise move the the next one.
  5. On 5 separate transaction types, perform an action
    1. Buy, ReinvDiv - Peform a buy with (Date, Stock Symbol, Quantity, Price)
    2. Sell - Perform a sell with (Date, Stock Symbol, Quantity, Price)
    3. ShrsOut - Perform a SharesOut call. This is the first part of a reverse stock split.
    4. ShrsIn - Perform a SharesIn call. This is the second part of a reverse stock split.  (My transaction history only had reverse stock splits, not regular stock splits)
  6. Over the course of the years, keep track of shares owned, and on specific dates, output the value of the entire portfolio.

I was able to run this, then tabulate how many stocks I owned on the dates in question, then look up their price in the historical data on MarketWatch. Here some of the code, which will show how it works and make the data structure seem pretty apparent...

func (port *Portfolio) Buy(date time.Time, symbol string, quantity, price float64){
    if _,ok := port.Holdings[symbol]; ok {
        h := port.Holdings[symbol]
        h.Quantity += quantity
        h.CostBasis += quantity * price
        port.Holdings[symbol] = h
    } else {
        h := Holding { Symbol: symbol, Quantity: quantity, CostBasis: quantity * price }
        port.Holdings[symbol] = h

func (port *Portfolio) Sell(date time.Time, symbol string, quantity, price float64){
    if _,ok := port.Holdings[symbol]; ok {
        h := port.Holdings[symbol]
        h.Quantity += quantity
        h.CostBasis += quantity * price
        port.Holdings[symbol] = h

// stock reverse / splits
func (port *Portfolio) SharesOut(date time.Time, symbol string, quantity float64){
    if _,ok := port.Holdings[symbol]; ok {
        h := port.Holdings[symbol]
        h.Quantity += quantity
        port.Holdings[symbol] = h

func (port *Portfolio) SharesIn(date time.Time, symbol string, quantity float64){
    if _,ok := port.Holdings[symbol]; ok {
        h := port.Holdings[symbol]
        h.Quantity += quantity
        port.Holdings[symbol] = h

Quantity was negative on the Sell so the code could have been the same, almost.

Here's the process data function. As usual, I am a student of Go, if there's a better way to do any of this, I will be happy to learn about it!!

func processData(allData data.LineList, start time.Time, dates ...time.Time) *data.Portfolio {
    port := data.NewPortfolio()

    dateMap := getDateMap(allData)

    dt := start
    end := time.Now()

    for dt.Unix() < end.Unix() {
        impDate := false
        for _,d := range dates {
            if d == dt {
                impDate = true

        for _, lineItem := range dateMap[dt] {
            switch lineItem.Action {
                case "Buy", "ReinvDiv":
                    port.Buy(lineItem.Date, lineItem.Security, lineItem.Quantity, lineItem.Price)
                case "Sell":
                    port.Sell(lineItem.Date, lineItem.Security, lineItem.Quantity, lineItem.Price)
                case "ShrsOut":
                    port.SharesOut(lineItem.Date, lineItem.Security, lineItem.Quantity)
                case "ShrsIn":
                    port.SharesIn(lineItem.Date, lineItem.Security, lineItem.Quantity)

        if impDate {
            fmt.Println(dt.Format("Jan 2, 2006"), port)

        dt = dt.AddDate(0, 0, 1)

    return port

Advent Of Code - Day 22

I finished Day 22 at 7pm on December 29th. This wasn't due to 7 days of rigorous coding, failing, rethinking, mind you. I didn't do any work on it Christmas Eve, Christmas Day, the 26th, 27th or 28th. Long stories there. On the 23rd and 24th I took a break from 22 because it was clear I was not going about it the right way, and did days 23 and 24. I don't know how much coding I actually put into it. The leaderboard filled up at about 4am on the 22nd, I wouldn't have been on it if I did all of the coding at one time starting at midnight, I know that much. It might have been 8 hours for me.

There were a few little things that made this puzzle very fun but also a pain! Little subtle word clues that I initially just glossed over, turned out to lead to flaws in the logic that prevented me from solving this puzzle.

The Puzzle

The puzzle returns to day 21, where Little Henry Case is playing the same role playing game, but instead with swords and armor, you're now a magician. You have 5 spells at your disposal, each with varying effects. You are tasked with defeating the boss, who's hit points and damage are the puzzle's input. Mine happened to be 58 HP and 9 DMG. You must find the "path" to defeat the boss using as little mana as possible.

The Spells

These are the spells at your disposal:

  1. Magic Missile - Costs 53. Instant damage of 4.
  2. Drain - Costs 73. Instant damage of 2. Instant heal of 2.
  3. Shield - Costs 113. Initializes an effect that lasts 6 turns and adds 7 defense to you over those 6 turns.
  4. Poison - Costs 173. Lasts 6 turns. At the start of each turn, deals 3 damage
  5. Recharge - Costs 229. Lasts 5 turns. At the start of each turn, adds 101 mana to player.


The subtle wording comes in the rules and how the spells are interpreted.

  • On each of your turns, you must select one of your spells to cast. If you cannot afford to cast any spell, you lose.
  • Effects apply at the start of both the player's turns and the boss' turns.
  • Effects are created with a timer
  • At the start of each turn, after they apply any effect they have, their timer is decreased by one
  • You cannot cast a spell that would start an effect which is already active
  • However, effects can be started on the same turn they end.
  • Do not include mana recharge effects as "spending" negative mana

The Code!


  • Instant is a property on Effect because Instant effects are processed at the same time as reoccurring effects, and the rule states that you can't start an effect which is already active. However, if you have poison active, which causes Damage, you should be able to cast the Magic Missile effect, which also causes Damage, because it's an instant effect.
  • OneTime is a property on Effect to make Shield work correctly. Shield occurs immediately, and remains, but its effects aren't processed every turn. Just once it's cast, and removed when the timer counts down.
  • I use random generation to get the next spell. It will pick one based on how much mana is left on the player, and if that effect is not already active, and also checking the Instant flag on Effect, because we want to be able to cast Magic Missile even though Poison may be active.
  • I was a little inconsistent in how the functions work. For instance, ApplyEffects applies the Boss and Player hit points and mana directly (use pointers), while Attack just returns the damage to apply. I'm ok with this.
  • The Story struct is fun to read through, when it works, and you're verifying everything, and you say to yourself, "This might actually work!" :)

I had a lot of fun on these puzzles. I think I want to post one for Day 25 too, so check back soon for that one!

Advent of Code - Day 15

I've been going through the Advent of Code, and it's been pretty straight forward day to day. I got to day 15, and it took me a little bit to think of how to get an array of 4 digits that also adds up to 100. After some thought, I then gave into the easy, brute force solution. There was a previous puzzle for "incrementing" a password and I thought I could solve this one the same way.

In other words, I'd start with [0,0,0,0] and always increment the right-most digit (array with index = len(array)-1) and when it hit 100, it would increment the digit at position len(array)-2 and reset the current one back to 0. This would take F@#%inG FOREVER. But it would loop through all possible combinations.  Before calculating the "cookie recipe" based on the array, I would first check if they added up to 100, if not I would just increment until it did add up to 100. One quick brilliant optimization I made was to start with the initial array of [0,0,0,99] :)  (I kid on the brilliance of it... it saved 99 iterations in what could be a hundred million).

So then I thought again about it. There were two other previous puzzles where you could brute force the solution by first getting all possible permutations of the elements, and looping over all permutations, applying the logic, and sorting them to tell you the highest / lowest combined value.  I thought, this seems similar in a way to that, moreso than it did to the password incrementing.

So you'd start with inserting 100 into a blank array, then recursively call a method that would give you permutations of the remaining value, in this case would be 0. So in the end it would return [100,0,0,0]

If I inserted 50 into the array first, it would then give me permutations of the remaining 50.

So, [50,25,25,0], [50,25,0,25], [50,26,0,24] etc, all the way to [50,0,0,50], [50,50,0,0], [50,0,50,0].  Totally exhaustive but also, 100% correct and I never have to check if the sum of the array adds up to 100, because it's guaranteed to! :)

Here's that method in Go

func Perms(total, num int) [][]int {
    ret := [][]int{}
    if num == 2 {
        for i := 0; i < total/2 + 1; i++ {
            ret = append(ret, []int{ total-i, i })
            if i != total - i {
                ret = append(ret, []int{ i, total-i })
    } else {
        for i := 0; i <= total; i++ {
            perms := Perms(total-i, num-1)
            for _, p := range perms {
                q := append([]int{ i }, p...)
                ret = append(ret, q)
    return ret

perms := Perms(5, 3)

And the output

[[0 5 0] [0 0 5] [0 4 1] [0 1 4] [0 3 2] [0 2 3] [1 4 0] [1 0 4] [1 3 1] [1 1 3]
[1 2 2] [2 3 0] [2 0 3] [2 2 1] [2 1 2] [3 2 0] [3 0 2] [3 1 1] [4 1 0] [4 0 1]
[5 0 0]]

YES!!! I thought about that one for a while. I knew there had to be a way to do it :)

More importantly, for the Perms(100,4) input in the puzzle, it returns only 176,851 possible permutations, and my original brute force method could have looped 100,000,000 (one hundred million) times!!  Quite the improvement.