New Go Project: Distributed Crawler

In work, we created a site for a client. While the development work was pretty straightforward (although I did do a few neat things), it was very, extremely content heavy. We're talking 5000 or more pages. While the development also was pretty straightforward, it was ongoing until the end, and doing that while content entry is going on proves to be a bit of a challenge. If the way something works changes, and content is not present, things break.

There are tools like that yelling amphibian (you know the one) to help crawl sites and report on errors, like 404s, 500s, etc. However, it takes a long time to run, costs money, doesn't do specifically what I need, and did I mention it takes forever?

One of the reasons it takes a long time to run is that it's probably doing more in there than it needs to do. I don't know, since I didn't write it. So, the benefit of being a software developer, besides getting to do awesome things all the time, is that I can write code to replace software that I don't know how it works, don't know what it's doing that makes it slow, and it is free* to replace (* takes my time but I learn stuff, so it pays for itself :)

So I wrote a crawler. A parallel crawler with go routines. Go and get the main site, parse out any URLs (image sources and anchor hrefs), add any that are unique to a map, crawl the next one. It was a fun exercise in getting familiar with Go and concurrent processing again.

But it takes 24 minutes!  Unacceptable across 5000 pages. So my idea was to create a distributed crawler, using the same core code, modified for distributed workload across multiple machines. However many you want, just start up the client on a new computer and it will dial the server and begin getting URLs that haven't been processed yet. It's done. Pretty awesome thing.  I'm not going to post the code here now, because it's on a different computer, and I haven't figured out a good project structure and put it in git yet.

But here's the general idea. You can just imagine the code :)

Server has a "job" config with the site that it wants to crawl. It starts up and immediately goes into listen mode, just waiting for clients. It does no crawling of its own (although the way the code is shared, that code is compiled into the server executable as well).

Client connects. To the first client that connects, the server sends the first URL. The client does the parsing, and sends back a status code as well as the original URL, and the list of URLs that it found.

The server records the status along with the URL, and adds any unique URLs found by the client. These URLs go into the "UnprocessedLinks" channel of type Link ( Url string, Referrer string). Referrer is kept so we know where the page was that lead to the 404 or whatever.

The server then continues to send any links on the UnprocessedLinks channel to any client that connects (The first link is sent to the first client, it is just added to the UnprocessedLinks channel when the server starts)

There were a few challenges that I overcame while writing this. First, learning Gobs. Next, ensuring that packets should be sent and received first, are indeed, sent and received first. The other thing is determining, in Go, when a client disconnects or the server shuts down. That was really all, the rest of the logic was written when I wrote the parallel crawler, so that was all figured out.

There's another thing I have to figure out. At any one time, a client can have a few hundred URLs in its process queue, because the server just sends those across all clients when it gets them (uniquely giving 1 URL to 1 client and not duplicating any work). However, if a client disconnects and has loads of URLs left in its queue, those URLs are currently lost forever. So on the server, I have to keep track of which URLs were sent, and which ones were received back as processed. Then if it disconnects, I can pump the remaining URLs that were sent and not processed back into the UnprocessedLinks channel, for future processing by future or current clients.

That's it. I haven't ran it yet across multiple computers, but it does a pretty good job of doing what it's supposed to do. I can't wait to test it soon :)

Transactional Data from Investment Account

So.... there are impending law-related proceedings that I'm going through where I have to provide certain values of financial assets that I hold, what their values were over the years, etc.  I have an investment account at what's now called CapitalOne Investing (was ShareBuilder when I signed up).  I was told that I could call them up and tell them a date or two and they could send me the value of my brokerage account on those dates.

They could not do this...

However, they do provide data on every transaction that you performed or which was performed on your behalf over the course of those 8 or so years I've been investing. This data is in CSV format.... Programming to the rescue!!

I used Go of course.

The data is in this format:


With the important data being Date, Action, Security, Price, Quantity (for what I care about).

The other part of this is historical stock price data which was easy enough to find on a site like MarketWatch.

The design of my program is like this.

  1. Read in the data.
  2. Pick a Start and End date for iteration
  3. Build a Map of Transaction list by Date. So in go this would be map[time.Time]data.LineList     (with a single item in data.LineList just holding a CSV line)
  4. Iterate over the dates, adding 1 to day on each iteration, running through the transactions for that day if they exist, otherwise move the the next one.
  5. On 5 separate transaction types, perform an action
    1. Buy, ReinvDiv - Peform a buy with (Date, Stock Symbol, Quantity, Price)
    2. Sell - Perform a sell with (Date, Stock Symbol, Quantity, Price)
    3. ShrsOut - Perform a SharesOut call. This is the first part of a reverse stock split.
    4. ShrsIn - Perform a SharesIn call. This is the second part of a reverse stock split.  (My transaction history only had reverse stock splits, not regular stock splits)
  6. Over the course of the years, keep track of shares owned, and on specific dates, output the value of the entire portfolio.

I was able to run this, then tabulate how many stocks I owned on the dates in question, then look up their price in the historical data on MarketWatch. Here some of the code, which will show how it works and make the data structure seem pretty apparent...

func (port *Portfolio) Buy(date time.Time, symbol string, quantity, price float64){
    if _,ok := port.Holdings[symbol]; ok {
        h := port.Holdings[symbol]
        h.Quantity += quantity
        h.CostBasis += quantity * price
        port.Holdings[symbol] = h
    } else {
        h := Holding { Symbol: symbol, Quantity: quantity, CostBasis: quantity * price }
        port.Holdings[symbol] = h

func (port *Portfolio) Sell(date time.Time, symbol string, quantity, price float64){
    if _,ok := port.Holdings[symbol]; ok {
        h := port.Holdings[symbol]
        h.Quantity += quantity
        h.CostBasis += quantity * price
        port.Holdings[symbol] = h

// stock reverse / splits
func (port *Portfolio) SharesOut(date time.Time, symbol string, quantity float64){
    if _,ok := port.Holdings[symbol]; ok {
        h := port.Holdings[symbol]
        h.Quantity += quantity
        port.Holdings[symbol] = h

func (port *Portfolio) SharesIn(date time.Time, symbol string, quantity float64){
    if _,ok := port.Holdings[symbol]; ok {
        h := port.Holdings[symbol]
        h.Quantity += quantity
        port.Holdings[symbol] = h

Quantity was negative on the Sell so the code could have been the same, almost.

Here's the process data function. As usual, I am a student of Go, if there's a better way to do any of this, I will be happy to learn about it!!

func processData(allData data.LineList, start time.Time, dates ...time.Time) *data.Portfolio {
    port := data.NewPortfolio()

    dateMap := getDateMap(allData)

    dt := start
    end := time.Now()

    for dt.Unix() < end.Unix() {
        impDate := false
        for _,d := range dates {
            if d == dt {
                impDate = true

        for _, lineItem := range dateMap[dt] {
            switch lineItem.Action {
                case "Buy", "ReinvDiv":
                    port.Buy(lineItem.Date, lineItem.Security, lineItem.Quantity, lineItem.Price)
                case "Sell":
                    port.Sell(lineItem.Date, lineItem.Security, lineItem.Quantity, lineItem.Price)
                case "ShrsOut":
                    port.SharesOut(lineItem.Date, lineItem.Security, lineItem.Quantity)
                case "ShrsIn":
                    port.SharesIn(lineItem.Date, lineItem.Security, lineItem.Quantity)

        if impDate {
            fmt.Println(dt.Format("Jan 2, 2006"), port)

        dt = dt.AddDate(0, 0, 1)

    return port

Mortgage Account Finances

I've come up with a brilliant idea. So I started the refinance process on my mortgage to bring down my monthly payments. I'm pretty happy with where they'll be for the next 30 years. But I've come up with a concept, at least, according to my outside influence, I've come up with it on my own, but it could be a concept already.

If I'm generally tight on finances, even with considering my mortgage payment is a lot less than it was, if I saved up in a dedicated account, enough for one mortgage payment, it would be like, after months of saving, it would be a huge release of pressure on the burden of paying mortgage that month.

If my mortgage payment is (fictitious amount here) $1200 a month, which includes all taxes and escrow, etc. Let's say I want that magical burden-lifting event to happen twice a year. I guess logically (although I came to the number in a different way) you'd multiply that mortgage by 2, to get $2400. Then divide that by 52 weeks in the year. I simply did $1200 / 26. You'd get 46.16, rounded up to the nearest penny.

$46.16 is the amount that you have to put away each week in order to have $1200 in a savings account after 6 months, that you can transfer over to your checking and use to pay your mortgage that month. Then it's like getting a cash injection. As a bonus you'd also have some extra cash in there due to interest accruing on that account over 6 months!

Obviously this isn't a long term savings plan. You'd be doing those as well on the side. I've got betterment for that. This would just be an extra account in my bank that I auto transfer $46.16 to every Friday (in my case). But I'm also transferring money to my investment accounts at the same time. It might be the reason I'm a little tight in finances, but in 6 months, that will be amazing to have that extra injection of cash!  :)  I don't do finance, just computers, so I don't know the terms for this type of financial strategy, but there probably is one.

Happy Saving!


I'm definitely tweeting these too. It felt like the perfect combination.

The Unmatched Welding of Politics and Flavor - #supertacotuesday

Townhalls and Tortillas - #supertacotuesday

Voters, Villains, Saviors, and V-Shaped Savories - #supertacotuesday

Those are fun :)  #supertacotuesday

Note To Self

For setting up a single page application (SPA) on NGINX (Engine-X, [I have to write it out since I always wrongly say "en-jinx"]), with HTML 5 style URLs

     server {
    listen 80;
    server_name serverinfo;{
    location  /go/ {
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $remote_addr;
        proxy_set_header Host $host;
        proxy_pass http://localhost:8234/;

    location / {
        try_files = $uri /index.html;
        root c:/users/jason/documents/go/src/serverinfo/site;

This will try the file and write out index.html (rewrite) if it can't find it. All front end is static, and anything that starts with /go/ (for now, may change) will proxy pass to the service running, which is serving up data, written in Go and accessing MongoDB.

Also, I'm starting to use TypeScript and learn Angular2, so I will probably be posting stuff on that relatively soon.

Advent Of Code - Day 22

I finished Day 22 at 7pm on December 29th. This wasn't due to 7 days of rigorous coding, failing, rethinking, mind you. I didn't do any work on it Christmas Eve, Christmas Day, the 26th, 27th or 28th. Long stories there. On the 23rd and 24th I took a break from 22 because it was clear I was not going about it the right way, and did days 23 and 24. I don't know how much coding I actually put into it. The leaderboard filled up at about 4am on the 22nd, I wouldn't have been on it if I did all of the coding at one time starting at midnight, I know that much. It might have been 8 hours for me.

There were a few little things that made this puzzle very fun but also a pain! Little subtle word clues that I initially just glossed over, turned out to lead to flaws in the logic that prevented me from solving this puzzle.

The Puzzle

The puzzle returns to day 21, where Little Henry Case is playing the same role playing game, but instead with swords and armor, you're now a magician. You have 5 spells at your disposal, each with varying effects. You are tasked with defeating the boss, who's hit points and damage are the puzzle's input. Mine happened to be 58 HP and 9 DMG. You must find the "path" to defeat the boss using as little mana as possible.

The Spells

These are the spells at your disposal:

  1. Magic Missile - Costs 53. Instant damage of 4.
  2. Drain - Costs 73. Instant damage of 2. Instant heal of 2.
  3. Shield - Costs 113. Initializes an effect that lasts 6 turns and adds 7 defense to you over those 6 turns.
  4. Poison - Costs 173. Lasts 6 turns. At the start of each turn, deals 3 damage
  5. Recharge - Costs 229. Lasts 5 turns. At the start of each turn, adds 101 mana to player.


The subtle wording comes in the rules and how the spells are interpreted.

  • On each of your turns, you must select one of your spells to cast. If you cannot afford to cast any spell, you lose.
  • Effects apply at the start of both the player's turns and the boss' turns.
  • Effects are created with a timer
  • At the start of each turn, after they apply any effect they have, their timer is decreased by one
  • You cannot cast a spell that would start an effect which is already active
  • However, effects can be started on the same turn they end.
  • Do not include mana recharge effects as "spending" negative mana

The Code!


  • Instant is a property on Effect because Instant effects are processed at the same time as reoccurring effects, and the rule states that you can't start an effect which is already active. However, if you have poison active, which causes Damage, you should be able to cast the Magic Missile effect, which also causes Damage, because it's an instant effect.
  • OneTime is a property on Effect to make Shield work correctly. Shield occurs immediately, and remains, but its effects aren't processed every turn. Just once it's cast, and removed when the timer counts down.
  • I use random generation to get the next spell. It will pick one based on how much mana is left on the player, and if that effect is not already active, and also checking the Instant flag on Effect, because we want to be able to cast Magic Missile even though Poison may be active.
  • I was a little inconsistent in how the functions work. For instance, ApplyEffects applies the Boss and Player hit points and mana directly (use pointers), while Attack just returns the damage to apply. I'm ok with this.
  • The Story struct is fun to read through, when it works, and you're verifying everything, and you say to yourself, "This might actually work!" :)

I had a lot of fun on these puzzles. I think I want to post one for Day 25 too, so check back soon for that one!

Advent of Code - Day 15

I've been going through the Advent of Code, and it's been pretty straight forward day to day. I got to day 15, and it took me a little bit to think of how to get an array of 4 digits that also adds up to 100. After some thought, I then gave into the easy, brute force solution. There was a previous puzzle for "incrementing" a password and I thought I could solve this one the same way.

In other words, I'd start with [0,0,0,0] and always increment the right-most digit (array with index = len(array)-1) and when it hit 100, it would increment the digit at position len(array)-2 and reset the current one back to 0. This would take F@#%inG FOREVER. But it would loop through all possible combinations.  Before calculating the "cookie recipe" based on the array, I would first check if they added up to 100, if not I would just increment until it did add up to 100. One quick brilliant optimization I made was to start with the initial array of [0,0,0,99] :)  (I kid on the brilliance of it... it saved 99 iterations in what could be a hundred million).

So then I thought again about it. There were two other previous puzzles where you could brute force the solution by first getting all possible permutations of the elements, and looping over all permutations, applying the logic, and sorting them to tell you the highest / lowest combined value.  I thought, this seems similar in a way to that, moreso than it did to the password incrementing.

So you'd start with inserting 100 into a blank array, then recursively call a method that would give you permutations of the remaining value, in this case would be 0. So in the end it would return [100,0,0,0]

If I inserted 50 into the array first, it would then give me permutations of the remaining 50.

So, [50,25,25,0], [50,25,0,25], [50,26,0,24] etc, all the way to [50,0,0,50], [50,50,0,0], [50,0,50,0].  Totally exhaustive but also, 100% correct and I never have to check if the sum of the array adds up to 100, because it's guaranteed to! :)

Here's that method in Go

func Perms(total, num int) [][]int {
    ret := [][]int{}
    if num == 2 {
        for i := 0; i < total/2 + 1; i++ {
            ret = append(ret, []int{ total-i, i })
            if i != total - i {
                ret = append(ret, []int{ i, total-i })
    } else {
        for i := 0; i <= total; i++ {
            perms := Perms(total-i, num-1)
            for _, p := range perms {
                q := append([]int{ i }, p...)
                ret = append(ret, q)
    return ret

perms := Perms(5, 3)

And the output

[[0 5 0] [0 0 5] [0 4 1] [0 1 4] [0 3 2] [0 2 3] [1 4 0] [1 0 4] [1 3 1] [1 1 3]
[1 2 2] [2 3 0] [2 0 3] [2 2 1] [2 1 2] [3 2 0] [3 0 2] [3 1 1] [4 1 0] [4 0 1]
[5 0 0]]

YES!!! I thought about that one for a while. I knew there had to be a way to do it :)

More importantly, for the Perms(100,4) input in the puzzle, it returns only 176,851 possible permutations, and my original brute force method could have looped 100,000,000 (one hundred million) times!!  Quite the improvement.

Sorry, I gave Rust a try

There's a new language over at Mozilla (new as in I've only heard about it in the past year), called Rust.  I tried it. I don't like it.

I can't judge it based on the fact that I couldn't get what I thought was a simple program to compile. I also can't judge it that I tried it after trying (and falling in love with) Go.

It just doesn't sit well with me. I don't like strict languages. The thing that made me give it a shot even, is that it has generics, where Go doesn't.  But oh how I hate the syntax!!

Ths is all my opinion and personal preference, obviously. Give Rust a shot if you are reading this and are on the fence. It's interesting. It'll make you think. Just not what I want to think about.

I liked the way writing code in Node.js (and I wrote a fuck-tonne of it) made you think in terms of asynchronous, event driven code. It worked for me for a little while, until I hated not compiling code and finding simple problems at runtime. That's when I decided to give Go a shot. I love Go. I really really love the syntax. It made me think differently, in terms of concurrency, goroutines, and channels. That was an amazingly fun learning experience (I do have my degree in computer science, concurrency and parallel programming aren't new to me, but the Go concurrent model was not covered, or at least not well or not hammered into my brain). I just found two issues... No ability yet to dynamically link libraries, as everything created is a static executable or a static library. And no generics which it tried to solve with code generation.

It lead me to a discussion with a coworker. For all intents and purposes, syntax is arbitrary.  What I mean is that there are constructs that are in every language. There are packages, structures, loops, condition statements, evaluation, expressions, basic types, reference types, pointers, method overloading, etc...  What if you could try a "new language", or really what I mean is a compilation environment, or even runtime environment for interpreted languages, with the syntax from your favorite language?

It's easy if your favorite language is assembly, then you just write everything in assembly. But if you love Go and goroutines, you could use that instead of thread::spawn(), for example.

Ok, looking through Rust docs for what you would put goroutines in place of, I hate the syntax even more. Calling ".unwrap()" on everything, namespaces are no more than like 4 characters (std, sync, iter, str, cell, ffi, etc). There's double colons everywhere, &mut (wtf is that :P .. I know what it is, it's a flaw in that you have to tell Rust that you need mutability on the borrowed reference.

Maybe I have to get a book and hope a good set of examples comes along, rather than the standard fare bullcrap example apps that you'll find on a worthless Rust by example site. I hate those base barebones example sites. Show me real world problems and solutions. Then maybe I'll see what your language is good for...


.NET Concurrency Model

I tried to replicate the Go concurrency model as demonstrated by my previous post, Software Design and Go. However, the outcome is not as pretty. I say Go concurrency model as it was made most obvious to me from a talk I watched given by Rob Pike, one of the designers of Go. Obviously as a language design decision, Go has channels for communicating between concurrent processes. .NET, as far as I'm aware, does not  have this exactly. What it does have is a set of specialized concurrent collections that you can use for passing data between threads. This seems to work.

The one thing you might be saying is ".NET has async and await now! There's no use for threads anymore!"  I can't begin to describe how wrong that is.  Async and await are more to act like Javascript and Node.js, non-blocking threads of execution that don't necessarily have to be coordinated (concurrent). Threads and Concurrent collections best simulate what's going on with Go's goroutines and channels. Particularly the ConcurrentQueue class.

But, as you might imagine, the syntax is much more verbose. 208 lines vs. 171 lines. 6,946 bytes vs. 3,652 bytes (C# vs Go respectively).

Here is a taste of what the C# program looks like. Obviously the bulk of it has been trimmed, but the meat is here:

Software Design and Go

Go has really flipped the concept of software design upside down for me. It's made it more interesting and fun. I will take you through some iterations of a recursive "merge verifier" that I wrote which was a necessity due to subversion's bad merge outcomes. Where after merging 3-4 branches into a master, the result would be broken. So alternatively, it can just be used to find different files among two identical folder structures.  It's not going to tell you missing files, but that's just a shortcoming of it right now, it will be an added feature later.

So to design this application, there's the linear way. Read in a file from the "left directory", and find the same file in the "right directory". Get an MD5 hash of their contents. I know MD5 is "broked" but in what case in my use would they ever collide on two files that are diffent?  Anyway, compare the MD5s. If they're not equal report them, otherwise, repeat, recurse, and recurse, and repeat. Repeatedly. Until the folder structures have been fully recursed.

Then there's the Go way. Although I'm still trying to determine what that is, evidenced by the application has gone through 2 redesigns at this point. One when learning Go initially, and now, after learning Go and putting it down for a good few months. However, I have a much better idea of how they're supposed to be designed, and the first time I didn't try to do it with the Go concurrency model.

So this new way...  Get a file path and the fact that it exists, and send this along a channel. On a select statement, listen for these newly discovered files. In a go subroutine, get an MD5 of their contents ignoring white space.   This is the blocking call, so we want to do this as separately as possible. Once the file contents are hashed, hang onto it until its twin (the same file in the other directory structure) gets its contents hashed. Once you have both, create a FilePair object and set its left content to the left file, and right content to the right file, and send it along another channel. Receive these on another select statement, and just do reporting. Compare the MD5, if they're different, spit out a message with the relative path to the file.

Going with the new way, I've cut the program down from 15 seconds to 7 seconds, on a 1.48 GB directory structure. So that's two directories so nearly 3GB of data.

The program does include a way to exclude directories (like .svn or .git) and specify only files that we care about (.cs, .css, .js, .ascx, .cshtml, etc), so it's not reading the entire 3 GB of data. But to cut it down in HALF is pretty good, just by reordering the steps and grouping IO together separate from other things, allowing it to read the files and still do work on the side.

Here's the meat of the code: