SQL Server, A Million Updates, Multithreading and Queues

I've pretty much figured out how to run thousands of updates pretty efficiently for SQL server. As with most things, the answer is many at the same time.

In the past month or two, with work, I've had two projects that have involved massive updates of data. Pulling data from a source, processing it, and updating SQL Server in both instances coincidentally. I've learned a lot.

First, SQL Server does not respond well to multiple threads doing thousands of updates each. I did not know this. I've seen the error message, "Transaction (Process ID XX) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction." more times than I'd like to admit. I've done multiple threads doing SQL updates many times before, but I guess never with tens of thousands of rows.

I wrote two apps that process hundreds of thousands of rows of data each. One was written in C#, the other was in Python. I'm not quite as adept in Python but I've learned some tricks.

The approach I've taken for each language was almost similar. Both involved creating a shared queue that would hold all of the SQL statements that need to run. The SQL statements are just stored procedure calls. There would be one process that just goes through the queue, batches them into 15-30 statement chunks, then executes it.

The two solutions, Python and C#, were slightly different though. In Python, the multiple threads would add to the queue, then after all the threads were done processing, it would process all of them. The C# solution involved creating an object which was a singleton (per connection string), and held the queue, and it would contain its own thread which would constantly process the queue. But just one thread so it wasn't overwhelming SQL Server in any way. Here's a little bit of code. In each language, I used the built in Queue provided by their respective standard library, although in C# I used the ConcurrentQueue.

C# pseudo code



        multiple threads collecting data
        {
              call service
              add all data to be updated to the service
        }

        service data collection
        {
              get sql for object (orm)
              add to the shared queue
        }


        sql execution thread - run
		{
			while (true)
			{
				open connection
					while queue has stuff
					{
						create a string builder, add 15 or so commands to the batch, separated by ;

						build n database command and execute it.
					}
				}

				close connection.
				sleep for a second.
			}
		}

Python pseudo code


        multiple threads collecting data
        {
             download data (in this case, it's just download csvs.
             process data
             get sql statement
             add sql statement to the shared queue
        }

        main thread
        {
             collect all data (fill the queue) across multiple threads
             process the queue, calling each batch of 35 in this case, in a single thread
        }

So as you can see, the C# version is processing the queue as the data is being collected, and Python waits until the end and then processes the queue. I think the C# approach is better, as I said I'm a little bit more adept with C# so I'm more comfortable doing things like that there. Hopefully this helps someone out there with processing loads of data.

Of course, in Go I would have just used a channel! Just kidding, there'd be the same amount of complexity in Go, but definitely the end result would be a lot better looking!

Happy Coding!

blog comments powered by Disqus