Got you

Last month, I had a massive bill from rackspace for my servers. Apparently the server acquired a trojan that was constantly downloading new versions of itself. At rackspace, pricing is $0.12 per GB of transfer. My server, after slowing to a crawl for a few days, sending out data, did about 3 TB worth of data in December, racking up my charge for that month to over $400.

After I found it I reimaged the server that was infected, after moving all of my databases and everything over to my backup next gen server, and it was clear. For a month.

Yesterday, I found the same virus running on both of my servers. This time, dedicated to tracking it down, and also having time to, I found it. More importantly, I found out how to get rid of it, but I haven't found out how exactly it found its way onto my server, or how to protect against it.

It is the Linux / DDOS trojan. It has an embedded rootkit. It is impossible to find by googling "Linux virus" as part of your search, as the only results returned are in relation to how Linux doesn't get viruses...

That article wasn't written at the time of the previous infection, and luckily I came across it this time. Otherwise, I would have had to re-image both servers, causing lots of down time. That was a pain, but at least now I know what was causing it, if that bastard finds his way onto my servers again.

Fixed!

The latest update to run the webserver with nohup seems to have fixed my server issues. I did get a server crash soon after changing the script, but I couldn't track it down, but it's been running non-stop for a few days now.

Finally I can put this to rest. Also, out of it, I got more efficient tracking as well as more detailed error messages in some instances, and less detailed error messages in others (for files not found, for instance, it just spits out 'Not found: ' and the file name, instead of a whole stack trace.).

Moment of Truth #3

After adding some better logging and stuff in the webserver, I determined that no error was causing the server to close. While I was starting the service with node webserver &, I think that I still needed to do nohup.

We shall see. My nohup'ed webserver is now running, and will run always in the future, whether it dies randomly or not. It's just the better way to start it.

Moment of Truth #2

After upgrading to Node.js 0.10.21 and implementing domains in Node.js, I was still getting errors. Since then, I've added code to listen for the response 'close' event, as well as the server 'clientError' event, in an effort to track down the issue. I think I may be on to something. At least then I'll be able to go to stack overflow or github with the issue in hopes that it will be fixed, or the act of listening for those events and logging the errors will lead me to the fix myself. Here's hoping.

The Moment of Truth

Every morning I have an email from pingdom that my site went down overnight. I've upgraded to node.js 0.10.20 a few weeks ago to take advantage of some other bug fixes and optimizations, and it shows in the Google Analytics page load times (they're all 0 seconds)

I have had no luck tracking down the cause, but I read about how to prevent a server error from bringing down the site. I've implemented the suggested solution using node.js domains, and we will see what's happening tomorrow.

It might be the same thing because when I go into the server, node is still running my webserver, it just appears the socket was destroyed... We shall see. Wish me luck!