‘Tiny’ is the name of the sheep.
No, but seriously. It is not often that the backend tyrannosaur feels tiny, insignificant and overwhelmed by the scale of things. Usually it’s the rest of the world bowing before the unstoppable, world shaking, hamburger powered, sex machine might of the tyrannosaur.
But it does happen.The last three weeks (yes, three. Count’em. Three goddamn weeks) I’ve been facing a monster of epic proportions. A beast that could eat 32 GB of RAM in 36 hours. Now, a gigabyte is 1,073,741,824 bytes and a ram is a sheep that can weigh up to 150 kg, so this monster could eat 5,033,164,800,000 kg (a bit more than FIVE THOUSAND MILLION TONS) of raw sheep in just about a day and a half. That’s a lot of baaa, folks. Even with ketchup, I don’t think I could do it.
The beast has a name… The Memory Leak. And it was causing us to have to reboot the servers once a day, more or less, which was preventing you, the players, from slaughtering each other. And we can’t have that.
It is an unfortunate reality of life that computer memory is not endless, which is why memory management has always been a basic programming necessity. However, since most of the H&G backend is written in .Net, we usually don’t have to worry all that much about it. .Net has this nifty feature called ‘Garbage collection’, which means that the programmer does not need to handle the allocation and deallocation of memory. A background process, the Garbage collector (or GC for short) simply checks every now and then if any memory needs to be cleaned up, and then does so.
So memory leaks in .Net are rare, and usually easily fixable. If we were doing this in C++ (for example), memory leaks would be a far more everyday occurrence. Programming in a non-garbage collected language means that the programmer has to remember to release every single byte he allocates manually, so it’s easy to miss one here and there. And when you do discover that memory is being allocated but not freed, the memory in question can just be floating aimlessly in space, forgotten by the runtime.
Whereas a memory leak in .Net typically requires that some piece of the runtime still be holding a reference to the memory. It might not be using it anymore, which is what makes this a ‘leak’, but it IS explicitly holding on to it. Thus finding the memory leak is pretty much just looking at the code and saying ‘alright, which one of you idiots is keeping stuff around that you don’t need? hmmm?’.
That approach didn’t work this time. I scoured the code for the place where something was being kept that wasn’t needed, and I did in fact find some places where everything was not perfect. But nothing that would explain the hugeness of the problem we were seeing. So, i started using some more exotic tools to look for the problem, which told me some pretty weird things : The server wasn’t actually using all that much memory. It was just reserving huge amounts of memory for some reason. Usually, you’d expect the memory usage of a .Net process to look something like this:
- Objects : <Big Number> bytes of Memory
- Free : <small number> bytes of Memory
Most of the memory should be in use, and some small part of memory should be free and ready for new allocations. ‘Free’ meaning that the runtime has asked the OS for the memory, but isn’t actually using it yet. But what we found here was more like this:
- Objects : <Big Number> bytes of memory
- Free : <HUUUUUGGE NUMBER> bytes of memory
So, the number of objects was fine. We weren’t keeping stuff around that we didn’t need, but the garbage collector seemed to be fucking up somehow. How does that happen? There are two possibilities:
- Either the folks at MS who wrote that Garbage collector are utter nincompoops, and the fact that the GC works on millions of computers around the world every second is just luck.
- Or maybe we were screwing something up in an unusual way.
Option 2 being far more likely, that’s the one we went with. After a bit of further digging the term ‘Memory Fragmentation’ came up. In order to understand what that even means, we’re going to have to understand how the Garbage collector works.
As a .Net process runs, it allocates a lot of objects. Every little number or string or other class instance gets created and shoved on top of the current pile of memory. Every now and then, the GC strolls by, finds objects that aren’t being used anymore and frees up that memory. So let’s say we’ve allocated objects A, B, C and D. Memory looks like this :
The GC comes by, notices that we’re not using B anymore and frees it. Now memory looks like this :
That’s silly, so the GC compacts the memory, like, say, the mighty taloned foot of a Tyrannosaur squashing an unworthy editor <Hey! Cut it out! -editor>. So now memory looks like this :
Which is fine, and the way we want it to. But what if, for some reason, the GC wasn’t able to move an object? If the GC wasn’t allowed to move C, for example, then it couldn’t compact the memory, and the chunk of free memory would just stay there, pointlessly taking up space. That’s memory fragmentation.
As it turns out, it IS possible to pin an object in memory, preventing the GC from moving it. Typically this happens because that chunk of memory is being accessed from outside the .Net world, by something that isn’t aware of the GC and would be rather WTF if the memory it was trying to use was suddenly moved to a different location. That something, in our case, turned out to be the Sockets API. (A note for any poor programmer who stumbles on this post because he googled “memory fragmentation”: You’re not using SocketAsyncEventArgs correctly. You might think you are, but you aren’t. Be told.)
Having realized the nature of the problem, a rewrite of some central parts of the network code fixed the problem. And on the 22nd day, the Tyrannosaur rested, and all was well.
Now, if you’ll excuse me, I have a meeting with some interns that I’m going to