Today I will tell you a bit about the system we use to manage the server setup for Heroes & Generals.
A sample graph of the CPU load of one of the servers. You can see that the server load is higher from noon till midnight, but there is still load other hours as well as players from all over the world are playing H&G 24/7.
We use an open source tool called Nagios to monitor the game servers. It’s widely used by server managers all over the world and is free and can do a whole lot of things that are very useful when running a bunch of servers like we do. It can send warnings and text messages to administrators, show service load, system IO, network info and other stuff… even monitor the water temperature in some of the fisk tanks (aquarium) we have at the office. :o)
I love tools that is works like that. It gives me the freedom to choose freely what kind of services we will monitor; hardware and/or services like web-services, memory load on a box, the server temperature in a rack in US, etc.
In this setup it is possible to control how important a server is to monitor and give the alarms as needed. Do I need to fix problems right now in the day or night time, or can it wait to the next day. For instance low disk space might not be an urgent problem on some of the servers, so that might not be prioritized as high as a total crash of another server and so on.
We create graphs that shows the history of servers and services. By looking at these graphs, I can change the setup to handle more game-sessions per server over time. Or if some of the servers are having problems with handling certain requests, then we are ready to take action right away, before it is a problem to you.
At least this is our goal – while we’re in closed beta, occasional bugs also sneak into the setup of the Nagios system, but we’re getting better and better at handling it. :)