One of the things that I see cropping up pretty often is this thing called “working set”. After recently chipping in on another StackOverflow question on the subject of “What does it meant to fit ‘working set’ in RAM?”, I thought it was a good subject for a blog post. This is really just a copy and extension of my input on that question and focused in certain parts on MongoDB, but is also as relevant to other databases.
“Working set” is basically the amount of data and indexes that will be active/in use by your system at any given time.
Why is it important to keep your working set in RAM?
Accessing RAM is quick. Accessing disk is slow. When querying your data store, if all the data and indexes typically accessed are in RAM, then performance is blisteringly quick. If it’s not in RAM, then disk access is required and that is when performance suffers. Hence it is important to ensure you have enough to hold your working set. The moment your working set exceeds the about of RAM you have, you will start to notice the performance degradation as it has to pull stuff back off disk, so it’s important to monitor the situation and react.
Crude Example
Suppose you have 1 year’s worth of data. For simplicity, each month relates to 1GB of data giving 12GB in total, and to cover each month’s worth of data you have 1GB worth of indexes again totalling 12GB for the year.
If you are always accessing the last 12 month’s worth of data, then your working set is: 12GB (data) + 12GB (indexes) = 24GB.
However, if you actually only access the last 3 month’s worth of data, then your working set is: 3GB (data) + 3GB (indexes) = 6GB.
You need to understand your data set, scenario and the usage patterns, in order to work out a ball park estimate of your working set. Don’t expect a black and white answer for what your working set is in your environment, from someone who doesn’t know these things.
What if my working set sky-rockets?
Add more RAM. This can be a case of adding more into your existing node(s). Or, if you need non-trivial increases, making use of sharding to split the data over a number of nodes and just bring more nodes online as you need. This provides incredible potential to scale out your workload.
The key point is to ask yourself: do I have enough RAM for my working set? If the answer is: “I don’t know”, then get yourself to the position of knowing.