Scaling Mongo - Sharding

If you are interested in using MongoDB (Mongo) you are probably also interested in scaling and performance. From a developer's perspective the rich feature set and ease of use make Mongo a good option. And when it comes to scaling Mongo covers distributed load and data via Sharding. But the question is what happens when things get really big. Perhaps you don't have time to watch Optimizing your Mongodb Dataset or perhaps you want a simple answer, here you go.

To begin, we must first understand the concept I will call Data Reachability (Reachability) and how it will be effected by Sharding. Reachability will change how Mongo can efficiently fetch all of your data for a query. If your query doesn't respect Reachability, then your Mongo will not be efficient. In order to achieve high performance, your queries and your data must work together. According to Optimizing your Mongodb Dataset if a minor 1% of your queries result in table scans the performance of your whole system will disproportionately more effected (because of the cascading effect of wasted resources). So imagine if your queries are causing this more frequently and across all the nodes of your cluster, bad news.

Achieving Reachability may not be as hard as you think. I have compiled a few things you can do which will constrain your system while still allowing the use of many Mongo features:

  • Only query data collections on indexed fields find({_id:...})
  • ensureIndex({_id:”hashed”}) to cause data to be distributed evenly across your cluster
  • Create meta records in a non-shared (or other wise logically partitioned) collection which can be queried to retrieve the _id(s) for your data collections.
    • If you query frequently by day you could ensureIndex({“dateInt”:”hashed”}) and user that as your shard key.
    • Thus your metadata will all be reachable, grouped, yet distributed.
  • To use this structure:
    • Query the appropriate meta collection using the appropriate index fields
    • Issue queries against the data collection using the _id values returned by the meta records

Perhaps your data all fits on one server or this seems too cumbersome. Either you aren't really big yet and need to plan ahead or you need to become comfortable with a little more structure. Even though Mongo will let you get away with less structure, the demands of a truly large system will not.

No Comments Yet.

Leave a comment