Friday, 25 February 2011

Thoughts on MongoDB from a SQL Server Dev

As someone with a background in SQL Server development dating back to 2000, the whole NoSQL jazz has been something that's been on the radar for a while but never made it under the microscope so to speak. Partly because SQL Server is my comfort zone and there's plenty more to still learn about that; partly because I've had my career mapped out in my head and NoSQL didn't feature much in that vision; partly because until you have something real-world driving a push into a technology then you tend to have that technology remaining as a distant dot on the radar.

Back at QCon London last year, I had the opportunity to hear a bit more about the NoSQL world from some great speakers - I think the CAP theorem cropped up in most sessions I attended. The dot on the radar moved a bit closer.

Back To The Future
Fast forward a bit and the real-world driver to take a look into NoSQL technologies appeared. So for a few months, I've spent some time doing research - seeing what the different options out there are, what they offer, what they don't, what compromises you have to make etc etc. Cassandra, CouchDB, MongoDB, HBase, RavenDB...to name just a few. Now I'm not going to go into a comparison of each of those - that in itself would be a whole blog post. If you ask me which is the "best"...I'll say "It Depends". What I do know, is which one feels like a good fit for my environment.

Cinderella's Glass Slipper
MongoDB appeared as a real potential candidate pretty early on and after a bit more of a deep dive investigation into it, it became more and more obvious that it was just the right fit for certain use cases compared to the other technologies. This is not me saying "MongoDB is better than Cassandra" or "I see your RavenDB, and I raise you a MongoDB". Just that in my current scenario MongoDB feels like the best fit.

NoSQL with a capital NOT ONLY
The whole "NoSQL" term has been done to death. Yes, it's not the best term. Happily, from my experience, the majority of what I've seen has been around the principle of "Not Only SQL". And this is very much where I sit. It does not replace SQL Server; it adds value for a number of use cases and for a number of reasons. Choosing the right tool for the right job.

The point of this post is really to just summarise some of the likes and dislikes I have, as someone with a heavy background in SQL Server development. What I think is pretty cool and what is not so cool.

Thumbs up
In no particular order...
  • Cost
    It's free, open source. Can haz more scale? Just add hardware. Licensing costs need not apply (can run on Linux).
  • Schema-less
    If you need to support a flexible schema, MongoDB's document storage is a big plus. It doesn't mean you don't need to think about schema at all, it just means you can very easily model data of this nature and cope with changes without headaches.
  • Quick start & fast learning
    Getting started with MongoDB was quick and easy. There was no entry barrier. I can't fault how quick and easy it was to get up and running with the basics. Hacking around to pick up the more advanced stuff was also a pretty painless exercise too. Within a relatively short period of time, I started to be able to provide answers to questions on StackOverflow. Using the C# driver has been a largely very positive and intuitive experience.
  • Replica sets
    Configuring is simple, making scaling reads and failover pretty effortless. Want more redundancy or scaling of reads? Fire up another machine, add to the set and away you go. You do need to be careful to give thought to the oplog though.
  • Auto Sharding
    Again, configuring is simple. You do need to give very careful consideration to this up front when deciding on what keys you want to shard on. Once you've done that, sharding "just does it's stuff".
  • Community
    It has a good community behind it and that IMHO is very important. I don't like sitting in a cupboard on my own with the lights off. I like being a part of a bigger community - to learn from, work through issues with and to contribute back to.
  • Rapidly evolving
    MongoDB is rapidly changing and it's good to see bugs are being tracked and fixed in good time. There is also a fast flowing feature enhancement pipeline too, so you typically don't have to wait for a long time to get something.
  • Choose your consistency
    You can choose to have data replicated to a configurable number of replicas before returning if you wish to have stronger level of consistency. Depends on what value you put on certain bits of data, but the choice is yours. So you can trade off performance for consistency.

Thumbs down
  • Rapidly evolving
    OK, so I listed this as a "thumbs up" but it can also be a negative. You'll often find older (but not "old") blog posts/answers/articles on a given matter that are no longer true as things have changed since then. So it can be frustrating trying to find the current, correct information as things are changing quickly.
  • Information / Documentation
    The level of information on MongoDB pales in comparison to what there is for SQL Server. Obvious point maybe - but when you're used to MSDN, all the bloggers, activity on StackOverflow/AskSSC etc and #sqlhelp it can be quite a fall back down to Earth.
  • Map/Reduce performance
    Doesn't seem to be great. There are other options if you need greater performance, such as Hadoop, but that adds complexity and another technology into the mix.
  • Tooling
    Not the wealth of tooling as there is for SQL Server. Hopefully this will change over time.

I'm sure I've missed some points, but think I've covered the main points. My brain is eventually consistent so of course will update if I remember any more of note :)

Summary
Of course, using NoSQL technology like MongoDB involves some trade-offs and different mindset vs a traditional RDBMS. The important thing for me, is that it gives another option and is another feather in my CAP (theorem, *badoomtish*). I don't see it as a replacement as it is definitely not; it's just another tool that can be used to achieve an end goal. I'm looking forward to having the two side-by-side in harmony.

Hopefully, this gives a bit of insight into MongoDB from the point of view of someone coming from a SQL Server background.

You can find my blog posts on MongoDB thus far, here.

Thursday, 24 February 2011

MongoDB 1.8.0 Release Candidate is out

MongoDB 1.8.0-rc0 is now available, you can download from here and find the release notes here.

The main changes over the current 1.6 production release relate to:
  • durability, a much talked about topic (!), with support now for write-ahead journaling (docs)
  • sparse and covered indexes (on my radar to try out)
  • support for more Map/Reduce functionality (note there are some breaking changes in this area)
The full info on the changes is in the Release Notes document.

Wednesday, 23 February 2011

MongoUK 2011 - Upcoming MongoDB event

There's a one day conference on MongoDB coming up, being held at the Skills Matter eXchange in London on Monday 21st March 2011. If you're interested in MongoDB or even just curious as to what it's all about, then it's worth registering for it based on what's on the agenda - it's going to be tough to choose between the different sessions, which include ones on:
  • schema design
  • storage engine internals
  • indexing and query optimiser
  • scalability
  • map/reduce & geo indexing
It will also be a good chance to hear from speakers outside of 10gen such as Thoughtworks, Guardian and Boxed Ice. So I for one am looking forward to it.

For full details and to register, see the MongoUK 2011 page.

Update 25/02/2011:

Also check out the SkillsMatter event page.

Monday, 21 February 2011

SQLSoton UserGroup Room Change from March

Due to the success of the SQL Southampton User Group and based on the feedback/opinions of those who have attended, the room the group is held in is changing. It's still in the same venue, Avenue St Andrews United Reformed Church, but instead of being in the Upper Room as per the last few meets, it will be in St Andrew's Hall. This gives a bigger, more flexible room.

The now famous #SQLSoton signs will be in place to direct as per usual - still round via the back carpark but a different entrance now.

Green marker = new entrance.
Red marker = old entrance.


View SQLSoton UG in a larger map

The next meet is on Wednesday 9th March, with Gavin Payne (Blog | Twitter) coming down to talk replication so if you haven't registered yet, head over to SQLServerFAQ to see the details and register.

See you there!

MongoDB replication - oplogSize

Background
An oplog is a write operation log that MongoDB uses to store data modifications that need to be replicated out to other nodes in the configured set. This oplog is a capped collection which means it will never grow in size beyond a certain point - once it reaches it's max size, old data drops off the end as new data is added so it keeps cycling round. The size of the oplog basically determines how long a secondary node can be down for and still be able to catch up when it comes back online. The bigger the oplog size, the longer you can deal with a secondary node being down for as the oplog can hold more operations. If a secondary node is down for too long and all the operations required to bring it back up to date are no longer available in the master oplog, then you'll start seeing "Error RS102". See "Resynching a Very Stale Replica Set Member".

oplogSize
Configuring the size of the oplog is an important decision to make up front and so you should give it careful consideration to allow for future write loads. You can configure the oplogSize via the --oplogSize command line argument for mongod (given in MB).

e.g. to set 10GB oplog size:
mongod --oplogSize 10240 --replSet ExampleSet1

Personally, I know I'd err well on the side of caution and set it larger rather than smaller to really minimise the risk of the oplog not being big enough to allow secondary nodes to catch up after falling behind/being offline for longer periods of time.

1.6.5 Bug
Setting this argument appears to not work on Windows 7 64Bit in 1.6.5 (not sure about Linux) and you end up with a seemingly random size being created instead. Also, you may encounter an error depending on the value you specify. It appears multiples of 2048 will produce the following error:
Assertion failure cmdLine.oplogSize > 0 db \db.cpp 962
Due to the nature of the bug, it appears as though the max size you could end up with is 2047MB in that version. A bug case has been raised here however this has been fixed in 1.7.x (currently not production release at time of writing) and can confirm I have this now working as expected.

Update 23/02/2011:
Kristina Chodorow (Twitter | Blog) has written a blog post on "Resizing Your Oplog" - well worth a read if you fall into the boat where you have a system already up and running and realise that you need a bigger oplog. You'll need 1.7 for this.

Sunday, 13 February 2011

OS CodePoint Data Geography Load Update

Following on from my previous post on loading the Ordnance Survey Code-Point data for GB post codes to SQL Server and converting to the GEOGRAPHY data type, I've made a few tweaks to the importer app that is up on GitHub:
  1. The schema of the SQL Server table generated has changed - postcodes are now split into 2 distinct columns: OutwardCode and InwardCode
  2. The importer now calculates a basic average for each postcode district (e.g. AB12, aka the OutwardCode) and sector (e.g. AB12 3). This provides a central geographical position for each district/sector allowing less granular usage.
I've also started paving the way for it to be able to import other Ordnance Survey datasets such as boundary data in the future if and when I get time. It turns out the boundary data is a lot less straight forward. For example, loading the shape data supplied as-is does not work for all boundaries as some data items form invalid GEOGRAPHY instances when attempting to load into SQL Server. Correcting those items is possible via various means, but is not top of my priority list at the moment.

Monday, 7 February 2011

MongoDB - Does My Data Look Big In This?

You have an existing relational database containing x amount of data and you decide to migrate that data, for whatever reason, into MongoDB. You may have an pre-conceived belief that as your relational database is x GB in size, that after loading that data into MongoDB your NoSQL database size will be around about x GB too - after all, you're loading exactly the same data from one to the other right?

Wrong

For simplicity's sake, assume you are doing a straight load of data, creating 1 document in MongoDB for each row in your relational database and each column value is represented as a key-value pair in the document. What you need to take into account is that:
  1. each document you insert into MongoDB will be assigned an ObjectId (_id) if you don't supply one yourself. This is a 12-byte value consisting of 4 parts:
    • timestamp (4 bytes)
    • machine identifier (3 bytes)
    • process id (2 bytes)
    • increment (3 bytes)
  2. in MongoDB, each document will contain not only the values, but the key names (~"column names" in relational db-speak) too. So if you have keys: "FirstName" and "LastName", then those alone will add 17 bytes to each document.
  3. MongoDB automatically adds some padding to the documents to allow for documents to grow in size to try and reduce the need for documents to be moved around if they do grow

Considerations

It's worth considering using shortened key names in your documents as this space saving can quickly mount up. Also, if a column doesn't have a value in your relational database then you should also consider not including that data item in that particular document.

SQL Server vs. MongoDB

As an example, I loaded approximately 105 million rows of data from a table in SQL Server into a collection in MongoDB. The resulting data sizes were:
  • SQL Server : 14060MB
  • MongoDB : 19629MB (total storage size inc. allocated but unused space: 22570MB)
That's ~39% more space being used for the same data in MongoDB (based on data size stats, with ObjectIDs being assigned by MongoDB).

After switching to use 2 character coded key names instead of the full length names, this dropped to 15439MB (18266MB total storage size) - about 10% more space than SQL Server. Had I had columns without values in SQL Server, then ensuring those weren't included as items in MongoDB documents would have saved more space.