Thursday, 15 September 2011

OS CodePoint Data Geography Update Sept 2011

History:
GB Post Code Geographic Data Load to SQL Server using .NET
OS CodePoint Data Geography Load Update

Following a comment on my original post listed above, it appears the structure of the CodePoint data file has changed. This means that the importer will fail when attempting to import the latest data files supplied by Ordnance Survey as the columns in the files have changed.

I've pushed an update to the project on GitHub to address this - it now takes a sensible approach so the index positions are no longer hard coded. Instead, the CodePoint importer now takes an extra argument - the path to the CSV file that contains the column header listing as supplied in the supporting docs in the download. The column indices we are interested in (for Easting, Northing and PostCode) are worked out from this. Obviously makes it a bit more future proof, so if the structure changes again in the future, it should pick up on that automatically.

This is an update to the CodePoint import only at the moment as I haven't had time to check the Scale Gazetteer import as yet. See the project/readme on GitHub for more info.

Thursday, 18 August 2011

Getting Started with Riak and .NET

Earlier in the year, I started playing around with MongoDB using .NET and wrote up a Getting Started guide. Now it's Riak's turn.

Nutshell
Riak is a distributed, schemaless, data-type agnostic key-value "NoSQL" database written primarily in Erlang.

Check out the "What is Riak?" page on Basho's wiki.

Riak does not run on Windows so you'll need to choose your supported OS of choice to install Riak on. Having used Ubuntu before for MongoDB, I went down that route so my notes here are oriented that way. So, after getting a Ubuntu VM set up...

Step 1
Install the latest custom .deb package, as per the Basho guide.

Step 2
Before you start Riak up, you need to tweak the configuration in Riak's app.config file. Follow the Basic Cluster Setup guide, but note that when you update the IP address for the http interface per those instructions, also make the same change to the pb_ip config value - this is for the Protocol Buffers interface. For performance, I wanted to use the Protocol Buffers interface instead of the HTTP REST API and had some initial problems as the pb_ip configuration was still set to 127.0.0.1.

You then need to edit vm.args, per the setup guide.

Step 3
Now you're ready to fire up Riak. Run:
$ riak start
Then make sure all is well using:
$ riak ping
If all is well, it will respond with "pong".

Step 4
Now, back in the comfort of Windows (my Linux-fu is seriously lacking) we're ready to get going. There is no official .NET client library at time of writing, but Basho list a few community contributed projects. When I first looked, only 2 were listed and these appeared to be inactive with no updates since last year. It was a bit disappointing to see the lack of support for those of us in the .NET world. I then discovered CorrugatedIron, very hot-of-the-press, under development by OJ Reeves (Twitter | Blog) and Jeremiah Peschka (Twitter | Blog).

The simplest way to get everything you need to get going is via NuGet.

Nu-what?

NuGet. It's a Visual Studio extension that you can download from here. Once installed, fire up VS and create a new .NET 4.0 Framework project. Then in the Solution Explorer, right click the project and select "Manage NuGet Packages...".


In the dialog, search online packages for: CorrugatedIron.



Click Install, and it will download all the assemblies/dependencies you need. When it's finished, you'll see you have references to CorrugatedIron, Newtonsoft.Json and protobuf-net.

Open up app.config and you'll see some initial example config for CorrugatedIron. Configure that accordingly to point at your running Riak node. For example, my Riak node is running on a VM called "Riak1", with the REST interface listening on port 8098 and the protocol buffers interface listening on port 8087:
<?xml version="1.0" encoding="utf-8"?>
<configuration>
  <configSections>
    <section name="riakConfig" type="CorrugatedIron.Config.RiakClusterConfiguration, CorrugatedIron" />
  </configSections>
  <riakConfig nodePollTime="5000" defaultRetryWaitTime="200" defaultRetryCount="3">
    <nodes>
      <node name="Riak1" hostAddress="Riak1" pbcPort="8087" restScheme="http" restPort="8098" poolSize="20" />      
    </nodes>
  </riakConfig>
</configuration>

Step 5
A quick check to make sure you can communicate with your Riak node:
static void Main(string[] args)
{
    IRiakCluster cluster = RiakCluster.FromConfig("riakConfig");
    IRiakClient client = cluster.CreateClient();
    RiakResult result = client.Ping();
    Console.WriteLine(result.IsSuccess);
}

Fingers crossed, and you will get a successful response! From here, you can start playing around with Riak from .NET and find your way round the client.

CorrugatedIron is still very new, so keep an eye on the site and/or on the GitHub repository. Give it a whirl, and be sure to let the guys know how you get on - I'm sure they'd be keen to hear from those using it and the project deserves support from those of us in the .NET world.

Useful tool
It's useful to have some form of GUI over Riak so you can actually "see stuff". I've been using Rekon which I found useful to get up and running with Riak. See the download/install instructions on it's GitHub repository page.

Hopefully, this post will help anyone else wanting to give Riak a try from a .NET perspective!

Monday, 27 June 2011

sp_executesql change between 2005 and 2008

Today I tripped over what turned out to be a difference in the way sp_executesql behaves between SQL Server 2005 and 2008 when executing a string containing a parameterised stored procedure call.

Take this simplified example:
DECLARE @SQL NVARCHAR(256)
SET @SQL = 'sp_help @obj'
EXECUTE sp_executesql @SQL, N'@obj NVARCHAR(100)', 'sp_help'
In SQL Server 2008 (10.0.4000.0), the above executes successfully.
In SQL Server 2005 (9.00.1399.06), it throws the following exception:
Msg 102, Level 15, State 1, Line 1
Incorrect syntax near 'sp_help'.
Adding the "EXEC(UTE)" before the stored procedure name in @SQL resolves the issue in 2005. As standard, I (usually) standardise on ensuring all stored procedure calls are made with "EXEC(UTE)" even when it is the only statement in the batch. Obviously, in this case it was overlooked and tripped me up!


Friday, 24 June 2011

The importance of "Working Set"

One of the things that I see cropping up pretty often is this thing called "working set". After recently chipping in on another StackOverflow question on the subject of "What does it meant to fit 'working set' in RAM?", I thought it was a good subject for a blog post. This is really just a copy and extension of my input on that question and focused in certain parts on MongoDB, but is also as relevant to other databases.

"Working set" is basically the amount of data and indexes that will be active/in use by your system at any given time.

Why is it important to keep your working set in RAM?

Accessing RAM is quick. Accessing disk is slow. When querying your data store, if all the data and indexes typically accessed are in RAM, then performance is blisteringly quick. If it's not in RAM, then disk access is required and that is when performance suffers. Hence it is important to ensure you have enough to hold your working set. The moment your working set exceeds the about of RAM you have, you will start to notice the performance degradation as it has to pull stuff back off disk, so it's important to monitor the situation and react.

Crude Example

Suppose you have 1 year's worth of data. For simplicity, each month relates to 1GB of data giving 12GB in total, and to cover each month's worth of data you have 1GB worth of indexes again totalling 12GB for the year.

If you are always accessing the last 12 month's worth of data, then your working set is: 12GB (data) + 12GB (indexes) = 24GB.

However, if you actually only access the last 3 month's worth of data, then your working set is: 3GB (data) + 3GB (indexes) = 6GB.

You need to understand your data set, scenario and the usage patterns, in order to work out a ball park estimate of your working set. Don't expect a black and white answer for what your working set is in your environment, from someone who doesn't know these things.

What if my working set sky-rockets?

Add more RAM. This can be a case of adding more into your existing node(s). Or, if you need non-trivial increases, making use of sharding to split the data over a number of nodes and just bring more nodes online as you need. This provides incredible potential to scale out your workload.

The key point is to ask yourself: do I have enough RAM for my working set? If the answer is: "I don't know", then get yourself to the position of knowing.

Thursday, 23 June 2011

Upcoming SQLSoton UG - 6th July 2011

The next SQLSoton (twitter) UserGroup meet is coming up on Wednesday 6th July. Matt Whitfield (twitter | blog) will be presenting some "notes from the field" and Alex Whittles (twitter | blog) will be giving a talk on automating cube documentation with SSRS, DMV & spatial data.

Don't forget to register!

Oh, and did I forget to mention? There's an XBox up for grabs. See you there? Thought so :)

Wednesday, 22 June 2011

If you can't make it, untake it

There's some great free events out there; SQLBits community day and the upcoming SQL In The City day by Red Gate are just two examples. With the high demand for spaces at events like these, it is well worth getting in there ASAP to reserve your space before they all go and you end up on a waiting list.

But if you find you can no longer make it, make someone else's day by cancelling your place so that it can be freed up for those on the waiting list. You may not personally be losing out by not taking the time to cancel your place, but someone else will be.

So...."if you can't make it, untake it"

P.S. Yes, I know, not my greatest blog title ever...

Tuesday, 3 May 2011

MongoDB ASP.NET Session State Store Provider

Please note: v1.1.0 of this Sesssion State Provider is now available. See blog post: http://www.adathedev.co.uk/2013/03/mongodb-aspnet-session-store-provider.html
I've pushed up an initial drop of a custom ASP.NET session state store, backed by MongoDB, to my GitHub repository. After a quick test it seems OK, though could do with a real hammering before being deemed production worthy! So if you want to use it, just make sure you give it a good test through and most importantly, let me know how it goes! It's based on the sample custom session state provider in this MSDN article, basically just with the MongoDB specific bits swapped in.

Session state is stored in a "Sessions" collection within a "SessionState" database. Example session document:
{
    "_id" : "bh54lskss4ycwpreet21dr1h",
    "ApplicationName" : "/",
    "Created" : ISODate("2011-04-29T21:41:41.953Z"),
    "Expires" : ISODate("2011-04-29T22:01:41.953Z"),
    "LockDate" : ISODate("2011-04-29T21:42:02.016Z"),
    "LockId" : 1,
    "Timeout" : 20,
    "Locked" : true,
    "SessionItems" : "AQAAAP////8EVGVzdAgAAAABBkFkcmlhbg==",
    "Flags" : 0
}

Inline with the MSDN reference ("ODBCSessionStateStore" changed appropriately to "MongoSessionStateStore"):
If the provider encounters an exception when working with the data source, it writes the details of the exception to the Application Event Log instead of returning the exception to the ASP.NET application. This is done as a security measure to avoid private information about the data source from being exposed in the ASP.NET application.

The sample provider specifies an event Source property value of "MongoSessionStateStore". Before your ASP.NET application will be able to write to the Application Event Log successfully, you will need to create the following registry key:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Eventlog\Application\MongoSessionStateStore

If you do not want the sample provider to write exceptions to the event log, then you can set the custom writeExceptionsToEventLog attribute to false in the Web.config file.

The session-state store provider does not provide support for the Session_OnEnd event, it does not automatically clean up expired session-item data. You should have a job to periodically delete expired session information from the data store where Expires date is in the past, i.e.:
db.Sessions.remove({"Expires" : {$lt : new Date() }})

Example web.config settings
<configuration>
  <connectionStrings>
    <add name="MongoSessionServices" connectionString="mongodb://localhost" />
  </connectionStrings>
  <system.web>
    <sessionState
        mode="Custom"
        customProvider="MongoSessionStateProvider">
      <providers>
        <add name="MongoSessionStateProvider"
             type="MongoSessionStateStore.MongoSessionStateStore"
             connectionStringName="MongoSessionServices"
             writeExceptionsToEventLog="false"
             fsync="false"
             replicasToWrite="0" />
      </providers>
    </sessionState>
  </system.web>
</configuration>

A few points to note, to give some control over fault tolerance and consistency:
  • if you want updates to the MongoDB store to be fsync'd before returning, set fsync="true"
  • if you have a replica set and want updates to be persisted to multiple nodes before returning, set the replicatesToWrite setting to the appropriate number.
SafeMode is currently always enabled.

If you give a whirl, please let me know how you get on.

Friday, 29 April 2011

Microsoft Community Contributor Award 2011

For a number of years now (since 2002 to be exact), I've been chipping in with my 2 pence on various online Q&A sites. As I mention in one of my first blog posts from just over a year ago, "Ways to improve technical skills" and one back in August on "The importance of being earnest", participating in sites like StackOverflow and the MSDN forums is a great way to keep you on your toes technically. I've learnt a lot from picking through other people's questions and challenges that they are facing, and attempting to provide some input. It gives exposure to real world problems outside of the ones I face in my own job and that I may not otherwise face. I've found it really does broaden my horizons and help me to learn more about my favourite technologies - SQL Server, C# and more recently MongoDB have been the main focus for me. Not only that, but I actually really enjoy it.

Bearing in mind these reasons for me taking part in online communities like this and what I've personally got out of doing so, I found it quite humbling to have received the Microsoft Community Contributor Award 2011. It came as a surprise; I certainly wasn't expecting it. From what I understand, one way to be selected to receive the award is to be recommended by an MVP. If that was how I came to be selected (which I'm not sure it was), then my thanks go to whoever did that for me - it was a nice surprise to open my emails to!

Monday, 4 April 2011

5 Tips For Attending Conferences

With SQLBits 8 being this week in Brighton and it being set to be the biggest and best one yet, it got me thinking.

If I was a first timer to a conference like this, what should I expect?

So here's my top 5 tips:
  • Talk.
    It's easy to be daunted by a room full of people who all seem to be in little groups chatting away. You think everyone else there knows someone, except you. Wrong. Of course there are people that know each other, but there will be others who don't. Don't be afraid to try and strike up a conversation with people. See it as an investment - the more people you chat to, the more people you will know at future conferences. On twitter? Play "Tweep Bingo" - try and hunt down the people you follow - that way you've got an instant "foot in the door" so to speak (unless you ignore / don't read any of their tweets in which case why are you following them?!).
  • Listen.
    Well, wouldn't be worth talking if you didn't listen would it? The main point here is, when you're in a session, focus on listening. Don't try to keep notes of everything the speaker says because:
    a) you'll get cramp in your hand, and you'll get it bad (personal experience, QCon 2010)
    b) you'll be focusing on hearing and writing as opposed to listening and learning
    There is an amazing amount of content, let your brain try to sponge it up, not your notepad. That's not to say don't make any notes, just don't fall into the trap of trying to get everything.
    The videos will more than likely be posted online anyway for you to watch back at your leisure (they are for SQLBits).
  • Bearings.
    Find out where the rooms are that you want to attend sessions in and make sure you know where you are going for your next session to give yourself time to get there. You don't want to be sitting in a room at the start of one session when they start talking about Squirrels when you actually wanted to hear the talk on Meerkats the other side of the building (no I haven't been to an animal conference before, but yes I have found myself in the wrong room - personal experience #2 from QCon 2010). Plus, it pays to be prepared to get to the very popular sessions - I remember Conor Cunningham's session at SQLBits 6 and the room was packed to the rafters.
  • Etiquette.
    Switch off your phone or stick it in silent mode. Don't run the risk of annoying the rest of the room by having it go off mid-session.
  • Socialise.
    A lot of conferences have after parties or evening events. Go along to them if you can, they're a great way to meet up and have some banter with others in an often very different environment. There's often free food and drink to be had if that sways you!

Sunday, 27 March 2011

OS Data Importer Supports Scale Gazetteer Dataset

I've just pushed some updates to GitHub for the Ordnance Survey Data Importer .NET app I've been working on every now and then (see my previous posts: GB Post Code Geographic Data Load to SQL Server using .NET and OS CodePoint Data Geography Load Update).

Aside from a little re-jigging to potentially pave the way to load more Ordnance Survey data files to SQL Server in the future, it now supports importing the 1:50000 Scale Gazetteer data file to SQL Server (available to download from here). To quote the Ordnance Survey site:
The 1:50 000 Scale Gazetteer is a reference tool or location finder, similiar to the index in a road atlas. It can be used as a simple list to find out relevant coordinates and six-figure grid references for a town or area.
At the time of this post this data file contains 259,113 locations taken from the OS Landranger Map series.

Example Use
To load the 1:50000 scale gazetteer data file that has been downloaded and extracted to C:\OSGazetteerData\50kgaz2010.txt, into dbo.Gazetteer table in MyDatabase on server SQLServerA, and load county codes into table: County and feature codes into table: Feature, run the following from the command prompt:
OSCodePointDataImport.exe GAZETTEER SQLServerA MyDatabase dbo Gazetteer County Feature "C:\OSGazetteerData\50kgaz2010.txt"
The Gazetteer database table will consist of the following columns:
SeqNo INTEGER PRIMARY KEY
PlaceName VARCHAR(60)
CountyCode CHAR(2) FK to County.Code
FeatureCode VARCHAR(3) FK to Feature.Code
Longitude FLOAT
Latitude FLOAT
GeoLocation GEOGRAPHY
The County lookup table will consist of the following columns:
Code CHAR(2) PRIMARY KEY
Name VARCHAR(60)
The Feature lookup table will consist of the following columns:
Code VARCHAR(3) PRIMARY KEY
Description VARCHAR(50)

The source is on GitHub here: https://github.com/AdaTheDev/Ordnance-Survey-Code-Point-Data-Importer

Friday, 18 March 2011

MongoDB Journaling Performance - Single Server Durability

Note: Relates to v1.8.0 release.

v1.8 of MongoDB (production release hot off the press, available from here) brings single server durability to MongoDB via its support for write-ahead journaling. The following summarises the key points from the journaling documentation, administration guide and from my initial playing around.
  • you have to specify the --journal switch for mongod in order to enable journaling (was previously --dur prior to production release)
  • journal files are created within a journal subfolder of your dbpath, up to 1GB each in size
  • in the event of a crash, when mongod restarts it will replay the journal files to recover automatically before the server goes online
  • if mongod shuts down cleanly, the journal files are cleared down. Also note that you shouldn't end up with a lot of journal files as they are rotated out as they are no longer needed
  • batch commits are performed approximately every 100ms (note the mention in the docs that this will be more frequent in future)
  • preallocation of journal files may be done when mongod starts up. If it decides this is worth doing, it will fully provision 3 x 1GB files BEFORE accepting any connections - so there could be a couple of minutes delay after starting up before you can connect. These preallocated files are not cleared down when mongod is shutdown - so it won't have to create them each time.
  • when running with journal files on a SATA drive, I found that it chose not to preallocate them. When I set up a junction to map the journal folder onto a separate 15K SAS drive, it did then choose to preallocate the files.

Durability vs. performance - what's the cost?
I wanted to get an idea of how much this journaling costs in terms of performance. Obviously there's going to be a hit, I just like to get a rough feel for just how much. So I took a test 295MB CSV data file containing 20 million rows of data with 2 columns: _id (integer) and val (random 5 character string) and loaded into a fresh database, with/without journaling enabled.

Tests were run on a single machine, Intel Xeon W3520 Quad Core, 10GB RAM, Disk 1=7200RPM SATA, Disk 2=15K SAS, Win 7 64Bit. MongoDB data is on Disk 1 (slower but larger disk).

Journaling?Journal locationFiles preallocated?Import time (s)Avg. Import Rate/sPerformance
Non/an/a29168729 
YesDisk 1 (same as data)No44245249-34%
YesDisk 1 (same as data)Yes (manually*)42247393-31%
YesDisk 2 (separate from data)Yes44844643-35%

(*) mongod always chose not to preallocate when the journal directory was on the slower disk (same as the data) so I had to manually preallocate the files by copying the ones that were created when the faster disk was used.

I couldn't run the test with the journals on disk 2 without them being preallocated because they were always preallocated and you can't delete them all while mongod is running.

Summing up
In my tests, I found:
  • using --journal resulted in about a 30-35% drop in throughput for my mongoimport job (just under 70K docs/s down to less than 50K docs/s)
  • preallocating of journal files (as you'd expect) helps as it doesn't have to create the files as it's going along
  • setting up a junction on Windows to map the journal directory onto a separate (and faster) disk to where the data is resulted in slower performance, presumably due to the overhead of the junction redirects. If there's another/better way of doing this I'd be interested to know. Also I haven't run this on Linux, so maybe there would be less of a hit on that. Personally, I'd like to see support for the journal folder to be explicitly configurable so you can point it at a separate disk.
Based on these results, for me, the decision on whether to use journaling or not would come down to how much I actually need single server durability. Is it critical for my specific use? Could I live without it and just use replica sets? What is the value / importance of it for my data vs. raw performance? Let's not forget this is a new feature in MongoDB. As stated in the docs, there are a number of cases in 1.8.1 for performance improvements with regard to journaling. Definitely something to keep an eye on.

Thursday, 17 March 2011

Upcoming DevConnections London conference


What: IT&DevCONNECTIONS Conference
Where: ExCel Conference Centre, London
When: 13th - 15th June 2011

Here's the summary on the event, which gives a good overview of what you can expect from it. Hold on to your hat for some very impressive name-dropping!
Penton, publishers of SharePointPro Connections, DevPro Connections, Windows IT Pro and SQL Server Magazines are excited to announce the IT&DevCONNECTIONS powered by Microsoft conference to be held at the ExCeL conference centre in London. This technical conference and pre con workshop brings together experts from Microsoft and leading independent industry presenters to deliver in-depth presentations and cutting edge sessions to help you succeed with the Microsoft technology stack – SharePoint, Windows, SQL Server, Silverlight, ASP.Net, Virtualisation, Cloud and Azure. Learn about new product features and the future direction for Microsoft delivered by MSVP’s Scott Guthrie, Dave Mendlen and Steve Fox. Dive deep into the latest Microsoft products and technologies with no-hype in-depth sessions delivered by top independent speakers Dan Holme, Kim Tripp, Paul Randal and Scot Hillier. Get your questions answered, your strategies defined, network with the experts and build relationships with your peers at this premier UK event! Register today to take advantage of the early bird offers : www.devconnections.com/uk

The super early bird offer ends on 31st March, with the early bird offer then ending on 13th May so get registering soon to make the biggest savings. You can save an extra 20% by quoting discount code: UG230 during the registration process, and if you're looking for hotels nearby then you can get preferential rates via here.

Pre-cons and conference sessions
Info on the pre-cons (13th June) can be found here. In terms of the SQL Server space, Simon Sabin (recently joined the elite ranks of becoming a Microsoft Certified Master - Twitter | Blog) is doing one on Advanced Reporting Services, and Paul Randal (Twitter | Blog) and Kimberly Tripp (Twitter | Blog) are doing one on SQL Server Performance Tuning from A to Z.

Info on the conference sessions (14th & 15th June) can be found here. (SQL Server track, SqlServerCentral.com track)

Why should I go?
The real question is, why not?! It looks like an awesome line-up. Having attended the SQL Server MasterClass by Kimberly Tripp and Paul Randal last year, I've experienced at first hand the amount and quality of information that they have to offer so that makes the SQL Server track extremely appealing. And that's before I mention Simon Sabin's sessions from past SqlBits conferences. Need I say more?

Thursday, 10 March 2011

SQLSoton UG March 2011

Last night saw another gathering of the SQL Server Southampton UserGroup, with Gavin Payne (Twitter | Blog) making the trip down to give a talk on replication, his first since joining Coeo.

As a developer, I've had very little exposure to replication and have seen it as some form of dark art. I've never scratched much past the surface. Over the past couple of months of getting up to speed with MongoDB, I've dived far deeper into replication within that than I ever have in SQL Server in 10 years+ that I've been using it which is quite a strange feeling, thinking about it like that. So I was interested in learning more about replication in the world of SQL Server. Gavin did not disappoint - excellent session, with very clear and concise explanations. When something "just makes sense", you know it's a sign of a great talk. Thanks to Gavin for coming down and presenting a blinder! Check out Gavin's write up on the talk on his blog here, where you can also download the presentation from.

We were in a new room this month - with much more space and scope for better layout. Hopefully everyone agrees it's a good space to have the UG in.

The next SQL Server Southampton UserGroup is on Weds 13th April, the week after SqlBits 8, when there is another first for the UserGroup - 2 speakers. How's that for value for no money?! We welcome Neil Hambly (Twitter | Blog) to kick off proceedings with the first talk of the evening (interested in studying towards MCM? Follow his progress with MCM training notes), followed by Pete Carter-Greenan. As if that wasn't enough, no doubt we will have more swag. Don't miss it, get yourself registered over on SQLServerFaq.

MongoDB 1.8.0 Release Candidate 2 Available

Fast on the heels of my previous blog post about MongoDB 1.8.0 Release Candidate 1 becoming available, I've just seen 1.8.0 RC2 is now available which includes a handful of changes (Download | Change Log).

Did I mention, MongoDB is rapidly evolving?!

Thursday, 3 March 2011

Upcoming SQLSoton UG - 9th March 2011

Quick reminder that the next SQLSoton (twitter) UserGroup meet is fast approaching on Wednesday 9th March. Gavin Payne (twitter | blog) will be making his way down to Southampton to give us a talk on replication:
For those who have never had the chance to use SQL Server’s replication features, or perhaps only manage an existing implementation of it, this session will give you an overview of the different replication services SQL Server offers, some background into the primary and secondary design goals each, along with some considerations for deployment and in-life management. The session aims to reach but not exceed an MCITP depth of detail.
As usual, please register via SQLServerFAQ. Also, please note we will be in a different room this time but still at the same venue.

Friday, 25 February 2011

Thoughts on MongoDB from a SQL Server Dev

As someone with a background in SQL Server development dating back to 2000, the whole NoSQL jazz has been something that's been on the radar for a while but never made it under the microscope so to speak. Partly because SQL Server is my comfort zone and there's plenty more to still learn about that; partly because I've had my career mapped out in my head and NoSQL didn't feature much in that vision; partly because until you have something real-world driving a push into a technology then you tend to have that technology remaining as a distant dot on the radar.

Back at QCon London last year, I had the opportunity to hear a bit more about the NoSQL world from some great speakers - I think the CAP theorem cropped up in most sessions I attended. The dot on the radar moved a bit closer.

Back To The Future
Fast forward a bit and the real-world driver to take a look into NoSQL technologies appeared. So for a few months, I've spent some time doing research - seeing what the different options out there are, what they offer, what they don't, what compromises you have to make etc etc. Cassandra, CouchDB, MongoDB, HBase, RavenDB...to name just a few. Now I'm not going to go into a comparison of each of those - that in itself would be a whole blog post. If you ask me which is the "best"...I'll say "It Depends". What I do know, is which one feels like a good fit for my environment.

Cinderella's Glass Slipper
MongoDB appeared as a real potential candidate pretty early on and after a bit more of a deep dive investigation into it, it became more and more obvious that it was just the right fit for certain use cases compared to the other technologies. This is not me saying "MongoDB is better than Cassandra" or "I see your RavenDB, and I raise you a MongoDB". Just that in my current scenario MongoDB feels like the best fit.

NoSQL with a capital NOT ONLY
The whole "NoSQL" term has been done to death. Yes, it's not the best term. Happily, from my experience, the majority of what I've seen has been around the principle of "Not Only SQL". And this is very much where I sit. It does not replace SQL Server; it adds value for a number of use cases and for a number of reasons. Choosing the right tool for the right job.

The point of this post is really to just summarise some of the likes and dislikes I have, as someone with a heavy background in SQL Server development. What I think is pretty cool and what is not so cool.

Thumbs up
In no particular order...
  • Cost
    It's free, open source. Can haz more scale? Just add hardware. Licensing costs need not apply (can run on Linux).
  • Schema-less
    If you need to support a flexible schema, MongoDB's document storage is a big plus. It doesn't mean you don't need to think about schema at all, it just means you can very easily model data of this nature and cope with changes without headaches.
  • Quick start & fast learning
    Getting started with MongoDB was quick and easy. There was no entry barrier. I can't fault how quick and easy it was to get up and running with the basics. Hacking around to pick up the more advanced stuff was also a pretty painless exercise too. Within a relatively short period of time, I started to be able to provide answers to questions on StackOverflow. Using the C# driver has been a largely very positive and intuitive experience.
  • Replica sets
    Configuring is simple, making scaling reads and failover pretty effortless. Want more redundancy or scaling of reads? Fire up another machine, add to the set and away you go. You do need to be careful to give thought to the oplog though.
  • Auto Sharding
    Again, configuring is simple. You do need to give very careful consideration to this up front when deciding on what keys you want to shard on. Once you've done that, sharding "just does it's stuff".
  • Community
    It has a good community behind it and that IMHO is very important. I don't like sitting in a cupboard on my own with the lights off. I like being a part of a bigger community - to learn from, work through issues with and to contribute back to.
  • Rapidly evolving
    MongoDB is rapidly changing and it's good to see bugs are being tracked and fixed in good time. There is also a fast flowing feature enhancement pipeline too, so you typically don't have to wait for a long time to get something.
  • Choose your consistency
    You can choose to have data replicated to a configurable number of replicas before returning if you wish to have stronger level of consistency. Depends on what value you put on certain bits of data, but the choice is yours. So you can trade off performance for consistency.

Thumbs down
  • Rapidly evolving
    OK, so I listed this as a "thumbs up" but it can also be a negative. You'll often find older (but not "old") blog posts/answers/articles on a given matter that are no longer true as things have changed since then. So it can be frustrating trying to find the current, correct information as things are changing quickly.
  • Information / Documentation
    The level of information on MongoDB pales in comparison to what there is for SQL Server. Obvious point maybe - but when you're used to MSDN, all the bloggers, activity on StackOverflow/AskSSC etc and #sqlhelp it can be quite a fall back down to Earth.
  • Map/Reduce performance
    Doesn't seem to be great. There are other options if you need greater performance, such as Hadoop, but that adds complexity and another technology into the mix.
  • Tooling
    Not the wealth of tooling as there is for SQL Server. Hopefully this will change over time.

I'm sure I've missed some points, but think I've covered the main points. My brain is eventually consistent so of course will update if I remember any more of note :)

Summary
Of course, using NoSQL technology like MongoDB involves some trade-offs and different mindset vs a traditional RDBMS. The important thing for me, is that it gives another option and is another feather in my CAP (theorem, *badoomtish*). I don't see it as a replacement as it is definitely not; it's just another tool that can be used to achieve an end goal. I'm looking forward to having the two side-by-side in harmony.

Hopefully, this gives a bit of insight into MongoDB from the point of view of someone coming from a SQL Server background.

You can find my blog posts on MongoDB thus far, here.

Thursday, 24 February 2011

MongoDB 1.8.0 Release Candidate is out

MongoDB 1.8.0-rc0 is now available, you can download from here and find the release notes here.

The main changes over the current 1.6 production release relate to:
  • durability, a much talked about topic (!), with support now for write-ahead journaling (docs)
  • sparse and covered indexes (on my radar to try out)
  • support for more Map/Reduce functionality (note there are some breaking changes in this area)
The full info on the changes is in the Release Notes document.

Wednesday, 23 February 2011

MongoUK 2011 - Upcoming MongoDB event

There's a one day conference on MongoDB coming up, being held at the Skills Matter eXchange in London on Monday 21st March 2011. If you're interested in MongoDB or even just curious as to what it's all about, then it's worth registering for it based on what's on the agenda - it's going to be tough to choose between the different sessions, which include ones on:
  • schema design
  • storage engine internals
  • indexing and query optimiser
  • scalability
  • map/reduce & geo indexing
It will also be a good chance to hear from speakers outside of 10gen such as Thoughtworks, Guardian and Boxed Ice. So I for one am looking forward to it.

For full details and to register, see the MongoUK 2011 page.

Update 25/02/2011:

Also check out the SkillsMatter event page.

Monday, 21 February 2011

SQLSoton UserGroup Room Change from March

Due to the success of the SQL Southampton User Group and based on the feedback/opinions of those who have attended, the room the group is held in is changing. It's still in the same venue, Avenue St Andrews United Reformed Church, but instead of being in the Upper Room as per the last few meets, it will be in St Andrew's Hall. This gives a bigger, more flexible room.

The now famous #SQLSoton signs will be in place to direct as per usual - still round via the back carpark but a different entrance now.

Green marker = new entrance.
Red marker = old entrance.


View SQLSoton UG in a larger map

The next meet is on Wednesday 9th March, with Gavin Payne (Blog | Twitter) coming down to talk replication so if you haven't registered yet, head over to SQLServerFAQ to see the details and register.

See you there!

MongoDB replication - oplogSize

Background
An oplog is a write operation log that MongoDB uses to store data modifications that need to be replicated out to other nodes in the configured set. This oplog is a capped collection which means it will never grow in size beyond a certain point - once it reaches it's max size, old data drops off the end as new data is added so it keeps cycling round. The size of the oplog basically determines how long a secondary node can be down for and still be able to catch up when it comes back online. The bigger the oplog size, the longer you can deal with a secondary node being down for as the oplog can hold more operations. If a secondary node is down for too long and all the operations required to bring it back up to date are no longer available in the master oplog, then you'll start seeing "Error RS102". See "Resynching a Very Stale Replica Set Member".

oplogSize
Configuring the size of the oplog is an important decision to make up front and so you should give it careful consideration to allow for future write loads. You can configure the oplogSize via the --oplogSize command line argument for mongod (given in MB).

e.g. to set 10GB oplog size:
mongod --oplogSize 10240 --replSet ExampleSet1

Personally, I know I'd err well on the side of caution and set it larger rather than smaller to really minimise the risk of the oplog not being big enough to allow secondary nodes to catch up after falling behind/being offline for longer periods of time.

1.6.5 Bug
Setting this argument appears to not work on Windows 7 64Bit in 1.6.5 (not sure about Linux) and you end up with a seemingly random size being created instead. Also, you may encounter an error depending on the value you specify. It appears multiples of 2048 will produce the following error:
Assertion failure cmdLine.oplogSize > 0 db \db.cpp 962
Due to the nature of the bug, it appears as though the max size you could end up with is 2047MB in that version. A bug case has been raised here however this has been fixed in 1.7.x (currently not production release at time of writing) and can confirm I have this now working as expected.

Update 23/02/2011:
Kristina Chodorow (Twitter | Blog) has written a blog post on "Resizing Your Oplog" - well worth a read if you fall into the boat where you have a system already up and running and realise that you need a bigger oplog. You'll need 1.7 for this.

Sunday, 13 February 2011

OS CodePoint Data Geography Load Update

Following on from my previous post on loading the Ordnance Survey Code-Point data for GB post codes to SQL Server and converting to the GEOGRAPHY data type, I've made a few tweaks to the importer app that is up on GitHub:
  1. The schema of the SQL Server table generated has changed - postcodes are now split into 2 distinct columns: OutwardCode and InwardCode
  2. The importer now calculates a basic average for each postcode district (e.g. AB12, aka the OutwardCode) and sector (e.g. AB12 3). This provides a central geographical position for each district/sector allowing less granular usage.
I've also started paving the way for it to be able to import other Ordnance Survey datasets such as boundary data in the future if and when I get time. It turns out the boundary data is a lot less straight forward. For example, loading the shape data supplied as-is does not work for all boundaries as some data items form invalid GEOGRAPHY instances when attempting to load into SQL Server. Correcting those items is possible via various means, but is not top of my priority list at the moment.

Monday, 7 February 2011

MongoDB - Does My Data Look Big In This?

You have an existing relational database containing x amount of data and you decide to migrate that data, for whatever reason, into MongoDB. You may have an pre-conceived belief that as your relational database is x GB in size, that after loading that data into MongoDB your NoSQL database size will be around about x GB too - after all, you're loading exactly the same data from one to the other right?

Wrong

For simplicity's sake, assume you are doing a straight load of data, creating 1 document in MongoDB for each row in your relational database and each column value is represented as a key-value pair in the document. What you need to take into account is that:
  1. each document you insert into MongoDB will be assigned an ObjectId (_id) if you don't supply one yourself. This is a 12-byte value consisting of 4 parts:
    • timestamp (4 bytes)
    • machine identifier (3 bytes)
    • process id (2 bytes)
    • increment (3 bytes)
  2. in MongoDB, each document will contain not only the values, but the key names (~"column names" in relational db-speak) too. So if you have keys: "FirstName" and "LastName", then those alone will add 17 bytes to each document.
  3. MongoDB automatically adds some padding to the documents to allow for documents to grow in size to try and reduce the need for documents to be moved around if they do grow

Considerations

It's worth considering using shortened key names in your documents as this space saving can quickly mount up. Also, if a column doesn't have a value in your relational database then you should also consider not including that data item in that particular document.

SQL Server vs. MongoDB

As an example, I loaded approximately 105 million rows of data from a table in SQL Server into a collection in MongoDB. The resulting data sizes were:
  • SQL Server : 14060MB
  • MongoDB : 19629MB (total storage size inc. allocated but unused space: 22570MB)
That's ~39% more space being used for the same data in MongoDB (based on data size stats, with ObjectIDs being assigned by MongoDB).

After switching to use 2 character coded key names instead of the full length names, this dropped to 15439MB (18266MB total storage size) - about 10% more space than SQL Server. Had I had columns without values in SQL Server, then ensuring those weren't included as items in MongoDB documents would have saved more space.

Monday, 24 January 2011

GB Post Code Geographic Data Load to SQL Server using .NET

Ordnance Survey now make available a number of mapping data/geographic datasets to download for free, allowing unrestricted use for commercial and non-commercial use. One of these is the Code-Point Open dataset, which gives a precise geographic location of each post code in Great Britain as a CSV file.

This file contains the Easting and Northing coordinates for each postcode. You may want to convert these to Latitude/Longitude coordinates to then load into a GEOGRAPHY column in SQL Server (as of 2008) and do all kinds of spatial wizardry in the database. But how?

If at first you don't find, search, search and search again

I did quite bit of googling and found various complex calculations to convert from Eastings/Northings to Lat/Lon. I also tried some conversion apps that didn't convert into the decimal degrees format I wanted. I finally found the exact kind of .NET library code I was looking for, which is available under the GNU General Public License v3 : geocoordconversion. I downloaded the DLL and from there it only took a matter of minutes of playing about to work out the API to convert the Eastings/Northings from the Code-Point data files into Lat/Lon coordinates (in the WGS84 coordinate system).

Conversion snippet

After referencing the geocoordconversion.dll in your .NET project, converting coordinates is as simple as:
long easting, northing;
// assume easting & northing are set.
GridReference gridRef = new GridReference(easting, northing);
PolarGeoCoordinate polarCoord = GridReference.ChangeToPolarGeo(gridRef);
polarCoord = PolarGeoCoordinate.ChangeCoordinateSystem(polarCoord, 
    CoordinateSystems.WGS84);
// Pull Lat/Lon coordinates from polarCoord.Lat & polarCood.Lon
End to end import process

Using this, I knocked up a C# console app to:
  • read in all the Code-Point data from the .csv files downloaded from the Ordnance Survey (approx 1.7 million rows of data)
  • convert the Eastings/Northings to WGS84 Lat/Lon coordinates
  • bulk load all the data to a new table in SQL Server
  • populate a GEOGRAPHY column with the point data for each post code
  • create a spatial index on that column
If you'd like a copy of/source code for this app, just ping me and I'll be happy to share. I plan to put the code up somewhere when I get around to it. The code is now available on GitHub.

Example query

Here's a quick, very basic demo of querying of geography data, which finds all post codes within about 5 miles of a postcode.

DECLARE @home GEOGRAPHY
SELECT @home = GeoLocation FROM dbo.PostCodeData WHERE OutwardCode = 'AB12' AND InwardCode = '3CD'

SELECT OutwardCode, InwardCode
FROM dbo.PostCodeData
WHERE GeoLocation.STDistance(@home) <= (5 * 1609) -- 1609 = approx metres in 1 mile
This returned me 8354 rows in around 500ms, with a spatial index in place on the GeoLocation column.

UPDATE (14 March 2012)
See my recent post for information on the accuracy of the GeoCoordConversion third party library I'm using to convert Eastings/Northings to Latitude/Longitude.

Tuesday, 18 January 2011

Running MongoDB as a Windows Service

Following on from my opening "Getting started with MongoDB and .NET" post, one of the next logical steps was to get mongodb running as a Windows service.

From a command prompt, in the mongodb bin directory run:
mongod --install --serviceName "MongoDB" --dbpath C:\mongodb\data\db --logpath C:\mongodb\logs\mongolog.txt --logappend
Obviously you can tweak the arguments for your environment - I opted to use a different dbpath to the default \data\db

Starting the service is then just a case of running:
net start mongodb
or starting via Control Panel - Administrative Tools - Services. You can of course set it to start up automatically on Windows startup.

Word of caution
Make sure the arguments you give to mongod are valid. I initially forgot to specify the --dbpath argument and because I wasn't using the default location which didn't exist on my machine, the service got stuck in a constant start/stop cycle when I tried to start it. The mongo log file was growing and growing each time. In the end, renaming the logpath directory on my machine was enough to kill the service completely and let me correct the mistake.

There is an open issue logged on mongodb.org for this: Error during service startup causes server to start over and over and over




Wednesday, 12 January 2011

SqlBulkCopy to SQL Server in Parallel

In an earlier post last year, I blogged about high performance bulk loading to SQL Server from .NET using SqlBulkCopy. That post highlighted the performance gain that SqlBulkCopy gives over another batched insert approach using an SqlDataAdapter. But is it possible to squeeze more performance out? Oh yes.

First, a quick recap. For optimal performance:
  • load into a heap table (with no indexes - add any indexes you need AFTER you've loaded the data)
  • specify the TableLock SqlBulkCopy option. This results in bulk update locks being taken on the destination table instead of row locks. This is key for what I'm going to focus on in this post as I'll go on to explain. Relevant reading:
    - SqlBulkCopyOptions Enumeration MSDN documentation
  • pave the way for minimal logging by setting the database to the Bulk-Logged recovery model (or Simple). In the Full recovery model all inserts are fully logged in the transaction log, whereas in Bulk-Logged, certain bulk operations can be minimally logged reducing the growth of the transaction log. SqlBulkCopy is an operation that can be. Relevant reading:
    - Prerequisites For Mnimal Logging in Bulk Import MSDN reference
    - Operations That Can Be Minimally Logged MSDN reference
    - SQL Server Myth A Day : Bulk-Logged Recovery Model by Paul Randal (Twitter)
Get squeezing
How can we squeeze more throughput in bulk inserts using SqlBulkCopy?

Multiple bulk update locks on the table from different processes can be taken out at the same time without blocking each other, which means you can scale out the bulk loading across multiple clients in parallel. There will be limitations of course, such as the network connection and the speed of the disks behind the database, but performance gains are very much possible.

Instead of loading data from a single machine, split the data over a number of machines, each one loading it's own chunk of the data to the database using SqlBulkCopy as previously outlined. Finding an optimal distribution of data/number of clients may take a bit of investigation but the rewards can pay dividends.

Pimp My Test Environment
Unfortunately I don't have a server rack at home housing multiple servers with a fast IO subsystem - hey, call me a cheap-skate. What I do have is a laptop with an i5-560M (2.67GHz dual core, 4 threads), 8GB RAM and a 7200RPM HDD. That's how I roll.

What I've done is to knock up a quick command-line app in .NET 4.0 using the new Task Parallel Library (TPL) to split the loading of data to SQL Server across 4 threads in parallel, each using a separate database connection and SqlBulkCopy instance. To quote MSDN:
The TPL scales the degree of concurrency dynamically to most efficiently use all the processors that are available. In addition, the TPL handles the partitioning of the work, the scheduling of threads on the ThreadPool, cancellation support, state management, and other low-level details. By using TPL, you can maximize the performance of your code while focusing on the work that your program is designed to accomplish.

Starting with the .NET Framework 4, the TPL is the preferred way to write multithreaded and parallel code.

I loaded 20 million rows of data from an in-memory DataTable to a heap table in SQL Server 2008 with 2 columns: FieldA VARCHAR(10), FieldB INTEGER.
FieldA is just a random sequence of 10 characters, FieldB is an incrementing number.

I ran 2 tests, recreating the database fresh before each run:
Test 1: 1 DataTable of 20 million rows, loaded single threaded. LOAD TIME: 15.4088814s
Test 2: 4 DataTables of 5 million rows, loaded in parallel. LOAD TIME: 12.3377056s

For the fact that this was all running on a single machine (a laptop at that), a 20% reduction in loading time is not too shabby at all! On production quality hardware, with multiple clients separate to the database server, the concept of loading in parallel is well worth bearing in mind and could speed up your data loading process.

Snippets
To give a brief taster of the .NET approach I used, here's the code of the key parts.
public void ParallelLoadTest()
{
    // GetTestData() is just a basic method that returns a List<> of 4 DataTables.
    // Each DataTable is populated with 5 million rows.
    List<datatable> tables = GetTestData(); 

    // For each of the 4 DataTables, kick off a bulk load task to run in parallel
    Parallel.ForEach(tables, table =>
        {
            BulkLoadData(table);
        }
    );
}

public void BulkLoadData(DataTable dt)
{
     using(SqlConnection conn = new SqlConnection("{removed connectionstring}"))
     using (SqlBulkCopy bulkCopy = new SqlBulkCopy(conn, SqlBulkCopyOptions.TableLock, null))
     {
          bulkCopy.DestinationTableName = "Dataload";
          bulkCopy.BulkCopyTimeout = 60;
          bulkCopy.ColumnMappings.Add("FieldA", "FieldA");
          bulkCopy.ColumnMappings.Add("FieldB", "FieldB");
          conn.Open();
          bulkCopy.WriteToServer(dt);
          bulkCopy.Close();
     }
}

Monday, 10 January 2011

GROUPING SETS in SQL Server

Something I personally haven't seen a lot of out there in the SQL Server world, is use of GROUPING SETS - an operator that can be applied in a GROUP BY clause. So what does it do? How would you use it?

Take the AdventureWorks sample database as an example playground. Suppose you want to query the sales data to find the following:
1. total sales for each product
2. total sales for each product category
3. total sales

There's a number of ways you could do this. You could of course write 3 separate queries then handle the 3 resultsets returned. Or if you wanted a single resultset, you could use one of these approaches:

1. UNION ALL approach
SELECT d.ProductID, NULL AS ProductCategoryId, SUM(d.LineTotal) AS Total -- Total sales by product
FROM SalesLT.SalesOrderDetail d
GROUP BY d.ProductID
UNION ALL 
SELECT NULL, p.ProductCategoryID, SUM(d.LineTotal) -- Total sales by category
FROM SalesLT.SalesOrderDetail d
JOIN SalesLT.Product p ON d.ProductID = p.ProductID
GROUP BY p.ProductCategoryID
UNION ALL
SELECT NULL, NULL, SUM(d.LineTotal) -- Total of all sales
FROM SalesLT.SalesOrderDetail d

2. WITH CUBE approach
SELECT ProductID, ProductCategoryID, Total
FROM
(
SELECT d.[ProductID], p.ProductCategoryID, SUM(d.LineTotal) AS Total
FROM SalesLT.SalesOrderDetail d
JOIN SalesLT.Product p ON d.ProductID = p.ProductID
GROUP BY d.ProductID, p.ProductCategoryID
WITH CUBE
) x
WHERE x.ProductID IS NULL OR x.ProductCategoryID IS NULL

3. GROUPING SETS approach
SELECT d.[ProductID], p.ProductCategoryID, SUM(d.LineTotal) AS Total
FROM SalesLT.SalesOrderDetail d
JOIN SalesLT.Product p ON d.ProductID = p.ProductID
GROUP BY GROUPING SETS((d.ProductID),(p.ProductCategoryID), ())

If you haven't used GROUPING SETS before, this GROUP BY clause may look a bit odd at first. So what's it doing? We're just defining the different things we want to group the results by, in one CSV delimited form in the GROUPING SETS operator. In this scenario, we're saying:
"Return the totals grouped by ProductId, and also the totals grouped by ProductCategoryID and then also the Grand Total (indicated by the final pair of empty brackets)."
You can define a grouping involving combinations of fields too. For example:
GROUP BY GROUPING SETS((d.ProductID),(p.ProductCategoryID), 
(d.ProductID, p.ProductCategoryID),())
which is equivalent to the WITH CUBE subquery part of the SQL given in approach 2 above.

As always, you should check the performance of the various approaches for your scenario as just because you can write a query one way, with less code, it doesn't automatically mean it's going to be the best in terms of performance. Here's the stats for each of the 3 approaches above, based on a cold cache each time.
Approach 1 (UNION ALL): CPU=16, Reads=146, Duration=25
Approach 2 (WITH CUBE): CPU=32, Reads=422, Duration=31
Approach 3 (GROUPING SETS): CPU=31, Reads=422, Duration=28

UPDATE:
Thanks to Simon Sabin (Twitter | Blog) for the suggestion of trying these out on the full AdventureWorks2008 database to show how it performs on a larger dataset. The AdventureWorks2008LT database I used above is a simplified, smaller version of the full sample database e.g. SalesOrderDetail contains 542 rows in the "LT" version of the db compared to 121,317 rows in the full version.

So I ran the same set of tests again on AdventureWorks2008 - I had to make a few cosmetic changes to the queries due to the slight differences in table schemas, but the structure of the query remained exactly the same. Here's the results:

Approach 1 (UNION ALL): CPU=312, Reads=3760, Duration=589
Approach 2 (WITH CUBE): CPU=141, Reads=1833, Duration=339
Approach 3 (GROUPING SETS): CPU=141, Reads=1831, Duration=331

Again, the WITH CUBE & GROUPING SETS approaches are all but equal. However, now the UNION ALL approach is the one that is least performant.

Focusing on the UNION ALL vs the GROUPING SETS approach, when running the two queries in a single batch and displaying the execution plans, the query cost is split as follows:
UNION ALL 70% / GROUPING SETS 30%. The UNION ALL is hit by having to query data from the SalesOrderDetail table multiple times - so in a larger table, the cost of this gets more expensive. Whereas the GROUPING SETS approach uses Table Spool (Eager Spool) operators to fulfil the query . To quote MSDN:
The Table Spool operator scans the input and places a copy of each row in a hidden spool table that is stored in the tempdb database and existing only for the lifetime of the query. If the operator is rewound (for example, by a Nested Loops operator) but no rebinding is needed, the spooled data is used instead of rescanning the input.

This is a great demonstration of why, in the real world, you should always check how a query will perform against production data volumes.

.NET Admin GUI for MongoDB

Having just started out fiddling around with MongoDB from .NET, finding a GUI over the top of MongoDB to help me out while I find my feet was one of the first things I looked for. And I've been using the first tool I found : MongoVUE - there undoubtedly are others out there, but at the moment I've found no reason to continue searching.

Can it do everything I want it to?

Well, that would involve me knowing all the things I would want it to do - as I'm just finding my feet with MongoDB, I honestly don't know :) But so far, so good. It's proved to be a very handy tool.

It's free and just requires .NET Framework 2.0 SP1 (as of time of writing this). If you're in the same boat/dinghy as me, then I recommend at least checking out this tool to evaluate for yourself.

If you know of other MongoDB-related tools that you have used and can recommend, please feel free to ping me via a comment / twitter / email.

Friday, 7 January 2011

Getting Started with MongoDB and .NET

Getting up and running with MongoDB using C# is a very simple process. It literally takes a couple of minutes, including the time to download MongoDB itself and the C# drivers - so if you fancy having a play around with it to see what it's all about, there's no barrier stopping you.

Nutshell
To quote MongoDB.org:
MongoDB (from "humongous") is a scalable, high-performance, open source, document-oriented database.

Check out the MongoDB homepage for a highlight of the key features.

Step 1 : Download the latest MongoDB release from here
Extract the .zip file to somewhere memorable. It's a good idea to rename the folder to just "mongodb" unless you have an amazing memory as the folder name will by default include all kinds of version info. e.g. C:\mongodb\ is as good a place as any

Step 2 : Download the latest MongoDB C# Driver from GitHub
Available as .msi or .zip file. So install or unzip as you see fit.

Step 3 : Create a data directory
By default, MongoDB stores data in \data\db - you need to create those directories yourself. Personally, I chose to put the directory under the mongodb folder - C:\mongodb\data\db\

Step 4 : Fire up the MongoDB server
In a cmd prompt, navigate to the mongodb\bin directory.
If you're using the default data directory, you run:
mongod
If you're using a different data directory, you need to specify that as an argument:
mongod --dbpath C:\mongodb\data\db\
So you should see something like this:
You now have MongoDB running.

Step 5 : Start coding
Create a new .NET project and add references to MongoDB.BSON.dll and MongoDB.Driver.dll from the C# driver installed in Step 2. By default, these are in program files under: MongoDB\CSharpDriver {VersionNumber}\

From this point, the C# driver tutorial is a great way to get started. Scan your eye over that and you'll have code interacting with MongoDB in no time at all. It really is quick and easy to get going with the basics. Within minutes, you can have written code to insert, find and delete documents as well as becoming comfortable with the API and seeing the differences to that of working with a RDBMS like SQL Server.

So that's it. A very quick, whistle-stop guide to the basics of getting started with MongoDB from a .NET point of view.