Sunday, 27 March 2011

OS Data Importer Supports Scale Gazetteer Dataset

I've just pushed some updates to GitHub for the Ordnance Survey Data Importer .NET app I've been working on every now and then (see my previous posts: GB Post Code Geographic Data Load to SQL Server using .NET and OS CodePoint Data Geography Load Update).

Aside from a little re-jigging to potentially pave the way to load more Ordnance Survey data files to SQL Server in the future, it now supports importing the 1:50000 Scale Gazetteer data file to SQL Server (available to download from here). To quote the Ordnance Survey site:
The 1:50 000 Scale Gazetteer is a reference tool or location finder, similiar to the index in a road atlas. It can be used as a simple list to find out relevant coordinates and six-figure grid references for a town or area.
At the time of this post this data file contains 259,113 locations taken from the OS Landranger Map series.

Example Use
To load the 1:50000 scale gazetteer data file that has been downloaded and extracted to C:\OSGazetteerData\50kgaz2010.txt, into dbo.Gazetteer table in MyDatabase on server SQLServerA, and load county codes into table: County and feature codes into table: Feature, run the following from the command prompt:
OSCodePointDataImport.exe GAZETTEER SQLServerA MyDatabase dbo Gazetteer County Feature "C:\OSGazetteerData\50kgaz2010.txt"
The Gazetteer database table will consist of the following columns:
SeqNo INTEGER PRIMARY KEY
PlaceName VARCHAR(60)
CountyCode CHAR(2) FK to County.Code
FeatureCode VARCHAR(3) FK to Feature.Code
Longitude FLOAT
Latitude FLOAT
GeoLocation GEOGRAPHY
The County lookup table will consist of the following columns:
Code CHAR(2) PRIMARY KEY
Name VARCHAR(60)
The Feature lookup table will consist of the following columns:
Code VARCHAR(3) PRIMARY KEY
Description VARCHAR(50)

The source is on GitHub here: https://github.com/AdaTheDev/Ordnance-Survey-Code-Point-Data-Importer

Friday, 18 March 2011

MongoDB Journaling Performance - Single Server Durability

Note: Relates to v1.8.0 release.

v1.8 of MongoDB (production release hot off the press, available from here) brings single server durability to MongoDB via its support for write-ahead journaling. The following summarises the key points from the journaling documentation, administration guide and from my initial playing around.
  • you have to specify the --journal switch for mongod in order to enable journaling (was previously --dur prior to production release)
  • journal files are created within a journal subfolder of your dbpath, up to 1GB each in size
  • in the event of a crash, when mongod restarts it will replay the journal files to recover automatically before the server goes online
  • if mongod shuts down cleanly, the journal files are cleared down. Also note that you shouldn't end up with a lot of journal files as they are rotated out as they are no longer needed
  • batch commits are performed approximately every 100ms (note the mention in the docs that this will be more frequent in future)
  • preallocation of journal files may be done when mongod starts up. If it decides this is worth doing, it will fully provision 3 x 1GB files BEFORE accepting any connections - so there could be a couple of minutes delay after starting up before you can connect. These preallocated files are not cleared down when mongod is shutdown - so it won't have to create them each time.
  • when running with journal files on a SATA drive, I found that it chose not to preallocate them. When I set up a junction to map the journal folder onto a separate 15K SAS drive, it did then choose to preallocate the files.

Durability vs. performance - what's the cost?
I wanted to get an idea of how much this journaling costs in terms of performance. Obviously there's going to be a hit, I just like to get a rough feel for just how much. So I took a test 295MB CSV data file containing 20 million rows of data with 2 columns: _id (integer) and val (random 5 character string) and loaded into a fresh database, with/without journaling enabled.

Tests were run on a single machine, Intel Xeon W3520 Quad Core, 10GB RAM, Disk 1=7200RPM SATA, Disk 2=15K SAS, Win 7 64Bit. MongoDB data is on Disk 1 (slower but larger disk).

Journaling?Journal locationFiles preallocated?Import time (s)Avg. Import Rate/sPerformance
Non/an/a29168729 
YesDisk 1 (same as data)No44245249-34%
YesDisk 1 (same as data)Yes (manually*)42247393-31%
YesDisk 2 (separate from data)Yes44844643-35%

(*) mongod always chose not to preallocate when the journal directory was on the slower disk (same as the data) so I had to manually preallocate the files by copying the ones that were created when the faster disk was used.

I couldn't run the test with the journals on disk 2 without them being preallocated because they were always preallocated and you can't delete them all while mongod is running.

Summing up
In my tests, I found:
  • using --journal resulted in about a 30-35% drop in throughput for my mongoimport job (just under 70K docs/s down to less than 50K docs/s)
  • preallocating of journal files (as you'd expect) helps as it doesn't have to create the files as it's going along
  • setting up a junction on Windows to map the journal directory onto a separate (and faster) disk to where the data is resulted in slower performance, presumably due to the overhead of the junction redirects. If there's another/better way of doing this I'd be interested to know. Also I haven't run this on Linux, so maybe there would be less of a hit on that. Personally, I'd like to see support for the journal folder to be explicitly configurable so you can point it at a separate disk.
Based on these results, for me, the decision on whether to use journaling or not would come down to how much I actually need single server durability. Is it critical for my specific use? Could I live without it and just use replica sets? What is the value / importance of it for my data vs. raw performance? Let's not forget this is a new feature in MongoDB. As stated in the docs, there are a number of cases in 1.8.1 for performance improvements with regard to journaling. Definitely something to keep an eye on.

Thursday, 17 March 2011

Upcoming DevConnections London conference


What: IT&DevCONNECTIONS Conference
Where: ExCel Conference Centre, London
When: 13th - 15th June 2011

Here's the summary on the event, which gives a good overview of what you can expect from it. Hold on to your hat for some very impressive name-dropping!
Penton, publishers of SharePointPro Connections, DevPro Connections, Windows IT Pro and SQL Server Magazines are excited to announce the IT&DevCONNECTIONS powered by Microsoft conference to be held at the ExCeL conference centre in London. This technical conference and pre con workshop brings together experts from Microsoft and leading independent industry presenters to deliver in-depth presentations and cutting edge sessions to help you succeed with the Microsoft technology stack – SharePoint, Windows, SQL Server, Silverlight, ASP.Net, Virtualisation, Cloud and Azure. Learn about new product features and the future direction for Microsoft delivered by MSVP’s Scott Guthrie, Dave Mendlen and Steve Fox. Dive deep into the latest Microsoft products and technologies with no-hype in-depth sessions delivered by top independent speakers Dan Holme, Kim Tripp, Paul Randal and Scot Hillier. Get your questions answered, your strategies defined, network with the experts and build relationships with your peers at this premier UK event! Register today to take advantage of the early bird offers : www.devconnections.com/uk

The super early bird offer ends on 31st March, with the early bird offer then ending on 13th May so get registering soon to make the biggest savings. You can save an extra 20% by quoting discount code: UG230 during the registration process, and if you're looking for hotels nearby then you can get preferential rates via here.

Pre-cons and conference sessions
Info on the pre-cons (13th June) can be found here. In terms of the SQL Server space, Simon Sabin (recently joined the elite ranks of becoming a Microsoft Certified Master - Twitter | Blog) is doing one on Advanced Reporting Services, and Paul Randal (Twitter | Blog) and Kimberly Tripp (Twitter | Blog) are doing one on SQL Server Performance Tuning from A to Z.

Info on the conference sessions (14th & 15th June) can be found here. (SQL Server track, SqlServerCentral.com track)

Why should I go?
The real question is, why not?! It looks like an awesome line-up. Having attended the SQL Server MasterClass by Kimberly Tripp and Paul Randal last year, I've experienced at first hand the amount and quality of information that they have to offer so that makes the SQL Server track extremely appealing. And that's before I mention Simon Sabin's sessions from past SqlBits conferences. Need I say more?

Thursday, 10 March 2011

SQLSoton UG March 2011

Last night saw another gathering of the SQL Server Southampton UserGroup, with Gavin Payne (Twitter | Blog) making the trip down to give a talk on replication, his first since joining Coeo.

As a developer, I've had very little exposure to replication and have seen it as some form of dark art. I've never scratched much past the surface. Over the past couple of months of getting up to speed with MongoDB, I've dived far deeper into replication within that than I ever have in SQL Server in 10 years+ that I've been using it which is quite a strange feeling, thinking about it like that. So I was interested in learning more about replication in the world of SQL Server. Gavin did not disappoint - excellent session, with very clear and concise explanations. When something "just makes sense", you know it's a sign of a great talk. Thanks to Gavin for coming down and presenting a blinder! Check out Gavin's write up on the talk on his blog here, where you can also download the presentation from.

We were in a new room this month - with much more space and scope for better layout. Hopefully everyone agrees it's a good space to have the UG in.

The next SQL Server Southampton UserGroup is on Weds 13th April, the week after SqlBits 8, when there is another first for the UserGroup - 2 speakers. How's that for value for no money?! We welcome Neil Hambly (Twitter | Blog) to kick off proceedings with the first talk of the evening (interested in studying towards MCM? Follow his progress with MCM training notes), followed by Pete Carter-Greenan. As if that wasn't enough, no doubt we will have more swag. Don't miss it, get yourself registered over on SQLServerFaq.

MongoDB 1.8.0 Release Candidate 2 Available

Fast on the heels of my previous blog post about MongoDB 1.8.0 Release Candidate 1 becoming available, I've just seen 1.8.0 RC2 is now available which includes a handful of changes (Download | Change Log).

Did I mention, MongoDB is rapidly evolving?!

Thursday, 3 March 2011

Upcoming SQLSoton UG - 9th March 2011

Quick reminder that the next SQLSoton (twitter) UserGroup meet is fast approaching on Wednesday 9th March. Gavin Payne (twitter | blog) will be making his way down to Southampton to give us a talk on replication:
For those who have never had the chance to use SQL Server’s replication features, or perhaps only manage an existing implementation of it, this session will give you an overview of the different replication services SQL Server offers, some background into the primary and secondary design goals each, along with some considerations for deployment and in-life management. The session aims to reach but not exceed an MCITP depth of detail.
As usual, please register via SQLServerFAQ. Also, please note we will be in a different room this time but still at the same venue.