Monday, 24 January 2011

GB Post Code Geographic Data Load to SQL Server using .NET

Ordnance Survey now make available a number of mapping data/geographic datasets to download for free, allowing unrestricted use for commercial and non-commercial use. One of these is the Code-Point Open dataset, which gives a precise geographic location of each post code in Great Britain as a CSV file.

This file contains the Easting and Northing coordinates for each postcode. You may want to convert these to Latitude/Longitude coordinates to then load into a GEOGRAPHY column in SQL Server (as of 2008) and do all kinds of spatial wizardry in the database. But how?

If at first you don't find, search, search and search again

I did quite bit of googling and found various complex calculations to convert from Eastings/Northings to Lat/Lon. I also tried some conversion apps that didn't convert into the decimal degrees format I wanted. I finally found the exact kind of .NET library code I was looking for, which is available under the GNU General Public License v3 : geocoordconversion. I downloaded the DLL and from there it only took a matter of minutes of playing about to work out the API to convert the Eastings/Northings from the Code-Point data files into Lat/Lon coordinates (in the WGS84 coordinate system).

Conversion snippet

After referencing the geocoordconversion.dll in your .NET project, converting coordinates is as simple as:
long easting, northing;
// assume easting & northing are set.
GridReference gridRef = new GridReference(easting, northing);
PolarGeoCoordinate polarCoord = GridReference.ChangeToPolarGeo(gridRef);
polarCoord = PolarGeoCoordinate.ChangeCoordinateSystem(polarCoord, 
    CoordinateSystems.WGS84);
// Pull Lat/Lon coordinates from polarCoord.Lat & polarCood.Lon
End to end import process

Using this, I knocked up a C# console app to:
  • read in all the Code-Point data from the .csv files downloaded from the Ordnance Survey (approx 1.7 million rows of data)
  • convert the Eastings/Northings to WGS84 Lat/Lon coordinates
  • bulk load all the data to a new table in SQL Server
  • populate a GEOGRAPHY column with the point data for each post code
  • create a spatial index on that column
If you'd like a copy of/source code for this app, just ping me and I'll be happy to share. I plan to put the code up somewhere when I get around to it. The code is now available on GitHub.

Example query

Here's a quick, very basic demo of querying of geography data, which finds all post codes within about 5 miles of a postcode.

DECLARE @home GEOGRAPHY
SELECT @home = GeoLocation FROM dbo.PostCodeData WHERE OutwardCode = 'AB12' AND InwardCode = '3CD'

SELECT OutwardCode, InwardCode
FROM dbo.PostCodeData
WHERE GeoLocation.STDistance(@home) <= (5 * 1609) -- 1609 = approx metres in 1 mile
This returned me 8354 rows in around 500ms, with a spatial index in place on the GeoLocation column.

UPDATE (14 March 2012)
See my recent post for information on the accuracy of the GeoCoordConversion third party library I'm using to convert Eastings/Northings to Latitude/Longitude.

Tuesday, 18 January 2011

Running MongoDB as a Windows Service

Following on from my opening "Getting started with MongoDB and .NET" post, one of the next logical steps was to get mongodb running as a Windows service.

From a command prompt, in the mongodb bin directory run:
mongod --install --serviceName "MongoDB" --dbpath C:\mongodb\data\db --logpath C:\mongodb\logs\mongolog.txt --logappend
Obviously you can tweak the arguments for your environment - I opted to use a different dbpath to the default \data\db

Starting the service is then just a case of running:
net start mongodb
or starting via Control Panel - Administrative Tools - Services. You can of course set it to start up automatically on Windows startup.

Word of caution
Make sure the arguments you give to mongod are valid. I initially forgot to specify the --dbpath argument and because I wasn't using the default location which didn't exist on my machine, the service got stuck in a constant start/stop cycle when I tried to start it. The mongo log file was growing and growing each time. In the end, renaming the logpath directory on my machine was enough to kill the service completely and let me correct the mistake.

There is an open issue logged on mongodb.org for this: Error during service startup causes server to start over and over and over




Wednesday, 12 January 2011

SqlBulkCopy to SQL Server in Parallel

In an earlier post last year, I blogged about high performance bulk loading to SQL Server from .NET using SqlBulkCopy. That post highlighted the performance gain that SqlBulkCopy gives over another batched insert approach using an SqlDataAdapter. But is it possible to squeeze more performance out? Oh yes.

First, a quick recap. For optimal performance:
  • load into a heap table (with no indexes - add any indexes you need AFTER you've loaded the data)
  • specify the TableLock SqlBulkCopy option. This results in bulk update locks being taken on the destination table instead of row locks. This is key for what I'm going to focus on in this post as I'll go on to explain. Relevant reading:
    - SqlBulkCopyOptions Enumeration MSDN documentation
  • pave the way for minimal logging by setting the database to the Bulk-Logged recovery model (or Simple). In the Full recovery model all inserts are fully logged in the transaction log, whereas in Bulk-Logged, certain bulk operations can be minimally logged reducing the growth of the transaction log. SqlBulkCopy is an operation that can be. Relevant reading:
    - Prerequisites For Mnimal Logging in Bulk Import MSDN reference
    - Operations That Can Be Minimally Logged MSDN reference
    - SQL Server Myth A Day : Bulk-Logged Recovery Model by Paul Randal (Twitter)
Get squeezing
How can we squeeze more throughput in bulk inserts using SqlBulkCopy?

Multiple bulk update locks on the table from different processes can be taken out at the same time without blocking each other, which means you can scale out the bulk loading across multiple clients in parallel. There will be limitations of course, such as the network connection and the speed of the disks behind the database, but performance gains are very much possible.

Instead of loading data from a single machine, split the data over a number of machines, each one loading it's own chunk of the data to the database using SqlBulkCopy as previously outlined. Finding an optimal distribution of data/number of clients may take a bit of investigation but the rewards can pay dividends.

Pimp My Test Environment
Unfortunately I don't have a server rack at home housing multiple servers with a fast IO subsystem - hey, call me a cheap-skate. What I do have is a laptop with an i5-560M (2.67GHz dual core, 4 threads), 8GB RAM and a 7200RPM HDD. That's how I roll.

What I've done is to knock up a quick command-line app in .NET 4.0 using the new Task Parallel Library (TPL) to split the loading of data to SQL Server across 4 threads in parallel, each using a separate database connection and SqlBulkCopy instance. To quote MSDN:
The TPL scales the degree of concurrency dynamically to most efficiently use all the processors that are available. In addition, the TPL handles the partitioning of the work, the scheduling of threads on the ThreadPool, cancellation support, state management, and other low-level details. By using TPL, you can maximize the performance of your code while focusing on the work that your program is designed to accomplish.

Starting with the .NET Framework 4, the TPL is the preferred way to write multithreaded and parallel code.

I loaded 20 million rows of data from an in-memory DataTable to a heap table in SQL Server 2008 with 2 columns: FieldA VARCHAR(10), FieldB INTEGER.
FieldA is just a random sequence of 10 characters, FieldB is an incrementing number.

I ran 2 tests, recreating the database fresh before each run:
Test 1: 1 DataTable of 20 million rows, loaded single threaded. LOAD TIME: 15.4088814s
Test 2: 4 DataTables of 5 million rows, loaded in parallel. LOAD TIME: 12.3377056s

For the fact that this was all running on a single machine (a laptop at that), a 20% reduction in loading time is not too shabby at all! On production quality hardware, with multiple clients separate to the database server, the concept of loading in parallel is well worth bearing in mind and could speed up your data loading process.

Snippets
To give a brief taster of the .NET approach I used, here's the code of the key parts.
public void ParallelLoadTest()
{
    // GetTestData() is just a basic method that returns a List<> of 4 DataTables.
    // Each DataTable is populated with 5 million rows.
    List<datatable> tables = GetTestData(); 

    // For each of the 4 DataTables, kick off a bulk load task to run in parallel
    Parallel.ForEach(tables, table =>
        {
            BulkLoadData(table);
        }
    );
}

public void BulkLoadData(DataTable dt)
{
     using(SqlConnection conn = new SqlConnection("{removed connectionstring}"))
     using (SqlBulkCopy bulkCopy = new SqlBulkCopy(conn, SqlBulkCopyOptions.TableLock, null))
     {
          bulkCopy.DestinationTableName = "Dataload";
          bulkCopy.BulkCopyTimeout = 60;
          bulkCopy.ColumnMappings.Add("FieldA", "FieldA");
          bulkCopy.ColumnMappings.Add("FieldB", "FieldB");
          conn.Open();
          bulkCopy.WriteToServer(dt);
          bulkCopy.Close();
     }
}

Monday, 10 January 2011

GROUPING SETS in SQL Server

Something I personally haven't seen a lot of out there in the SQL Server world, is use of GROUPING SETS - an operator that can be applied in a GROUP BY clause. So what does it do? How would you use it?

Take the AdventureWorks sample database as an example playground. Suppose you want to query the sales data to find the following:
1. total sales for each product
2. total sales for each product category
3. total sales

There's a number of ways you could do this. You could of course write 3 separate queries then handle the 3 resultsets returned. Or if you wanted a single resultset, you could use one of these approaches:

1. UNION ALL approach
SELECT d.ProductID, NULL AS ProductCategoryId, SUM(d.LineTotal) AS Total -- Total sales by product
FROM SalesLT.SalesOrderDetail d
GROUP BY d.ProductID
UNION ALL 
SELECT NULL, p.ProductCategoryID, SUM(d.LineTotal) -- Total sales by category
FROM SalesLT.SalesOrderDetail d
JOIN SalesLT.Product p ON d.ProductID = p.ProductID
GROUP BY p.ProductCategoryID
UNION ALL
SELECT NULL, NULL, SUM(d.LineTotal) -- Total of all sales
FROM SalesLT.SalesOrderDetail d

2. WITH CUBE approach
SELECT ProductID, ProductCategoryID, Total
FROM
(
SELECT d.[ProductID], p.ProductCategoryID, SUM(d.LineTotal) AS Total
FROM SalesLT.SalesOrderDetail d
JOIN SalesLT.Product p ON d.ProductID = p.ProductID
GROUP BY d.ProductID, p.ProductCategoryID
WITH CUBE
) x
WHERE x.ProductID IS NULL OR x.ProductCategoryID IS NULL

3. GROUPING SETS approach
SELECT d.[ProductID], p.ProductCategoryID, SUM(d.LineTotal) AS Total
FROM SalesLT.SalesOrderDetail d
JOIN SalesLT.Product p ON d.ProductID = p.ProductID
GROUP BY GROUPING SETS((d.ProductID),(p.ProductCategoryID), ())

If you haven't used GROUPING SETS before, this GROUP BY clause may look a bit odd at first. So what's it doing? We're just defining the different things we want to group the results by, in one CSV delimited form in the GROUPING SETS operator. In this scenario, we're saying:
"Return the totals grouped by ProductId, and also the totals grouped by ProductCategoryID and then also the Grand Total (indicated by the final pair of empty brackets)."
You can define a grouping involving combinations of fields too. For example:
GROUP BY GROUPING SETS((d.ProductID),(p.ProductCategoryID), 
(d.ProductID, p.ProductCategoryID),())
which is equivalent to the WITH CUBE subquery part of the SQL given in approach 2 above.

As always, you should check the performance of the various approaches for your scenario as just because you can write a query one way, with less code, it doesn't automatically mean it's going to be the best in terms of performance. Here's the stats for each of the 3 approaches above, based on a cold cache each time.
Approach 1 (UNION ALL): CPU=16, Reads=146, Duration=25
Approach 2 (WITH CUBE): CPU=32, Reads=422, Duration=31
Approach 3 (GROUPING SETS): CPU=31, Reads=422, Duration=28

UPDATE:
Thanks to Simon Sabin (Twitter | Blog) for the suggestion of trying these out on the full AdventureWorks2008 database to show how it performs on a larger dataset. The AdventureWorks2008LT database I used above is a simplified, smaller version of the full sample database e.g. SalesOrderDetail contains 542 rows in the "LT" version of the db compared to 121,317 rows in the full version.

So I ran the same set of tests again on AdventureWorks2008 - I had to make a few cosmetic changes to the queries due to the slight differences in table schemas, but the structure of the query remained exactly the same. Here's the results:

Approach 1 (UNION ALL): CPU=312, Reads=3760, Duration=589
Approach 2 (WITH CUBE): CPU=141, Reads=1833, Duration=339
Approach 3 (GROUPING SETS): CPU=141, Reads=1831, Duration=331

Again, the WITH CUBE & GROUPING SETS approaches are all but equal. However, now the UNION ALL approach is the one that is least performant.

Focusing on the UNION ALL vs the GROUPING SETS approach, when running the two queries in a single batch and displaying the execution plans, the query cost is split as follows:
UNION ALL 70% / GROUPING SETS 30%. The UNION ALL is hit by having to query data from the SalesOrderDetail table multiple times - so in a larger table, the cost of this gets more expensive. Whereas the GROUPING SETS approach uses Table Spool (Eager Spool) operators to fulfil the query . To quote MSDN:
The Table Spool operator scans the input and places a copy of each row in a hidden spool table that is stored in the tempdb database and existing only for the lifetime of the query. If the operator is rewound (for example, by a Nested Loops operator) but no rebinding is needed, the spooled data is used instead of rescanning the input.

This is a great demonstration of why, in the real world, you should always check how a query will perform against production data volumes.

.NET Admin GUI for MongoDB

Having just started out fiddling around with MongoDB from .NET, finding a GUI over the top of MongoDB to help me out while I find my feet was one of the first things I looked for. And I've been using the first tool I found : MongoVUE - there undoubtedly are others out there, but at the moment I've found no reason to continue searching.

Can it do everything I want it to?

Well, that would involve me knowing all the things I would want it to do - as I'm just finding my feet with MongoDB, I honestly don't know :) But so far, so good. It's proved to be a very handy tool.

It's free and just requires .NET Framework 2.0 SP1 (as of time of writing this). If you're in the same boat/dinghy as me, then I recommend at least checking out this tool to evaluate for yourself.

If you know of other MongoDB-related tools that you have used and can recommend, please feel free to ping me via a comment / twitter / email.

Friday, 7 January 2011

Getting Started with MongoDB and .NET

Getting up and running with MongoDB using C# is a very simple process. It literally takes a couple of minutes, including the time to download MongoDB itself and the C# drivers - so if you fancy having a play around with it to see what it's all about, there's no barrier stopping you.

Nutshell
To quote MongoDB.org:
MongoDB (from "humongous") is a scalable, high-performance, open source, document-oriented database.

Check out the MongoDB homepage for a highlight of the key features.

Step 1 : Download the latest MongoDB release from here
Extract the .zip file to somewhere memorable. It's a good idea to rename the folder to just "mongodb" unless you have an amazing memory as the folder name will by default include all kinds of version info. e.g. C:\mongodb\ is as good a place as any

Step 2 : Download the latest MongoDB C# Driver from GitHub
Available as .msi or .zip file. So install or unzip as you see fit.

Step 3 : Create a data directory
By default, MongoDB stores data in \data\db - you need to create those directories yourself. Personally, I chose to put the directory under the mongodb folder - C:\mongodb\data\db\

Step 4 : Fire up the MongoDB server
In a cmd prompt, navigate to the mongodb\bin directory.
If you're using the default data directory, you run:
mongod
If you're using a different data directory, you need to specify that as an argument:
mongod --dbpath C:\mongodb\data\db\
So you should see something like this:
You now have MongoDB running.

Step 5 : Start coding
Create a new .NET project and add references to MongoDB.BSON.dll and MongoDB.Driver.dll from the C# driver installed in Step 2. By default, these are in program files under: MongoDB\CSharpDriver {VersionNumber}\

From this point, the C# driver tutorial is a great way to get started. Scan your eye over that and you'll have code interacting with MongoDB in no time at all. It really is quick and easy to get going with the basics. Within minutes, you can have written code to insert, find and delete documents as well as becoming comfortable with the API and seeing the differences to that of working with a RDBMS like SQL Server.

So that's it. A very quick, whistle-stop guide to the basics of getting started with MongoDB from a .NET point of view.