Wednesday, 14 March 2012

Ordnance Survey Data Importer Coordinate Conversion Accuracy

Update 13 March 2013: Please see latest blog post (fix) on this here.

Thanks to a comment on my original post around my project that imports Ordnance Survey CodePoint data into SQL Server, I was made aware of a potential issue with the (awesome) third party GeoCoordConversion DLL I use to convert the Eastings/Northings coordinates supplied in the Ordnance Survey data files, into Latitude/Longitude coordinates. The issue relates to an inaccuracy in the conversion process, specifically to do with integer divisions instead of double.

So, this blog post is to document my analysis of that and try to quantify what, if any, impact there is on the accuracy of the conversion of Eastings/Northings to Latitude/Longitude. So what I've done is:

  1. Import the Ordnance Survey data set using the released version (1.0.0.0) of the GeoCoordConversion library (i.e. the version my project uses)
  2. Downloaded the source code for GeoCoordConversion and corrected the parts of the calculations where the raised issue noted an inaccuracy
  3. Reran the data import into a separate table using the "corrected" conversion code
  4. Ran the following query to try and quantify how much of a difference it actually makes:
;WITH CTEDiff AS 
(
    SELECT orig.OutwardCode, orig.InwardCode,
        orig.GeoLocation.STDistance(new.GeoLocation) Dist
    FROM PostCodeData orig
 JOIN PostCodeDataWithFix new ON orig.InwardCode = new.InwardCode 
            AND orig.OutwardCode = new.OutwardCode
)
SELECT MIN(Dist) AS MinDifferenceInMetres, 
    MAX(Dist) AS MaxDifferenceInMetres, 
    AVG(Dist) AS AverageDifferenceInMetres, 
    SUM(CASE WHEN Dist = 0 THEN 0 ELSE 1 END) AS NumberOfDifferences
FROM CTEDiff

Results

The MAXIMUM distance between the 2 conversions for a given postcode is: 124.9 metres.
The MINIMUM distance between the 2 conversions for a given postcode is: 0 metres.
The AVERAGE distance between the 2 conversions for a given postcode is: 2.5 metres.
The TOTAL NUMBER of differences was 1,652,458 (out of 1,705,177 conversions).

This map shows the case that had the biggest difference (postcode: NR34 8JW). Plotted points are as follows:
(A) = position from conversion process before fix
(B) = position from conversion process with fix
(C) = position Google Maps gives for location
Note (B) and (C) are in the same spot, so that's why you can't see both!


View Larger Map

Summary

For me, and what I see as potential uses for this data, I personally don't see this as much of an issue especially with an average difference of 2.5 metres. I will add a comment on to the GeoCoordConversion project issue page, referencing this post. Though there doesn't look to be any activity on the project, so whether a fix can be applied is yet to be seen. That's my preferred route at present, to see if the author of that project is around to apply a fix as I currently don't have full knowledge of what I changes I can make to the source code and what would be acceptable in order to distribute the amended source code as part of my project (my project is released under a different, but compatible license...but things might get murkier when it comes to amending/distributing those amendments). In a nutshell, I just don't have time to try to make head and tail of what I can legitimately do (and I don't want to change the license for my project)!

All feedback welcomed

Monday, 5 March 2012

Quick win - check your table variable use

Quick wins are awesome. Making a change that takes minimal effort and yields a significant performance improvement is very satisfying.

This particular potential quick win relates to the use of table variables vs. temporary tables. Have a non-trivial stored procedure that produces some intermediary results and stores in a table variable which then goes on to be used further in the stored procedure? Consider evaluating a switch to a temporary table instead. This very quick change, can give a great performance boost for such little effort due to the points outlined on MSDN (quote):

table variables are not supported in the SQL Server optimizer's cost-based reasoning model. Therefore, they should not be used when cost-based choices are required to achieve an efficient query plan. Temporary tables are preferred when cost-based choices are required. This typically includes queries with joins, parallelism decisions, and index selection choices.
Queries that modify table variables do not generate parallel query execution plans. Performance can be affected when very large table variables, or table variables in complex queries, are modified. In these situations, consider using temporary tables instead......Queries that read table variables without modifying them can still be parallelized.
Indexes cannot be created explicitly on table variables, and no statistics are kept on table variables. In some cases, performance may improve by using temporary tables instead, which support indexes and statistics.
I've seen cases where this alone, with no other modifications, results in noticeable better performance. This could then be built on further by adding supporting indices. Of course, you should be considering whether you need temporary tables/table variables at all in your stored procedure. If you can remove the need entirely, then look to do that.

Further reading: