Thursday, 30 December 2010

Round up of 2010

2010 was my first year as a blogger, after finally pulling my finger out back in February. I learnt an early lesson in naming/tagging posts appropriately after my first search engine hits came in via the term "pull my finger". Not exactly the kind of search term I was hoping to bring up my site - things have improved now so I've headed in the right direction with a focus on SQL Server and .NET related posts.

Top of the Pops

The top 3 posts in terms of views have been:
1. SQL Server 2008 - Table Valued Parameters vs XML vs CSV
2. High performance bulk loading to SQL Server using SqlBulkCopy
3. Sorting a DataTable - LINQ performance

In terms of feedback/comments, the top post was: Rise of the SQL Server DevBA - I particularly enjoyed writing this post as it's something I feel particularly strong about and so it was fantastic to receive the feedback on it.

I Came, I Saw, I Learnt

2010 was also a year where I set myself a resolution to get out there and attend more conferences and community events, and what a great year it was in that respect.

March: QCon London
April: SQLBits 6
June: SQL Server Master Class
December: First SQL Server Southampton UserGroup

Add to that a handful of other SQL Server/.NET usergroup attendances in London, Reading and Southampton and it all adds up to a successful year for that particular goal. Through these various events, I've had the pleasure of putting a number of faces to names. It hasn't always been easy to do that - trying to play "spot the tweep" while hovering next to a breakfast buffet table was at times challenging; a) because there was food at hand to focus on and b) because trying to recognise someone you follow on twitter based on some their profile pic can be difficult. Being recognised myself for the first time was a nice surprise - thanks to Dave Ballantyne (Twitter | Blog) for being my first "spotter" at the SQL Server Master Class! (Sorry, no prize)

Who's The Daddy!

Aside from professional goals, 2010 was also an amazing year on a personal note as I became a dad for the first time to a beautiful baby girl. That is my excuse for a decline in blog posting rate in the latter part of this year!

Group hug

Community, community, community. That's where it's at. I've learnt a lot from the community and been lucky enough to be a part of some awesome events, something I very much look forward to more of in 2011. One of my goals will definitely be to put more faces to names. Thanks to all who have viewed, commented on or RT'd my blog posts and of course thanks to all those out there who have provided first class blog posts which I have enjoyed following (too many to name!). Happy New Year!

Thursday, 23 December 2010

SqlBits 8 - Upcoming SQL Server event

So SQLBits 8 has been announced and is going to be beside the seaside (so maybe it should be called SeaQLBits) in that there Brighton at The Grand, from Thursday 7th April to Saturday 9th April 2011. If you're reading my blog, then there's a pretty strong chance you use SQL Server in which case without a doubt, SQLBits will have a lot to offer you. Chalk it down in your calendar.

As if you need any convincing, but check out my post on SQLBits 6 from April this year and of course the SQLBits site for full details on the upcoming one. Unfortunately I was unable to make SQLBits 7 due to circumstance, so no post from me on that but the sessions videos/slides etc are available so check those out too.

Sunday, 19 December 2010

SQLSoton UserGroup Jan 2011

After the success of the first ever SQL Server Southampton UserGroup this month, the next is planned for Wednesday 19th January 2011 with Christian Bolton (Blog| Twitter) doing a talk on "Exploring SQL Server's Architecture". Followed by SQL Supper - a chance to eat pizza and discuss anything SQL Server related. Got a niggle at work you'd like 2nd/3rd/nth opinions on? Perfect opportunity. Just want to network and have a casual chat? Ditto.

The full agenda and location info can be found over on SqlServerFAQ.com where you can register for the meetup. Even if you can't make it this time, please spread the word!

Thursday, 9 December 2010

SQL Server Southampton UserGroup

Last night was the first SQL Server Southampton UserGroup (#sqlsoton for those on twitter) organised by Mark Pryce-Maher (Twitter | Blog). It's a difficult job trying to get a local usergroup up and running especially when it's difficult to reach out to local professionals and get a firm idea of who's going to come, and so credit to Mark for making it happen.

Cap'n, we have no power

Despite projector/power lead problems (and there being a million and one kettles in the hired room, none with suitable lead!), it was a great first gathering. Matt Whitfield (Twitter | Blog) kicked off with a session on "SQL CLR Demystified", a look into CLR integration within SQL Server and how it can be used to implement CLR types, stored procedure, functions, aggregates and triggers, with examples on how it can offer better performance over pure TSQL. A great start to what hopefully will become a regular local meetup. Job well done.

It didn't form part of his talk, but having recently tried out one of his free SQL Server related tools (Schema Surf), I think it's worth a shout-out to the tools available from Atlantis Interactive. A number of them are free, so give them a try - I've only had time to try out Schema Surf as yet, but found that to be a very handy tool that can really improve productivity, especially if you're working with an unfamiliar database. Good stuff.

Pizza supplies arrived - an essential part of any usergroup and followed with informal discussions including ones on nice looking features of SQL Server Denali as initiated by Mark.

Support your local groups

For me, it was good to put a real face to those I know through reading blogs and twitter and who I haven't met in person before, namely Christian Bolton (Twitter | Blog), Andrew Fryer (Blog | Twitter) and Mark Pryce-Maher.

Usergroups like this are a great way to learn new things, and get involved with the community around you. Personally, I really enjoy the informal nature of it and the chance just to meet up and have some banter as much as the technical session aspect. It's always interesting to see how other people are using SQL Server, to what extent and hearing their experiences. And I think the benefit of this kind of chance to get together with like-minded people was well and truly demonstrated by the fact the usergroup overran by around 35-40 minutes, showing how immersed we became. That's a big thumbs up in my book.

So, if you don't already attend a local usergroup, why not give it a try?

Thursday, 21 October 2010

SQLBits 6 videos available

First tweet I say today was a great one - the SQLBits 6 videos are now available. Good news from Simon Sabin (Blog | Twitter) - thanks once again, and to the whole SQLBits team.

If you were there, then no doubt you're like me and keen to either catch the sessions you didn't attend or re-watch some sessions you did attend.

If you weren't there, then even more reason to check them out (list of sessions). You might also want to check up my blog post about the conference from back in April.

Wednesday, 20 October 2010

SSMS Tools Pack

Today I downloaded and installed the SSMS Tools Pack, a free add-in for SQL Server Management Studio developed by Mladen Prajdić (Blog | Twitter).

Late to the party...and not even fashionably late

Yes I know, I know - everyone uses the SSMS Tools Pack and I'm well aware I'm late to the party! Fail whale on my part. It's one of those things I bookmarked a long time ago and just never got round to actually trying out. Lesson being - spend 2 mins to give something that looks good, a quick try as soon as you first find out about it. If it does turn out to be good (which this add-in is, it's awesome), that will repay itself in terms of greater productivity. And will prevent the "I should have done this earlier" facepalm.

So if you are even later to the party than I am, check out the feature list and then head on to download it. Personally, the support for snippets and window connection colouring has had an immediate impact for me.

Cheers Mladen - awesome work!

Monday, 18 October 2010

Long time no post

It's been a while since my last blog post, with good reason as September saw the arrival of my first child - a beautiful baby girl. It did mean I had to miss SQLBits 7 up in York, but of course that pales into insignificance when comparing to becoming a Dad for the first time! Besides, there's always SQLBits 8...(*cough* bring it to the south coast! *cough*). Now I'm back at work after paternity leave and I'm starting to get back in the swing of things.

Southampton SQL Server User Group, interested?

Last week, I saw a tweet by @tsqltidy trying to gauge interest for an SQL Server UserGroup in Hampshire. This is something I thought about earlier this year, so I'm very keen to show my support for this as I think this area is calling out for something more local. I have travelled up to the Microsoft Campus in Reading to attend a UG there, but would be great to have a UG a bit more local. If you're in Hampshire and think you'd be interested, please show your interest! Either ping me via email/twitter (@AdaTheDev) or, if you're on LinkedIn and in the "UK SQL Server User Group" group there's a discussion in there started by Neil Hambly (@Neil_Hambly). I for one am hoping that the birth of my daughter will be followed by the birth of a new local SQL Server User Group!

Tuesday, 10 August 2010

Passing a TABLE variable into dynamic SQL

A question popped up on StackOverflow today, asking how to pass a TABLE variable into a dynamic SQL statement in SQL Server 2008. I've previously blogged about table-valued parameters, comparing the approach of passing in a TABLE of values to a stored procedure to the techniques you'd have had to use in earlier versions of SQL Server (e.g. CSV or XML), but this specific question is worth a quick follow-up.

Passing a TABLE variable in to a dynamic SQL statement is straight forwards:
-- First define a custom TABLE type
CREATE TYPE IntegerTableType AS TABLE (ID INTEGER)
GO

-- Fill a var of that type with some test data
DECLARE @MyTable IntegerTableType
INSERT @MyTable VALUES (1),(2),(3)

-- Now this is how you pass that var into dynamic statement
EXECUTE sp_executesql N'SELECT * FROM @MyTable', 
    N'@MyTable IntegerTableType READONLY', 
    @MyTable

Job done.

Thursday, 5 August 2010

The importance of being earnest

I've been learning and trying out a lot of new things recently and it got me to thinking just how important it is in this business, to continue pushing yourself and adding more strings to your bow. It can be easy to stay within your comfort zone and go with the flow, but there (usually) comes a time when you have to branch out for one reason or another.

Stand-ups, Tomatoes and a date(-abase) with Cassandra
Over the past week we've started doing daily stand-ups at work - still in the early stages but it already feels like I've been doing it for longer. Then there's the Pomodoro Technique which is a time management technique. Again, early days, but my initial experience has been very positive - I'm planning on swotting up for some SQL Server exams when I have time, and think this will come in very handy for that so is not just something I'll be using at work.

Something that has taken a lot of my time recently has been research into "NoSQL" technologies (or better described, "NoACID"). This is something that is outside of what I'd call my comfort zone - SQL Server. I've spent my time trying to learn more about all the different options out there to see what could provide a good fit to our scenario/requirements - Cassandra, MongoDB, CouchDB, RavenDB, HyperTable, HBase....to name but a few.

But what I've found in having time spent learning about these options, is that it's also pushing me to learn more about SQL Server. There's always something more to learn in the world of SQL Server, and the community around it is awesome - so I'm using the challenge of researching into NoSQL as a way to challenge my knowledge of SQL Server as that is a key part of my skillset that I'm always looking to keep maximising.

Questions, questions, questions
Answering questions, originally mainly on Experts Exchange and now mainly on StackOverflow, not only gives the opportunity to help others but also is a great way to learn more myself. Another important factor for me, is that it serves as a great record of what I have to offer technically . Everything is there for all to see - the good answers AND the not-so-good answers that I've gone on to learn from. I like the "honesty" of having it there, unfiltered.

What is my point?
Bit of a rambling of a blog post, it is a bit off-the-cuff. But there is a point. Being earnest, working hard at your career, trying to contribute to the wider community is an important aspect of being in this line of work. I'm a firm believer of trying to maximise the value I can offer to an employer. Trying to make yourself stand out from the crowd, especially in the recent economic climate can put you in good stead. I'm sure everyone has their own dream job at a dream company - so maximise your chance of getting there. May take a lot of effort, may take a lot of time....and the dream role may not exist or the opportunity may never arise, but it'd at least be an educational journey to go on and you can help others along the way.

It takes time to build up a reputation and get yourself known, so start early and stick with it.

Monday, 5 July 2010

Getting started with Cassandra and .NET

Over the past couple of days, I've started playing around with Cassandra in an effort to satisfy my curiosity and to see what it's all about. In terms of databases, SQL Server is my main skill - I first started using it nearly 10 years ago, and will always remember the first book I read cover to cover on it: Inside Microsoft SQL Server 6.5 by Ron Soukup. So Cassandra is a step outside of my comfort zone; something new and a little alien to me....a challenge. The aim of this post, is mainly to dump my early thoughts, key points and useful links - basically an overflow area for my brain; things that I find are important/helpful resources.

Along with SQL Server my other core technology is C#. So what I'm trying to do is get up and running with Cassandra and knock up some C# to talk to it. Just something basic will do - a "Hello Cassandra" app. It's not about the pros/cons of NoSQL, SQL Server vs Cassandra vs anything else, and it's definitely not a guide on best practices when it comes to this subject (I'm hardly an expert after 2 days). End of disclaimer.

First things first, what is Cassandra?
It's a highly scalable, eventually consistent, distributed, structured key-value store.
Check out the WIKI on apache.org.
Also a great reference here on the architecture overview.

Installing
You can install it on any OS for which there is a Java runtime. Windows is what I'm comfortable with, so that's the host OS I used (XP to be exact - yes, my laptop is old-school, cue violins). Future steps will be to get more comfortable with Linux and have it running on a number of VMs.

Great guide for installing Cassandra by Nick Berardi (Twitter): Cassandra Jump Start For The Windows Developer. Gives instructions for Windows and Linux.

Data model - key terminology
Cluster, ColumnFamily, Keyspace, SuperColumn - sounds strange. A read through of the DataModel WIKI page is a must.

Thrift
From the WIKI:
Thrift is a software framework for scalable cross-language services development. Thrift allows you to define data types and service interfaces in a simple definition file. Taking that file as input, the compiler generates code to be used to easily build RPC clients and servers that communicate seamlessly across programming languages.
You can write your code directly using the Thrift API to Cassandra, but for beginners (like me) it's definitely worth downloading a high-level client. Plenty available for a wide variety of languages including C# / .NET, Python, Ruby, Java and more.

For .NET I originally went to try Fluent Cassandra (Github) which is being developed by Nick Berardi - after all, his site gave me the guide to get up and running. Plus he has a number of useful Cassandra articles from a .NET perspective. However, I currently don't have .NET 4.0/VS2010 so to get something going quickly, I looked at Aquiles (Codeplex) which requires .NET 3.5 or above.

That got me to the point where I am now - able to connect to Cassandra from .NET. Nothing shiny, nothing flash. But the main point is the links I've collated in this post are the ones I've used to get going with the basics.

Thursday, 1 July 2010

Validating an SQL query programmatically, with gotcha

If you want to check the validity of a TSQL statement programmatically, you can make use of the SET NOEXEC statement. As this MSDN reference states, SET NOEXEC ON will compile the query but won't actually execute it. This is ideal (well, nearly....) if perhaps you have a dynamically generated statement that you want to check is valid before executing it.

Example time
SET NOEXEC ON 
GO
SELECT TOP * FROM sys.tables
GO
SET NOEXEC OFF
GO
This results in the following error, as I've failed to specify the number of rows to return for the TOP:
Msg 102, Level 15, State 1, Line 1
Incorrect syntax near '*'.
So if we correct the statement, then try again:
SET NOEXEC ON
GO
SELECT TOP 10 * FROM sys.tables
GO
SET NOEXEC OFF
GO
This time, as we'd expect, we get:
Command(s) completed successfully.
Gotcha
However, it appears that in SQL Server 2008, SET NOEXEC does not work exactly as documented. The documentation states it will validate the syntax and object names. As shown above, it did validate the syntax. But I found it didn't actually error when invalid objects are referenced.
e.g.
SET NOEXEC ON
GO
SELECT * FROM NonExistentTable
GO
SET NOEXEC OFF
GO
This comes back as successful, despite the table not existing.
SQL Server 2005 DOES behave as documented, and throws error 208 - "Invalid object name".
I've raised a connect bug report - if you can reproduce it in SQL Server 2008, please add your input to that bug report. If anyone is using SQL Server 2008 R2, I'd be interested to know the behaviour in that - drop me an email or add as a comment below.

Another statement worth looking at, is SET FMTONLY which as per MSDN, returns metadata from the query without actually running the query. This does throw error 208 when referencing an invalid table name. If the query is valid, it will return the metadata of the resultset which is handy if you want to test the response format without executing the statement.


Friday, 18 June 2010

The SQL Server MasterClass Experience

SQL Server MasterClass 2010 was a very high quality, one day event presented by Paul S. Randal (Twitter | Blog) and Kimberly L. Tripp (Twitter | Blog) - not that they need any introduction of course. In fact if you don't know of them, then perhaps you should close down your SSMS window and switch off your machine - after reading my blog of course :). Maybe uninstall SQL Server too...

The A-Team
With two of the best SQL Server trainers in the industry offering a packed schedule of sessions, it was an event I'd been looking forward to attending since first hearing about it and it certainly lived up to the expectations.

That DevOps Thing
The keynote on "Bridging the Gap Between Development and Production" was a great start, reminding me of the "Dev and Ops" track from QCon London earlier this year (check out Sam Newman's talk: From Dev to Production Through Build Pipelines and Teamwork). Developers should appreciate DBA perspsectives, and DBA's should appreciate developer perspectives. Bring the two sides together, work more closely, share knowledge and think and behave as a team. It was a captivating session, and one that really struck a chord with what led me to blog about the "Rise of the SQL Server DevBA" back in March. Prevent pain later, by thinking ahead about how things will perform in a production environment with production-sized data volumes is a must. But don't just "think" about it, test it - and no, testing it out on a development database with a small amount of data does not qualify as testing!

Who ya gonna call? Mythbusters!
A session on busting SQL Server myths followed. I recommend checking out the "misconceptions" category of posts in Paul's blog here. Go on, you may just find you have a misconception about an aspect of SQL Server! It was clear long before this point in the day, that this event was extremely good value for money and the quality was truly top-notch.

After a couple of platinum partner presentations and lunch, the next session was on database recovery techniques. For me, as a developer, this was one I was prepared to not be as relevant to me. However, I was surprised by just how engrossed I was by the topic and the demonstrations. That's truly a great credit to Paul and Kimberly's style of effortless and engaging presenting skills.

The GUID, The Bad and The Ugly
The penultimate talk of the day was on GUIDs and how you really need to understand the implications of using them as a primary key. Especially if it's a clustered key. Random keys like this are just screaming out to cause performance problems. Think about it - every time you insert a new record with a randomly generated key, the position of that row in the clustered index will be random - it won't be just added on to the end like it would if you have a monotonously increasing value like an IDENTITY. So where that record needs to go, there might not be space meaning SQL Server has to create space by performing page splits. This can add a significant extra amount of work and load, degrading performance markedly. Then there's the fact that the GUID key will be present in each and every nonclustered index too - it's hardly the smallest of keys (16 bytes for GUID compared to 4 bytes for an integer for example). This topic was something I investigated a while a go, when I wanted to demonstrate the issues you can get with random keys - I spent some time monitoring the Page Splits performance counters, as one way of visualising the impact. So this was another session that really struck a chord.

A final session on Essential Database Maintenance closed off a fantastic MasterClass from the best in the business. To pick out just one nugget, check out Paul's blog post on I/O error 825 - if you're not monitoring for this read-retry error already, then make sure you do as it can flag up an impending I/O subsystem failure earlier, giving you time to do something about it!

A few of us headed down to the bar for drinks afterwards (very well deserved for Paul and Kimberly I should add!), and it was a pleasure to meet them both and give my thanks for a superb day. I really enjoyed chatting to everyone; it topped off the day perfectly.

I would jump at the chance to attend future talks by Paul and Kimberly - if you get the chance to, I cannot recommend it highly enough. Awe-inspiring knowledge, that is shared in such an effortless and enjoyable way, bouncing off each other. It was also a great opportunity to put faces to some of the names of other SQL tweeps - even if I did suck at playing the "try to recognise people from their Twitter pic" game.

Monday, 14 June 2010

SQL Server XML datatype with CDATA

So today I learnt something new - it turns out the XML datatype in SQL Server does not preserve CDATA sections.
e.g.
DECLARE @XML XML
SET @XML = '<Test><NodeA><![CDATA[Testing cdata section <woop!>]]></NodeA></Test>'
SELECT @XML

Results
------------------------------------------------------------------
<Test><NodeA>Testing cdata section &lt;woop!&gt;</NodeA></Test>
After a quick dig around, I found this MS Connect case. I personally would like to see it accept whatever you pass in without silently altering it, as long as it's well-formed XML of course.

Wednesday, 26 May 2010

Optimising wildcard prefixed LIKE conditions

Suppose you want to write a query to find all products in your database that have a name beginning with "Long-Sleeve". You'd more than likely use something like below (examples based on AdventureWorks LT sample database):
SELECT name 
FROM SalesLT.Product
WHERE name LIKE 'Long-Sleeve%'
This produces a good execution plan, performing an index seek on the nonclustered index that exists on the name column. If you look at the seek operation, you'll see the query optimiser has done a good job of ensuring an index can be used by looking at the seek predicate:



You'll see what it's actually done is converted the LIKE condition to a ">=" OR "<" query, searching for product names where the name >= "Long-Sleeve" and is < "Long-SleevF".

Now, suppose you want to find all products where the name ends with "Tire Tube". The immediate query that springs to mind is:
SELECT name
FROM SalesLT.Product
WHERE name LIKE '%Tire Tube'
The execution plan shows an index scan like below. Unfortunately, it's not ideal as it's not making good use of the index.


A way to optimise this scenario is flip the LIKE condition on its head, to get the wildcard at the end of the string being matched on. If you reverse the product name, and reverse the string being matched on you get the same results.
i.e.
SELECT name 
FROM SalesLT.Product
WHERE name LIKE '%Tire Tube'
gives the same results as:
SELECT name
FROM SalesLT.Product
WHERE REVERSE(name) LIKE REVERSE('%Tire Tube') -- or in other words, LIKE 'ebuT eriT%'
Now that as it stands isn't any better as it still can't use an index. But add a computed column to reflect the reversed name column, and then add an index on that, then you soon can see the benefits.
-- Add a NameReversed column that is the Name column, but reversed
ALTER TABLE SalesLT.Product
ADD NameReversed AS REVERSE(Name)

-- Now index that reversed representation of the product name, and INCLUDE the original Name column
CREATE INDEX IX_Product_NameReversed ON SalesLT.Product(NameReversed) INCLUDE (Name) 
Run 2 queries side by side to compare like below:
SELECT Name FROM SalesLT.Product WHERE Name LIKE '%Tire Tube'

SELECT Name FROM SalesLT.Product WHERE NameReversed LIKE REVERSE('%Tire Tube')
Note I'm keeping the use of REVERSE just to make the condition a bit more readable for demonstration purposes.

The optimised, indexed computed column approach is using an index seek, compared to the original way that uses the index scan as shown below.


Trying this on a slightly bigger table (the AdventureWorks sample database Product table has only about 800 rows) highlights the difference further. In a table with approximately 50,000 rows, the cost of the original attempt (relative to the batch) showed as 98%. Leaving the relative cost of the computed column approach at 2%. The execution stats showed:
Original Approach
Cold cache: CPU=62, Reads=138, Duration=95
Hot cache: CPU=63, Reads=124, Duration=65

Indexed Computed Column Approach
Cold cache: CPU=0, Reads=28, Duration=28
Hot cache: CPU=0, Reads=2, Duration=0

Now imagine on a larger table with million of rows, how this optimisation would scale up...

As per usual, you need to consider the specifics of your situation as adding extra indexes etc. will add to database size and means there is more work to do when performing INSERTs/UPDATEs. If you identify you have a performance problem with a similar scenario, then this approach is at least something to consider.

Monday, 24 May 2010

QCon London 2010 Videos

It's been a couple of months now since QCon London (check out my previous blog post on The QCon London 2010 Experience) and some session videos are starting to make their way out on InfoQ. With a lot of great content, they're well worth checking out.

I've re-watched the sessions I attended and enjoyed them as much the second time round as I did the first time.
The State of the Art on .NET by Josh Graham and Amanda Laucher.
Code Leaders and Beautiful Teams by Roy Osherove.
Testing C# and ASP.NET Applications Using Ruby by Ben Hall.

One session I didn't get to attend as I chose to stay on a different track was From Dev To Production Through Build Pipelines and Teamwork by Sam Newman who has a style of talk that I really enjoy, so I would have liked to have attended that one in hindsight.

Update (more session videos available):
Scale at Facebook by Aditya Agarwal.
Bad Code, Craftmanship, Engineering and Certification by Robert C. Martin.
The Wizardry of Scaling by Oren Eini (aka Ayende Rahien).
Save the Day with Noda Time by Jon Skeet.

So, check'em out, share them with your team, and keep an eye out for the others as they are made available. Chop chop!

Monday, 19 April 2010

The SQLBits VI Experience

I had high expectations about SQLBits VI and it certainly lived up to it! Not only did it have great speakers, who really know their stuff and have an effortless way of getting the information across, but it was also free. So it goes without saying that it was a popular event. There were 4 sessions going on at any one time with a theme of performance and scalability and plenty of choice across DBA, Dev and BI areas. It did mean some tough decisions had to be made between which ones to attend!

Monitoring and Tuning Parallel Query Execution

The first session I attended was held by Ramesh Meyyappan on the DBA track (you can see some of his webcasts here). I chose it as out of the 4 opening sessions, it seemed most relevant to my day job over the others. If I'm honest, I wasn't expecting to get much from the talk beforehand; not because I thought I knew it all - quite the opposite in fact, as I didn't understand some of the terms in the overview (CXPACKET, SOS_SCHEDULER_YIELD...you what? what's a doobry-wotsit-dicombobulater?). Nor did I appreciate the significance of the topic. These were terms I knew of before, but that I didn't actually know about - probably as I felt they were a bit too much out of my comfort zone so when I did come across the terms in the past, I didn't dive down into them to really find out about them.

What a great session it was. Ramesh certainly knows his stuff and I'm glad I opted for it. What was great about it, was the way Ramesh shared his knowledge - his ability to explain things in such a clear way that makes complete sense to people like me who were learning new things was superb, especially given the time constraints. I now have a greater appreciation of MAXDOP, CXPACKET, SLEEP_TASK & SOS_SCHEDULER_YIELD - seeing live demonstrations gets a big thumbs up from me. Seeing real examples of where running queries in parallel can be slower was very beneficial for me. CDs were given out with all examples on so I'll firing that up soon!

Inside the SQL Server Query Optimizer

Conor Cunningham, Principal Architect at Microsoft, was one of the must-see speakers for me and one I expected to draw a big crowd. He's a natural speaker, with an informal nature and did an entertaining talk which I rate extremely highly. "DMVs basically exist because I was p*ssed off" was a particularly entertaining moment. Hearing his insight into how the Query Optimizer works definitely made it a session well worth attending, as proved by the size of the audience - the room was packed. "Snug" was becoming a theme word for the day already!

Note to self: get on with reading "Microsoft SQL Server 2008 Internals" which is currently sitting on my desk.

Performance Monitoring and Tuning - Tips and tricks

Another DBA session by Ramesh Meyyappan that didn't disappoint. And another one that is on my list that I want to watch back again when the recordings are available.

At this point in the day, 3 out of 3 sessions I'd attended were on the DBA track. This just further emphasised something I previously blogged about: Rise of the SQL Server DevBA. As an SQL Server developer, I found these of so much value. If you want to build performant, scalable database systems, you can't be blind to these topic areas.

Time to switch to Dev sessions

After lunch, I attended 2 Dev sessions: "Denormalisation - Having your cake and eating it" by Mark Whitehorn & Yasmeen Ahmed followed by "Improving Database Performance by removing the Database" by Simon Munro. Denormalisation is a topic I'm comfortable with and the session was all theory. The other session was something I followed as a general theme at QCon recently (as I previously blogged about). So I personally didn't take as much away from these sessions, but this was something I should have anticipated from the session abstracts. I did enjoy some of Simon's slides in particular, prompting a few LOLs.

Icing on the cake

The final session of the day for me, was on the BI track: "Designing and Tuning High Speed Data Loading", held by Thomas Kejser from the SQL Customer Advisory Team. I really enjoyed his talk on the various techniques for quickly loading data into SQL Server and on minimally logged operations. Another session I want to watch back again as it topped off a brilliant day.

SQLBits - you know it makes sense

This was the first SQLBits I've attended and if SQL Server is your thing, I completely recommend you try to make it to future ones. I know I wouldn't hesitate to attend again if I have the chance. It's a great opportunity to sponge up more knowledge and meet up with other SQL Server professionals. I had a number of good chats with other attendees, in particular SQL Server tweeps John Samson (Twitter) and Vivekanand Serou (Twitter).

Overall, this kind of event is great for broadening your horizons. Never be satisfied just with what you already know - go out there and push to learn more. Sometimes the most valuable nuggets of information take you by surprise!

Keeping my eyes peeled for the next event to attend!

Wednesday, 14 April 2010

SqlBulkCopy ColumnMappings Mismatch

"The given ColumnMapping does not match up with any column in the source or destination"

This error caused me more head-scratching than it should have done today. I was using SqlBulkCopy to bulk insert some data into an SQL Server table from a .NET DataTable, something I've done before - it's not rocket science. But I hit this error upon calling WriteToServer and it took me a while (longer than it should have done really!) to track down the problem. The ColumnMappings I'd specified, to map the data columns in my DataTable to the underlying database table columns, all seemed to tally up - each column did exist in the source DataTable and the destination SQL Server table. I was expecting that I'd made a stupid mistake and just couldn't see the wood for the trees.

ColumnMappings are case-sensitive

The problem was because the name I'd given for the destination column name for one of the column mappings, did not match the case of the name of that column in the database. Once I'd eventually identified the problem (an "i" character should have been "I" to match exactly the database schema), it was a bit of a face-palm moment. I don't see any reason why it should be case sensitive and believe it should not matter. A bit more digging led me to this open Microsoft Connect bug report from some years ago.

I won't be tripping over this again!

Tuesday, 6 April 2010

SQL CAST to VARCHAR without size

What's the result of this, in SQL Server?
SELECT CAST(12345 AS VARCHAR)
As you expect, it's "12345". How about this?
SELECT CAST(1234567890123456789012345678901 AS VARCHAR)
You may be expecting it to return "1234567890123456789012345678901" but it will in fact throw an error:
Msg 8115, Level 16, State 5, Line 1
Arithmetic overflow error converting numeric to data type varchar.
It's all because a length was not given for the VARCHAR to convert the value to. By default, as documented in MSDN, if no length is specified it will default to 30 when using CAST or CONVERT. This results in the error.

This actually tripped me up a few years back now, and I think could easily trip others up. The way to avoid it is to make sure you explicitly specify the length of a VARCHAR (or CHAR) in all places - this is of course good practice, but sometimes it can be too easy to quickly write a CAST or CONVERT statement without giving proper thought to the size.

It's also worth a note that when declaring a VARCHAR variable without defining the length, it will default to 1. This is a more dangerous scenario as it can go unspotted as an error is not thrown. e.g.
DECLARE @Data VARCHAR
SET @Data = '123'
SELECT @Data 
Returns: "1"

Just in case you find yourself trying a quick CAST or CONVERT to a string without specifying the size, bear this in mind!

Sunday, 14 March 2010

The QCon London 2010 Experience

QCon London 2010 - what a conference!

Awesome...with a capital AWE

So many inspiring speakers, a packed schedule of talks over 3 days on a number of tracks and a great bunch of people. This was my first QCon, and cannot recommend the experience enough. The only thing I didn't enjoy, was having to decide between which talks to attend. But that's the true sign of a good conference. I'm finding it difficult to write a blog post as my mind is still buzzing and there was a lot to soak up. In all honesty, I won't do justice to the talks if I try to paraphrase them too much, so I'm just going to summarise my experience and some key points I've left with.

Getting the ball rolling

The opening keynote by "Uncle Bob" Martin (twitter) drew some mixed comments on twitter from what I saw, but the main things I took from the talk were about pride of workmanship. After all, if you're not proud of the code you've written and you hope noone finds out it was you who wrote it, then you should probably think about improving it so you can take pride in what you've done. He's definitely a seasoned, engaging speaker, regardless of what your own personal opinions are on the topics at hand, and I enjoyed the talk.

Day 1 - Architectures you've always wondered about

The sessions that particularly grabbed me were the ones by Facebook's director of engineering (Aditya Agarwal) and Skype's architecture team lead (Andres Kütt). The level of scale talked about was mind-blowing. To give some example of this, here's some of the points mentioned during Facebook's talk:
  • people spend 8 billion minutes per day on the site
  • 5 billion pieces of shared content per month
  • 3 billion photos uploaded per month
  • 10's of TB of RAM in use, across thousands of servers containing cached data
  • tweaked memcached to make it even more high performance
  • developed HipHop for PHP which is a source code transformer, transforming PHP into highly optimised C++
Pretty jaw-dropping stuff. You can check out the facebook engineering blog here.

I would have liked to have heard more technical detail in some of the architecture talks, but I guess they have their competitive advantages to keep so they're hardly going to give away all their secrets!

The conference party finished off the first day, in a nearby pub. A chance to grab a beer, take in the wealth of information from day 1, and mingle with other attendees. Cosy, is probably the best way to describe it - the ratio of attendees to pub space was slightly uneven! But a good time nonetheless.

Day 2 - AlphaGeeks on .NET

Being a C# developer, this track was a natural decision for me and I wasn't disappointed. Kicked off by Ben Hall (twitter), who gave a great talk on BDD (Behaviour Driven Development) as opposed to TDD (Test Driven Development) and added IronRuby on to the list of technologies I should look at.

This was followed on by a great talk by Ayende Rahien (Oren Eini) (twitter) who talked about how to scale an application by a "divide and conquer" approach. What I particularly liked about his session, was that he worked through a real-world example and gave some live metrics on performance improvements. The talk was one of a number to mention the CAP theorem, whereby a distributed can only satisfy 2 out of 3 of the following, but not all 3:
  • Consistency
  • Availability
  • Partition tolerance
It also added so many technologies to my list to look at: RhinoPHT/RhinoDHT, Rhino Mocks, RhinoESB and RavenDB.

Jon Skeet's (twitter) style of presentation was familiar to me from the StackOverflow DevDay last year and didn't disappoint in his talk on NodaTime, a .NET port of the JodaTime date/time library in Java. And yes Tony the Pony did make an appearance on the stand.

The most entertaining talk of the whole conference has to go to Roy Osherove (twitter). A cracking talk on "Beautiful teams and leaders" was topped off with a sublime solo guitar song performance. Brilliant.

The day ended with a talk by Josh Graham and Amanda Laucher on "12 months of things to learn" in .NET. I found my TODO list rapidly growing. F#, M, Guerilla SOA, MEF....... Another really good talk, with some good banter.

Day 2 ended with a number of usergroup events, and I went along to the NNUG (Norwegian .NET User Group) / alt.net beers event at a pub in Soho. It was somewhat surreal for me, having drinks and chatting with the type of experts I can only strive to be (to be very geek, heroes): Ayende Rahien (Oren Eini), Jon Skeet, Roy Osherove, Udi Dahan to name but a few. It was fantastic to see a number of other .NET developers show up that hadn't attended QCon, and was good chatting to them. This was my first usergroup event - lesson being, I really should have attended one before now. Take that as another TODO on my list.

Day 3 - Pick and mix

On the final day I mixed it up a bit, jumping between the "SOA 2010", "The Concurrency Challenge" and "Browser as a platform" tracks. Udi Dahan (twitter) kicked it off with a talk on how to avoid a failed SOA, talking about using EDA & SOA (Event Driven & Service Oriented Architecture) and how to only use a Request/Response model if you cannot get an event driven approach working to meet your requirements. Great speaker, very informative and enlightening.

Justin Sheehy (twitter) followed this, on the concurrency track with a talk on "embracing concurrency at scale" - eventual consistency was a term becoming more and more at the forefront of my mind, as was the point that ACID does not work with a distributed, scaled system. Instead you have a trade-off with BASE:
  • Basically
  • Available
  • Soft state
  • Eventually consistent
which leads to Availablity and Partition tolerance (from the CAP theorem).

Summary

Some of the key points I'm walking from the conference with:
  • keep things simple - complicated doesn't work. If you find your design is complicated, then the chances are you're probably doing something wrong!
  • focus on optimising for an individual task, rather than one-size-fits-all approach
  • ACID doesn't go well with scaling
  • eventual consistency - does it matter if your data is persisted throughout immediately? Probably not. As long as the system is eventually consistent, that's usually all that is required and allows you to scale better
  • asynchronous messaging
  • rules of thumb do not apply - use existing patterns as a starting point for discussion - just because you've done something one way before, doesn't mean you should automatically do it the same way in future

Check out the QCon slides.

I would love to go back to QCon next year, if I'm fortunate enough to have the opportunity again. Once I get the chance to fully absorb the whole experience and put things into practice, I will be a better developer because of it.

Thursday, 4 March 2010

Rise of the SQL Server DevBA

Something that has got me thinking recently is the distinction between an SQL Server Developer and a DBA. I imagine that most people would describe themselves as one or the other exclusively. My CV for example says I'm a developer - that is what I am. I would never market myself as a DBA; aside from the fact I just don't have the full skillset of a DBA, it would just be an insult to the real DBAs out there whose knowledge and experience in that arena far outweighs mine. Ask me to set up RAID, a backup/disaster recovery strategy or a SAN, and I will give you a blank stare. More than likely, I'd make a speedy exit through the nearest door/window/cat flap.

I'm surrounded!

I signed up for the recent SQL Server Dynamic Management View training offered by Quest Software - a free, one day online conference. This was a very popular event - after all not only was it free, but there were 3 great speakers presenting a range of sessions at levels ranging from beginner to expert. The notes, slides and links can be found here. Midway through, I started wondering how many other "Database Developers" had signed up, as I was surrounded (albeit virtually) by DBAs, and very proficient ones at that. This naturally got me thinking...

Why am I here?

It turns out the answer is pretty simple - to make myself a better Database Developer. Performance and scalability are very important, and are always at the forefront of my mind. As a developer, I want to design and develop a database to support the system I'm working on, that will perform and scale well. For me this involves thinking about the bigger picture. Not just how it performs at the time of development with relatively low volumes of data on development servers but how it would perform with current live data volumes and expected volumes in the future. Thinking about things from a DBA's perspective, understanding the issues a DBA has to deal with and what skills they use to keep a database server running smoothly is in my opinion, a valuable asset to have. I don't mean that I should be able to do even half of, what a DBA can do - that would have been a different career path had I wanted to head down that route. But some DBA skills are very beneficial for a developer to have - at least I find they are in my role.

Rise of the "DevBA"?

An SQL Server Developer with a bit of DBA thrown in. Knowing how to diagnose performance issues, make use of DMVs, pinpoint and spot potential bottlenecks has been extremely valuable to me. It's led to me broadening my horizons, and fine tuning my skills as a developer. For those of you who perhaps work for a small company without a specific DBA role, where pretty much everyone mucks in, I think you're in the prime DevBA territory.

Another carrot being dangled

The upcoming SQLBits event I recently blogged about is another opportunity I'm looking forward to making the most of. Sessions are currently being submitted in Dev, DBA and BI disciplines. The Dev ones are obviously of main interest to me, but there's also some DBA sessions that I'd like to attend. I can't actually recall the point at which I started becoming more aware of DBA-oriented topics, but it's something that is happening more and more.

Am I a wanna-be DBA? No.

Could I double-up as a DBA? No.

Does being a "DevBA" make me a better developer and increase my technical ability? Yes.

Monday, 1 March 2010

Queue table processing in SQL Server

Implementing SQL Server queue table processing logic is something I keep meaning to blog about and finally I've got round to it thanks to my memory being jogged by StackOverflow questions I've recently participated in, including this one. The scenario is you queue up records in a database table, each representing a piece of work needing to be done. You then want to have processes that periodically poll this table to pick up the next item of work from the queue and process them.

What you want to avoid
  • Multiple processes picking up the same queue item. You want each item in the queue to be processed once after all.
  • Blocking. If multiple processes are polling the queue and they are blocking each other, then scalability will be limited.

Solution
DECLARE @NextId INTEGER
BEGIN TRANSACTION

-- Find next available item available
SELECT TOP 1 @NextId = ID
FROM QueueTable WITH (UPDLOCK, READPAST)
WHERE IsBeingProcessed = 0
ORDER BY ID ASC

-- If found, flag it to prevent being picked up again
IF (@NextId IS NOT NULL)
    BEGIN
        UPDATE QueueTable
        SET IsBeingProcessed = 1
        WHERE ID = @NextId
    END

COMMIT TRANSACTION

-- Now return the queue item, if we have one
IF (@NextId IS NOT NULL)
    SELECT * FROM QueueTable WHERE ID = @NextId

It's all about the table hints

UPDLOCK
This grabs an update lock until the transaction is completed and prevents another process from picking up the same queue item.

READPAST
If a process encounters a row that is currently locked by another, this hint will make it skip over that locked row, and allow to move on to find the next available one.

This was a topic I investigated some time ago and spent some time working through to end up at this approach, finding this MSDN reference on table hints a valuable resource. I then found this article on MSSQLTips which demonstrates the same approach - if only I'd found that at the start, as it was the READPAST hint that was the one I wasn't aware of initially!

Changing primary key index structure

Changing the structure of a primary key constraint index from nonclustered to clustered (or from clustered to nonclustered) is not necessarily as straight forward as it first seems. The process of changing it over, involves the constraint being dropped and then recreated. This could potentially cause a problem if you're making the change on a table whilst there could be activity against it.

Example
TableX was originally created as below:
CREATE TABLE [TableX]
(
    FieldA INTEGER CONSTRAINT PK_TableX PRIMARY KEY NONCLUSTERED,
    FieldB DATETIME NOT NULL,
    FieldC VARCHAR(50)
)
GO
CREATE CLUSTERED INDEX IX_TableX_FieldB ON TableX(FieldB)
GO
After a period of time, it becomes clear that performance would be better if the primary key was made to be clustered instead, and the existing clustered index switched to nonclustered. The following script demonstrates how to make the switch, in a manner that prevents the primary key being violated while it is switch over, by creating a temporary unique constraint.
-- 1) Drop the existing CLUSTERED index
DROP INDEX TableX.IX_TableX_FieldB

-- 2) Create a (temporary) UNIQUE constraint on the unique fields referenced in the primary key. This will enforce the uniqueness when we drop the PK.
ALTER TABLE TableX
ADD CONSTRAINT UQ_TableX UNIQUE(FieldA)

-- 3) Drop the existing nonclustered PRIMARY KEY constraint.
ALTER TABLE TableX
DROP CONSTRAINT PK_TableX

-- 4) Recreate the PRIMARY KEY as CLUSTERED
ALTER TABLE TableX
ADD CONSTRAINT PK_TableX PRIMARY KEY CLUSTERED(FieldA)

-- 5) Drop the temporary UNIQUE constraint
ALTER TABLE TableX
DROP CONSTRAINT UQ_TableX

-- 6) Add the IX_TableX_FieldB index back on as NONCLUSTERED
CREATE NONCLUSTERED INDEX IX_TableX_FieldB ON TableX(FieldB)
This scenario is a good reason why I always avoid SQL Server's autogenerated constraint names!

Thursday, 25 February 2010

SQLBits VI - Upcoming SQL Server event

SQL Bits VI is scheduled in my calendar for Friday 16th April - and I'm expecting great things! It's a 1 day, SQL Server event being held in London (Westminster to be semi-precise), and the theme is performance and scalability which, as I mentioned in an earlier blog post, is particularly high on my agenda at the moment.

Not only does it offer high quality speakers, but it's a free event too. What more could you want? Although I haven't attended an SQL Bits event before, it promises to be well worth the journey up to London. As soon as registration is open I know I'll be signing up pretty quickly. The list of sessions that have been submitted so far can be found here and there looks to be a good mix already.

As a developer, I'm particularly interested in the Dev sessions but also, I find myself very interested in the DBA sessions that have been submitted. I always market myself as an SQL Server Developer, not a DBA - that's what I am. However, increasingly I find myself straying into the realms of the DBA, especially recently. I like to think that adds to the quality and scalability of solutions I work on.

So if SQL Server is your thing and you can make it to London for it, you should register!

Tuesday, 23 February 2010

Autogenerated SQL constraint names - why to avoid them

Autogenerated constraint names in SQL Server - are they convenient, or are they a hindrance? For me, they are a hindrance.

Why should I care that SQL Server autogenerates a constraint name when I don't explicitly specify one? Why does it matter if it names my PRIMARY KEY constraint "PK__Constrai__3214EC271FCDBCEB", or my DEFAULT constraint "DF__Constraint__ColumnA__21B6055D"?

Why should I name each constraint myself and not rely on autogenerated constraint names?

Because it makes maintenance and database update deployments easier, simpler and in my opinion, safer.

If you want to change a constraint in the future, knowing what the name of that constraint is makes life a whole lot easier. But if you didn't specify a name when you created the constraint originally, then you need to query the metadata views to find out what it is (sys.check_constraints, sys.default_constraints, sys.key_constraints, sys.foreign_keys in SQL Server 2005 and later). This isn't particularly difficult, but I find it's an unnecessary step to have to take. And often when introducing extra steps into a process, there's the potential element of extra risk.

"PK_MyTable" makes more sense and is more memorable than "PK__Constrai__3214EC271FCDBCEB". It does what it says on the tin.

If you deploy database updates by loading up SQL Server Management Studio, and running through a list of changes manually one by one via the UI, then you probably still don't see the true benefit. However, if you create SQL scripts to roll out database updates, that can be tested on local systems before going to the production database server, then the benefit should be obvious. In an ideal world, you have a continuous integration server automatically building and testing code changes to a local test environment.

I like making life easier for myself which is why one the best practise approaches I follow, is to make sure I avoid SQL Server's autogenerated constraint names.

Sunday, 21 February 2010

Optimising date filtered SQL queries

How you structure your SQL queries is very important and choosing the wrong approach can have big effects on the performance of the query.

One of the key things that should flag up the potential for needing optimisation, is if you are using a function/calculation within a WHERE clause. These can lead to inefficient execution plans, preventing sub-optimal index use.

A classic example to demonstrate this is when querying the data for a specific month. Here's the test setup, creating a table to hold customer orders, populating it with 100,000 records spread out randomly throughout the year of 2009, and creating an index on the OrderDate column that includes the "Total" data (SQL Server 2005 and later support the ability to include other non-key columns in nonclustered indexes, via the INCLUDE syntax):
CREATE TABLE [OrderHead]
(
    ID INTEGER IDENTITY(1,1) PRIMARY KEY,
    OrderDate DATETIME,
    CustomerID INTEGER,
    Total MONEY
)

DECLARE @Counter INTEGER
SET @Counter = 1

WHILE (@Counter <= 100000)
  BEGIN
    INSERT [OrderHead] (OrderDate, CustomerID, Total)
    SELECT DATEADD(dd, CAST(RAND() * 365 AS INTEGER), '20090101'), 
        CAST(RAND() * 100 AS INTEGER),
        CAST(RAND() * 100 AS DECIMAL(5,2))
  
    SET @Counter = @Counter + 1
  END
 
CREATE NONCLUSTERED INDEX IX_OrderHead_OrderDate 
ON OrderHead(OrderDate) INCLUDE(TOTAL)
Say you want to find the total value of sales made within June 2009. The key thing to do is to get optimal usage of the index on the OrderDate field.

Optimal Approach
SELECT SUM(Total) AS MonthlyTotal
FROM [OrderHead]
WHERE OrderDate >= '20090601'
  AND OrderDate < '20090701'
Average stats over 3 runs:
Duration: 45ms, CPU: 0, Reads: 29

Execution Plan:
This is a good execution plan, with an optimal index seek.

How does that compare to the alternative approaches?

Alternative #1
SELECT SUM(Total) AS MonthlyTotal
FROM [OrderHead]
WHERE MONTH(OrderDate) = 6
    AND YEAR(OrderDate) = 2009
Average stats over 3 runs:
Duration: 153ms, CPU: 36, Reads: 324

Execution Plan:
 

See how an index scan is now being used, not a seek. This is due to the use of the MONTH and YEAR functions within the WHERE clause preventing the most optimal use of the index.

Alternative #2
SELECT SUM(Total) AS MonthlyTotal
FROM [OrderHead]
WHERE DATEDIFF(mm, OrderDate, '20090601') = 0
Average stats over 3 runs:
Duration: 186ms, CPU: 78, Reads: 324

Execution Plan:
This is the same as Alternative #1.

In this scenario, the optimal approach reduced duration by up to 75%, took CPU down to 0 and reduced reads by about 91%.

For such a simple tweak, you can see how much of a difference it can have on performance. When you want a high performance, scalable system, it is important to optimise these scenarios in your SQL queries.

Wednesday, 17 February 2010

Using the SP:StmtCompleted SQL Profiler trace event class

One question I've seen popping up a few times recently, is how to check what individual statement(s) within an SQL Server stored procedure are taking the most time.

The scenario is that there is a stored procedure that consists of a number of SQL statements. The stored procedure has been flagged up as running slower than perhaps it was expected to, but as there are a number of statements/queries within it, it's not immediately clear where the time is being taken.

Now the immediate solution that springs to mind is to check the execution plan as that will show where the time is being spent. Job done.

But in this instance, the question related to how to extract a list of execution times/stats for each statement within a stored procedure and the focus was away from just looking at the execution plan.

One approach mentioned was to record the start and end time either side of each statement and log out to an audit table to then be reviewed. I think this is a bit clunky, and means you have to go in and add potentially a lot of temporary code that needs to be stripped out again later.

Instead, what I think is a better approach is to use SQL Profiler. When you run a standard/default trace it doesn't capture the execution of each individual statement, just the overall execution of the stored procedure. What you need to do is set up the trace to monitor the SP:StmtCompleted event so that what you end up with is a trace of every statement executed within the stored procedure call. The trace can then of course be saved to file or to a database table if needed.

It's simple, but effective. I think this particular event class often slips under the radar as it's (rightly so) not traced by the standard template that most people (including me) tend to use when using SQL Profiler. And unless you've looked through all the event classes, you may not know that this was there or what it actually does.

Monday, 15 February 2010

High performance bulk loading to SQL Server using SqlBulkCopy

If you ever want to bulk load data into an SQL Server database as quickly as possible, the SqlBulkCopy class is your friend (in the System.Data.SqlClient namespace). Since being introduced in .NET 2.0, it has provided an extremely efficient way to bulk load data into SQL Server, and is one the classes that I see as a "must know about". A usual scenario is where you want to dump some data into the database to then do some processing on. Or you want to temporarily load some data in to then query to get some stats from.

To give an idea of how well it performs, I'll compare it to an SqlDataAdapter approach.

SqlBulkCopy approach
// Instantiate SqlBulkCopy with default options,
// supplying an open SqlConnection to the database
using (SqlBulkCopy bulkCopy=new SqlBulkCopy(connection))
{
    // The table I'm loading the data to
    bulkCopy.DestinationTableName = "BulkLoadTable";
    // How many records to send to the database in one go (all of them)
    bulkCopy.BatchSize = myDataTable.Rows.Count;

    // Load the data to the database
    bulkCopy.WriteToServer(myDataTable);
    
    // Close up          
    bulkCopy.Close();
}        

SqlDataAdapter approach
using (SqlCommand insertCommand=new SqlCommand(
"INSERT BulkLoadTable(FieldA, FieldB) VALUES (@FieldA, @FieldB)", connection))
{
    insertCommand.Parameters.Add("@FieldA", SqlDbType.VarChar, 10, "FieldA");
    insertCommand.Parameters.Add("@FieldB", SqlDbType.Int, 4, "FieldB");
    // Setting UpdatedRowSource is important if you want to batch up the inserts
    insertCommand.UpdatedRowSource = UpdateRowSource.None;
    using (SqlDataAdapter insertAdapter = new SqlDataAdapter())
    {
        insertAdapter.InsertCommand = insertCommand;
        // How many records to send to the database in one go (all of them)
        insertAdapter.UpdateBatchSize = myDataTable.Rows.Count;
        
        // Send the inserts to the database
        insertAdapter.Update(myDataTable);                   
    }
}

I've left out the boring stuff like the instantiation and opening of the SqlConnection (connection). Plus I haven't shown the generation of the DataTable (myDataTable) as that's not particularly exciting either.

For this test, I'm loading 2 columns of data to the table named "BulkLoadTable" in my database. The columns in myDataTable match those in the table which keeps it nice and simple. If they didn't match, then when using SqlBulkCopy, you need to map which columns in the DataTable map to which columns in the destination table using the ColumnMappings method on the SqlBulkCopy instance. All straight forward.

So how does SqlBulkCopy perform?

I populated the DataTable with 100,000 records containing some random data. FieldA is just a GUID, FieldB is an incrementing number. The times recorded are average measurements of the actual time to persist the data (i.e. the times to complete bulkCopy.WriteToServer and insertAdapter.Update respectively):

SqlBulkCopy:1.5885s
SqlDataAdapter:25.0729s

That's about a 93% drop in time taken when using the SqlBulkCopy class over SqlDataAdapter approach.

It gets better.

One of the SqlBulkCopy constructor overloads provides the ability to specify SqlBulkCopyOptions. For maximum performance, specify the TableLock option:
SqlBulkCopy bulkCopy = new SqlBulkCopy(connection, SqlBulkCopyOptions.TableLock, null);
This will obtain a bulk update lock on the destination table during the bulk copy process. By default, it would otherwise obtain row level locks. This further reduces the time to load 100,000 records:

SqlBulkCopy with TableLock Option:0.8229s

Of course, this lock can block out other processes until the loading is complete. Usually in my experience, this is not an issue. For optimal performance I recommend:
  • bulk loading into a new, empty table (staging table)
  • add suitable indexes on to the staging table after the data has been bulk loaded, that will help with subsequent processing/querying. If you add indexes at the start, you will just unnecessarily hinder the insert speed
  • use the TableLock option to maximise throughput

The data source doesn't have to be a DataTable either. The WriteToServer method will also accept a DataRow array, an IDataReader, or a DataTable with a DataRowState (it will persist only the DataTable rows matching the supplied DataRowState).

So, "can I use it for bulk updates?" I hear you ask. No. It's a mechanism for doing bulk inserts only. You can't have everything I suppose.

Update (12 Jan 2011):
Want to squeeze more performance out? Check out my followup post on SqlBulkCopy to SQL Server in Parallel

Saturday, 13 February 2010

Ways to improve technical skills

It's nearly the end of my first week as a blogger. Already I'm feeling the benefits of finally getting round to starting up my blog. It got me thinking about what the best ways are to continue developing your skills and learning new things.

For me, there are 3 particular ways that I find are invaluable if you want to keep yourself on your toes technically:
  1. Learn by doing - often you learn my making mistakes. Want to know what LINQ is all about? What can SQL CLR do for you? Hit those syntax errors then fix them. Think it's not performing well? Investigate, find out why, look for alternatives. Just try it out - play around.
  2. Learn by participating - I'm a StackOverflow enthusiast. Before that I spent years on another similar site (the one with the unfortunate name and conveniently placed hypen). Why? Because seeing what problems other people are encountering and helping them to find a solution is a great way to broaden your technical horizons. You'll often find there are other ways to go about things that you may not have thought about, but which other people suggest.
  3. Learn by sharing - Now I'm a blogger, and already finding that a great benefit. It gives me the drive to investigate areas I find interesting. Being able to understand something is one thing; being able to share it in a way that makes sense and provides value to others is another. It's a learning curve and I'll be feeling my way for some time.

Of course there are books, blogs, webinars, converences and other online references that are extremely valuable resources for learning and the above list by no means overshadows these. The DevDays London conference opened my eyes to new areas and as per an earlier blog post, I'm expecting great things from QCon London. My intention is to feed back into this pool of resources with this blog.

I encourage any feedback as it will only improve the content I produce. In return, I will promise to try and keep bad jokes to a minimum. Although in my experience, telling bad jokes is part and parcel of being a software developer.

Friday, 12 February 2010

Sorting a DataTable - LINQ performance

Whenever there are a number of ways to achieve the same goal, I'm always inquisitive as what the performance difference is between them. I'm a firm believer in thinking about scalability from the start - if you can make your best effort to prepare for scale, then you can save yourself time and effort further down the line. I like to try and avoid pain - it doesn't agree with me.

In this instance, I become curious about sorting an in-memory DataTable, using C# / .NET 3.5. So I decided to run a few tests using the common approaches. Now, you may be thinking "Why not just order the results from the database using an ORDER BY clause?". Well yes, you could do that. But what about when you want to cache some data once in memory, and then use that cached copy for subsequent purposes to prevent round-tripping back/re-hitting the database server? Or, what if the data isn't actually coming from a database but some other source?

So I knocked up a quick test harness. For each method, I tested sorting a DataTable containing between 100 and 2 million rows of data. The DataTable contained 2 columns:
ColumnA - integer, just an incrementing number
ColumnB - string, in the format {Character}{Row Number} where {Character} just loops round from A-Z just to mix the records up a bit and give the need for ordering.

Method 1 - DataView.Sort
DataView vw = dt.DefaultView;
vw.Sort = "ColumnB ASC";

Method 2 - DataTable.Select
DataRow[] rows = dt.Select("", "ColumnB ASC");

Method 3 - LINQ to DataSet
var rows = (from r in dt.AsEnumerable()
orderby r["ColumnB"] ascending
select r).ToArray();
Note: the .ToArray() bit in the LINQ above is important - this makes the execution of the query immediate. Without it, what you are actually really doing is just defining a query object. It does not execute until you try to request data from the query object - this is known as deferred execution. So, in this example, without the enclosing brackets and the subsequent .ToArray(), the data wouldn't actually be being sorted at this point.

Here's the results:
No. Of RowsMethod 1 - DataView.SortMethod 2 - DataTable.SelectMethod 3 - LINQ
100 0.0625s 0.0520s 0.0475s
1000 0.0781s 0.0573s 0.0573s
10,000 0.1618s 0.1094s 0.0989s
100,000 1.4793s 0.8959s 0.7084s
1,000,000 16.1318s 9.8290s 8.4534s
2,000,000 35.094s 21.5995s 18.3420s

As you can see from my tests, LINQ to DataSet came out tops. With a smaller DataTable the difference is, as you'd expect, minimal. Though as the volume of rows increases, LINQ seems to keep out-performing the other two approaches, being nearly 100% quicker than a DataView Sort as you get to the level of hundreds of thousands of rows, and about 14% quicker than a DataTable.Select.

Thursday, 11 February 2010

SQL Server 2008 - Table Valued Parameters vs XML vs CSV

The scenario is, you want to create a stored procedure that returns the records relating to a finite set of keys/IDs that you pass in to it. How do you do it? Well you could use dynamic SQL, but for the purpose of this article I'm going to ignore that option.

Test scenario:
-- Create basic customer table
CREATE TABLE [dbo].[Customer]
(
Id INTEGER IDENTITY(1,1) PRIMARY KEY,
Name NVARCHAR(50) NOT NULL
);
GO

-- Populate Customer table with 100,000 sample records
DECLARE @Counter INTEGER
SET @Counter = 1
WHILE (@Counter <= 100000)
BEGIN
INSERT Customer (Name)
VALUES ('Test Customer #' + CAST(@Counter AS VARCHAR(10)))
SET @Counter = @Counter + 1
END

Option 1 - CSV list of keys
In SQL Server 2000, you would most likely have ended up passing in a CSV list of keys, splitting that out into a table, and then joining that on to your real data table to match the records to return. dbo.fnSplit is a user defined function that iterates through the string and splits each value out into a TABLE variable which is then returned. To save space, I haven't supplied the definition of that, but there's plenty of examples of this function out there via a quick Google.
CREATE PROCEDURE [dbo].[uspGetCustomersCSV]
@CustomerIDs VARCHAR(8000)
AS
BEGIN
SELECT c.Id, c.Name
FROM [dbo].[Customer] c
JOIN dbo.fnSplit(@CustomerIDs, ',') t ON c.Id = t.item
END
GO

Example Use:
EXECUTE [dbo].[uspGetCustomersCSV] '1,10,100'


Option 2 - XML
SQL Server 2005 added the XML datatype, which allows you to pass in an XML blob containing the keys and using the new built-in XML support to incorporate it into a JOIN.
CREATE PROCEDURE [dbo].[uspGetCustomersXML]
@CustomerIDs XML
AS
BEGIN
SELECT c.ID, c.Name
FROM [dbo].[Customer] c
JOIN @CustomerIDs.nodes('IDList/ID') AS x(Item) ON c.ID = Item.value('.', 'int' )
END
GO

Example Use:
EXECUTE [dbo].[uspGetCustomersXML] '<IDList><ID>1</ID><ID>10</ID><ID>100</ID></IDList>'


Option 3 - Table Valued Parameters
Now in SQL Server 2008, there's a new kid on the block. Table Valued Parameters. In a nutshell, this is the ability to pass a TABLE type in as a parameter, which previously was not allowed. The beauty of this, is it allows you to focus on writing the query in a more natural manner - no string parsing/manipulation, no XML functionality. Plain, simple, SQL.

First you create a new TYPE, defining the TABLE structure.
CREATE TYPE CustomerIDTableType AS TABLE (ID INTEGER PRIMARY KEY);
GO

Next you create your stored procedure, taking that new type as a parameter.
CREATE PROCEDURE [dbo].[uspGetCustomersTable]
@CustomerIDs CustomerIDTableType READONLY
AS
BEGIN
SELECT c.ID, c.Name
FROM [dbo].[Customer] c
JOIN @CustomerIDs t ON c.Id = t.Id
END
GO

Example Use:
DECLARE @Ids CustomerIDTableType
INSERT @Ids VALUES (1)
INSERT @Ids VALUES (10)
INSERT @Ids VALUES (100)
EXECUTE [dbo].[uspGetCustomersTable] @Ids


If you're wanting to call this from .NET, it's quite simple. You define the parameter as SqlDbType.Structured and the value can be set to any IEnumerable, DataTable, or DbDataReader!

The follow stats are the average figures over 3 runs, against the 100,000 row table, retrieving just 7 rows of data for keys: 1, 10, 200, 3000, 40000, 50000 and 90000. Each time, the cache was fully cleared to allow fair comparisons.

Option 1 (CSV): Duration=123ms, Reads=107, Writes=1, CPU=15
Option 2 (XML): Duration=80ms, Reads=69, Writes=0, CPU=5
Option 3 (TVP): Duration=69ms, Reads=27, Writes=0, CPU=0

This was quite a modest test, and by no means an absolute stress test. But the results look very promising for Table Valued Parameters. But not only that. I like them, because they let you get on with writing straightforward, maintainable queries.

Wednesday, 10 February 2010

QCon London 2010

So QCon London is fast approaching, starting with 2 days of tutorials on the 8th and 9th of March, then the 3 day conference from 10th-12th. This will be my first time at QCon and I'm really looking forward to it. Having never been to many conferences in the past, the level of expectation I have is solely based on the StackOverflow DevDay in London last October. That was a great day with some very interesting (and humorous) talks, so I'm expecting even better things from QCon as it's reputation precedes it!

I've been keeping an eye on the schedule as it has evolved over the past few months and am starting to wish I could split myself into a number of mini-me's to fan out to a number of the tracks. Also because it would allow me to maximise the collection of freebies....after all, isn't that what conferences are all about - the free stuff? Guaranteed to turn even the most restrained people into kleptomaniacs. It's all about the free pens.

At the moment, I think I'm pretty much decided on what tracks to attend for the Wednesday and Thursday; still swaying on Friday :

Wednesday: Architectures you've always wondered about
Thursday: AlphaGeeks on .NET
Friday: SOA 2010 or Browser as a Platform

Performance and scalability has been the flavour of 2010 so far for me, spending a lot of time thinking about how to scale up in the world of .NET and SQL Server. So I'm particular interested in talks like "Scaling applications: Complexity, Performance, Maintainability" - keen to open my eyes to the likes of Rhino Service Bus, NHibernate and Windsor.

If you're going to QCon, what are you looking forward to? Or have you been before and got any tips for a newbie? Feel free to drop me an email.

Tuesday, 9 February 2010

Would you like SQL cache with that?

One of the things I feel I keep badgering on about in the world of SQL Server and query tuning is to be fair and consistent when comparing the different possible variants. If you don't level the playing field, then potentially it will lead to you thinking a particular query out-performs another when in fact the opposite could be true. It's a bit like comparing 2 paper boys to see which one can complete a given round the quickest - if one of them has been doing the round for the past 2 weeks but the other has never done that round before, then are you going to be able to reliably tell which one is actually quickest?

No.

One of them knows the route, the houses, and which houses have the dog that will chew their hand off given half the chance, but the other paper boy has no such knowledge. This puts him at an unfair disadvantage and almost certainly he will be slower. So until you compare them fair and square, from the same starting point, then you won't truly know who is quicker.

This is true when query tuning. So how do I do fair comparisons of queries? Well, nothing ground-breaking:

Clear the data cache and execution plan cache.
The first time you run a query an execution plan will be generated and stored in the execution plan cache. The data will be held in SQL Server's data cache/buffer. If you then run it for a second time, there is already an execution plan to reuse and the data is in the cache meaning it doesn't need to be re-read from disk. This can give vast improvements in query performance.


CHECKPOINT -- force dirty pages in the buffer to be written to disk
DBCC DROPCLEANBUFFERS -- clear the data cache
DBCC FREEPROCCACHE -- clear the execution plan cache

So, if you don't clear the cache down between tests, then it means you will be getting skewed results. A second different query may appear to be more performant, when in fact it's benefiting unfairly from the primed cache created by the first query.

The data cache is usually the biggest contributor to the performance improvement. If you want to get an idea for what is in the data cache, here's a basic query to run in your database that will show the number of cached data pages by table. This is a slightly adapted version of a query available here in MSDN (strips out the sys table figures) for SQL Server 2005 and above:

SELECT t.name AS TableName, COUNT(*) AS NumberOfCachedPages
FROM sys.dm_os_buffer_descriptors AS bd
INNER JOIN
(
SELECT object_id, allocation_unit_id
FROM sys.allocation_units AS au
INNER JOIN sys.partitions AS p
ON au.container_id = p.hobt_id
AND (au.type = 1 OR au.type = 3)
UNION ALL
SELECT object_id, allocation_unit_id
FROM sys.allocation_units AS au
INNER JOIN sys.partitions AS p
ON au.container_id = p.partition_id
AND au.type = 2
) AS obj
ON bd.allocation_unit_id = obj.allocation_unit_id
INNER JOIN sys.tables t ON obj.object_id = t.object_id
WHERE database_id = db_id()
GROUP BY t.name
ORDER BY NumberOfCachedPages DESC;

Running this after clearing the cache, and then after running a query, will soon give you at least some idea of what is being cached. I recommend having a look at sys.dm_os_buffer_descriptors.

So, this is why whenever I want to try out different versions of a query for performance, I like to follow a standard approach:
1) start a trace in SQL Profiler, filtered down to just my process (by user name)
2) clear the cache
3) run query #1
4) record the reads, writes, CPU and duration from profiler
5) repeat steps 2 to 4, at least once more
6) run query #1 again but without clearing the cache - this is because I like to see how the query runs with a hot/primed cache
7) repeat steps 2 to 6 for query #2
8) compare the stats

I find it good to try each query a number of times, to get an average performance. Note up until now I've been talking about performance in terms of raw query duration. I also monitor reads/writes/CPU as they are good to keep an eye on too. Where two queries run in about the same time, there may be a more sizeable difference in the other stats to help choose a winner. Of course, you may want to optimise for reads or CPU at the expense of overall speed, but generally it's duration that is of the most interest.

So when I see questions asking "why does query X run faster than query Y?", I feel it's worth checking whether the tests have been done from a level playing field if it's not stated. Usually the answer is yes, but on the occasion where they haven't been done in a fair manner it's well worth the gentle nudge about clearing the cache.