Thursday, 25 February 2010

SQLBits VI - Upcoming SQL Server event

SQL Bits VI is scheduled in my calendar for Friday 16th April - and I'm expecting great things! It's a 1 day, SQL Server event being held in London (Westminster to be semi-precise), and the theme is performance and scalability which, as I mentioned in an earlier blog post, is particularly high on my agenda at the moment.

Not only does it offer high quality speakers, but it's a free event too. What more could you want? Although I haven't attended an SQL Bits event before, it promises to be well worth the journey up to London. As soon as registration is open I know I'll be signing up pretty quickly. The list of sessions that have been submitted so far can be found here and there looks to be a good mix already.

As a developer, I'm particularly interested in the Dev sessions but also, I find myself very interested in the DBA sessions that have been submitted. I always market myself as an SQL Server Developer, not a DBA - that's what I am. However, increasingly I find myself straying into the realms of the DBA, especially recently. I like to think that adds to the quality and scalability of solutions I work on.

So if SQL Server is your thing and you can make it to London for it, you should register!

Tuesday, 23 February 2010

Autogenerated SQL constraint names - why to avoid them

Autogenerated constraint names in SQL Server - are they convenient, or are they a hindrance? For me, they are a hindrance.

Why should I care that SQL Server autogenerates a constraint name when I don't explicitly specify one? Why does it matter if it names my PRIMARY KEY constraint "PK__Constrai__3214EC271FCDBCEB", or my DEFAULT constraint "DF__Constraint__ColumnA__21B6055D"?

Why should I name each constraint myself and not rely on autogenerated constraint names?

Because it makes maintenance and database update deployments easier, simpler and in my opinion, safer.

If you want to change a constraint in the future, knowing what the name of that constraint is makes life a whole lot easier. But if you didn't specify a name when you created the constraint originally, then you need to query the metadata views to find out what it is (sys.check_constraints, sys.default_constraints, sys.key_constraints, sys.foreign_keys in SQL Server 2005 and later). This isn't particularly difficult, but I find it's an unnecessary step to have to take. And often when introducing extra steps into a process, there's the potential element of extra risk.

"PK_MyTable" makes more sense and is more memorable than "PK__Constrai__3214EC271FCDBCEB". It does what it says on the tin.

If you deploy database updates by loading up SQL Server Management Studio, and running through a list of changes manually one by one via the UI, then you probably still don't see the true benefit. However, if you create SQL scripts to roll out database updates, that can be tested on local systems before going to the production database server, then the benefit should be obvious. In an ideal world, you have a continuous integration server automatically building and testing code changes to a local test environment.

I like making life easier for myself which is why one the best practise approaches I follow, is to make sure I avoid SQL Server's autogenerated constraint names.

Sunday, 21 February 2010

Optimising date filtered SQL queries

How you structure your SQL queries is very important and choosing the wrong approach can have big effects on the performance of the query.

One of the key things that should flag up the potential for needing optimisation, is if you are using a function/calculation within a WHERE clause. These can lead to inefficient execution plans, preventing sub-optimal index use.

A classic example to demonstrate this is when querying the data for a specific month. Here's the test setup, creating a table to hold customer orders, populating it with 100,000 records spread out randomly throughout the year of 2009, and creating an index on the OrderDate column that includes the "Total" data (SQL Server 2005 and later support the ability to include other non-key columns in nonclustered indexes, via the INCLUDE syntax):
    OrderDate DATETIME,
    CustomerID INTEGER,
    Total MONEY

SET @Counter = 1

WHILE (@Counter <= 100000)
    INSERT [OrderHead] (OrderDate, CustomerID, Total)
    SELECT DATEADD(dd, CAST(RAND() * 365 AS INTEGER), '20090101'), 
        CAST(RAND() * 100 AS INTEGER),
        CAST(RAND() * 100 AS DECIMAL(5,2))
    SET @Counter = @Counter + 1
ON OrderHead(OrderDate) INCLUDE(TOTAL)
Say you want to find the total value of sales made within June 2009. The key thing to do is to get optimal usage of the index on the OrderDate field.

Optimal Approach
SELECT SUM(Total) AS MonthlyTotal
FROM [OrderHead]
WHERE OrderDate >= '20090601'
  AND OrderDate < '20090701'
Average stats over 3 runs:
Duration: 45ms, CPU: 0, Reads: 29

Execution Plan:
This is a good execution plan, with an optimal index seek.

How does that compare to the alternative approaches?

Alternative #1
SELECT SUM(Total) AS MonthlyTotal
FROM [OrderHead]
WHERE MONTH(OrderDate) = 6
    AND YEAR(OrderDate) = 2009
Average stats over 3 runs:
Duration: 153ms, CPU: 36, Reads: 324

Execution Plan:

See how an index scan is now being used, not a seek. This is due to the use of the MONTH and YEAR functions within the WHERE clause preventing the most optimal use of the index.

Alternative #2
SELECT SUM(Total) AS MonthlyTotal
FROM [OrderHead]
WHERE DATEDIFF(mm, OrderDate, '20090601') = 0
Average stats over 3 runs:
Duration: 186ms, CPU: 78, Reads: 324

Execution Plan:
This is the same as Alternative #1.

In this scenario, the optimal approach reduced duration by up to 75%, took CPU down to 0 and reduced reads by about 91%.

For such a simple tweak, you can see how much of a difference it can have on performance. When you want a high performance, scalable system, it is important to optimise these scenarios in your SQL queries.

Wednesday, 17 February 2010

Using the SP:StmtCompleted SQL Profiler trace event class

One question I've seen popping up a few times recently, is how to check what individual statement(s) within an SQL Server stored procedure are taking the most time.

The scenario is that there is a stored procedure that consists of a number of SQL statements. The stored procedure has been flagged up as running slower than perhaps it was expected to, but as there are a number of statements/queries within it, it's not immediately clear where the time is being taken.

Now the immediate solution that springs to mind is to check the execution plan as that will show where the time is being spent. Job done.

But in this instance, the question related to how to extract a list of execution times/stats for each statement within a stored procedure and the focus was away from just looking at the execution plan.

One approach mentioned was to record the start and end time either side of each statement and log out to an audit table to then be reviewed. I think this is a bit clunky, and means you have to go in and add potentially a lot of temporary code that needs to be stripped out again later.

Instead, what I think is a better approach is to use SQL Profiler. When you run a standard/default trace it doesn't capture the execution of each individual statement, just the overall execution of the stored procedure. What you need to do is set up the trace to monitor the SP:StmtCompleted event so that what you end up with is a trace of every statement executed within the stored procedure call. The trace can then of course be saved to file or to a database table if needed.

It's simple, but effective. I think this particular event class often slips under the radar as it's (rightly so) not traced by the standard template that most people (including me) tend to use when using SQL Profiler. And unless you've looked through all the event classes, you may not know that this was there or what it actually does.

Monday, 15 February 2010

High performance bulk loading to SQL Server using SqlBulkCopy

If you ever want to bulk load data into an SQL Server database as quickly as possible, the SqlBulkCopy class is your friend (in the System.Data.SqlClient namespace). Since being introduced in .NET 2.0, it has provided an extremely efficient way to bulk load data into SQL Server, and is one the classes that I see as a "must know about". A usual scenario is where you want to dump some data into the database to then do some processing on. Or you want to temporarily load some data in to then query to get some stats from.

To give an idea of how well it performs, I'll compare it to an SqlDataAdapter approach.

SqlBulkCopy approach
// Instantiate SqlBulkCopy with default options,
// supplying an open SqlConnection to the database
using (SqlBulkCopy bulkCopy=new SqlBulkCopy(connection))
    // The table I'm loading the data to
    bulkCopy.DestinationTableName = "BulkLoadTable";
    // How many records to send to the database in one go (all of them)
    bulkCopy.BatchSize = myDataTable.Rows.Count;

    // Load the data to the database
    // Close up          

SqlDataAdapter approach
using (SqlCommand insertCommand=new SqlCommand(
"INSERT BulkLoadTable(FieldA, FieldB) VALUES (@FieldA, @FieldB)", connection))
    insertCommand.Parameters.Add("@FieldA", SqlDbType.VarChar, 10, "FieldA");
    insertCommand.Parameters.Add("@FieldB", SqlDbType.Int, 4, "FieldB");
    // Setting UpdatedRowSource is important if you want to batch up the inserts
    insertCommand.UpdatedRowSource = UpdateRowSource.None;
    using (SqlDataAdapter insertAdapter = new SqlDataAdapter())
        insertAdapter.InsertCommand = insertCommand;
        // How many records to send to the database in one go (all of them)
        insertAdapter.UpdateBatchSize = myDataTable.Rows.Count;
        // Send the inserts to the database

I've left out the boring stuff like the instantiation and opening of the SqlConnection (connection). Plus I haven't shown the generation of the DataTable (myDataTable) as that's not particularly exciting either.

For this test, I'm loading 2 columns of data to the table named "BulkLoadTable" in my database. The columns in myDataTable match those in the table which keeps it nice and simple. If they didn't match, then when using SqlBulkCopy, you need to map which columns in the DataTable map to which columns in the destination table using the ColumnMappings method on the SqlBulkCopy instance. All straight forward.

So how does SqlBulkCopy perform?

I populated the DataTable with 100,000 records containing some random data. FieldA is just a GUID, FieldB is an incrementing number. The times recorded are average measurements of the actual time to persist the data (i.e. the times to complete bulkCopy.WriteToServer and insertAdapter.Update respectively):


That's about a 93% drop in time taken when using the SqlBulkCopy class over SqlDataAdapter approach.

It gets better.

One of the SqlBulkCopy constructor overloads provides the ability to specify SqlBulkCopyOptions. For maximum performance, specify the TableLock option:
SqlBulkCopy bulkCopy = new SqlBulkCopy(connection, SqlBulkCopyOptions.TableLock, null);
This will obtain a bulk update lock on the destination table during the bulk copy process. By default, it would otherwise obtain row level locks. This further reduces the time to load 100,000 records:

SqlBulkCopy with TableLock Option:0.8229s

Of course, this lock can block out other processes until the loading is complete. Usually in my experience, this is not an issue. For optimal performance I recommend:
  • bulk loading into a new, empty table (staging table)
  • add suitable indexes on to the staging table after the data has been bulk loaded, that will help with subsequent processing/querying. If you add indexes at the start, you will just unnecessarily hinder the insert speed
  • use the TableLock option to maximise throughput

The data source doesn't have to be a DataTable either. The WriteToServer method will also accept a DataRow array, an IDataReader, or a DataTable with a DataRowState (it will persist only the DataTable rows matching the supplied DataRowState).

So, "can I use it for bulk updates?" I hear you ask. No. It's a mechanism for doing bulk inserts only. You can't have everything I suppose.

Update (12 Jan 2011):
Want to squeeze more performance out? Check out my followup post on SqlBulkCopy to SQL Server in Parallel

Saturday, 13 February 2010

Ways to improve technical skills

It's nearly the end of my first week as a blogger. Already I'm feeling the benefits of finally getting round to starting up my blog. It got me thinking about what the best ways are to continue developing your skills and learning new things.

For me, there are 3 particular ways that I find are invaluable if you want to keep yourself on your toes technically:
  1. Learn by doing - often you learn my making mistakes. Want to know what LINQ is all about? What can SQL CLR do for you? Hit those syntax errors then fix them. Think it's not performing well? Investigate, find out why, look for alternatives. Just try it out - play around.
  2. Learn by participating - I'm a StackOverflow enthusiast. Before that I spent years on another similar site (the one with the unfortunate name and conveniently placed hypen). Why? Because seeing what problems other people are encountering and helping them to find a solution is a great way to broaden your technical horizons. You'll often find there are other ways to go about things that you may not have thought about, but which other people suggest.
  3. Learn by sharing - Now I'm a blogger, and already finding that a great benefit. It gives me the drive to investigate areas I find interesting. Being able to understand something is one thing; being able to share it in a way that makes sense and provides value to others is another. It's a learning curve and I'll be feeling my way for some time.

Of course there are books, blogs, webinars, converences and other online references that are extremely valuable resources for learning and the above list by no means overshadows these. The DevDays London conference opened my eyes to new areas and as per an earlier blog post, I'm expecting great things from QCon London. My intention is to feed back into this pool of resources with this blog.

I encourage any feedback as it will only improve the content I produce. In return, I will promise to try and keep bad jokes to a minimum. Although in my experience, telling bad jokes is part and parcel of being a software developer.

Friday, 12 February 2010

Sorting a DataTable - LINQ performance

Whenever there are a number of ways to achieve the same goal, I'm always inquisitive as what the performance difference is between them. I'm a firm believer in thinking about scalability from the start - if you can make your best effort to prepare for scale, then you can save yourself time and effort further down the line. I like to try and avoid pain - it doesn't agree with me.

In this instance, I become curious about sorting an in-memory DataTable, using C# / .NET 3.5. So I decided to run a few tests using the common approaches. Now, you may be thinking "Why not just order the results from the database using an ORDER BY clause?". Well yes, you could do that. But what about when you want to cache some data once in memory, and then use that cached copy for subsequent purposes to prevent round-tripping back/re-hitting the database server? Or, what if the data isn't actually coming from a database but some other source?

So I knocked up a quick test harness. For each method, I tested sorting a DataTable containing between 100 and 2 million rows of data. The DataTable contained 2 columns:
ColumnA - integer, just an incrementing number
ColumnB - string, in the format {Character}{Row Number} where {Character} just loops round from A-Z just to mix the records up a bit and give the need for ordering.

Method 1 - DataView.Sort
DataView vw = dt.DefaultView;
vw.Sort = "ColumnB ASC";

Method 2 - DataTable.Select
DataRow[] rows = dt.Select("", "ColumnB ASC");

Method 3 - LINQ to DataSet
var rows = (from r in dt.AsEnumerable()
orderby r["ColumnB"] ascending
select r).ToArray();
Note: the .ToArray() bit in the LINQ above is important - this makes the execution of the query immediate. Without it, what you are actually really doing is just defining a query object. It does not execute until you try to request data from the query object - this is known as deferred execution. So, in this example, without the enclosing brackets and the subsequent .ToArray(), the data wouldn't actually be being sorted at this point.

Here's the results:
No. Of RowsMethod 1 - DataView.SortMethod 2 - DataTable.SelectMethod 3 - LINQ
100 0.0625s 0.0520s 0.0475s
1000 0.0781s 0.0573s 0.0573s
10,000 0.1618s 0.1094s 0.0989s
100,000 1.4793s 0.8959s 0.7084s
1,000,000 16.1318s 9.8290s 8.4534s
2,000,000 35.094s 21.5995s 18.3420s

As you can see from my tests, LINQ to DataSet came out tops. With a smaller DataTable the difference is, as you'd expect, minimal. Though as the volume of rows increases, LINQ seems to keep out-performing the other two approaches, being nearly 100% quicker than a DataView Sort as you get to the level of hundreds of thousands of rows, and about 14% quicker than a DataTable.Select.

Thursday, 11 February 2010

SQL Server 2008 - Table Valued Parameters vs XML vs CSV

The scenario is, you want to create a stored procedure that returns the records relating to a finite set of keys/IDs that you pass in to it. How do you do it? Well you could use dynamic SQL, but for the purpose of this article I'm going to ignore that option.

Test scenario:
-- Create basic customer table
CREATE TABLE [dbo].[Customer]

-- Populate Customer table with 100,000 sample records
SET @Counter = 1
WHILE (@Counter <= 100000)
INSERT Customer (Name)
VALUES ('Test Customer #' + CAST(@Counter AS VARCHAR(10)))
SET @Counter = @Counter + 1

Option 1 - CSV list of keys
In SQL Server 2000, you would most likely have ended up passing in a CSV list of keys, splitting that out into a table, and then joining that on to your real data table to match the records to return. dbo.fnSplit is a user defined function that iterates through the string and splits each value out into a TABLE variable which is then returned. To save space, I haven't supplied the definition of that, but there's plenty of examples of this function out there via a quick Google.
CREATE PROCEDURE [dbo].[uspGetCustomersCSV]
@CustomerIDs VARCHAR(8000)
SELECT c.Id, c.Name
FROM [dbo].[Customer] c
JOIN dbo.fnSplit(@CustomerIDs, ',') t ON c.Id = t.item

Example Use:
EXECUTE [dbo].[uspGetCustomersCSV] '1,10,100'

Option 2 - XML
SQL Server 2005 added the XML datatype, which allows you to pass in an XML blob containing the keys and using the new built-in XML support to incorporate it into a JOIN.
CREATE PROCEDURE [dbo].[uspGetCustomersXML]
@CustomerIDs XML
SELECT c.ID, c.Name
FROM [dbo].[Customer] c
JOIN @CustomerIDs.nodes('IDList/ID') AS x(Item) ON c.ID = Item.value('.', 'int' )

Example Use:
EXECUTE [dbo].[uspGetCustomersXML] '<IDList><ID>1</ID><ID>10</ID><ID>100</ID></IDList>'

Option 3 - Table Valued Parameters
Now in SQL Server 2008, there's a new kid on the block. Table Valued Parameters. In a nutshell, this is the ability to pass a TABLE type in as a parameter, which previously was not allowed. The beauty of this, is it allows you to focus on writing the query in a more natural manner - no string parsing/manipulation, no XML functionality. Plain, simple, SQL.

First you create a new TYPE, defining the TABLE structure.

Next you create your stored procedure, taking that new type as a parameter.
CREATE PROCEDURE [dbo].[uspGetCustomersTable]
@CustomerIDs CustomerIDTableType READONLY
SELECT c.ID, c.Name
FROM [dbo].[Customer] c
JOIN @CustomerIDs t ON c.Id = t.Id

Example Use:
DECLARE @Ids CustomerIDTableType
EXECUTE [dbo].[uspGetCustomersTable] @Ids

If you're wanting to call this from .NET, it's quite simple. You define the parameter as SqlDbType.Structured and the value can be set to any IEnumerable, DataTable, or DbDataReader!

The follow stats are the average figures over 3 runs, against the 100,000 row table, retrieving just 7 rows of data for keys: 1, 10, 200, 3000, 40000, 50000 and 90000. Each time, the cache was fully cleared to allow fair comparisons.

Option 1 (CSV): Duration=123ms, Reads=107, Writes=1, CPU=15
Option 2 (XML): Duration=80ms, Reads=69, Writes=0, CPU=5
Option 3 (TVP): Duration=69ms, Reads=27, Writes=0, CPU=0

This was quite a modest test, and by no means an absolute stress test. But the results look very promising for Table Valued Parameters. But not only that. I like them, because they let you get on with writing straightforward, maintainable queries.

Wednesday, 10 February 2010

QCon London 2010

So QCon London is fast approaching, starting with 2 days of tutorials on the 8th and 9th of March, then the 3 day conference from 10th-12th. This will be my first time at QCon and I'm really looking forward to it. Having never been to many conferences in the past, the level of expectation I have is solely based on the StackOverflow DevDay in London last October. That was a great day with some very interesting (and humorous) talks, so I'm expecting even better things from QCon as it's reputation precedes it!

I've been keeping an eye on the schedule as it has evolved over the past few months and am starting to wish I could split myself into a number of mini-me's to fan out to a number of the tracks. Also because it would allow me to maximise the collection of freebies....after all, isn't that what conferences are all about - the free stuff? Guaranteed to turn even the most restrained people into kleptomaniacs. It's all about the free pens.

At the moment, I think I'm pretty much decided on what tracks to attend for the Wednesday and Thursday; still swaying on Friday :

Wednesday: Architectures you've always wondered about
Thursday: AlphaGeeks on .NET
Friday: SOA 2010 or Browser as a Platform

Performance and scalability has been the flavour of 2010 so far for me, spending a lot of time thinking about how to scale up in the world of .NET and SQL Server. So I'm particular interested in talks like "Scaling applications: Complexity, Performance, Maintainability" - keen to open my eyes to the likes of Rhino Service Bus, NHibernate and Windsor.

If you're going to QCon, what are you looking forward to? Or have you been before and got any tips for a newbie? Feel free to drop me an email.

Tuesday, 9 February 2010

Would you like SQL cache with that?

One of the things I feel I keep badgering on about in the world of SQL Server and query tuning is to be fair and consistent when comparing the different possible variants. If you don't level the playing field, then potentially it will lead to you thinking a particular query out-performs another when in fact the opposite could be true. It's a bit like comparing 2 paper boys to see which one can complete a given round the quickest - if one of them has been doing the round for the past 2 weeks but the other has never done that round before, then are you going to be able to reliably tell which one is actually quickest?


One of them knows the route, the houses, and which houses have the dog that will chew their hand off given half the chance, but the other paper boy has no such knowledge. This puts him at an unfair disadvantage and almost certainly he will be slower. So until you compare them fair and square, from the same starting point, then you won't truly know who is quicker.

This is true when query tuning. So how do I do fair comparisons of queries? Well, nothing ground-breaking:

Clear the data cache and execution plan cache.
The first time you run a query an execution plan will be generated and stored in the execution plan cache. The data will be held in SQL Server's data cache/buffer. If you then run it for a second time, there is already an execution plan to reuse and the data is in the cache meaning it doesn't need to be re-read from disk. This can give vast improvements in query performance.

CHECKPOINT -- force dirty pages in the buffer to be written to disk
DBCC DROPCLEANBUFFERS -- clear the data cache
DBCC FREEPROCCACHE -- clear the execution plan cache

So, if you don't clear the cache down between tests, then it means you will be getting skewed results. A second different query may appear to be more performant, when in fact it's benefiting unfairly from the primed cache created by the first query.

The data cache is usually the biggest contributor to the performance improvement. If you want to get an idea for what is in the data cache, here's a basic query to run in your database that will show the number of cached data pages by table. This is a slightly adapted version of a query available here in MSDN (strips out the sys table figures) for SQL Server 2005 and above:

SELECT AS TableName, COUNT(*) AS NumberOfCachedPages
FROM sys.dm_os_buffer_descriptors AS bd
SELECT object_id, allocation_unit_id
FROM sys.allocation_units AS au
INNER JOIN sys.partitions AS p
ON au.container_id = p.hobt_id
AND (au.type = 1 OR au.type = 3)
SELECT object_id, allocation_unit_id
FROM sys.allocation_units AS au
INNER JOIN sys.partitions AS p
ON au.container_id = p.partition_id
AND au.type = 2
) AS obj
ON bd.allocation_unit_id = obj.allocation_unit_id
INNER JOIN sys.tables t ON obj.object_id = t.object_id
WHERE database_id = db_id()
ORDER BY NumberOfCachedPages DESC;

Running this after clearing the cache, and then after running a query, will soon give you at least some idea of what is being cached. I recommend having a look at sys.dm_os_buffer_descriptors.

So, this is why whenever I want to try out different versions of a query for performance, I like to follow a standard approach:
1) start a trace in SQL Profiler, filtered down to just my process (by user name)
2) clear the cache
3) run query #1
4) record the reads, writes, CPU and duration from profiler
5) repeat steps 2 to 4, at least once more
6) run query #1 again but without clearing the cache - this is because I like to see how the query runs with a hot/primed cache
7) repeat steps 2 to 6 for query #2
8) compare the stats

I find it good to try each query a number of times, to get an average performance. Note up until now I've been talking about performance in terms of raw query duration. I also monitor reads/writes/CPU as they are good to keep an eye on too. Where two queries run in about the same time, there may be a more sizeable difference in the other stats to help choose a winner. Of course, you may want to optimise for reads or CPU at the expense of overall speed, but generally it's duration that is of the most interest.

So when I see questions asking "why does query X run faster than query Y?", I feel it's worth checking whether the tests have been done from a level playing field if it's not stated. Usually the answer is yes, but on the occasion where they haven't been done in a fair manner it's well worth the gentle nudge about clearing the cache.

Pulling my finger out

Starting up a blog is something I've been threatening to do for some time now. There's always a good excuse not to get round to it, even though it's something I've really wanted to dip my toe into, just to see if I could cut the mustard.

As it's the start of what I hope will be a successful foray into blogging, I'm afraid this is indeed a "what do you do and where do you come from?" type post.

C# Developer, SQL Server Developer, StackOverflow enthusiast.


I work for a software development house on the South Coast of England, predominantly on backend solutions using C# and SQL Server. So a lot of what I do is the type of stuff that if it works well, the end user shouldn't know it's there - it's invisible to them; it's where the magic happens, 24/7 quietly plugging away. However, if it doesn't work so well, then everyone knows about it. There's no hiding.

It's the challenge of developing robust, scalable, high performance solutions that I get a kick out of as well being pushed technically to overcome hurdles and broaden my horizons. There's always something to learn in this profession.

Which leads on nicely to why I'm dipping my toes into the world of blogging. It's going to serve as a place to share snippets of information, things that niggle or challenge me and basically anything that there's a chance other people may find useful (isn't that the whole point of a blog?!). I forget things, so at least if I blog about them I'll have a record of them.

So consider me well and truly pulling my finger out.