Wednesday, 12 January 2011

SqlBulkCopy to SQL Server in Parallel

In an earlier post last year, I blogged about high performance bulk loading to SQL Server from .NET using SqlBulkCopy. That post highlighted the performance gain that SqlBulkCopy gives over another batched insert approach using an SqlDataAdapter. But is it possible to squeeze more performance out? Oh yes.

First, a quick recap. For optimal performance:
  • load into a heap table (with no indexes - add any indexes you need AFTER you've loaded the data)
  • specify the TableLock SqlBulkCopy option. This results in bulk update locks being taken on the destination table instead of row locks. This is key for what I'm going to focus on in this post as I'll go on to explain. Relevant reading:
    - SqlBulkCopyOptions Enumeration MSDN documentation
  • pave the way for minimal logging by setting the database to the Bulk-Logged recovery model (or Simple). In the Full recovery model all inserts are fully logged in the transaction log, whereas in Bulk-Logged, certain bulk operations can be minimally logged reducing the growth of the transaction log. SqlBulkCopy is an operation that can be. Relevant reading:
    - Prerequisites For Mnimal Logging in Bulk Import MSDN reference
    - Operations That Can Be Minimally Logged MSDN reference
    - SQL Server Myth A Day : Bulk-Logged Recovery Model by Paul Randal (Twitter)
Get squeezing
How can we squeeze more throughput in bulk inserts using SqlBulkCopy?

Multiple bulk update locks on the table from different processes can be taken out at the same time without blocking each other, which means you can scale out the bulk loading across multiple clients in parallel. There will be limitations of course, such as the network connection and the speed of the disks behind the database, but performance gains are very much possible.

Instead of loading data from a single machine, split the data over a number of machines, each one loading it's own chunk of the data to the database using SqlBulkCopy as previously outlined. Finding an optimal distribution of data/number of clients may take a bit of investigation but the rewards can pay dividends.

Pimp My Test Environment
Unfortunately I don't have a server rack at home housing multiple servers with a fast IO subsystem - hey, call me a cheap-skate. What I do have is a laptop with an i5-560M (2.67GHz dual core, 4 threads), 8GB RAM and a 7200RPM HDD. That's how I roll.

What I've done is to knock up a quick command-line app in .NET 4.0 using the new Task Parallel Library (TPL) to split the loading of data to SQL Server across 4 threads in parallel, each using a separate database connection and SqlBulkCopy instance. To quote MSDN:
The TPL scales the degree of concurrency dynamically to most efficiently use all the processors that are available. In addition, the TPL handles the partitioning of the work, the scheduling of threads on the ThreadPool, cancellation support, state management, and other low-level details. By using TPL, you can maximize the performance of your code while focusing on the work that your program is designed to accomplish.

Starting with the .NET Framework 4, the TPL is the preferred way to write multithreaded and parallel code.

I loaded 20 million rows of data from an in-memory DataTable to a heap table in SQL Server 2008 with 2 columns: FieldA VARCHAR(10), FieldB INTEGER.
FieldA is just a random sequence of 10 characters, FieldB is an incrementing number.

I ran 2 tests, recreating the database fresh before each run:
Test 1: 1 DataTable of 20 million rows, loaded single threaded. LOAD TIME: 15.4088814s
Test 2: 4 DataTables of 5 million rows, loaded in parallel. LOAD TIME: 12.3377056s

For the fact that this was all running on a single machine (a laptop at that), a 20% reduction in loading time is not too shabby at all! On production quality hardware, with multiple clients separate to the database server, the concept of loading in parallel is well worth bearing in mind and could speed up your data loading process.

Snippets
To give a brief taster of the .NET approach I used, here's the code of the key parts.
public void ParallelLoadTest()
{
    // GetTestData() is just a basic method that returns a List<> of 4 DataTables.
    // Each DataTable is populated with 5 million rows.
    List<datatable> tables = GetTestData(); 

    // For each of the 4 DataTables, kick off a bulk load task to run in parallel
    Parallel.ForEach(tables, table =>
        {
            BulkLoadData(table);
        }
    );
}

public void BulkLoadData(DataTable dt)
{
     using(SqlConnection conn = new SqlConnection("{removed connectionstring}"))
     using (SqlBulkCopy bulkCopy = new SqlBulkCopy(conn, SqlBulkCopyOptions.TableLock, null))
     {
          bulkCopy.DestinationTableName = "Dataload";
          bulkCopy.BulkCopyTimeout = 60;
          bulkCopy.ColumnMappings.Add("FieldA", "FieldA");
          bulkCopy.ColumnMappings.Add("FieldB", "FieldB");
          conn.Open();
          bulkCopy.WriteToServer(dt);
          bulkCopy.Close();
     }
}

4 comments:

  1. Hi,

    I tried using this approach with multiple threads running in parallel, but seem to be getting a fair bit of blocking. Any tips?

    ReplyDelete
  2. Hi Mark,

    Saw your tweet where you said you had an index on the PK column. This could be the cause of the blocking as for optimal performance/least blocking you should load into a table without indexes on and add them after. If you're appending into an existing table, is it viable to remove the index before, do the load, and then re-add after? Can't say whether that would be a good idea without knowing e.g. the size of the table before / typical loads each time etc.

    You could try loading into new partitions, though I don't have a post that covers this (yet).

    ReplyDelete
  3. Hi, thanks for the article.

    I'm bulk loading against a SQL Azure database and these tables are all required to have a clustered index, so I can't drop and re-create them. I'm getting blocking deadlocks similar to Mark.

    I assume I could try something like temp tables for each db connection, but as I think about it, the data all has to flow at some point back to the main table, correct? And so I might just be delaying the problem.

    The weird part is that I'm just loading only perhaps 5000 rows at a time, using row locking (not table locks), so I wouldn't anticipate that I'd be getting deadlocks.

    Thanks again for the article and for any insights.

    ReplyDelete
    Replies
    1. @Chris - there's a similar question to this on the MSDN forums - see: http://social.msdn.microsoft.com/Forums/en-US/sqldataaccess/thread/932cd26c-53fc-49c0-b082-e7f5f05a9801
      and the referenced MSDN article: http://msdn.microsoft.com/en-us/library/ms186341(v=sql.100).aspx which states:

      "When indexes exist on a table, you cannot perform a parallel load operation by using the TABLOCK option."

      So to gain most performance, you would need to remove the indexes first I'm afraid. Re: row locking, note that when > 4000 row locks are taken, the locks are escalated straight up to a table lock. So perhaps try lowering to 4000 or less rows at a time.

      Hope this helps!

      Delete