Friday, 12 February 2010

Sorting a DataTable - LINQ performance

Whenever there are a number of ways to achieve the same goal, I'm always inquisitive as what the performance difference is between them. I'm a firm believer in thinking about scalability from the start - if you can make your best effort to prepare for scale, then you can save yourself time and effort further down the line. I like to try and avoid pain - it doesn't agree with me.

In this instance, I become curious about sorting an in-memory DataTable, using C# / .NET 3.5. So I decided to run a few tests using the common approaches. Now, you may be thinking "Why not just order the results from the database using an ORDER BY clause?". Well yes, you could do that. But what about when you want to cache some data once in memory, and then use that cached copy for subsequent purposes to prevent round-tripping back/re-hitting the database server? Or, what if the data isn't actually coming from a database but some other source?

So I knocked up a quick test harness. For each method, I tested sorting a DataTable containing between 100 and 2 million rows of data. The DataTable contained 2 columns:
ColumnA - integer, just an incrementing number
ColumnB - string, in the format {Character}{Row Number} where {Character} just loops round from A-Z just to mix the records up a bit and give the need for ordering.

Method 1 - DataView.Sort
DataView vw = dt.DefaultView;
vw.Sort = "ColumnB ASC";

Method 2 - DataTable.Select
DataRow[] rows = dt.Select("", "ColumnB ASC");

Method 3 - LINQ to DataSet
var rows = (from r in dt.AsEnumerable()
orderby r["ColumnB"] ascending
select r).ToArray();
Note: the .ToArray() bit in the LINQ above is important - this makes the execution of the query immediate. Without it, what you are actually really doing is just defining a query object. It does not execute until you try to request data from the query object - this is known as deferred execution. So, in this example, without the enclosing brackets and the subsequent .ToArray(), the data wouldn't actually be being sorted at this point.

Here's the results:
No. Of RowsMethod 1 - DataView.SortMethod 2 - DataTable.SelectMethod 3 - LINQ
100 0.0625s 0.0520s 0.0475s
1000 0.0781s 0.0573s 0.0573s
10,000 0.1618s 0.1094s 0.0989s
100,000 1.4793s 0.8959s 0.7084s
1,000,000 16.1318s 9.8290s 8.4534s
2,000,000 35.094s 21.5995s 18.3420s

As you can see from my tests, LINQ to DataSet came out tops. With a smaller DataTable the difference is, as you'd expect, minimal. Though as the volume of rows increases, LINQ seems to keep out-performing the other two approaches, being nearly 100% quicker than a DataView Sort as you get to the level of hundreds of thousands of rows, and about 14% quicker than a DataTable.Select.

11 comments:

  1. Nice comparison buddy :) I am already in love with linq for its sheer elegance and speed. However, I am not a very big fan of SQL like syntax of LINQ. The extension method style is more appropriate for me.

    ReplyDelete
  2. Interesting idea to do the comparison and to be honest the results surprised me a bit. Can you also mention how you generated the data that you used for testing (the same for all 3 methods or each time randomly generated), if the numbers in the table are from 1 test only or averages, and how you measured the execution time for the sorting methods
    Thanks :)

    ReplyDelete
  3. @Dejan - the data was the same each time, that way I knew I was comparing like for like.

    The stats are averages taken from 3 runs, where I was just capturing DateTime.Now before and immediately after the sort process.

    ReplyDelete
  4. Hi

    I tried the above on 500,000 rows x 13 cols and actually found "Method 3 - LINQ to DataSet" to be the SLOWEST consistantly.

    Method 1 was the fastest where the first call was around 3 secs and subsequent calls down to 150 ms.

    The Linq calls came in around 3 secs and subsequent calls stayed around 3 secs....

    ReplyDelete
  5. @Anonymous

    Thanks for the comment. I'd be interested to know what stats you see for fewer columns (i.e. I tried just 2). I will try and repeat my tests with a DataTable containing more columns in, to see what the effect of having more data columns has.

    ReplyDelete
  6. the timings are roughly the same with 2 columns (with same number of rows).

    All the columns in my DataTable are "String" types and have not added any indexes etc.

    ReplyDelete
  7. @Anonymous
    I've tried the same test again, but this time with 10 columns of data instead of 2. I consistently get the same results with LINQ coming out just ahead of DataTable.Select followed by DataView.Sort.

    ReplyDelete
  8. AdaTheDev,
    You saved my day, thanks to you. Just the few lines of LINQ and the comparison chart, I was completely sold on it. Best possible search and solution. Thanks once again.

    LotusShiv

    ReplyDelete
  9. @Anonymous
    I get similar performance to Adrian.
    A combination of Threading and LINQ has made me the
    'Wizard of Speed' here at work on some particularly sticky data models with 'reqs' of , "We want to be able to spread the grid across three monitors." ... Not to mention the custom Vertical grid I created and nested in a DataRepeater ...

    Here is a 'link' to some more LINQ testing you may find interesting...
    http://msdn.microsoft.com/en-us/library/dd364983.aspx

    ReplyDelete
  10. awesome comparison..exactly what i was looking out for.Thanks alot.

    ReplyDelete
  11. In your examples, you Method 1 and Method 2 are backwards based on their title and function.

    ReplyDelete