In-Memory Queues: Performance comparison between the CCR and the .NET Framework’s ConcurrentQueue

01/05/2010

tags: , , ,
7 comments

speed

 

Update:  See the notes for an important point about the 50,000 bytes / 100,000 messages case.

Code for this post is available here.  See below for the link to the download containing the CCR DLL.

So here’s the problem:  you need an in-memory queue in your application that can handle processing a large number (several thousands) of messages per second, using several enqueuing and dequeuing threads.  If you’re using .NET Framework 4.0 then you would immediately go for the new ConcurrentQueue class available in the System.Collections.Concrurent namespace.  But what if you’re not using 4.0?  Well, you could download the Reactive Extensions for .NET and get a back-ported (and unsupported) version of Parallel Extensions for .NET 3.5 that contains this class.  But there are other options.

 

Enter the CCR

The Concurrency & Coordination Runtime (CCR) is a little-known gem contained in the Microsoft Robotics Toolkit (available here as a downloadable express version).  It supplies the developer with a rich set of abstractions that make it possible to write concurrent and asynchronous programs (the kind that drive robots with a bunch of sensors) in a coherent manner.  Among one of these abstractions is Port<T> – the CCR’s version of the thread-safe queue.  The port is the essential building block of the CCR – everything you do eventually boils down to reading and writing messages on a port, so naturally it has to be fast.  A complete description of the various possibilities and capabilities of the CCR is outside the scope of this post – but it’s well worth your while to read the material here and Nick’s series here.  If there  is sufficient interest, I might do some more posts on it.

 

Back to Business

OK, so we’re using .NET 4.0, including all the cool new concurrent collections and parallel task stuff.  But what we want is SPEED.  Should we even bother with the CCR if we don’t need any of the coordination primitives?  This is what I set out to discover. 

The enclosed code (minus the CCR reference – see the download link above) simply adds a number of messages to either a Port or a ConcurrentQueue and measures how long it takes to remove them, using multiple threads both during the enqueue and the dequeue operation.  Notice that it doesn’t do any fancy processing – just a simple ‘in and out’ for timing the raw performance of the two data structures.   I ran the code on my 3-year old desktop machine, with 2GB memory and an 2.4GHz Intel Q6600 quad-core processor.  Each message size/total number of messages combination was run for 30 iterations.  Then, the maximum and minimum values were removed and the average was calculated.

The raw data is as follows: (Columns denote the total number of messages, rows the size in bytes of each message.  Times are in milliseconds.)

 

CCR Performance:

  100 1000 10000 100000

500

3.044

4.395

27.853

285.702

5000

2.742

17.889

201.673

2,473.820

50000

7.910

110.756

1,192.207

12,637.490

ConcurrentQueue Performance:

 

100

1000

10000

100000

500

0.123

0.349

4.431

202.330

5000

0.263

5.638

166.374

1,819.114

50000

1.845

135.225

1,419.670

*

 

* At this point, available memory on my machine was exhausted and the machine started thrashing to disk.  There was no sense in continuing the experiment.

The same data in graphical form:

 

image

 

image

image

 

Interesting points observed during the experiment:

  • The CCR benchmarks would consistently take the CPU to 100% use, whereas the ConcurrentQueue benchmarks would hold the CPU somewhere between 40% and 60% use, with the larger number appearing with the larger memory sizes.
  • The ConcurrentQueue quite obviously outperforms the CCR ports at smaller message sizes and at a smaller number of messages – yet with a message size of 50,000 bytes, with a total message count of somewhere between 100 and 1,000 the trend reverses and the CCR takes the lead. 

 

Conclusion

Seeing as how I did not conclude the experiment (I’m still missing a data point), I can’t really be certain.  However, it would seem that the choice depends on the number of total expected messages and the individual message size.  If you’re below a message size of 50K – the answer is definitely  ‘use ConcurrentQueue’.  I’ll have to dig in deeper into the 100 – 1,000 total message  and the 5,000 – 50,000 bytes message size ranges for more definitive information.

 

Code for this post is available here.

Add comment
facebook linkedin twitter email

Leave a Reply to buy ventolin online Cancel Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

*

7 comments

  1. bnaya01/05/2010 ב 19:04

    very nice
    some of the diagrams seem cut at the right end

    Reply
  2. nhanks02/05/2010 ב 16:24

    Good stuff indeed – I would also wonder what impact the GC had on this. In 4.0 you now have the background concurrent GC, and this is not available with the CCR in 3.5.

    Reply
  3. Vish03/05/2010 ב 18:54

    Is the CCR still under active development?

    Thank You,
    Vish

    Reply
  4. yuvmaz04/05/2010 ב 18:13

    @nhanks: Thanks, that’s a good point. I’ll have to look into that.

    @Vish: Look here for your answer – http://social.msdn.microsoft.com/Forums/en-US/roboticsccr/thread/de3f596a-e6fa-4431-b6ae-57afe1d051b1

    Y.

    Reply
  5. Cindy Song06/05/2010 ב 18:50

    Hi Yuval, the benchmarks of ConcurrentQueue and CCR are not really equivalent. In the ConcurrentQueue code, you add all the items first (Parallel.For is blocking), then removing them. While in CCR code, the adding and removing are happening at the same time.

    This explains the memory behavior of ConcurrentQueue when buffer size is 50000, because it is indeed eating up memory.

    Reply
  6. yuvmaz07/05/2010 ב 21:22

    Hi Cindy,
    You’re right, of course. And this after I specifically used Parallel.For to make sure that both queues will be filled in the same way…
    The proper fix is to activate the CCR Arbiter only after the Parallel.For call in the CCR benchmark. So after running the experiment again, I indeed get thrashing for the 50,000 / 100,000 combination in both cases. However, the rest of the results are pretty much the same.
    Thanks for the good catch!

    Reply
  7. buy ventolin online30/08/2013 ב 02:21

    Really enjoyed this article.Really looking forward to read more. Awesome.

    Reply