In-Memory Queues: Performance comparison between the CCR and the .NET Framework’s ConcurrentQueue
Update: See the notes for an important point about the 50,000 bytes / 100,000 messages case.
Code for this post is available here. See below for the link to the download containing the CCR DLL.
So here’s the problem: you need an in-memory queue in your application that can handle processing a large number (several thousands) of messages per second, using several enqueuing and dequeuing threads. If you’re using .NET Framework 4.0 then you would immediately go for the new ConcurrentQueue class available in the System.Collections.Concrurent namespace. But what if you’re not using 4.0? Well, you could download the Reactive Extensions for .NET and get a back-ported (and unsupported) version of Parallel Extensions for .NET 3.5 that contains this class. But there are other options.
Enter the CCR
The Concurrency & Coordination Runtime (CCR) is a little-known gem contained in the Microsoft Robotics Toolkit (available here as a downloadable express version). It supplies the developer with a rich set of abstractions that make it possible to write concurrent and asynchronous programs (the kind that drive robots with a bunch of sensors) in a coherent manner. Among one of these abstractions is Port<T> – the CCR’s version of the thread-safe queue. The port is the essential building block of the CCR – everything you do eventually boils down to reading and writing messages on a port, so naturally it has to be fast. A complete description of the various possibilities and capabilities of the CCR is outside the scope of this post – but it’s well worth your while to read the material here and Nick’s series here. If there is sufficient interest, I might do some more posts on it.
Back to Business
OK, so we’re using .NET 4.0, including all the cool new concurrent collections and parallel task stuff. But what we want is SPEED. Should we even bother with the CCR if we don’t need any of the coordination primitives? This is what I set out to discover.
The enclosed code (minus the CCR reference – see the download link above) simply adds a number of messages to either a Port or a ConcurrentQueue and measures how long it takes to remove them, using multiple threads both during the enqueue and the dequeue operation. Notice that it doesn’t do any fancy processing – just a simple ‘in and out’ for timing the raw performance of the two data structures. I ran the code on my 3-year old desktop machine, with 2GB memory and an 2.4GHz Intel Q6600 quad-core processor. Each message size/total number of messages combination was run for 30 iterations. Then, the maximum and minimum values were removed and the average was calculated.
The raw data is as follows: (Columns denote the total number of messages, rows the size in bytes of each message. Times are in milliseconds.)
CCR Performance:
| | 100 | 1000 | 10000 | 100000 | | 500 | 3.044 | 4.395 | 27.853 | 285.702 | | 5000 | 2.742 | 17.889 | 201.673 | 2,473.820 | | 50000 | 7.910 | 110.756 | 1,192.207 | 12,637.490 | |
|
|
|
|
|
|
|
|
|
|
|
ConcurrentQueue Performance:
| | 100 | 1000 | 10000 | 100000 |
| 500 | 0.123 | 0.349 | 4.431 | 202.330 |
| 5000 | 0.263 | 5.638 | 166.374 | 1,819.114 |
| 50000 | 1.845 | 135.225 | 1,419.670 | * |
* At this point, available memory on my machine was exhausted and the machine started thrashing to disk. There was no sense in continuing the experiment.
The same data in graphical form:


Interesting points observed during the experiment:
- The CCR benchmarks would consistently take the CPU to 100% use, whereas the ConcurrentQueue benchmarks would hold the CPU somewhere between 40% and 60% use, with the larger number appearing with the larger memory sizes.
- The ConcurrentQueue quite obviously outperforms the CCR ports at smaller message sizes and at a smaller number of messages – yet with a message size of 50,000 bytes, with a total message count of somewhere between 100 and 1,000 the trend reverses and the CCR takes the lead.
Conclusion
Seeing as how I did not conclude the experiment (I’m still missing a data point), I can’t really be certain. However, it would seem that the choice depends on the number of total expected messages and the individual message size. If you’re below a message size of 50K – the answer is definitely ‘use ConcurrentQueue’. I’ll have to dig in deeper into the 100 – 1,000 total message and the 5,000 – 50,000 bytes message size ranges for more definitive information.
Code for this post is available here.