C# Vectorization with Microsoft.Bcl.Simd

Tuesday, April 22, 2014

tl;drA couple of weeks ago at Build, the .NET/CLR team announced a preview release of a library, Microsoft.Bcl.Simd, that exposes a set of JIT intrinsics on top of CPU vector instructions (a.k.a. SIMD). This library relies on RyuJIT, another preview technology that is aimed to replace the existing JIT compiler. When using Microsoft.Bcl.Simd, you program against a vector abstraction that is then translated at runtime to the appropriate SIMD instructions that your processor supports.In this post, I'd like to take a look at what exactly this SIMD support is about, and show you some examples of what kind of...
no comments

Workshops at Sela Developer Practice, December 2013: Improving .NET Performance and .NET/C++ Interop Crash Course

Thursday, December 19, 2013

In addition to my three breakout sessions, I've also had the pleasure of delivering two workshops at the Sela Developer Practice: Improving .NET Performance and .NET/C++ Interop Crash Course. Although these workshops are quite time-tested, I always try to add new materials and tools to make them more interesting for both myself and the audience. There's also constant interest in these topics -- I had 110 people registered for the performance workshop and more than 40 people at the interop course. In the performance workshop, we cover various performance measurement tools. I always try to squeeze in new tools in...
no comments

Uneven Work Distribution and Oversubscription

Wednesday, October 23, 2013

A few days ago I was teaching our Win32 Concurrent Programming course and showed students an experiment with the std::thread class introduced in C++ 11. The experiment is designed to demonstrate how to partition work across multiple threads and coordinate their execution, and the work to partition is simply counting the number of primes in a certain interval. You can find the whole benchmark here. The heart of the code is the parallelize_count function, below: void parallelize_count(unsigned nthreads, unsigned begin, unsigned end) {     std::vector<std::thread> threads;     unsigned...
no comments

On ‘stackalloc’ Performance and The Large Object Heap

Thursday, October 17, 2013

An interesting blog post is making the rounds on Twitter, 10 Things You Maybe Didn’t Know About C#.  There are some nice points in there, such as using the FieldOffset attribute to create unions or specifying custom add/remove accessors for events. However, item #4 on the list claims that using stackalloc is not faster than allocating a standard array. The proof is given in form of a benchmark program that allocates 10,000 element arrays – so far so good – and then proceeds to store values in them. The values are obtained by using Math.Pow. The benchmark results...
2 comments

Talks from DevConnections 2013: Advanced Debugging with WinDbg and SOS, Task and Data Parallelism, and Garbage Collection Performance Tips

Thursday, October 10, 2013

I'm falling behind in documenting all my travels this fall :-) In the beginning of the month I flew out to Vegas for IT/DevConnections, which was my second Las Vegas conference this year. I've been there for just 48 hours, but it was enough time to deliver three talks, meet fellow speakers, and even have a few meaningful chats with attendees about the future of .NET and production debugging techniques. You can find my presentations below -- the last couple of slides of each presentations have some additional references and books that might be useful if you want to expand...

Lock vs. Mutex

Friday, July 12, 2013

Here’s a quick brainteaser for you. Suppose you really want to find all the prime numbers in a certain range, and store them in a List<uint>. And also suppose that you want to parallelize that calculation to make it as quick as possible. You then need to synchronize access to the list so that it’s not corrupted by add operations performed in multiple threads. Would it be better to use a C# lock (CLR Monitor) or a Windows mutex to protect the list of primes? Parallel.For(2, 400000, n => { ...
15 comments

Introduction to Performance Measurement Session

Wednesday, July 3, 2013

I delivered a short two-hour session today introducing performance measurement tools. We covered performance counters – including a demo of custom performance counters, the Visual Studio profiler (sampling, instrumentation, allocations, and concurrency), and finally capturing ETW information using PerfView. Introduction to .NET Performance Measurement from Sasha Goldshtein The slides and demos are available here. In the Allocations folder you’ll find an app that allocates memory rapidly because it uses string concatenation instead of StringBuilder. In the Leak folder you’ll find a classic memory leak. In the Concurrency folder you’ll find a naïve parallelization attempt...
no comments

Wishes for the CLR JIT in the 2020s

Sunday, March 3, 2013

There have been some very interesting discussions at the MVP Summit concerning the CLR JIT, what we expect of it, and how to evolve it forward. I obviously can't disclose any NDA materials, but what I can do is share my hopes and dreams for the JIT, going forward. This is not a terribly popular subject, but there are some UserVoice suggestions around the JIT, such as adding SIMD support to C#. The state of the JIT today is that it's a fairly quick compiler that does a fairly bad job at optimization. There are some tricks it employs that...
3 comments

Windows Performance Analyzer

Wednesday, February 6, 2013

In 2008, I blogged about the just-released Windows Performance Toolkit, and the xperf tool that collects ETW events (including stack traces) and displays them in a form that allows basic analysis. Since then, ETW generation and collection have taken a huge leap forward. Microsoft has released a great library for creating ETW providers, and a set of tools (PerfMonitor, PerfView) for analyzing ETW traces in .NET apps. With the release of the Windows 8 SDK, xperf has been superseded by two new tools: WPR (Windows Performance Recorder), which enables ETW providers and captures traces, and WPA (Windows Performance...
2 comments

DevReach 2012: Task and Data Parallelism

Friday, October 5, 2012

Thanks for attending my DevReach session on task and data parallelism! We discussed the APIs available to you in the Task Parallel Library and how to avoid common pitfalls and squeeze performance from seemingly difficult to parallelize algorithms. Among the topics we covered: Measuring concurrency using the Visual Studio Concurrency Visualizer Extracting parallelism from recursive algorithms Symmetric data processing and uneven work distribution Dependency management with continuations Synchronization avoidance with aggregation and creative solutions...
no comments