The Case of The Async Log4Net Appender

October 16, 2013

4 comments

It was a bad day. I was coding away happily as we started getting alerts that one of our production farms is down. Our service seemed to be stuck on each one of our servers. Requests are coming in, but no responses appear. Restarting the service helped, only to get stuck again after a while. The logs were unhelpful, and we had to pull out windbg to figure out what was going on. When your day starts with windbg, you know it won’t be a good one.

The command ~*e!clrstack (=view stack trace of all managed threads) showed us that most of the threads are deadlocked in log4net’s AppenderSkeleton class. It looks something like this:

   1: public void DoAppend(LoggingEvent loggingEvent) 

   2: {

   3:     lock(this)

   4:     {

   5:         //bla bla bla... do some stuff

   6:         Append(loggingEvent); //method in child class

   7:         //bla bla bla... do more stuff                

   8:     }

   9: }

Lo and behold! There’s some locking here. I wasn’t even aware that log4net is locking on each call to Append, although it makes sense of course when your usual task is writing to files (one could suggest that there are better ways to approach that as well, and one would probably be right). Our own custom appender isn’t really writing to files, it sends log messages to Splunk (a pretty awesome system, albeit very expensive).

   1: public class SplunkAppender : AppenderSkeleton

   2:    {

   3:        protected override async void Append(log4net.Core.LoggingEvent loggingEvent)

   4:        {

   5:            var message = new SplunkMessage(loggingEvent.Level, loggingEvent.ExceptionObject);

   6:            var success = await _splunkLogger.Report(message);

   7:            //do something with the success status, not really relevant

   8:        }

   9:    }

I know what you’re thinking. Why is your append method async? Well, since logging to Splunk is a networked operation, we don’t want it to be done on the current thread. So our own SplunkLogger just puts messages in a queue, and a dedicated thread will send them to Splunk. But we also want to know if and when the message was succesfully sent. Hence, an async task that only completes when the message is actually sent.

You’ve probably figured out right now what happended to our threads by now. If for whatever reason messages are stuck in the queue and not being sent, then the lock in DoAppend is never released! The Monitor.Exit() call won’t happen until the task returns, as the async/await system rightfully makes sure of that. Log calls from other threads will never be able to complete.

Once we figured this out, the solution was simple. Stop locking! Our SplunkLogger is thread-safe anyway, so this wasn’t required at all in the first place. Unfortunately this means we had to leave the conveinence of AppenderSkeleton and directly implement IAppender. You lose some filters and formatting support there, but luckily for us we weren’t really using these features in this appender.

Less luckily for us, there were more problems with this code. More on that in the next post, stay tuned.

Add comment
facebook linkedin twitter email

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

*

4 comments

  1. Ken EgoziOctober 16, 2013 ב 21:17

    The outgoing queue to Splunk is implied in that code. You could have used an actual queue, have the append method only enqueue the message and use a separate thread (or timer) to dequeue and send to splunk.

    Another (perhaps better) option is not to await on call to splunk, but instead to a ContinueWith on the task, and have the Append method complete as soon as.

    Reply
  2. doronyOctober 17, 2013 ב 2:18

    Regarding your first suggestion, that’s exactly what my SplunkLogger does. I didn’t show the code for the sake of brevity, probably a mistake on my part 🙂

    Regarding the second – a good idea. Although in this case, I prefer not to have a lock in the first place – SplunkLogger uses a ConcurrentQueue to queue the messages and it is thread safe anyways.

    Reply
  3. Boris ModylevskyOctober 24, 2013 ב 2:54

    Ouuch, that appender! I hope everything is Ok with the production now. How come it occurred only now, we used this Appender for long time.

    Reply
  4. doronyOctober 24, 2013 ב 11:22

    Boris! Good to hear from you. I think problems with Splunk triggered it.

    Reply