The ABC of Blue-Screen Dump Analysis

June 28, 2008

2 comments

Kernel-mode crash dump analysis, affectionately called “Blue-Screen Analysis” thanks to the manifestation of kernel-mode crashes in Windows, is an extremely complicated topic to master.  Analyzing user-mode crash dumps is hard enough, plagued by missing information, mismatched symbols, dump corruption and inability to reproduce the live problem.  Kernel-mode crash dumps add a new dimension of complexity due to the interaction of multiple components (drivers, user-mode processes, Windows core services and components) which is often the root cause of the dump.  Additionally, analyzing a dump of significant complexity requires a great amount of knowledge about Windows system mechanisms and kernel-mode programming in general (interrupts, DPCs, APCs, thread scheduling and many other areas well covered by our Windows Internals course).

I’ve looked into kernel-mode crash analysis in the past, as part of the voluminous “Debugging and Investigation Tools” post, where I demonstrate isolating and pinpointing a faulty driver through the use of Driver Verifier.  For now, though, I would like to focus on the ABC of Blue-Screen Dump Analysis – the steps any of us can take at home to determine why our favorite laptop is giving us the blue-screen goodness with every meal.

Step A – Send Your Error Reports to Microsoft

The easiest way of actually getting your problem diagnosed and resolved if at all possible is sending the error report to Microsoft Online Crash Analysis.  After the system recovers from a blue-screen, it will ask you to send the information to Microsoft, and you should do so.  More often than not, shortly afterwards or a few days or weeks later, there will be a solution available for your problem:

image

If this kind of automatic diagnosis is not enough for your needs; if you’re not getting a prompt solution to the problem; if you’re curious what happens behind the scenes of a crash dump… then read on.

Last week one of my acquaintances was kind enough to give me the exact material necessary for this kind of ABC post – a collection of 18 blue-screen crash dumps from his laptop, collected across a period of 3 months.  To begin with, where do you actually find this kind of information?

Step B – Dumps Live at %SYSTEMROOT%\Minidump

Try looking at the %SYSTEMROOT%\Minidump folder right now to find out if you’ve had any blue-screens lately.  On my laptop, from the last 1.5 years, all I have are a measly 5 dumps:

image

As you see, a kernel crash dump is something you can easily send over the Internet to a curious colleague or, as we have already seen, to… Microsoft Online Crash Analysis.  And of course you can open it yourself to see what’s lurking inside.

Step C – Bugs Fear WinDbg The Most

The single best tool for diagnosing kernel-mode crash dumps is WinDbg, part of the Debugging Tools for Windows package which I have extensively covered in the past.  It’s a free download from Microsoft, and its facilities for analyzing kernel-mode and user-mode problems are truly endless.

All you need to do with a blue-screen dump to get some meaningful information from WinDbg is configure symbols (File -> Symbol File Path -> srv*C:\SymbolCache*http://msdl.microsoft.com/download/symbols) and File -> Open Crash Dump.  The next thing you’ll see will closely resemble the following, and it will occur after an unspecified delay while your system is downloading symbols from the web:

Loading Dump File [D:\Temp\Dumps\Mini032308-01.dmp]

Mini Kernel Dump File: Only registers and stack trace are available

Symbol search path is: srv*D:\Symbols*http://msdl.microsoft.com/download/symbols

Executable search path is:

Windows Kernel Version 6001 (Service Pack 1) MP (2 procs) Free x86 compatible

Product: WinNt, suite: TerminalServer SingleUserTS

Built by: 6001.18000.x86fre.longhorn_rtm.080118-1840

Kernel base = 0x81c1a000 PsLoadedModuleList = 0x81d31c70

Debug session time: Sun Mar 23 18:29:59.623 2008 (GMT+3)

System Uptime: 0 days 0:09:01.883

Loading Kernel Symbols

……………………………………………………………………………………………………………………………………

Loading User Symbols

Loading unloaded module list

…..

Use !analyze -v to get detailed debugging information.

BugCheck 1A, {4000, 8655d188, 80000000, 17e05c}

 

Probably caused by : memory_corruption ( nt!MiDeleteVirtualAddresses+7ef )

Followup: MachineOwner

———

The interesting parts are in bold – we have the machine information (Vista SP1, 32-bit, 2 CPU), we have the system uptime (just 9 minutes!) and we have the probable cause right in front of us.  The debugger thinks it’s a memory corruption, and suggests that we use the !analyze -v command for more detailed information.  Let’s have a look:

1: kd> !analyze -v

MEMORY_MANAGEMENT (1a)

    # Any other values for parameter 1 must be individually examined.

Arguments:

Arg1: 00004000, The subtype of the bugcheck.

Arg2: 8655d188

Arg3: 80000000

Arg4: 0017e05c

 

Debugging Details:

——————

BUGCHECK_STR:  0x1a_4000

CUSTOMER_CRASH_COUNT:  1

DEFAULT_BUCKET_ID:  VISTA_DRIVER_FAULT

PROCESS_NAME:  svchost.exe

CURRENT_IRQL:  0

LAST_CONTROL_TRANSFER:  from 81c584cf to 81ce7163

 

STACK_TEXT: 

a9e7baa4 81c584cf 0000001a 00004000 8655d188 nt!KeBugCheckEx+0x1e

a9e7bbd8 81cab82c 0e430002 0f586fff 8ddec810 nt!MiDeleteVirtualAddresses+0x7ef

a9e7bca8 81caadc5 8ddec810 84751ad8 84574d78 nt!MiRemoveMappedView+0x4aa

a9e7bcd0 81e3eb9d 84574d78 00000000 ffffffff nt!MiRemoveVadAndView+0xe3

a9e7bd34 81e3ecee 8ddec810 0e430000 00000000 nt!MiUnmapViewOfSection+0x265

a9e7bd54 81c71a7a ffffffff 0e430000 043eed4c nt!NtUnmapViewOfSection+0x55

a9e7bd54 77909a94 ffffffff 0e430000 043eed4c nt!KiFastCallEntry+0x12a

WARNING: Frame IP not in any known module. Following frames may be wrong.

043eed4c 00000000 00000000 00000000 00000000 0x77909a94

 

STACK_COMMAND:  kb

 

FOLLOWUP_IP:

nt!MiDeleteVirtualAddresses+7ef

81c584cf cc              int    3

 

SYMBOL_STACK_INDEX:  1

SYMBOL_NAME:  nt!MiDeleteVirtualAddresses+7ef

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: nt

DEBUG_FLR_IMAGE_TIMESTAMP:  47918b12

IMAGE_NAME:  memory_corruption

FAILURE_BUCKET_ID:  0x1a_4000_nt!MiDeleteVirtualAddresses+7ef

BUCKET_ID:  0x1a_4000_nt!MiDeleteVirtualAddresses+7ef

Followup: MachineOwner

———

Note that we have no specifics regarding the user-mode stack that caused the crash because it’s a kernel-only minidump (no user-mode information was captured).  However, we see that the memory_corruption indication is pretty consistent.  Looking this up on the web we see multiple recommendations:

  • Run some memory diagnostic tools
  • Use tools like DebugWiz to further diagnose the problem
  • Send the hardware to the manufacturer for inspection

Let’s take a look at another dump (we have 18 of them, so no need to use them sparingly):

BugCheck 1000008E, {c0000005, 81e63829, aea91860, 0}

Probably caused by : ntkrpamp.exe ( nt!PfGetCompletedTrace+138 )

Followup: MachineOwner

———

 

1: kd> !analyze -v

 

KERNEL_MODE_EXCEPTION_NOT_HANDLED_M (1000008e)

Arguments:

Arg1: c0000005, The exception code that was not handled

Arg2: 81e63829, The address that the exception occurred at

Arg3: aea91860, Trap Frame

Arg4: 00000000

 

Debugging Details:

——————

EXCEPTION_CODE: (NTSTATUS) 0xc0000005 – The instruction at 0x%08lx referenced memory at 0x%08lx. The memory could not be %s.

 

FAULTING_IP:

nt!PfGetCompletedTrace+138

81e63829 894804          mov    dword ptr [eax+4],ecx

 

TRAP_FRAME:  aea91860 — (.trap 0xffffffffaea91860)

ErrCode = 00000002

eax=00000000 ebx=00000001 ecx=81d341a4 edx=da84a000 esi=81d341c0 edi=81d341b4

eip=81e63829 esp=aea918d4 ebp=aea91928 iopl=0        nv up ei ng nz na pe cy

cs=0008  ss=0010  ds=0023  es=0023  fs=0030  gs=0000            efl=00010287

nt!PfGetCompletedTrace+0x138:

81e63829 894804          mov    dword ptr [eax+4],ecx ds:0023:00000004=????????

Resetting default scope

 

CUSTOMER_CRASH_COUNT:  1

DEFAULT_BUCKET_ID:  VISTA_DRIVER_FAULT

BUGCHECK_STR:  0x8E

PROCESS_NAME:  svchost.exe

CURRENT_IRQL:  0

LAST_CONTROL_TRANSFER:  from 81e62c63 to 81e63829

 

STACK_TEXT: 

aea91928 81e62c63 01240000 00004000 aea91d30 nt!PfGetCompletedTrace+0x138

aea919a0 81e6e0ca 00000000 adb85501 aea91d30 nt!PfQuerySuperfetchInformation+0x204

aea91d4c 81c8ca7a 0000004f 012bf370 00000014 nt!NtQuerySystemInformation+0x2201

aea91d4c 77629a94 0000004f 012bf370 00000014 nt!KiFastCallEntry+0x12a

WARNING: Frame IP not in any known module. Following frames may be wrong.

012bf598 00000000 00000000 00000000 00000000 0x77629a94

 

STACK_COMMAND:  kb

FOLLOWUP_IP:

nt!PfGetCompletedTrace+138

81e63829 894804          mov    dword ptr [eax+4],ecx

 

SYMBOL_STACK_INDEX:  0

SYMBOL_NAME:  nt!PfGetCompletedTrace+138

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: nt

IMAGE_NAME:  ntkrpamp.exe

DEBUG_FLR_IMAGE_TIMESTAMP:  47918b12

FAILURE_BUCKET_ID:  0x8E_nt!PfGetCompletedTrace+138

BUCKET_ID:  0x8E_nt!PfGetCompletedTrace+138

Followup: MachineOwner

———

This one sure looks different.  This time the module that takes the blame is not the generic memory_corruption, but the very specific ntkrpamp.exe which is the Windows kernel itself!  Examining the stack trace, it seems like a very innocent stack related to the SuperFetch memory caching and preloading feature which is built into the OS, triggering an access violation.  A random write bug is possible but unlikely, especially since we have seen traces of memory corruption in the previous dump, and SuperFetch is one of those services accessing memory quite heavily.  Let’s take a look at another one:

BugCheck 50, {fb400428, 1, 81e71d60, 0}

Probably caused by : win32k.sys ( win32k!vSolidFillRect1+107 )

Followup: MachineOwner

———

 

0: kd> !analyze -v

 

PAGE_FAULT_IN_NONPAGED_AREA (50)

Arguments:

Arg1: fb400428, memory referenced.

Arg2: 00000001, value 0 = read operation, 1 = write operation.

Arg3: 81e71d60, If non-zero, the instruction address which referenced the bad memory address.

Arg4: 00000000, (reserved)

 

Debugging Details:

——————

FAULTING_IP:

nt!RtlFillMemoryUlong+10

81e71d60 f3ab            rep stos dword ptr es:[edi]

 

MM_INTERNAL_CODE:  0

CUSTOMER_CRASH_COUNT:  1

DEFAULT_BUCKET_ID:  VISTA_DRIVER_FAULT

BUGCHECK_STR:  0x50

PROCESS_NAME:  devenv.exe

CURRENT_IRQL:  0

 

TRAP_FRAME:  8e00f840 — (.trap 0xffffffff8e00f840)

ErrCode = 00000002

eax=00f0f0f0 ebx=00000202 ecx=00000011 edx=00000011 esi=fb200008 edi=fb400428

eip=81e71d60 esp=8e00f8b4 ebp=8e00f8e8 iopl=0        nv up ei pl nz na pe nc

cs=0008  ss=0010  ds=0023  es=0023  fs=0030  gs=0000            efl=00010206

nt!RtlFillMemoryUlong+0x10:

81e71d60 f3ab            rep stos dword ptr es:[edi]  es:0023:fb400428=????????

Resetting default scope

 

LAST_CONTROL_TRANSFER:  from 81e78bb4 to 81ec3155

 

STACK_TEXT: 

8e00f828 81e78bb4 00000001 fb400428 00000000 nt!MmAccessFault+0x10a

8e00f828 81e71d60 00000001 fb400428 00000000 nt!KiTrap0E+0xdc

8e00f8b4 961106f7 fb400428 00000044 00f0f0f0 nt!RtlFillMemoryUlong+0x10

8e00f8e8 9610bcc7 8e00fb44 00000001 fb200008 win32k!vSolidFillRect1+0x107

8e00fa88 9610b8b7 961105f0 8e00fb44 fda2dac8 win32k!vDIBSolidBlt+0x102

8e00faf4 960ded53 ffa81008 00000000 00000000 win32k!EngBitBlt+0x18e

8e00fb60 9609947b fda2da5c fda2dac8 181f35b1 win32k!ExtTextOutRect+0x1cf

8e00fbc8 960f8775 8e00fd0c 7ffdf2e4 006ce26c win32k!GreBatchTextOutRect+0xcb

8e00fd34 81e75a1c 00000099 0020ee6c 0020ee90 win32k!NtGdiFlushUserBatch+0x134

8e00fd44 77309a94 badb0d00 0020ee6c 00000000 nt!KiFastCallEntry+0xcc

WARNING: Frame IP not in any known module. Following frames may be wrong.

8e00fd48 badb0d00 0020ee6c 00000000 00000000 0x77309a94

8e00fd4c 0020ee6c 00000000 00000000 00000000 0xbadb0d00

8e00fd50 00000000 00000000 00000000 00000000 0x20ee6c

 

STACK_COMMAND:  kb

 

FOLLOWUP_IP:

win32k!vSolidFillRect1+107

961106f7 8b55f4          mov    edx,dword ptr [ebp-0Ch]

 

SYMBOL_STACK_INDEX:  3

SYMBOL_NAME:  win32k!vSolidFillRect1+107

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: win32k

IMAGE_NAME:  win32k.sys

DEBUG_FLR_IMAGE_TIMESTAMP:  47c78851

FAILURE_BUCKET_ID:  0x50_W_win32k!vSolidFillRect1+107

BUCKET_ID:  0x50_W_win32k!vSolidFillRect1+107

Followup: MachineOwner

———

This time, it’s win32k.sys (the built-in windowing and graphics driver) taking the blame for the crash, as part of some code that appears to be filling out memory.  The originating process this time is devenv.exe (Visual Studio itself).  Again, it’s highly unlikely that the win32k code is indeed at fault here – either it’s a physical memory corruption, or some faulty driver is running over memory.  Let’s take a look at a final, fourth dump before we start coming up with action items:

BugCheck 1A, {4000, 8d6a3678, 80000000, 17dfed}

Probably caused by : memory_corruption ( nt!MiDeleteVirtualAddresses+7ef )

Followup: MachineOwner

———

 

0: kd> !analyze -v

 

Debugging Details:

——————

BUGCHECK_STR:  0x1a_4000

CUSTOMER_CRASH_COUNT:  1

DEFAULT_BUCKET_ID:  VISTA_DRIVER_FAULT

PROCESS_NAME:  iexplore.exe

CURRENT_IRQL:  0

LAST_CONTROL_TRANSFER:  from 81e8e4cf to 81f1d163

 

STACK_TEXT: 

c2da4b5c 81e8e4cf 0000001a 00004000 8d6a3678 nt!KeBugCheckEx+0x1e

c2da4c94 81ee236e 0e770000 0ed41fff 07a4b321 nt!MiDeleteVirtualAddresses+0x7ef

c2da4d2c 81ea7a7a ffffffff 0e33ee50 0e33ee44 nt!NtFreeVirtualMemory+0x652

c2da4d2c 77469a94 ffffffff 0e33ee50 0e33ee44 nt!KiFastCallEntry+0x12a

WARNING: Frame IP not in any known module. Following frames may be wrong.

0e33ed9c 00000000 00000000 00000000 00000000 0x77469a94

 

STACK_COMMAND:  kb

FOLLOWUP_IP:

nt!MiDeleteVirtualAddresses+7ef

81e8e4cf cc              int    3

 

SYMBOL_STACK_INDEX:  1

SYMBOL_NAME:  nt!MiDeleteVirtualAddresses+7ef

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: nt

DEBUG_FLR_IMAGE_TIMESTAMP:  47918b12

IMAGE_NAME:  memory_corruption

FAILURE_BUCKET_ID:  0x1a_4000_nt!MiDeleteVirtualAddresses+7ef

BUCKET_ID:  0x1a_4000_nt!MiDeleteVirtualAddresses+7ef

Followup: MachineOwner

———

Ah, it’s our friend memory_corruption again, this time with iexplore.exe (Internet Explorer) as the current process responsible.  Time to wrap it up.

Conclusion: We are either looking at a machine with defective physical memory, overclocked physical memory or some other kind of hardware problem, or a misbehaving driver that is randomly corrupting memory as part of its normal operation.  In the former case, we can run memory diagnostic tools and send the machine to the manufacturer for replacement; in the latter case, we are looking at a long story of downloading latest versions of all drivers, ensuring that no rogue or irrelevant drivers are installed, enabling Driver Verifier on suspect drivers and waiting to reproduce the problem and catch the faulty component in the act.

Add comment
facebook linkedin twitter email

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

*

2 comments

  1. rgvivaNovember 2, 2010 ב 4:03 PM

    I had a very similar bsod to one of your samples involving win32k sys:

    a7bc0e20 8052036a 00000050 bdc083bc 00000000 nt!KeBugCheckEx+0x1b
    a7bc0e88 80544578 00000000 bdc083bc 00000000 nt!MmAccessFault+0x9a8
    a7bc0e88 bf82f038 00000000 bdc083bc 00000000 nt!KiTrap0E+0xd0
    a7bc0f34 bf82ee6c a7bc1160 bdc083bc bd9c0008 win32k!vSolidFillRect1+0xb0
    a7bc10b4 bf82bd63 bf82ef2e a7bc1160 e27af5a4 win32k!vDIBSolidBlt+0x19b
    a7bc1120 bf82863b e1bf5008 00000000 00000000 win32k!EngBitBlt+0xe1
    a7bc117c bf812e1c e27af5a4 e27af538 00000000 win32k!ExtTextOutRect+0x1d1

    It also happen from devenv.exe like in your case. I do not agree that it is due to physical memory errors or what not. It is more likely a windows bug.

    Reply
  2. loadaMay 7, 2011 ב 11:41 AM

    Am also experiencing a win32k culprit right now. After a clean install the bsod begins right after my first windows update with nothing else installed. Taking either of my 2 memory modules out makes the problem go away. It appears a recent windows update can’t work with more than 2GB ram installed on my system.

    Windows 7 Kernel Version 7600 MP (2 procs) Free x64

    Reply