The ABC of Blue-Screen Dump Analysis
Kernel-mode crash dump analysis, affectionately called "Blue-Screen Analysis" thanks to the manifestation of kernel-mode crashes in Windows, is an extremely complicated topic to master. Analyzing user-mode crash dumps is hard enough, plagued by missing information, mismatched symbols, dump corruption and inability to reproduce the live problem. Kernel-mode crash dumps add a new dimension of complexity due to the interaction of multiple components (drivers, user-mode processes, Windows core services and components) which is often the root cause of the dump. Additionally, analyzing a dump of significant complexity requires a great amount of knowledge about Windows system mechanisms and kernel-mode programming in general (interrupts, DPCs, APCs, thread scheduling and many other areas well covered by our Windows Internals course).
I've looked into kernel-mode crash analysis in the past, as part of the voluminous "Debugging and Investigation Tools" post, where I demonstrate isolating and pinpointing a faulty driver through the use of Driver Verifier. For now, though, I would like to focus on the ABC of Blue-Screen Dump Analysis - the steps any of us can take at home to determine why our favorite laptop is giving us the blue-screen goodness with every meal.
Step A - Send Your Error Reports to Microsoft
The easiest way of actually getting your problem diagnosed and resolved if at all possible is sending the error report to Microsoft Online Crash Analysis. After the system recovers from a blue-screen, it will ask you to send the information to Microsoft, and you should do so. More often than not, shortly afterwards or a few days or weeks later, there will be a solution available for your problem:
If this kind of automatic diagnosis is not enough for your needs; if you're not getting a prompt solution to the problem; if you're curious what happens behind the scenes of a crash dump... then read on.
Last week one of my acquaintances was kind enough to give me the exact material necessary for this kind of ABC post - a collection of 18 blue-screen crash dumps from his laptop, collected across a period of 3 months. To begin with, where do you actually find this kind of information?
Step B - Dumps Live at %SYSTEMROOT%\Minidump
Try looking at the %SYSTEMROOT%\Minidump folder right now to find out if you've had any blue-screens lately. On my laptop, from the last 1.5 years, all I have are a measly 5 dumps:
As you see, a kernel crash dump is something you can easily send over the Internet to a curious colleague or, as we have already seen, to... Microsoft Online Crash Analysis. And of course you can open it yourself to see what's lurking inside.
Step C - Bugs Fear WinDbg The Most
The single best tool for diagnosing kernel-mode crash dumps is WinDbg, part of the Debugging Tools for Windows package which I have extensively covered in the past. It's a free download from Microsoft, and its facilities for analyzing kernel-mode and user-mode problems are truly endless.
All you need to do with a blue-screen dump to get some meaningful information from WinDbg is configure symbols (File -> Symbol File Path -> srv*C:\SymbolCache*http://msdl.microsoft.com/download/symbols) and File -> Open Crash Dump. The next thing you'll see will closely resemble the following, and it will occur after an unspecified delay while your system is downloading symbols from the web:
Loading Dump File [D:\Temp\Dumps\Mini032308-01.dmp]
Mini Kernel Dump File: Only registers and stack trace are available
Symbol search path is: srv*D:\Symbols*http://msdl.microsoft.com/download/symbols
Executable search path is:
Windows Kernel Version 6001 (Service Pack 1) MP (2 procs) Free x86 compatible
Product: WinNt, suite: TerminalServer SingleUserTS
Built by: 6001.18000.x86fre.longhorn_rtm.080118-1840
Kernel base = 0x81c1a000 PsLoadedModuleList = 0x81d31c70
Debug session time: Sun Mar 23 18:29:59.623 2008 (GMT+3)
System Uptime: 0 days 0:09:01.883
Loading Kernel Symbols
......................................................................................................................................................
Loading User Symbols
Loading unloaded module list
.....
Use !analyze -v to get detailed debugging information.
BugCheck 1A, {4000, 8655d188, 80000000, 17e05c}
Probably caused by : memory_corruption ( nt!MiDeleteVirtualAddresses+7ef )
Followup: MachineOwner
---------
The interesting parts are in bold - we have the machine information (Vista SP1, 32-bit, 2 CPU), we have the system uptime (just 9 minutes!) and we have the probable cause right in front of us. The debugger thinks it's a memory corruption, and suggests that we use the !analyze -v command for more detailed information. Let's have a look:
1: kd> !analyze -v
MEMORY_MANAGEMENT (1a)
# Any other values for parameter 1 must be individually examined.
Arguments:
Arg1: 00004000, The subtype of the bugcheck.
Arg2: 8655d188
Arg3: 80000000
Arg4: 0017e05c
Debugging Details:
------------------
BUGCHECK_STR: 0x1a_4000
CUSTOMER_CRASH_COUNT: 1
DEFAULT_BUCKET_ID: VISTA_DRIVER_FAULT
PROCESS_NAME: svchost.exe
CURRENT_IRQL: 0
LAST_CONTROL_TRANSFER: from 81c584cf to 81ce7163
STACK_TEXT:
a9e7baa4 81c584cf 0000001a 00004000 8655d188 nt!KeBugCheckEx+0x1e
a9e7bbd8 81cab82c 0e430002 0f586fff 8ddec810 nt!MiDeleteVirtualAddresses+0x7ef
a9e7bca8 81caadc5 8ddec810 84751ad8 84574d78 nt!MiRemoveMappedView+0x4aa
a9e7bcd0 81e3eb9d 84574d78 00000000 ffffffff nt!MiRemoveVadAndView+0xe3
a9e7bd34 81e3ecee 8ddec810 0e430000 00000000 nt!MiUnmapViewOfSection+0x265
a9e7bd54 81c71a7a ffffffff 0e430000 043eed4c nt!NtUnmapViewOfSection+0x55
a9e7bd54 77909a94 ffffffff 0e430000 043eed4c nt!KiFastCallEntry+0x12a
WARNING: Frame IP not in any known module. Following frames may be wrong.
043eed4c 00000000 00000000 00000000 00000000 0x77909a94
STACK_COMMAND: kb
FOLLOWUP_IP:
nt!MiDeleteVirtualAddresses+7ef
81c584cf cc int 3
SYMBOL_STACK_INDEX: 1
SYMBOL_NAME: nt!MiDeleteVirtualAddresses+7ef
FOLLOWUP_NAME: MachineOwner
MODULE_NAME: nt
DEBUG_FLR_IMAGE_TIMESTAMP: 47918b12
IMAGE_NAME: memory_corruption
FAILURE_BUCKET_ID: 0x1a_4000_nt!MiDeleteVirtualAddresses+7ef
BUCKET_ID: 0x1a_4000_nt!MiDeleteVirtualAddresses+7ef
Followup: MachineOwner
---------
Note that we have no specifics regarding the user-mode stack that caused the crash because it's a kernel-only minidump (no user-mode information was captured). However, we see that the memory_corruption indication is pretty consistent. Looking this up on the web we see multiple recommendations:
- Run some memory diagnostic tools
- Use tools like DebugWiz to further diagnose the problem
- Send the hardware to the manufacturer for inspection
Let's take a look at another dump (we have 18 of them, so no need to use them sparingly):
BugCheck 1000008E, {c0000005, 81e63829, aea91860, 0}
Probably caused by : ntkrpamp.exe ( nt!PfGetCompletedTrace+138 )
Followup: MachineOwner
---------
1: kd> !analyze -v
KERNEL_MODE_EXCEPTION_NOT_HANDLED_M (1000008e)
Arguments:
Arg1: c0000005, The exception code that was not handled
Arg2: 81e63829, The address that the exception occurred at
Arg3: aea91860, Trap Frame
Arg4: 00000000
Debugging Details:
------------------
EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%08lx referenced memory at 0x%08lx. The memory could not be %s.
FAULTING_IP:
nt!PfGetCompletedTrace+138
81e63829 894804 mov dword ptr [eax+4],ecx
TRAP_FRAME: aea91860 -- (.trap 0xffffffffaea91860)
ErrCode = 00000002
eax=00000000 ebx=00000001 ecx=81d341a4 edx=da84a000 esi=81d341c0 edi=81d341b4
eip=81e63829 esp=aea918d4 ebp=aea91928 iopl=0 nv up ei ng nz na pe cy
cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00010287
nt!PfGetCompletedTrace+0x138:
81e63829 894804 mov dword ptr [eax+4],ecx ds:0023:00000004=????????
Resetting default scope
CUSTOMER_CRASH_COUNT: 1
DEFAULT_BUCKET_ID: VISTA_DRIVER_FAULT
BUGCHECK_STR: 0x8E
PROCESS_NAME: svchost.exe
CURRENT_IRQL: 0
LAST_CONTROL_TRANSFER: from 81e62c63 to 81e63829
STACK_TEXT:
aea91928 81e62c63 01240000 00004000 aea91d30 nt!PfGetCompletedTrace+0x138
aea919a0 81e6e0ca 00000000 adb85501 aea91d30 nt!PfQuerySuperfetchInformation+0x204
aea91d4c 81c8ca7a 0000004f 012bf370 00000014 nt!NtQuerySystemInformation+0x2201
aea91d4c 77629a94 0000004f 012bf370 00000014 nt!KiFastCallEntry+0x12a
WARNING: Frame IP not in any known module. Following frames may be wrong.
012bf598 00000000 00000000 00000000 00000000 0x77629a94
STACK_COMMAND: kb
FOLLOWUP_IP:
nt!PfGetCompletedTrace+138
81e63829 894804 mov dword ptr [eax+4],ecx
SYMBOL_STACK_INDEX: 0
SYMBOL_NAME: nt!PfGetCompletedTrace+138
FOLLOWUP_NAME: MachineOwner
MODULE_NAME: nt
IMAGE_NAME: ntkrpamp.exe
DEBUG_FLR_IMAGE_TIMESTAMP: 47918b12
FAILURE_BUCKET_ID: 0x8E_nt!PfGetCompletedTrace+138
BUCKET_ID: 0x8E_nt!PfGetCompletedTrace+138
Followup: MachineOwner
---------
This one sure looks different. This time the module that takes the blame is not the generic memory_corruption, but the very specific ntkrpamp.exe which is the Windows kernel itself! Examining the stack trace, it seems like a very innocent stack related to the SuperFetch memory caching and preloading feature which is built into the OS, triggering an access violation. A random write bug is possible but unlikely, especially since we have seen traces of memory corruption in the previous dump, and SuperFetch is one of those services accessing memory quite heavily. Let's take a look at another one:
BugCheck 50, {fb400428, 1, 81e71d60, 0}
Probably caused by : win32k.sys ( win32k!vSolidFillRect1+107 )
Followup: MachineOwner
---------
0: kd> !analyze -v
PAGE_FAULT_IN_NONPAGED_AREA (50)
Arguments:
Arg1: fb400428, memory referenced.
Arg2: 00000001, value 0 = read operation, 1 = write operation.
Arg3: 81e71d60, If non-zero, the instruction address which referenced the bad memory address.
Arg4: 00000000, (reserved)
Debugging Details:
------------------
FAULTING_IP:
nt!RtlFillMemoryUlong+10
81e71d60 f3ab rep stos dword ptr es:[edi]
MM_INTERNAL_CODE: 0
CUSTOMER_CRASH_COUNT: 1
DEFAULT_BUCKET_ID: VISTA_DRIVER_FAULT
BUGCHECK_STR: 0x50
PROCESS_NAME: devenv.exe
CURRENT_IRQL: 0
TRAP_FRAME: 8e00f840 -- (.trap 0xffffffff8e00f840)
ErrCode = 00000002
eax=00f0f0f0 ebx=00000202 ecx=00000011 edx=00000011 esi=fb200008 edi=fb400428
eip=81e71d60 esp=8e00f8b4 ebp=8e00f8e8 iopl=0 nv up ei pl nz na pe nc
cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00010206
nt!RtlFillMemoryUlong+0x10:
81e71d60 f3ab rep stos dword ptr es:[edi] es:0023:fb400428=????????
Resetting default scope
LAST_CONTROL_TRANSFER: from 81e78bb4 to 81ec3155
STACK_TEXT:
8e00f828 81e78bb4 00000001 fb400428 00000000 nt!MmAccessFault+0x10a
8e00f828 81e71d60 00000001 fb400428 00000000 nt!KiTrap0E+0xdc
8e00f8b4 961106f7 fb400428 00000044 00f0f0f0 nt!RtlFillMemoryUlong+0x10
8e00f8e8 9610bcc7 8e00fb44 00000001 fb200008 win32k!vSolidFillRect1+0x107
8e00fa88 9610b8b7 961105f0 8e00fb44 fda2dac8 win32k!vDIBSolidBlt+0x102
8e00faf4 960ded53 ffa81008 00000000 00000000 win32k!EngBitBlt+0x18e
8e00fb60 9609947b fda2da5c fda2dac8 181f35b1 win32k!ExtTextOutRect+0x1cf
8e00fbc8 960f8775 8e00fd0c 7ffdf2e4 006ce26c win32k!GreBatchTextOutRect+0xcb
8e00fd34 81e75a1c 00000099 0020ee6c 0020ee90 win32k!NtGdiFlushUserBatch+0x134
8e00fd44 77309a94 badb0d00 0020ee6c 00000000 nt!KiFastCallEntry+0xcc
WARNING: Frame IP not in any known module. Following frames may be wrong.
8e00fd48 badb0d00 0020ee6c 00000000 00000000 0x77309a94
8e00fd4c 0020ee6c 00000000 00000000 00000000 0xbadb0d00
8e00fd50 00000000 00000000 00000000 00000000 0x20ee6c
STACK_COMMAND: kb
FOLLOWUP_IP:
win32k!vSolidFillRect1+107
961106f7 8b55f4 mov edx,dword ptr [ebp-0Ch]
SYMBOL_STACK_INDEX: 3
SYMBOL_NAME: win32k!vSolidFillRect1+107
FOLLOWUP_NAME: MachineOwner
MODULE_NAME: win32k
IMAGE_NAME: win32k.sys
DEBUG_FLR_IMAGE_TIMESTAMP: 47c78851
FAILURE_BUCKET_ID: 0x50_W_win32k!vSolidFillRect1+107
BUCKET_ID: 0x50_W_win32k!vSolidFillRect1+107
Followup: MachineOwner
---------
This time, it's win32k.sys (the built-in windowing and graphics driver) taking the blame for the crash, as part of some code that appears to be filling out memory. The originating process this time is devenv.exe (Visual Studio itself). Again, it's highly unlikely that the win32k code is indeed at fault here - either it's a physical memory corruption, or some faulty driver is running over memory. Let's take a look at a final, fourth dump before we start coming up with action items:
BugCheck 1A, {4000, 8d6a3678, 80000000, 17dfed}
Probably caused by : memory_corruption ( nt!MiDeleteVirtualAddresses+7ef )
Followup: MachineOwner
---------
0: kd> !analyze -v
Debugging Details:
------------------
BUGCHECK_STR: 0x1a_4000
CUSTOMER_CRASH_COUNT: 1
DEFAULT_BUCKET_ID: VISTA_DRIVER_FAULT
PROCESS_NAME: iexplore.exe
CURRENT_IRQL: 0
LAST_CONTROL_TRANSFER: from 81e8e4cf to 81f1d163
STACK_TEXT:
c2da4b5c 81e8e4cf 0000001a 00004000 8d6a3678 nt!KeBugCheckEx+0x1e
c2da4c94 81ee236e 0e770000 0ed41fff 07a4b321 nt!MiDeleteVirtualAddresses+0x7ef
c2da4d2c 81ea7a7a ffffffff 0e33ee50 0e33ee44 nt!NtFreeVirtualMemory+0x652
c2da4d2c 77469a94 ffffffff 0e33ee50 0e33ee44 nt!KiFastCallEntry+0x12a
WARNING: Frame IP not in any known module. Following frames may be wrong.
0e33ed9c 00000000 00000000 00000000 00000000 0x77469a94
STACK_COMMAND: kb
FOLLOWUP_IP:
nt!MiDeleteVirtualAddresses+7ef
81e8e4cf cc int 3
SYMBOL_STACK_INDEX: 1
SYMBOL_NAME: nt!MiDeleteVirtualAddresses+7ef
FOLLOWUP_NAME: MachineOwner
MODULE_NAME: nt
DEBUG_FLR_IMAGE_TIMESTAMP: 47918b12
IMAGE_NAME: memory_corruption
FAILURE_BUCKET_ID: 0x1a_4000_nt!MiDeleteVirtualAddresses+7ef
BUCKET_ID: 0x1a_4000_nt!MiDeleteVirtualAddresses+7ef
Followup: MachineOwner
---------
Ah, it's our friend memory_corruption again, this time with iexplore.exe (Internet Explorer) as the current process responsible. Time to wrap it up.
Conclusion: We are either looking at a machine with defective physical memory, overclocked physical memory or some other kind of hardware problem, or a misbehaving driver that is randomly corrupting memory as part of its normal operation. In the former case, we can run memory diagnostic tools and send the machine to the manufacturer for replacement; in the latter case, we are looking at a long story of downloading latest versions of all drivers, ensuring that no rogue or irrelevant drivers are installed, enabling Driver Verifier on suspect drivers and waiting to reproduce the problem and catch the faulty component in the act.