Random and unexpected EXCEPTION_FLT_DIVIDE_BY_ZERO and EXCEPTION_FLT_INVALID__OPERATION

November 12, 2008, 7:29 am

Your application is running fine then one day it starts to fail with EXCEPTION_FLT_DIVIDE_BY_ZERO (0xC000008E) or EXCEPTION_FLT_INVALID__OPERATION (0xC0000090) exceptions at seemingly random places.

Why?

One reason might be the current value of the floating point control word (fpcw). This is a bit mask used to control whether Intel 8087 and later CPUs raise exceptions or not when certain types of floating point errors occur. There is a good article about this over on the openwatcom.org site.

In Windows applications the usual value for fpcw is 027F. For example, just fire up notepad, attach WinDBG and do the following:

0:001> ~*e r@fpcw
fpcw=0000027f
fpcw=0000027f

The value of this register can be set in many ways, for example the functions _control87, _controlfp, __control87_2, _clear87, _clearfp, _status87, _statusfp, _statusfp2 can all modify it. But if it is modified it fundamentally changes the ground rules for how some floating point operations will behave on that thread until the original value is restored.

As 027f is the "normal" value on Windows, almost all Microsoft and third party applications, components, frameworks and libraries are written and tested with the expectation that this register will have this value and that floating point operations will happen in a particular way either raising or not raising exceptions. Therefore any code that needs to modify this register for some reason has a duty to change it back again when finished unless it is running on its own private thread. If not, mayhem will result.

And that is exactly what we see sometimes here at Microsoft support.

I had a case not so long ago from a systems integrator. They were delivering a solution to an end customer using a system developed by another company which in turn used applications from different vendors. After an update to one of the components was deployed these applications began to fail with floating point exceptions. Troubleshooting was particularly difficult because the end customer was in an isolated environment without access to the Internet so remote access was out of the question. Every time we wanted to do some troubleshooting someone from one of the vendors had to go on site and we'd have a phone conference where I would talk them through a series of debug steps "blind" (you develop good visualisation skills in my job).

From the outset I suspected something was modifying the fpcw so the first thing I had them check was the value of fpcw on all threads at the point in time where the exceptions had started to happen but the process was still up (unhandled, these exceptions will take down a process). Sure enough, somehow the "normal" value had been changed on thread 0, the main UI thread of the application:

0:000> ~*e rfpcw
fpcw=00001372
fpcw=0000027f
fpcw=0000027f
fpcw=0000027f
fpcw=0000027f
fpcw=0000027f

So this was the cause of the exceptions but the harder question to answer was who was changing this?

The lack of direct access to the system in question limited the complexity of what debug steps I could use as I would be talking someone else (who was not familiar with debugging) through it. I decided to start by running the process under the debugger from the beginning and setting the debugger to break any time a module loaded into the process and dump out the the value of fpcw on every thread and then continue execution. All this output would then be captured into a log file which could be brought back from the onsite visit. This was on the assumption that whatever DLL was making the change was likely to be doing it when it first loaded into the process. To do this we used the following command just after launching the process under the debugger:

0:001> .logopen c:\debug_session.log
Closing open log file \debug_1318_2008-09-30_15-06-38-527.log
Opened log file 'c:\debug_session.log'

0:001> sxe -c "~*e @fpcw;g" ld

(The reason for using the @ before the register name tells the debugger that it is a register name and should not be valuated for any symbol resolution that might be going on. Doing this can make debug sessions a bit more responsive.)

What this output showed us was like the following:

ModLoad: 053a0000 053b1000 C:\libraries\thirdpart.dll <<< start of loading of third party module
fpcw=0000027f
fpcw=0000027f
fpcw=0000027f
fpcw=0000027f
fpcw=0000027f
fpcw=0000027f
ModLoad: 77760000 778cc000 C:\WINDOWS\system32\shdocvw.dll dll <<< start of loading of shdocvw.dll ( a Windows component)
fpcw=00001372 <<< incorrect value of fpcw on thread 0
fpcw=0000027f
fpcw=0000027f
fpcw=0000027f
fpcw=0000027f
fpcw=0000027f

Sometime between the start of loading thirdpart.dll and the start of the load of the next DLL into the process the wrong fpcw value was set. Therefore we can now say with a reasonable degree of certainty that it is this module that is responsible.

After discussions between all parties involved we eventually established that this module was being injected into the process to fulfil a hooking/monitoring function. Unfortunately the changing of the fpcw value appeared to be a side effect of the non-Microsoft compiler the DLL was compiled with. Certain compilers seem to generate code that does this possibly as a legacy side effect of targeting non-Microsoft operating systems in the past. The vendor was not in a position to recompile this module so in the end they had to redesign things to avoid using it.

[A little tip for spotting certain components as being compiled with certain non-Microsoft compilers (based off my experience). A clue of this lies in the timestamp in the version resource (do lmvm thirdpart in the debugger):

2A425E19 time date stamp Fri Jun 19 23:22:17 1992

Now I remember that when I joined Microsoft Developer Support in 1995 we were in the beta of Windows 95 and although there was a thing called Win32 that gave some kind of 32 bit implementation on the 16-bit Windows platform we were only just at the beginning of 32-bit computing. So I was fairly sure this component was not really compiled in 1992. I've seen this 1992 thing a few times now and I think it has usually been with PE binaries produced by a non-Microsoft compiler. ]

I've also seen cases where modification of the mxcsr register has led to very unexpected errors:

Microsoft VBScript error 800a000b
Division by Zero

This was occurring on this line of ASP code:

Response.Write(5/3)

Imagine how confusing that was!

In that case a third party ASP.NET charting component was changing the mxcsr register to 00001fa0 or 00001fa4 instead of its "normal" Windows value of 00001f80. (customer was hosting ASP.NET and ASP applications in the same application pool).

In another case we saw a customer getting a VBScript error 6, overflow on this line:

x = 1 + 2.0

Again, confusion reigned. This time it was caused by a component that was using MMX/SSE2/SSE3 instructions.

I'm not against code altering the fpcw or mxcsr registers. But if you are a library component that is going to be used by arbitrary threads in some foreign host process then your documentation needs to have a big red warning sticker on it and you certainly shouldn't go around injecting yourself into other processes and changing the way the CPU behaves. That's just bad manners!

HTH

Doug

↧

A message to my readers...

December 24, 2008, 7:00 pm

≫ Next: Answer to the Christmas brain teaser...

≪ Previous: Random and unexpected EXCEPTION_FLT_DIVIDE_BY_ZERO and EXCEPTION_FLT_INVALID__OPERATION

0:001> .foreach ( greeting {s -[1]u 0 L?0xffffffff "Merry"} ) {.printf "%mu" , greeting }
Merry Christmas and a Happy New Year!

Christmas brain teaser: based off this, what can infer about the most likely bit-ness of the process and the operating system I was debugging on and why?

Cheers!

Doug

↧

Answer to the Christmas brain teaser...

January 6, 2009, 12:12 pm

≫ Next: Escaping entries in ADPlus config files

≪ Previous: A message to my readers...

So at Christmas I posed a brain teaser.

"0:001> .foreach ( greeting {s -[1]u 0 L?0xffffffff "Merry"} ) {.printf "%mu" , greeting }
Merry Christmas and a Happy New Year!

Christmas brain teaser: based off this, what can infer about the most likely bit-ness of the process and the operating system I was debugging on and why?"

First of all, the "Christmas message" was command Window output from WinDBG which is part of the great and free Debugging Tools for Windows package. The command executed uses the built in .foreach meta command to iterate the output of the given search command ("s") and pass each token to the .printf command. "greeting" is the placeholder variable used in the .foreach command and represents the memory address of each search "hit". The format string passed to the .printf command ("%mu") tells the debugger to interpret whatever is at that address as a null terminated unicode string and print it out.

The search command (s -[1]u 0 L?0xffffffff "Merry") searches the specified address range for the unicode string (the "u") "Merry" but only prints out the address at which it is found (the "-[1]"). The address range specified here is from 0 to 4Gb ("L?0xffffffff " - more cryptic WinDBG syntax). The address range is really the clue as to the bitness of the operating system I'm doing this on. Whilst this could be a 64-bit debuggee, it's unlikely we would just want to search the first 4Gb of a possible 16Tb address range if we are looking for something. So this makes it more likely we are debugging a 32-bit debuggee. In which case the question is 'in what circumstances do we have a 4Gb address range in a 32-bit process?'. The only case where that applies is a 32-bit process on a 64-bit version of Windows.

So in summary, we are searching the entire 4Gb virtual memory address space of a 32-bit process on 64-bit Windows for the word "Merry" and finding once instance of it at the start of the null terminated unicode string "Merry Christmas and a Happy New Year!" which we are then printing out.

And once again- Happy New Year!

Don't know about where you are, but here it is a very cold one (by UK standards). Here's the Wunderground weather station nearest to me.

Doug

↧

Escaping entries in ADPlus config files

February 9, 2009, 7:16 am

≫ Next: “Failed to load data access DLL, 0x80004005” – OR – What is mscordacwks.dll?

≪ Previous: Answer to the Christmas brain teaser...

This is a useful point I saw discussed recently on some internal email. So I thought I would blog it before I lose it.

If you use ADPlus at all and make use of the (very useful) configuration file option you may have run into the situation where it complains about certain characters.

For example, suppose you want to include a custom action in the handling of a particular exception. You would probably try something like this:

<CustomActions1> .if (0n10 < ParamXYZ ) </CustomActions1>

But then you will get an error something like this:

C:\Program Files\Debugging Tools for Windows>adplus -crash -pn MyTestApplication.exe -c MyConfigFile.cfg

*** ERROR ***

No matching closing tag in XML string [CustomActions1]. Check missing closing tag or use of invalid characters like '<' or '>' inside the tag.

Well, at least the error tells you what to look for and sure enough we are using a '<' inside a tag but what to do about it?

Answer is straightforward once you know - just use standard XML escaping techniques:

<CustomActions1> .if (0n10 < ParamXYZ ) </CustomActions1>

HTH

Doug

↧

“Failed to load data access DLL, 0x80004005” – OR – What is mscordacwks.dll?

February 18, 2009, 12:34 pm

≫ Next: Must know info on PDB files…

≪ Previous: Escaping entries in ADPlus config files

Ever seen this error in a WinDBG/CDB debug session?

Failed to load data access DLL, 0x80004005
Verify that 1) you have a recent build of the debugger (6.2.14 or newer)
            2) the file mscordacwks.dll that matches your version of mscorwks.dll is
                in the version directory
            3) or, if you are debugging a dump file, verify that the file
                mscordacwks_<arch>_<arch>_<version>.dll is on your symbol path.
            4) you are debugging on the same architecture as the dump file.
                For example, an IA64 dump file must be debugged on an IA64
                machine.

You can also run the debugger command .cordll to control the debugger's
load of mscordacwks.dll. .cordll -ve -u -l will do a verbose reload.
If that succeeds, the SOS command should work on retry.

If you are debugging a minidump, you need to make sure that your executable
path is pointing to mscorwks.dll as well.

This error message is something that often faces people trying to debug dumps of .NET 2.0 applications using WinDBG/CDB using the SOS debugger extension. I’ve been having more than my fair share of issues with it lately and I thought it justified a bit of explanation.

What is mscordacwks.dll?

The Common Language Runtime (CLR) is the core engine of the Microsoft .NET Framework that executes managed code. In simple terms it does this by taking the intermediate language and metadata in a managed assembly, JIT compiling the code on demand, building in memory representations of the types the assembly defines and uses and ensures the resulting code is safe, secure and verifiable and gets executed when it is meant to. This engine is itself implemented in native code. When we want to debug a .NET application using a native debugger like CDB or WinDBG (which we currently do a lot of if we want to debug it using post-mortem memory dump files) we have to use a “bridge” between the native debugger and the managed world because the native debugger does not inherently understand managed code. It is a native debugger.

To provide this bridge, the CLR helpfully ships with a debugger extension – SOS.DLL. This understands the internals of the CLR and so allows us to do things like outputting managed calls stacks, dumping the managed heap etc.

But from time to time, these internal data structures and details of the CLR change and so it is useful to abstract the interface to the CLR that this debugger extension needs from the actual internal implementation of the CLR that makes .NET applications work. Enter mscordacwks.dll. This provides the Data Access Component (DAC) that allows the SOS.DLL debugger extension to interpret the in memory data structures that maintain the state of a .NET application.

If you look in your framework folder you should always see a matching set of these 3 DLLs:

If you work with 64-bit you should also see a matching DLL set in the Framework64 folder.

What does this error message mean?

It means that the SOS.DLL debugger extension has not been able to find the matching mscordacwks.dll that it needs to be able to debug the dump file you are trying to debug.

How do you know I am debugging a dump file?

Because if you were debugging a live application the debugger extension would automatically find and load the mscordacwks.dll from the framework directory.

When am I likely to get this error message and when will I not get it?

If you are debugging a dump file from an application that was using a different build (e.g. different installed service pack or hotfix) of the CLR to the one installed on your local system, or if the .NET Framework was installed in a different location to where it is installed on your system and if the correct mscordacwks.dll is not discoverable by the debugger by some other means.

What “other means”?

Because having the matching mscordacwks.dll is so important for SOS.DLL to work correctly, SOS has a number of tricks up its sleeve to find it. In particular, provided the correct indexing to the symbol server has occurred the debugger will load it from there. The debugger will also look for it in your debuggers directory provided it has been renamed in a special way (see below).

So how do I fix it?

Most of the time, if you have your symbol path set up correctly (which you will need to anyway to make any headway at all with debugging anything, let alone managed applications) then the debugger should be able to get the correct mscordacwks.dll from the symbol server automatically:

!sym noisy
.symfix c:\mylocalsymcache
.cordll -ve -u -l

What if that doesn’t work?

The simplest thing is to ask the person that gave you the dump file to look at to give you a copy of the mscordacwks.dll. Once you have it, check its file properties for the version number. It should be something like 2.0.50727.xxxx. Then rename it to

mscordacwks_AAA_AAA_2.0.50727.xxxx.dll

where xxxx is the appropriate bit of the version number and AAA is either x86 or AMD64 depending on whether you are dealing with a 32-bit or a 64-bit application dump. (The AMD64 is a legacy thing before we referred to x64). Then put this renamed copy into your debuggers directory (the one where WinDBG is installed). Then, as per the error message, tell the debugger to try again:

.cordll -ve -u -l

Although we try to ensure that every build of the CLR that is released (as a service pack, a hotfix or whatever) has its mscordacwks.dll indexed on the public symbol server, unfortunately it sometimes does not happen. But since it always ships as part of the CLR you always have the option of getting it from the machine the dump came from.

I tried the verbose logging option and it seems to be confused about whether it wants x86 or x64. Now what?

So you ran .cordll –ve –u –l as instructed and got a message something like this:

CLR DLL status: ERROR: Unable to load DLL mscordacwks_AMD64_x86_2.0.50727.3053.dll, Win32 error 0n87

What this means is that you most likely took a dump of a 32-bit process (running under WoW64) on a 64-bit system using a 64-bit debugger and you are now trying to analyse the dump using a 64-bit debugger. That's why the message references AMD64 and then x86. This is not going to work. Because the SOS.DLL extension actually makes use of the framework while debugging the bitnesses need to match. I strongly recommend always generating the dump using a debugger of the same bitness as the process (so x86 debugger for WoW64 processes even though the system is an x64 system) and analysing the dump with the same bitness of debugger that generated it. And that means of course you cannot debug a 64-bit dump on a 32-bit system. It also means you have to have the framework installed to debug a managed application dump.

Now it’s telling me it sufferred an “init failure”?

You might see this:

0:018> .cordll -ve -u -l
CLRDLL: ERROR: DLL C:\Windows\Microsoft.NET\Framework\v2.0.50727\mscordacwks.dll init failure, Win32 error 0n87
CLR DLL status: ERROR: DLL C:\Windows\Microsoft.NET\Framework\v2.0.50727\mscordacwks.dll init failure, Win32 error 0n87

This also points to a bitness mix up. I've seen this when using a 32-bit debugger to analyse a dump of a WoW64 process generated with a 64-bit debugger.

Isn’t this problem as old as the hills?

I am certainly not the first person to blog about this and won’t be the last either but I thought it was worth me attempting to explain some of these mysteries since it still continues to get people confused. Here are some other posts about this:

Failed to load data access DLL, 0x80004005 – hm
"Failed to start stack walk: 80004005", "Following frames may be wrong" and other errors you may see in windbg
Production Debugging for Hung ASP.Net 2 applications – a crash course
Loading CLR DAC dll from a different path

HTH

Doug

↧

Must know info on PDB files…

May 27, 2009, 9:02 am

≫ Next: Debugging and Profiling API enhancements in CLR 4.0

≪ Previous: “Failed to load data access DLL, 0x80004005” – OR – What is mscordacwks.dll?

I don’t think I’ve ever come across anything written by John Robbins that is not worth reading. But in a recent post he talks about a subject that rings particularly true for me – the importance of debug symbols. Symbols are like the Rosetta stone of debugging – they are the key to interpreting debugger output and relating it back to the source code of the application or component. Without it, much of the debugger output is meaningless. Not meaningfree. Just meaningless. It is not impossible to debug without symbols. It is just much harder. Maybe 10 times harder, maybe 100 times harder. And harder means more time and more time means more money.

I have lost count of the number of situations where I have assisted customers with application in production that are dying on their feet due to some crash or memory leak and the conversation goes something like this:

Me: You application is failing/leaking memory from modulename!RandomSoundingFunctionName+0xVERYBIGHEXOFFSET

Customer: My development team doesn’t recognise that function

Me: That’s because it is simply the nearest exported function name that the debugger has found, plus an offset

Customer: Can we get the actual function name?

Me: Unfortunately not without matching symbols

Customer: My development team have just rebuilt the component with symbols and I’ve uploaded the PDB to you

Me: I can’t use that because it doesn’t match the binary that was on the server when the dump file was generated

Customer: My development team says they haven’t changed the source code much since the original deployment

Me: Unfortunately it won’t work. Even with no code changes a PDB can change from one compilation to the next, depending on the compiler. They will need to recompile and then you will have to redeploy, reproduce the issue, gather more dump files and upload them

Customer: That will take a week of dev, a week of test and a week of change control approval

Me: Ok, well, let’s see what we can do in the meantime

So my recommendation is – read John’s post and heed his wisdom.

HTH

Doug

↧

Debugging and Profiling API enhancements in CLR 4.0

May 27, 2009, 12:10 pm

≫ Next: “THIS DUMP FILE IS PARTIALLY CORRUPT”.

≪ Previous: Must know info on PDB files…

There’s a very interesting interview been published on Channel 9 about the new and changed profiling and debugging APIs in CLR 4.0.

While I’m mentioning CLR 4.0, here are a bunch of links related to Visual Studio 2010 and .NET Framework 4.0:

Downloads

Microsoft .NET Framework 4 Beta 1
Microsoft .NET Framework 4 Client Profile Beta 1
Visual Studio 2010 Professional Beta 1 (web installer)
Visual Studio 2010 Professional Beta 1 (ISO)
Visual Studio Team System 2010 Team Suite Beta 1 – Web Installer
Visual Studio Team System 2010 Team Foundation Server Beta 1
Visual Studio 2010 and .NET Framework 4 Training Kit
WCF WF Samples for .NET Framework 4.0 Beta 1

Blogging/sites/links

Main MSDN site
The Readme
MSDN article on new features
Beta 1 forum
Data sheet
ASP.NET 4.0 and Visual Studio 2010 Web Development Beta 1 Overview
GS switch enhancements in VS2010
10-4 Episode about installation of beta 1

HTH

Doug

↧

“THIS DUMP FILE IS PARTIALLY CORRUPT”.

June 12, 2009, 6:11 am

≫ Next: On memory leaks…

≪ Previous: Debugging and Profiling API enhancements in CLR 4.0

I was investigating an issue today and needed to create a kernel dump on demand in my repro machine, a copy of Windows Server 2003 SP2 hosted in Hyper-V. I successfully blue-screened it (!):

and left it to get on with generating the dump and rebooting. Because I wasn’t watching it I never saw it actually get to 100%.

However when I opened the dump I got a whole load of messages I didn’t like the look of:

Kernel Complete Dump File: Full address space is available
************************************************************
WARNING: Dump file has been truncated. Data may be missing.
**************************************************************************
THIS DUMP FILE IS PARTIALLY CORRUPT.
KdDebuggerDataBlock is not present or unreadable.
**************************************************************************
*********************************************************************
Unable to read PsLoadedModuleList
**************************************************************************
THIS DUMP FILE IS PARTIALLY CORRUPT.
KdDebuggerDataBlock is not present or unreadable.
**************************************************************************
Unable to read selector for PCR for processor 0
GetContextState failed, 0x80070026
GetContextState failed, 0x80070026
Unable to get current machine context, Win32 error 0n38
GetContextState failed, 0x80070026
Unable to get current machine context, Win32 error 0n38
GetContextState failed, 0x80070026

and so it went on, and on, and on, page after page of scary looking error messages.

So I tried again a few times but always the same result.

Then I got to thinking about why the dump might not be getting fully written and I thought “pagefile”.

The pagefile settings in the pre-configured virtual image I was using looked something like this:

And sure enough if you try to set such settings you get a warning like this:

So I set it back to “system managed size”:

and rebooted then crashed the system again and this time the dump was fine with no corruption reports.

Now I had always thought (and thought it had been my experience) that if your pagefile was too small then you simply didn’t get the dump file created when the crash occurred. I had not realised that you might end up with a “partially corrupt” dump file.

HTH

Doug

↧

On memory leaks…

June 17, 2009, 10:53 am

≫ Next: Object doesn't support this property or method

≪ Previous: “THIS DUMP FILE IS PARTIALLY CORRUPT”.

If you have a rusty old bucket and it is overflowing because you put too much water in it, then you should certainly think about buying a nice new plastic one at some point. But the new one will still overflow if you put too much water in it.

↧

Object doesn't support this property or method

August 10, 2009, 10:08 am

≫ Next: .NET Type internals

≪ Previous: On memory leaks…

This is a short post about a very strange support case I had. I should start by mentioning however that you could see the same error for other reasons, not least of which because you’ve mistyped the property or method name :-).

In this particular case the customer had a class ASP web application that used VBScript. It would work for a long period of time (like days) but then begin to experience errors like the following:

Microsoft VBScript runtime error '800a01b6'
Object doesn't support this property or method: '<method name>'

These errors would become more and more frequent and the only solution my customer had found was to restart the process hosting the script. In his case that meant an IISRESET.

After much investigation and debugging we eventually figured out that he was hitting a variant of this issue:

You receive an error message if you try to start the VBScript engine from a Microsoft C++-based program on a Windows Server 2003 Service Pack 1-based computer

The underlying issue is specific to 32-bit processes that have a >2Gb address space available to them. Therefore this can occur both on 32-bit versions of Windows booted with the /3Gb switch in BOOT.INI as well as 64-bit versions of Windows where VBScript is hosted in a 32-bit process. (This is because 32-bit processes running on 64-bit systems have a 4Gb address space by default.)

The same underlying problem is also known to cause problems for the Scriptor component of Commerce Server:

FIX: You may receive a Scriptor component error message when you add the /3GB switch to the Boot.ini file on a Windows Server 2003 SP1-based computer that is running Commerce Server 2002 or Commerce Server 2007

HTH

Doug

↧

.NET Type internals

September 7, 2009, 5:34 am

≫ Next: Failed to CoCreate profiler

≪ Previous: Object doesn't support this property or method

If you debug .NET a lot you need to know what is going on inside a .NET application.

I came across a great article about .NET Type internals that is very useful.

HTH

Doug

↧

Failed to CoCreate profiler

December 30, 2009, 6:47 am

≫ Next: Generating Application Verifier logs for web applications

≪ Previous: .NET Type internals

Sometimes support cases are like buses. You never see one and then two or three the same come along all at once. Recently one of my team mates asked me about an error his customer was getting in their event log:

Source: CLR
Category: None
Event ID: 0

Description: The description for Event ID (0) in Source (CLR) cannot be found. The local computer may not have the necessary registry information or message DLL files...
The following information is part of the event: Failed to CoCreate profiler..

I’d not come across it or at least if I had then I had forgotten it. So I had no quick answer to what might be causing it. Then the next day, a different colleague asked me about the same error for a different customer.

What is this error telling us exactly?

The error is coming from the CLR (a.k.a. the Common Language Runtime, the core engine of the .NET Framework). CoCreate here refers to the CoCreateInstance or CoCreateInstanceEx APIs which are the way in COM of creating an object instance (Remember COM, that funny thing we used to use before the days of .NET? not sure? never heard of it? you could always grab yourself a copy of “Mr Bunny’s guide to ActiveX” or perhaps Don Box’s classic “Essential COM”).

The profiler it is referring to here is any registered CLR profiler. The CLR will check to see if a profiler is registered and load it if it is. It does this check when the process starts. The method of registration is quite simple – it looks for two environment variables COR_ENABLE_PROFILING and COR_PROFILER. If COR_ENABLE_PROFILING is set to a non-zero value then we are saying that profiling is enabled. If it is, the CLR reads the COR_PROFILER variable to find out the COM GUID of the CoClass of the profiler. You can read more here.

But if you are a non-interactive process such as the w3wp.exe that hosts ASP.NET applications, where is this environment variable defined? Well one place is along with all the other system environment variables. However that was checked – typing SET at a command prompt will show you the amalgamation of user specific environment variables for the currently logged on user as well as all the system environment variables (which apply to all processes including non-interactive processes and processes running under other user identities).

One way you can check if a process has got a profiler active is from a memory dump taken at some time after it has started. In the !peb output you should clearly see the environment variables in effect for this process, coming from all places in which they might be defined.

One other place an environment variable can be defined for a w3wp.exe is in the environment of the Windows Service that spawns the process. In this case the customer had it defined in two places:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\W3SVC
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\IISADMIN

both had a REG_MULTI_SZ value called “Environment” defined in which the two COR* environment variables were present.

Removing these (just the two variables in question, not the whole ‘Environment’ value - and carefully please – if you don’t IIS may fail to start !) fixed the problem for the customer.

Note that the reasons the specified profiler may have failed to CoCreate could be many. If could be that the profiler was uninstalled but the environment variables somehow got left behind. It could be that the profiler DLL associated with the CoClass was not registered correctly. It could be that one of the dependencies of the profiler DLL was missing (try doing a REGSVR32 on it to find out). Or perhaps there was a permissions issue accessing the profiler DLL or one of its dependencies. Process Monitor would always be a good way to trouble shoot this kind of issue.

I’ve seen “forgotten” profilers cause other issues over the years. For example, I’ve seen them cause slows downs on production systems because customers had either forgotten or never knew that someone had installed a profiler. I’ve seen them interfere with line number alignment when performing interactive debugging with Visual Studio in test environments. All sorts. Profilers are very useful and have their place but you have to remember where you have installed and enabled them.

By the way, searching for this error on the web only pops up a handful of hits so it would not appear to be very common:

Forum posting about “Failed to CoCreate profiler”
RedGate Software article about this error in relation to their profiler

So maybe it was just coincidence that I came across this problem two days in a row!

HTH

Doug

↧

Generating Application Verifier logs for web applications

January 11, 2010, 2:54 am

≫ Next: Tool to turn Hyper-V VM state into a memory dump – vm2dmp tool

≪ Previous: Failed to CoCreate profiler

Application Verifier is a useful tool for capturing application bugs (mostly of the non-.NET variety) – things like heap corruption, resource leaks, invalid handle usage and the ignoring of exceptions by exception handlers to name a few. It has a reasonably friendly GUI for configuring which applications you want to monitor, what you want to monitor them for and you can also view the log files that have been created from there as well:

Note: the Application Verifier page linked to above refers you to the download link for the tool on the Microsoft Download Center. At the time of writing that will get you version 4.0.665 of Application Verifier. However that is NOT the latest version. To get the latest version (4.0.0917 at the time of writing) you have to install the Windows 7 SDK . This is obviously a much bigger download but you can use the web based setup you can be quite selective about what bits of the SDK you do install and Application Verifier is identified clearly in the custom setup options:

If you are a web developer there a couple of “gotchas” that you might run into in relation to the logging that you should be aware of.

By default, logs are written to a location within the user profile path of the user account the application under test runs as. So for an interactive desktop application like Notepad it will write to something like this:

"C:\Users\John\AppVerifierLogs\notepad.exe.0.dat"

However if the process is running as some special account such as the NETWORK SERVICE account used for most IIS w3wp.exe application pool processes the profile based location may correspond to somewhere that the account does not have write access to:

"C:\Windows\System32\config\systemprofile\AppVerifierLogs\w3wp.exe.0.dat"

(or for a 32-bit process on a 64-bit system, C:\Windows\SysWow64\config\systemprofile\AppVerifierLogs).

Application Verifier does have a feature to cope with this; you can specify a log file path to be used by such “protected processes”:

AppVerif –sppath C:\MyLogsLocation

Unfortunately, as I found out recently during a customer support case, there is a currently a bug in the logic for deciding that the alternate location should be used. The issue is this – if the “\config\systemprofile\AppVerifierLogs” location does not exist Application Verifier (or more specifically the vrfcore.dll component of Application Verifier that runs inside of the process being monitored) tries to create it. If the creation of that folder fails, it then reads the registry looking for this value:

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\{ApplicationVerifierGlobalSettings} | ProtectedProcessLogPath (REG_SZ)

which it interprets as the alternate log location.

Unfortunately if the “\config\systemprofile\AppVerifierLogs” folder DOES exist but the process does not have write access within it (to create log files) then vrfcore.dll does not check for the alternate location. The result is that you just don’t get any log files. The best solution is to delete the “\config\systemprofile\AppVerifierLogs” folder (having first taken a copy of any existing log files you want to keep) and then follow the procedure for specifying an alternate location.

Once you have specified an alternate location using –sppath and ensure that Application Verifier does actually try to use it you still need to ensure that the process you are monitoring has write access to the specified log folder.

You can ensure that the w3wp.exe will have write access to the specified location by giving full control to one of three SIDs (security IDs):

MACHINENAME\IIS_IUSRS – this would give permission for any application pool [IIS7 and above or IIS_WPG on IIS6]

MACHINENAME\NETWORK SERVICE – this should give permission for any application pool running under the default identity

IIS APPPOOL\applicationpoolname – this would give permission for the named application pool only (Note: to give permission for this SID, you have to set MACHINENAME as the location when you are in the ‘Select Users or Groups’ dialog and then specify “IIS APPPOOL\applicationpoolname” as the user name ) [IIS7 and above]

A couple of reference sites useful for understanding application pool identities and accounts:
http://learn.iis.net/page.aspx/140/understanding-the-built-in-user-and-group-accounts-in-iis-70/
http://adopenstatic.com/cs/blogs/ken/archive/2008/01/29/15759.aspx

Note: in a situation where the logs do end up in the alternate location these logs do NOT appear listed in the Application Verifier GUI.

For that to happen, you need to manually move the log file from the alternate location to the location that would normally be used if you were doing an application verifier test on an interactive application running under your own login, e.g. C:\Users\John\AppVerifierLogs

HTH

Doug

↧

Tool to turn Hyper-V VM state into a memory dump – vm2dmp tool

February 4, 2010, 2:28 am

≫ Next: An index to Maoni's blog posts about the GC

≪ Previous: Generating Application Verifier logs for web applications

Here is a very clever idea made reality. Take a virtual machine that you want to do some kernel level spelunking on. Rather than going into the guest and generating a kernel dump by one of the usual methods, take the saved state of the virtual machines and use this new tool to make a memory dump.

HTH

Doug

↧

An index to Maoni's blog posts about the GC

February 18, 2010, 3:12 am

≫ Next: Finding the .NET version in a debug session

≪ Previous: Tool to turn Hyper-V VM state into a memory dump – vm2dmp tool

Maoni Stephens has a great blog about how the .NET Garbage Collector (GC) works.

I was delivering a .NET debug class last week and had cause to bring together an index of her many GC related posts:

Maoni's WebLog : Using GC Efficiently – Part 1
Maoni's WebLog : Using GC Efficiently – Part 2
Maoni's WebLog : Using GC Efficiently – Part 3
Maoni's WebLog : Using GC Efficiently – Part 4

So, what’s new in the CLR 2.0 GC?
So, what’s new in the CLR 4.0 GC?

Large Object Heap
Workstation GC for server applications?
64-bit vs 32-bit
He’s live… he’s live not… he’s live…
Clearing up some confusion over finalization and other areas in GC
Finalization Uncovered
Suspending and resuming threads for GC
When memory is running low…
I Am a Happy Janitor – Part 1: Finding garbage
Correlating the output of !eeheap -gc and !address
Not seeing the WKS:: and the SVR:: namespace?

Difference Between Perf Data Reported by Different Tools – 1
Difference Between Perf Data Reported by Different Tools – 2
Difference Between Perf Data Reported by Different Tools – 3
Difference Between Perf Data Reported by Different Tools – 4

My application seems to hang. What do I do? – Part 1
My application seems to hang. What do I do? – Part 2
Understand the problem before you try to find a solution
Debugging with the Right Tools

Channel9 Video on Background GC

CLR Inside Out: Investigating Memory Issues

HTH

Doug

↧

Finding the .NET version in a debug session

March 4, 2010, 9:47 am

≫ Next: Debug Analyzer.NET–a tool to really rev up your debugging

≪ Previous: An index to Maoni's blog posts about the GC

An interesting little question came up on one of our internal discussion groups today.

“How can I find in a debug session the version of the .NET runtime being used in the debuggee?”

[in an automated/scripted fashion and without using debugger extensions or symbols]

Here is what I came up with:

0:029> !for_each_module .if ( ($sicmp( "@#ModuleName" , "mscorwks") = 0) | ($sicmp( "@#ModuleName" , "mscorsvr") = 0) | ($sicmp( "@#ModuleName" , "clr") = 0)) {.echo @#ProductVersion}

2.0.50727.3607

HTH

Doug

↧

Debug Analyzer.NET–a tool to really rev up your debugging

January 28, 2011, 2:48 am

≫ Next: Enabling Gflags or AppVerifier options for a particular service or COM+ package.

≪ Previous: Finding the .NET version in a debug session

One of my colleagues Sukesh has just released a beta of a tool he has written called Debug Analyze.NET. I’ve not had the chance to give it a test drive yet but it looks to be a very powerful and rich framework/tool to help you automate the debugging experience. Very useful if you do a lot of dump analysis.

HTH

Doug

↧

Enabling Gflags or AppVerifier options for a particular service or COM+ package.

March 14, 2011, 2:54 am

≫ Next: You spent HOW much on our new server and the app slowed DOWN??!!

≪ Previous: Debug Analyzer.NET–a tool to really rev up your debugging

I saw the following tip go around on email. Certain debugging options such as PageHeap and AppVerifier setting are set per process name (because they use the ImageFileExecutionOptions key in the registry). That can make it tricky setting them for processes that host other things such as dllhost.exe, svchost.exe and w3wp.exe.

For services hosted in svchost.exe you can try the following:

1) Make a copy of svchost.exe in the System32 directory and call the copy “Mysvchost.exe”.
2) Using regedit, open HKLM\System\CurrentControlSet\Services\MyService.
3) Edit the value “ImagePath”, which will be something like “%SystemRoot%\system32\svchost.exe -k myservice” and change svchost.exe to “Mysvchost.exe”.
4) Add “Mysvchost.exe” to the AppVerifier list and set the settings you wish to set or use gflags to set the options you wish for mysvchost.exe
5) Reboot (if troubleshooting something that goes wrong at startup

For things hosted in dllhost.exe (COM+ applications) there is a trick you can use. On the advanced tab of the COM+ package properties page there is a setting called “Enable 3Gb support”:

When you set this (which is no often set) your components will get loaded in a process called dllhst3g.exe. You can then set your gflags or AppVerifier settings for that.

I don’t know a way to do it for one particular w3wp.exe. The best I can think of is to set the required ImageFileExecutionOption setting, recycle the application pool, ensure the new w3wp.exe for the application pool has come back up and then revert the ImageFileExecutionOption setting immediately (these settings are read at startup).

HTH

Doug

↧

You spent HOW much on our new server and the app slowed DOWN??!!

April 6, 2011, 3:28 pm

≫ Next: New developer tools features in IE9

≪ Previous: Enabling Gflags or AppVerifier options for a particular service or COM+ package.

I had a support case recently where the customer had moved their server farm onto brand new hardware, each server with lots of CPUs. At the same time they had taken the operating system from Windows Server 2003 to Windows Server 2008 R2. I forget how many CPUs they had but let’s just say that their task manager looked something like this:

(This is not a screenshot from my desktop machine at work I hasten to add. Oh, if only!)

They were understandably disappointed to find that a key web service had degraded from about 0.2ms to 0.4ms response time.

Now what they had noticed was that CPU#0 was getting more than its fair share of the work, peaking near 100% a lot of the time. The other curious thing they had found was that if they disabled CPU#0 for the w3wp.exe hosting the application while the process was running the problem resolved and performance increased. [You can do this by right clicking on the process in the list on the processes tab in task manager and selecting “Set affinity”. This is something I would strongly recommend against doing in the normal course of events but in this case it was a useful diagnostic step].] But they also found that if they permanently disabled use of CPU#0 for that application pool by setting the processor affinity mask in the application pool advanced properties then the high CPU just shifted onto CPU#1.

The other thing that had been observed was that the performance counter for “.NET CLR Memory\%time in GC” was now around 40% whereas on the old servers it had been around 2%. Not good.

Anyway, in the end we got the debugger attached (of course!). Now the .NET Garbage Collector (GC), when running in server mode which is what ASP.NET uses on multi-processor machines, creates a dedicated thread per CPU. As most real world applications tend to allocate objects quite liberally and leave them for the GC to clear up it is not that uncommon to see the GC threads at the top of the list in !runaway output. For example, on a 4 logical CPU machine it might look like this:

0:000> !runaway
User Mode Time
Thread       Time
26:56c       0 days 0:05:10.328
38:488       0 days 0:05:10.750
37:be4       0 days 0:05:07.328
39:dc8       0 days 0:04:37.796
48:acc       0 days 0:00:27.484
31:1144      0 days 0:00:22.156
…

You can see the top 4 threads have used up considerably more user mode time than subsequent threads. This is normal and reasonable.

In my customer’s case however it looked more like this:

0:000> !runaway
User Mode Time
Thread       Time
26:56c       0 days 0:07:10.328
38:488       0 days 0:00:10.750
37:be4       0 days 0:00:07.328
39:dc8       0 days 0:00:37.796
48:acc       0 days 0:00:27.484
31:1144      0 days 0:00:22.156
….

The top thread (which was indeed a GC thread) was chewing up much more time than any other.

This rang some bells at the back of my mind and I remembered this fix:

A hotfix is available that resolves the System.InsufficientMemoryException exception and enhances the heap balancing on a computer that has over 8 processors for the .NET Framework 2.0 Service Pack 2

Don’t be fooled by the title. If you look at the article you’ll see the hotfix release addresses two issues at the same time [hotfixes are cumulative so when you install a fix you are always getting lots of fixes anyway it is just that usually each hotfix release just adds one new fix to the mix]. Here is the description:

Issue 2

You run a .NET Framework 2.0-based application on a computer that has more than 8 logical processors. The computer uses the server garbage collector. In this case, you may experience a memory issue caused by an unbalanced workload in different processors. For example, the application runs slower than when you run the application on a computer that has 8 logical processors.

That certainly seemed to fit the bill.

We installed the fix on the customer’s server and the issue was resolved!

HTH

Doug