Tuesday, January 8, 2008

Capturing PSSDIAG/SQLDump, Right In Time

    Usually when you contact Microsoft PSS for SQL Server related issue they will ask you to run the PSSDIAG if its performance issue or if its SQL Server hung/restart issue they will ask you to capture memory dumps. If the issue is known to occur at certain time or if you can reprooduce the issue its fine you can trigger PSSDIAG or SQLDump well in time. But usually that is not the case. The issues are totally random and unknown in that kind of situation as a DBA you will face one problem - how will you run PSSDIAG or SQLDump right at the time of issue occurance.

    Understand the symtoms: If you know the symptoms like heavy blocking, slow down, unresponsiveness etc you can write a VBScript or combination of DOS and OSQL to detect the symptoms. So lets take un-responsiveness as an example. If this is the case I would write an Vscript and ping the SQL Server if the connection time is beyound 3-4 seconds then I will trigger the PSSDIAG or SQLDump whatever MS Support engineer has asked me to capture. This would totally automate the process and no human intervention is required.

    In one of my cases what happened was there was an issue with SQL Server and it used to restart automatically. This behaviour was very much random was not sure when would the next re-start would occur. After initial investigation we found that the MS Clustering service was causing the SQL Server to restart. The way MS Clustering service works is - if the resource (in this case SQL service) is not responding (failure of isAlive check) then it will attempt to restart it. After restart attempts exceeds the configured thershold value Clustering service will attempt to failover. Which means the SQL Service will stop & has to start from the other node and hence I used to see the restart event occuring. So what I did was, I whipped out a simple VBScript to detect the SQL Server's situation and trigger the memory dump. Along with this I also changed the threshold timer and number of attempts in cluadmin so that Clustering service will not failover before my script does its job.

    Note: Pls evaluate the Pros/Cons or risks before making any change in Cluster Configurations.

    So this is how I was successful in capturing the SQLDump right in time and sent it to PSS for analysis. We found that the SQL Server was running out of "worker thread" and due to which applications were failing (new connections) to connect to SQL Server. And even Clustering service was unable to do its "isAlive" checks and it used to interpret this as serice failure.

    Similarly like what I explained above you can execute PSSDIAG or any other utility of choice if required.

No comments:

About Me

By profession I am a Database Administrator (DBA) with total 13 yrs. of experience in the field of Information Technology, out of that 9 yrs as SQL DBA and last 3 years in IBM System i/iSeries and DB2 LUW 9. I have handled Developer, plus production support roles, and I like both the roles. I love and live information technology hence the name "Techonologyyogi" Apart from that I am a small, retail investor, with small investments in India and United States in the form of Equity holdings via common stocks. Don't ask me if I have made money, I have been loosing money in stocks.