0% found this document useful (0 votes)
53 views

Got A Better Name? Please Let Me Know!

The document discusses SQLCallStackResolver, a tool that helps troubleshoot SQL Server issues by resolving call stacks from SQL dumps, XEL files, and error logs into readable function names. It provides an overview of debugging SQL Server, call stacks, and how SQLCallStackResolver addresses challenges like missing module information in XEL files. The document demonstrates how SQLCallStackResolver works and how to collect call stacks, resolve symbols, and use it to analyze a case study on spinlock contention.

Uploaded by

Asep Wijaya
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views

Got A Better Name? Please Let Me Know!

The document discusses SQLCallStackResolver, a tool that helps troubleshoot SQL Server issues by resolving call stacks from SQL dumps, XEL files, and error logs into readable function names. It provides an overview of debugging SQL Server, call stacks, and how SQLCallStackResolver addresses challenges like missing module information in XEL files. The document demonstrates how SQLCallStackResolver works and how to collect call stacks, resolve symbols, and use it to analyze a case study on spinlock contention.

Uploaded by

Asep Wijaya
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30

ADVANCED SQL

SERVER
TROUBLESHOOTIN Got a better

G WITH
name? Please let
me know!

SQLCALLSTACKRES
OLVER
Arvind Shyamsundar
Principal Program Manager | Microsoft Azure CAT (a.k.a. SQLCAT)
Twitter: @arvisam
Blog: http://aka.ms/arvindsh
LET’S LOOK AT…
 Target SQL scenarios
 Debugging basics
 A lap inside the SQLServer.exe process
 Call stacks and PDB symbols

 SQLCallStackResolver basics
 Installation and UI
 Matching PDB symbols

 Resolving various avatars of call stacks


 Module + offset format
 Hex address only
 XEL files

 Recap / Wrap-up
WHY SQLCALLSTACKRESOLVER?
Efficient diagnosis of difficult problems
 Spinlock contention
 Troubleshooting waits – overloaded wait types / “strange” wait
types
 Asserts / AVs
 Unexplained memory allocations inside SQL

Challenges that SQLCallStackResolver addresses


 Dumps are not always available
 Dumps are invasive and disrupt production

 XEL files (when viewed in SSMS) do not display module+offset


notation (we will detail this soon!)
 Scale of analysis (using WinDbg and resolving stacks manually
is time-consuming)
We want to go from here… … to here!

sqldk.dll+0x0000000000047645 sqldk!XeSosPkg::wait_info::Publish
sqldk.dll+0x0000000000001960 sqldk!SOS_Scheduler::UpdateWaitTimeStats
sqldk.dll+0x00000000000012DF sqldk!SOS_Task::PostWait
sqlmin.dll+0x000000000000187C sqlmin!EventInternal<SuspendQueueSLock>::Wait
sqlmin.dll+0x00000000000361FE sqlmin!LatchBase::Suspend

What’s our objective?


sqlmin.dll+0x00000000000083BE sqlmin!BUF::AcquireLatch
sqlmin.dll+0x000000000000861C sqlmin!BPool::Get
sqlmin.dll+0x000000000000A3D8 sqlmin!IndexPageManager::GetPageWithKey
sqlmin.dll+0x0000000000009E96 sqlmin!GetRowForKeyValue
sqlmin.dll+0x000000000000C314 sqlmin!IndexRowScanner::EstablishInitialKeyOrderPosition
sqlmin.dll+0x000000000000934D sqlmin!IndexDataSetSession::GetNextRowValuesInternal
sqlmin.dll+0x00000000000149D0 sqlmin!RowsetNewSS::GetNextRows
sqlmin.dll+0x00000000000DCAE2 sqlmin!CMEDScan::DeleteNCRow
sqlmin.dll+0x00000000000DC84B sqlmin!CMEDScan::DeleteRow
sqlmin.dll+0x000000000018C41E sqlmin!CMEDCatKatmaiIndex::DropRowset
sqlmin.dll+0x000000000018B388 sqlmin!VisibleHoBt::DropHoBt
sqlmin.dll+0x000000000018BF6B sqlmin!SEDropRowsetInternal
sqlmin.dll+0x000000000018B1A1 sqlmin!DDLAgent::SEDropRowsets
sqllang.dll+0x00000000002C1FFE sqllang!CIndexDDL::DropRowset
sqllang.dll+0x00000000002C1F5A sqllang!CIndexDDL::DropAllRowsets
sqllang.dll+0x0000000000243917 sqllang!DropAllRowsetsForTable
sqllang.dll+0x00000000002436B7 sqllang!DropObject
sqllang.dll+0x00000000002C1CD2 sqllang!FDropTempWithNolog
sqllang.dll+0x00000000002C1A80 sqllang!TmpObject::Release
COLLECTING CALL STACKS
Basic XEvent callstack action “Bucketizing” call stacks
CREATE EVENT SESSION XESpins ON SERVER
CREATE EVENT SESSION [OOM] ON SERVER
ADD EVENT sqlos.spinlock_backoff
ADD EVENT sqlserver.error_reported(
(
ACTION(package0.callstack) WHERE (error_number =
ACTION (package0.callstack)
701 ))
WHERE type = 151
ADD TARGET package0.ring_buffer WITH (STARTUP_STATE=OFF)
)
ADD TARGET package0.histogram
Viewing data (SET source_type = 1,
source = N'package0.callstack')
WITH (MAX_MEMORY = 32768 KB,
SELECT event_session_address, target_name, EVENT_RETENTION_MODE = ALLOW_SINGLE_EVENT_LOSS,
execution_count, MAX_DISPATCH_LATENCY = 5 SECONDS,
CAST (target_data AS XML) AS CallStack MAX_EVENT_SIZE = 0 KB,
FROM sys.dm_xe_session_targets AS xst MEMORY_PARTITION_MODE = PER_CPU,
INNER JOIN sys.dm_xe_sessions AS xs TRACK_CAUSALITY = OFF,
ON (xst.event_session_address = xs.address) WHERE STARTUP_STATE = OFF);
xs.name = 'OOM';

Other sources for call stacks: TXT files generated whenever a dump occurs & SQL Errorlog
MAKING SENSE OF CALL
STACKS
Tools to look deeper
 Process Explorer (https://docs.microsoft.com/en-us/sysinternals/downloads/process-explorer)
 Debugging Tools for Windows / WinDbg (
https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/)
 Trace flag 3656. For more information,
Diagnosing and Resolving Spinlock Contention on SQL Server is highly recommended reading.

Important DLLs you might notice in call stacks


 sqllang.dll; sqlmin.dll; sqldk.dll
 hkengine.dll; hkruntime.dll; hkcompile.dll

To get a deeper understanding of the code / functions that


are running, we need SYMBOLS!
WHAT ARE (.PDB) SYMBOLS?
 PDB (Program Database) files are produced by the linker at the time a DLL or EXE binary is built
 These map the offsets of executable machine code in the binary to original source information
(function name)
 Microsoft internally stores “private PDBs” which are the full set of information including source line
# information
 Publicly, Microsoft “strips” out the sensitive information and makes available a “public PDB” with
function name mappings and some other essential information.
 PDBs are meant to be matched with their corresponding binary file – DLL or EXE. This matching is
tracked via. a unique GUID (“signature”) and an “age” number.
 Garbage-in; Garbage-out (“GIGO”): mismatched PDBs mean incorrect mapping to readable function
names
SQLCALLSTACKRESOLVER: INSTALLATION
 Go to https://aka.ms/SQLStack
 Current release is under Releases
 You need Windows and .NET Framework 4.5.2
(or greater)
 Files you need to obtain manually:
 MSDIA140.DLL (if you have Visual Studio 2015,
you are all set, else you need the 64-bit version from
https://my.visualstudio.com)
 Microsoft.SqlServer.XE.Core.dll and
Microsoft.SqlServer.XEvent.Linq.dll (SQL Server
2017)
DEMO:
SQLCALLSTACKR
ESOLVER UI
GETTING MATCHED PDB FILES
 Go to the Wiki section under
http://aka.ms/SQLStack
 Locate the page corresponding to the major
version of SQL Server you are
troubleshooting
 Most likely there’s already a script to
download PDB files for the specific
(including minor version for SP / CU)
 Use that PowerShell script and modify the
<somepath> placeholders; execute it to
download the PDB files
 What if there is no entry for the specific
version you are using? The next demo will
show you how!
DEMO: CREATING
PDB DOWNLOAD
SCRIPTS
.XEL FILES, SSMS AND CALL
STACKS
Getting base addresses The module+offset notation for call stacks is
only if you view the call stack information
First, run the query below within SSMS, and make sure you use Results
using to Grid:
a server-side method like dumping the
ring buffer.

SELECT name, base_address When viewed with SSMS (offline) this


module+offset notation is not preserved.
FROM sys.dm_os_loaded_modules
SQLCallStackResolver helps in this situation!

Then, select ALL the results, and copy into clipboard.

Click the “Enter base addresses” in the SQLCallStackResolver tool, and OVERWRITE what’s pre-populated with
the values you copied.
DEMO: WORKING
WITH XEL FILES
CASE STUDY: SPINLOCK
CONTENTION
 A customer testing SQL Server 2017 with a highly
concurrent workload was seeing 100% CPU
consumption on a 144-CPU box
 They were using memory-optimized, non-durable Key metrics:
tables with a natively compiled procedure and an
interop stored procedure on top Azure VM with 128 vCPU

 The requirement was to understand the reason for ~ 250 K Batch req/sec
the high CPU usage, and look at options for either
lowering the CPU usage or to increase throughput 100% CPU usage
(measured by SQL Server: Batch Requests / sec.)
LOCK_HASH spinlock
having huge amount of
backoffs
SIDEBAR: XPERF
 Get these tools from the Windows SDK:
https://developer.microsoft.com/en-US/windo
ws/downloads/windows-10-sdk
 Once downloaded on a desktop, you can copy-
paste the redistributable .MSI or the folder to a
server (see Notes for details)
 XPerf is a command line tool for capturing
traces. Capturing a basic XPerf trace for
investigating high CPU usage is simple:

xperf -On Base


 After a few seconds of capturing data, stop the
trace and write the data out by running:
xperf -d <path to a .ETL file>
WINDOWS PERF ANALYZER
(WPA)
 WPA is also installed with the
Windows Performance Toolkit
 To analyze the ETL file that was
saved by Xperf, open the trace in
Windows Performance Analyzer
(WPA)
 Use the Computation view (right
click and Add to Analysis view) to
examine CPU usage by module
and function (un-select the
“Stack” option)
 Make sure symbol paths are
correctly setup in WPA, and the
option to “Load Symbols” is
turned on
LOOKING AT THE SQL 2016
XPERF TRACE
 Almost 70% of the time is spent in
spins for spinlock # 143
 Spinlock # 143 in SQL 2016 SP1 is
LOCK_HASH (you can check this with
the following query)
select * from
sys.dm_xe_map_values where
map_key = 143 and name =
'spinlock_types’
 We still can’t tell exactly what the
LOCK_HASH is due to – which is
where we will look at call stacks!
DEMO:
DIAGNOSING
LOCK_HASH
SPINLOCK
CONTENTION
FOR THE RECORD: SQL 2016
CALL STACK
sqldk!XeSosPkg::spinlock_backoff::Publish
sqldk!SpinlockBase::Sleep
sqlmin!Spinlock<143,7,1>::SpinToAcquireWithExponentialBackoff
sqlmin!lck_lockInternal
sqlmin!MDL::LockGenericLocal
sqlmin!MDL::LockGenericIdsLocal
sqlmin!CMEDCacheEntryFactory::GetProxiedCacheEntryById
sqlmin!CMEDProxyDatabase::GetOwnerByOwnerId
sqllang!CSECAccessAuditBase::SetSecurable
sqllang!CSECManager::_AccessCheck
sqllang!CSECManager::AccessCheck
sqllang!FHasEntityPermissionsWithAuditState
sqllang!FHasEntityPermissions

CASE STUDY REVIEW
 Symptoms: the system was experiencing 100% CPU usage, and we wanted to know what was responsible for the high CPU
usage.
 DMVs showed a huge amount of LOCK_HASH spinlock contention
 Using Xperf we proved that a significant % of CPU usage was attributed to the backoffs for this spinlock, hence it was
worth investigating the spinlock further
 To know exactly what code paths resulted in the spinlock contention (LOCK_HASH is ubiquitous / overloaded) we
collected call stacks using Extended Events
 We used SQLCallStackResolver to resolve those call stacks and found some likely clues:

sqlmin!CMEDProxyDatabase; sqlmin!MDL::LockGenericLocal; sqlmin!lck_lockInternal

The root cause: hash table contention for metadata (MD) locks
 MD locks are not partitioned prior to SQL 2017 and hence did not scale
 The SQL MD team then turned on MD lock partitioning by default in SQL 2017
 This helped push the workload to much higher throughput in SQL 2017

 Sidebar: LINQPad (http://www.linqpad.net) is a great tool for quick stress testing scripts!
THE ‘AFTER’ PICTURE

Key metrics:

Azure VM with 128 vCPU

~ 465 K Batch req/sec

Still 100% CPU usage 

LOCK_HASH is gone!
FOR THE RECORD: SQL 2017
XPERF TRACE
 We still see ~ 23% of CPU time is spent
in a spinlock # 276
 Spinlock # 276 in SQL 2016 RTM
(verify using the below query)
select * from
sys.dm_xe_map_values where
map_key = 276 and name =
'spinlock_types'
 Good news is that it’s not
LOCK_HASH! so the fix was effective
 Interestingly, the next potential
bottleneck has come up (spinlock # 276
is SOS_CACHESTORE). This is a
common situation in these kind of
systems, where fixing one bottleneck
reveals the next one.
CASE STUDY: PAGELATCH_EX
CONTENTION
 PAGELATCH_EX contention is common with customers. In many cases this is a product of poor
index selection, but in rare cases it can actually be due to sub-optimal code on the SQL engine side
 One such case was observed when customers suddenly saw increased PAGELATCH_EX contention
on system tables post upgrade to SQL Server 2016
 Collecting callstacks for PAGELATCH_EX wait, bucketizing them into a histogram target, and then
resolving them using SQLCallStackResolver clearly showed the increased waits were due to
additional code introduced in SQL Server 2016
 Without SQLCallStackResolver, the customer would have had to collect a dump – clearly something
not acceptable given the already degraded state of performance on the box!
 With the ease of isolation of the problem using SQLCallStackResolver, the SQL engineering team
was able to optimize that additional code in SQL 2016 SP1 CU2 / SQL 2016 RTM CU6 (KB article:
https://support.microsoft.com/kb/4013999)
DEMO:
PAGELATCH_EX
CONTENTION
IN SUMMARY
 SQLCallStackResolver is a time-saving tool to help you diagnose issues accurately
 Download and install from http://aka.ms/SQLStack
 Matched PDB files are the key to a successful diagnosis
 For Linux, the PDBs are exactly the same as the Windows ones for the same build
 For XEL files you should also have the output of sys.dm_os_loaded_modules (name,
base_address) from the time the XEL was collected
 Windows Performance Toolkit (XPerf / WPA) are really useful for diagnosing high
CPU usage. Correlation between SQL and system usage is really important!
 SQLCallStackResolver has been used successfully in real-world cases – you saw
one example where we introduced MD lock partitioning in SQL 2017 directly based
on diagnosis based on the tool! Many other examples in the GitHub repo.
THANK
YOU!
Arvind Shyamsundar (@arvisam)
http://aka.ms/SQLStack
EXTRAS
DEBUGGING SQL MEMORY
OOM / 701 errors
 701 errors (“There is insufficient system memory”… ) are difficult to diagnose
 Collecting XE for error_reported events with the callstack action can be very useful to shed further light
on what might be happening
 This in conjunction with DBCC MEMORYSTATUS, sys.dm_os_memory_* DMVs can be a
comprehensive approach

Strange memory ‘leaks’


 This can be very tricky to narrow down in real-world servers
 One approach we have used with real-world customer issues is to collect XE for sqlos.page_allocated and
sqlos.page_freed events along with the callstack action for each, and bucketize them in a histogram target
 Then resolving the stacks and then manually ‘diff-ing’ the stacks actually pointed us in the right direction.
DEMO:
INVESTIGATING
ERROR 701
DEMO: TRACKING
DOWN A MEMOBJ
LEAK

You might also like