Got A Better Name? Please Let Me Know!
Got A Better Name? Please Let Me Know!
SERVER
TROUBLESHOOTIN Got a better
G WITH
name? Please let
me know!
SQLCALLSTACKRES
OLVER
Arvind Shyamsundar
Principal Program Manager | Microsoft Azure CAT (a.k.a. SQLCAT)
Twitter: @arvisam
Blog: http://aka.ms/arvindsh
LET’S LOOK AT…
Target SQL scenarios
Debugging basics
A lap inside the SQLServer.exe process
Call stacks and PDB symbols
SQLCallStackResolver basics
Installation and UI
Matching PDB symbols
Recap / Wrap-up
WHY SQLCALLSTACKRESOLVER?
Efficient diagnosis of difficult problems
Spinlock contention
Troubleshooting waits – overloaded wait types / “strange” wait
types
Asserts / AVs
Unexplained memory allocations inside SQL
sqldk.dll+0x0000000000047645 sqldk!XeSosPkg::wait_info::Publish
sqldk.dll+0x0000000000001960 sqldk!SOS_Scheduler::UpdateWaitTimeStats
sqldk.dll+0x00000000000012DF sqldk!SOS_Task::PostWait
sqlmin.dll+0x000000000000187C sqlmin!EventInternal<SuspendQueueSLock>::Wait
sqlmin.dll+0x00000000000361FE sqlmin!LatchBase::Suspend
Other sources for call stacks: TXT files generated whenever a dump occurs & SQL Errorlog
MAKING SENSE OF CALL
STACKS
Tools to look deeper
Process Explorer (https://docs.microsoft.com/en-us/sysinternals/downloads/process-explorer)
Debugging Tools for Windows / WinDbg (
https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/)
Trace flag 3656. For more information,
Diagnosing and Resolving Spinlock Contention on SQL Server is highly recommended reading.
Click the “Enter base addresses” in the SQLCallStackResolver tool, and OVERWRITE what’s pre-populated with
the values you copied.
DEMO: WORKING
WITH XEL FILES
CASE STUDY: SPINLOCK
CONTENTION
A customer testing SQL Server 2017 with a highly
concurrent workload was seeing 100% CPU
consumption on a 144-CPU box
They were using memory-optimized, non-durable Key metrics:
tables with a natively compiled procedure and an
interop stored procedure on top Azure VM with 128 vCPU
The requirement was to understand the reason for ~ 250 K Batch req/sec
the high CPU usage, and look at options for either
lowering the CPU usage or to increase throughput 100% CPU usage
(measured by SQL Server: Batch Requests / sec.)
LOCK_HASH spinlock
having huge amount of
backoffs
SIDEBAR: XPERF
Get these tools from the Windows SDK:
https://developer.microsoft.com/en-US/windo
ws/downloads/windows-10-sdk
Once downloaded on a desktop, you can copy-
paste the redistributable .MSI or the folder to a
server (see Notes for details)
XPerf is a command line tool for capturing
traces. Capturing a basic XPerf trace for
investigating high CPU usage is simple:
The root cause: hash table contention for metadata (MD) locks
MD locks are not partitioned prior to SQL 2017 and hence did not scale
The SQL MD team then turned on MD lock partitioning by default in SQL 2017
This helped push the workload to much higher throughput in SQL 2017
Sidebar: LINQPad (http://www.linqpad.net) is a great tool for quick stress testing scripts!
THE ‘AFTER’ PICTURE
Key metrics:
LOCK_HASH is gone!
FOR THE RECORD: SQL 2017
XPERF TRACE
We still see ~ 23% of CPU time is spent
in a spinlock # 276
Spinlock # 276 in SQL 2016 RTM
(verify using the below query)
select * from
sys.dm_xe_map_values where
map_key = 276 and name =
'spinlock_types'
Good news is that it’s not
LOCK_HASH! so the fix was effective
Interestingly, the next potential
bottleneck has come up (spinlock # 276
is SOS_CACHESTORE). This is a
common situation in these kind of
systems, where fixing one bottleneck
reveals the next one.
CASE STUDY: PAGELATCH_EX
CONTENTION
PAGELATCH_EX contention is common with customers. In many cases this is a product of poor
index selection, but in rare cases it can actually be due to sub-optimal code on the SQL engine side
One such case was observed when customers suddenly saw increased PAGELATCH_EX contention
on system tables post upgrade to SQL Server 2016
Collecting callstacks for PAGELATCH_EX wait, bucketizing them into a histogram target, and then
resolving them using SQLCallStackResolver clearly showed the increased waits were due to
additional code introduced in SQL Server 2016
Without SQLCallStackResolver, the customer would have had to collect a dump – clearly something
not acceptable given the already degraded state of performance on the box!
With the ease of isolation of the problem using SQLCallStackResolver, the SQL engineering team
was able to optimize that additional code in SQL 2016 SP1 CU2 / SQL 2016 RTM CU6 (KB article:
https://support.microsoft.com/kb/4013999)
DEMO:
PAGELATCH_EX
CONTENTION
IN SUMMARY
SQLCallStackResolver is a time-saving tool to help you diagnose issues accurately
Download and install from http://aka.ms/SQLStack
Matched PDB files are the key to a successful diagnosis
For Linux, the PDBs are exactly the same as the Windows ones for the same build
For XEL files you should also have the output of sys.dm_os_loaded_modules (name,
base_address) from the time the XEL was collected
Windows Performance Toolkit (XPerf / WPA) are really useful for diagnosing high
CPU usage. Correlation between SQL and system usage is really important!
SQLCallStackResolver has been used successfully in real-world cases – you saw
one example where we introduced MD lock partitioning in SQL 2017 directly based
on diagnosis based on the tool! Many other examples in the GitHub repo.
THANK
YOU!
Arvind Shyamsundar (@arvisam)
http://aka.ms/SQLStack
EXTRAS
DEBUGGING SQL MEMORY
OOM / 701 errors
701 errors (“There is insufficient system memory”… ) are difficult to diagnose
Collecting XE for error_reported events with the callstack action can be very useful to shed further light
on what might be happening
This in conjunction with DBCC MEMORYSTATUS, sys.dm_os_memory_* DMVs can be a
comprehensive approach