I attended the SQL intersection conference in November 2014 in Las Vegas, Nevada. I am including my notes from the sessions in this blog, primarily for my own reference later. For anyone that comes across these notes, please take them with a grain of salt – I may very easily have misheard something or have written it down incorrectly.
Session Title: Top Customer Problems: How to Avoid Calling Bob
Instructor: Bob Ward
- Biggest issues (in order)
- Performance
- Is it a “running” problem (high CPU) vs a “waiting” problem (IO bottleneck, blocking, latch waits, other waits)?
- Blocking – usually a problem within the app
- IO bottleneck (PAGEIOLATCH or WRITELOG)
- Latch waits, hotspots (PAGELATCH_XX)
- RESOURCE_SEMAPHORE waits
- ASYNC_NETWORK_IO can be a network problem, but usually a problem with client result processing
- Other waits, check DMV
- If the problem is current: Performance Dashboard Reports and DMVs
- Performance Dashboard Reports
- DMVs only
- Create SP in master
- 2005 version (use for 2005, 2008, 2008R2)
- 2012 version (user for 2012, 2014)
- Check for other process using lots of resources
- High kernel time (i.e., anti-virus)
- Within SQL Server
- Background/system tasks
- High recompiles
- Query
- Usually about cardinality estimation
- Many root causes, see table
- Tables variables (estimate = 1)
- TF 2453 (2012 SP2+)
- Data type mismatch (cannot make estimate, change data types)
- Poor/missing indexes
- Query design (lots of scans)
- Lots of parallelism (check DOP)
- Usually about cardinality estimation
- Post-mortem problems: PSSDiag, SQLNexus
- Performance Dashboard Reports
- Backup/Restore
- Backup
- OS error 64 – Network problems (backup started, but connection dropped)
- OS error 1450 – OS memory problems (backing up to a remote drive)
- Transaction log growth (full recovery with no t-log backups)
- VDI backup failure (3rd party app problems)
- Restore
- Disk bottleneck
- Long-running recovery
- VLF fragmentation
- CHECKSUM failures
- Use WITH CONTINUE_AFTER_ERROR to override
- Consider restoring to a different DB vs replace (if error while restoring, still have fallback)
- Consistency Issues
- Message 824 error / DBCC CHECKDB
- Checksum errors
- Suspect DB
- No “magic tricks” to fix
- Almost always IO path error (try SQLIOSIM)
- Can also be memory corruption
- Constant page sniffing – thread that randomly check memory pages
- Could be a SQL bug (replay the log)
- Session id of -3 = recovery (deferred)
- Page-level recovery
- Stackdumps/Exceptions
- Verify that the faulting module is SQL
- Obscure errors
- Message 1204 – Deadlock
- Message 833 – IO delay warning (15 sec IO delay)
- Message 17883 – Non-yielding scheduler (almost always a SQL bug)
- Message 17884, 17888 – usually a long blocking chain
- Message 844, 845, 846, 847 – latch time out
- Use error_reported XEvent to track down a specific error
- Startup problems
- Permissions problem (may have changed service account followed by a hard restart)
- Config (i.e., max server memory)
- Damaged files
- Keep file copies of master (shut down SQL, copy file off) and binaries
- Backup tail of model causes restart problems
- Memory issues
- Out of memory
- Memory grown (OK if below max server memory)
- Hidden gems
- Sytem health session persisted (2014)
- NUMA guidance
- System Center Advisor
- Backup
- Is it a “running” problem (high CPU) vs a “waiting” problem (IO bottleneck, blocking, latch waits, other waits)?
- Performance
tempdb bulk operation optimized (fix for 2012, built in to 2014)