Tim Helton, one of Microsoft's top Support Escalation Engineers, has written some great tips that he uses to troubleshoot console performance issues in large environments. I took the liberty of posting his notes below:
========
These are some steps that have helped me troubleshoot console and general performance issues in OpsMgr SP1. If you are not on SP1, this should be your first course of action. The information below is primarily meant for large environments although some of these tips may still be useful for smaller installation (Less than 1000 agents).
SQL Performance: Most console performance issues stem from poor SQL performance. The below steps are basic SQL troubleshooting but if you're not familiar with working with QL server then your resident QL Specialist should be engaged.
The OperationsManager Database should be on its own dedicated server - No other databases should live on this SQL server. You should also try to maintain at most a 50GB OperationsManager DB for best performance and I'd recommend 30-40GB.
SQL Disk Configuration: Disk performance issues make up the bulk of SQL performance problems. With that in mind here are some of my recommendations:
Some Performance counters to look at:
If these numbers look bad should take a look at SQL memory before pointing fingers at your SAN vendor.
SQL Memory Configuration: Not enough memory, or improperly using SQL memory, could cause SQL to perform extra work that makes it look like the disk cant keep up. Make sure SQL has enough memory. SQL 2005 SP2 64 bit Enterprise Edition should be running on a 64 bit server for best performance. If we are not running Enterprise Edition, Online index operations are not available.
Some rough Estimates of memory needed on the SQL server:
Some performance counters to look at:
Memory/Available Mbytes - If this number is under 200 MB then SQL has been configurated to take all the memory in the system. This is often not the most efficient for SQL. Try to determine how much memory is needed by other components on the system (OS/Sql Agents/Antivirus/Backup/Misc) and subtract that from the total system memory giving SQL the rest.
Here is an Example using an 8-way 64-bit server with 32GB of ram:
So now our max server memory setting is: 32-2-2-3 = 25GB.
Here is the SQL command to limit the server to 25GB:
sp_configure 'max server memory', 25600 RECONFIGURE GO
SQLServer:Buffer Manager/Page Life expectancy - This counter lets us know how long SQL expects a page to live inside its memory. This number should be greater than 300. The Page Life Expectancy counter tells you, on average, how long data pages are staying in the buffer. If this value gets below 300 seconds then this is a potential indication that your SQL Server could use more memory in order to boost performance. If this number falls well below 300 seconds it can be an indication that SQL may have benefited from more memory during that time period, although just because this number falls below 300 does not mean SQL needs more memory. We could have incorrect or out of date indexes. See the next Section.
SQLServer:Buffer Manager/Lazy Writes/sec - If this number increases during the time when Avg Disk Sec/Transfer increases, it is a good indication SQL needs more memory. Lazy writer is a process that moves dirty pages from the buffer to disk in order to free up buffer space. Lazy writer/sec counter tracks how many times a second the Lazy writer process is moving dirty pages. Generally speaking this should not be a high value, say more than 20 per second or so. Ideally it should be close to zero. If it is zero, this indicates that your SQL Server's buffer cache is plenty big and SQL Server doesn't have to free up dirty pages, instead waiting for this to occur during regular checkpoints. If this value is high then a need for more memory is indicated
Temp DB: The Temp DB should be moved to a High Performance Disk, preferably a dedicated SAN LUN in large environments.
SQL Indexes: Out of date or incorrect indexes can cause additional data to be read from the disk. This can cause our Page Life Expectancy to fall and increase our disk I/O. Pssdiag can be useful in identifying which Tables contain the most date and which indexes are out of date.
Identifying tables that are fragmented:
DBCC SHOWCONTIG WITH FAST Here is some sample output: ---------------------------------------------------------------------------------------------- DBCC SHOWCONTIG scanning 'Alert' table... Table: 'Alert' (1771153355); index ID: 1, database ID: 5 TABLE level scan performed. - Pages Scanned................................: 936 - Extent Switches..............................: 427 - Scan Density [Best Count:Actual Count].......: 27.34% [117:428] - Logical Scan Fragmentation ..................: 60.90% ----------------------------------------------------------------------------------------------
In general, we would like the "Scan density" to be high (Above 80%) and the "Logical Scan Fragmentation" to be low (below 30%). What you might find is that *some* of the tables are more fragmented than others, because our built-in maintenance does not reindex all tables. This is especially true with tables like the raw perf, event, and localizedtext tables.
Reindexing: This can be ran as a SQL agent job or manually and can have a huge effect on peformance:
USE OperationsManager go SET ANSI_NULLS ON SET ANSI_PADDING ON SET ANSI_WARNINGS ON SET ARITHABORT ON SET CONCAT_NULL_YIELDS_NULL ON SET QUOTED_IDENTIFIER ON SET NUMERIC_ROUNDABORT OFF EXEC SP_MSForEachTable "Print 'Reindexing '+'?' DBCC DBREINDEX ('?')"
Maintenance built into OpsMgr: There is a rule in the System Center Internal Library called "Optimize Indexes". This rule runs every night at 2:30am on the RMS and calls p_OptimizeIndexes. Make sure any standard maintenance you perform on the OpsMgr DB does not interfere with this job.
See this blog post for more information: http://blogs.technet.com/kevinholman/archive/2008/04/12/what-sql-maintenance-should-i-perform-on-my-opsmgr-databases.aspx
Tuning OpsMgr: NOTE: The Event and Performance Data tables are updated in realtime and are not indexed. Opening these views can have a significant performance impact so when testing console performance all efforts should be taken to avoid these views.
Data Retention: Data Retention can be a significant source of data in the OpsMgr DB. Reducing the amount of data in the database will offer significant performance increases. In a large environment set the Data Retention to 1 or 2 days for every category. After grooming and reindexing takes place you should see a console performance increase. This is accessed from Administration -> Settings -> grooming.
Open and Closed Alerts: We should try to maintain less than 2,000 open and closed alerts in our database. If you see more than this then Management Packs may have to be tuned to disable noisy rules.
Tuning Management Packs: In a large OpsMgr environment, every management pack added needs to be examined, tuned and tested. One way to do this is to remove all of the management packs and add them back one by one. As you add them back, identify what the management packs do and disable rules/discoveries that are unneeded. Tune each management pack for a couple of days , identify noisy rules, excessive perf data etc. Also make sure the newest Management packs are used. Especially with the complicated MP such as Exchange, big performance issues have been addressed in newer versions. Research what needs to be done with each management pack that will be installed.
These next steps are best performed on a clean working system and repeated everytime a new management pack is being tested. See:
http://blogs.technet.com/kevinholman/archive/2007/10/18/useful-operations-manager-2007-sql-queries.aspx
Tuning Alerts: Run the two queries below. Identify what is causing alerts and either fix the issues or disable those rules:
Most Common Alerts in an Operational Database, by Alert Count
SELECT AlertStringName, AlertStringDescription, AlertParams, Name, SUM(1) AS AlertCount, SUM(RepeatCount+1) AS AlertCountWithRepeatCount FROM Alertview WITH (NOLOCK) GROUP BY AlertStringName, AlertStringDescription, AlertParams, Name ORDER BY AlertCount DESC
Most Common Alerts in an Operational Database, by Repeat Count
SELECT AlertStringName, AlertStringDescription, AlertParams, Name, SUM(1) AS AlertCount, SUM(RepeatCount+1) AS AlertCountWithRepeatCount FROM Alertview WITH (NOLOCK) GROUP BY AlertStringName, AlertStringDescription, AlertParams, Name ORDER BY AlertCountWithRepeatCount DESC
Tuning events: Run the following queries to determine which events are prevalent in the database. Find the event collection rules collecting these events and disable them. If an especially noisy event rule is absolutely needed, please be aware this could cause performance issues. It may also be beneficial to tune the application causing the events.
All Events by count by day, with the total for entire database (this tells us how many events per day we are inserting, and helps us look for too many events, event storms, and the result after tuning rules that generate too many events):
SELECT CASE WHEN(GROUPING(CONVERT(VARCHAR(20), TimeAdded, 101)) = 1) THEN 'All Days' ELSE CONVERT(VARCHAR(20), TimeAdded, 101) END AS DayAdded, COUNT(*) AS NumEventsPerDay FROM EventAllView GROUP BY CONVERT(VARCHAR(20), TimeAdded, 101) WITH ROLLUP ORDER BY DayAdded DESC
Most Common Events by event number: (This helps us know which event ID's are the most common in the database)
SELECT top 50 Number, COUNT(*) AS TotalEvents FROM EventView GROUP BY Number ORDER BY TotalEvents DESC
Most common events by event number and event publishername (This gives us the event publisher name to help see what is raising these events):
SELECT top 50 Number, Publishername, COUNT(*) AS TotalEvents FROM EventAllView GROUP BY Number, Publishername ORDER BY TotalEvents DESC
Computers generating the most events (This shows us which computers create the most event traffic and use the most database space):
SELECT top 50 LoggingComputer, COUNT(*) AS TotalEvents FROM EventallView GROUP BY LoggingComputer ORDER BY TotalEvents DESC
Performance Turning: Use the following queries to identify which performance rules are taking the bulk of the database. Tune these rules to collect performance data at a larger interval, or disable the rule all together.
Performance insertions per day:
SELECT CASE WHEN(GROUPING(CONVERT(VARCHAR(20), TimeSampled, 101)) = 1) THEN 'All Days' ELSE CONVERT(VARCHAR(20), TimeSampled, 101) END AS DaySampled, COUNT(*) AS NumPerfPerDay FROM PerformanceDataAllView GROUP BY CONVERT(VARCHAR(20), TimeSampled, 101) WITH ROLLUP ORDER BY DaySampled DESC
Most common performance insertions by perf counter name:
select pcv.countername, count (pcv.countername) as total from performancedataallview as pdv, performancecounterview as pcv where (pdv.performancesourceinternalid = pcv.performancesourceinternalid) group by pcv.countername order by count (pcv.countername) desc
Most common performance insertions by perf object and counter name (This is the most interesting - tells us specifically which perf insertions are the noisiest):
select pcv.objectname, pcv.countername, count (pcv.countername) as total from performancedataallview as pdv, performancecounterview as pcv where (pdv.performancesourceinternalid = pcv.performancesourceinternalid) group by pcv.objectname, pcv.countername order by count (pcv.countername) desc
Here are some examples of noisy performance rules that are enabled by default in OpsMgr:
Microsoft.Windows.Server.2000 Collection Rule for the Average Disk Queue Length Microsoft.Windows.Server.2000 Collection Rule for Average Disk Seconds Per Read Microsoft.Windows.Server.2000 Collection Rule for Average Disk Seconds Per Transfer Microsoft.Windows.Server.2000 Collection Rule for Average Disk Seconds Per Write Microsoft.Windows.Server.2000 Collection rule for Current Disk Queue Length Microsoft.Windows.Server.2000 Collection Rule for Disk Bytes Per Second Microsoft.Windows.Server.2000 Collection Rule for Disk Reads Per Second Microsoft.Windows.Server.2000 Collection Rule for Disk Writes Per Second Microsoft.Windows.Server.2000 Logical Disk Free Megabytes Microsoft.Windows.Server.2000 % Logical Disk Free Space Microsoft.Windows.Server.2000 Network Adapter Bytes Received per Second Microsoft.Windows.Server.2000 Network Adapter Bytes Sent per Second Microsoft.Windows.Server.2000 Network Adapter Bytes Total per Second Microsoft.Windows.Server.2000 Memory Available Megabytes Microsoft.Windows.Server.2000 Memory Page Reads per Second Microsoft.Windows.Server.2000 Memory Pages per Second Microsoft.Windows.Server.2000 Memory Page Writes per Second Microsoft.Windows.Server.2000 Memory % Committed Bytes in Use Microsoft.Windows.Server.2000 Page File Percentage Use Microsoft.Windows.Server.2000 System Context Switches per Second Microsoft.Windows.Server.2000 Processor % Processor Time Total Microsoft.Windows.Server.2003 Collection Rule for the Average Disk Queue Length Microsoft.Windows.Server.2003 Collection Rule for Average Disk Seconds Per Read Microsoft.Windows.Server.2003 Collection Rule for Average Disk Seconds Per Transfer Microsoft.Windows.Server.2003 Collection Rule for Average Disk Seconds Per Write Microsoft.Windows.Server.2003 Collection rule for Current Disk Queue Length Microsoft.Windows.Server.2003 Collection Rule for Disk Bytes Per Second Microsoft.Windows.Server.2003 Collection Rule for Disk Reads Per Second Microsoft.Windows.Server.2003 Collection Rule for Disk Writes Per Second Microsoft.Windows.Server.2003 Logical Disk Free Megabytes Microsoft.Windows.Server.2003 % Logical Disk Free Space Microsoft.Windows.Server.2003 Network Adapter Bytes Received per Second Microsoft.Windows.Server.2003 Network Adapter Bytes Sent per Second Microsoft.Windows.Server.2003 Network Adapter Bytes Total per Second Microsoft.Windows.Server.2003 Memory Available Megabytes Microsoft.Windows.Server.2003 Memory Page Reads per Second Microsoft.Windows.Server.2003 Memory Pages per Second Microsoft.Windows.Server.2003 Memory Page Writes per Second Microsoft.Windows.Server.2003 Memory % Committed Bytes in Use Microsoft.Windows.Server.2003 Page File Percentage Use Microsoft.Windows.Server.2003 System Context Switches per Second Microsoft.Windows.Server.2003 Processor % Processor Time Total Microsoft.Windows.Server.2003 Processor % Processor Time
All of these rules run on a 5 minute interval and can cause a significant amount of data to be inserted into the database. These rules, along with any rules you determine with the above queries, should be disabled or tuned to run at a larger interval (say, every 15-30 minutes).
Console Registry keys: Automatically Polling the OpsMgr DB every 15 seconds from multiple consoles can negatively impact performance. For best performance in large environments, turning off Polling or increasing it to at least level 5 can help:
HKCU\Software\Microsoft\Microsoft Operations Manager\3.0\console\CacheParameters\PollingInterval
0 – Turn off console polling and only refresh when manually pressing F5 1 – Console will poll every 15 seconds (default) 2 – Console will poll ever 30 seconds 10 – This is the maximum value – anything above 10 will be treated as 10
Also note that:
HKCU\Software\Microsoft\Microsoft Operations Manager\3.0\console\CacheParameters\EnableContextMenuTasks
defaults to 0 in SP1. If this is re-enabled it could cause performance issues.
Console Machine Performance: The server or workstation the console runs on should be in good shape. Disks should have ample free space (especially where the profile is located) and not be fragmented. We should have at least 1GB of free memory to run the console at top speed and antivirus should be configured to exclude Local console cache path:
Documents and Settings\<username>\local settings\Application Data\Microsoft\Microsoft.MOM.UI.Console
Limit the number of open consoles to the bare minimum required.
RMS Performance: On the Root Management Server, the most critical resource is RAM followed by CPU. Memory (see Memory/Available Mbytes in perfmon) should never go below 200 mb). In large environments the RMS should have fast disks, especially where the Health Service Store is located). Disks should not be fragmented.
No agents should report directly to the RMS. No Gateway servers should report directly to the RMS.
When configuring read caching vs. write caching on disk controllers for the Root Management Server, allocating at least 50% of the cache to write caching is recommended.
Antivirus software should be configured to excluded the Health Service Store folder (C:\Program Files\System Center Operations Manager\Health Service Store, by default. Located on the shared disk in a cluster).
The console should not be run on the RMS, and for best performance the console machine should be physically close to the RMS, preferably on the same subnet).
Agents: Depending on your hardware and the management packs you are running, you should not have more than 5,000 agents in a management group although this is not a hard limit. See the Operations Manager 2007 Performance and Scalability White Paper:
http://download.microsoft.com/download/d/3/6/d3633fa3-ce15-4071-be51-5e036a36f965/OM2007_PerfScal.doc
Networking: The SQL server hosting the Operations Manager DB and Root Management Server should be on the same subnet. Gigabit networking is preferred.
Hotfix 948496 should be applied to the SQL server and the RMS: KB948496: An update to turn off default SNP features is available for Windows Server 2003-based and Small Business Server 2003-based computers: http://support.microsoft.com/?kbid=948496.
Additional Links: OpsMgr 2007 Capacity Planner Tool: http://www.microsoft.com/downloads/details.aspx?FamilyID=6fec1f12-a62c-4e8d-8a19-56879192adc3&displaylang=en.
OpsMgr 2007 Hardware Guidance (Satya Vel's Blog): http://blogs.technet.com/momteam/archive/2008/04/10/opsmgr-2007-hardware-guidance-what-hardware-do-i-buy.aspx