没有合适的资源?快使用搜索试试~ 我知道了~
资源推荐
资源详情
资源评论
Case Study: Resolving High CPU Usage on Oracle Servers
Author: Hector Pujol, Consulting Technical Advisor, Center of Expertise, Oracle USA
Skill Level Rating for this Case Study: Intermediate
About Oracle Case Studies
Oracle Case Studies are intended as learning tools and for sharing information or
knowledge related to a complex event, process, procedure, or to a series of related events.
Each case study is written based upon the experience that the writer/s encountered.
Each Case Study contains a skill level rating. The rating provides an indication of what
skill level the reader should have as it relates to the information in the case study. Ratings
are:
• Expert: significant experience with the subject matter
• Intermediate: some experience with the subject matter
• Beginner: little experience with the subject matter
Case Study Abstract
This case study demonstrates a top-down tuning methodology that was applied to a
performance problem involving high CPU usage on a database server. The diagnostics that
were collected and analyzed for this problem include: operating system (OS) data such as
vmstat and top; statspack, and trace event 10046 with TKProf output. The conclusion from
analyzing the collected data was that a specific query was consuming most of the CPU and
needed to be tuned. We present various ways of tuning the query as well as verification that
the recommended solution solved the problem.
Case History
A large university reported severe performance problems with their email application and
other applications that relied upon an Oracle RDBMS server for authentication services.
The Oracle database was used solely for user lookups, authentication, and other assorted
LDAP queries. It was hosted on a 2-CPU Linux server that also hosted Oracle’s
Application Server (iAS) and related daemons.
The email application was installed approximately 6 months prior and ran without incident
until some application security patches were installed. The day after the patches were
installed, the performance problems started occurring.
The DBAs noticed that this authentication server was often seeing load averages over 20
processes (as shown in uptime or top). Normally, prior to the patch, the load averages
were seldom above 1 or 2 processes. The performance seemed to be much worse when
“mass mailing” occurred. These mass-mailing jobs sent out email to thousands of recipients
and relied upon the authentication/lookup services provided by the database.
The main symptoms reported by users when there was a performance problem were that
they were unable to log-in to their email accounts and they experienced timeouts in the
application (visible as browser timeouts or client error messages, depending on the mail
client used).
The database server version was 9.0.1.5; it had been upgraded from 9.0.1.4 during the same
outage period when the application security patches were installed.
Analysis
Summary
Our strategy in diagnosing this problem was to take a “top-down” approach and follow the
trail of highest CPU consumption down to the offending processes, and eventually the
offending sessions and query. The top-down approach can be summarized as follows
1
:
1. Verify CPU Consumption on the server
• Verify the observation of high CPU consumption on the affected server and collect
top and vmstat data for a period of time (from 30 minutes up to a few hours)
• Find which processes are using the most CPU
• If the top CPU consumers are Oracle processes, then find out which database
sessions correspond to these processes
2
2. Collect Extended SQL Trace Data (Event 10046)
• Gather SID, SERIAL#, and other information from V$SESSION about these
sessions
• Trace the sessions using an extended SQL trace via event 10046, level 12
• Produce a TKProf report and look for the queries having the longest elapsed time
1
The steps have been categorized according to the Oracle Diagnostic Method (ODM), see MetaLink Doc ID
312789.1
2
If the processes were not Oracle RDBMS processes, then we would have to stop and find out what kind of
processes they are and how to diagnose them.
Page 2
3. Build a Test Case Script for the Problem Query
• Extract the high-impact queries with their bind values and create a test script that
will allow us to benchmark changes to the query and verify our progress in tuning it.
• Run the test script on the production system (with the 10046 event set) to validate
that it reflects what was seen in step two.
4. Obtain Statspack Snapshots
• Collect statspack snapshots during the same time period as the OS were collected
and look for correlations with the session-level data collected in step two. This
could easily be done before steps 1 through 3 as well.
5. Tune the Query
• Tune the query and verify performance improvements using the test script
• Implement the changes in production and verify that the solution worked.
• Verify the performance has improved and meets the customer’s requirements
In steps 1 and 2 above, one can see that the overall effort is one that focuses attention down
to a particular session, with the ultimate goal of collecting the extended SQL trace
information. We are interested in the extended SQL trace information because it will report
CPU times, wait times, elapsed times, and bind values
3
. By having the CPU, wait, and
elapsed times, we can account for how the time was spent during each query’s execution.
The bind values allow us to put together a reproducible test case using actual values. The
bind values will be extremely useful later when we tune the query.
3
See Cary Milsap’s book Optimizing Oracle Performance for an in-depth look at using event 10046.
Page 3
Detailed Analysis
1.
Verify CPU Consumption on the server
Before we can start looking at why this problem is occurring we must first understand the
problem and verify what the customer is reporting (its symptoms). This can be done by
collecting OS statistics during a recent spike in activity due to a mass mailing. The data
was collected using Oracle Support’s OSWatcher tool
4
.
The vmstat output during this time looked like this:
# of processes in the run queue CPU User mode
CPU Sys mode
CPU idle
zzz ***Thu Mar 9 18:30:00 PST 2006
cpu procs memory swap io system
r b w swpd free buff cache si so bi bo in cs us sy id
3 0 0 920 8596 342908 4088164 0 0 10 514 501 1790 43 7 50
5 0 0 920 8844 343024 4089564 0 0 8 79 319 1751 29 4 67
21 0 1 920 5332 343160 4093964 0 0 18 320 390 1690 97 3 0
22 0 0 920 22864 343304 4057500 0 0 11 149 370 2563 97 3 0
25 0 3 920 9728 343404 4059732 0 0 77 412 400 2412 95 5 0
23 0 0 920 8416 343688 4044128 0 0 2 2 254 1308 97 3 0
21 0 1 920 6760 343520 4045256 0 0 56 2428 704 2382 89 10 1
24 0 0 920 14856 343812 4045968 0 0 54 66 278 2785 95 5 0
21 0 2 920 17600 343956 4048296 0 0 17 70 249 1677 96 4 0
20 0 0 920 6028 344080 4050368 0 0 8 52 265 1461 94 6 0
22 0 1 920 27172 344184 4053608 0 0 25 49 263 2513 96 4 0
14 0 2 920 58264 344364 4057036 0 0 73 125 614 3821 79 20 1
1 1 0 920 65756 344476 4070080 0 0 358 1964 410 2118 12 4 84
0 0 0 920 48628 344580 4080864 0 0 17 106 229 1190 9 2 88
Each line represents a sample collected at 1-minute intervals.
Notice the how the run queues
5
rapidly increase from 3 to 25 as the CPU utilization
increases from about 50 percent to 100 percent. This shows the importance of knowing the
length of the run queue as well as the CPU utilization. When the CPU is pegged at 100%
utilization, the severity of the CPU starvation won’t be reflected in the percentage of CPU
utilization (its pegged at 100%), but the run queue will clearly show the impact. Similarly,
knowing only the run queue will not provide you with knowledge on the exact CPU usage,
nor the spread of time usage across system and user modes.
CPU utilization is classified into three types: system (sy), user (us), and idle (id). System
mode CPU utilization occurs whenever a process requires resources from the system; e.g.,
4
OSWatcher may be downloaded from MetaLink by viewing Doc ID 301137.1
5
The run queue is a queue of processes that are ready to run but must wait for their turn on a CPU; a run queue of 20
means that 20 processes are currently waiting to execute.
Page 4
I/O or memory allocation calls. In typical OLTP systems, the percentage of system mode
CPU utilization is often less than 10 percent; in data warehouses more I/O calls are
performed and a higher percentage of system mode CPU utilization is common. User mode
CPU utilization accounts for the rest of the time the CPU is busy and
not running in system
mode. Idle CPU is essentially the time that the CPU is not busy and waiting for work to do.
In this case, we can quickly see that the percentage of CPU utilization in system mode is
usually less than 10 percent. This indicates that we are probably not suffering from memory
shortages or doing excessive I/Os
6
. We can further dismiss memory shortages by looking at
the paging and swapping statistics in vmstat (only swap-ins (si) and swap-outs (so) are
shown here). If no pages are being swapped out during peak time, then there can’t possibly
be a memory shortage.
If we were seeing system-mode CPU utilization higher than 15 percent and we weren’t
seeing any memory problems, we might begin to suspect excessive I/O calls, very frequent
database connections/disconnections, or some anomaly (we would keep this in mind as we
collect more data for individual processes; e.g., using the truss command or while looking
at RDBMS metrics like logons cumulative per second).
The fact that most CPU utilization is occurring in user mode tells us that whatever is
occurring, it’s happening within the application that the process is running, not as part of a
system call or action. The question is now, which processes are using the CPU?
To answer this question, we can look at the output of the top command. This command is
useful because it shows at a glance the overall CPU statistics and the processes consuming
the most CPU (ranked in descending order of CPU usage). The following shows a typical
output of the top command during the performance problem:
6
This is just a casual observation – we would need to look at the process in detail (at least at the statspack level for
the database) to conclusively decide that excessive I/O is not being performed
Page 5
剩余32页未读,继续阅读
资源评论
- superman01402014-02-23挺实用的文档,谢谢分享
QiuJinHua
- 粉丝: 0
- 资源: 10
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功