Manage OEM12c
Manage OEM12c
Lap Nguyen,
Chevron
Andrei Dumitru,
CERN
Copyright 2014, Oracle and/or its affiliates. All rights reserved. |
Agile, Automated
Optimized, Efficient
Superior Enterprise-Grade
Management
Scalable, Secure
Program Agenda
1
Architecture Overview
Overall Architecture and Components
Critical Subsystems
1
Loader Subsystem
Job Subsystem
Console Subsystem
Agent Subsystem
Notification Subsystem
Loader Subsystem
Backlog
OMS
Agents
Synchronous
Upload of data
Repository
Copyright 2014, Oracle and/or its affiliates. All rights reserved. |
Contact oracle support if the loader consistently running at more than 85% utilization
capacity
Deviation
tolerance
10-20%
General Advice
It is normal to have some amount of Agent being backed off
Keep an eye on consistently growing large number of agents backed off
Job Subsystem
Anything
User
OMS
WORKER THREADS
JOB DISPATCHER
STEP SCHEDULER
Repository
Agents
Problem Trend Analysis of Job Step Backlog and Overall Job Steps per Second metric
If Job Step Backlog and Overall Jobs per Second shows increasing trend, indicates work load is high. Job
engine resources are not able to keep with inflow. Increase the resources
If Job Step Backlog is increasing but Overall Jobs per Second is not,it indicates abnormal processing of
specific job
Copyright 2014, Oracle and/or its affiliates. All rights reserved. |
Monitor Thread Pool Utilization if inflow of work is high, backlog is consistently high
If the Avg. Steps Dispatched/Min is HIGH and Avg. Threads Available is less than 50% of
Configured Threads for a specific pool, increase the thread pool size for each of the OMS
If the Avg. Steps Dispatched/Min is LOW, Avg. Threads Available is also LOW, this typically means that
either a thread is stuck/hung
Refer to Appendix for
Sizing Recommendations
Work
of thread pool size
Inflow
Contact Support for
triaging stuck threads
OMS
Oracle_emd_proxy status
Host status
Partner
Agent
Agent
oracle_emd_proxy
Host1
Agent Push
Monitored
Agent
Host2
SCENARIO
TARGETS
STATUS OF TARGETS
Agent is shutdown
gracefully and not
under blackout
AGENT
HOST OMS
Down
MONITORED
TARGET
Agent Down
AGENT
Agent Unreachable
HOST
Up (Unmonitored)
MONITORED
TARGET
Agent Unreachable
If Partner Agent is
AGENT
not available(Host or
HOST
Agent is down)
MONITORED
TARGET
Agent Unreachable
Up (Unmonitored)
Agent Unreachable
Agent Unreachable
Troubleshooting Tips
Down
Agent Down
Up
Unmonitored
Currently this sub status is set only for host target with real time
partner agent deduction. Host is up but its agent is shutdown.
Cannot Write to
File System
Check that OS user who owns the agent process has write access
to agent instance directory.
Collections
Disabled
Check that Agent can upload to OMS with emctl upload. Check
loader statistics report for loader health.
Disk Full
Check that Agent can upload to OMS with emctl upload. Recheck the count of pending files using the command emctl
status agent to verify if they have reduced.
Post Blackout
Agent is unreachable as its first severity has not yet come after
blackout end.
Troubleshooting Tips
Blocked Manually
Blocked (Plug-in
Mismatch)
Blocked (Bounce
Counter Mismatch)
Agent Misconfigured
Communication
Broken
Under Migration
Note: Refer to Appendix for General Troubleshooting steps for Agent Unreachable
Copyright 2014, Oracle and/or its affiliates. All rights reserved. |
A backlog in notifications can cause a delay in alerts being sent, or a missing alert all
together
If notifications are not getting delivered
Check your external systems that are configured to receive notifications
For email/pager - Is the email gateway configured and working?
For OS Command and PLSQL, check the external systems that they may connect to
Contact Oracle Support if external systems are not working as expected.
Find the specific events in Incident Manager console for non-informational events
If it is not found, likely to be an event publishing issue.
If found in Incident Manager, verify the rule definition
Copyright 2014, Oracle and/or its affiliates. All rights reserved. |
If Pending Notifications Count remains high over a period of time [such as an hour],
check Notifications Processed (Last Hour)
If it is making good progress, there could be temporary load and it will resolve itself soon
If it is not making good progress, there could be stuck queues in notification system/ out-of-date
incident rules. Contact Oracle Support
Events Subsystem
Repository Health
High Availability
Critical components in Enterprise Manager
infrastructure are:
Repository - Persistent store for all Enterprise Manager
data
OMS - Central application accessed by Agents and endusers
Software Library - Filesystem repository used to store
software entities
High Availability
Repository
OMS
Repository
High Availability
OMS
Disaster Recovery
Protects applications from catastrophic failures
Primary
Site
Standby
Site
Disaster Recovery
Repository
Primary Site
(active)
Standby Site
(passive)
Data Guard
Physical Standby
Disaster Recovery
OMS
Deprecated
Primary Site
DNS/
GTM
Standby Site
Copyright 2014, Oracle and/or its affiliates. All rights reserved. |
Console
/EMCLI
PASSIVE
ACTIVE
DNS
Lookup
Primary
OMS
Additional
OMS1
Storage
OMS Share
OMS1 Share
EM
Repository
Physical Standby
Swlib Share
S
W
I
T
C
H
O
V
E
R
Storage
Primary
OMS
Additional
OMS1
OMS Share
OMS1 Share
Swlib Share
EM
Repository
Physical Standby
ACFS Replication
This does not provide failover, in case one of the BIP instances fails or is otherwise stopped [Fixed
in future].
Recommendations:
Load Balancer
OMS1
BIP1
BIP2
OMS2
WebLogic
WebLogic
Server 1
Server 2
Appendix
Architecture Overview
Oracle Management Service (OMS)
OMS
Console
PBS
Architecture Overview
Repository
Repository
Architecture Overview
Agents
Collect monitoring and configuration data from the targets and store locally
in XML files
Collected data uploaded at scheduled intervals to Management Service using
HTTP/HTTPS
XML files are purged once data has been uploaded
Agents
Sizing Recommendations of pool size for Large Configuration with 2 or 4 OMS nodes
Default pool size configuration for
Small and Medium configuration
Incase of major resource issues,
contact Oracle Support for guidance
on adding additional threads
Parameters
Value
oracle.sysman.core.jobs.shortPoolSize
50
oracle.sysman.core.jobs.longPoolSize
24
oracle.sysman.core.jobs.longSystemPoolSize
20
oracle.sysman.core.jobs.systemPoolSize
50
oracle.sysman.core.conn.maxConnForJobWorkers
144
Responsible for processing the events published by different components in the system
Key Metrics to check event backlog -Total Events Pending and Total Events Processed
(Last Hour)
If Total Events Pending remains high [over an hour].
Check metrics Total Events Processed (Last Hour)
If it is making good progress(count is high), there could be temporary load ignore
When pending count continues to be high, it should sustain a minimum processing of 1000
events per every 10 minutes
If it is not making good progress, there could be stuck queues in event system
Check the queue statistics in Event Status metric group to detect problem in AQ
Contact Support for triaging issues in AQ /queues
Copyright 2014, Oracle and/or its affiliates. All rights reserved. |
FLASHBACK Mode ON
Refer to Oracle Database High Availability
Guidelines
Aggregation mechanism: Both hourly and daily rollups are done from the raw data
directly
Look out for consistently growing backlogs or prolonged execution time span
Configure additional rollup worker threads using configure option in Metric Rollup Performance Chart
Click
If the RAC is configured in the database,
to avoid RAC contention negating gain of
additional threads
Create database service and set affinity to it
for the rollup job to only run on one RAC node
Create database service and set affinity to it for the rollup job to only run on one RAC
node
Create database service rollup and set one of RAC instance as primary instance in -r
srvctl add service -d <dbname>-s rollup -r <primary instance> -a <the the other instances> -y
automatic
srvctl start service -d <dbname>-s rollup
srvctl status service -d <dbname>
As sys user, execute DBMS_SCHEDULER.create_job_class( job_class_name => 'ROLLUP', service =>
'rollup')
GRANT EXECUTE ON sys.ROLLUP TO sysman;
As sysman user, execute DBMS_SCHEDULER.SET_ATTRIBUTE ( name => 'EM_ROLLUP_SCHED_JOB',
attribute => 'job_class', value => 'ROLLUP')
As sysman user, execute GC_SCHED_JOB_REGISTRAR.SET_JOB_CLASS('EM_ROLLUP_SCHED_JOB', 'ROLLUP')
Copyright 2014, Oracle and/or its affiliates. All rights reserved. |
Post Blackout
Down
Agent Down
Up Unmonitored
Agent Misconfigured
Under Migration
Communication Broken
Status Pending
Collections Disabled
Disk Full
Blocked Manually
Ensure OMS is reachable from agent host and agent from OMS host
Check emctl status for various configurations. Eg: Agent communicating with correct OMS
Check agent upload with emctl upload
Contact Oracle Supprt with these log
gcagent.log from agent home
emoms_pbs .log, emoms.log
This document is intended for use by Chevron for presentation at the October 2, 2014 Oracle Open World Conference, for
posting on the Oracle Open World Conference website and for handouts to Oracle Open World Conference attendees. No
portion of this document may be copied, displayed, reproduced, published, sold, licensed, downloaded or used in any other
way, unless the use has been specifically authorized by Chevron in writing.
2014 Chevron U.S.A. Inc. All Rights Reserved
Agenda
Our company
Overview of Oracle EM HA
Tips and tricks to reduce down time when a switchover or failover to
disaster recovery (DR)
Benefits of using a storage replication HA solution
180+ countries in
which we operate
Chevron
Corporation
Headquarters
18 refineries and
asphalt plants
(includes Global
Upstream & Gas
and Downstream
headquarters)
35 chemical
manufacturing
facilities
Chevron
Refining
No Operations
Chemicals
3 retail brands
(Chevron, Texaco and
Caltex)
22,000+ retail outlets
Repository
Local HA: Data Guard fast start failover with maximum availability
protection mode
DR: Data Guard
F5, F5 Networks, and BIG-IP are trademarks or registered trademarks of F5 Networks, Inc. in
5
the U.S. and in certain other countries.
Did you know that you can incorporate standby database into the OMS
configuration using a one-time configuration on the primary OMS?
Command:
emctl config oms -store_repos_details -repos_conndesc
"(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=ho
st1)(PORT=1521))(ADDRESS=(PROTOCOL=TCP)(HOST=host2)(PORT=1521))(
ADDRESS=(PROTOCOL=TCP)(HOST=host3)(PORT=1521)))(CONNECT_DATA
=(SERVER=DEDICATED)(SERVICE_NAME=SID_DG)))" -repos_user sysman repos_pwd password
Where:
Host1-2 are local primary database hosts HA fast start failover
Host3 is a remote database host DR
Note: For RAC, host1-2 would be replaced by SCAN-IP
Benefits:
Reconfiguring the OMS to point to the new repository (after the Data
Guard switchover/failover) is not required.
Reduce downtime, human errors and manual work when
switchover/failover occurs.
2014 Chevron U.S.A. Inc. All Rights Reserved
2.
3.
4.
5.
6.
7.
2.
3.
Benefits:
10
CERN
European Organization for Nuclear Research
Founded in 1954
Research: Finding answers to questions about the
Universe
Technology, International collaboration, Education
21 Member States
7 Observer States
European Commission, USA, Russian Federation, India, Japan, Turkey, UNESCO
Associate State
Serbia
Candidate State
Romania
People
2500 Staff, 560 Fellows, 500 Students, 10600 Users ,
Grand Total ~ 15000
Thousands of
superconducting
magnets
Ultra vacuum:
10x emptier
than on the Moon
Coldest place
in the Universe:
-271C/1.9K/-456F
Deployment
Agents version: 12.1.0.4
Linux x86-64
Secure agent upload
AD accounts for user login
Agents
Users(https)
Failover VIP
RAC nodes
Cold Failover Cluster
Databases
200 Oracle Database Instances
80 RAC Databases
Middleware
370 WebLogic Servers
340 Java Virtual Machines
over 1000 App Deployments
Apache Tomcat & HTTP Servers
Hosts
270 Red Hat Enterprise Linux 5 & 6
Total
5200 targets
Before
Case Study:
OMS Troubleshooting
1. Launched Agent Upgrade job
2. Started receiving many alerts
Agent is unable to communicate with OMS
Agents not yet upgraded to R4:
no enhanced agent health status available
Case Study:
OMS Troubleshooting
4. Diagnosing the Repository Database
High Load on the OMR host
Row lock contention in SYSMAN
schema caused by Agent Upgrade
5. Oracle Support provided patch
6. OMR issue fixed
Throughput rate back to normal
Everything working
After
Agents Overview
Control agent
Properties
Partner Agent
Agents monitor one another
Faster downtime detection
Host=hostname.cern.ch
Separate alerts
Target type=Agent
Target name=hostname.cern.ch:1234
agent down event
Message=Agent has stopped monitoring.
The following errors are reported :
host down event
agent shutdown.
Host=hostname.cern.ch
Target type=Host
Target name=hostname.cern.ch
Message=Host Down - Detected by
Partner Agent
Repository
Out of the box checks
based on OMS size
Repository Metrics