0% found this document useful (0 votes)
94 views78 pages

Manage OEM12c

Manage OEM12c

Uploaded by

Carlos Ojeda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views78 pages

Manage OEM12c

Manage OEM12c

Uploaded by

Carlos Ojeda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

Manage the Manager: Tips on How to

Best Manage Oracle Enterprise


Manager 12c
Angeline Janet Dhanarani,
Product Management,
Oracle

Lap Nguyen,
Chevron
Andrei Dumitru,
CERN
Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

Total Cloud Control

Expanded Cloud Stack


Management

Complete Cloud Lifecycle


Management

Agile, Automated

Optimized, Efficient

Superior Enterprise-Grade
Management

Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

Scalable, Secure

Safe Harbor Statement


The following is intended to outline our general product direction. It is intended for
information purposes only, and may not be incorporated into any contract. It is not a
commitment to deliver any material, code, or functionality, and should not be relied upon
in making purchasing decisions. The development, release, and timing of any features or
functionality described for Oracles products remains at the sole discretion of Oracle.

Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

Program Agenda
1

Architecture Overview of Enterprise Manager

Critical Subsystems and its monitoring with Self-monitoring


features

High Availability and Disaster Recovery

Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

Architecture Overview
Overall Architecture and Components

Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

CRITICAL SUBSYSTEMS AND ITS


MONITORING WITH SELF-MONITORING
FEATURES

Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

Critical Subsystems
1

Loader Subsystem

Job Subsystem

Console Subsystem

Agent Subsystem

Notification Subsystem

Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

Loader Subsystem

Backlog

Responsible for processing the data


collected by the agent and uploading it to
the Repository
Its efficiency greatly impacts
performance and health of overall system
Does synchronous uploading of data
Under heavy load, OMS prioritizes
uploading of data
Preference given to agents with
higher agent priorities like Mission
Critical and Production
Agents with lower priorities are
asked to backoff by OMS for a specific
time period
Backlog accumulates at the agents

OMS

Agents

Synchronous
Upload of data

Repository
Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

Loader Subsystem Monitoring


Checking Loader Performance
Monitor the Loader performance charts in Setup > Manage Cloud Control > Management Servers

Indicates the loader processing time


Look for consistent increase over a time period

Current loader CPU utilization


Lower value indicates loader throughput is efficient

Contact oracle support if the loader consistently running at more than 85% utilization
capacity

Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

Loader Subsytem Monitoring


Checking Agent Backlog
Monitor the Upload Backlog and Backoff charts in Setup > Manage Cloud Control > Health Overview

Overall Back-off Requests in the Last 10 Mins

Overall Upload Backlog (MB) and (Files)


Overall Upload Rate (MB/sec)

Incase of consistent increase in Back-off requests / Backlog


Check that load is evenly distributed across all OMS with Loader Statistics Report (Reports /
Information Publisher)
Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

Loader Subsystem Monitoring


Checking Agent Backlog
Uneven load on specific Management Server :Check if SLB configuration is set to Round-Robin algorithm
Permitted deviation tolerance : 10 20 %

Deviation
tolerance
10-20%

General Advice
It is normal to have some amount of Agent being backed off
Keep an eye on consistently growing large number of agents backed off

Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

Job Subsystem
Anything

User

OMS
WORKER THREADS

JOB DISPATCHER

STEP SCHEDULER

Repository

Agents

that is scheduled and


automated uses the job subsystem.
Eg: Scheduling Blackouts, Template
apply applications
Very crucial sub-component
Critical processes in the Job System
Step Scheduler:
Responsible for processing the job
steps that are ready to run and marks it
Ready
Job Dispatcher:
Picks the steps marked ready for
execution.Dispatches job steps to job
worker threads
Workers threads:
Take work from the Job Dispatcher and
send it to the appropriate agent
Different thread pools for job types

Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

Job Subsystem Monitoring


Setup > Manage Cloud Control > Health Overview

Monitor Jobs Backlog(Steps)


Indicates number of Job steps past its scheduled execution time.
If this number is high and has not decreased for long period, it indicates job system is not
functioning normally.
Indicates Job engine resources are unable to meet inflow or indicate abnormal processing of
specific jobs because it is stuck for unusual periods
Click
Rate of change of backlog is more important than
absolute backlog numbers
11

Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

Job Subsystem Monitoring


Setup>Manage Cloud Control>Health Overview>Monitoring>All Metrics> Repository Job Scheduler Performance

Problem Trend Analysis of Job Step Backlog and Overall Job Steps per Second metric

Increasing trend over a prolonged period

Increasing trend over a prolonged period


Decreased /constant trend over a prolonged period

If Job Step Backlog and Overall Jobs per Second shows increasing trend, indicates work load is high. Job
engine resources are not able to keep with inflow. Increase the resources
If Job Step Backlog is increasing but Overall Jobs per Second is not,it indicates abnormal processing of
specific job
Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

Job Subsystem Monitoring


Setup / Manage Cloud Control / Management Services >Job System(More Details) >Job Dispatcher details

Monitor Thread Pool Utilization if inflow of work is high, backlog is consistently high
If the Avg. Steps Dispatched/Min is HIGH and Avg. Threads Available is less than 50% of

Configured Threads for a specific pool, increase the thread pool size for each of the OMS
If the Avg. Steps Dispatched/Min is LOW, Avg. Threads Available is also LOW, this typically means that
either a thread is stuck/hung
Refer to Appendix for
Sizing Recommendations
Work
of thread pool size
Inflow
Contact Support for
triaging stuck threads

Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

Console Subsystem Monitoring


Setup>Manage Cloud Control>Health Overview>OMS and Repository >Monitoring > Page Performance

Monitoring console performance


General Advisories
Proactively check that page access and
session load is evenly distributed across OMS
Check SLB configuration if not
Check the presence of Symptom Analysis
Icon in Overall Tab and use this feature to
narrow down the cause of slow performing
pages
Icon appears only when metric Page
Processing Time (sec) exceeds the
threshold
Symptom analysis can be done on overall
page processing and individual pages
Break-down of processing time by layers
helps narrow down the issue

Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

Agent Subsystem Monitoring With Partner Agent


Partner agent is an agent which in addition
to all of its regular functions, monitors the
status of its assigned Management Agent and
its host

OMS
Oracle_emd_proxy status
Host status

Partner
Agent
Agent
oracle_emd_proxy

Host1

Agent Push

Monitoring with Proprietary protocol


Signals to partner agent

Monitored
Agent
Host2

Algorithm of automatic partner agent


assignment by OMS
Agent should be pingable from agent it
is going to monitor
Preference is given to agents belonging
to the same subnet
Agent should be a 12.1.0.4 Agent
Agent should be monitoring less than 10
(Configurable) agents
One can change the partnership explicitly
with emcli manage_agent_partnership
Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

Agent Subsystem Monitoring With Partner Agent


Target statuses with partner agent
mentioned in table

SCENARIO

TARGETS

STATUS OF TARGETS

Agent is shutdown
gracefully and not
under blackout

AGENT
HOST OMS

Down

MONITORED
TARGET

Agent Down

AGENT

Agent Unreachable

HOST

Up (Unmonitored)

MONITORED
TARGET

Agent Unreachable

If Partner Agent is
AGENT
not available(Host or
HOST
Agent is down)
MONITORED
TARGET

Agent Unreachable

If AGENT goes down


unexpectedly and
host is up (and not
under blackout)

Up (Unmonitored)

Agent Unreachable
Agent Unreachable

Partner agent accesses the monitored


agent and host with a proprietary
protocol
Can convey to the OMS whether the
monitored agent goes DOWN
Can determine if the host of the
monitored agent is UP or DOWN

Agent status detection done


immediately (few seconds).

Host status change detection when the


agent is down done every minute

Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

Agent Subsystem-Agent Unreachable Troubleshooting


Sub status added to provide more diagnostic details
Common
Scenarios

Sub status Description

Troubleshooting Tips

Down
Agent Down

Agent was brought down in error /brought down as part of


planned maintenance.

If agent was brought down in error, restart it from the agent


homepage.If agent was brought down as part of planned
maintenance, consider creating a blackout on the agent.

Up
Unmonitored

Currently this sub status is set only for host target with real time
partner agent deduction. Host is up but its agent is shutdown.

If agent is down, do emctl start agent. To triage agent issue, go


to its agent homepage and run the Symptom Analysis tool
located next to the Status field.

Cannot Write to
File System

Agent cannot write to file system due to permission issue.

Check that OS user who owns the agent process has write access
to agent instance directory.

Collections
Disabled

Agent Collections have been disabled. The Agent will no longer


collect any metric for the managed targets.

Check that Agent can upload to OMS with emctl upload. Check
loader statistics report for loader health.

Disk Full

Agent file system is full.

Check that Agent can upload to OMS with emctl upload. Recheck the count of pending files using the command emctl
status agent to verify if they have reduced.

Post Blackout

Agent is unreachable as its first severity has not yet come after
blackout end.

To triage agent issue, go to its agent homepage and run the


Symptom Analysis tool located next to the Status field.

Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

Agent Subsystem-Agent Unreachable/Pending Diagnosis


Sub status added to provide more diagnostic details
Common Scenarios

Sub status Description

Troubleshooting Tips

Blocked Manually

Agent has been blocked manually.

Unblock the Agent from console -Setup > Manage Cloud


Control > Agents

Blocked (Plug-in
Mismatch)

Agent has been blocked for communication with OMS due to


Plug-in mismatch.

If Agent has been restored from a backup perform an Agent


Resync emcli resyncAgent.

Blocked (Bounce
Counter Mismatch)

Agent has been blocked for communication with OMS due to


Bounce Counter mismatch.

If Agent has been restored from a backup perform an Agent


Resync emcli resyncAgent.

Agent Misconfigured

Agent is configured for communication with another OMS or


OMS Agent time skew is noticed or Consecutive metadata
/severity upload failure

Check Agent configuration to ensure the Agent is


communicating with the correct OMS.Re-secure the agent
with emctl secure agent

Communication
Broken

Agent is unreachable due to communication break between


agent and the OMS.

Address the network latency , port being blocked or proxy


related issue.

Under Migration

Agent is unreachable as it is under migration (2 system


upgrade) from pre 12 to 12C.

Migrate the agent and then start the agent.

Note: Refer to Appendix for General Troubleshooting steps for Agent Unreachable
Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

Notification Subsystem Monitoring


Notification system allows you to notify Enterprise Manager administrators when
specific incidents, events, or problems arise

A backlog in notifications can cause a delay in alerts being sent, or a missing alert all
together
If notifications are not getting delivered
Check your external systems that are configured to receive notifications
For email/pager - Is the email gateway configured and working?
For OS Command and PLSQL, check the external systems that they may connect to
Contact Oracle Support if external systems are not working as expected.
Find the specific events in Incident Manager console for non-informational events
If it is not found, likely to be an event publishing issue.
If found in Incident Manager, verify the rule definition
Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

Notification Subsystem Monitoring


Setup > Manage Cloud Control >Health Overview

Check Notification delivery backlog


Look for consistent increase
Key Metrics to monitor
Notifications Processed (Last Hour)
Pending Notifications Count

If Pending Notifications Count remains high over a period of time [such as an hour],
check Notifications Processed (Last Hour)
If it is making good progress, there could be temporary load and it will resolve itself soon
If it is not making good progress, there could be stuck queues in notification system/ out-of-date
incident rules. Contact Oracle Support

Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

Few Other Critical Subsystems(Appendix)


1

Events Subsystem

Repository Metrics Collection Jobs

Repository Health

Repository Scheduler Jobs

Metric Rollup Jobs

Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

HIGH AVAILABILITY AND DISASTER


RECOVERY

Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

High Availability
Critical components in Enterprise Manager
infrastructure are:
Repository - Persistent store for all Enterprise Manager
data
OMS - Central application accessed by Agents and endusers
Software Library - Filesystem repository used to store
software entities

All of the above should be configured for High


Availability if availability of Enterprise Manager is
critical
Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

High Availability
Repository

Oracle RAC provides a standard HA solution for the EM repository


Best Practice: Configure RAC prior to EM installation
Best Practice: Use SCAN and role based DB Services
for OMS to Repository connect strings

OMS
Repository

Advantage of Role-based database services with Oracle RAC


Can automatically control the startup of database services on databases by assigning
a database role - PRIMARY / PHYSICAL_STANDBY / LOGICAL_STANDBY /
SNAPSHOT_STANDBY
Refer whitepaper Best Practises for Highly Available Oracle Databases for details
Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

High Availability
OMS

Additional OMSs can be


deployed behind a Server Load
Balancer (SLB) for OMS High
Availability
Agents and Users communicate
with OMS via load balancer

Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

High Availability End-To-End Topology


All OMS, Repository and Software Library
components are active within the same Data
Center
Software Library must be accessible
Read/Write from all active OMSs
Software library should be deployed on
highly available storage
Not a Disaster Recovery (DR) solution

Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

Disaster Recovery
Protects applications from catastrophic failures

Primary
Site

Keeps data on primary site synchronized with a standby


Allows applications to failover to the standby site

Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

Standby
Site

Disaster Recovery
Repository

Data Guard Physical Standby Database provides Disaster Recovery


solution for Repository

Primary Site
(active)

Use Data Guard Broker to manage switchover/ failover of database


Best Practise: Configure OMS connect descriptor with scan names
and role-based services of primary and standby data centers
(DESCRIPTION_LIST=
(LOAD_BALANCE=off) (FAILOVER=on)
(DESCRIPTION= (CONNECT_TIMEOUT=5)(TRANSPORT_CONNECT_TIMEOUT=3)(RETRY_COUNT=3)
(ADDRESS_LIST= (LOAD_BALANCE=on)
(ADDRESS=(PROTOCOL=TCP)(HOST=PRIM_SCAN)(PORT=1521)))
(CONNECT_DATA=(SERVICE_NAME=DB_ROLE_SERVICE)))
(DESCRIPTION= (CONNECT_TIMEOUT=5)(TRANSPORT_CONNECT_TIMEOUT=3)(RETRY_COUNT=3)
(ADDRESS_LIST= (LOAD_BALANCE=on)
(ADDRESS=(PROTOCOL=TCP)(HOST= STBY_SCAN)(PORT=1521)))
(CONNECT_DATA=(SERVICE_NAME=DB_ROLE_SERVICE))))

Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

Standby Site
(passive)
Data Guard
Physical Standby

Disaster Recovery
OMS

Deploy Standby (Passive) OMSs on Standby Site


Standby OMS using Standby WebLogic Domain
Standby OMS using Storage Replication

Deprecated

Primary Site

Use DNS / Global Traffic Manager to redirect


Agent

Best Practice: Storage Replication is


recommended method
No manual application of Plug-ins or OMS
patches at Standby Site
No rebuild of Standby site needed after upgrades

DNS/
GTM

Standby Site
Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

Console
/EMCLI

Enterprise Manager High Availability Level 4 Solution


Recommended Solution for High Availability and Disaster Recovery with Storage Replication
PASSIVE
ACTIVE

PASSIVE
ACTIVE

DNS
Lookup

Server Load Balancer of


Standby data center

Primary
OMS

Additional
OMS1

Storage

OMS Share
OMS1 Share

EM
Repository
Physical Standby

Swlib Share

S
W
I
T
C
H
O
V
E
R

Server Load Balancer of


Primary data center

Storage

Primary
OMS

Additional
OMS1

OMS Share
OMS1 Share
Swlib Share

Storage Continuous Replication


DB Replication with Dataguard from Primary to Standby
Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

EM
Repository
Physical Standby

ACFS Replication

Alternate to using External Storage Appliances

ACFS storage replication requires Grid


Infrastructure to be installed for a Cluster

ACFS Filesystem created for OMS install


and software library on ACFS server and this
is exported using NFS
Filesystem mounted on another node(OMS
server) and EM installed here

Similar setup on second ACFS server with


another ACFS filesystem to be used as a
standby
Established ACFS replication between the
primary and standby ACFS servers
Refer to Section 18.7 of Advanced
Installation Guide for configuration details
Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

BI Publisher High Availability

With EM12c R4, BI Publisher is bundled and installed by default

BIP needs to be configured using the configureBIP script

BIP supports Enterprise Manager HA scale out

BIP can be configured on all OMS nodes to increase reporting capacity

This does not provide failover, in case one of the BIP instances fails or is otherwise stopped [Fixed
in future].

Recommendations:

Configure BIP on the first OMS node, before cloning it

Always configure BIP on all OMS nodes, and ensure that


BIP is always UP, when that node's OMS is also up

BI Publisher is supported only with storage replication


based solution for Disaster Recovery. Not functional with
Standby OMS with Weblogic Domain method

Load Balancer

OMS1

BIP1

BIP2

OMS2

WebLogic

WebLogic

Server 1

Server 2

Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

Appendix

Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

Architecture Overview
Oracle Management Service (OMS)

OMS

Central Enterprise Manager Application


Source of truth for all management
Receives and processes data from Agents
Uses repository as persistent store for information

Console

PBS

Comprises 2 Weblogic Server application deployments


Console Provides UI and Target Specific Management
Platform Background Services (PBS) A set of background services critical for monitoring
and management
Verify status of Console and PBS is marked as UP for each Management Service in Setup >
Manage Cloud Control > Management Services

Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

Architecture Overview
Repository

Most critical part of EM system


Deploy with performance and availability in mind

Persistent store for data collected from the managed Targets


Performance and Availability Metrics
Configuration and Compliance Information

Repository

Used to store a variety of Enterprise Manager configuration information


such as:
users and privileges
job definitions
Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

Architecture Overview
Agents

Collect monitoring and configuration data from the targets and store locally
in XML files
Collected data uploaded at scheduled intervals to Management Service using
HTTP/HTTPS
XML files are purged once data has been uploaded

Execute tasks on behalf of Enterprise Manager users


Real-time data collections
Jobs
Deployment Procedures

Agents

Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

Job Subsystem Monitoring


Sizing Recommendations of thread pool size

Sizing Recommendations of pool size for Large Configuration with 2 or 4 OMS nodes
Default pool size configuration for
Small and Medium configuration
Incase of major resource issues,
contact Oracle Support for guidance
on adding additional threads

Parameters

Value

oracle.sysman.core.jobs.shortPoolSize

50

oracle.sysman.core.jobs.longPoolSize

24

oracle.sysman.core.jobs.longSystemPoolSize

20

oracle.sysman.core.jobs.systemPoolSize

50

oracle.sysman.core.conn.maxConnForJobWorkers

144

Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

Event Subsystem Monitoring


Setup / Manage Cloud Control / Health Overview / OMS and Repository Menu /Monitoring All Metrics

Responsible for processing the events published by different components in the system
Key Metrics to check event backlog -Total Events Pending and Total Events Processed
(Last Hour)
If Total Events Pending remains high [over an hour].
Check metrics Total Events Processed (Last Hour)
If it is making good progress(count is high), there could be temporary load ignore
When pending count continues to be high, it should sustain a minimum processing of 1000
events per every 10 minutes
If it is not making good progress, there could be stuck queues in event system
Check the queue statistics in Event Status metric group to detect problem in AQ
Contact Support for triaging issues in AQ /queues
Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

Repository Metric Collection Jobs Monitoring


Repository metric jobs are sub divided into long and short running tasks
Some collection workers (Default 1) process the short tasks and some (Default 1)
process long tasks

Key Indicators of its performance


Repository Collection Performance Chart
Repository collection performance metrics
Key Metrics
Average Collection Duration (seconds)
Collections Processed
Repository Collection Task Performance
Run Duration (Seconds)

Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

Repository Metric Collection Job Monitoring


Average Collection Duration (seconds) - Indicator of the load on the
repository collection subsystem
Two possible reasons - Number of collections have increased Or some of the metrics
are taking a long time to complete
Check the Run Duration (Seconds) metric
To identify which metric is taking more than 2 mins of time(default) to execute. Threshold-able
If any metric is taking unusually long time, disable the specific metric to unblock.

Check the Collections Processed metric


Consistently high and backlog is continuous
Enable Collection Manager for one-off cases
Configure threads if backlogs are generally high
Maximum workers is 5
Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

Repository Health Monitoring


General guidelines for Maximum
Availability to check in repository
Regular Backups
ARCHIVELOG mode ON

FLASHBACK Mode ON
Refer to Oracle Database High Availability
Guidelines

Compliance to Repository Database


Setting as per Sizing guidelines.

Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

Repository Scheduler Jobs Monitoring


Setup > Manage Cloud Control >Repository

Monitor Repository Scheduler Jobs status and processing time

Tips to troubleshoot if the Status of these jobs are down


For the repository jobs to run, the DBMS_SCHEDULER must be enabled
Start these jobs with pl/sql command exec emd_maintenance.submit_em_dbms_jobs

Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

Repository Scheduler Jobs Monitoring


If a specific job is down ie broken state,
Query the mgmt_performance_names table as repository owner for the
dbms_jobname and fetch the job id from all_jobs
Look for ORA-12012 messages for this job id in the database alerts log and trace files
for the problem to fix. Re-start the job from console
Contact Oracle Support if fix cannot be easily identified

Key Metrics to gauge its performance


Throughput per second
Processing Time(% of Last Hour)
If Processing Time is large
and the Throughput is low
Check for errors in database-alert.log
Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

Metric Rollup Jobs


Setup > Manage Cloud Control >Repository

Aggregation mechanism: Both hourly and daily rollups are done from the raw data
directly
Look out for consistently growing backlogs or prolonged execution time span
Configure additional rollup worker threads using configure option in Metric Rollup Performance Chart
Click
If the RAC is configured in the database,
to avoid RAC contention negating gain of
additional threads
Create database service and set affinity to it
for the rollup job to only run on one RAC node

Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

Metric Rollup Jobs Monitoring


Setting affinity with RAC Configuration

Create database service and set affinity to it for the rollup job to only run on one RAC
node
Create database service rollup and set one of RAC instance as primary instance in -r
srvctl add service -d <dbname>-s rollup -r <primary instance> -a <the the other instances> -y
automatic
srvctl start service -d <dbname>-s rollup
srvctl status service -d <dbname>
As sys user, execute DBMS_SCHEDULER.create_job_class( job_class_name => 'ROLLUP', service =>
'rollup')
GRANT EXECUTE ON sys.ROLLUP TO sysman;
As sysman user, execute DBMS_SCHEDULER.SET_ATTRIBUTE ( name => 'EM_ROLLUP_SCHED_JOB',
attribute => 'job_class', value => 'ROLLUP')
As sysman user, execute GC_SCHED_JOB_REGISTRAR.SET_JOB_CLASS('EM_ROLLUP_SCHED_JOB', 'ROLLUP')
Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

Agent Subsystem-New Agent Unreachable sub-statuses


Sub status added to provide more diagnostic details
Agent Unreachable And Status Pending Statuses
Agent Unreachable

Post Blackout

Down

Blocked (Plug-in Mismatch)

Agent Down

Blocked (Bounce Counter Mismatch)

Up Unmonitored

Agent Misconfigured

Under Migration

Communication Broken

Cannot Write to File System

Status Pending

Collections Disabled

Target Addition in Progress

Disk Full

Status Pending (Post Blackout)

Blocked Manually

Status Pending (Post Metric Error)

Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

General Troubleshooting Steps for Agent Unreachable


Setup > Manage Cloud Control >Agents
Target Status Diagnostics Report: Agent-based targets (Information Publisher report)
Check the Promote Status column and Broken Reason in Target Information
Check for latest Clean Heartbeat UTC time in Agent Ping Status table in the Report

Ensure OMS is reachable from agent host and agent from OMS host
Check emctl status for various configurations. Eg: Agent communicating with correct OMS
Check agent upload with emctl upload
Contact Oracle Supprt with these log
gcagent.log from agent home
emoms_pbs .log, emoms.log

Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

Safe Harbor Statement


The preceding is intended to outline our general product direction. It is intended for
information purposes only, and may not be incorporated into any contract. It is not a
commitment to deliver any material, code, or functionality, and should not be relied upon
in making purchasing decisions. The development, release, and timing of any features or
functionality described for Oracles products remains at the sole discretion of Oracle.

Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

Copyright 2014, Oracle and/or its affiliates. All rights reserved. |

Enterprise Manager (EM) High


Availability (HA) Architecture

Lap Nguyen, Database Analyst


Oracle Open World Conference
October 2, 2014

This document is intended for use by Chevron for presentation at the October 2, 2014 Oracle Open World Conference, for
posting on the Oracle Open World Conference website and for handouts to Oracle Open World Conference attendees. No
portion of this document may be copied, displayed, reproduced, published, sold, licensed, downloaded or used in any other
way, unless the use has been specifically authorized by Chevron in writing.
2014 Chevron U.S.A. Inc. All Rights Reserved

Agenda

Our company
Overview of Oracle EM HA
Tips and tricks to reduce down time when a switchover or failover to
disaster recovery (DR)
Benefits of using a storage replication HA solution

2014 Chevron U.S.A. Inc. All Rights Reserved

Oracle is a registered trademark of Oracle and/or its affiliates.2

Chevron is One of the Largest Integrated Energy


Companies in the World
2nd largest integrated
energy company in the
United States
12th largest company
in the world
64,500+ employees
worldwide (includes
service station
personnel)
2.59 net million barrels
of oil per day in 2012
$21.4 Billion Net
Income in 2013

$39.8 Billion Capital


and Exploratory budget
for 2014

2014 Chevron U.S.A. Inc. All Rights Reserved

A Global Company Operating on Six Continents

180+ countries in
which we operate

30+ countries with


exploration and
production activities

Chevron
Corporation
Headquarters

18 refineries and
asphalt plants

(includes Global
Upstream & Gas
and Downstream
headquarters)

35 chemical
manufacturing
facilities

Exploration & Production

Chevron

Refining

No Operations

Chemicals

* In some cases, one dot


designates multiple locations

2014 Chevron U.S.A. Inc. All Rights Reserved

3 retail brands
(Chevron, Texaco and
Caltex)
22,000+ retail outlets

Overview of Oracle EM HA Architecture Components

Repository
Local HA: Data Guard fast start failover with maximum availability
protection mode
DR: Data Guard

Oracle management system (OMS) Redundancy: two primary OMS


and two DR OMS
Network attached storage (NAS) Replication: Application software bit
and Oracle Enterprise Manager Software Library
Global Traffic Manager and F5 Networks BIG-IP
Oracle Access Manager (OAM) single sign on (SSO): Two primary
servers and two local DR servers (required due to Kerberos SSO)

2014 Chevron U.S.A. Inc. All Rights Reserved

F5, F5 Networks, and BIG-IP are trademarks or registered trademarks of F5 Networks, Inc. in
5
the U.S. and in certain other countries.

Oracle EMHA Architecture Diagram

2014 Chevron U.S.A. Inc. All Rights Reserved

Tips and Tricks to Simplify the DR Configuration

Did you know that you can incorporate standby database into the OMS
configuration using a one-time configuration on the primary OMS?
Command:
emctl config oms -store_repos_details -repos_conndesc
"(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=ho
st1)(PORT=1521))(ADDRESS=(PROTOCOL=TCP)(HOST=host2)(PORT=1521))(
ADDRESS=(PROTOCOL=TCP)(HOST=host3)(PORT=1521)))(CONNECT_DATA
=(SERVER=DEDICATED)(SERVICE_NAME=SID_DG)))" -repos_user sysman repos_pwd password
Where:
Host1-2 are local primary database hosts HA fast start failover
Host3 is a remote database host DR
Note: For RAC, host1-2 would be replaced by SCAN-IP

Benefits:
Reconfiguring the OMS to point to the new repository (after the Data
Guard switchover/failover) is not required.
Reduce downtime, human errors and manual work when
switchover/failover occurs.
2014 Chevron U.S.A. Inc. All Rights Reserved

Benefits of Setting Up Multiple Database Connections


when Using Switchover or Failover to DR
OMS was set up to connect to multiple hosts. In the DR situation, the
process can be simplified without having to run any configuration
changes.
6 simple steps to switchover to DR without having to run configuration
changes:
1. Stop the OMS

2. Switch over the database to DR


3. Disable f5 on the primary site and enable f5 on the DR site
4. Break the NAS Mirror and set the DR site to RW
5. Start up the OMS
6. Configure the OMS to support Chevron logon standards

Note: Step 6 is NOT required if OAM/Kerberos SSO is not in place

2014 Chevron U.S.A. Inc. All Rights Reserved

Benefits of a DR Solution with NAS Replication

Patching/Plugin Deployment steps without NAS replication:


1.

Stop the OMS

2.

Patch or deploy new plugins on Primary

3.

Switch the database to Standby

4.

Start up the OMS on Standby

5.

Patch or deploy new plugins to the OMS on Standby

6.

Switch the database back to primary

7.

Start up the OMS

Patching /Plugin Deployment steps with NAS replication:


1.

Stop the OMS

2.

Patch or deploy new plugins on the primary OMS

3.

Start up the OMS

Benefits:

Prior to storage replication, standby OMS recreation is required when upgrading

Reduce down time by half or more when patching or deploying plugins

Reduce human errors and simplify the EM infrastructure

2014 Chevron U.S.A. Inc. All Rights Reserved

2014 Chevron U.S.A. Inc. All Rights Reserved

10

Enterprise Manager at CERN


Andrei Dumitru
IT Department / Database Services / openlab

CERN
European Organization for Nuclear Research
Founded in 1954
Research: Finding answers to questions about the
Universe
Technology, International collaboration, Education

21 Member States
7 Observer States
European Commission, USA, Russian Federation, India, Japan, Turkey, UNESCO

Associate State
Serbia

Candidate State
Romania

People
2500 Staff, 560 Fellows, 500 Students, 10600 Users ,
Grand Total ~ 15000

CERN Member States

The largest particle accelerators & detectors


27km (17 miles)
long tunnel

Thousands of
superconducting
magnets
Ultra vacuum:
10x emptier
than on the Moon
Coldest place
in the Universe:
-271C/1.9K/-456F

Credit: Mariusz Piorkowski

Deployment
Agents version: 12.1.0.4
Linux x86-64
Secure agent upload
AD accounts for user login

2-node RAC OMS+OMR


OMS version: 12.1.0.4
Linux RHEL 5 x86-64
8 CPU @ 2.53GHz
48 GB RAM

Agents
Users(https)
Failover VIP

RAC nodes
Cold Failover Cluster

RDBMS version 11.2.0.4


Size: ~200GB
NAS storage
Shared storage (OMS & OMR)

Databases
200 Oracle Database Instances
80 RAC Databases
Middleware
370 WebLogic Servers
340 Java Virtual Machines
over 1000 App Deployments
Apache Tomcat & HTTP Servers
Hosts
270 Red Hat Enterprise Linux 5 & 6
Total
5200 targets

Before

Case Study:
OMS Troubleshooting
1. Launched Agent Upgrade job
2. Started receiving many alerts
Agent is unable to communicate with OMS
Agents not yet upgraded to R4:
no enhanced agent health status available

3. Looking into the new self monitoring


features(Loader):
Throughput was dropping
Backoff (no of files rejected) increasing
Utilized Capacity increasing

over 5000 files in backoff

Case Study:
OMS Troubleshooting
4. Diagnosing the Repository Database
High Load on the OMR host
Row lock contention in SYSMAN
schema caused by Agent Upgrade
5. Oracle Support provided patch
6. OMR issue fixed
Throughput rate back to normal
Everything working

After

Agents Overview

Check agent status


Symptom Analysis

Control agent
Properties

Partner Agent
Agents monitor one another
Faster downtime detection
Host=hostname.cern.ch
Separate alerts
Target type=Agent
Target name=hostname.cern.ch:1234
agent down event
Message=Agent has stopped monitoring.
The following errors are reported :
host down event
agent shutdown.
Host=hostname.cern.ch
Target type=Host
Target name=hostname.cern.ch
Message=Host Down - Detected by
Partner Agent

Repository
Out of the box checks
based on OMS size

Change the schedule


for Repository Jobs

Repository Metrics

Page Performance Analysis

Advantages of new EM12c R4 self-monitoring features

Quickly spot infrastructure problems


Fast host down detection - partner agent
New agent health sub statuses
Change schedule for repository jobs from UI
Performance diagnosis of UI pages
Detailed diagnosis of different sub-systems
Repository recommendations and checks

You might also like