0% found this document useful (0 votes)

2K views137 pages

Oracle Exadata Database - Boas Praticas - Document 1067527.1

Uploaded by

joao henrique

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2K views137 pages

Oracle Exadata Database - Boas Praticas - Document 1067527.1

Uploaded by

joao henrique

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 137

29/10/2019 Document 1067527.

PowerView is Off Switch to Cloud Support joao Henrique (Available) (0) Contact Us Help

Dashboard Knowledge Service Requests Patches & Updates Community

Give Feedback...
Copyright (c) 2019, Oracle. All rights reserved. Oracle Confidential.

Oracle Exadata Database Machine Setup/Configuration Best Practices (Doc ID 1274318.1) To Bottom

Was this document helpful?

APPLIES TO:
Yes
Serviço de Backup do Banco de Dados Oracle - Versão N / A e posterior No
Oracle Exadata Express Cloud Service - Versão N / A e posterior
Software Oracle Exadata Storage Server - Versão 11.2.1.2.0 a 12.2.1.1.2 [Release 11.2 a 12.2]
Oracle Cloud Infraestrutura - Serviço de banco de dados - Versão N / A e posterior Document Details
Serviço Oracle Database Cloud Schema Service - Versão N / A e posterior As
informações neste documento se aplicam a qualquer plataforma. Type:
BULLETIN
Status:
PUBLISHED
OBJETIVO Last Major
04-Aug-2018
Update:
Last 18-Sep-2019
O objetivo deste documento é apresentar as práticas recomendadas para a implantação do Sun Oracle Database Machine V2 / Update: English
X2-2 / X2-8 / X3-2 / X3-8 / X4-2 / X4-8 / X5-2 no diretório área de instalação e configuração. Language:

ESCOPO
Related Products
Oracle Database Exadata
Público-alvo que trabalha no Oracle Exadata X2-2 / X2-8 / X3-2 / X3-8 / X4-2 / X4-8 / X5-2 / X6-2 / X6-8 / X7-2 / X-8
Express Cloud Service
Oracle Exadata Storage Server
DETALHES Software
Oracle Cloud Infrastructure -
Database Service
Primary and standby databases should NOT reside on the same IB Fabric Oracle Database Cloud
Schema Service
Use hostname and domain name in lower case
Oracle Database Cloud Service
Verify ILOM Power Up Configuration Show More
Verify Hardware and Firmware on Database and Storage Servers
Verify InfiniBand Cable Connection Quality
Information Centers
Verify Ethernet Cable Connection Quality
Verify InfiniBand Fabric Topology (verify-topology) Information Center: Oracle
Verify key InfiniBand fabric error counters are not present Exadata Database Machine
[1306791.2]
Verify InfiniBand switch software version is 1.3.3-2 or higher
Verify InfiniBand subnet manager is running on an InfiniBand switch インフォメーション・センタ
Disable Infiniband subnet manager service where subnet manager master should never run ー: データベースおよび
Verify key parameters in the InfiniBand switch /etc/opensm/opensm.conf file Enterprise Manager 日本語ド
Verify There Are No Memory (ECC) Errors キュメント [1946305.2]
Verify celldisk configuration on disk drives
Verify celldisk configuration on flash memory devices Platform as a Service (PaaS)
Verify there are no griddisks configured on flash memory devices and Oracle Cloud
Infrastructure (OCI)
Verify griddisk count matches across all storage servers where a given prefix name exists Information Center
Verify griddisk ASM status [2048297.2]
Verify that griddisks are distributed as expected across celldisks
Index of Oracle Database
Verify the percent of available celldisk space used by the griddisks
Information Centers
Verify Database Server ZFS RAID Configuration [1568043.2]
Verify InfiniBand is the Private Network for Oracle Clusterware Communication
Verify InfiniBand Address Resolution Protocol (ARP) Configuration on Database Servers Information Center: Overview
Database Server/Client
Verify Oracle RAC Databases use RDS Protocol over InfiniBand Network. Installation and
Verify Database and ASM instances use same SPFILE Upgrade/Migration
Verify Berkeley Database location for Cloned GI homes [1351022.2]
Configure Storage Server alerts to be sent via email Show More
Configure NTP and Timezone on the InfiniBand switches
Configure NTP slew_always settings as SMF property for Solaris
Verify NUMA Configuration Document References
Enable Xeon Turbo Boost IDT switch on the PCI riser
Verify Exadata Smart Flash Log is Created has a problem resulting in
Verify Exadata Smart Flash Cache is Created occasional loss of connectivity
to pair of flash cards on the
Verify Exadata Smart Flash Cache status is "normal" cells [1351559.1]
Verify Master (Rack) Serial Number is Set
Verify Management Network Interface (eth0) is on a Separate Subnet Oracle Linux: Shell Script to
Calculate Values
Verify RAID disk controller CacheVault capacitor condition
Recommended Linux
Verify RAID Disk Controller Battery Condition HugePages / HugeTLB
Verify Ambient Air Temperature Configuration [401749.1]
Verify operating system hugepages count satisfies total SGA requirements
Updating key software
Verify MaxStartups 100 in /etc/ssh/sshd_config on all database servers components on database hosts
Verify all datafiles have AUTOEXTEND attribute ON to match those on the cells
Verify all BIGFILE tablespaces have non-default MAXBYTES values set [1284070.1]
Ensure Temporary Tablespace is correctly defined rp_filter for multiple private
Enable portmap service if app requires it interconnects and Linux Kernel
Enable proper services on database nodes to use NFS 2.6.32+ [1286796.1]
Be Careful when Combining the InfiniBand Network across Clusters and Database Machines
Mount Options for Oracle files
Set fast_start_mttr_target=300 to optimize run time performance of writes for RAC databases and
Enable auditd on database servers Clusterware when used with
Verify AUD$ and FGA_LOG$ tables use Automatic Segment Space Management NFS on NAS devices
Use dbca templates provided for current best practices [359515.1]
Updating database node OEL packages to match the cell Show More
Disable cell level flash caching for grid disks that don't need it when using Write Back Flash Cache

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=c… 1/137
29/10/2019 Document 1067527.1
Gather system statistics in Exadata mode if needed Recently Viewed
Verify Hidden Database Initialization Parameter Usage Oracle Exadata Database
Verify BDB location for Cloned GI homes Machine Setup/Configuration
Verify Shared Servers do not perform serial full table scans Best Practices [1067527.1]
Verify Write Back Flash Cache minimum version requirements Oracle Net 12c: Changes to
Verify bundle patch version installed matches bundle patch version registered in database the Functionality of Dead
Verify database server file systems have "Maximum mount count" = "-1" Connection Detection (DCD)
[1591874.1]
Verify database server file system have "Check interval" = "0"
Verify Automated Service Request (ASR) configuration How To Track Dead
Connection Detection(DCD)
Verify ZFS File System User and Group Quotas are configured Mechanism Without Enabling
Verify the file /.updfrm_exact does not exist Any Client/Server Network
Verify the vm.min_free_kbytes configuration Tracing [438923.1]
Validate key sysctl.conf parameters on database servers How to Check if Dead
Remove "fix_control=32" from dbfs mount options Connection Detection (DCD)
Set Linux kernel log buffer size to 1MB is Enabled in 9i ,10g and
11g [395505.1]
Verify IP routing configuration on DB nodes
A discussion of Dead
Set SQLNET.EXPIRE_TIME=10 in DB Home
Connection Detection,
Verify there are no .fuse_hidden files under the dbfs mount Resource Limits,
Verify that the SDP over IB option "sdp_apm_enable(d)" is set to "0" V$SESSION, V$PROCESS
Verify /etc/oratab and OS processes
Verify consistent software and configuration across nodes [601605.1]
Verify all database and storage servers time server configuration Show More
Verify Sar files have read permissions for non-root user
Verify that the patch for bug 16618055 is applied
Verify the Name Service Cache Daemon (NSCD) is Running
Verify kernels and initrd in /boot/grub/grub.conf are available on the system
Verify basic Logical Volume(LVM) system devices configuration
Ensure db_unique_name is unique across the enterprise
Verify average ping times to DNS nameserver
Verify Running-config and Startup-config are the same on the Cisco switch
Validate SSH is installed and configured on Cisco management switch
Verify Database Memory Allocation is not Greater than Physical Memory Installed on Database node
Verify Cluster Verification Utility(CVU) Output Directory Contents Consume < 500MB of Disk Space
Verify active system values match those defined in configuration file "cell.conf"
Verify that CRS_LIMIT_NPROC is greater than 65535 and not "UNLIMITED"
Verify TCP Segmentation Offload (TSO) is set to off
Check alerthistory for stateful alerts not cleared
Check alerthistory for non-test open stateless alerts
Verify clusterware state is "Normal"
Verify the grid Infrastructure management database (MGMTDB) does not use hugepages
Verify the "localhost" alias is pingable
Verify bundle patch version installed matches bundle patch version registered in database
Verify database is not in DST upgrade state
Verify there are no failed diskgroup rebalance operations
Verify the CRS_HOME is properly locked
Verify storage server data (non-system) disks have no partitions
Verify db_unique_name is used in I/O Resource Management (IORM) interdatabase plans
Verify Datafiles are Placed on Diskgroups consisting of griddisks with cachingPolicy = DEFAULT
Verify all datafiles are placed on griddisks that are cached on flash disks
Validate key sysctl.conf parameters on database servers
Detect duplicate files in /etc/*init* directories
Verify Database Server Quorum Disks configuration
Verify Oracle Clusterware files are placed appropriately
Verify "_reconnect_to_cell_attempts=9" on database servers which access X6 storage servers
Verify passwordless SSH connectivity for Enterpise Manager (EM) agent owner userid to target component userids
Check /EXAVMIMAGES on dom0s for possible over allocation by sparse files
Verify active kernel version matches expected version for installed Exadata Image
Verify Storage Server user "CELLDIAG" exists
Verify installed rpm(s) kernel type match the active kernel version
Verify Flex ASM Cardinality is set to "ALL"
Verify "downdelay" is set correctly for bonded client interfaces
Verify ExaWatcher is executing
Verify non-Default services are created for all Pluggable Databases
Verify Grid Infrastructure Management Database (MGMTDB) configuration
Verify Automatic Storage Management Cluster File System (ACFS) file systems do not contain critical database files
Verify the ownership and permissions of the "oradism" file
Verify the SYSTEM, SYSAUX, USERS and TEMP tablespaces are of type bigfile
Verify the storage servers in use configuration matches across the cluster
Verify "asm_power_limit" is greater than zero
Verify the recommended patches for Adaptive features are installed
Verify initialization parameter cluster_database_instances is at the default value
Verify the database server NVME device configuration
Verify that Automatic Storage Management Cluster File System (ACFS) uses 4K metadata block size
Evaluate Automated Maintenance Tasks configuration
Verify proper ACFS drivers are installed for Spectre v2 mitigation
Verify Exafusion Memory Lock Configuration
Verify there are no unhealthy InfiniBand switch sensors
Refer to MOS 1682501.1 if non-Exadata components are in use on the InfiniBand fabric
Verify the ib_sdp module is not loaded into the kernel
Verify all voting disks are online
Verify available ksplice fixes are installed
Archived Best Practices
Revision History

Primary and standby databases should NOT reside on the same IB Fabric

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=c… 2/137
29/10/2019 Document 1067527.1

Priority Added Machine Type OS Exadata Oracle

Type Version Version
Critical N/A X2-2(4170), X2-2, X2- Linux 11.2.x + 11.2.x +
8,X4-2

Benefit / Impact:

To properly protect the primary databases residing on the "primary" Exadata Database Machine, the physical standby database
requires fault isolation from IB switch maintenance issues,
IB switch failures and software issues, RDS bugs and timeouts or any issue resulting from a complete IB fabric failure. To protect
the standby from these failures that impact the primary's
availability, we highly recommend that at least one viable standby database resides on a separate IB fabric.

Risk:

If the primary and standby resides on the same IB fabric, both primary and standby systems can be unavailable due a bug
causing an IB fabric failure.

Action / Repair:

The primary and at least one viable standby database must not reside on the same inter-racked Exadata Database Machine. The
communication between the primary and standby
Exadata Database Machines must use GigE or 10GigE. The trade-off is lower network bandwidth. The higher network bandwidth
is desirable for standby database instantiation
(should only be done first time) but that requirement is eliminated for post-failover operations when flashback database is
enabled.

Use hostname and domain name in lower case

Priority Added Machine Type OS Type Exadata Oracle

Version Version
Critical N/A X2-2(4170), X2-2, X2-8, Linux, 11.2.x + 11.2.x +
X4-2 Solari

Benefit / Impact:

Using lowercase will avoid known deployment time issues.

Risk:

OneCommand deployment will fail in step 16 if this is not done. This will abort the installation with:

"ERROR: unable to locate file to check for string 'Configure Oracle Grid Infrastructure for a Cluster ... succeeded' #Step 16#"

Action / Repair:

As a best practice, user lower case for hostnames and domain names

Verify ILOM Power Up Configuration

Priority Alert Date Owner Status Scope Bug(s)

Level
Critical FAIL 11/11/12 <Name> Production Exadata, SSC 14281920-
exachk
DB DB Role Engineered System Exadata OS & Version Validation Tool TBD
Version Version Version
N/A N/A X2-2(4170), X2-2, X2-8, X3-2, X3- 11.2.2.2.0+ Solaris - 11 exachk 2.2.2
8,X4-2 Linux x86-64
UEK5.8

Benefit / Impact:

Verifying the ILOM power up configuration helps to ensure that a server (or more) are booted up after a power interruption as
quickly as possible.

Risk:

Not verifying the ILOM power up configuration may result in unexpected server boot behavior after a power interruption.

Action / Repair:

To verify the ILOM power up configuration, as the root userid enter the following command on each database and storage
server:

if [ -x /usr/bin/ipmitool ]
then
#Linux
ipmitool sunoem cli force "show /SP/policy" | grep -i power
else
#Solaris
/opt/ipmitool/bin/ipmitool sunoem cli force "show /SP/policy" | grep -i power
fi;

Exadata software version 11.2.3.2.1 or higher:

HOST_AUTO_POWER_ON=disabled
HOST_LAST_POWER_STATE=enabled

Exadata software version 11.2.3.2.0 or lower:

HOST_AUTO_POWER_ON=enabled
HOST_LAST_POWER_STATE=disabled

If the output is not as expected, as the root userid use the ipmitool "set /SP/policy" command. For example:

# ipmitool sunoem cli force "set /SP/policy HOST_AUTO_POWER_ON=enabled"

Connected. Use ^D to exit.
-> set /SP/policy HOST_AUTO_POWER_ON=enabled
Set 'HOST_AUTO_POWER_ON' to 'enabled'
-> Session closed
Disconnected

Verify Hardware and Firmware on Database and Storage Servers

Priority Added Machine Type OS Exadata Oracle

Type Version Version
Critical N/A X2-2(4170), X2-2, X2-8, Linux 11.2.x + 11.2.x +
X4-2

Benefit / Impact:

The Oracle Exadata Database Machine is tightly integrated, and verifying the hardware and firmware before the Oracle Exadata
Database Machine is placed into or returned to
production status can avoid problems related to the hardware or firmware modifications.

The impact for these verification steps is minimal.

Risk:

If the hardware and firmware are not validated, inconsistencies between database and storage servers can lead to problems and
outages.

Action / Repair:

To verify the hardware and firmware configuration for a database server, execute the following command as the "root" userid:

/opt/oracle.SupportTools/CheckHWnFWProfile

The output will contain a line similar to the following:

[SUCCESS] The hardware and firmware profile matches one of the supported profile

If any result other than "SUCCESS" is returned, investigate and correct the condition.

To verify the hardware and firmware configuration for a storage server, execute the following "cellcli" command as the
"cellmonitor" userid:

CellCLI> alter cell validate configuration

The output will be similar to:

Cell <cell> successfully altered

If any result other than "successfully altered" is returned, investigate and correct the condition.

NOTE: CheckHWnFWProfile is also executed at each boot of a database server.

NOTE: "alter cell validate configuration" is also executed once a day on a storage server by the MS process and
the result is written into the storage server alert history.

Verify InfiniBand Cable Connection Quality

Priority Added Machine Type OS Exadata Oracle

Type Version Version
Critical N/A X2-2(4170), X2-2, X2- Linux 11.2.x + 11.2.x +
8,X4-2

Benefit / Impact:

InfiniBand cables require proper connections for optimal efficiency. Verifying the InfiniBand cable connection quality helps to
ensure that the InfiniBand network operates at optimal efficiency.

There is minimal impact to verify InfiniBand cable connection quality.

Risk:

InfiniBand cables that are not properly connected may negotiate to a lower speed, work intermittently, or fail.

Execute the following command on all database and storage servers:

for ib_cable in `ls /sys/class/net | grep ^ib`; do printf "$ib_cable: "; cat /sys/class/net/$ib_cable/carrier; do

The output should look similar to:

ib0: 1
ib1: 1

If anything other than "1" is reported, investigate that cable connection

Linux

Execute the following command as the "root" userid on all database and storage servers:

for ib_cable in `ls /sys/class/net | grep ^ib`; do printf "$ib_cable: "; cat /sys/class/net/$ib_cable/carrier; do

The output should look similar to:

ib0: 1 ib1: 1

If anything other than "1" is reported, investigate that cable connection.

Solaris

Execute the following command as the "root" userid on all database servers:

dladm show-ib | grep -v LINK | sed -e 's/ */ /g' -e 's/ *//' | awk '{print $1":", $5}'| sort

The output should be similar to:

ib0: up
ib1: up

If anything other than "up" is reported, investigate that cable connection.

NOTE: Storage servers should report 2 connections. X2-2(4170) and X2-2 database servers should report 2
connections. X2-8 database servers should report 8 connections.

Verify Ethernet Cable Connection Quality

Priority Added Machine Type OS Exadata Oracle

Type Version Version
Critical N/A X2-2(4170), X2-2, X2- Linux 11.2.x + 11.2.x +
8,X4-2

Benefit / Impact:

Ethernet cables require proper connections for optimal efficiency. Verifying the Ethernet cable connection quality helps to ensure
that the Ethernet network operates at optimal efficiency.

There is minimal impact to verify Ethernet cable connection quality.

Risk:

Ethernet cables that are not properly connected may negotiate to a lower speed, work intermittently, or fail.

Action / Repair:

Execute the following command as the root userid on all database and storage servers:

for cable in `ls /sys/class/net | grep ^eth`; do printf "$cable: "; cat /sys/class/net/$cable/carrier; done

The output should look similar to:

eth0: 1
eth1: cat: /sys/class/net/eth1/carrier: Invalid argument
eth2: cat: /sys/class/net/eth2/carrier: Invalid argument
eth3: cat: /sys/class/net/eth3/carrier: Invalid argument
eth4: 1
eth5: 1

"Invalid argument" usually indicates the device has not been configured and is not in use. If a device reports "0", investigate
that cable connection.

NOTE: Within machine types, the output of this command will vary by customer depending on how the customer
chooses to configure the available ethernet cards.

Verify the InfiniBand Fabric Topology (verify-topology)

Priority Alert Date Owner Status Engineered System Engineered System Bug(s)
Level Platform

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=c… 5/137
29/10/2019 Document 1067527.1
Critical WARN 09/05/18 <Name> Production Exadata - Physical, ALL 20144798 - exachk
Exadata - Management
Domain,
RA
DB DB Type DB Role DB Exadata OS & Version Validation Tool MAA Scorecard
Version Mode Version Version Section
N/A N/A N/A N/A ALL Linux exachk 18.4.0 N/A

Benefit / Impact:

Verifying that the InfiniBand network is configured with the correct topology for an Oracle Exadata Database Machine helps to
ensure that the InfiniBand network operates at maximum efficiency.

Risk:

An incorrect InfiniBand topology will cause the InfiniBand network to operate at degraded efficiency, intermittently, or fail to
operate.

Action / Repair:

To verify the InfiniBand Fabric Topology, execute the following code set as the "root" userid on one database server in the
Exadata environment:

unset VT_ERRORS
unset VT_WARNINGS
VT_OUTPUT=$(/opt/oracle.SupportTools/ibdiagtools/verify-topology)
VT_WARNINGS=$(echo "$VT_OUTPUT" | egrep WARNING)
VT_ERRORS=$(echo "$VT_OUTPUT" | egrep ERROR)
if [ -n "$VT_ERRORS" ]
then
echo -e "FAILURE: verify-topology returned one or more errors (and perhaps warnings).\nDETAILS:\n$VT_OUTPUT"
elif [ -n "$VT_WARNINGS" ]
then
echo -e "WARNING: verify-topology returned one or more warnings.\nDETAILS:\n$VT_OUTPUT"
else
echo -e "SUCCESS: verify-topology returned no errors or warnings."
fi

The expected output is:

SUCCESS: verify-topology returned no errors or warnings.

An example of a "FAILURE:" message:

FAILURE: verify-topology returned one or more errors (and perhaps warnings).

DETAILS:
[ DB Machine Infiniband Cabling Topology Verification Tool. ]
Every node is connected to two leaf switches in a single rack....................................................
[ERROR]
Node randomcel06 (Guid: 21280001f00464 ) is connected to just one leaf switch randomsw-ib2(Guid: 2128f57723a0a0 )
Error found in following rack
<output truncated>

An example of a "WARNING:" message:

WARNING: verify-topology returned one or more warnings.

DETAILS:

[ DB Machine Infiniband Cabling Topology Verification Tool ]

[Version IBD VER 2.b ]

[WARNING] - Non-Exadata nodes detected! Please ensure this is OK

Approximating classification into cells and db hosts

Software UPGRADE required for the tool to be accurate

Looking at 1 rack(s).....
<output truncated>

If anything other than "SUCCESS:" is reported, investigate and correct the underlying fault(s).

Verify key InfiniBand fabric error counters are not present

Priority Alert Date Owner Status Engineered System Bug(s)

Level
Critical WARN 09/28/16 <Name> Production Exadata-Management Domain,
Exadata-Physical, SSC, Exalogic
DB DB Engineered System Platform Exadata OS & Validation Tool Version TBD
Version Role Version Version
N/A N/A X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4- 11.2.x+ Linux x86- exachk 12.1.0.2.8
2, X4-8, X5-2, X6-2, X6-8 64
Solaris -
11

Benefit / Impact:

Verifying key InfiniBand fabric error counters are not present helps to maintain the InfiniBand fabric at peak efficiency.

The impact of verifying key InfiniBand fabric error counters are not present is minimal. The impact of correcting key InfiniBand
fabric error counters varies depending upon the root cause of the specific error counter present, and cannot be estimated here.

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=c… 6/137
29/10/2019 Document 1067527.1
Risk:

If key InfiniBand fabric error counters are present, the fabric may be running in degraded condition or lack redundancy.

NOTE: Uncorrected symbol errors increase the risk of node evictions and application outages.

Action / Repair:

To verify key InfiniBand fabric error counters are not present, execute the following command set as the "root" userid on one
database server:

NOTE: This will not work in the user domain of a virtualized environment.

if [[ -d /proc/xen && ! -f /proc/xen/capabilities ]]

then
echo -e "\nThis check will not run in a user domain of a virtualized environment. Execute this check in the management
domain.\n"
else
RAW_DATA=$(ibqueryerrors | egrep 'SymbolError|LinkDowned|RcvErrors|RcvRemotePhys|LinkIntegrityErrors');
CRITICAL_DATA=$(echo "$RAW_DATA" | egrep 'SymbolError|RcvErrors');
WARNING_DATA=$(echo "$RAW_DATA" | egrep -v 'SymbolError|RcvErrors');
if [ -z "$RAW_DATA" ]
then
echo -e "SUCCESS: Key InfiniBand fabric error counters were not found"
else
if [ 'echo "$RAW_DATA" | egrep 'SymbolError|RcvErrors' | wc -l' -gt 0 ]
then
echo -e "FAILURE: receive errors or symbol errors or both were found:\n\nCounters found:\n"
echo -e "CRITICAL DATA:\n\n$CRITICAL_DATA\n\n\nWARNING DATA:\n\n$WARNING_DATA"
else
echo -e "WARNING: Key InfiniBand fabric error counters were found\n\nCounters Found:\n";
echo -e "CRITICAL DATA:\n\n$CRITICAL_DATA\n\n\nWARNING DATA:\n\n$WARNING_DATA"
fi;
fi;
fi;

The expected output should be:

SUCCESS: Key InfiniBand fabric error counters were not found

- OR -

This check will not run in a user domain of a virtualized environment. Execute this check in the management domain.

Example of a FAILURE result:

FAILURE: receive errors or symbol errors or both were found:

Counters found:

CRITICAL DATA:

GUID 0x10e00001451161 port 1: [SymbolErrorCounter == 1367] [PortRcvErrors == 1367]

GUID 0x10e08027b8a0a0 port ALL: [SymbolErrorCounter == 54679] [LinkErrorRecoveryCounter == 76]
<output truncated>

WARNING DATA:

GUID 0x21280001fca219 port 1: [LinkDownedCounter == 1]

GUID 0x21280001fca21a port 2: [LinkDownedCounter == 1]
<output truncated>

Example of a WARNING result:

WARNING: Key InfiniBand fabric error counters were found

CRITICAL DATA:

WARNING DATA:

GUID 0x10e00001886289 port 1: [LinkDownedCounter == 1] [PortXmitDiscards == 272] [PortXmitWait == 2021116]

GUID 0x10e0802617a0a0 port ALL: [LinkErrorRecoveryCounter == 63]
GUID 0x10e0802617a0a0 port 1: [LinkErrorRecoveryCounter == 10]
GUID 0x10e0802617a0a0 port 2: [LinkErrorRecoveryCounter == 11]
<output truncated>

In general, if the output is not "SUCCESS...", follow the diagnostic guidance in the following documents:

InfiniBand Network Troubleshooting Guidelines and Methodologies.

"Gathering Troubleshooting Information for the Infiniband Network in Engineered Systems (Doc ID 1538237.1)".
The "Exadata InfiniBand Issues" section of "Exadata Diagnostic Collection Guide (Doc ID 1353073.2)".

Special Notes on Symbol errors:

Symbol errors create a much higher risk of node evictions if the error rate is too high. On the InfiniBand switches, there is a
mechanism that will automatically down a port if the error rate becomes too high. On the database and storage servers, there is

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=c… 7/137
29/10/2019 Document 1067527.1
no such mechanism at this time, so it is recommended to examine the Symbol error rate manually, using ExaWatcher data.

NOTE: In the following example, all data pertaining to InfiniBand switches has been filtered out for brevity.

As the "root" userid, the following example demonstrates how to examine the Symbol error rate using ExaWatcher.

1) From the manual output, make note of the GUIDs with SymbolErrorCounter present:

FAILURE: receive errors or symbol errors or both were found:

Counters found:

GUID 0x10e00001451161 port 1: [SymbolErrorCounter == 1123] [PortRcvErrors == 1123] [PortXmitWait ==

230121020]

2) Use the following command to identify the server with the symbol errors present:

[root@randomadm01 ~]# ibqueryerrors -G 0x10e00001451161 | head -1 Errors for "randomadm02 S

192.168.8.16,192.168.8.17 HCA-4"

3) Log onto the database server identified in the command above, randomadm02.

4) Change to the ExaWatcher directory for IB hca information (the default is in use here):

# cd /opt/oracle.ExaWatcher/archive/IBCardInfo.ExaWatcher

5) Using the port identification provided in 1), use the following output to condense (removes "0" entries) all relevant available
ExaWatcher data:

[root@randomadm02 IBCardInfo.ExaWatcher]# cat <(bzcat *.bz2) <(cat *.dat) | egrep "port 1" -A23 | egrep SymbolError
| grep -v '0[[:blank:][:cntrl:]]*$' | sort -k1.2,10 -k2.1,8

[09/13/2016 02:38:18] SymbolErrorCounter 999 1

[09/13/2016 02:43:20] SymbolErrorCounter 1030 31

[09/13/2016 17:28:56] SymbolErrorCounter 1062 1

[09/13/2016 17:39:00] SymbolErrorCounter 1085 23

[09/13/2016 17:59:10] SymbolErrorCounter 1100 5

6) Calculate the symbol error rate per minute. By default, ExaWatcher data intervals are 5 minutes, but that can be changed.
Using these two lines:

[09/13/2016 17:28:56] SymbolErrorCounter 1062 1

[09/13/2016 17:39:00] SymbolErrorCounter 1085 23

The delta between 17:28 and 17:39 is "23". The time interval is 10 minutes, so 23 / 10 is 2.3 symbol errors per minute.

NOTE ESPECIALLY!! If the symbol error rate is consistently greater than 2 per minute, investigate for root cause
and take corrective action!

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=c… 8/137
29/10/2019 Document 1067527.1
NOTE: The InfiniBand fabric error counters should be cleared and validated after any maintenance activity.

NOTE: The InfiniBand fabric error counters are cumulative and the errors may have occurred at any time in the past. This check
is the result at one point in time, and cannot advise anything about history or an error rate.

NOTE: This check should not be considered complete validation of the InfiniBand fabric. Even if this check indicates success,
there may still be issues on the InfiniBand fabric caused by other, more rare Infiinband fabric error counters being present. If
there are or appear to be issues with the InfiniBand fabric while this check passes, perform a full evaluation of the
"ibqueryerrors" command output and the output of other commands such as "ibdiagnet".

NOTE: Depending upon the Exadata version, the key InfiniBand fabric error counters have different names. In the following
list, the older version of the counter name is shown in square brackets.

Key Infiniband fabric error counters list:

SymbolErrorCounter [SymbolErrors]
LinkErrorRecoveryCounter [LinkRecovers]
LinkDownedCounter [LinkDowned]
PortRcvErrors [RcvErrors]
PortRcvRemotePhysicalErrors [RcvRemotePhysErrors]
LocalLinkIntegrityErrors [LinkIntegrityErrors]

NOTE: Some Infiinband fabric error counters (for example, "SymbolErrorCounter [SymbolErrors]","PortRcvErrors [RcvErrors]")
can increment when nodes are rebooted. Small values for these Infiinband fabric error counters which are less than the
"LinkDownedCounter [LinkDowned]" counters are generally not a problem. The "LinkDownedCounter [LinkDowned]" counters
indicate the number of times the port has gone down (usually for valid reasons, such as a node reboot) and are not typically an
error indicator by themselves.

NOTE: Links reporting high, persistent error rates (especially "SymbolErrorCounter [SymbolErrors]", "LinkErrorRecoveryCounter
[LinkRecovers]", "PortRcvErrors [RcvErrors]", "LocalLinkIntegrityErrors [LinkIntegrityErrors]") often indicate a bad or loose cable
or port issues.

Verify InfiniBand switch software version is 1.3.3-2 or higher

Priority Added Machine Type OS Type Exadata Version Oracle Version

Critical 11/01/11 X2-2(4170), X2-2, X2-8,X4-2 Linux, [WIP:VW]Solaris 11.2.x + 11.2.x +

Benefit / Impact:

The Impact of verifying that the InfiniBand switch software is at version 1.3.3-2 or higher is minimal. The impact of upgrading
the InfiniBand switch(s) to 1.3.3-2 varies depending upon the upgrade method
chosen and your current InfiniBand switch software level.

Risk:

InfiniBand switch software version 1.3.3-2 fixes several potential InfiniBand fabric stability issues. Remaining on an InfiniBand
switch software version below 1.3.3-2 raises the risk of experiencing a potential outage.

Action / Repair:

To verify the InfiniBand switch software version, log onto the InfiniBand switch and execute the following command as the
"root" userid:

version | head -1 | cut -d" " -f5

The output should be similar to:

1.3.3-2

If the output is not 1.3.3-2 or higher, upgrade the InfiniBand switch software to at least version 1.3.3-2.

NOTE: Patch 12373676 provides InfiniBand software version 1.3.3-2 and instructions.

NOTE: Upgrading to 1.3.3-2 may be performed as a rolling upgrade without an outage. The InfiniBand switch
software is not dependent upon any other components in the Oracle Exadata Database Machine
and may be upgraded at any time.

NOTE: If your InfiniBand switch is at software version 1.0.1-1, it will need to first be upgraded to 1.1.3-1 or 1.1.3-
2 before it can be upgraded to 1.3.3-2. The InfiniBand switch software cannot be upgraded
directly from 1.0.1-1 to 1.3.3-2.

Verify the Master Subnet Manager is running on an InfiniBand switch

Priority Alert Date Owner Status Engineered System Engineered System Bug(s)
Level Platform
Critical FAIL 11/28/18 <Name> Development Exadata - Physical, ALL 28862740 - exachk
Exadata - Managment

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=c… 9/137
29/10/2019 Document 1067527.1
Domain
DB/GI DB Type DB Role DB Exadata OS & Version Validation Tool MAA Scorecard
Version Mode Version Version Section
N/A N/A N/A N/A ALL Linux exachk 18.4.0 N/A

Benefit / Impact:

Having the Master Subnet Manager reside in the correct location improves the stability, availability and performance of the
InifiniBand fabric. The Impact of verifying the Master Subnet Manager is running on an InfiniBand switch is minimal. The impact
of moving the Master Subnet Manager varies depending upon where it is currently executing and to where it will be relocated.

Risk:

If the Master Subnet Manager is not running on an InfiniBand switch, the InfiniBand fabric may crash during certain fabric
management transitions.

Action / Repair:

To verify the Master Subnet Manager is located on an InfiniBand switch, execute the following command set as the "root" userid
on a database server:

SUBNET_MGR_MSTR_OUTPUT=$(sminfo)
IBSWITCHES_OUTPUT=$(ibswitches)
SUBNET_MGR_MSTR_GID=$(echo "$SUBNET_MGR_MSTR_OUTPUT" | cut -d" " -f7 | cut -c3-16)
SUBNET_MGR_MSTR_LOC_RESULT=1
for IB_NODE_GID in $(echo "$IBSWITCHES_OUTPUT" | cut -c14-27)
do
if [ $SUBNET_MGR_MSTR_GID = $IB_NODE_GID ]
then
SUBNET_MGR_MSTR_LOC_RESULT=0
SUBNET_MGR_MSTR_LOC_SWITCH=$(echo "$IBSWITCHES_OUTPUT" | grep $IB_NODE_GID)
fi
done
if [ $SUBNET_MGR_MSTR_LOC_RESULT -eq 0 ]
then
echo -e "SUCCESS: the Master Subnet Manager is executing on InfiniBand switch:\n$(echo "$SUBNET_MGR_MSTR_LOC_SW
else
echo -e "FAILURE: the Master Subnet Manager does not appear to be executing on an InfiniBand switch:\n$(echo "$
fi

The output should be similar to:

SUCCESS: the Master Subnet Manager is executing on InfiniBand switch:

Switch : 0x002128469b03a0a9 ports 36 "SUN DCS 36P QDR randomsw-iba0 <IP>" enhanced port 0 lid 1 lmc 0

Example of a "FAILURE" result:

FAILURE: the Master Subnet Manager does not appear to be executing on an InfiniBand switch:
sminfo: sm lid 3 sm guid 0x10e0cdce81a0a9, activity count 3362634 priority 8 state 3 SMINFO_MASTER

If the result is "FAILURE", investigate the guid provided, relocate the Master Subnet Manager to a correct
InfiniBand switch, and prevent the Subnet Manager from starting on the component where the Master Subnet
Manager was found to be executing.

NOTES:

1. The InfiniBand network can have more than one Subnet Manager, but only one Subnet Manager is active at
a time. The active Subnet Manager is the Master Subnet Manager. The other Subnet Managers are the
Standby Subnet Managers. If a Master Subnet Manager is shut down or fails, then a Standby Subnet
Manager automatically becomes the Master Subnet Manager.
2. There are typically several Standby Subnet Managers waiting to take over should the current Master Subnet
Manager either fail or is manually moved to some other component with an available Standby Subnet
Manager. Only run Subnet Managers on the InfiniBand switches specified for use in Oracle Exadata
Database Machine, Oracle Exalogic Elastic Cloud, Oracle Big Data Appliance, and Oracle SuperCluster.
Running Subnet Manager on any other device is not supported.
3. For pure multirack Exadata deployments with less than 4 racks, the Subnet Manager should run on all spine
and leaf InfiniBand switches. For deployments with 4 or more Exadata racks, the Subnet Manager should
run only on spine InfiniBand switches. For additional configuration information, please see section "4.6.7
Understanding the Network Subnet Manager Master" of the "Exadata Database Machine Maintenance
Guide".
4. For InfiniBand fabric configurations that involve a mix of different Oracle Engineered Systems, please refer
to: MOS note 1682501.1
5. Moving the Master Subnet Manager is sometimes required during maintenance and patching operations. For
additional guidance on maintaining the Master Subnet Manager, please see section "4.6 Maintaining the
InfiniBand Network" of the "Exadata Database Machine Maintenance Guide".

Verify the Subnet Manager is properly disabled

Priority Alert Date Owner Status Engineered System Engineered System Bug(s)
Level Platform
Critical FAIL 11/28/18 <Name> Development Exadata - Physical, ALL 28768896- exachk
Exadata - Managment 14534296- exachk
Domain 16270663- exachk
16795289- exachk
DB/GI DB Type DB Role DB Exadata OS & Version Validation Tool MAA Scorecard
Version Mode Version Version Section
N/A N/A N/A N/A ALL Linux exachk 18.4.0 N/A

NOTE: The Subnet Manager should only execute on InfiniBand switches. It should be disabled on any other
component attached to an InfiniBand fabric.

Having the Subnet Manager executing in the correct locations improves the stability, availability and performance
of the InifiniBand fabric. The Impact of verifying the Subnet Manager is disabled on components where the Master
Subnet Manager should never reside is minimal. The impact of disabling the Subnet Manager varies depending
upon the component type where it is found to be incorrectly executing, and whether or not the Master Subnet
Manager is incorrectly executing on that component.

Risk:

Unexpected behavior, such as connectivity or performance loss, can occur if the Subnet Manager is executing on an unexpected
component in the InfiniBand fabric.

Action / Repair:

To Verify the Subnet Manager is disabled on components where the Master Subnet Manager should never reside, execute the
following command set as the "root" userid on all database and storage servers:

unset COMMAND_OUTPUT
COMMAND_OUTPUT=$(ps -ef | grep -i [o]pensm)
if [ -n "$COMMAND_OUTPUT" ]
then
echo -e "FAILURE: the Subnet Manager is executing.\nDETAILS:\n$COMMAND_OUTPUT"
else
echo -e "SUCCESS: the Subnet Manager is not executing."
fi

The expected output is:

SUCCESS: the Subnet Manager is not executing.

Example of a "FAILURE" output:

FAILURE: the Subnet Manager is executing.
DETAILS:
root 2627 1 0 Mar24 ? 12:14:31 /usr/sbin/opensm --daemon

If the result is "FAILURE", investigate why the Subnet Manager is executing, relocate the Master Subnet Manager
if necessary, and prevent the Subnet Manager from starting in the future.

NOTES:

1. The command set provided is for Oracle Exadata Database Machines only. If there are non-Exadata
components residing on the InifiniBand fabric (e.g., a media server), refer to the provided documentation
for that component.
2. There are typically several Standby Subnet Managers waiting to take over should the current Master Subnet
Manager either fail or is manually moved to some other component with an available Standby Subnet
Manager. Only run Subnet Managers on the InfiniBand switches specified for use in Oracle Exadata
Database Machine, Oracle Exalogic Elastic Cloud, Oracle Big Data Appliance, and Oracle SuperCluster.
Running Subnet Manager on any other device is not supported.
3. For pure multirack Exadata deployments with less than 4 racks, the Subnet Manager should run on all spine
and leaf InfiniBand switches. For deployments with 4 or more Exadata racks, the Subnet Manager should
run only on spine InfiniBand switches. For additional configuration information, please see section "4.6.7
Understanding the Network Subnet Manager Master" of the "Exadata Database Machine Maintenance
Guide".
4. For InfiniBand fabric configurations that involve a mix of different Oracle Engineered Systems, please refer
to: MOS note 1682501.1
5. Moving the Master Subnet Manager is sometimes required during maintenance and patching operations. For
additional guidance on maintaining the Master Subnet Manager, please see section "4.6 Maintaining the
InfiniBand Network" of the "Exadata Database Machine Maintenance Guide".

Verify There Are No Memory (ECC) Errors

Priority Alert Date Owner Status Engineered System Bug(s)

Level

<Name>
Critical FAIL 11/16/16 Production Exadata - Physical,
Exadata - Management
Domain,
SSC

DB DB Role Engineered System Platform Exadata OS & Validation Tool Version TBD
Version Version Version

N/A N/A X2-2(4170), X2-2, X2-8, X3-2, X3-8, X3- 11.2.2.2.0+ Solaris - 11 EXAchk 12.2.0.1.2
2, X3-8, Linux x86-64 exachk 2.2.4
X4-2, X4-8, X5-2, X5-8, X6-2, X6-8

Benefit / Impact:

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 11/137
29/10/2019 Document 1067527.1
Memory modules that have corrected Memory Errors (ECC) can show degraded performance, IPMI driver timeouts, and BMC
error messages in /var/log/messages file.

Correcting the condition restores optimal performance.

The impact of checking for memory ECC errors is slight. Correction will likely require node downtime for hardware diagnostics or
repair.

Risk:

If not corrected, the faulty memory will lead to performance degradation and other errors.

Action / Repair:

To verify there are no memory (ECC) errors, run the following commands as the "root" userid on all database and storage
servers:

if [ -x /usr/bin/ipmitool ]

then

#Linux

IPMI_COMMAND=ipmitool;

else

#Solaris

IPMI_COMMAND=/opt/ipmitool/bin/ipmitool

fi;

ECC_OUTPUT=$($IPMI_COMMAND sel list | grep Memory | grep ECC)

if [ -z "$ECC_OUTPUT" ]

then

echo -e "SUCCESS: No memory ECC errors were found.\nECC list:\n\n$ECC_OUTPUT"

else

echo -e "FAILURE: Memory ECC errors were found.\nECC list:\n\n$ECC_OUTPUT"

The expected output should be:

SUCCESS: No memory ECC errors were found. ECC list:

Example of a FAILURE result:

If any errors are reported, take the following corrective actions in order:
1) Reseat the DIMM.
2) Open an SR for hardware replacement.

Verify celldisk configuration on disk drives

Priority Added Machine Type OS Type Exadata Version Oracle Version

Critical 12/06/11 X2-2(4170), X2-2, X2-8,X4-2 Linux, Solaris 11.2.x + 11.2.x +

Benefit / Impact:

The definition and maintenance of storage server celldisks is critical for optimal performance and outage avoidance.

The impact of verifying the basic storage server celldisk configuration is minimal. Correcting any abnormalities is dependent
upon the reason for the anomaly, so the impact cannot be estimated here.

Risk:

If the basic storage server celldisk configuration is not verified, poor performance or unexpected outages may occur.

Action / Repair:

To verify the basic storage server celldisk configuration on disk drives, execute the following command as the "celladmin" user
on each storage server:

cellcli -e "list celldisk where disktype=harddisk and status=normal" | wc -l

The output should be:

If the output is not as expected, investigate the condition and take corrective action based upon the root cause of
the unexpected result.

NOTE: On a storage server configured according to Oracle best practices, there should be 12 celldisks on disk
drives with a status of "normal".

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 12/137
29/10/2019 Document 1067527.1

Verify celldisk configuration on flash memory devices

Priority Alert Date Owner Status Engineered

Level System
Critical FAIL 11/15/2017 <Name> Production Exadata

DB DB Role Engineered System Exadata OS & Validation To

Version Version Version Version
N/A N/A X2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5-2, X5-8, X6- 11.2+ Linux x86- exachk 12.2.0.1
2, X6-8, X7-2, X7-8 64

Benefit / Impact:

The definition and maintenance of storage server celldisks is critical for optimal performance and outage avoidance. The number
of celldisks configured on flash memory devices varies by hardware version. Each celldisk configured on flash memory devices
should have a status of "normal".

The impact of verifying the celldisk configuration on flash memory devices is minimal. The impact of correcting any anomalies is
dependent upon the reason for the anomaly and cannot be estimated here.

Risk:

If the celldisk configuration on flash memory devices is not verified, poor performance or unexpected outages may occur.

Action / Repair:

To verify the celldisk configuration on flash memory devices, execute the following command as the "root" userid on each
storage server:

cellcli -e "list celldisk where disktype=flashdisk and status=normal" | wc -l

The output should be similar to the following and match one of the rows in the "Celldisk on Flash Memory Devices Mapping
Table":

Celldisk on Flash Memory Devices Mapping Table

System Description Common Name Disk Type Number of Devices

X4275 X2-2(4170) MIXED 16
X4270 M2 X2-2, X2-8 MIXED 16
X4270 M3 X3-2, X3-8 MIXED 16
X4270 M3 EIGHTH MIXED 8
X4-2L X4-2 MIXED 16
X4-2L EIGHTH MIXED 8
X5-2L X5-2, X5-8 MIXED 4
X5-2L X5-2, X5-8 FLASH 8
X5-2L EIGHTH MIXED 2
X5-2L EIGHTH FLASH 4
X6-2L X6-2, X6-8 MIXED 4
X6-2L X6-2, X6-8 FLASH 8
X6-2L EIGHTH MIXED 2
X6-2L EIGHTH FLASH 4
X7-2L X7-2, X7-8 MIXED 4
X7-2L X7-2, X7-8 FLASH 8
X7-2L EIGHTH MIXED 2
X7-2L EIGHTH FLASH 4

If the output is not as expected, execute the following command as the "root" userid:

cellcli -e "list celldisk where disktype=flashdisk and status!=normal"

Perform your root cause analysis and corrective actions based upon the key words returned in the "status" field. For additional
information, please reference the following:

The "Maintaining Flash Disks" section of "Oracle® Exadata Database Machine, Owner's Guide 11g Release 2 (11.2), E13874-24"

Troubleshooting guide for Sick or underperforming storage cell/Performance Issue (Doc ID 1348736.1)

Troubleshooting guide for Underperforming FlashDisks (Doc ID 1348938.1)

Verify there are no griddisks configured on flash memory devices

Priority Alert Date Owner Status Engineered System Bug(s)

Level

<Name>

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 13/137
29/10/2019 Document 1067527.1
Critical FAIL 12/08/15 Production Exadata - Physical,
Exadata - Management
Domain,
BDA, Exalogic, Exalytics,
SSC, ZDLRA

DB DB Role Engineered System Platform Exadata OS & Validation Tool Version TBD
Version Version Version

N/A N/A X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, 11.2.2.2.0+ Linux x86- exachk 12.1.0.2.6
X4-8, X5-2, X5-8 64

Benefit / Impact:

The definition and maintenance of storage server griddisks is critical for optimal performance and outage avoidance.

The impact of verifying the storage server griddisk configuration is minimal. Correcting any abnormalities is dependent upon the
reason for the anomaly, so the impact cannot be estimated here.

Risk:

If the storage server griddisk configuration is not verified, poor performance or unexpected outages may occur.

Action / Repair:

To verify there are no storage server griddisks configured on flash memory devices, execute the following command as the
"celladmin" user on each storage server:

cellcli -e "list griddisk where disktype=flashdisk" | wc -l

The output should be:

If the output is not as expected, investigate the condition and take corrective action based upon the root cause of the
unexpected result.

NOTE:
Experience has shown that the Oracle recommended Best Practice of using all available flash device space for Smart Flash Log
and Smart Flash Cache provides the highest overall performance benefit with lowest maintenance overhead for an Oracle
Exadata Database Machine.

In some very rare cases for certain highly write-intensive applications, there may be some performance benefit to configuring
With the release of the Smart Flash Log feature in 11.2.2.4,
grid disks onto the flash devices for datafile writes only.
redo logs should never be placed on flash grid disks. Smart Flash Log leverages both hard disks and flash devices
with intelligent caching to achieve the fastest possible redo write performance, optimizations which are lost if
redo logs are simply placed on flash grid disks.
The space available to Smart Flash Cache and Smart Flash Log is reduced by the amount of space allocated to the grid disks
deployed on flash devices. The usable space in the flash grid disk group is either half or one-third of the space allocated for grid
disks on flash devices, depending on whether the flash grid disks are configured with ASM normal or high redundancy.

If after thorough performance and recovery testing, a customer chooses to deploy grid disks on flash devices, it would be a
supported, but not Best Practice, configuration.

Verify griddisk count matches across all storage servers where a given prefix name exists

Priority Added Machine Type OS Type Exadata Version Oracle Version

Critical 12/06/11 X2-2(4170), X2-2, X2-8,X4-2 Linux, Solaris 11.2.x + 11.2.x +

Benefit / Impact:

The definition and maintenance of storage server griddisks is critical for optimal performance and outage avoidance.

The impact of verifying the basic storage server griddisk configuration is minimal. Correcting any abnormalities is dependent
upon the reason for the anomaly, so the impact cannot be estimated here.

Risk:

If the storage server griddisk configuration as designed is not verified, poor performance or unexpected outages may occur.

Action / Repair:

To verify the storage server griddisk count matches across all storage server where a given prefix name exists, execute the
following command as the "root" userid on the database server from which the
onecommand script was executed during initial deployment:

for GD_PREFIX in `dcli -g /opt/oracle.SupportTools/onecommand/cell_group -l root "cellcli -e list griddisk attributes name" | cut -d" "
do
GD_PREFIX_RESULT=`dcli -g /opt/oracle.SupportTools/onecommand/cell_group -l root "cellcli -e list griddisk where name like \'$GD_PREFIX\_
if [ $GD_PREFIX_RESULT = 1 ]
then
echo -e "$GD_PREFIX: SUCCESS"
else
echo -e "$GD_PREFIX: FAILURE"

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 14/137
29/10/2019 Document 1067527.1
dcli -g /opt/oracle.SupportTools/onecommand/cell_group -l root "cellcli -e list griddisk where name like \'$GD_PREFIX\_.*\' | wc -l";
fi
done

The output should be similar to:

DATA_SLCC16: SUCCESS
DBFS_DG: SUCCESS
RECO_SLCC16: SUCCESS

If the output is not as expected, investigate the condition and take corrective action based upon the root cause of
the unexpected result.

NOTE: On a storage server configured according to Oracle best practices, the total number of griddisks per storage
server for a given prefix name (e.g: DATA) should match across all storage servers
where the given prefix name exists.

NOTE: Not all storage servers are required to have all prefix names in use. This is possible where for security
reasons a customer has segregated the storage servers, is using a data lifecycle management methodology,
or an Oracle Storage Expansion Rack is in use. For example, when an Oracle Storage Expansion Rack is in use for
data lifecycle management, those storage servers will likely have griddisks with unique names that
differ from the griddisk names used on the storage servers that contain real time data, yet all griddisks are visible
to the same cluster.

NOTE: This command requires that SSH equivalence exists for the "root" userid from the database server upon
which it is executed to all storage servers in use by the cluster.

Verify griddisk ASM status

Priority Added Machine Type OS Type Exadata Version Oracle Version

Critical 12/06/11 X2-2(4170), X2-2, X2-8,X4-2 Linux, Solaris 11.2.x + 11.2.x +

Benefit / Impact:

The definition and maintenance of storage server griddisks is critical for optimal performance and outage avoidance.

The impact of verifying the storage server griddisk configuration is minimal. Correcting any abnormalities is dependent upon the
reason for the anomaly, so the impact cannot be estimated here.

Risk:

If the storage server griddisk configuration as designed is not verified, poor performance or unexpected outages may occur.

Action / Repair:

To verify the storage server griddisk ASM status, execute the following command as the "celladmin" user on each storage
server:

ASM_STAT_RESLT=`cellcli -e "list griddisk attributes name,status, asmmodestatus,asmdeactivationoutcome" | egrep -v ".\<active\>.\<ONLIN

if [ $ASM_STAT_RESLT = 0 ]
then
echo -e "\nSUCCESS\n"
else
echo -e "\nFAILURE:";
cellcli -e "list griddisk attributes name,status, asmmodestatus,asmdeactivationoutcome" | egrep -v ".*\<active\>.*\<ONLINE\>.*\<Yes\>";
echo -e "\n";
fi;

The output should be:

SUCCESS

If the output is not as expected, investigate the condition and take corrective action based upon the root cause of
the unexpected result.

NOTE: On a storage server configured according to Oracle best practices, all griddisks should have "status" of
"active", "asmmodestatus" of "online" and "asmdeactivationoutcome" of "yes".

Verify that griddisks are distributed as expected across celldisks

Alert Engineered System

Priority Date Owner Status Engineered System
Level Platform
Exadata - Physical,
B
Critical FAIL 10/11/17 <Name> Production Exadata - Management ALL
Domain
DB/GI DB Exadata M
DB Type DB Role OS & Version Validation Tool Version
Version Mode Version
N/A N/A N/A N/A ALL Linux exachk 12.2.0.1.4
https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 15/137
29/10/2019 Document 1067527.1

Benefit / Impact:

The definition and maintenance of storage server griddisks is critical for optimal performance and outage avoidance.

The impact of verifying the storage server griddisk configuration is minimal. Correcting any abnormalities is dependent upon the
reason for the anomaly, so the impact cannot be estimated here.

Risk:

If the storage server griddisk configuration as designed is not verified, poor performance or unexpected outages may occur.

Action / Repair:

NOTE: The recommended best practice is to have each griddisk distributed across all celldisks. For older versions of Exadata
storage server software and hardware, the griddisks "SYSTEM" or "DBFS_DG" had a slightly different distribution, and the code
below correctly accounts for those cases.

To verify that griddisks are distributed as expected across celldisks, execute the following command as the "root" userid on each
storage server:

RAW_CELLDISK=$(cellcli -e "list celldisk attributes name" | sed -e 's/^[ \t]*//')

RAW_GRIDDISK=$(cellcli -e "list griddisk attributes name" | sed -e 's/^[ \t]*//')
if [ ècho -e $RAW_CELLDISK | grep CD | wc -l` -ge 1 ]
then
PARSED_CELLDISK=$(echo -e "$RAW_CELLDISK" | grep CD)
else
PARSED_CELLDISK=$(echo -e "$RAW_CELLDISK")
fi
CELLDISK_COUNT=$(echo -e "$PARSED_CELLDISK" | wc -l)
if [ ècho -e $RAW_GRIDDISK | grep CD | wc -l` -ge 1 ]
then
SHORT_GD_NAME_ARRAY=$(echo -e "$RAW_GRIDDISK" | awk -F "_CD_" '{print $1}' | sort -u)
else
SHORT_GD_NAME_ARRAY=$(echo -e "$RAW_GRIDDISK" | awk -F "_FD_" '{print $1}' | sort -u)
fi
RETURN_RESULT=0
for GD_SHORT_NAME in $SHORT_GD_NAME_ARRAY
do
if [[ $GD_SHORT_NAME = "SYSTEM" || $GD_SHORT_NAME = "DBFS_DG" || $GD_SHORT_NAME = "CATALOG" ]]
then
GD_COUNT=$(expr ècho "$RAW_GRIDDISK" | grep $GD_SHORT_NAME | wc -l` + 2)
else
GD_COUNT=$(echo "$RAW_GRIDDISK" | grep $GD_SHORT_NAME | wc -l)
fi
if [ $GD_COUNT -eq $CELLDISK_COUNT ]
then
:
else
OUTPUT_ARRAY+=ècho -e "\n$GD_SHORT_NAME: FAILURE:\tGriddisk count: $GD_COUNT\tCelldisk count: $CELLDISK_CO
RETURN_RESULT=1
fi
done
if [ $RETURN_RESULT -eq 0 ]
then
echo -e "SUCCESS: All griddisks are distributed as expected across celldisks."
else
echo -e -n "FAILURE: One or more griddisks are not distributed as expected across celldisks. Details:"
echo -e "${OUTPUT_ARRAY[@]}"
fi

The expected output should be:

SUCCESS: All griddisks are distributed as expected across celldisks.

Example of a "FAILURE" result:

FAILURE: One or more griddisks are not distributed as expected across celldisks. Details:
C_DATA: FAILURE: Griddisk count: 7 Celldisk count: 8

If the output is not as expected, investigate the condition and take corrective action based upon the root cause of the
unexpected result.

Verify the percent of available celldisk space used by the griddisks

Priority Alert Date Owner Status Engineered System

Level
Critical INFO 11/09/16 <Name> Production Exadata - Physical,
Exadata - Management
Domain
DB DB Role Engineered System Platform Exadata OS & Validation Tool
Version Version Version Version
N/A N/A X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5-2, 11.2+ Linux x86- exachk 12.2.0.1.2
X5-8, X6-2, X6-8 64

Benefit / Impact:

The impact of verifying the percent of available celldisk space used by the griddisks is minimal.

Risk:

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 16/137
29/10/2019 Document 1067527.1
If the percent of available celldisk space used by the griddisks is not verified, an unexpected configuration change may be
missed.

Action / Repair:

To verify the percent of available celldisk space used by the griddisks, execute the following command set as the "root" userid
on each storage server:

ALLFLASHCELL=$(cellcli -e "list cell attributes makemodel"|egrep -ic 'ALLFLASH|EXTREME_FLASH');

RAW_GRIDDISK_SIZE=$(cellcli -e "list griddisk attributes size");
TOTAL_GRIDDISK_SIZE=$(echo "$RAW_GRIDDISK_SIZE" | sed 's/\s//g'|awk '/G$/ { print $0 } /T$/ { size=substr($0, 0,
if [ $ALLFLASHCELL -eq 0 ]
then
RAW_CELLDISK_SIZE=$(cellcli -e "list celldisk attributes size where disktype=harddisk");
else
RAW_CELLDISK_SIZE=$(cellcli -e "list celldisk attributes size where disktype=flashdisk");
fi;
TOTAL_CELLDISK_SIZE=$(echo "$RAW_CELLDISK_SIZE" | sed 's/\s//g'|awk '/G$/ { print $0 } /T$/ { size=substr($0, 0,
GRIDDISK_CELLDISK_PCT=$(echo $TOTAL_GRIDDISK_SIZE $TOTAL_CELLDISK_SIZE | awk '{ printf("%d", ($1/$2)*100) }');
echo -e "INFO: The percent of available celldisk space used by the griddisks is: $GRIDDISK_CELLDISK_PCT\nThe tot

The expected output will be similar to:

INFO: The percent of available celldisk space used by the griddisks is: 99
The total griddisk size found is: 87818.7
The total celldisk size found is: 87819.3

If the output is not as expected for a given known configuration, investigate and take corrective action based upon the root
cause of the unexpected result.

NOTE: On a storage server not in an Oracle Virtual Machine environment configured according to Oracle best
practices, the percent utilization will typically be >= 99 for spinning disk and >= 94 <= 95 for Extreme Flash. The
lower percentage of utilization for Extreme Flash is because the griddisks, Flash Log, and Flash Cache are all built
on the same flash hardware.

NOTE: In an Oracle Virtual Machine environment, it is not unusual for the percentage of available celldisk space
used by the griddisks to be in the middle 60 range. This is due in part to the fact the DBFS griddisk is not created
by default, and user requirements to reserve free space for future use. For example:

INFO: The percent of available celldisk space used by the griddisks is: 63
The total griddisk size found is: 4236
The total celldisk size found is: 6636.06

Verify Database Server ZFS RAID Configuration

Priority Added Machine Type OS Type Exadata Version Oracle Version

Critical 01/27/12 X2-2, X2-8, X4-2 Solaris 11.2.x + 11.2.x +

Benefit / Impact:

For a database server running Solaris deployed according to Oracle standards, there will be two ZFS RAID-1 pools, named
"rpool" and "data". Each mirror in the pool contains two disk drives. For an X2-2,
there is one mirror for each name. For an X2-8, there is one mirror for "rpool" and 3 for "data". Verifying the database server
ZFS RAID configuration helps to avoid a possible performance impact, or an outage.

The impact of validating the ZFS RAID configuration is minimal. The impact of corrective actions will vary depending on the
specific issue uncovered, and may range from simple reconfiguration to an outage.

Risk:

Not verifying the ZFS RAID configuration increases the chance of a performance degradation or an outage.

Action / Repair:

To verify the database server ZFS RAID configuration, execute the following command as the "root" userid:

/opt/oracle.SupportTools/disks_map.pl | ggrep mirror -A3

The output will be similar to:

------------------- mirror-0 ---------------------

16:5 c1t2d0s0 rpool
16:4 c1t1d0s0 rpool
--------------------------------------------------
------------------- mirror-0 ---------------------
16:6 c1t3d0 data
16:7 c1t4d0 data
--------------------------------------------------
------------------- mirror-2 ---------------------
16:0 c1t5d0 data
16:2 c1t6d0 data
--------------------------------------------------
------------------- mirror-1 ---------------------
16:3 c1t0d0 data
16:1 c1t7d0 data
--------------------------------------------------

For an X2-2, the expected output is one pool named "rpool", and one named "data", each comprised of 1 mirror with 2 disk
drives.

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 17/137
29/10/2019 Document 1067527.1
For an X2-8, the expected output is one pool named "rpool", comprised of 1 mirror with 2 disk drives, and one pool named
"data" comprised of 3 mirrors each with 2 disk drives.

If the reported output differs, investigate and correct the condition.

Verify InfiniBand is the Private Network for Oracle Clusterware Communication

Priority Added Machine Type OS Exadata Oracle

Type Version Version
Critical N/A X2-2(4170), X2-2, X2-8, Linux 11.2.x + 11.2.x +
X4-2

Benefit / Impact:

The InfiniBand network in an Oracle Exadata Database Machine provides superior performance and throughput characteristics
that allow Oracle Clusterware to operate at optimal efficiency.

The overhead for these verification steps is minimal.

Risk:

If the InfiniBand network is not used for Oracle Clusterware communication, performance will be sub-optimal.

Action / Repair:

The InfiniBand network is preconfigured on the storage servers. Perform the following on the database servers:

Verify the InfiniBand network is the private network used for Oracle Clusterware communication with the following command:

$GI_HOME/bin/oifcfg getif -type cluster_interconnect

For X2-2 the output should be similar to:

bondib0 192.168.8.0 global cluster_interconnect

For X2-8 the output should be similar to:

bondib0 192.168.8.0 global cluster_interconnect

bondib1 192.168.8.0 global cluster_interconnect
bondib2 192.168.8.0 global cluster_interconnect
bondib3 192.168.8.0 global cluster_interconnect

If the InfiniBand network is not the private network used for Oracle Clusterware communication, configure it following the
instructions in MOS Note 283684.1,
"How to Modify Private Network Interface in 11.2 Grid Infrastructure".

NOTE: It is important to ensure that your public interface is properly marked as public and not private. This can be
checked with the oifcfg getif
command. If it is inadvertantly marked private,
you can get errors such as "OS system dependent operation:bind failed with status" and "OS failure message:
Cannot assign requested address".
It can be corrected with a command like oifcfg setif -global eth0/<public IP address>:public
In each database verify that it is using the private IB interconnect withe following query :
SQL> select name,ip_address from v$cluster_interconnects;

NAME IP_ADDRESS
--------------- ----------------
bondib0 192.168.40.25

Or in the database alert log you can look for the following message:

Cluster communication is configured to use the following interface(s) for this instance
192.168.40.26

Verify InfiniBand Address Resolution Protocol (ARP) Configuration on Database Servers

Priority Alert Date Owner Status Engineered Sy

Level
Critical FAIL 7/13/16 <Name Production Exadata - Phys
Status Exadata - Manag
Domain,
Exadata - User D
DB DB Role Engineered System Platform Exadata OS Version Validation T
Version Version Version
N/A N/A X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5-2, 11.2.2.2.0+ Linux x86-64 exachk 12.1.0
X5-8, X6-2, X6-8

Benefit / Impact: There are specific ARP configurations required for Real Application Clusters (RAC) to work correctly that vary
between an active/passive or active/active configuration.

For an active/passive configuration, the settings for all IB interfaces should be:

accept_local = 1
rp_filter = 0

For an active/active configuration, the settings for all IB interfaces should be:

- AND the three single attributes -

net.ipv4.conf.all.rp_filter = 0
net.ipv4.conf.default.rp_filter = 0
net.ipv4.conf.all.accept_local = 1

The impact of verifying the ARP configuration is minimal. Correcting a configuration requires editing "/etc/sysctl.conf" and
restarting the interface(s).

Risk:

Incorrect ARP configurations may prevent RAC from starting, or result in dropped packets and inconsistent RAC operation.

Action / Repair:

To verify the InfiniBand interface ARP settings for a database server, use the following command as the "root" userid:

RAW_OUTPUT=$(sysctl -a)
RF_OUTPUT=$(echo "$RAW_OUTPUT" | egrep -i "\.ib|bondib" | egrep -i "\.rp_filter")
AL_OUTPUT=$(echo "$RAW_OUTPUT" | egrep -i "\.ib|bondib" | egrep -i "\.accept_local")
if [ ècho "$RAW_OUTPUT" | grep -i bondib | wc -l` -ge 1 ]
then #active/passive case
if [[ ècho "$AL_OUTPUT" | cut -d" " -f3 | sort -u | wc -l` -eq 1 && ècho "$AL_OUTPUT" | cut -d" " -f3 |
then
AL_RSLT=0 #all AL same value and value is 1
else
AL_RSLT=1
fi;
if [[ ècho "$RF_OUTPUT" | cut -d" " -f3 | sort -u | wc -l` -eq 1 && ècho "$RF_OUTPUT" | cut -d" " -f3 |
then
RF_RSLT=0 #all RF same value and value is 0
else
RF_RSLT=1
fi;
if [[ $AL_RSLT -eq 0 && $RF_RSLT -eq 0 ]]
then
echo -e "Success: The active/passive ARP configuration is as recommended:\n"
else
echo -e "Failure: The active/passive ARP configuration is not as recommended:\n"
fi;
echo -e "$AL_OUTPUT\n\n$RF_OUTPUT"
else #active/active case
NICARF_OUTPUT=$(echo "$RAW_OUTPUT" | egrep -i "net.ipv4.conf.all.rp_filter")
NICDRF_OUTPUT=$(echo "$RAW_OUTPUT" | egrep -i "net.ipv4.conf.default.rp_filter")
NICAAL_OUTPUT=$(echo "$RAW_OUTPUT" | egrep -i "net.ipv4.conf.all.accept_local")
NICARF_RSLT=$(echo "$NICARF_OUTPUT" | cut -d" " -f3)
NICDRF_RSLT=$(echo "$NICDRF_OUTPUT" | cut -d" " -f3)
NICAAL_RSLT=$(echo "$NICAAL_OUTPUT" | cut -d" " -f3)
IB_INTRFCE_CNT=$(echo "$RAW_OUTPUT" | egrep "\.ib.\." | cut -d"." -f4 | sort -u | wc -l)
if [[ ècho "$AL_OUTPUT" | cut -d" " -f3 | sort -u | wc -l` -eq 1 && ècho "$AL_OUTPUT" | cut -d" " -f3 |
then
AL_RSLT=0 #all AL same value and value is 1
else
AL_RSLT=1
fi;
if [[ ècho "$RF_OUTPUT" | cut -d" " -f3 | sort -u | wc -l` -eq 1 && ècho "$RF_OUTPUT" | cut -d" " -f3 |
then
RF_RSLT=0 #all RF same value and value is 0
else
RF_RSLT=1
fi;
if [ $IB_INTRFCE_CNT -eq 2 ] # 2 socket case
then
if [[ $AL_RSLT -eq 0 && $RF_RSLT -eq 0 && $NICARF_RSLT -eq 0 && $NICDRF_RSLT -eq 0 && $NICAAL_RSLT -eq 1
then
echo -e "Success: The active/active ARP configuration is as recommended:\n"
else
echo -e "Failure: The active/active ARP configuration is not as recommended:\n"
fi;
echo -e "$AL_OUTPUT\n\n$RF_OUTPUT\n\n$NICARF_OUTPUT\n$NICDRF_OUTPUT\n$NICAAL_OUTPUT"
else # 8 socket case
NICIAA_OUTPUT=$(echo "$RAW_OUTPUT" | egrep "\.ib.\." | egrep arp_announce)
if [[ ècho "$NICIAA_OUTPUT" | cut -d" " -f3 | sort -u | wc -l` -eq 1 && ècho "$NICIAA_OUTPUT" | cut -d"
then
NICIAA_RSLT=0 #all arp_announce same value and value is 2
else
NICIAA_RSLT=1
fi;
if [[ $AL_RSLT -eq 0 && $RF_RSLT -eq 0 && $NICIAA_RSLT -eq 0 && $NICARF_RSLT -eq 0 && $NICDRF_RSLT -eq 0
then
echo -e "Success: The active/active ARP configuration is as recommended:\n"
else
echo -e "Failure: The active/active ARP configuration is not as recommended:\n"
fi;
echo -e "$AL_OUTPUT\n\n$RF_OUTPUT\n\n$NICIAA_OUTPUT\n\n$NICARF_OUTPUT\n$NICDRF_OUTPUT\n$NICAAL_OUTPUT"
fi;
fi;

The expected output should be similar to:

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 19/137
29/10/2019 Document 1067527.1

Success: The active/passive ARP configuration is as recommended:

net.ipv4.conf.ib0.accept_local = 1
net.ipv4.conf.ib1.accept_local = 1
net.ipv4.conf.bondib0.accept_local = 1

net.ipv4.conf.ib0.rp_filter = 0
net.ipv4.conf.ib1.rp_filter = 0
net.ipv4.conf.bondib0.rp_filter = 0

- OR -

Success: The active/active ARP configuration is as recommended:

net.ipv4.conf.ib0.accept_local = 1
net.ipv4.conf.ib1.accept_local = 1

net.ipv4.conf.ib0.rp_filter = 0
net.ipv4.conf.ib1.rp_filter = 0

net.ipv4.conf.all.rp_filter = 0
net.ipv4.conf.default.rp_filter = 0
net.ipv4.conf.all.accept_local = 1

- OR -

Success: The active/active ARP configuration is as recommended:

net.ipv4.conf.ib0.accept_local = 1
<outpout truncated>
net.ipv4.conf.ib7.accept_local = 1

net.ipv4.conf.ib0.rp_filter = 0
<output truncated>
net.ipv4.conf.ib7.rp_filter = 0

net.ipv4.conf.ib0.arp_announce = 2
<output turncated>
net.ipv4.conf.ib7.arp_announce = 2

net.ipv4.conf.all.rp_filter = 0
net.ipv4.conf.default.rp_filter = 0
net.ipv4.conf.all.accept_local = 1

If a "FAILURE: ..." message appears, investigate for root cause, make the necessary edits to "/etc/sysctl.conf", and restart the
interface(s).

NOTE: These recommendations are for the InfiniBand interfaces on database servers only! They do not apply to the Ethernet
interfaces on the database servers. No changes are permitted on the storage servers.

Verify Oracle RAC Databases use RDS Protocol over InfiniBand Network.

Priority Alert Date Owner Status Eng

Level
Critical FAIL 03/01/2017 <Name> Production Exa
E

DB DB Role Engineered System Platform Exadata OS & Va

Version Version Version
11.2.0.2+ Primary, X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5-2, X5-8, X6-2, All Linux x86- ex
Standby, X6-8, SL6 64,
ASM Sparc Linux

Benefit / Impact:

The RDS protocol over InfiniBand provides superior performance because it avoids additional memory buffering operations when
moving data from process memory to the network interface for IO operations.
This includes both IO operations between the Oracle instance and the storage servers, as well as instance to instance block
transfers via Cache Fusion.

There is minimal impact to verify that the RDS protocol is in use. Implementing the RDS protocol over InfiniBand requires an
outage to relink the Oracle software.

Risk:

If the Oracle RAC databases do not use RDS protocol over the InfiniBand network, IO operations will be sub-optimal.

Action / Repair:

To verify the RDS protocol is in use by a given Oracle instance, set the ORACLE_HOME and LD_LIBRARY_PATH variables
properly for the instance and execute the following command as the oracle userid
on each database server where the instance is running:

The output should be:

rds

Note: For Oracle software versions below 11.2.0.2, the skgxpinfo command is not present. For 11.2.0.1, you can
copy over skgxpinfo to the proper path in your 11.2.0.1 environment from an
available 11.2.0.2 environment and execute it against the 11.2.0.1 database home(s) using the provided
command.

Note: An alternative check (regardless of Oracle software version) is to scan each instance's alert log (must
contain a startup sequence!) for the following line:

Cluster communication is configured to use the following interface(s)for this instance 192.168.20.21 cluster interconn

If the instance is not using the RDS protocol over InfiniBand, relink the Oracle binary using the following commands (with
variables properly defined for each home being linked):

(as oracle) Shutdown any processes using the Oracle binary

If and only if relinking the grid infrastructure home, then (as root) GRID_HOME/crs/install/rootcrs.pl -unlock
(as oracle) cd $ORACLE_HOME/rdbms/lib
(as oracle) make -f ins_rdbms.mk ipc_rds ioracle
If and only if relinking the Grid Infrastructure home, then (as root) GRID_HOME/crs/install/rootcrs.pl -patch
Note: Avoid using the relink all command due to various issues. Use the make commands provided.

Verify Database and ASM instances use same SPFILE

Priority Added Machine Type OS Type Exadata Version Oracle Version

Critical March 2013 All All All All

Benefit / Impact:

All instances for a particular database or ASM cluster should be using the same spfile. Making changes to databases and ASM
instances needs to be done in a reliable and consistent way across all instances.

Risk:

Multiple 'sources of truth' can cause confusion and possibly unintended values being set.

Action / Repair:

Verify what spfile is used across all instances of one particular ASM or database cluster. If multiple spfiles for one database are
found, provide a recommendation to consolidate them into one.

Scope includes all machine types, os types and db versions

SQL> select name, value from gv$parameter where name = 'spfile';

NAME VALUE
------------------------------ ------------------------------------------------------------
spfile +DATA/racone/spfileracone.ora

The value for pfile should be empty:

SQL> select name, value from gv$parameter where name = 'pfile';

no rows selected

Verify Berkeley Database location for Cloned GI homes

Priority Added Machine Type OS Type Exadata Version Oracle Version

Critical March 2013 X2-2(4170), X2-2, X2-8, X4-2 Linux, Solaris 11.2.x + 11.2.x +

Benefit / Impact

After cloning a Grid Home the Berkeley Database configuration file ($GI_HOME/crf/admin/crf<node>.ora) in the new home
should not be pointing to the
previous GI home where it is cloned from. During previous patch set updates Berkeley Database configuration files were found
still pointing to the
'before (previously cloned from) home'. It was due an invalid cloning procedure the Berkeley Database location of the 'new
home' was not updated during
the out of place bundle patching procedure

Risk:

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 21/137
29/10/2019 Document 1067527.1
Berkeley Database configurations still pointing to the old GI home, will fail GI Upgrades to 11.2.0.3. Error messages in
$GRID_HOME/log/crflogd/crflogdOUT.log logfile

Action / Repair:

Detect:

# cat $GI_HOME/crf/admin/crf`hostname -s`.ora | grep CRFHOME | grep $GI_HOME | wc -l

# cat $GI_HOME/crf/admin/crf`hostname -s`.ora | grep BDBLOC | egrep "default|$GI_HOME | wc -l

For each of the above commands, when no '1' is returned, the CRFHOME or BDBLOC as mentioned the crf.ora file has the wrong
reference to the GI_HOME in it.

To solve this, manually edit $GI_HOME/crf/admin/crf<node>.ora in the cloned Grid Infrastructure Home and change the values
for BDBLOC and CRFHOME
and make sure none of them point to the previous GI_HOME but to their current home. The same change needs to be done on
all nodes in the cluster.
It is recommended to set BDBLOC to "default". This needs to be done prior the upgrade.

. Reference: 1485970.1 / 14168708

Configure Storage Server alerts to be sent via email

Priority Added Machine Type OS Exadata Oracle

Type Version Version
Critical N/A X2-2(4170), X2-2, X2-8, Linux 11.2.x + 11.2.x +
X4-2

Benefit / Impact:

Oracle Exadata Storage Servers can send various levels of alerts and clear messages via email or snmp, or both. Sending these
messages via email at a minimum helps to ensure that a problem is detected and corrected.

There is little impact to storage server operation to send these messages via email.

Risk:

If the storage servers are not configured to send alerts and clear messages via email at a minimum, there is an increased risk of
a problem not being detected in a timely manner.

Action / Repair:

Use the following cellcli command to validate the email configuration by sending a test email:

alter cell validate mail;

The output will be similar to:

Cell slcc09cel01 successfully altered

If the output is not successful, configure a storage server to send email alerts using the following cellcli command (tailored to
your environment):

ALTER CELL smtpServer='mailserver.maildomain.com', -

smtpFromAddr='[email protected]', -
smtpToAddr='[email protected]', -
smtpFrom='Exadata cell', -
smtpPort='<port for mail server>', -
smtpUseSSL='TRUE', -
notificationPolicy='critical,warning,clear', -
notificationMethod='mail';

NOTE: The recommended best practice to monitor an Oracle Exadata Database Machine is with Oracle Enterprise
Manager (OEM) and the suite of OEM plugins developed for the Oracle Exadata Database Machine.
Please reference My Oracle Support (MOS) Note 1110675.1 for details.

Configure NTP and Timezone on the InfiniBand switches

Priority Added Machine Type OS Exadata Oracle

Type Version Version
Critical N/A X2-2(4170), X2-2, X2-8, Linux 11.2.x + 11.2.x +
X4-2

Benefit / Impact:

Synchronized timestamps are important to switch operation and message logging, both within an InfiniBand switch between the
InfiniBand switches. There is little impact to correctly configure the switches.

Risk:

If the InfiniBand switches are not correctly configured, there is a risk of improper operation and disjoint message timestamping.

The InfiniBand switches should be properly configured during the initial deployment process. If for some reason they were were
not, please consult the "Configuring Sun Datacenter InfiniBand Switch 36 Switch"
section of the "OracleÃ Â® Exadata Database Machine Owner's Guide, 11g Release 2 (11.2)".

Configure NTP slew_always settings as SMF property for Solaris

Priority Added Machine Type OS Type Exadata Version Oracle Version

Critical N/A X2-2, X2-8, X4-2 Solaris 11.2.x + 11.2.0.2 +

Benefit / Impact:

Configuring NTP slew settings as an SMF property will make sure the time is equally managed on all the systems which will
prevent timing issues that may impact availability. This also helps in problem analysis
and will prevent for error messages in the system-log about in incorrect ntp setting

Risk:

Not having a working NTP configuration using SMF will result in different time settings on the nodes. This may impact stability
and makes problem analysis difficult.

"syntax error in /etc/inet/ntp.conf line 95, ignored"

Action / Repair:

As a best practice the ntp configuration setting slew_always should be configured as an SMF setting. After setting slew_always in
SMF the other setting 'disable pll' is not required anymore.
On Solaris 11 Express and Solaris 11 both should not exist in ntp.conf

Enable Xeon Turbo Boost

Priority Alert Level Date Owner Status Scope Bug(s)

Important WARN 2014-Jan-17 <Name> Production Exadata
17898503

DB Version DB Role Engineered System Exadata Version OS & Version Validation Tool Version TBD
N/A N/A X4-2, X4-8 11.2.3.3.0+ All exachk TBD

Benefit / Impact:

Xeon Turbo Boost automatically allows processor cores to run faster than their rated frequency if operating below power,
current, and temperature specification limits, which may result in better performance for some applications. Turbo Boost is
supported on X4 systems only.

Action / Repair:

Verify your system is using X4-based hardware using the dmidecode command:

# dmidecode -s system-product-name

The output on an X4-based database server is "SUN SERVER X4-2". The output on an X4-based storage server is "SUN SERVER
X4-2L".

Verify Turbo Boost is enabled on X4 database and storage servers using the following command:

# ubiosconfig export all -E | fgrep Turbo_Mode

Turbo Boost is enabled if the output is the following:

<Turbo_Mode>Enabled</Turbo_Mode>

Turbo Boost is disabled if the output is the following:

<Turbo_Mode>Disabled</Turbo_Mode>

If Turbo Boost is disabled, then enable it (on X4 systems only) by following the instructions in MOS Document 1487339.1, Issue
1.6 - Enable the Xeon Turbo Boost mode for X4 storage and database servers.

NOTE: Although it is possible to enable Turbo Boost on X3-based Exadata hardware, it is not supported.

Verify NUMA Configuration

Priority Added Machine Type OS Exadata Oracle

Type Version Version
Critical N/A X2-2(4170), X2-2, X2-8, Linux 11.2.x + 11.2.x +
X4-2

Benefit / Impact:

X2-2 Database servers in Oracle Exadata Database Machine by default are booted with operating system NUMA support enabled.
Commands that manipulate large files without using direct I/O on ext3 file systems
will cause low memory conditions on the NUMA node (Xeon 5500 processor) currently running the process.

By turning NUMA off, a potential local node low memory condition and subsequent performance drop is avoided.

The impact of turning NUMA off is minimal.

Risk:

Once local node memory is depleted, system performance as a whole will be severely impacted.

Action / Repair:

Follow the instructions in MOS Note 1053332.1 to turn NUMA off in the kernel for database servers.

NOTE: NUMA is configured to be off in the storage servers and should not be changed.

Verify Exadata Smart Flash Log is Created

Priority Alert Date Owner Status Scope

Level
Critical FAIL 03/05/2013 <Name> Production Exadata, SSC, Exalogic
DB DB Role Engineered System Exadata OS & Version Validation Tool
Version Version Version
N/A N/A X2-2(4170), X2-2, X2-8, X3-2, EIGHTH, X3-8, 11.2.2.2.0+ Solaris - 11 exachk 2.2.1
X4-2 Linux x86-64
UEK5.8

Benefit / Impact:

When created, Exadata Smart Flash Log uses 512MB of flash memory per storage server by default to help minimize redo log
write latency.

The impact of verifying that Exadata Smart Flash Log is created is minimal.

Risk:

Without Exadata Smart Flash Log, the LGWR process may be delayed causing longer "log file parallel write" and "log file sync"
waits.

Action / Repair:

To verify that Exadata Smart Flash Log is created, execute the following cellcli command as the "celladmin" user on each storage
server:

list flashlog attributes size,status

The output should be similar to:

512M normal

If the size is not as expected, Exadata Smart Flash Log may not be created, or there may be a hardware issue, or there may be
a configuration issue.

It is extremely important that the root cause for the size not being as expected is understood before attempting corrective
action. Because Smart Flash Log and Smart Flash Cache share the same physical memory structure on the storage servers, both
are likely to be impacted by hardware failures, for example. Corrective action is also impacted by whether or not Write Back
Flash Cache is in use, and solutions for the same root cause may vary if Write Back Flash Cache is in use.

After determining the root cause, refer to the Database Machine Owner's Guide and the Exadata Software User's Guide for the
appropriate corrective action steps.

Because they share the same storage server physical flash memory, there is a space usage relationship between Exadata Smart
Flash Log and Exadata Smart Flash Cache. Exadata Smart Flash Log should be created before Exadata Smart Flash Cache,
because the default configuration for Exadata Smart Flash Cache will use all available storage server flash memory. If Exadata
Smart Flash Cache already exists, a subsequent attempt to create Exadata Smart Flash Log will fail because all the available
storage server flash memory is in use.

NOTE: Exadata Smart Flash Log is created by default with Exadata Storage Server Software version 11.2.2.4.0 and
above.

NOTE: Exadata Smart Flash Log will be used by Oracle software 11.2.0.2 Bundle Patch 9 (or higher) or 11.2.0.3.0.
The recommended Oracle software version levels are 11.2.0.2 Bundle Patch 11 (or higher) or 11.2.0.3 Bundle
Patch 1 (or higher).

NOTE: The default Exadata Smart Flash Log size of 512MB is the recommended value.

NOTE: See also "Configure Storage Server Flash Memory as Exadata Smart Flash Cache"

Verify Exadata Smart Flash Cache is Created

Priority Alert Date Owner Status Scope Bug(s)
Level
Critical FAIL updated 10/11/17 <Name> Production Exadata - Physical, <26637216>-
Exadata - Management exachk
Domain, <24514430>-
SSC, Exalogic exachk

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 24/137
29/10/2019 Document 1067527.1
<23063691>-
exachk
<22344656>-
exachk
<18691846>-
exachk
DB DB Role Engineered Exadata OS & Validation Tool Version TBD

Version System Version Version

N/A N/A ALL 11.2.3.2.0+ exachk 12.2.0.1.4
Linux x86-64

Benefit / Impact:

For the vast majority of situations, maximum performance is achieved by configuring the storage server flash memory as cache,
allowing the Exadata software to determine the content of the cache.

The impact of configuring storage server flash memory as cache at initial deployment is minimal. If there are already grid disks
configured in the flash memory, consideration must be given as to the relocation of the data when converting the flash memory
back to cache.

Risk:

Not configuring the storage server flash memory as cache may result in a degradation of overall performance.

Action / Repair:

To confirm all storage server flash memory is configured as smart flash cache, execute the command shown below:

cellcli -e "list flashcache detail" | grep size

The output will be similar to:

size: 5.82122802734375T

Starting with Exadata software version 11.2.3.2.0, for an environment deployed according to Oracle standards, with the storage
server "flashlog" feature in use at the default size of 512M, the size of the storage server "flashcache" should match one of the
entries in this table:

Smart Flash Cache Expected Size Table

System Common Cache Size with Cache Size without Cache Size with Cache Size with
Name Smart Flash Log Smart Flash Log flashCacheCompression flashCacheCompression
Description and Smart Flash Log and no Smart Flash Log
X4275 X2-2(4170) 0.356201171875T 0.356689453125T FCC not available on this FCC not available on this
364.75G 365.25G hardware hardware
X4270 M2 X2-2, X2-8 0.356201171875T 0.356689453125T FCC not available on this FCC not available on this
364.75G 365.25G hardware hardware
X4270 M3 X3-2, X3-8 1.453857421875T 1.454345703125T 2.908935546875T 2.909423828125T
1488.75G 1489.25G 2978.75G 2979.25G
X4270 M3 EIGHTH 0.7266845703125T 0.7271728515625T 1.4542236328125T 1.4547119140625T
744.125G 744.625G 1489.125G 1489.625G
X4-2L X4-2 2.908935546875T 2.909423828125T 5.8193359375T 5.81982421875T
2978.75G 2979.25G 5959G 5959.5G
X4-2L EIGHTH 1.4542236328125T 1.4547119140625T 2.909423828125T 2.909912109375T
1489.125G 1489.625G 2979.25G 2979.75G
X5-2L X5-2 5.82122802734375T 5.82171630859375TFCC not available on this FCC not available on this
hardware hardware
X5-2L EIGHTH 2.910369873046875T 2.910858154296875T FCC not available on this FCC not available on this
hardware hardware
X6-2L X6-2, X6-8 11.64312744140625T 11.64361572265625T FCC not available on this FCC not available on this
hardware hardware
X6-2L EIGHTH 5.821319580078125T 5.821807861328125T FCC not available on this FCC not available on this
hardware hardware
X7-2L X7-2 23.28692626953125T 23.28741455078125T FCC-NA FCC-NA
X7-2L (all flash) X7-2 2.3287353515625T 2.3287353515625T FCC-NA FCC-NA

If the size is not as expected, some of the storage server flash memory may be configured as grid disks, or there may be a
hardware issue, or there may be a configuration issue.

After determining the root cause, refer to the Database Machine Owner's Guide and the Exadata Software User's Guide for the
appropriate corrective action steps.

NOTE: While not configuring the Exadata Smart Flash Log is permitted, it is recommended that the Exadata Smart Flash Log be
configured. If a decision is made not to create the Exadata Smart Flash Log, the expected size for the Smart Flash Cache is
shown in column "Cache Size without Smart Flash Log" and "Cache Size with flashCacheCompression and no Smart Flash Log".

NOTE: On storage servers that use only flash memory devices(no spinning disks), the Exadata Smart Flash Cache size is the
same whether or not Exadata Smart Flash Log is created. Therefore, the order in which Exadata Smart Flash Log and Exadata
Smart Flash Cache are created does not matter.

NOTE: See also "Verify Exadata Smart Flash Log is Created".

Verify Exadata Smart Flash Cache status is "normal"

Priority Alert Date Owner Status Engine

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 25/137
29/10/2019 Document 1067527.1
Level
Exad
Critical FAIL 10/13/15 <Name> Production Exadata-Ma
SS
DB Exadata
DB Role Engineered System Platform OS & Version Validati
Version Version
X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X5-2,X5- Linux x86-64 el5uek,
N/A N/A ALL exac
8 Linux x86-64 el6uek

Benefit / Impact:

Verifying that the Exadata Smart Flash Cache status is "normal" helps to avoid a performance degradation.

The impact of verifying that the Exadata Smart Flash Cache status is "normal" is minimal. The impact of restoring the Exadata
Smart Flash Cache status to "normal" varies, depending upon the reason for the abnormality, and cannot be estimated here.

Risk:

If the Exadata Smart Flash Cache status is not "normal", a performance degradation is likely.

Action / Repair:

To verify that the Exadata Smart Flash Cache status is "normal", as the root userid on each storage server, execute the following
command set:

unset CACHE_STATE;
CACHE_STATE=$(cellcli -e "list flashcache attributes status");
if [ $CACHE_STATE = "normal" ]
then
echo -e SUCCESS: the Exadata Smart Flash Cache state is: $CACHE_STATE;
else
echo -e FAILURE: the Exadata Smart Flash Cache state is: $CACHE_STATE;
fi

The expected output is:

SUCCESS: the Exadata Smart Flash Cache state is: normal

If the output is not as expected, investigate for root cause and correct the discovered cause.

NOTE: If the word "degraded" appears in the output, investigate the hardware condition as a memory module may have failed.

NOTE: If the word "flushed" appears in the output, a cache flush command was issued and was not subsequently cancelled. For
example:

FAILURE: the Exadata Smart Flash Cache state is: normal - flushed

In this condition, the Exadata Smart Flash Cache is not in use for cache operations of any type!

To cancel a flash cache flush operation, as the root userid on the storage server with the issue, execute the following command:

cellcli -e "alter flashcache all cancel flush"

The output should be:

Flash cache randomcel05_FLASHCACHE altered successfully

Verify Master (Rack) Serial Number is Set

Priority Added Machine Type OS Exadata Oracle

Type Version Version
Critical 03/02/11 X2-2(4170), X2-2, X2-8, X4- Linux 11.2.x + 11.2.x +
2

Benefit/Impact

Setting the Master Serial Number (MSN) (aka Rack Serial Number) assists Oracle Support Services to resolve entitlement issues
which may arise. The MSN is listed on a label on the front and the rear of the chassis
but is not electronically readable unless this value is set.

The impact to set the MSN is minimal.

Risk

Not having the MSN set for the system may hinder entitlement when opening Service Requests.

Action/Repair

Use the following command as the "root" userid to verify that all the MSN's are set correctly and match on all servers:

ipmitool sunoem cli "show /SP system_identifier" | grep "system_identifier ="

The output should resemble one of the following:

For X2-2(4170):

system_identifier = Sun Oracle Database Machine xxxxAKyyyy

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 26/137
29/10/2019 Document 1067527.1
For X2-2:

system_identifier = Exadata Database Machine X2-2 xxxxAKyyyy

For X2-8:

system_identifier = Exadata Database Machine X2-8 xxxxAKyyyy

(MSN's will be of the format either 4 numbers, the letters 'AK' followed by 4 more numbers or letters A-F, or the letters 'AK
followed by 8 numbers or letters A-F)

On any server where the MSN is not set correctly, use the following command as the "root" userid to set it:

ipmitool sunoem cli 'set /SP system_identifier="text_identifier_string serial_number"'

Where "text_identifier_string" is one of:

For X2-2(4170): "Sun Oracle Database Machine"

For X2-2: "Exadata Database Machine X2-2"

For X2-8: "Exadata Database Machine X2-8"

and "serial_number" is the MSN from the label attached to the rack.

NOTE: The label with the Master Serial Number is located on the top left side wall (viewed from rear) inside the rack on the
rear of the chassis.

NOTE: In the command to set the Master Serial Number there is a space between the "text_identifier_string" and the
"serial_number".

Verify Management Network Interface (eth0) is on a Separate Subnet

Priority Added Machine Type OS Type Exadata Version Oracle Version

Critical 03/02/11 X2-2(4170), X2-2, X2-8, X4-2 Linux 11.2.x + 11.2.x +

Benefit/Impact:

It is a requirement that the management network be on a different non-overlapping sub-net than the InfiniBand network and
the client access network. This is necessary for better network security, better client access
bandwidths, and for Auto Service Request (ASR) to work correctly.

The management network comprises of the eth0 network interface in the database and storage severs, the ILOM network
interfaces of the database and storage servers, and the Ethernet management interfaces of the
InfiniBand switches and PDUs.

Risk:

Having the management network on the same subnet as the client access network will reduce network security, potentially
restrict the client access bandwidth to/from the Database Machine to a single 1GbE link,
and will prevent ASR from working correctly.

Action/Repair:

To verify that the management network interface (eth0) is on a separate network from other network interfaces, execute the
following command as the "root" userid on both storage and database servers:

grep -i network /etc/sysconfig/network-scripts/ifcfg* | cut -f5 -d"/" | grep -v "#"

The output will be similar to:

ifcfg-bondeth0:NETWORK=10.204.77.0
ifcfg-bondib0:NETWORK=192.168.76.0
ifcfg-eth0:NETWORK=10.204.78.0
ifcfg-lo:NETWORK=127.0.0.0

The expected result is that the network values are different. If they are not, investigate and correct the condition.

Verify RAID disk controller CacheVault capacitor condition

Priority Alert Level Date Owner Status Engineered System Engineered System
Platform
Critical FAIL 08/08/18 Production SSC, Exadata - Physical, X5-2, X5-8, X6-2, X6-8, X7-2 284
Exadata - Management Domain 2749
229
DB Version DB Type DB Role DB Mode Exadata Version OS & Version Validation Tool Version MAA
N/A N/A N/A N/A 18.1.0 or higher Linux exachk 18.4.0

Benefit/Impact:

The CacheVault capacitor loses its ability to support cache over time. Verifying the CacheVault capacitor condition helps to
reasonably time proactive replacement.

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 27/137
29/10/2019 Document 1067527.1
The impact of verifying the CacheVault capacitor condition is minimal. Replacing the CacheVault will require downtime for the
impacted server.

Risk:

A failed CacheVault capacitor will put the RAID controller into WriteThrough mode which significantly impacts write I/O
performance.

Action/Repair:

NOTE: This check is not applicable to Extreme Flash Oracle Exadata Storage Servers nor X7-8 Oracle Exadata
Database Servers as they contain no conventional disk drives!

Execute the following command as the "root" userid on all storage and database servers:

RAW_OUTPUT=$(/opt/MegaRAID/storcli/storcli64 /c0/cv show all)

if [[ $(echo "$RAW_OUTPUT" | egrep -i "^state" | egrep -ic optimal) -eq 1 ]]
then
echo -e "SUCCESS: raid controller CacheVault condition is optimal."
else
echo -e "FAILURE: raid controller CacheVault condition is not optimal. Details:\n\n$RAW_OUTPUT"
fi

The expected output should be:

SUCCESS: raid controller CacheVault condition is optimal.

If the output is a "FAILURE" message, upload the detailed information provided into a hardware service request for component
replacement.

Verify RAID Disk Controller Battery Condition

Priority Alert Date Owner Status Engineered System Engineered System

Level Platform
Critical FAIL 08/01/18 <Name> Production SSC, Exadata - Physical, X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5-2, X5
Exadata - Management X7-2
Domain
DB DB Type DB Role DB Mode Exadata Version OS & Version Validation Tool Version
Version
N/A N/A N/A N/A 18.1.0 or Linux exachk 18.4.0
higher

Benefit/Impact:

Maintaining optimal condition maximizes RAID controller battery life.

The impact of verifying RAID controller battery condition is minimal. The impact of correcting a non-optimal condition varies, and
may include a server shutdown to replace batteries.

Risk:

A non-optimal battery condition may place the RAID controller into WriteThrough mode which significantly impacts write I/O
performance.

Action/Repair:

NOTE: This check is not applicable to Extreme Flash Oracle Exadata Storage Servers nor X7-8 Oracle Exadata
Database Servers as they contain no conventional disk drives!

To verify the RAID controller battery condition, execute the following command as the "root" userid on all database and storage
servers:

RAW_OUTPUT=$(/opt/MegaRAID/storcli/storcli64 /c0/bbu show all)

if [[ $(echo "$RAW_OUTPUT" | egrep -i "battery state" | egrep -ic optimal) -eq 1 ]]
then
echo -e "SUCCESS: raid controller battery condition is optimal."
else
echo -e "FAILURE: raid controller battery condition is not optimal. Details:\n\n$RAW_OUTPUT"
fi

The expected output should be similar to:

SUCCESS: raid controller battery condition is optimal.

If the output is a "FAILURE" message, upload the detailed information provided into a hardware service request for component
replacement.

Verify Ambient Air Temperature

Date Owner Status Engineered System Bug(s)

Alert
Level

Critical Fail 03/16/16 <Name> Production Exadata - Physical,

Exadata - Management
Domain
DB DB Role Engineered System Platform Exadata OS & Validation Tool TBD
Version Version Version Version

Benefit / Impact:

Maintaining ambient air temperature conditions within design specification for an Oracle Exadata Database Machine helps to
achieve maximum efficiency and targeted component service lifetimes.

The impact of verifying the ambient air temperature is minimal. The impact of correcting ambient air temperatures outside of
design specification range will vary depending upon the root cause of the issue.

Risk:

Ambient air temperatures outside the design specification range affect all components within the chassis of an Oracle Exadata
Database Machine, possibly manifesting performance problems and shortened service lifetimes.

Action / Repair:

To verify the ambient air temperature, execute the following command set as the "root" userid on each storage and database
server:

unset AMBIENT_TEMP;
AMBIENT_TEMP=$(ipmitool sunoem cli "show /SYS/T_AMB" | grep value | sed -e 's/^[ \t]*//;s/[ \t]*$//' | cut -d" " -f3);
if [[ 'echo "${AMBIENT_TEMP//./}"' -ge 5000 && 'echo "${AMBIENT_TEMP//./}"' -le 32000 ]]
then
echo "SUCCESS: Ambient air temperature is within the range of 5 to 32 degrees Centigrade: $AMBIENT_TEMP";
else
echo -e "FAILURE: Ambient air temperature is outside the range of 5 to 32 degrees Centigrade: $AMBIENT_TEMP";
fi;

The output should be similar to:

SUCCESS: Ambient air temperature is within the range of 5 to 32 degrees Centigrade: 27.250

If the ambient air temperature is not within the recommended range, investigate for root cause and take appropriate corrective
action.

NOTE: Since there is no one sensor in the physical rack for overall ambient temperature of the data center air, this check reads
the ambient temperature from each storage and database server.

Verify Platform Configuration and Initialization Parameters for Consolidation

Platform Consolidation Considerations

Consolidation Parameters Reference Table

Critical, 08/02/11

Benefit / Impact: Experience and testing has shown that certain database initialization parameter settings should use the
following formulas for platform consolidation. By using these formulas as recommended, known
problems may be avoided and performance maximized.

The performance related settings provide guidance to maintain highest stability without sacrificing performance. Changing the
default performance settings can be done after careful performance evaluation and clear
understanding of the performance impact.

Risk: If the operating system and database parameters are not set as recommended, a variety of issues may be encountered
that can lead to system and database instability.

Action / Repair: To verify the database initialization parameters, use the following guidance:

The following are important platform level considerations in a consolidated environment.

Operating System Configuration Recommendations

Hugepages, when set, should equal the sum of shared memory from all databases, see MOS Note 401749.1 for
precise computations and see MOS Note 361323.1
for a description of Hugepages. Hugepages is generally required if "PageTables" in /proc/meminfo is > 2% of
physical memory
Benefits: Memory savings. Prevent cases of paging and swapping when not configured.
Tradeoffs: Set Hugepages correctly and need to be adjusted when another instance is added/dropped or
when sga sizes change.
As of 11.2.0.2 to disable hugepages on an instance set parameter "use_large_pages=false"
Note that as of onecommad version that supports 11.2.0.2 BP9 hugepages is automatically configured upon
deployment. The vm.nr_hugepages value
may need to be adjusted if an instance memory parameters are changed post initial deployment
Amount of locked memory - 75% of physical memory
Number of Shared Memory Identifiers - set greater than the number of databases
Size of Shared Memory Segments - OS setting for max size = 85% of physical memory
Number of semaphores - sum of processes cannot exceed the maximum number of semaphors. On linux, the max
can be obtained with cat /proc/sys/kernel/sem | awk '{print $2}'.
The number of semaphores on the system should not be so high such that maximizing oracle processes running
causes performance problems .

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 29/137
29/10/2019 Document 1067527.1
Number of semaphores in a semaphore set: The number of semaphores in a semaphore set must be at least as
high as the largest
value for the processes parameter in all databases. On linux, the number of semaphore sets can be obtained with
cat /proc/sys/kernel/sem | awk '{print $4}'

Applications with similar SLA requirements are best suited to co-exist in a consolidated environment together. Do not mix
mission critical applications with non mission critical applications in the same consolidated environment. Do not mix
production and test/dev databases in the same environment.
It is possible to over-subscribe an application resource requirements in a consolidated environment as long as the
other applications under-subscribe at that time. The exception
to this is mission critical applications. Do not over-subscribe in a consolidated environment that contains mission
critical applications. Oracle Resource Manager can be used to
manage varying degrees of IO and CPU requirements within one database and across databases. Within one database,
Oracle Resource Manager can also manage parallel query processing.

Consolidation Parameters Reference Table

The performance related recommendations provide guidance to maintain highest stability without sacrificing performance.
Changing these performance settings can be done after careful performance
evaluation and clear understanding of the performance impact.

This parameter consolidation health check table is a general reference for environments. This is not a hard prerequisite for a
consolidated environment, rather a guideline used to establish
the formulas, maximum values, and notes below. It should suffice for most customers, but if you do not qualify for this formula,
the table below can be used as a reference solely for important
parameters that must be considered. These values are per node.

Parameter Formula Max Notes

11.2 Check aforementioned formula. Excee

11.2* 75% of total usage can potentially cause performan
OLTP: memory also ensure that the value computed f
Sga_target / Pga_aggregate_target the application using the associated d
Sum of all sga_target and
12.1* pga_aggregate_target for all Pga_aggregate_target setting does no
databases < 75% of physical usage. For some data warehouse and
sga_target/pga_aggregate_limit memory target has been observed. For OLTP a
much less. The 25% room provides in
DW/BI: spill over and for non-SGA/PGA memo
and non-memory allocations can add
Sum of Sga_target +
some cases. Monitoring application an
(pga_aggregate_target x 3)
required to ensure there's sufficient m
< 75% of physical memory
workload/business cycles. Oracle reco
12.* free at all times.

Both OLTP and DW/BI: In 12c, new parameter pga_aggregate

enforces a maximum PGA usage so th
Sum of Sga_target + should be used in calculations. pga_ag
pga_aggregate_limit < 75% pga_aggregate_target and defaults to
of physical memory the pga_aggregate_target setting.

DBM Machine Type: Memory Available

DBM V2 | 72 GB | 54 GB

X2-2 | 96 GB | 60.8 GB can be expand

X2-8 | 1 TB | 768GB

X3-2 | 256G | 192GB

X3-8 | 2 TB | 1536G

X4-2 | 512G | 384GB

X4-8 | 6 TB | 4608GB

X5-2 | 1 TB | 768GB

For mission critical Refer to the Rules of thumbs:

Cpu_count applications: formulas in
the previous 1.Leverage CPU_COUNT and instance
Sum of cpu_count of all column consolidation (e.g. managing multiple
databases <= 75% X Total DBM). They are particularly helpful in
CPUs from over-consuming target CPU reso

Alternatively: 2. Most light weight applications are id

For light-weight CPU usage 3. Large reporting/DW/BI and some O

applications, intensive applications) can easily cons
to be bounded with instance caging a

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 30/137
29/10/2019 Document 1067527.1
sum (CPU_COUNT) <=3 X 4. For consolidating mission critical ap
CPUs over-subscribing CPU resources to ma
performance consistency.
and
For additional guidance and precautio
CPU intensive applications, 1362445.1>

sum(CPU_COUNT) <= Total Exadata DBM | # Cores |# CPUs

CPUs
DBM V2 | 8 CPUs | 16 CPUs

X2-2 | 12 CPUs | 24 CPUs

X2-8 | 64 CPUs | 128 CPUs

resource_manager_plan NA NA Ensure this is enabled. A good starting

processes Sum of processes of all Number of Check formula. Alert if > max
databases < max semaphores
on the Alert if # Active Processes > 4 X CPUs
system
Sum (all processes for all instances) <

Parallel parameters Automatic Adjusting CPU_COUNT parameter for

resource management will automatica
PARALLEL_MAX_SERVERS and PARAL
parameter values provided these are n
parameter file.

Db_recovery_file_dest_size Sum of Size of Check formula; Usable FRA space sub

Db_recovery_file_dest_size Usable Fast other files such as online log files in th
<= Fast Recovery Area Recovery high redundancy diskgroups
Area

Verify operating system hugepages count satisfies total SGA requirements

Priority Alert Date Owner Status Engineered System Bug(s)

Level

Critical FAIL 10/23/15 <Name> Production Exadata - Physical

Exadata - Management
Domain

DB DB Role Engineered System Platform Exadata OS & Version Validation Tool TBD
Version Version Version

N/A N/A X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, ALL Linux x86-64 exachk 12.1.0.2.6
X4-8, X5-2 el5uek
Linux x86-64
el6uek

Benefit / Impact:

Properly configuring operating system hugepages on Linux and setting the database initialization parameter "use_large_pages"
to "only" results in more efficient use of memory and reduced paging.

The impact of validating that the total current hugepages are greater than or equal to estimated requirements for all currently
active SGAs is minimal. The impact of corrective actions will vary depending on the specific configuration, and because the
hugepages pool must be contiguous, it is recommended to reboot the database server.

Risk:

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 31/137
29/10/2019 Document 1067527.1
The risk of not correctly configuring operating system hugepages in advance of setting the database initialization parameter
"use_large_pages" to "only" is that if not enough huge pages are configured, some databases will not start after you have set
the parameter.

Action / Repair:

PREREQUISITE: All database instances that are supposed to run concurrently on a database server must be up
and running for this check to be accurate.

To verify that the total number of configured hugepages is greater than or equal to the estimated requirements of all currently
active SGAs using large pages. As the root user copy the following block of commands to a shell script (i.e,
/tmp/hugepages_calculation.sh) and execute it.

#!/bin/bash

TOTAL_HUGEPAGES='grep HugePages_Total /proc/meminfo | cut -d":" -f 2 | sed -e 's/^[ \t]//;s/[ \t]$//''

HPG_SZ='grep Hugepagesize /proc/meminfo | awk '{print $2}''
NUM_PG=0
MGMT_PID='/usr/bin/pgrep -f mdb_pmon_'
if [ $? -eq 0 ]; then
MGMT_SEGIDS='grep SYSV /proc/${MGMT_PID}/maps | awk '{print $5}' | uniq'
else
MGMT_PID=0
fi
IPCARR=('ipcs -m | grep "^0x" | awk '{ print $2":"$5}'')
for SEGIDBYTES in "${IPCARR[@]}"
do
SEG_ID=${SEGIDBYTES%:*}
SEG_BYTES=${SEGIDBYTES##*:}
if [[ $MGMT_PID -eq 0 || ! "$MGMT_SEGIDS" =~ "$SEG_ID" ]]; then
MIN_PG='echo "$SEG_BYTES/($HPG_SZ*1024)" | bc -q'
if [ $MIN_PG -gt 0 ]; then
NUM_PG='echo "$NUM_PG+$MIN_PG+1" | bc -q'
fi
fi
done
if [ $TOTAL_HUGEPAGES -ge $NUM_PG ]
then echo -e "\nSUCCESS: Total current hugepages ($TOTAL_HUGEPAGES) are greater than or equal to"
echo -e " estimated requirements for all currently active SGAs ($NUM_PG).\n"
else echo -e "\nFAILURE: Total current hugepages ($TOTAL_HUGEPAGES) should be greater than or equal to"
echo -e " estimated requirements for all currently active SGAs ($NUM_PG).\n"
fi

The output should be similar to:

SUCCESS: Total current hugepages (13004) are greater than or equal to

estimated requirements for all currently active SGAs (632).

If the output is not "SUCCESS", investigate and correct the condition.

NOTE: Please refer to My Oracle Support notes MOS 401749.1, 361323.1, and 1392497.1 for additional details on configuring
hugepages.

NOTE: If you have not reviewed notes 401749.1, 361323.1, and 1392497.1 and followed their guidance BEFORE using the
database parameter "use_large_pages=only", this check will pass the environment but you will still not be able to start instances
once the configured pool of operating system hugepages have been consumed by instance startups. If that should happen, you
will need to change the "use_large_pages" inialization parameter to one of the other values, restart the instance, and follow the
instructions in notes 401749.1 and 361323.1. The brute force alternative is to increase the huge page count until the newest
instance will start, and then adjust the huge page count after you can see the estimated requirements for all currently active
SGAs.

NOTE: While it is possible to modify the number of hugepages in active memory in the running kernel, it is not recommended for
two reasons:
1) The hugepages pool must be contiguous, and it may not be possible to find enough contiguous pages to meet a request in
the running kernel active memory.
2) Setting the value in the kernel configuration files and rebooting ensures the expected number of hugepages is properly
configured and available. Misconfigurations in this area can impact server availability so following this operational best practice
prevents an unexpected outage caused by user error.

Verify "MaxStartups 100" in /etc/ssh/sshd_config on all database servers

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 32/137
29/10/2019 Document 1067527.1

Priority Added Machine Type OS Type Exadata Version Oracle Version

Critical 03/21/12 X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2 Linux 11.2.0.3+ 11.2.0.3 +

Benefit / Impact:

Configuring "MaxStartups 100" helps to avoid the risk of certain cluster operations failing for clusters containing more than 10
database servers.
Cluster operations examples include installing or upgrading the grid infrastructure, and adding a cluster node.

The impact of verifying "MaxStartups 100" is minimal. The impact of correcting the setting is moderate, requiring a restart of the
sshd service.

Risk:

With "MaxStartups" configured at the default value (10), certain cluster operations for clusters containing more than 10 database
servers may fail.
For example, if the Oracle Univeral Installer (OUI) calls the Cluster Verification Utility (CVU) and CVU starts an ssh session across
all nodes
concurrently that fails because more than 10 concurrent ssh connections are required.

Action / Repair:

To verify that "MaxStartups 100" is set in /etc/ssh/sshd_config file, execute the following command as the "root" userid on the
node where deploy112.sh was executed:

dcli -g /opt/oracle.SupportTools/onecommand/dbs_group -l root "egrep -i maxstartups /etc/ssh/sshd_config"

The output should be similar to:

randomdb01: MaxStartups 100

<output truncated>
randomdb16: MaxStartups 100

If the output is not as expected, as the root userid on each database server, edit the sshd_config file to include "MaxStartups
100" and restart the ssh service with the "service sshd restart" command.

Verify all datafiles have "AUTOEXTEND" attribute "ON"

Priority Added Machine Type OS Type Exadata Version Oracle Version

Critical N/A X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2 Linux, [WIP:VW]Solaris 11.2.x + 11.2.x +

Benefit / Impact

The benefit of having "AUTOEXTEND" on is that applications may avoid out of space errors.

The impact of verifying that the "AUTOEXTEND" attribute is "ON" is minimal. The impact of setting "AUTOEXTEND" to "ON"
varies depending upon if it is done during database creation, file addition to a tablespace, or added to an existing file.

Risk

The risk of running out of space in either the tablespace or diskgroup varies by application and cannot be quantified here. A
tablespace that runs out
of space will interfere with an application, and a diskgroup running out of space could impact the entire database as well as ASM
operations (e.g., rebalance operations)..

Action / Repair

To obtain a list of tablespaces that are not set to "AUTOEXTEND", enter the following sqlplus command logged into the database
as sysdba:

select file_id, file_name, tablespace_name from dba_data_files where autoextensible <>'YES'

union
select file_id, file_name, tablespace_name from dba_temp_files where autoextensible <> 'YES';

The output should be:

no rows selected

If any rows are returned, investigate and correct the condition.

NOTE: Configuring "AUTOEXTEND" to "ON" requires comparing space utilization growth projections at the
tablespace level to space available in the diskgroups to permit the expected
projected growth while retaining sufficient storage space in reserve to account for ASM rebalance operations that
occur either as a result of planned operations or component failure.
The resulting growth targets are implemented with the "MAXSIZE" attribute that should always be used in
conjunction with the "AUTOEXTEND" attribute. The "MAXSIZE" settings should
allow for projected growth while minimizing the prospect of depleting a disk group. The "MAXSIZE" settings will
vary by customer and a blanket recommendation cannot be given here.

NOTE: When configuring a file for "AUTOEXTEND" to "ON", the size specified for the "NEXT" attribute should cover
all disks in the diskgroup to optimize balance. For example,
with a 4MB AU size and 168 disks, the size of the "NEXT" attribute should be a multiple of 672M (4*168).

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 33/137
29/10/2019 Document 1067527.1

Enable portmap service if app requires it

By default, the portmap service is not enabled on the database nodes and it is required for things such as NFS. If needed,
enable and start it using the following with dcli across required nodes:
chkconfig --level 345 portmap on

service portmap start

Enable proper services on database nodes to use NFS

In addition to the portmap service previously explained, the nflsock service must also be enabled and running to use NFS on
database nodes. Below is a working example, showing the errors that will be encountered with
various utilities if not setup correctly. MOS Note 359515.1 can also be referenced.

SQL> create tablespace nfs_test_on_nfs datafile '/shared/dscbbg02/users/user/nfs_test/nfs_test_on_nfs.dbf' size 16M;

create tablespace nfs_test_on_nfs datafile '/shared/dscbbg02/users/user/nfs_test/nfs_test_on_nfs.dbf' size 16M

ERROR at line 1:

ORA-01119: error in creating database file

'/shared/dscbbg02/users/user/nfs_test/nfs_test_on_nfs.dbf'

ORA-27086: unable to lock file - already in use

Linux-x86_64 Error: 37: No locks available

Additional information: 10

Elapsed: 00:00:30.08

SQL> create tablespace nfs_test datafile '+D/user/datafile/nfs_test.dbf' size 16M;

Tablespace created.

SQL> create table nfs_test(n not null) tablespace nfs_test as select rownum from dual connect by rownum < 1e5 + 1;

Table created.

SQL> alter tablespace nfs_test read only;

Tablespace altered.

SQL> create directory nfs_test as '/shared/dscbbg02/users/user/nfs_test';

Directory created.

SQL> create table nfs_test_x organization external(type oracle_datapump default directory nfs_test location('nfs_test.dp')) as
select * from nfs_test;

create table nfs_test_x organization external(type oracle_datapump default directory nfs_test location('nfs_test.dp')) as select *
from nfs_test

ERROR at line 1:

ORA-29913: error in executing ODCIEXTTABLEPOPULATE callout

ORA-31641: unable to create dump file

"/shared/dscbbg02/users/user/nfs_test/nfs_test.dp"

ORA-27086: unable to lock file - already in use

Linux-x86_64 Error: 37: No locks available

Additional information: 10

Elapsed: 00:00:31.17

$ expdp userid=scott/tiger parfile=nfs_test.par

Export: Release 11.2.0.1.0 - Production on Wed Jun 2 10:44:51 2010

Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production

With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,

Data Mining and Real Application Testing options

ORA-39001: invalid argument value

ORA-39000: bad dump file specification

ORA-31641: unable to create dump file "/shared/dscbbg02/users/user/nfs_test/nfs_test.dmp"

ORA-27086: unable to lock file - already in use

Additional information: 10

RMAN works:

$ rman target=/

Recovery Manager: Release 11.2.0.1.0 - Production on Wed Jun 2 10:46:40 2010

connected to target database: USER (DBID=3710096878)

RMAN> backup as copy datafile '+D/user/datafile/nfs_test.dbf' format '/shared/dscbbg02/users/user/nfs_test/nfs_test.dbf';

Starting backup at 20100602104700

using target database control file instead of recovery catalog

allocated channel: ORA_DISK_1

channel ORA_DISK_1: SID=204 device type=DISK

channel ORA_DISK_1: starting datafile copy

input datafile file number=00007 name=+D/user/datafile/nfs_test.dbf

output file name=/shared/dscbbg02/users/user/nfs_test/nfs_test.dbf tag=TAG<a target="_blank"

channel ORA_DISK_1: datafile copy complete, elapsed time: 00:00:01

Finished backup at 20100602104702

The solution is to ensure that the nfslock service (aka rpc.statd) is running:

# service nfslock status

rpc.statd (pid 10795) is running... Of course youÃ¢Â Â d want to enable the service via chkconfig too.

Be Careful when Combining the InfiniBand Network across Clusters and Database Machines

Priority Added Machine Type OS Type Exadata Version Oracle Version

N/A X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2 Linux 11.2.x + 11.2.x +

If you want multiple database machines to run as separate environments yet be connected through the InfiniBand network,
please be aware of the following items especially when the database machines
were deployed as separate environments.

The cell name, cell disk name, grid disk name, ASM diskgroup name, and ASM failgroup name should be unique to help avoid
accidental damage during maintenance operations. For example do not have
diskgroup DATA on both database machines, call them DATA_DM01 and DATA_DM02.

IP Addresses

Priority Added Machine Type OS Type Exadata Version Oracle Version

N/A X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2 Linux 11.2.x + 11.2.x +

All nodes on the InfiniBand network must have a unique IP address. When an Oracle Database Machine is deployed, the default
InfiniBand network is 192.168.10.x and we start with 192.168.10.1.
If you used the default IP address on each Database Machine, you will have duplicate IP addresses. You must modify the IP
addresses on one of the machines before re-configuring the InfiniBand Network.

Ensure any additional equipment ordered from Oracle is marked for an Oracle Exadata Database Machine and the hardware
engineer is using the correct Multi-rack Cabling when the physical InfiniBand network is modified.

After the hardware engineer has modified the network, ensure that network is working correctly by running verify topology and
infinicheck. Infinicheck will create load on the system and should not be run when
there is active workload on the system. Note: Infinicheck will need an input file of all IP addresses on the network.

I.E. Create a temporary file in /tmp that contains all cells for both database machines. Pass this file to the inifnicheck command
using the -c option. Also pass the -b option

#cd /opt/oracle.SupportTools/ibdiagtools

#./verify-topology -t fattree

#./infinicheck -c /tmp/combined_cellip.ora -b

CELLIP.ORA

Priority Added Machine Type OS Type Exadata Version Oracle Version

N/A X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2 Linux,Solaris 11.2.x + 11.2.x +

The cellip.ora file in each database node of each cluster should only reference cells in use by that respective cluster.

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 35/137
29/10/2019 Document 1067527.1

Set fast_start_mttr_target=300 to optimize run time performance of writes

Priority Added Machine Type OS Type Exadata Version Oracle Version

N/A X2-2(4170), X2-2, X2-8, X3-2, X3-8 Linux 11.2.x + 11.2.x +

The deployment default for fast_start_mttr_target as of 12/22/2010 is 60. To optimize run time performance for write/redo
generation intensive workloads, increase fast_start_mttr_target to 300.
This will reduce checkpoint writes from DBWR processes, making more room for LGWR IO. The trade-off is that instance
recovery will run longer, so if instance recovery is more important than performance,
then keep fast_start_mttr_target low. Also keep in mind that an application with inadequately sized redo logs will likely not see
an affect from this change due to frequent log switches.

Considerations for a direct writes in a data warehouse type of application: Even though direct operations aren't using the buffer
cache, fast_start_mttr_target is very effective at controlling crash recovery time because
it ensures adequate checkpointing for the few buffers that are resident (ex: undo segment headers). fast_start_mttr_target
should be set to the desired RTO (Recovery Time Objective) while still maintaining performance SLAs.

Enable auditd on database servers

Priority Added Machine Type OS Type Exadata Version Oracle Version

N/A X2-2(4170), X2-2, X2-8, X4-2 Linux 11.2.x + 11.2.x +

On database servers, when auditing is configured, as is done automatically by applying convenience pack 11.2.2.2.0 or higher,
the audit records are logged in /var/log/messages if the auditd service is not running.
By logging these messages to /var/log/messages, it may cause more frequent rotation of the messages file which may result in
losing historical data more quickly than necessary or desired. By enabling auditd, audit records
are sent to /var/log/audit/audit.log which is rotated and managed separately using settings in /etc/audit/audit.conf.

The best practice is to run the auditd service whenever auditing is configured during kernel bootup by setting audit=1 on the
kernel line in /boot/grub/grub.conf, as shown here:

title Trying_LABEL_DBSYS
root (hd0,0)
kernel /vmlinuz-2.6.18-194.3.1.0.2.el5 root=LABEL=DBSYS ro bootarea=dbsys loglevel=7 panic=60 debug rhgb audit=1 numa=off console=ttyS0,1
initrd /initrd-2.6.18-194.3.1.0.2.el5.img

To configure auditd to be enabled, run the following commands as root on each database server:

chkconfig auditd on
chkconfig --list auditd
auditd 0:off 1:off 2:on 3:on 4:on 5:on 6:off
service auditd start
service auditd status
auditd (pid 32582) is running...

Verify AUD$ and FGA_LOG$ tables use Automatic Segment Space Management

Priority Added Machine Type OS Type Exadata Version Oracle Version

Critical 02/27/2012 X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2 Linux, Solaris 11.2.x + 11.2.x +

Benefit / Impact:

With AUDIT_TRAIL set for database (AUDIT_TRAIL=db), and the AUD$ and FGA_LOG$ tables located in a dictionary segment
space managed SYSTEM tablespace, "gc" wait events are sometimes observed
during heavy periods of database logon activity. Testing has shown that under such conditions, placing the AUD$ and FGA_LOG$
tables in the SYSAUX tablespace, which uses automatic segment space management,
reduces the space related wait events.

The impact of verifying that the AUD$ and FGA_LOG$ tables are in the SYSAUX table space is low. Moving them if they are not
located in the SYSAUX does not require an outage, but should be done during a
scheduled maintenance period or slow audit record generation window.

Risk:

If AUD$ and FGA_LOG$ tables are not verifed to use automatic segment space management, there is a risk of a performance
slowdown during periods of high database login activity.

Action / Repair:

To verify the segment space management policy currently in use by the AUD$ and FGA_LOG$ tables, use the following Sqlplus
command:

select t.table_name,ts.segment_space_management from dba_tables t, dba_tablespaces ts where ts.tablespace_name = t.tablespace_name and t

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 36/137
29/10/2019 Document 1067527.1

The output should be:

TABLE_NAME SEGMEN
------------------------------ ------
FGA_LOG$ AUTO
AUD$ AUTO

If one or both of the AUD$ or FGA_LOG$ tables return "MANUAL", use the DBMS_AUDIT_MGMT package to move them to the
SYSAUX tablespace:

BEGIN
DBMS_AUDIT_MGMT.set_audit_trail_location(audit_trail_type => DBMS_AUDIT_MGMT.AUDIT_TRAIL_AUD_STD,--this moves tab
BEGIN
DBMS_AUDIT_MGMT.set_audit_trail_location(audit_trail_type => DBMS_AUDIT_MGMT.AUDIT_TRAIL_FGA_STD,--this moves tab
END;
/

The output should be similar to:

PL/SQL procedure successfully completed.

If the output is not as above, investigate and correct the condition.

NOTE: This "DBMS_AUDIT_MGMT.set_audit_trail" command should be executed as part of the dbca template post
processing scripts, but for existing databases, the command can be executed,
but since it moves the AUD$ & FGA_LOG$ tables using "alter table ... move" command, it should be executed at a
"quiet" time

Use dbca templates provided for current best practices

Priority Added Machine Type OS Type Exadata Version Oracle Version

N/A X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2 Linux, Solaris 11.2.x + 11.2.x +

Benefit / Impact:

Starting with 11.2.2.3.1 onecommand v, dbca templates with built in best practices are provided at deployment time for OLTP,
DW/BI, and DBFS.
The database created at deployment time uses one of these templates. If other databases are created, the templates should be
used to ensure
current database configuration best practices are implemented. If custom scripts are used to create databases, the templates
can be used as a reference for those customer scripts.

Risk:

Not adhering to best practices can lead to unnecessary outages and performance problems

Action / Repair:

Run health check to assess diffs with current best practices. Check configuration assistant logs for template use.

Updating database node OEL packages to match the cell

MOS Note 1284070.1 provides a working example of updating the db host OEL packages to match those on the cell.

Disable cell level flash caching for grid disks that don't need it when using Write Back Flash Cache

Priority Added Machine Type OS Type Exadata Version Oracle Version

n/a August 2012 X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2 Linux 11.2.3.2+ 11.2.x +

Benefit / Impact

When using Write Back Flash Cache, disabling caching for grid disks that don't need it frees up cache space for more important
objects.
The classic use-case for this is grid disks in the RECO diskgroup. Note that Exadata already has intelligence to not cache objects
that
don't need it, but this extends that to the grid disk level in a Write Back Flash Cache configuration.

Risk:

Cache pollution (less caching benefit) leading to performance impact.

Action / Repair:

The following cellcli command displays the cell caching mode. It should be "WriteBack" for this best practice.

list cell attributes flashCacheMode

The following cellcli command displays the caching mode for all grid disks on a cell. A cachingPolicy of "none" indicates caching
is turned off for that particular grid disk.

list griddisk attributes name,cachingPolicy

To disable caching for a particular griddisk, first flush the cache data for that grid disk, and then set the cachedPolicy attribute to
"none" as illustrated in the cellcli commands below

alter griddisk <grid disk name> flush

If caching needs to be enabled again after these steps, first cancel the prior flush, and then set the caching Policy attribute back
to "default" as illustrated in the cellcli commands below

alter griddisk <grid disk name> cancel flush

alter griddisk <grid disk name> cachingPolicy="default"

Gather system statistics in Exadata mode if needed

Priority Added Machine Type OS Exadata Oracle Version

Type Version
n/a Auguest X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4- Linux 11.2.x + 11.2.0.2 BP18 and 11.2.0.3
2012 2 BP8

Benefit / Impact

Gathering Exadata specific system statistics ensure the optimizer is aware of Exadata scan speed. Accurately accounting for the
speed of scan operations will ensure the Optimizer chooses an optimal execution plan in a Exadata environment. The following
command gathers Exadata specific system statistics

exec dbms_stats.gather_system_stats('EXADATA');

Note this best practice is not a general recommendation to gather system statistics in Exadata mode for all Exadata
environments. For existing customers
who have acceptable performance with their current execution plans, do not gather system statistics in Exadata mode. For
existing customers whose cardinality
estimates are accurate, but suffer from the optimizer over estimating the cost of a full table scan where the full scan performs
better, then gather system
statistics in Exadata mode. For new applications where the impact can be assessed from the beginning, and dealt with easily if
there is a problem, gather system statistics in Exadata mode.

Risk:

Lack of Exadata specific stats can lead to less performant optimizer plans.

Action / Repair:

To see if Exadata specific optimizer stats have been gathered, run the following query on a system with at least 11.2.0.2 BP18 or
11.2.0.3 BP8 Oracle software. If PVAL1 returns null or is not set, Exadata specific stats have not been gathered.

select pname, PVAL1 from aux_stats$ where pname='MBRC';

Verify Hidden Database Initialization Parameter Usage

Priority Alert Level Date Owner Status Engineered System
Critical FAIL 08/01/18 <Name> Production Exadata - Physical,
Exadata - User Domain

DB Version DB Role Engineered System Platform Exadata Version OS & Version Validation Tool Version
11.2.x Primary ALL 11.2.3.+ Linux x86-64 exachk 12.2.0.1.4, 18.3.0
12.1.x Standby
ASM

Benefit / Impact

Hidden database initialization parameters are typically set as a workaround to solve a specific problem, and should be removed
once a system has been upgraded to a version level that contains the fix for the specific problem. Often they are not removed
during the upgrade process to the version level that contains the correct fix. Verifying the hidden database initialization
parameter usage helps avoid hidden parameters being used any longer than necessary.

Risk:

Use of hidden ASM or database initialization parameters not recommended by Oracle development in an Exadata environment
can cause instability, performance problems, corruptions, and crashes.

Action / Repair:

Para verificar o uso do parâmetro de inicialização do banco de dados oculto em cada instância do ASM e do banco de dados,
execute o seguinte comando sqlplus como proprietário da respectiva casa com o ambiente configurado corretamente para
acessar a instância:

selecione nome, valor do parâmetro v $ where substr (name, 1,1) = '_';

NOTA: O parâmetro v $ contém apenas parâmetros ocultos que foram alterados do padrão, que são os que
interessam aqui.

A saída esperada deve ser uma lista de todos os parâmetros ocultos em uso que foram alterados a partir do valor padrão,
semelhante a:

_enable_NUMA_support FALSE

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 38/137
29/10/2019 Document 1067527.1
Não deve haver parâmetros ocultos em uso que não sejam mostrados na "Tabela de parâmetros ocultos geralmente aceitáveis":

Tabela de parâmetros ocultos geralmente aceitáveis

Valor Notas
Nome do Parâmetro Versão Oracle Versão Tipo de
Exadata Instância
_file_size_increase_increment 2143289344 <= 11.2.0.3 TUDO Base de Permite tamanhos de alocação de
BP11 dados backup rman com melhor
desempenho.
_enable_NUMA_support Configure <12.1.0.2.6 TUDO Base de
_enable_NUMA_support = TRUE dados Para qualquer sistema Exadata que
para todos os servidores de usa o Banco de dados 12.1.0.2.6 ou
banco de dados de 8 soquetes superior, não defina explicitamente
da geração de hardware (Nota - _enable_NUMA_support (inclui
aplica-se apenas a não OVM - o todas as gerações de hardware, 2
OVM não é suportado em soquetes, 8 soquetes, não OVM e
servidores de 8 soquetes). OVM). A configuração
_enable_NUMA_support é
Defina _enable_NUMA_support automaticamente configurada pelo
= TRUE para servidores de banco de dados.
banco de dados de 2 soquetes
X5 e posteriores implementados Para qualquer sistema Exadata que
como não OVM. usa o Banco de dados 12.1.0.2.5 ou
inferior, consulte a configuração
Nos demais casos, não defina recomendada na coluna Valor desta
explicitamente linha.
_enable_NUMA_support.
_asm_resyncckpt 00 SOMENTE TUDO ASM Desativa o ponto de verificação de
12.1.0.1 ressincronização
_smm_auto_max_io_size 1024 TUDO Base de Isso permite 1 MB de E / S para
12.1 e dados junções de hash que se espalham
inferior para o disco, o que pode aumentar
o desempenho em até 40% devido
ao aumento da taxa de
transferência. Esses aumentos de
desempenho podem impedir a
necessidade de mover o TEMP para
piscar.

Nota apenas interna: isso não será

mais necessário quando o bug
20925115 for corrigido.
_parallel_adaptive_max_users 2 12.1 e TUDO Base de
superior dados Verifique para garantir não mais
que o valor recomendado. Definir
esse valor acima do recomendado
pode esgotar a memória e afetar o
desempenho. * O

parâmetro
PARALLEL_MAX_SERVERS é
avaliado com base no método de
cálculo abaixo:

parallel_threads_per_cpu *
cpu_count *
concurrent_parallel_users * 5 O

parâmetro
PARALLEL_SERVERS_TARGET é
avaliado com base no método de
cálculo abaixo:

parallel_threads_per_cpu *
cpu_count *
concurrent_parallel_users * 2

_PARALLEL_ADAPTIVE_MAX_USERS
fornece o valor de
concurrent_parallel_users no
cálculo. O valor desse parâmetro é
definido como 4 na maioria dos
casos, o que resultaria em um
número máximo superior ao
recomendado de servidores
paralelos, portanto, o valor
recomendado é 2.

PARALLEL_MAX_SERVERS seria
calculado como abaixo, assumindo
que cpu_count esteja definido para
todas as CPUs disponíveis:

X2-2: 1 * 24 * 2 * 5 = 240
X6-2: 1 * 88 * 2 * 5 = 880
X2-8: 1 * 128 * 2 * 5 = 1280
X6-8: 1 * 288 * 2 * 5 = 2880

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 39/137
29/10/2019 Document 1067527.1

_assm_segment_repair_bg FALSO 12.2 e TUDO Base de solução alternativa para o erro

superior dados 23734075
_asm_max_connected_clients Altera dinamicamente 12.2 e 18.1 TUDO ASM Usado internamente; Removido na
APENAS liberação 19c
_backup_disk_bufcnt 64 12.1 e TUDO Base de Somente quando backups baseados
inferior dados em ZFS estão em uso
_backup_disk_bufsz 1048576 12.1 e TUDO Base de Somente quando backups baseados
inferior dados em ZFS estão em uso
_backup_file_bufcnt 64 12.1 e TUDO Base de Somente quando backups baseados
inferior dados em ZFS estão em uso
_backup_file_bufsz 1048576 12.1 e TUDO Base de Somente quando backups baseados
inferior dados em ZFS estão em uso

NOTES:

1) For additional ZFS based backup configuration information, please see: Oracle ZFS Storage: FAQ: Exadata
RMAN Backup with The Oracle ZFS Storage Appliance (Doc ID 1354980.1)
2) This best practice check does not include any application specific hidden parameters. If an application in use
requires hidden parameters that are failed by this best practice, refer to the proper documentation for the
application version in use. If the extra hidden parameters are correct, then ignore the failures reported for those
specific parameters.

For Oracle E-Business Suite, please see: Database Initialization Parameters for Oracle E-Business Suite Release 12
(Doc ID 396009.1)
For Siebel CRM Application, please see: Performance Tuning Guidelines for Siebel CRM Application on Oracle
Database (Doc ID 2077227.2)

Verify BDB location for Cloned GI homes

Priority Added Machine Type OS Type Exadata Version Oracle Version

n/a August 2012 X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2 Linux, Solaris 11.2.x + 11.2.x +

Benefit / Impact

After cloning a Grid Home the $GI_HOME/crf/admin/crf<node>.ora configuration file in the new home has the BDB location still
pinpointing the GI home where it is cloned from.

Risk:

GI Upgrade to 11203 from 11201 and 11202 can fail

Error messages in $GRID_HOME/log/crflogd/crflogdOUT.log logfile

Action / Repair:

Manually edit $GI_HOME/crf/admin/crf<node>.ora in the cloned Grid Infrastructure Home and change the values for BDBLOC
and CRFHOME.
This same change needs to be done on all nodes in the cluster to the file referenced above if it exists. Reference: 1485970.1 /
14168708

Verify Shared Servers do not perform serial full table scans

Priority Added Machine Type OS Type Exadata Version Oracle Version

Warn September 2012 X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2 Linux 11.2.x + 11.2.x +

Benefit / Impact

As an Oracle kernel design decision, shared servers are intended to perform quick transactions and therefore do not issue serial
(non PQ) direct reads. Consequently, shared servers do not perform serial (non PQ) Exadata smart scans.

The impact of verifying that shared servers are not doing serial full table scans is minimal. Modifying the shared server
environment to avoid shared server serial full table scans varies by configuration and application behavior, so the impact cannot
be estimated here.

Risk:

Shared servers doing serial full table scans in an Exadata environment lead to a performance impact due to the loss of Exadata
smart scans.

Action / Repair:

To verify shared servers are not in use, execute the following SQL query as the "oracle" userid:

SQL> select NAME,value from v$parameter where name='shared_servers';

The expected output is:

NAME VALUE
--------------- ------------------------------
shared_servers 0

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 40/137
29/10/2019 Document 1067527.1
If the output is not "0", use the following command as the "oracle" userid with properly defined environment variables and check
the output for "SHARED" configurations:

$ORACLE_HOME/bin/lsnrctl service

If shared servers are confirmed to be present, check for serial full table scans performed by them. If shared servers performing
serial full table
scans are found, the shared server environment and application behavior should be modified to favor the normal Oracle
foreground processes so that
serial direct reads and Exadata smart scans can be used.

Verify Write Back Flash Cache minimum version requirements

Priority Alert Date Owner Status Scope Bug(s)

Level
Critical FAIL 02/06/13 <Name> Development Exadata, SSC 16012455-
exachk
DB Version DB Role Engineered System Exadata OS & Validation Tool TBD
Version Version Version
11.2.0.3 ASM 11.2.3.2.1+ Solaris - 11 exachk 2.2.1
BP9+ X2-2(4170), X2-2, X2-8, X3-2, X3-8, Linux x86-64
X4-2

All eng systems with Exadata

Storage

Benefit / Impact

Oracle Write Back Flash Cache requires Oracle version 11.2.0.3 Bundle Patch 9 (BP9) in the Grid Infrastructure ORACLE_HOME
or higher and Exadata version 11.2.3.2.1 or higher.

Oracle 11.2.0.3 BP9 or higher in the Grid Infrastructure ORACLE_HOME enables the resilvering feature, which drastically reduces
the time required to restore redundancy after a flash disk failure (FDOM) failure.

Exadata software 11.2.3.2.1 has critical optimizations and fixes (e.g. fix for bug 16232581) to fully take advantage of Exadata
Write Back Flash Cache.

Risk:

Without 11.2.0.3 BP9 in the Grid Infrastructure ORACLE_HOME, disks cached by the failed DOM will be dropped and added
which significantly extends the repair time.

Without the fixes in Exadata cell 11.2.3.2.1, IO errors and possible data corruptions may appear for very large IO intensive
workloads when using Write Back Flash Cache.

Action / Repair:

To check if Write Back Flash Cache is in use, run the following cellcli command on all storage servers and check for 'WriteBack'

CellCLI> list cell attributes flashCacheMode WriteBack

To check the Grid Infrastructure ORACLE_HOME for BP9 or above, run the following command from the Grid Infrastructure
ORACLE_HOME as the oracle userid:

$ $ORACLE_HOME/OPatch/opatch lspatches

The output should be similar to:

14307915;DISKMON PATCH FOR EXADATA (NOV 2012 - 11.2.0.3.12) : (14307915)

14275572;CRS PATCH FOR EXADATA (NOV 2012 - 11.2.0.3.12) : (14275572)
14662263;DATABASE PATCH FOR EXADATA (NOV 2012 - 11.2.0.3.12) : (14662263)

In this case, patch 14275572 is applied, which is 11.2.0.3 BP12, and therefore the proper fixes are in place.

If the Oracle version is less than 11.2.0.3 BP9, upgrade to 11.2.0.3 BP9 or higher.

To check the Exadata software version, execute the following command as the root userid on all storage servers:

imageinfo -version

The output should be similar to:

11.2.3.2.1.130109

If the Exadata software version is less than 11.2.3.2.1, upgrade to 11.2.3.2.1 or higher.

Verify bundle patch version installed matches bundle patch version registered in database

DB Version Alert Level Date Owner Status Scope

Critical FAIL 11/04/15 <Name> Production Exadata, Exalo
DB Version DB Role Engineered System Exadata Version OS & Verion Validaton Too
>= 12.1.0.2 ALL X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X5-2, X5-8 11.2.x + Linux, Solaris exachk 12.1

Benefit / Impact:

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 41/137
29/10/2019 Document 1067527.1
first manner where the SQL portion of the bundle patch is not installed inside the database until the primary and all standby
software homes have the same version installed, then this crosscheck is expected to fail until both the binary and SQL portion of
the bundle patch application is fully installed.

Risk:

Incomplete bug fixes, software instability, and unexpected behavior

Action / Repair:

To verify that the bundle patch version installed matches bundle patch version registered in database, as the oracle home owner
for the primary database, and with ORACLE_SID and ORACLE_HOME properly set, execute the following command:

opatch_bp=$($ORACLE_HOME/OPatch/opatch lspatches 2>/dev/null|grep -iwv javavm|grep -wi database|head -1|awk -F';'

'{print $1}');
database_bp_status=$(echo -e "set heading off feedback off timing off \n select ACTION, STATUS from (select * from
dba_registry_sqlpatch where PATCH_ID = $opatch_bp order by action_time desc) where
rownum=1;"|$ORACLE_HOME/bin/sqlplus -s " / as sysdba" | sed -e '/^ *$/d');
database_bp_status='echo $database_bp_status';
if [ "$database_bp_status" == "APPLY SUCCESS" ];
then
echo "SUCCESS: Bundle patch installed in the database matches the software home and is installed successfully.";
else
echo "FAILURE: Bundle patch installed in the database does not match the software home, or is installed with errors.";
fi;

The output should be similar to:

SUCCESS: Bundle patch installed in the database matches the software home and is installed successfully.

If FAILURE is reported, then investigate and correct the discrepancy.

NOTE: For versions less than 12.1.0.2, please see this archived best practice:
Verify bundle patch version installed matches bundle patch version registered in database (ARCHIVE)

Verify database server file systems have "Maximum mount count" = "-1"

Priority Alert Date Owner Status Engin

Level
Critical FAIL 03/16/16 <Name> Production Exada
Exadata
D
Exadata
DB DB Role Engineered System Exadata OS & Validatio
Version Version Version
N/A N/A X2-2(4170), X2-2, X2-8, X3-2, X3-8, EIGHTH, X4-2, X4-8, X5-2, 11.2.2.2.0+ Linux x86-64 exache
X5-8

Benefit / Impact:

A filesystem will be checked for consistency (fsck) after the number of times it is mounted exceeds the "Maximum mount count"
setting, typically at reboot time. On a database server, the "Maximum mount count" is set to "-1" by default.

Verifying that the database server file systems all have "Maximum mount count" set to "-1" helps to avoid an unexpectedly long
reboot sequence as an fsck of the file system completes. The Impact of verifying the database server file systems "Maximum
mount count" is minimal. The impact of changing the "Maximum mount count" value is minimal as it can be changed
dynamically.

Risk:

A database server reboot may take an unexpectedly long time as an fsck operation completes, potentially extending an outage
or maintenance window.

Action / Repair:

To verify the database server disk devices maximum mount count configuration, execute the following command as the "root"
userid on all database servers:

LVM_IN_USE=$(parted -ls 2>/dev/null | egrep -i lvm | wc -l);

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 42/137
29/10/2019 Document 1067527.1
then
FILESYSTEM_ARRAY+="$INDIVIDUAL_LOGICAL_VOLUME "
fi;
done;
MNT_CNT_CHK_RSLT=0;
for INDIVIDUAL_LOGICAL_VOLUME in 'echo ${FILESYSTEM_ARRAY[@]}'
do
if [ "'$FS_COMMAND -l $INDIVIDUAL_LOGICAL_VOLUME | egrep "^Maximum mount" | cut -d ":" -f 2 | sed -e 's/^[ \t]*//''" -ne
"-1" ]
then MNT_CNT_CHK_RSLT=1;
fi;
done;
if [ "$MNT_CNT_CHK_RSLT" -eq "0" ]
then
echo -e "\nSUCCESS: All database server logical volumes found with filesystems had \"Maximum mount count\" equal to -1";
else
echo -e "\nFAILURE: One or more database server logical volumes found with filesystems had \"Maximum mount count\" not
equal to -1";
fi;
for INDIVIDUAL_LOGICAL_VOLUME in 'echo ${FILESYSTEM_ARRAY[@]}'
do
echo "$INDIVIDUAL_LOGICAL_VOLUME: '$FS_COMMAND -l $INDIVIDUAL_LOGICAL_VOLUME | egrep \"^Maximum mount\" |
cut -d ":" -f 2 | sed -e 's/^[ \t]*//''";
done;
else
export SWAP_DEVICE='swapon -s | grep -v Filename | cut -d" " -f1'
export PARTITIONED_DEVICE_ARRAY='fdisk -l 2>/dev/null | egrep ^/dev | egrep -v $SWAP_DEVICE | cut -d" " -f1';
export MNT_CNT_CHK_RSLT=0;
for INDIVIDUAL_PARTITIONED_DEVICE in $PARTITIONED_DEVICE_ARRAY
do
if [ "'tune2fs -l $INDIVIDUAL_PARTITIONED_DEVICE | egrep "^Maximum mount" | cut -d ":" -f 2 | sed -e 's/^[ \t]*//''" -ne "-1"
]
then MNT_CNT_CHK_RSLT=1;
fi;
done;
if [ "$MNT_CNT_CHK_RSLT" -eq "0" ]
then
echo -e "\nSUCCESS: All database server partitioned devices (other than swap) found had \"Maximum mount count\" equal to
-1";
for INDIVIDUAL_PARTITIONED_DEVICE in $PARTITIONED_DEVICE_ARRAY
do
echo "$INDIVIDUAL_PARTITIONED_DEVICE: 'tune2fs -l $INDIVIDUAL_PARTITIONED_DEVICE | egrep \"^Maximum mount\" |
cut -d ":" -f 2 | sed -e 's/^[ \t]*//''";
done;
else
echo -e "\nFAILURE: One or more database partitioned devices (other than swap) found had \"Maximum mount count\" not
equal to -1";
for INDIVIDUAL_PARTITIONED_DEVICE in $PARTITIONED_DEVICE_ARRAY
do
echo "$INDIVIDUAL_PARTITIONED_DEVICE: 'tune2fs -l $INDIVIDUAL_PARTITIONED_DEVICE | egrep \"^Maximum mount\" |
cut -d ":" -f 2 | sed -e 's/^[ \t]*//''";
done;
fi;
fi;

The output should be similar to:

SUCCESS: All database server logical volumes found (other than swap) and the boot device had "Maximum mount count
Boot Device /dev/sda1: -1
/dev/VGExaDb/LVDbSys1: -1
/dev/VGExaDb/LVDbOra1: -1
/dev/VGExaDb/LVDbSys2: -1

- OR -

SUCCESS: All database server partitioned devices (other than swap) found had "Maximum mount count" equal to -1
/dev/sda1: -1
/dev/sda3: -1

If the output is not as expected, you can change the "Maximum mount count" value as the "root" userid using the appropriate
command for your environment ("tune2fs" or "tune4fs") on the database server for either partitioned or logical volume devices.
Only the device name portion of the command differs. For example, if the appropriate command for your environment is
"tune2fs":

# tune2fs -c -1 /dev/mapper/VGExaDb-LVDbOra1
tune2fs 1.39 (29-May-2006)
Setting maximal mount count to -1

NOTE: fsck should be periodically executed as part of the regular maintenance schedule for an Oracle Exadata Database
Machine, where the timing is controlled by the customer. This check only verifies that the timing of the run should be controlled
and not unexpected.

NOTE: In Exadata versions 11.2.3.2.0, 11.2.3.2.1, and 11.2.3.2.2, the database server may reset "Maximum mount count" to 27
and "Check interval" to 15552000 for some devices upon reboot. This is due to a change introduced in bug 14223777. The
recommended fix is to upgrade to 11.2.3.3.0 or higher.

Verify database server file systems have "Check interval" = "0"

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 43/137
29/10/2019 Document 1067527.1

Priority Alert Date Owner Status Engineered System Bug(s)

Level
Critical FAIL 03/16/16 <Name> Production
Exadata - Physical,
Exadata - Management
Domain,
Exadata - User Domain
DB DB Role Engineered System Exadata OS & Validation Tool TBD

Version Version Version Version

N/A N/A X2-2(4170), X2-2, X2-8, X3-2, X3-8, EIGHTH, X4-2, 11.2.2.2.0+ Linux x86- exachk 12.1.0.2.7
X4-8, X5-2, X5-8 64

Benefit / Impact:

A filesystem will be checked for consistency (fsck) after the elapsed time from the last fsck run exceeds the "Check interval"
setting, typically at reboot time. On a database server, the "Check interval" is set to "0" by default.

Verifying that the database server filesystems all have the "Check interval" set to "0" helps to avoid an unexpectedly long reboot
sequence as an fsck of the file system completes. The Impact of verifying the database server file system "Check interval" is
minimal. The impact of changing the file system "Check interval" value is minimal as it can be changed dynamically.

Risk:

A database server reboot may take an unexpectedly long time as an fsck operation completes, potentially extending an outage
or maintenance window.

Action / Repair:

To verify the database server disk devices check interval configuration, execute the following command as the "root" userid on
all database servers:

LVM_IN_USE=$(parted -ls 2>/dev/null | egrep -i lvm | wc -l);

if [ $LVM_IN_USE -ge 1 ]
then
if test -f /proc/xen/capabilities && grep -q "control_d" /proc/xen/capabilities
then
FS_COMMAND=tune4fs # dom0 case
else
FS_COMMAND=tune2fs # physical, domU case
fi;
LOGICAL_VOLUME_ARRAY=$(lvscan | cut -d"'" -f2);
for INDIVIDUAL_LOGICAL_VOLUME in $LOGICAL_VOLUME_ARRAY
do
if [ 'file -sL $INDIVIDUAL_LOGICAL_VOLUME | egrep -wc "ext3|ext4" 2> /dev/null' -eq 1 ]
then
FILESYSTEM_ARRAY+="$INDIVIDUAL_LOGICAL_VOLUME "
fi;
done;
LVM_CHECK_INTERVAL_RSLT=0;
for INDIVIDUAL_LOGICAL_VOLUME in 'echo ${FILESYSTEM_ARRAY[@]}'
do
if [ "'$FS_COMMAND -l $INDIVIDUAL_LOGICAL_VOLUME | grep "Check interval:"|awk '{print $3}''" -ne "0" ]
then LVM_CHECK_INTERVAL_RSLT=1;
fi;
done;
if [ "$LVM_CHECK_INTERVAL_RSLT" -eq "0" ]
then
echo -e "\nSUCCESS: All database server logical volumes found with filesystems had \"Check interval\" equal to 0"
else
echo -e "\nFAILURE: One or more database server logical volumes found with filesystems had \"Check interval\" not
fi;
for INDIVIDUAL_LOGICAL_VOLUME in 'echo ${FILESYSTEM_ARRAY[@]}'
do
echo "$INDIVIDUAL_LOGICAL_VOLUME: '$FS_COMMAND -l $INDIVIDUAL_LOGICAL_VOLUME | grep "Check interval:"|awk '{print
done;
else
export SWAP_DEVICE='swapon -s | grep -v Filename | cut -d" " -f1'
export PARTITIONED_DEVICE_ARRAY='fdisk -l 2>/dev/null | egrep ^/dev | egrep -v $SWAP_DEVICE | cut -d" " -f1';
export PRTN_CHECK_INTERVAL_RSLT=0;
for INDIVIDUAL_PARTITIONED_DEVICE in $PARTITIONED_DEVICE_ARRAY
do
if [ "'tune2fs -l $INDIVIDUAL_PARTITIONED_DEVICE | grep "Check interval:"|awk '{print $3}''" -ne "0" ]
then PRTN_CHECK_INTERVAL_RSLT=1;
fi;
done;
if [ "$PRTN_CHECK_INTERVAL_RSLT" -eq "0" ]
then
echo -e "\nSUCCESS: All database server partitioned devices (other than swap) found had \"Check interval\" equal
for INDIVIDUAL_PARTITIONED_DEVICE in $PARTITIONED_DEVICE_ARRAY
do
echo "$INDIVIDUAL_PARTITIONED_DEVICE: 'tune2fs -l $INDIVIDUAL_PARTITIONED_DEVICE | grep "Check interval:"|awk '{p
done;
else
echo -e "\nFAILURE: One or more database partitioned devices (other than swap) found had \"Check interval\" not e
for INDIVIDUAL_PARTITIONED_DEVICE in $PARTITIONED_DEVICE_ARRAY
do
echo "$INDIVIDUAL_PARTITIONED_DEVICE: 'tune2fs -l $INDIVIDUAL_PARTITIONED_DEVICE | grep "Check interval:"|awk '{p
done;
fi;
fi;

SUCCESS: All database server disk devices found (other than swap) and the boot device had "Check interval" equal
Boot Device /dev/sda1: 0
/dev/VGExaDb/LVDbSys1: 0
/dev/VGExaDb/LVDbOra1: 0
/dev/VGExaDb/LVDbSys2: 0

- OR -

SUCCESS: All database server partitioned devices (other than swap) found had "Check interval" equal to 0
/dev/cciss/c0d0p1: 0
/dev/cciss/c0d0p3: 0

If the output is not as expected, you can change the "Check interval" value as the "root" userid using the appropriate command
for your environment ("tune2fs" or "tune4fs") on the database server for either partitioned or logical volume devices. Only the
device name portion of the command differs. For example, if the appropriate command for your environment is "tune2fs":

# tune2fs -i 0 /dev/VGExaDb/LVDbOra1
tune2fs 1.39 (29-May-2006)
Setting interval between checks to 0 seconds

NOTE: fsck should be periodically executed as part of the regular maintenance schedule for an Oracle Exadata
Database Machine, where the timing is controlled by the customer. This check only verifies that the timing of the
run should be controlled and not unexpected.

NOTE: In Exadata versions 11.2.3.2.0, 11.2.3.2.1, and 11.2.3.2.2, the database server may reset "Maximum
mount count" to 27 and "Check interval" to 15552000 for some devices upon reboot. This is due to a change
introduced in bug 14223777. The recommended fix is to upgrade to 11.2.3.3.0 or higher.

Verify Automated Service Request (ASR) configuration

Priority Alert Date Owner Status Scope

Level
Critical FAIL 11/11/12 <Name> Development Exadata, SSC, Exalogic
DB DB Role Engineered System Exadata OS & Validation Tool
Version Version Version Version
N/A N/A X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4- 11.2.2.2.0+ Solaris - 11 exachk 2.1.6
2 Linux x86-64

Benefit / Impact:

Verifying the Automated Service Request (ASR) is necessary to ensure that an Oracle Exadata Database Machine can
automatically open an Oracle support Service Request when a qualifying condition is detected.

The Impact of verifying the ASR configuration is minimal. The impact of correcting deficiencies found varies by the corrective
action required, and cannot be estimated here.

Risk: If the ASR configuration is not correct, service requests will not be correctly opened automatically when a qualifying
condition is detected, leading to delays in correcting the qualifying condition.

Action / Repair:

There are two methods to verify that the ASR configuration is correct:

1) Read and follow the instructions in My Oracle Support Doc ID 1450112.1, which provides the asrexacheck script to verify the
ASR configuration.

2) Download and execute the latest exachk from My Oracle Support Doc ID 1070954.1, which includes the asrexacheck script.

Refer to the output of the asrexacheck script, or the "Systemwide Automatic Service request (ASR) healthcheck" section of the
exachk HTML report, for findings and corrective actions.

Verify ZFS File System User and Group Quotas are configured

Priority Alert Level Date Owner Status Scope

Critical WARN 3/1/2013 <Name> Review Exadata, SSC
DB Version DB Role Engineered System Exadata Version OS & Version Validation Tool Version
N/A N/A X2-2(4170), X2-2, X3-2, X4-2 11.2.1.0.0 + Solaris - 11 exachk 2.2.0

Benefit / Impact:

Filesystem quotas enable control of filesystem space to users and groups. Especially on systems where the grid infrastructure
and RDBMS software are managed through separate OS users, restrictions on space consumption are helpful to ensure that
system stability and application availability are maximized.

Risk:

Without quotas, filesystems can fill up and application availability can be impacted. When quotas are used, soft limits enable
warnings when the quota limits approach and hard limits keep the filesystem from filling to ensure that the system remains
stable.

Action / Repair:

To verify ZFS file system user and group quotas are configured, as the "root" userid on all storage servers, perform the following
commands:

# zfs get userquota@oracle data/u01 NAME PROPERTY VALUE SOURCE data/u01 userquota@oracle none local # zfs get gro

NOTE: This procedure only applies to Solaris database servers in Exadata database machine. No changes are
permitted on Exadata storage cells. For instructions on how to implement ZFS quotas on Exadata, please refer to
Chapter 7 of the Database Machine Owners Guide - "Resetting the Quota of a ZFS Storage Pool File System"

Verify the file /.updfrm_exact does not exist

Priority Alert Date Owner Status Scope Bug(s)

Level
Critical FAIL 04/02/2014 <Name> Production Exadata, SSC, Exalogic 18746642-
exachk
DB DB Role Engineered System Exadata OS & Validation Tool TBD
Version Version Version Version
N/A N/A X2-2(4170), X2-2, X2-8, X3-2, X3-8, All All exachk 2.2.5
X4-2

Benefit / Impact:

To workaround a firmware patching issue for an earlier Exadata release, the file /.updfrm_exact had to be manually created.
This file should only be temporarily created during patching at the direction of Oracle Support, and should be removed
immediately after patching is complete.

The impact of verifying the existance of the file /.updfrm_exact and removing it is minimal.

Risk:

If /.updfrm_exact exists, a manual firmware upgrade may be inadvertantly rolled back when the server is next rebooted.

Action / Repair:

To verify that the file /.updfrm_exact does not exist, as the root userid on all database and storage servers, execute the
following command:

bash -c '[ -f /.updfrm_exact ] && echo "FAIL: /.updfrm_exact exists"'

The output should be empty.

If the output is similar to the following:

randomdb01: FAIL: /.updfrm_exact exists

then remove the file /.updfrm_exact with the following command executed as the root userid:

rm -f /.updfrm_exact

Verify the vm.min_free_kbytes configuration

Priority Alert Date Owner Status Engineered System Engineered Bug(s)

Level System
Platform
Critical FAIL 04/10/19 <Name> Production Exadata - Physical, ALL 29604454 - exachk
Exadata - Management 27679610 - exachk
Domain, 26308040 - exachk,
Exadata - User Domain, 17251233, 16984594,
RA 17200041
DB/GI DB Type DB Role DB Exadata OS & Version Validation Tool MAA Scorecard
Version Mode Version Version Section
N/A N/A N/A N/A ALL Linux exachk 19.3.0 N/A

Benefit / Impact:

Maintaining vm.min_free_kbytes as recommended helps a Linux system to reclaim memory faster. For a database server with 1
NUMA node, the minimum value is 512KB. For database servers with more than 1 NUMA node, the minimum value is
the_number_of_NUMA_nodes multiplied by 512KB.

The impact of verifying the vm.min_free_kbytes configuration is minimal. The impact of adjusting vm.min_free_kbytes should
include a reboot to verify the configuration is correctly configured and retained during the boot cycle.

NOTE: It is possible, but NOT recommended, especially for a system already under memory pressure, to modify
the setting interactively.

Risk:

Exposure to unexpected node eviction and reboot.

Action / Repair:

To verify the vm.min_free_kbytes configuration, as the "root" userid on each database server, execute the following command
set:

MIN_FREE_KBYTES_SYSCTL=$(egrep ^vm.min_free_kbytes /etc/sysctl.conf | awk '{print $3}');

MIN_FREE_KBYTES_MEMORY=$(cat /proc/sys/vm/min_free_kbytes);
RAW_NUMA_DATA=$(numactl -s | egrep ^cpubind | awk '{$1=$1;print}')
FIELD=$(expr $(echo "$RAW_NUMA_DATA" | tr -cd ' ' | wc -c) + 1)
NUMA_NODE_COUNT=$(expr $(echo "$RAW_NUMA_DATA" | cut -d " " -f$FIELD) + 1)
if [[ $NUMA_NODE_COUNT = 1 ]]
then
MINIMUM_SIZE=524288
else

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 46/137
29/10/2019 Document 1067527.1
MINIMUM_SIZE=$(expr $NUMA_NODE_COUNT '*' 524288)
fi
DETAIL=$(
echo -e "NUMA node count: $NUMA_NODE_COUNT";
echo -e "minimum size: $MINIMUM_SIZE";
echo -e "in sysctl.conf: $MIN_FREE_KBYTES_SYSCTL";
echo -e "in active memory: $MIN_FREE_KBYTES_MEMORY";
)
if [[ $MIN_FREE_KBYTES_SYSCTL -eq $MIN_FREE_KBYTES_MEMORY && $MIN_FREE_KBYTES_SYSCTL -ge $MINIMUM_SIZE ]]
then
echo -e "\nSUCCESS: vm.min_free_kbytes is set as recommended:\n$DETAIL";
else
echo -e "\nFAILURE: vm.min_free_kbytes is not set as recommended:\n$DETAIL";
fi;

The output should be similar to:

SUCCESS: vm.min_free_kbytes is set as recommended:

NUMA node count: 8
minimum size: 4194304
in sysctl.conf: 4194304
in active memory: 4194304

-- OR --

SUCCESS: vm.min_free_kbytes is set as recommended:

NUMA node count: 2
minimum size: 1048576
in sysctl.conf: 1048576
in active memory: 1048576

-- OR --

SUCCESS: vm.min_free_kbytes is set as recommended:

NUMA node count: 1
minimum size: 524288
in sysctl.conf: 524288
in active memory: 524288

Example of a "FAILURE" result:

FAILURE: vm.min_free_kbytes is not set as recommended:

NUMA node count: 8
minimum size: 4194304
in sysctl.conf: 1048576
in active memory: 2097152

NOTE: In the above "FAILURE" example, it appears the sysctl.conf file setting is too low, and then the active
kernel setting was expanded but still too low, and neither is close to the recommended minimum value.

If the output is a "FAILURE" result, investigate and take corrective action. Corrective action should include setting the minimum
recommended vm.min_free_kbytes value for the given NUMA configuration in sysctl.conf and reboot the database server.

Validate key sysctl.conf parameters on database servers

Priority Alert Date Owner Status Scope Bug(s)

Level
Critical FAIL 5/8/13 <Name> Design Exadata
DB DB Role Engineered System Exadata OS & Validation Tool TBD
Version Version Version Version
N/A N/A X2-2(4170), X2-2, X2-8, X3-2, X3-8, All Linux
X4-2

Benefit / Impact:

Kernel parameter settings in /etc/sysctl.conf are applied to the kernel automatically at boot time and manually via the sysctl
utility at runtime. The semantics of each kernel parameter are known only to the kernel, so the sysctl utility passes all values
directly to the kernel with minimal processing and validation. Invalid values can be misinterpreted by the kernel, leading to
unexpected results. For certain key parameters, such invalid values can have an immediate and critical impact on the system.
Invalid values stored in /etc/sysctl.conf at boot time can prevent the system from booting, making it difficult to identify and
correct the problem. Validating the format of some key parameters periodically or after changes to sysctl.conf can prevent
unexpected outages due to human error.

Risk:

Applying improperly formatted values to kernel parameters can render a system unusable.

Action / Repair: Run the command "awk -f check_sysctl.awk /etc/sysctl.conf" and correct any parameters reported to be
formatted incorrectly. The contents of check_sysctl.awk are shown below:

#########################################################################
# Notes:
#
# - The purpose of this script is to check certain kernel parameters in
# /etc/sysctl.conf that could prevent the server from booting if set
# incorrectly.
# - This script is only capable of checking the validity of the *syntax*
# of these parameters, but is not capable of assessing whether the
# values themselves are correct or optimal.
# - This script does not attempt to check all parameters in sysctl.conf.
# It only checks parameters which have been observed to cause severe
# impact on server stability.
#
# Revision history:
# 08-May-2014 - initial version
# 28-May-2014 - vm.nr_hugepages must be < 100% of physical memory

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 47/137
29/10/2019 Document 1067527.1
# 24-Jun-2014 - add corrective action guidance
#
#########################################################################

BEGIN {
errcnt = 0
BEGIN_memtotal_bytes()
}

END {
if( !errcnt ) { print "All sysctl.conf formatting checks succeeded" }
exit errcnt
}

function BEGIN_memtotal_bytes() {
if( NR )
{
exit -1
}

cmd = "grep MemTotal /proc/meminfo"

if( 1 != cmd | getline )
{
close( cmd )
exit -1
}
else if( 3 != NF || $3 != "kB" )
{
print "Unexpected /proc/meminfo format"
exit -1
}
close( cmd )
memtotal_bytes = $2 * 1024

cmd = "grep Hugepagesize /proc/meminfo"

if( 1 != cmd | getline )
{
hugepage_size = 2048 * 1024
}
else if( 3 != NF || $3 != "kB" )
{
print "Unexpected /proc/meminfo format"
exit -1
}
else
{
hugepage_size = $2 * 1024;
}
close( cmd )

memtotal_hugepages = memtotal_bytes / hugepage_size

}

# This function extracts the value portion of the setting with whitespace
# before and after trimmed, as sysctl does
function extract_value( localval ) {
localval = gensub( /^[^=]*=[[:space:]]*/, "", 1 )
localval = gensub( /[[:space:]]*$/, "", 1, localval)
return localval;
}

# This function verifies that the specified value consists entirely of

# numeric digits 0-9
function check_decimal_int( v ) {
if( v !~ /^[[:digit:]]*$/ ) { return 0 }
return 1;
}

# Check for comments first and skip to the next line if found
/^[[:space:]]*[#;]/ {
next
}

/vm\.nr_hugepages/ {
valstr = extract_value()
if( !check_decimal_int(valstr) )
{
errcnt++
print "Invalid hugepages line: '" $0 "'"
print "ACTION: A valid hugepages line should look similar to the following example,"
print " with no additional comments or other characters:"
print ""
print " vm.nr_hugepages = 10000"
print ""
next
}

# Add 0 to valstr to force it to numeric type. Otherwise

# subsequent comparisons will use string comparisons,
# which won't yield expected results
valnum = 0 + valstr
if( valnum >= memtotal_hugepages )
{
errcnt++
print "Hugepages value '" valnum "' is larger than physical memory"
print "ACTION: Reduce the hugepages value to something much less than the total size of"
print " physical RAM in the server. For this server, a value of " memtotal_hugepages
print " would consume all of physical RAM, and would prevent the server from"
print " booting. Please refer to MOS Note 401749.1 for guidance on choosing"
print " an appropriate value for this server."
https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 48/137
29/10/2019 Document 1067527.1
next
}
}

Remove "fix_control=32" from dbfs mount options

Alert Scope
Priority Date Owner Status
Level
Critical None 5/2/2013 <Name> Exadata
DB DB Role Engineered System Exadata OS & Version Validation Tool
Version Version Version
X2-2(4170), X2-2, X2-8, X3-2, X3-8, Linux x86-64 UEK5.8, SPARC
11.2.3.2.1+ All All
SSC, X4-2 Solaris 11

Benefit / Impact:

DBFS is designed to use an async statfs to handle the need of getting the filesystem info. Bug #13340960 added an extra mount
option of "fix_control=32",
which allowed statfs to be done asynchronously due to a timeout issue. If patch 13340960 is already applied, it's recommended
to remove "fix_control=32".
Bug 13340960 is fixed in 11.2.0.3 BP5 and higher.

Risk:

Changes the statfs behavior if mount option "fix_control=32" is not removed

Action / Repair:

1) Check on Exadata compute node(s) if DBFS is mounted with "fix_control=32";

On Linux:

#ps -ef | grep -E 'dbfs_client' | grep -E 'fix_control'

On Solaris:

# ps -ef | grep dbfs_client

# pargs <pid> - from dbfs_client above

2) Check to see if bug:1334096 is installed or 11.2.0.3 BP5+ is applied to the RDBMS Oracle home:

$RDBMS/OPatch/opatch lspatches

3) Check make sure you're using the latest mount-dbfs.sh script from note: Configuring DBFS on Oracle Database Machine [ID
1054431.1]

Set Linux kernel log buffer size to 1MB

Priority Alert Date Owner Status Scope Bug(s)

Level
Critical WARN 7/31/13 <Name> Exadata 17250965
DB DB Role Engineered System Exadata OS & Validation Tool TBD
Version Version Version Version
N/A N/A X2-2(4170), X2-2, X2-8, X3-2, X3-8, All Linux
X4-2

Benefit / Impact:

Set the kernel command line parameter "log_buf_len=1m" in /boot/grub/grub.conf to increase the size of the kernel's internal
log buffer. This will help ensure all messages from the kernel's boot sequence can be captured to /var/log/messages by
syslogd/klogd.

This is primarily a concern only on larger servers like the Sun Server X2-8, where the large number of hardware components
causes the kernel to produce a larger volume of messages than the internal log buffer can hold during the boot sequence.

Risk:

The default size of the kernel's internal log buffer is not large enough to hold all messages from the entire boot sequence on
some large hardware models.
Without this change, some messages from the kernel's boot sequence may be lost before they can be captured to
/var/log/messages, which may make it difficult
to diagnose some system issues.

Action / Repair:

Edit /boot/grub/grub.conf and add "log_buf_len=1m" (excluding quotes) to each kernel command line entry, as in the following
example:

title Oracle Linux Server (2.6.32-400.21.1.el5uek)

root (hd0,0)
kernel /vmlinuz-2.6.32-400.21.1.el5uek root=LABEL=DBSYS ro bootarea=dbsys loglevel=7 panic=60 debug rhgb console=
initrd /initrd-2.6.32-400.21.1.el5uek.img

Verify IP routing configuration on DB nodes

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 49/137
29/10/2019 Document 1067527.1
Priority Alert Level Date Owner Status Engineered System Engineered Syst
Platform
Critical WARN 05/31/17 <Name> Production RA, Exadata - Physical, ALL
Exadata - Management Domain
DB Version DB Type DB Role DB Mode Exadata Version OS & Version Validation Tool Ve
N/A N/A N/A N/A N/A Linux exachk 12.2.0.1

Benefit / Impact:

The default IP routing configuration on Exadata database nodes has changed over time so that the latest configuration works
well in all environments, but due to a kernel bug in kernels pre-2.6.31, the older configurations only worked in some cases. Since
the configurations aren't changed during Exadata software upgrades, legacy configurations should be updated to avoid issues
during future upgrades.

Risk:

If the Linux routing configuration is not updated before the kernel is upgraded from pre-2.6.31 (Exadata pre-11.2.3.2.0) to
Exadata software version 11.2.3.2.0 or later, it is likely that routing/network issues will surface following the upgrade. The
required changes (or potential changes) are outlined in MOS note 1306154.1.

Action / Repair:

To verify the routing configuration requires updating, execute the following as any userid on a database server:

cd /etc/sysconfig/network-scripts
. ./network-functions
# find all the interfaces besides loopback. ignore aliases, alternative configurations, and editor backup files
interfaces=$(ls ifcfg* | grep -v -e ifcfg-ib -e ifcfg-bondib | LANG=C sed -e "$__sed_discard_ignored_files" -e '/

for i in $interfaces
do
unset SLAVE
unset IPADDR
unset NETWORK
unset CNT
unset NETMASK
unset RNT
unset IPV6ADDR

. /etc/sysconfig/network-scripts/ifcfg-$i
AGREE=`/bin/grep ^SLAVE= ifcfg-$i | /bin/cut -d= -f2`
if [ [$AGREE] == [yes] ]
then echo " NOTICE: Slave Interfaces ($i) do not have rule or route files"
else
# IPv4 check
if [ -z $IPADDR ]
then echo " NOTICE: $i is not configured for IPv4"
else
if [ -z $NETWORK ]
then NETWORK=`/bin/ipcalc $IPADDR $NETMASK -n | /bin/cut -d= -f2`
fi
# check the rule file exists and has the two rules that apply (to and from)
if [ ! -f rule-$i ]
then echo "FAILURE: Need to create the rule configuration for rule-$i per 1306154.1"
else
CNT=`/sbin/ip rule list | /bin/grep -e $NETWORK -e $IPADDR -e GATEWAY | wc -l`
if [ $CNT -lt 2 ]
then echo "FAILURE: Need to update rule configuration for rule-$i per 1306154.1"
else echo " PASS: rule-$i is configured with rules."
fi
fi
# check the route file exists and have the proper route
if [ ! -f route-$i ]
then echo "FAILURE: Need to create the route configuration for route-$i per 1306154.1"
else
RNT=`/sbin/ip route list table all | /bin/grep $NETWORK | grep -v local | wc -l`
if [ $RNT -lt 2 ]
then echo "FAILURE: Need to update route configuration for route-$i per 1306154.1"
else echo " PASS: route-$i is configured with routes."
fi
fi
fi
# IPv6 check
if [ -z $IPV6ADDR ]
then echo " NOTICE: $i is not configured for IPv6"
else
if [ -z $NETWORK ]
then
NETWORK=`echo $IPV6ADDR | /bin/cut -d: -f1,2,3,4`
NETWORK=$NETWORK:
# check the rule file exists and has the two rules that apply (to and from)
if [ ! -f rule6-$i ]
then echo "FAILURE: Need to create the rule configuration for rule6-$i per 1306154.1"
else
CNT=`/sbin/ip -6 rule list | /bin/grep $NETWORK | wc -l`
if [ $CNT -lt 2 ]
then echo "FAILURE: Need to update rule configuration for rule6-$i per 1306154.1"
else echo " PASS: rule6-$i is configured with rules."
fi
fi
# check the route file exists and have the proper route
if [ ! -f route6-$i ]
then echo "FAILURE: Need to create the route configuration for route6-$i per 1306154.1"
else
RNT=`/sbin/ip -6 route list table all | /bin/grep $NETWORK | grep -v local | grep table | wc -l
if [ $RNT -lt 2 ]
then echo "FAILURE: Need to update route configuration for route6-$i per 1306154.1"

The expected result will be similar to:

PASS: rule-bondeth0 is configured with rules.

PASS: route-bondeth0 is configured with routes.
NOTICE: bondeth0 is not configured for IPv6
PASS: rule-eth0 is configured with rules.
PASS: route-eth0 is configured with routes.
NOTICE: eth0 is not configured for IPv6
NOTICE: eth1 is not configured for IPv4
NOTICE: eth1 is not configured for IPv6
NOTICE: eth2 is not configured for IPv4
NOTICE: eth2 is not configured for IPv6
NOTICE: eth3 is not configured for IPv4
NOTICE: eth3 is not configured for IPv6
NOTICE: Slave Interfaces (eth4) do not have rule or route files
NOTICE: Slave Interfaces (eth5) do not have rule or route files

Example of a "FAILURE" result:

PASS: rule-bondeth0 is configured with rules.

FAILURE: Need to create the route configuration for route-bondeth0 per 1306154.1
NOTICE: bondeth0 is not configured for IPv6
PASS: rule-eth0 is configured with rules.
PASS: route-eth0 is configured with routes.
NOTICE: eth0 is not configured for IPv6
NOTICE: eth1 is not configured for IPv4
NOTICE: eth1 is not configured for IPv6
NOTICE: eth2 is not configured for IPv4
NOTICE: eth2 is not configured for IPv6
NOTICE: eth3 is not configured for IPv4
NOTICE: eth3 is not configured for IPv6
NOTICE: Slave Interfaces (eth4) do not have rule or route files
NOTICE: Slave Interfaces (eth5) do not have rule or route files

NOTE: If any "FAILURE:" results are returned, follow the guidance provided in the message.

Set SQLNET.EXPIRE_TIME=10 in DB Home

Priority Alert Date Owner Status Scope Bug(s)

Level
Critical WARNING 12/4/2013 <Name> Production Exadata, SSC, 17159324
Exalogic
DB DB Role Engineered System Exadata OS & Version Validation Tool TBD
Version Version Version
N/A N/A X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4- 11.2.3.3.0+ Solaris - 11
2, X4-8 Linux x86-64
UEK5.8

Benefit / Impact:

Setting this in DB Home will prevent a connection over SQL*Plus from timing out

Risk:

If this is not set then the SQL*Net connection held by RMAN can timeout while the database is backed up over HTTP protocol.

Action / Repair:

To verify the parameter is set - look in ${ORACLE_HOME}/network/admin/sqlnet.ora

The output should be similar to

SQLNET.EXPIRE_TIME=10

Verify there are no .fuse_hidden files under the dbfs mount

Priority Alert Date Owner Status Scope Bug(s)

Level

Important N/A 12/10/13 <Name> Production Exadata

DB DB Role Engineered System Exadata OS & Validation Tool TBD

Version Version Version Version

N/A N/A X2-2(4170), X2-2, X2-8, X3-2, X3-8, All OEL5 exachk TBD
X4-2

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 51/137
29/10/2019 Document 1067527.1

Benefit / Impact:

Verifying the existence of .fuse_hidden files located under the dbfs mount point will positively identify a recommended bug fix.
The impact of verifying the existance of these files is minimal.

This problem is specific to fuse on OEL5 (which is 2.7.4 based version).

Risk:

When a file is opened under the dbfs mount and later removed whilst a process still holds the file descriptor, the fuse library
may not unlink
correctly leaving .fuse_hidden files remaining under the dbfs mount.These files can accumulate causing slow performance for
simple filesystem
commands such as "ls". Also, the number of files can grow quite large taking up unnecessary space.

Action / Repair:

It's recommended to perform these actions during your next planned maintenance schedule as dbfs will need to be restarted.

These instructions are applicable to those environments who configured DBFS using MOS note:1054431.1

1) While dbfs is mounted, manually delete any existing .fuse_hidden files under the dbfs mount as the patch does not clear
these.

2) Stop and unmount dbfs:

$GI/bin/crsctl stop res <dbfs_mount>

3) Obtain and install the new fuse rpms related to bug:17401424 from Oracle's public Yum Server

4) Verify the new rpm is installed <fuse-libs-2.7.4-8.0.1.1.el5>:

# rpm -qa|grep fuse

fuse-devel-2.7.4-8.0.1.1.el5
fuse-2.7.4-8.0.1.1.el5
fuse-libs-2.7.4-8.0.1.1.el5

5) Start and remount dbfs:

$GI/bin/crsctl start res <dbfs_mount>

Verify that the SDP over IB option "sdp_apm_enable(d)" is set to "0"

Priority Alert Date Owner Status Engineered System Bug(s)

Level

<Name>
Critical FAIL 06/03/15 Production Exadata-Physical, Exadata-
Management Domain,
Exadata-user Domain, SSC,
Exalogic

DB DB Engineered System Platform Exadata OS & Validation Tool Version TBD

Version Role Version Version

N/A N/A X2-2(4170), X2-2, X2-8, X3-2, X3-8, <=11.2.3.3.0 Linux x86-64 exachk 12.1.0.2.4
X4-2, X4-8, X5-2 -or- el5uek
<=12.1.1.1.0 Linux x86-64
el6uek

Benefit / Impact:

The Impact of verifying that the SDP over IB option "sdp_apm_enable" is set to "0" is minimal. To set the option, a reboot is
recommended to make sure the configuration file syntax is correct.

Risk:

If the the SDP over IB option "sdp_apm_enable" is not set to "0" on all Exadata database servers and clients that communicate
with each other using SDP, either the client or database server side of the connection request will eventually hang.

NOTE: While the original issue was reported in environments where Exalogic application servers where accessing an Oracle
Exadata Database Machine using SDP, ANY client requesting a connection using SDP with Automatic Path Migration (APM)
enabled to an Oracle Exadata Database Machine will cause the connection to hang on the database server. exachk cannot tell
from querying an Oracle Exadata Database Machine if there is, or ever will be, an end user application accessing the database
servers via SDP. The Best Practice recommendation for stability is therefore to turn off APM on all Oracle Exadata Database
Machines and any clients that may seek to establish an SDP connection with them.

Action / Repair:

To verify that the SDP over IB option "sdp_apm_enable" is set to "0" in the proper configuration file and the running kernel,
execute the following command as the "root" userid on all database servers.

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 52/137
29/10/2019 Document 1067527.1
unset IB_SDP_OUTPUT_FILE;
unset IB_SDP_OUTPUT_KERNEL_RSLT;
unset IB_SDP_FILE;
unset KERNEL_TYPE;
MODULE=ib_sdp;
OPTION=sdp_apm_enable
if ! /sbin/lsmod | grep -q "^${MODULE}[[:space:]]"; then
echo "Module ${MODULE} is not loaded, so ${OPTION} will not be checked";
else
echo "Module ${MODULE} is loaded, so ${OPTION} will be checked";
KERNEL_TYPE=$(uname -r | cut -d"." -f6);
if [ $KERNEL_TYPE = "el5uek" ]
then
IB_SDP_FILE="/etc/modprobe.conf";
elif [ $KERNEL_TYPE = "el6uek" ]
then
IB_SDP_FILE="/etc/modprobe.d/ib_sdp.conf";
else
echo -e "ERROR: unable to determine IB_SDP_FILE: $KERNEL_TYPE";
fi;
IB_SDP_OUTPUT_FILE=$(egrep "ib_sdp" $IB_SDP_FILE);
if [ -s /sys/module/ib_sdp/parameters/sdp_apm_enable ]
then
IB_SDP_OUTPUT_KERNEL_RSLT=$(cat /sys/module/ib_sdp/parameters/sdp_apm_enable);
else
IB_SDP_OUTPUT_KERNEL_RSLT="/sys/module/ib_sdp/parameters/sdp_apm_enable not found";
fi;
if [[ `echo "$IB_SDP_OUTPUT_FILE" | egrep "sdp_apm_enable*.=0" | wc -l | sed -e 's/^[ \t]*//'` = 1 && `echo
"$IB_SDP_OUTPUT_FILE" | wc -l | sed -e 's/^[ \t]*//'` = 1 ]]
then
IB_SDP_OUTPUT_FILE_RSLT=0;
fi;
if [[ "$IB_SDP_OUTPUT_FILE_RSLT" = 0 && "$IB_SDP_OUTPUT_KERNEL_RSLT" = 0 ]]
then
echo -e "SUCCESS: sdp_apm_enable is set to 0 in $IB_SDP_FILE and running kernel.";
echo -e "$IB_SDP_FILE: $IB_SDP_OUTPUT_FILE";
echo -e "Running Kernel: $IB_SDP_OUTPUT_KERNEL_RSLT";
else
echo -e "FAILURE: sdp_apm_enable should be set to 0 in $IB_SDP_FILE and running kernel.";
echo -e "$IB_SDP_FILE: $IB_SDP_OUTPUT_FILE";
echo -e "Running Kernel: $IB_SDP_OUTPUT_KERNEL_RSLT";
fi;
fi;

The output should be similar to:

Module ib_sdp is not loaded, so sdp_apm_enable will not be checked

- OR -

Module ib_sdp is loaded, so sdp_apm_enable will be checked

SUCCESS: sdp_apm_enable is set to 0 in /etc/modprobe.conf and running kernel.
/etc/modprobe.conf: options ib_sdp sdp_zcopy_thresh=0 recv_poll=0 sdp_apm_enable=0
Running Kernel: 0

If the output is not as expected, investigate the configuration for root cause and make appropriate corrections.

NOTE: The 11.x and 12.x series are separate code lines, which is why there are two entries under "Exadata Version". Above the
versions listed in "Exadata Version", APM is off by default in the Linux kernel, but it can still be manually activated.

NOTE: For additional guidance on configuring sdp_apm_enable, please see "SDP Connection in inter-connected Exalogic and
Exadata stopped working (Doc ID 1588546.1)"

Verify /etc/oratab

Priority Alert Date Owner Status Scope Bug(s)

Level

Important WARN 02/06/14 <Name> Production Exadata

DB DB Role Engineered System Exadata OS & Version Validation Tool TBD

Version Version Version

N/A N/A X2-2(4170), X2-2, X2-8, X3-2, X3-8, ALL Solaris - 11 exachk 2.2.5
X4-2 Linux x86-64
UEK5.8

Benefit / Impact:

Risk:

oratab having stale or invalid entries takes away the ability to automate - for example relinking of oracle homes.

Action / Repair:

all directories point to real locations with $ORACLE_HOME/bin/oracle binary in place

only ony GI home
one and only one +ASM entry exists
+ASM entry matched with GI home with $ORACLE_HOME/bin/crsd.bin binary

A quick script with 5 basic checks is made available here. The script was written quick and only serves as an example of what we
are trying to accomplish

Verify consistent software and configuration across nodes

Priority Alert Date Owner Status Scope Bug(s)

Level

Important WARN 02/6/2014 <Name> Production Exadata See bug list in linked
section below.

DB DB Role Engineered System Exadata OS & Validation Tool TBD

Version Version Version Version

N/A N/A X2-2(4170), X2-2, X2-8, X3-2, All All exachk various
X3-8, X4-2

Benefit / Impact:

Consistent software and configuration across nodes increases stabillty and performance, and facilitates problem diagnosis.

Risk:

Inconsistent software and configuration across nodes can cause crashes and performance degredation, and can make problem
diagnosis difficult.

Action / Repair:

Recommended consistency checks are provided at the following location:

Exadata Best Practices Cross Node Consistency

Verify all database and storage servers time server configuration

Priority Alert Date Owner Status Engineered System Engineered Bug(s)

Level System
Platform
Critical CRITICAL 05/01/19 <Name> Production Exadata - Physical, ALL 29605287 - exachk
Exadata - Management 29031050 - exachk
Domain, 27262264 - exachk
Exadata - User Domain, 24696447 - exachk
SSC
DB/GI DB Type DB Role DB Exadata OS & Version Validation Tool MAA Scorecard
Version Mode Version Version Section
N/A N/A N/A N/A ALL Linux, Sparc exachk 19.3.0 N/A

Benefit / Impact:

Verifying all database and storage servers time server configurations are as expected can help avoid issues such as impaired
performance or node eviction.

The impact of verifying all database and storage servers time server configuration is minimal. The impact of making corrections
varies depending upon the root cause of the difference.

Risk:

Significant time drift on database and storage servers may cause unexpected storage server crashes or database server node
evictions.

Action / Repair:

NOTE: This check will only pass if the following are all true on each database or storage server:
1) There are one or more time servers specified in the configuration file (/etc/chrony.conf or /etc/ntp.conf).
2) Each storage or database server is synched with one of the set of available time sources in the configuration
file.
3) The maximum time drift for each storage or database server from the synched time source reported is less than
or equal to 1 second.

To verify all database and storage servers time server configuration, run exachk and review the provided report.

In the "Cluster Wide" section of the report, the overall result should be "PASS":

PASS All database and storage servers time server configuration is as expected Cluster Wide View

In the "View" detail section of the report for this check the expected output should be similar to:

Status on Cluster Wide:

PASS => Time services are properly configured

DATA FROM RANDOM05ADM05 - VERIFY ALL DATABASE AND STORAGE SERVERS TIME SERVER CONFIGURATION

SUCCESS: time services are properly configured.

In the "View" detail section of the report for this check a "FAILURE" example will be similar to:

FAILURE: time services are not properly configured. Details:

randomadm05: FAILURE: server count: 1 synched server in conf: 1 timedrift: 2

randomceladm07: FAILURE: server count: 0 synched server in conf: 1 timedrift: 0
randomceladm08: FAILURE: server count: 1 synched server in conf: 0 timedrift: 0

NOTE: A "FAILURE" result prints the gathered data from the cluster to help identify the issue.
NOTE: This configuration failed because
1) randomadm05 timedrift is too high.
2) randomceladm07 has no servers defined in the configuration file.
3) randomceladm08 is not synchronized to a server defined in the configuration file.

If the result is not as expected, investigate for root cause and take appropriate corrective action.

NOTE: If after corrective actions are completed, you wish to run this one check without a full exachk run execute
the following command as the "root" userid in the directory in which exachk was installed:

./exachk -check 85C96EAB566F8F13E053D498EB0AE6F1,85C9BA643125E253E053D598EB0A6D07,85CEDB9B0FBF1262E053D298E

Verify Sar files have read permissions for non-root user

Priority Alert Date Owner Status Scope Bug(s)

Level

Critical FAIL 1/24/2013 <Name> Draft Exadata

DB DB Role Engineered System Exadata OS & Version Validation Tool TBD

Version Version Version

N/A N/A X2-2(4170), X2-2, X2-8, X3-2, X3-8, N/A Solaris - 11 exachkx
X4-2 Linux x86-64
UEK5.8

Benefit / Impact:

Ability for non-root users including EM to monitor System Activity Report (sar) files.

Risk:

Inability for non-root users including EM to monitor sar files.

Action / Repair:

Verify if read permissions are set for the sar files, execute the script below.

##### begin script

#!/bin/bash

if [ `stat -c %A /var/log/sa/sa* | awk 'END{print}' | sed 's/.......$.$.\+/\1/'` != "r" ]

then
echo "Sar files does not have the proper read permission set for non-root users. To correct, issue this command
else
echo "Sar file permissions are correct and no further action is needed."
fi
#### end script

Verify that the patch for bug 16618055 is applied

Important Warn 05/29/14 <Name> Production Exadata

DB DB Role Engineered System Exadata OS & Validation Tool TBD

Version Version Version Version

>= N/A X2-2(4170), X2-2, X2-8, X3-2, X3-8, N/A N/A exachk 2.2.5
11.2.0.4 X4-2
and
<
11.2.0.4.8

Benefit / Impact:

Applying the patch for bug 16618055 allows recovery to utilize ASYNC I/O, providing greater recovery performance and a shorter
Recovery Time Objective.

The impact of verifying that the patch for bug 16618055 is applied is minimal. The impact of applying the patch for bug
16618055 varies by method.

Risk:

Without the patch for bug 16618055 applied, recovery uses SYNC I/O for all log and block read operations which causes slower
recovery slave performance and a longer Recovery Time Objective.

Action / Repair:

To verify that the patch for bug 16618055 is applied, as the owner of each RDBMS home, with the environment properly
configured, execute the following command for each RDBMS home:

$ORACLE_HOME/OPatch/opatch lsinventory -bugs_fixed|egrep -w '^16618055|^Bug|Patch'|grep -v Installer

The output should be similar to:

Bug Fixed by Installed at Description

Patch
16618055 18642122 Fri Jun 13 11:32:22 PDT 2014 SLOW REDO APPLY ON EXADATA DUE TO SYNC IOS

If the appropriate patch is not already applied, and the database software version is 11.2.0.4 and the Bundle Patch applied is
less than Bundle Patch 8 then you must apply the patch for bug 16618055 to the appropriate database home.

NOTE: For additional detail, please see My Oracle Support note "ASYNC IO In Exadata Not Working (Doc ID
1642088.1)".

Verify the Name Service Cache Daemon (NSCD) is Running

Priority Alert Level Date Owner Status Engineered System Engineered System
Platform
Critical FAIL 07/17/17 <Name> Production
Exadata - Physical, ALL
Exadata - Management Domain,
Exadata - User Domain
DB/GI Version DB Type DB Role DB Mode Exadata Version OS & Version Validation Tool Version
N/A N/A N/A N/A ALL Linux exachk 12.2.0.1.4

Benefit / Impact

Verifying the NSCD configuration ensures the correct configuration when providing cache for the most common name service
requests, like passwords, groups, hosts.

The impact of verifying the NSCD configuration is minimal. While configuring and starting the NSCD can be done without a
reboot, a reboot is recommended to prove the configuration is correct and survives a boot procedure.

NOTE: The recommended NSCD attribute values varying depending upon whether or not the System Security
Service Daemon (SSSD) is also in use.

Risk:

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 56/137
29/10/2019 Document 1067527.1
When NCSD and SSSD daemons are running together, an incorrect configuration could cause processes to use the incorrect
cache service. Typical problems are CRS start failure due to an invalid password, new connections to the database suddenly
failing due to invalid password error (ORA-1031, ORA-1017) among others.

Action / Repair:

To verify the NSCD is properly configured, as the root userid on each database server, execute the following code:

NSCD_SERVICE_DATA=$(service nscd status 2>&1)

SSSD_SERVICE_DATA=$(service sssd status 2>&1)
NSCD_AUTOSTART_DATA=$(chkconfig --list nscd 2>&1 | sed -e 's/ */ /g' -e 's/ *//')
NSCD_AUTOSTART_CONFIGURED=$(echo $NSCD_AUTOSTART_DATA |awk '{if ($0 ~ /3:on/ || $0 ~ /5:on/) {print "1";exit 1}el
if [ -r /etc/nscd.conf ]
then
NSCD_FILE_DATA=$(egrep "enable-cache" /etc/nscd.conf | grep -v "#" | awk '{print $2 ": " $3}')
else
NSCD_FILE_DATA=$(ls -l /etc/nscd.conf 2>&1)
fi
NSCD_MEMORY_DATA=$(for CACHE_NAME in passwd group hosts services netgroup; do echo -e "$CACHE_NAME: `nscd -g 2>/d
NSCD_SERVICE_STATUS=$(echo $NSCD_SERVICE_DATA | grep running | wc -l)
SSSD_SERVICE_STATUS=$(echo $SSSD_SERVICE_DATA | grep running | wc -l)
NSCD_FILE_DATA_SHORT=$(echo "$NSCD_FILE_DATA" | awk '{print $2}' | tr -d " \t\n\r")
NSCD_MEMORY_DATA_SHORT=$(echo "$NSCD_MEMORY_DATA" | awk '{print $2}' | tr -d " \t\n\r")
if [ $NSCD_FILE_DATA_SHORT = $NSCD_MEMORY_DATA_SHORT ] 2>/dev/null
then
MEMORY_MATCHES_FILE=1
else
MEMORY_MATCHES_FILE=0
fi
if [ $SSSD_SERVICE_STATUS -eq 0 ] # only NSCD
then
if [ "$NSCD_FILE_DATA_SHORT" == "yesyesyesyesno" ]
then
NSCD_ATTRIBUTES_CORRECT=1
else
NSCD_ATTRIBUTES_CORRECT=0
fi
else # NSCD and SSSD
if [ "$NSCD_FILE_DATA_SHORT" == "yesnononono" ]
then
NSCD_ATTRIBUTES_CORRECT=1
else
NSCD_ATTRIBUTES_CORRECT=0
fi
fi
if [ $NSCD_SERVICE_STATUS -eq 1 ] && [ $NSCD_AUTOSTART_CONFIGURED -eq 1 ] && [ $MEMORY_MATCHES_FILE -eq 1 ] && [
then
echo -e "SUCCESS: The Name Service Cache Daemon (NSCD) configuration is correct:\n"
echo -e "NSCD service data: $NSCD_SERVICE_DATA\n"
echo -e "SSSD service data: $SSSD_SERVICE_DATA\n"
echo -e "NSCD autostart data: $NSCD_AUTOSTART_DATA\n"
echo -e "NSCD file data:\n$NSCD_FILE_DATA\n"
echo -e "NSCD memory data:\n$NSCD_MEMORY_DATA\n"
else
echo -e "FAILURE: The Name Service Cache Daemon (NSCD) configuration is not correct:\n"
echo -e "NSCD service data: $NSCD_SERVICE_DATA\n"
echo -e "SSSD service data: $SSSD_SERVICE_DATA\n"
echo -e "NSCD autostart data: $NSCD_AUTOSTART_DATA\n"
echo -e "NSCD file data:\n$NSCD_FILE_DATA\n"
echo -e "NSCD memory data:\n$NSCD_MEMORY_DATA\n"
fi

The expected output should be similar to:

SUCCESS: The Name Service Cache Daemon (NSCD) configuration is correct:

NSCD service data: nscd (pid 69150) is running...

SSSD service data: sssd: unrecognized service
NSCD autostart data: nscd 0:off 1:off 2:on 3:on 4:on 5:on 6:off

NSCD file data:

passwd: yes
group: yes
hosts: yes
services: yes
netgroup: no

NSCD memory data:

passwd: yes
group: yes
hosts: yes
services: yes
netgroup: no

-- OR --

SUCCESS: The Name Service Cache Daemon (NSCD) configuration is correct:

NSCD service data: nscd (pid 69150) is running...

SSSD service data: sssd (pid 91505) is running...

NSCD autostart data: nscd 0:off 1:off 2:on 3:on 4:on 5:on 6:off

NSCD file data:

passwd: yes
group: no
hosts: no

NSCD memory data:

passwd: yes
group: no
hosts: no
services: no
netgroup: no

If the output is not as expected take the following actions as the root userid:

1) If the NSCD is not set for autostart, enable the NSCD to autostart on reboots:

chkconfig --level 35 nscd on

NOTE: The autostart levels vary by Exadata Storage Server Software version, at least levels 3 and 5 should be set.

2) The entries for the /etc/nscd.conf file depend upon whether or not SSSD is in use with NSCD. For NSCD without SSSD, the
following entries should be present in the /etc/nscd.conf file:

enable-cache passwd yes

enable-cache group yes
enable-cache hosts yes
enable-cache services yes
enable-cache netgroup no

For NSCD with SSSD, the following entries should be present in the /etc/nscd.conf file:

enable-cache passwd yes

enable-cache group no
enable-cache hosts no
enable-cache services no
enable-cache netgroup no

If the values are not as expected, modify the /etc/nscd.conf file.

NOTE: the /etc/nscd.conf file can be edited with the "vi" editor.
NOTE: these attributes are spread throughout the /etc/nscd.conf file, at the head of other attributes that pertain to
each cache. They are not grouped together. For example:
enable-cache services yes
positive-time-to-live services 28800
negative-time-to-live services 20
suggested-size services 211
check-files services yes
persistent services yes
shared services yes
max-db-size services 33554432

3) It is recommended to reboot the database server to ensure that the configuration is correct and is persistent across the
reboot process.

4) If a reboot is not immediately possible, as a workaround, the service may be started or restarted manually:

service nscd start

Starting nscd: [ OK ]

- OR -

service nscd restart

Stopping nscd: [ OK ]
Starting nscd: [ OK ]

For additional guidance on NSCD, please see:

Oracle® Grid Infrastructure Installation Guide 11g Release 2 (11.2) for Linux

Oracle® Grid Infrastructure Installation Guide 12c Release 1 (12.1) for Linux

Verify kernels and initrd in /boot/grub/grub.conf are available on the system

Priority Alert Date Owner Status Scope Bug(s)

Level

Critical FAIL 8/29/14 <Name> Production Exadata,

DB DB Role Engineered System Exadata OS & Validation Tool TBD

Version Version Version Version

N/A N/A X2-2(4170), X2-2, X2-8, X3-2, X3-8, 11.2.2.2.0+ Linux x86-64
X4-2

Benefit / Impact:

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 58/137
29/10/2019 Document 1067527.1
The impact of verifying that the kernel and initrd listed in grub.conf are actually available on the system is minimal. When the
kernel or initrd file is unavailable the user should either remove the corresponding entry from grub.conf (if possible) or install the
appropriate files on the right location (recommended)

Risk:

If entries in grub.conf exist that refer to kernel and initrd files not installed on the system, a next reboot may fail. The system
will 'hang' in the bootloaded.

Action / Repair:

To verify entries in grub.conf match with what is installed. I would think of the following approach in pseudo:

for each 'title' in /boot/grub/grub.conf

do
get the value for 'kernel' without other arguments; check if the file is found on disk in /boot; raise an alert
get the value for 'initrd' without other arguments; check if the file is found on disk in /boot; raise an alert
done

Verify basic Logical Volume(LVM) system devices configuration

Priority Alert Date Owner Status Engineered System Bug(s)

Level

Critical FAIL 12/09/15 <Name> Production Exadata - Physical,

Exadata - Management
Domain

DB DB Role Engineered System Platform Exadata OS & Validation Tool TBD

Version Version Version Version

N/A N/A X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X4- 11.2.2.2.0+ Linux x86- exachk 12.1.0.2.6
8, X5-2, X5-8 64

Benefit / Impact:

The impact of verifying that the basic Logical Volume(LVM) system devices configuration is correct is minimal. The impact of
correcting any abnormalities depends upon the specific abnormality.

Risk:

If the basic Logical Volume(LVM) system devices configuration is not correct, there may be risk of patching interruption or
unexpected downtime.

Action / Repair:

The basic Logical Volume(LVM) system devices configuration varies by Exadata software version level and hardware type.
exachk runs the appropriate checks based on Exadata software version levels and hardware type. To validate the basic Logical
Volume(LVM) system devices configuration, run exachk and review the provided report.

The expected output in the exachk report should be as follows:

In the "Findings Passed" summary section of the report, the overall result should be "PASS":

PASS OS Check Basic Logical Volume(LVM) system devices configuration meets recommendations. All Database Servers View

In the "View" detail section of the report for each individual database server:

(*) PASS: This is an LV (Logical Volume) enabled system

(*) PASS: LVDbSys1 should reside in Volume Group (VG) VGExaDb.
(*) PASS: LVDbSys2 should reside in Volume Group (VG) VGExaDb.
(*) PASS: Minimum number of LVDbSys LV's
(*) PASS: Maximum number of LVDbSys LV's
(*) PASS: LVDbSys LV minimum size of /dev/mapper/VGExaDb-LVDbSys2
(*) PASS: LVDbSys LV size
(*) PASS: LVDbSys inactive LV minimum size of /dev/mapper/VGExaDb-LVDbSys1
(*) PASS: Inactive LVDbSys LV's not mounted
(*) PASS: Enough free space found for snapshot
(*) PASS: No filesystem label issues for DBSYS
(*) PASS: No reclaimdisk issues found
(*) PASS: No active lvm snapshots found

If the items reported are not all "PASS", investigate the root cause and take appropriate corrective action.

Ensure db_unique_name is unique across the enterprise

Priority Alert Date Owner Status Scope Bug(s)

Level

Critical FAIL 02/25/2015 <Name> Production Exadata

11.2+ All X2-2(4170), X2-2, X2-8, X3-2, X3-8, N/A Solaris - 11 exachk 12.1.0.2.2
X4-2 Linux x86-64
UEK5.8

Benefit / Impact:

db_unique_name is used extensively in many Clusterware, RDBMS, and Exadata code layers. Uniqueness is enforced within
clusters but not across clusters. Ensuring db_unique_name is unique across clusters, especially those that are sharing the same
Exadata storage, ensures that all code layers that use it work properly.

Risk:

Having databases with the same db_unique_name across different Real Application Clusters that share the same Exadata
storage causes unexpected behavior such as database isolation, crashes, or failures to start.

Action / Repair:

The following is an example of a sqlplus command checking whether db_unique_name has been explicitely set:

SQL> select isdefault from v$parameter where name ='db_unique_name';

ISDEFAULT
---------
FALSE

If the output is "FALSE", then someone has explicitely set db_unique_name and not let it default to the value of db_name.

If the output is "TRUE", then db_unique_name is set to its default value, ie the same as db_name.

Oracle recommends that db_unique_name is unique across a customer's Oracle enterprise. exachk running on a given Real
Application Cluster cannot check all values across a customer's enterprise. This exachk check assumes that "FALSE" means
specific care has been taken to ensure uniqueness across the customer's enterprise and is considered the "PASS" condition.
"TRUE" is assumed to imply that enterprise uniqeness may not have been considered and is the "FAIL" condition.

NOTE: the corrective action is to ensure all databases have a unique name across the customer's Oracle enterprise, especially
those accessing the same Exadata storage. If every database is confirmed to have a unique name without setting
db_unique_name universally, then this exachk check may be disabled or ignored.

Verify average ping times to DNS nameserver

Priority Alert Date Owner Status Scope Bug(s)

Level

<Name>
Critical WARN 01/14/2015 Production Exadata

DB DB Role Engineered System Exadata OS & Version Validation Tool TBD

Version Version Version

N/A N/A X2-2(4170), X2-2, X2-8, X3-2, EIGHTH, 11.2.3.2.0+ Solaris - 11 exachk 12.1.0.2.2
X3-8, X4-2 Linux x86-64
UEK5.8

Benefit / Impact:

Secure Shell (SSH) remote login procedures require communication between the remote target device and the DNS nameserver.
Minimal average ping times to the DNS nameserver improve SSH login times and help to avoid problems such as timeouts or
failed connection attempts.

The impact of verifying average ping times to the DNS nameserver is minimal. The impact required to minimize average ping
times to the DNS nameserver varies by configuration and cannot be estimated here.

Risk:

Long ping times between remote SSH targets and the active DNS server may cause remote login failures, performance issues, or
dropped application connections.

Action / Repair:

To verify average ping times to DNS nameserver, enter the following command set as the "root" userid on each database server,
storage server, and InfiniBand switch:

HOST_NAME=$(hostname);
if [ -s /usr/local/bin/version ]
then
DNS_SERVER=$(grep -o '[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}' /etc/resolv.conf | head -1);
else
DNS_SERVER=$(nslookup $HOST_NAME | head -1 | cut -d: -f2 | sed -e 's/^[ \t]*//');
fi;

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 60/137
29/10/2019 Document 1067527.1
OS_TYPE=$(uname);
if [ $OS_TYPE = "Linux" ]
then
PING_COMM="ping -c10 $DNS_SERVER";
else
PING_COMM="ping -s $DNS_SERVER 56 5";
fi;
AVG_PING_TIME=$($PING_COMM | egrep avg | cut -d"/" -f5);
TRNC_AVG_PING_TIME=$(echo $AVG_PING_TIME | cut -d"." -f1);
if [ "$TRNC_AVG_PING_TIME" -le "3" ];
then
echo -e "SUCCESS: Average ping times to DNS nameserver should not be negatively impacting SSH operations: $AVG_P
echo -e "Active DNS Server IP: $DNS_SERVER\n";
else
echo -e "WARNING: Average ping times to DNS nameserver MAY be negatively impacting SSH operations: $AVG_PING_TIM
echo -e "Active DNS Server IP: $DNS_SERVER\n";
fi;

The output should be similar to the following:

SUCCESS: Average ping times to DNS nameserver should not be negatively impacting operations: 3.255
Active DNS Server IP: 111.222.333.444

If the result is a "WARNING", first repeat the command set several times at different intervals to determine if the results are
consistent. The command set is one spot check for ten pings. The environment could normally have a short delay and an
execution just happened to catch a period of poor response, or it could normally have a long delay and an execution just
happened to catch a period of good response. If the results are consistent, determine the root cause and take appropriate
corrective action.

NOTE: The result of this command set is a reflection of how DNS is implemented in the environment and not
evidence in itself of a defect in the Oracle Exadata Database Machine.

NOTE: A "WARNING" result does not prove that a delay is causing SSH connectivity problems in the environment.
A "WARNING" result should always be evaluated in conjunction with a review of SSH connectivity issues in the
environment. If there are other SSH connectivity issues present, evaluate if reducing or stabilizing the average
ping times to the DNS nameserver may correct the issues.

NOTE: As with many other network performance metrics, the average ping times to DNS nameserver should be
"minimal". However, it is possible that any given environment may return a result that exceeds the threshold used
in this command set, yet it is satisfactory given the overall environment characteristics and lack of other related
problems. IF NO OTHER PROBLEMS related to DNS exist other than this command set returning a "WARNING",
and the numbers reported are acceptable after a "baseline" for the given environment has been established by
repeated sampling, then the documented procedures for bypassing this check in exachk may be implemented.

NOTE: Due to the differences in available commands for the InfiniBand switch, the command set assumes the first
"nameserver" in /etc/resolv.conf is the "active" DNS server.

NOTE: The use of the Name Service Cache Daemon (NSCD) may also mitigate the effects of long average ping
times to DNS nameserver. For more information see: Verify the Name Service Cache Daemon (NSCD) is Running

Verify Running-config and Startup-config are the same on the Cisco switch

Priority Alert Date Owner Status Scope Bug(s)

Level

Medium WARN 12/01/14 <Name> Production Exadata, SSC,

Exalogic

DB DB Role Engineered System Exadata OS & Version Validation Tool TBD

Version Version Version

N/A N/A X2-2, X2-8, X3-2, X3-8, ALL cat4500-IPBASEK9-M, Version N/A
X4-2 15.0(2)SG8

Benefit / Impact:

To keep the switch running the same configuration after it reboots, it is a best practice to have the running-config the same as
the startup-config.

Risk:

Potential management network issues if the startup-config contains pre-install defaults, or other customizations made by the
Customer.

Action / Repair:

Compare the startup-config and the running-config. The simplest way to do this is to capture the output from the switch and run
diff on the capture files.

Capture output of an ssh session, use the tee command to create the log file:

unixhost ~ > ssh admin@randomsw-adm0 2>&1 | tee /tmp/running.out

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 61/137
29/10/2019 Document 1067527.1
From that new connection, go into enable, set the terminal so it does not pause its output, and show the running configuration
(not the "all")

randomsw-adm0> enable
randomsw-adm0# terminal length 0 - this causes the output to not pause
randomsw-adm0# show running-config all
randomsw-adm0# exit

Now, do the same to check the startup configuration:

unixhost ~ >ssh admin@randomsw-adm0 2>&1 | tee /tmp/start.out

From that new connection, go into enable, set the terminal so it does not pause its output, and show the startup configuration
(not the "all")

randomsw-adm0> enable
randomsw-adm0# terminal length 0 <- this causes the output to not pause
randomsw-adm0# show startup-config all
randomsw-adm0# exit

Modify the two files by removing the lines before the version number:

Version 15.0

and the last entry from the show command:

end

This modification will make the file format more suited for the diff command:

unixhost ~ > diff /tmp/start.out /tmp/running.out > /tmp/diff.out

The two files should have identical parameters. In my first attempt to validate the two config's I saved running to startup and
then output both. The diff was:

194c194
< spanning-tree uplinkfast max-update-rate 444318408
---
> spanning-tree uplinkfast max-update-rate 444318920

It seems that no matter how many times I copy running to startup, it still differs by those few bytes. This might be the same, so
examining the diff.out file you should be able to determine if the differences make any difference at all.

To make running and startup the same, go into the switch and then into the enable mode:

randomsw-adm0> enable
randomsw-adm0# copy running-config startup-config all
Destination filename [startup-config]?
Compressed configuration from 75923 bytes to 22210 bytes[OK]

To protect this setup, you should also copy the new config to a backup on the switch itself and to an external tftp server:

randomsw-adm0# copy running-config bootflash:cisco4948-ip-confg-before

Destination filename [cisco4948-ip-confg-before]?

13815 bytes copied in 1.376 secs (10040 bytes/sec)

Now to the external tftp server:

randomsw-adm0#copy running-config tftp

Address or name of remote host []? random-tftp-1
Destination filename [randomsw-adm0-confg]? cisco4948-ip-confg-before
!!
13815 bytes copied in 1.564 secs (8833 bytes/sec)

Validate SSH is installed and configured on Cisco management switch

Priority Alert Level Date Owner Status Scope Bug(s)

N/A FAIL 12/03/14 <Name> Production Exadata, SSC, Exalogic N/A

DB Version DB Role Engineered System Exadata Version OS & Version Validation Tool Version TBD

N/A N/A X2-2, X2-8, X3-2, X3-8, X4-2 N/A N/A N/A

Benefit / Impact: Telnet has no security and should be avoided. Early versions of the Cisco Internetwork Operating System
(IOS) for the Catalyst 4948 only had telnet available. Note 1415044.1 describes how to get the version of the IOS and how to
configure SSH. This is a check to validate SSHis enabled and also how to configure it, and restrict the number of simultaneous
sessions into the switch.

Risk:

By using telnet, one risks a network sniffer obtaining the administrative and enable passwords. Once these passwords are had, it
is trivial to breach the switch and cause administrative access to Exadata be disabled. Depending on how this switch is
integrated into a Customer's network, infiltration into the Customer's network becomes a possibility.

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 62/137
29/10/2019 Document 1067527.1
The versions which do not contain SSH are those which only have the IP Base image which will have only IPBASE only in its
image name. For instance, Cat4500-IPBASE-M will not have SSH while Cat4500-IPBASEK9-M will have SSH in it.

Action / Repair:

The following was done on this version of the Cisco IOS which contains SSH (as it is cat4500-IPBASEK9-M).

Cisco IOS Software, Catalyst 4500 L3 Switch Software (cat4500-IPBASEK9-M), Version 15.0(2)SG8, RELEASE SOFTWARE (

One must first start a session into the switch. Once there, go into "enable" mode. One will notice the prompt change from ">" to
"#" to represent the enable session.

randomsw-adm0>enable
randomsw-adm0#

Find if SSH is enabled on the switch.

randomsw-adm0#show ip ssh
SSH Enabled - version 2.0
Authentication timeout: 60 secs; Authentication retries: 3

Validate the SSH configuration:

randomsw-adm0#show running-config all | include transport

no destination transport-method http
destination transport-method email
transport preferred none
transport preferred telnet
transport input telnet
transport output telnet
transport preferred none
transport input none
transport output none

In this case SSH is not listed so it is not configured to be used. A configuration that passes looks like this:

randomsw-adm0#show running-config all | include transport

no destination transport-method http
destination transport-method email
transport preferred none
transport preferred ssh
transport input ssh
transport output ssh
transport preferred none
transport input none
transport output none

Validate that the startup configuration is the same as the running.

randomsw-adm0#show startup-config | include transport

In this case they match. If further validation is needed, one will have to capture the running configuration and the startup
configuration and compare them.

If SSH is not enabled and there still are telnet entries in the above output, then the system needs to be configured for SSH. The
first step is to discover how many simultaneous sessions are available.

randomsw-adm0#show line
Tty Typ Tx/Rx A Modem Roty AccO AccI Uses Noise Overruns Int
0 CTY - - - - - 0 0 0/0 -
1 VTY - - - - - 66 0 0/0 -
2 VTY - - - - - 20 0 0/0 -
3 VTY - - - - - 6 0 0/0 -
4 VTY - - - - - 0 0 0/0 -
5 VTY - - - - - 0 0 0/0 -

There can be up to 16 VTY lines in this version of the IOS, so the list you see might be longer. This will allow up to 16
telnet/SSH sessions in the switch at the same time. Normally this is not a good idea, so in this document we will assume only
five total sessions are needed and will disable the rest. So below we will configure vty 1 up to vty 4. We will disable vty 5
through 16. The vty 0 is the serial port in the back of the switch.

randomsw-adm0#configure terminal
Enter configuration commands, one per line. End with CNTL/Z.

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 63/137
29/10/2019 Document 1067527.1

randomsw-adm0(config)#
randomsw-adm0(config)#line vty 1 4
randomsw-adm0(config-line)#transport preferred ssh
randomsw-adm0(config-line)#transport input none
randomsw-adm0(config-line)#transport input ssh
randomsw-adm0(config-line)#transport output none
randomsw-adm0(config-line)#transport output ssh
randomsw-adm0(config-line)#exit
randomsw-adm0(config)#line vty 5 16
randomsw-adm0(config-line)#transport preferred none
randomsw-adm0(config-line)#transport input none
randomsw-adm0(config-line)#transport output none
randomsw-adm0(config-line)#exit
randomsw-adm0(config)#exit

randomsw-adm0#show line vty 0 | include transport

Allowed input transports are ssh.
Allowed output transports are ssh.
Preferred transport is ssh.
randomsw-adm0#show line vty 1 | include transport
Allowed input transports are ssh.
Allowed output transports are ssh.
Preferred transport is ssh.
randomsw-adm0#show line vty 2 | include transport
Allowed input transports are ssh.
Allowed output transports are ssh.
Preferred transport is ssh.
randomsw-adm0#show line vty 3 | include transport
Allowed input transports are ssh.
Allowed output transports are ssh.
Preferred transport is ssh.
randomsw-adm0#show line vty 4 | include transport
Allowed input transports are ssh.
Allowed output transports are ssh.
Preferred transport is ssh.
randomsw-adm0#show line vty 5 | include transport
Allowed input transports are none.
Allowed output transports are none.
Preferred transport is none.

The rest of the "show line vty #" will show all transport options will be set to one. Because they are set to none, you will only be
able to have up to five SSH sessions. You will also not be able get a telnet session on any of the vty's. We will test this in later
steps.

We now need to save the running configuration to the startup configuration so these changes will take.

randomsw-adm0#copy running-config startup-config all

Destination filename [startup-config]?
randomsw-adm0#exit

Now that you have exited from the session to the switch, time to test its really working. First try telneting to the switch:

user@host ~ >telnet randomsw-adm0

Trying 111.222.333.444...
telnet: connect to address 111.222.333.444: Connection refused
telnet: Unable to connect to remote host: Connection refused

Now try SSH:

user@host ~ >ssh admin@randomsw-adm0

Password:
Warning: untrusted X11 forwarding setup failed: xauth key data not generated
Warning: No xauth data; using fake authentication data for X11 forwarding.

To test simultaneous connect restriction, keep opening SSH sessions (without exiting from them) until you get a Connection
refused error. Once you get that error, you've discovered the number of simultaneous SSH sessions are possible. From this
point, while keeping those SSH sessions open and telnet into the switch. If you do not get a Session refused error, the switch is
still open to telnet so the configuration above needs to be troubleshot.

Verify Database Memory Allocation is not Greater than Physical Memory Installed on Database node

Priority Alert Level Date Owner Status Scope Bug(s)

Warn WARN 14/11/04 <Name> Production Exadata

DB Version DB Role Engineered System Exadata Version OS & Version Validation Tool Version TBD

ALL ALL ALL ALL ALL

Benefit / Impact:

Database memory allocation should never be greater than the physical memory installed on a database node. Over allocating
memory can cause memory swapping which will negatively impact performance.

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 64/137
29/10/2019 Document 1067527.1
Risk:

Database performance can be significantly impacted by over allocating memory.

Action / Repair:

Generate a collection of all of the running databases in the environment. This must be done on a per-node basis as databases
may not have instances running on all nodes. In a loop, connect to each database and query gv$parameter and ensure all
database instances are using USE_LARGE_PAGES = ONLY.

If any instance does not have USE_LARGE_PAGES = ONLY set, FAIL with a message similar to the following and stop
processing:

It is highly recommended that you use hugepages in the Linux environment (link to BP for USE_LARGE_PAGES). We have found
at least one instance without USE_LARGE_PAGES = ONLY and thus cannot with absolute accuracy calculate actual memory
utilization.
If all instances PASS the previous check, calculate PGA memory allocation in use by each database instance (this includes ASM
and MGMTDB instances).

When accessing the ASM instance, at this time PGA_AGGREGATE_LIMIT is not used, so in all cases for ASM retrieve the
PGA_AGGREGATE_TARGET
SQL> select value*3 from v$parameter where name='pga_aggregate_target';
VALUE*3
----------
1258291200

If the database version is 12.1.0.1 or higherretrieve the PGA_AGGREGATE_LIMIT and add to PGA total. Note that in 12c,
PGA_AGGREGATE_LIMIT is derived from PGA_AGGREGATE_TARGET and defaults to greater of 2gb or 2 times setting of
PGA_AGGREGATE_TARGET.
SQL> select value from v$parameter where name='pga_aggregate_limit';
VALUE
--------------------------------------------------------------------------------
3221225472

If the database version is earlier than 12.1.0.1, retrieve the PGA_AGGREGATE_TARGET * 3 and add to PGA total. Note
that PGA_AGGREGATE_TARGET can actually consume memory up to 3 times the setting for the parameter
SQL> select value*3 from v$parameter where name='pga_aggregate_target';
VALUE*3
--------------------------------------------------------------------------------
4831838208

Determine the amount of memory being used by HugePages?

$ cat /proc/meminfo|grep Huge
HugePages_Total: 256000
HugePages_Free: 234587
HugePages_Rsvd: 67
HugePages_Surp: 0
Hugepagesize: 2048 kB

Memory being used by HugePages? is HugePages? _Total * Hugepagesize

$ bc -q
2048*1024*25600
53687091200
quit

Determine the memory available on the node for PGA

$ cat /proc/meminfo |grep MemTotal? |awk '{print $2 * 1024}'
1083965984768

Subtract the memory allocated for HugePages? (gathered above)

$ bc -q
1083965984768 - 536870912000
547095072768
quit

If the PGA database instance memory total is > memory available on the node for PGA provide FAILURE message stating
something similar to "Database PGA allocation of <PGA memory total> is greater than the memory available for PGA <memory
available on the node for PGA> on this node. Please change memory allocations by reducing PGA_AGGREGATE_TARGET as
appropriate in one or more databases until PGA memory allocation is less than memory available for PGA.

This last item should be scripted so that we can provide it as part of the best practices page for customers to run outside of
exachk.

Verify Cluster Verification Utility(CVU) Output Directory Contents Consume < 500MB of Disk Space

Priority Alert Date Owner Status Scope Bug(s)

Level

Critical WARN 03/20/15 <Name> Production Exadata, SSC

DB DB Role Engineered System Exadata OS & Version Validation Tool TBD

Version Version Version

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 65/137
29/10/2019 Document 1067527.1
12.1.0.2+ N/A X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, 11.2.3.3.1+ Solaris - 11 N/A
X4-8, X5 Linux x86-64
UEK5.8

Benefit / Impact:

Beginning with Oracle version 12.1.0.2, the CVU is configured by default to run and generate an XML output file every 6 hours
(360 minutes). These files, and occasionally CVU command text output files, are stored in the output directory. If not monitored,
the files in the CVU output directory could eventually exhaust the available disk space. Currently, there is no effective purging of
these files, but this is expected to be addressed in a future release of CVU.

The benefit of verifying that the CVU output directory contents consume < 500MB of disk space is that an outage due to
depleted disk space is avoided. The impact of the verification is small, the impact of reducing disk space consumption depends
upon the chosen remediation strategy.

Risk:

Not verifying that the CVU output directory contents consume < 500MB of disk space increases the risk of a cluster instance
crash or other failures related to a file system running out of space.

Action / Repair:

To verify that the CVU output directory contents consume < 500MB of disk space, as the RDBMS home owner, and with the
environment properly set, execute the following command set on each database server:

DEFAULT_LOCATION="/u01/app/oracle/crsdata/@global/cvu/baseline/cvures"
if [ -r $DEFAULT_LOCATION ]
then
CVU_SPACE_USED=$(du -sm $DEFAULT_LOCATION | awk '{ print $1}')
if [ $CVU_SPACE_USED -le "500" ]
then echo -e "SUCCESS: Automated CVU check output consumes <= 500MB of disk space: "$CVU_SPACE_USED"MB"
else echo -e "WARNING: Automated CVU check output consumes > 500MB of disk space: "$CVU_SPACE_USED"MB"
fi
else
echo -e "WARNING: There seems to be some issue with $DEFAULT_LOCATION"
fi

The expected output should be similar to:

SUCCESS: Automated CVU check output consumes <= 500MB of disk space: 224MB

If the output is "WARNING", these are the recommended corrective options:

1) Manually purge the accumulated files from all database servers on a schedule that suits your retention and space usage
requirements. Do not just delete all files.

2) Lengthen the interval at which the automated CVU check executes:

As the RDBMS home owner, with the environment properly set, and with CVU enabled and running, execute the following
command set on a database server:

[oracle@randomadm03 ~]$ srvctl modify cvu -checkinterval 720

[oracle@randomadm03 ~]$ srvctl config cvu
CVU is configured to run once every 720 minutes
CVU is enabled.
CVU is individually enabled on nodes:
CVU is individually disabled on nodes:

NOTE: the "modify" command does not return any output confirmation. Follow up with the "config" command.
NOTE: The interval change takes effect without restarting the CVU.
NOTE: The CVU process only runs on one database server, but the files accumulate on all database servers.

For additional information see: "Oracle® Real Application Clusters Administration and Deployment Guide 12c
Release 1 (12.1) E48838-10"

Verify active system values match those defined in configuration file "cell.conf"

Priority Alert Date Owner Status Engineered System Bug(s)

Level

N/A WARN 03/01/15 <Name> Production BDA, Exadata, Exalogic,

Exalytics, SSC, ZDLRA

DB DB Engineered System Platform Exadata OS & Validation Tool Version TBD

Version Role Version Version

N/A N/A X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, 11.2.2.4.2+ Linux x86- exachk 12.1.0.2.4
X4-8, X5-2, X5-8 64

Benefit / Impact:

The Impact of verifying that active system values match those defined in configuration file "cell.conf" is minimal.

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 66/137
29/10/2019 Document 1067527.1
Changing the options defined in configuration file "cell.conf" directly in the active kernel may impact availability.

Risk:

A run time kernel configuration that does not match the values defined in the configuration file "cell.conf" may result in an
outage or unexpected issues during the next boot.

Action / Repair:

Note: Modifications to the Oracle Exadata Storage Server hardware or software are not supported. Only the
documented network interfaces on the Oracle Exadata Storage Server should be used for all connectivity including
management and storage traffic. Additional network interfaces should not be used.

NOTE: Always follow the recommended procedures to make changes on an Exadata system, and use a reboot to
verify that the changes are persistent in order to avoid unexpected issues during a reboot.

NOTE: ipconf validation restarts the cellwall service, which resets the storage server to the default configuration. If
manual changes have been made regardless that such configuration is not permitted, the manual configuration will
be lost when the cellwall service is restarted.

NOTE: The "ipconf" command performs a number of cross-checks. The length of time to execute varies by Exadata
version, environment complexity, and system load. Newer versions of Exadata software have longer execution
times due to more cross checking, as do more complex environments. Internal testing has taken up to 60 seconds.
Please make sure the command is truly stuck before terminating.

To verify that active system values match those defined in configuration file "cell.conf", as the "root" userid execute the
following command set only on each storage server:

IPCONF_RAW_OUTPUT=$(/opt/oracle.cellos/ipconf -verify -semantic -at-runtime -check-consistency -verbose 2>/dev/nu

IPCONF_RESULT=$(echo "$IPCONF_RAW_OUTPUT" | egrep "Consistency check PASSED" | wc -l);
IPCONF_SUMMARY=$(echo "$IPCONF_RAW_OUTPUT" | tail -1);
if [ $IPCONF_RESULT = "1" ]
then
echo -e "SUCCESS: $IPCONF_SUMMARY"
else
echo -e "FAILURE: $IPCONF_SUMMARY\n"
echo -e "`echo -e "$IPCONF_RAW_OUTPUT" | grep FAILED`"
fi;

The expected output is:

SUCCESS: [Info]: Consistency check PASSED

If the result is not as expected, the detailed output data will be echoed back after the "FAILURE" message. For example:

FAILURE: Info. Consistency check FAILED

ILOM timezone 00:21:28:A5:1B:BC found in /usr/share/zoneinfo : FAILED

ILOM timezone America/Denver matches 00:21:28:A5:1B:BC from Exadata configuration file : FAILED
Info. Consistency check FAILED

Review the data and take corrective action based upon the specific configuration items that did not pass.

Verify that CRS_LIMIT_NPROC is greater than 65535 and not "UNLIMITED"

Priority Alert Date Owner Status Engineered System Bug(s)

Level

Critical WARN 06/09/15 <Name> Production Exadata-User Domain, Exadata-

Physical, SSC, Exalogic

DB DB Engineered System Platform Exadata OS & Validation Tool Version TBD

Version Role Version Version

12.1.0.2+ CRS X2-2(4170), X2-2, X2-8, X3-2, X3-8, 11.2+ Solaris - 11 exachk 12.1.0.2.4
X4-2, X4-8, X5-2 Linux x86-64
el5uek
Linux x86-64
el6uek

Benefit / Impact:

Verifying that CRS_LIMIT_NPROC is greater than 65535 and not "UNLIMITED" avoids node eviction and potential cluster crashes
due to insufficient resources, and it helps avoid a possible denial of service attack.

The impact of verifying that CRS_LIMIT_NPROC is greater than 65535 and not "UNLIMITED" is minimal. The impact of
correcting CRS_LIMIT_NPROC should include a restart of the clusterware to ensure the setting is as expected after a restart.

Risk:

Without verifying that CRS_LIMIT_NPROC is greater than 65535 and not "UNLIMITED" there is a risk of node eviction and
potential cluster crashes due to insufficient resources, and a possible denial of service attack avenue.

Action / Repair:

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 67/137
29/10/2019 Document 1067527.1
To verify that CRS_LIMIT_NPROC is greater than 65535 and not "UNLIMITED", execute the following command set as the grid
owner userid with the environment properly set on each of the database servers or each user domain of a virtualized
environment:

unset CONFG_CRS_LIMIT_NPROC;
export MIN_VAL=65535;
CONFG_CRS_LIMIT_NPROC=$(grep -w CRS_LIMIT_NPROC $CRS_HOME/crs/install/s_crsconfig_`hostname -s`_env.txt|grep -v ^
if [ `echo $CONFG_CRS_LIMIT_NPROC | tr -s '[:upper:]' '[:lower:]'` = "unlimited" ]
then
echo "WARNING: CRS_LIMIT_NPROC should be set to a value greater than or equal to $MIN_VAL, but not \"UNLIMITED\"
elif [ $CONFG_CRS_LIMIT_NPROC -ge $MIN_VAL ]
then
echo "SUCCESS: CRS_LIMIT_NPROC is set to a value greater than or equal to $MIN_VAL, but not \"UNLIMITED\": $CONF
else
echo "FAILURE: CRS_LIMIT_NPROC is set to a value less than $MIN_VAL: $CONFG_CRS_LIMIT_NPROC.";
fi;

The expected output should be:

SUCCESS: CRS_LIMIT_NPROC is set to a value greater than or equal to 65535, but not "UNLIMITED": 65536.

Example of a FAILURE result:

FAILURE: CRS_LIMIT_NPROC is set to a value less than 65535: 16384.

Example of a WARNING result:

WARNING: CRS_LIMIT_NPROC should be set to a value greater than or equal to 65535, but not "UNLIMITED": UnliMITed.

If the result is not "SUCCESS", determine the root cause and correct the cause.

For example, to correct the "FAILURE" example provided, as the owner userid of the grid infrastructure on the database server
or user domain that produced the warning, edit with the "vi" editor the file $CRS_HOME/crs/install/s_crsconfig_`hostname -
s`_env.txt and add this line:

CRS_LIMIT_NPROC=65535

as a minimum acceptable value. The limit name is typically in upper case. If thorough testing indicates a larger value should be
used, the value can be set to any value within the recommended range. After you have closed the file and verified the value,
restart the clusterware.

Verify TCP Segmentation Offload (TSO) is set to off

Priority Alert Date Owner Status Engineered System Bug(s)

Level

Critical FAIL 05/21/15 <Name> Production Exadata-Physical, Exadata-User

Domain, Exalogic

DB Engineered System Platform Exadata OS & Version Validation Tool Version TBD
Role Version

N/A N/A X2-2(4170), X2-2, X2-8, X3-2, X3- 12.1.2.1.0, Linux x86-64 exachk 12.1.0.2.4
8, X4-2, X5-2 12.1.2.1.1 el6uek

Benefit / Impact:

The Impact of verifying that the TSO option for IB and bonded IB interfaces is set to "off" is minimal. With the chosen
implementation (of updating a configuration file) to make the setting effective a reboot is required.

Risk:

If the TSO option is not set to "off" cluster node evictions can occur.

NOTE: Starting 12.1.2.1.2 TSO function is disabled by the kernel. This does not apply for other Exadata releases
then mentioned.

Action / Repair:

To verify that the TSO option is set to "off" in the run time configuration, execute the following command as the "root" userid on
all database servers where the exadata image version is >= 12.1.2.1.0 and <= 12.1.2.1.1 on Exadata physical and
domU deployments (not dom0)

get_ib_interfaces ()
{
local -i ret_val=0
local interface_list=''
if [ ! -e /opt/oracle.cellos/ORACLE_CELL_NODE ]; then
ActiveInterfaces=$(/sbin/ip link show up | awk '/[\t ]+bondib/ {print $2}' | sed -e 's/:$//' | grep -v eth | sor
ActiveInterfaces1=$(/sbin/ip link show up | awk '/[\t ]+ib/ {print $2}' | sed -e 's/:$//' | grep -v eth | sort)
ActiveInterfaces2=$(/sbin/ip link show up | awk '/[\t ]+bond/ {print $2}' | sed -e 's/:$//' | grep -v eth | sort
for Interface in ${ActiveInterfaces} ${ActiveInterfaces1} ${ActiveInterfaces2}; do
interface_list="${Interface} ${interface_list}"
done
fi
interface_list=`echo $interface_list| xargs -n1 | sort -u | xargs`

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 68/137
29/10/2019 Document 1067527.1
echo "$interface_list"
}
gettso ()
{
local tso=UNDEFINED
local -i ret_val=0
for Interface in `get_ib_interfaces | tail -1`; do
if [ -z "$Interface" ]; then
echo "`date '+%F %T %z'` [INFO] No ib interfaces need this work around."
else
tso=$(/sbin/ethtool --show-offload $Interface | awk '(/tcp-segmentation-offload:/){print $NF}')
if [ $tso == 'off' ]; then
echo -e "SUCCESS: ${Interface}: tcp-segmentation-offload: set to off"
else
echo -e "FAILURE: ${Interface}: tcp-segmentation-offload: not set to off"
ret_val=1
fi
fi
done
return $ret_val
}

gettso

The output should be similar to:

SUCCESS: bondib0: tcp-segmentation-offload: set to off

SUCCESS: ib0: tcp-segmentation-offload: set to off
SUCCESS: ib1: tcp-segmentation-offload: set to off

- OR -

2015-05-27 10:49:01 -0500 [INFO] No ib interfaces need this work around.

If the output is not as expected, add the option ETHTOOL_OPTS="-K <ibdev> tso off" to the configuration files. Shutdown the
stack followed by the command (executed as root) "ifdown <ibdev>" and "ifup <ibdev>" (where <ibdev> is ib0, ib1 or
bondib0). Then restart the stack. For the majority of two socket database servers, these files are:

/etc/sysconfig/network-scripts/ifcfg-bondib0
/etc/sysconfig/network-scripts/ifcfg-ib0
/etc/sysconfig/network-scripts/ifcfg-ib1

NOTE: For older compute nodes, the file is: /etc/sysconfig/network-scripts/ifcfg-bond0

NOTE: Eight socket database servers may have additional bonded interfaces in use, with additional configuration
files.

Check alerthistory for stateful alerts not cleared

Priority Alert Date Owner Status Engineered System Engineered Bug(s)

Level System
Platform
Critical FAIL 06/19/19 <Name> Production Exadata - Physical, ALL 27848031 - exachk
Exadata - Management 26651210 - exachk
Domain 21299782 - exachk
DB/GI DB Type DB Role DB Exadata OS & Version Validation Tool MAA Scorecard
Version Mode Version Version Section
N/A N/A N/A N/A ALL Linux exachk 19.3.0 N/A

Benefit / Impact:

There are two types of alerts maintained in the alerthistory of a storage or database server, stateful and stateless.

A stateful alert is usually associated with a transient condition, often hardware related, and it will clear itself after that transient
condition is corrected. These alerts age out of the alerthistory after 7 days (default time) once they are set to clear.

The benefit of checking for stateful alerts that have not been cleared is faster problem resolution. The impact of correcting any
stateful alert that has not been cleared depends upon each individual alert.

Risk:

Failure to investigate a stateful alert that has not been cleared may result in significant impact, which varies by the particular
alert.

Action / Repair:

To verify there are no stateful alerts that have not been cleared, as the root userid on each storage and database server execute
the following commands:

unset IMAGE_VERSION
unset NODE_TYPE
unset COMMAND_NAME
unset NAME_ARRAY
unset INDIVIDUAL_NAME
unset SID
unset SEVERITY
unset MESSAGE
unset ACTION
unset OUTPUT_ARRAY

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 69/137
29/10/2019 Document 1067527.1
if [ $(egrep -i node.type /opt/oracle.cellos/cell.conf | grep -i db | wc -l) -eq 1 ]
then NODE_TYPE=db
else
NODE_TYPE=cell
fi
IMAGE_VERSION=$(imageinfo -version |tr -d '.'|cut -c1-6)
if [ $NODE_TYPE = "cell" ]
then
COMMAND_NAME=cellcli
else
if [ $IMAGE_VERSION -ge 121211 ]
then COMMAND_NAME=dbmcli
fi;
fi;
if [ -n "$COMMAND_NAME" ]
then
NAME_ARRAY=$($COMMAND_NAME -e "list alerthistory attributes name where alerttype=stateful and endtime=null" | s
if [ -z "$NAME_ARRAY" ]
then
echo -e "SUCCESS: there are no stateful alerts that have not been cleared."
else
for INDIVIDUAL_NAME in $NAME_ARRAY
do
NAME_RECORD=$($COMMAND_NAME -e "list alerthistory attributes alertsequenceid,severity,alertMessage,alertAct
SID=$(echo "$NAME_RECORD" | cut -f2 | tr -s " " | sed -e 's/^[[:space:]]*//')
SEVERITY=$(echo "$NAME_RECORD" | cut -f3 | tr -s " " | sed -e 's/^[[:space:]]*//')
MESSAGE=$(echo "$NAME_RECORD" | cut -f4 | tr -s " " | sed -e 's/^[[:space:]]*//')
ACTION=$(echo "$NAME_RECORD" | cut -f5 | tr -s " " | sed -e 's/^[[:space:]]*//')
OUTPUT_ARRAY+=$(echo -e "\n";echo -e "SID:\t\t$SID";echo -e "NAME:\t\t$INDIVIDUAL_NAME";echo -e "SEVERITY:\
done
echo -e -n "FAILURE: there are one or more stateful alerts that have not been cleared. Details:"
echo -e "${OUTPUT_ARRAY[@]}"
fi
else
echo "alerthistory is not available on database servers at image versions below 12.1.2.1.1: $NODE_TYPE $IMAGE_V
fi

The output should be similar to:

SUCCESS: there are no stateful alerts that have not been cleared.

- OR -

alerthistory is not available on database servers at image versions below 12.1.2.1.1: db 112322

Example of a FAILURE result:

FAILURE: there are one or more stateful alerts that have not been cleared. Details:

SID: 1
NAME: 1_2
SEVERITY: critical
MESSAGE: A IO subsystem component is suspected of causing a fault with a 100% certainty. Component Name :
ACTION: For additional information, please refer to http://www.sun.com/msg/SPX86-8003-QH This alert occur

SID: 2
NAME: 2_1
SEVERITY: critical
MESSAGE: A processor component is suspected of causing a fault with a 100% certainty. Component Name : /SY
ACTION: For additional information, please refer to http://www.sun.com/msg/SPX86-8003-K5 This alert occur

If the output is not as expected, examine the full details for each alert that has not been cleared and follow the
recommendations.

Check alerthistory for non-test open stateless alerts

Priority Alert Date Owner Status Engineered System Engineered Bug(s)

Level System
Platform
Critical FAIL 06/19/19 Vern Production Exadata - Physical, ALL 27848031 - exachk
Wagman Exadata - Management 26651210 - exachk
Domain 21299794 - exachk
DB/GI DB Type DB Role DB Mode Exadata OS & Version Validation Tool MAA Scorecard
Version Version Version Section
N/A N/A N/A N/A ALL Linux exachk 19.3.0 N/A

Benefit / Impact:

There are two types of alerts maintained in the alerthistory of a storage or database server, stateful and stateless.

A stateless alert is not cleared automatically. They will not age out of the alerthistory until the alert is manually investigated and
the "examinedby" field set manually to a non-null value, typically the name of the person who reviewed the stateless alert and
corrected or otherwise acted upon the information provided.

The benefit of checking for for non-test open stateless alerts is faster problem resolution. The impact of correcting any stateless
alert that has not been cleared depends upon each individual alert.

Risk:

Failure to investigate a stateless non-test alert that has not been cleared may result in significant impact, which varies by the
particular alert.

Action / Repair:

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 70/137
29/10/2019 Document 1067527.1
To verify there are no non-test open stateless alerts, as the root userid on each storage and database server execute the
following commands:

unset IMAGE_VERSION
unset NODE_TYPE
unset COMMAND_NAME
unset NAME_ARRAY
unset INDIVIDUAL_NAME
unset SID
unset SEVERITY
unset MESSAGE
unset ACTION
unset OUTPUT_ARRAY
if [ $(egrep -i node.type /opt/oracle.cellos/cell.conf | grep -i db | wc -l) -eq 1 ]
then NODE_TYPE=db
else
NODE_TYPE=cell
fi
IMAGE_VERSION=$(imageinfo -version |tr -d '.'|cut -c1-6)
if [ $NODE_TYPE = "cell" ]
then
COMMAND_NAME=cellcli
else
if [ $IMAGE_VERSION -ge 121211 ]
then COMMAND_NAME=dbmcli
fi;
fi;
if [ -n "$COMMAND_NAME" ]
then
NAME_ARRAY=$($COMMAND_NAME -e list alerthistory attributes name where alerttype=stateless and examinedby=\'\' |
if [ -z "$NAME_ARRAY" ]
then
echo -e "SUCCESS: there are no non-test open stateless alerts."
else
for INDIVIDUAL_NAME in $NAME_ARRAY
do
NAME_RECORD=$($COMMAND_NAME -e "list alerthistory attributes alertsequenceid,severity,alertMessage,alertAct
SID=$(echo "$NAME_RECORD" | cut -f2 | tr -s " " | sed -e 's/^[[:space:]]*//')
SEVERITY=$(echo "$NAME_RECORD" | cut -f3 | tr -s " " | sed -e 's/^[[:space:]]*//')
MESSAGE=$(echo "$NAME_RECORD" | cut -f4 | tr -s " " | sed -e 's/^[[:space:]]*//')
ACTION=$(echo "$NAME_RECORD" | cut -f5 | tr -s " " | sed -e 's/^[[:space:]]*//')
OUTPUT_ARRAY+=$(echo -e "\n";echo -e "SID:\t\t$SID";echo -e "NAME:\t\t$INDIVIDUAL_NAME";echo -e "SEVERITY:\
done
echo -e -n "FAILURE: there are one or more non-test open stateless alerts that have not been cleared. Details
echo -e "${OUTPUT_ARRAY[@]}"
fi
else
echo "alerthistory is not available on database servers at image versions below 12.1.2.1.1: $NODE_TYPE $IMAGE_V
fi

The output should be similar to:

SUCCESS: there are no non-test open stateless alerts.

- OR -

alerthistory is not available on database servers at image versions below 12.1.2.1.1: db 112322

If the output is not as expected, examine the full details for each name that has not been cleared and follow the
recommendations.
Example of a FAILURE result:

FAILURE: there are one or more non-test open stateless alerts that have not been cleared. Details:

SID: 1
NAME: 1
SEVERITY: critical
MESSAGE: Critical interrupt detected: . Power cycle forced.
ACTION: Informational. Diagnostic package is attached. It is also accessible at /opt/oracle/dbserver/dbms

When the underlying issue for a given name is resolved, manually set the "examinedby" field with a command similar to the
following (command name is either cellcli or dbmcli, depending upon whether a storage or database server is involved):

CellCLI> alter alerthistory 1 examinedby="jdoe"

Alert 1 successfully altered

Where jdoe is the name of the person who verified the cause of the stateless alert no longer exists, and the number is the name
of the stateless alert. Note that double quotes are used around the value to be set, but not the name of the stateless alert.

Verify clusterware state is "Normal"

Priority Alert Date Owner Status Engineered System Bug

Level

Critical FAIL 07/29/15 <Name> Production Exadata-Physical,

Exadata-User Domain,
SSC, ZDLRA

DB Version DB Role Engineered System Platform Exadata OS & Validation Tool TB

Version Version Version

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 71/137
29/10/2019 Document 1067527.1
11.2.+, ASM X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, ALL N/A exachk 12.1.0.2.5
12.1.+ X5-2

Benefit / Impact:

The Impact of verifying that the clusterware state is "Normal" is minimal. The impact of returning the clusterware state to
normal varies depending upon the clusterware state found, and the root cause that lead to the found clusterware state.

NOTE: The clusterware state, unless an upgrade or patching exercise is in progress, should always be "Normal".

Risk:

Outside of an active upgrade or patching exercise, having cluster nodes with clusterware states other than "Normal" can lead to
problems with disk rebalances, dropping griddisks, and other maintenance operations.

NOTE: The following operations cannot be performed while the clusterware is in some form of "Rolling" state:

User invoked disk operations (ex: add, drop, replace, online, offline, undrop, resize, expel)
Create/Drop Diskgroup
Rebalance
Voting File Creation/Deletion
Advancing compatibility
SP file parameter add/change/remove
Create/Drop ADVM volume

NOTE: Outside of an active upgrade or patching exercise, having different cluster nodes report a mix of states, particularly
"In Rolling Patch" and "In Rolling Upgrade" is an indication of an incomplete or incorrect upgrade or patching exercise!

Action / Repair:

To verify the clusterware state, execute the following command set as the owner of the clusterware home with the environment
properly set to access the ASM instance on each database server:

unset CLUSTER_STATE;
CLUSTER_STATE=$($ORACLE_HOME/bin/sqlplus -s "/ as sysdba" <<EOF
set head off lines 80 feedback off timing off serveroutput on
SELECT SYS_CONTEXT('SYS_CLUSTER_PROPERTIES', 'CLUSTER_STATE') FROM DUAL;
exit
EOF);
if [ `echo $CLUSTER_STATE | wc -w` = 1 ]
then
if [ $CLUSTER_STATE = "Normal" ]
then
echo -e SUCCESS: the clusterware state is: $CLUSTER_STATE;
else
echo -e FAILURE: the clusterware state is: $CLUSTER_STATE;
fi;
else
echo -e FAILURE: the clusterware state is: $CLUSTER_STATE;
fi;

The expected output should be:

SUCCESS: the clusterware state is: Normal

If the output is not as expected, investigate the root cause and correct the condition.

Verify the grid Infrastructure management database (MGMTDB) does not use hugepages

Priority Alert Date Owner Status Engineered Bug(s)

Level System

Critical FAIL 11/02/15 <Name> Production Exadata - Physical,

Exadata - User
Domain,
SSC

DB DB Role Engineered System Platform Exadata OS & Version Validation Tool TBD
Version Version Version

>= 12.1 MGMTDB X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X4- 11.2+ Linux x86-64 exachk 12.1.0.2.6-
8, X5-2, X5-8 el5uek ish
Linux x86-64
el6uek

Benefit / Impact:

MGMTDB can start on any node within the cluster which makes the configuration and allocation of hugepages more difficult.
Verifying that MGMTDB doesn't use hugepages helps to avoid instance start failures because not enough huge pages are

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 72/137
29/10/2019 Document 1067527.1
available.

The impact of verifying MGMTDB does not use hugepages is minimal. Configuring MGMTDB to not use hugepages requires an
instance restart.

Risk:

If MGMTDB is configured to use hugepages and it starts on a database server where MGMTDB's use of hugepages has not been
considered, other database instances may fail to start because not enough hugepages are available, or MGMTDB itself may not
acquire hugepages when it fails over to a different database server.

Action / Repair:

To verify MGMTDB does not use hugepages, as the root userid on the database server where MGMTDB is running,
execute the following command set:

#!/bin/bash
# Main
v_pmon_pid=$(ps -ef | grep pmon | grep '\-MGMTDB' | awk ' { print $2 } ') # If we have a value continue, else exit - MGMTDB
may not be running here.
if [ "${v_pmon_pid}" != '' ]
then
# Check value we found is a number
expr ${v_pmon_pid} + 1 > /dev/null 2>&1
if [ $? -eq 0 ]
then
v_hugep_count=$(grep -a -s huge /proc/${v_pmon_pid}/numa_maps 2>/dev/null | grep -a -s dirty | wc -l)
if [ ${v_hugep_count} -gt 0 ]
then
v_logger_msg="MGMTDB should not be running with hugepages"
echo -e "\nFAILURE: ${v_logger_msg}"
else
v_logger_msg="MGMTDB is not running with hugepages"
echo -e "\nSUCCESS: ${v_logger_msg}"
fi
else
v_logger_msg="Unable to find pmon pid for MGMTDB unable to detect if MGMTDB runs with hugepages or not"
echo -e "\nFAILURE: ${v_logger_msg}"
fi
fi

The expected output will be similar to:

SUCCESS: MGMTDB is not running with hugepages

If the output is 'FAILURE', execute the following steps to deconfigure hugepages for MGMTDB as owner of the Grid
Infrastructure with Oracle home set to the grid Home and Oracle Sid to -MGMTDB:

[oracle@dbm01 ~]$ sqlplus / as sysdba

SQL> alter system set use_large_pages=FALSE scope=spfile;
[oracle@dbm01 ~]$ srvctl stop mgmtdb -o immediate
[oracle@dbm01 ~]$ srvctl start mgmtdb

Verify the "localhost" alias is pingable

Priority Alert Date Owner Status Engineered System Bug(s)

Level

Critical FAIL 11/02/15 <Name> Production Exadata - Physical,

Exadata - User Domain,
Exadata - Management
Domain,
SSC

DB DB Role Engineered System Platform Exadata OS & Version Validation Tool TBD
Version Version Version

N/A N/A X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, 11.2+ Linux x86-64 exachk 12.1.0.2.6-ish
X4-8, X5-2, X5-8 el5uek
Linux x86-64
el6uek
Solaris 11

Benefit / Impact:

Many scripts and programs, including patching utilities rely on the "localhost" alias. Verifying the "localhost" alias is pingable
helps avoid operational issues or incorrect patch applications.

The impact of verifying the "localhost" alias is pingable is minimal. Changing the "localhost" alias definition does not require a
reboot or network restart.

Risk:

If the "localhost" alias is not pingable operational issues or incorrect patch applications may result.

To verify the "localhost" alias is pingable, as the "root" userid on each storage server, database server, and InfiniBand switch,
execute the following command set (IPv4 or IPv6 compatible):

#!/bin/bash

v_cmd[0]="ping -c1 localhost"

v_index=0

v_netw_ipv6=$(grep ^NETWORKING_IPV6 /etc/sysconfig/network | awk -F "=" ' { print $2 } ')

if [ "${v_netw_ipv6}" == "yes" ]
then
# ipv6 detected also check for ip6-localhost
v_cmd[1]="ping6 -c1 ip6-localhost"
fi

# Main

while [ $v_index -lt ${#v_cmd[*]} ]

do
v_localhostname=$(echo ${v_cmd[$v_index]} | awk ' { print $3 } ')

${v_cmd[$v_index]} > /dev/null 2>&1

if [ $? != 0 ]
then
v_logger_msg="${v_localhostname} is not pingable by name"
echo -e "\nFAILURE: ${v_logger_msg}"
else
v_logger_msg="${v_localhostname} is pingable by name"
echo -e "\nSUCCESS: ${v_logger_msg}"
fi
v_index=$((v_index+1))
done

The expected output should be similar to:

SUCCESS: localhost is pingable by name

- OR -

SUCCESS: ip6-localhost is pingable by name

</verbatim> If the output is 'FAILURE' then manually edit /etc/hosts and test to make sure the "localhost" alias definition is a
valid entry.

IPv4 example:

127.0.0.1 localhost.localdomain localhost

IPv6 example:

127.0.0.1 localhost.localdomain localhost

::1 ip6-localhost.localdomain ip6-localhost

Verify bundle patch version installed matches bundle patch version registered in database

Priority Alert Date Owner Status Scope Bug(s)

Level

Critical FAIL 11/04/15 <Name> Production Exadata, Exalogic, SSC

DB DB Role Engineered System Exadata OS & Validation Tool TBD

Version Version Version Version

>= ALL X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, 11.2.x + Linux, exachk 12.1.0.2.6
12.1.0.2 X5-2, X5-8 Solaris

Benefit / Impact:

Crosschecking the software bundle patch version installed with the bundle patch registered in the database to make sure they
match ensures software correctness and stability. If a bundle patch is being installed in a Data Guard configuration in a standby-
first manner where the SQL portion of the bundle patch is not installed inside the database until the primary and all standby
software homes have the same version installed, then this crosscheck is expected to fail until both the binary and SQL portion of
the bundle patch application is fully installed.

Risk:

Incomplete bug fixes, software instability, and unexpected behavior

Action / Repair:

opatch_bp=$($ORACLE_HOME/OPatch/opatch lspatches 2>/dev/null|grep -iwv javavm|grep -wi database|head -1|awk -F';'

'{print $1}');
database_bp_status=$(echo -e "set heading off feedback off timing off \n select ACTION, STATUS from (select * from
https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 74/137
29/10/2019 Document 1067527.1
dba_registry_sqlpatch where PATCH_ID = $opatch_bp order by action_time desc) where
rownum=1;"|$ORACLE_HOME/bin/sqlplus -s " / as sysdba" | sed -e '/^ *$/d');
database_bp_status='echo $database_bp_status';
if [ "$database_bp_status" == "APPLY SUCCESS" ];
then
echo "SUCCESS: Bundle patch installed in the database matches the software home and is installed successfully.";
else
echo "FAILURE: Bundle patch installed in the database does not match the software home, or is installed with errors.";
fi;

The output should be similar to:

SUCCESS: Bundle patch installed in the database matches the software home and is installed successfully.

If FAILURE is reported, then investigate and correct the discrepancy.

NOTE: For versions less than 12.1.0.2, please see this archived best practice: Verify bundle patch version installed matches
bundle patch version registered in database

Verify database is not in DST upgrade state

Priority Alert Date Owner Status Engineered System

Level

Critical FAIL 10/19/2015 <Name> Review Exadata - Physical,

Exadata - User Domain,
Exadata - Management
Domain,
SSC, ZDLRA

DB DB Role Engineered System Platform Exadata OS & Version Validation Tool Version
Version Version

11.2+ All X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, 11.2.x + Linux x86-64 exachk 12.1.0.2.6
X5-2, X5-8 el5uek
Linux x86-64
el6uek
Solaris 11

Benefit / Impact:

When the DB timezone is in upgrade mode or inconsistent mode, I/Os issued from DB nodes to cell nodes will not go through
smart scan and hence block I/O or passthru will take place instead. This results in cell nodes shipping all blocks rather than
blocks of interest (filtered) to the database for qualified scans.

Risk:

Smart scan will be disabled or do passthru and can cause potential performance issues. If the I/O size is huge it might saturate
the RDS traffic and impact the RDA service times along with database performance.

Action / Repair:

To check whether database DST_UPGRADE_STATE is set to anything other than the normal value NONE, as the owner of the
oracle home for a given database and with the environment set to access that database, execute the following command set:

unset DST_UPGRADE_STATE_VALUE;
DST_UPGRADE_STATE_VALUE=$($ORACLE_HOME/bin/sqlplus -s "/ as sysdba" <<EOF
set head off lines 80 feedback off timing off serveroutput on
select upper(property_value) from sys.database_properties where property_name = 'DST_UPGRADE_STATE';
exit
EOF
);
if [ $DST_UPGRADE_STATE_VALUE = "NONE" ]
then
echo -e "SUCCESS: DB is not in DST upgrade state. \"DST_UPGRADE_STATE\" column value =
"$DST_UPGRADE_STATE_VALUE""
else
echo -e "FAILURE: DB is in DST upgrade state. \"DST_UPGRADE_STATE\" column value = "$DST_UPGRADE_STATE_VALUE""
fi;

The expected output should be similar to:

SUCCESS: DB is not in DST upgrade state. "DST_UPGRADE_STATE" column value = NONE

NOTE: Oracle recommends that database should not be in DST upgrade state under normal operations. Refer to MOS Doc ID
1583297.1 for fixing or closing the DST upgrade state. If DST_UPGRADE_STATE is UPGRADE, PREPARE or DATAPUMP then
possibly a prepare or upgrade window or an on-demand or datapump-job loading of a secondary time zone data file is in an
active state. A failed or terminated Datapump job can also cause DST_UPGRADE_STATE value to be Datapump(1) which should
be fixed. This check could fail if there is an active Datapump job loading a secondary timezone file at the same time.

Verify there are no failed diskgroup rebalance operations

Priority Alert Date Owner Status Engineered System Bug(s)

Level

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 75/137
29/10/2019 Document 1067527.1
Critical FAIL 09/16/15 <Name> Production Exadata - Physical, Exadata -
User Domain, SSC

DB DB Engineered System Platform Exadata OS & Validation Tool Version TBD

Version Role Version Version

11.2.0.3+ ASM X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4- ALL Linux x86-64 exachk 12.1.0.2.6
2, X4-8, X5-2, X5-8 el5uek
Linux x86-64
el6uek
Solaris 11

Benefit / Impact:

Verifying there are no failed diskgroup rebalance operations helps to ensure that all diskgroups have the chosen redundancy.
The impact of correcting any failed diskgroup rebalance operations depends upon the error responsible for the failure.

Risk:

A failed diskgroup rebalance operation could leave the diskgroup without the proper redundancy, exposing the diskgroup to a
loss of data if another partner disk fails.

Action / Repair:

To verify there are no failed diskgroup rebalance operations, as the owner of the grid home and with the environment set to
access one ASM instance, execute the following command set:

#!/bin/bash
unset REBALANCE_ERROR;
REBALANCE_ERROR='$ORACLE_HOME/bin/sqlplus -s "/ as sysasm" << EOF
set head off pagesize 0 timing off serveroutput on feedback off
select group_number,error_code from gv\\$asm_operation where error_code is not null and upper(state) not in
('DONE','WAIT','RUN');
exit
EOF';
if [ -z 'echo $REBALANCE_ERROR | tr -d ' \t\n\r\f'' ]
then
echo -e "\nSUCCESS: There were no failed rebalance operations found.\n"
else
echo -e "\nFAILURE: Failed rebalance operations were found:\n"
echo -e "REBALANCE_ERROR:\n$REBALANCE_ERROR\n"
fi;

The output should be similar to:

SUCCESS: There were no failed rebalance operations found.

If the output is not "SUCCESS...", investigate the reported errors and correct appropriately.

Verify the CRS_HOME is properly locked

Priority Alert Date Owner Status Engineered Bug(s)

Level System

Critical WARN 11/10/15 <Name> Production Exadata - Physical,

Exadata - User
Domain,
ZDLRA

DB DB Role Engineered System Platform Exadata OS & Validation Tool TBD

Version Version Version Version

12.1+ ASM X2-2(4170), X2-2, X2-8, X3-2, X3-8, EIGHTH, X4-2, 11.2.2.2.0+ Linux x86- exachk 12.1.0.2.6
X4-8, X5-2, X5-8 64

Benefit / Impact:

The CRS_HOME should be locked properly after patching.

Risk:

The CRS_HOME not being locked properly may result in permissions being wrongly set as well as files not being instantiated.

Action / Repair:

To verify the CRS_HOME is properly locked, as the "root" userid on each database server execute the following command set:

export CRS_HOME=$(awk -F: '/^+ASM[0-9].*/{printf "%s\n", $2}' /etc/oratab)

CRS_CHECK=$(stat -c %U $CRS_HOME);
if [[ $CRS_CHECK == "root" ]];
then echo -e "SUCCESS:CRS Home is locked.";
else echo -e "WARN:CRS Home is NOT locked."
fi;

SUCCESS:CRS Home is locked.

If the output is not "SUCCESS...", open an SR and work with Oracle Support to determine the root cause and proper corrective
action.

Verify storage server data (non-system) disks have no partitions

Priority Alert Date Owner Status Engineered System Bug(s)

Level

Critical FAIL 01/27/2016 <Name> Production Exadata - Physical,

Exadata - Management
Domain,
SSC, Exalogic, Exalytics,
BDA, ZDLRA

DB DB Role Engineered System Platform Exadata OS & Validation Tool Version TBD
Version Version Version

N/A N/A X2-2(4170), X2-2, X2-8, X3-2, EIGHTH, X3-8, 11.2.3.2.0+ Linux x86- exachk 12.1.0.2.6
X4-2, X5-2 64

Benefit / Impact:

Verifying that storage server data (non-system) disks have no partitions helps avoid an outage or data loss.

The impact of verifying that storage server data (non-system) disks have no partitions is minimal. The impact of correcting
storage server data (non-system) disks that have partitions varies according to the reason for the partitions and the state of the
device, and cannot be estimated here.

Risk:

During a storage server reboot, for storage server data (non-system) disks that have partitions, the partitions may become not
visible to the operation system, and therefore unusable.

Action / Repair:

To verify that storage server data (non-system) disks have no partitions, as the "root" userid, execute the following command
set on each storage server:

unset report_command
OSS_SCRIPTS_HOME=/opt/oracle/cell/cellsrv/deploy/scripts
DISK_DEV="$OSS_SCRIPTS_HOME/unix/hwadapter/diskadp/get_disk_devices.pl"
SYS_DISKS=`cellcli -x -e "list lun attributes deviceName where isSystemLun = TRUE"`
SYS_DISK0=ècho $SYS_DISKS|cut -f1 -d' '`
SYS_DISK1=ècho $SYS_DISKS|cut -f2 -d' '`
if [ -z "$OSS_SCRIPTS_HOME" ]; then
report_command=$(echo "$report_command\nEnvironment variable OSS_SCRIPTS_HOME is not defined")
status=1
fi
if [ ! -f $DISK_DEV ]; then
report_command=$(echo "$report_command\nFile $DISK_DEV does not exists")
status=1
else
DATA_DISKS=`$DISK_DEV 2 |grep -v $SYS_DISK0 |grep -v $SYS_DISK1`
failDiskCount=0
for disk in $DATA_DISKS; do
size=${#disk}
if [ $size -eq 9 ]; then
disk="${disk%?}"
fi
parted -s $disk print 1>&2 >/dev/null
if [ $? -eq 0 ]; then
disks[$i]=$disk
failDiskCount=èxpr $failDiskCount + 1`
fi
done
if [ $failDiskCount -eq 0 ]; then
report_command=$(echo "$report_command\nAll data disks have no partitions")
status=0
else
report_command=$(echo "$report_command\nThe following disks have partitions:")
report_command=$(echo "$report_command\n ${disks[@]}")
report_command=$(echo "$report_command\nAssociated griddisks needs to be removed from diskgroups")
report_command=$(echo "$report_command\nRebalance should complete before replacing/reformatting this device.")
https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 77/137
29/10/2019 Document 1067527.1
status=1
fi
fi
echo -e "$report_command"
The expected output should be:

All data disks have no partitions

If data disks with partitions are discovered, they will be echoed back. If the output is not as expected, investigate for root cause
and take appropriate corrective action.

NOTE: For additional information, please see: Exadata: Problems introduced when replacing a physical disk having a foreign
partition table (Doc ID 1965314.1).

Verify db_unique_name is used in I/O Resource Management (IORM) interdatabase plans

Priority Alert Level Date Owner Status Engineered Bug(s)

System

Critical WARN 02/24/2016 <Name> Production Exadata - Physical,

Exadata - User
Domain,
SSC, Exalogic,
Exalytics,
BDA, ZDLRA

DB DB Role Engineered System Platform Exadata OS & Validation Tool TBD

Version Version Version Version

11g, 12c Primary X2-2(4170), X2-2, X2-8, X3-2, EIGHTH, X3-8, 11.2.3.2.0+ Solaris - 11 exachk 12.1.0.2.6
Physical X4-2, X5-2 Linux x86-
Standby 64

Benefit / Impact:

Starting with Oracle Exadata Storage Server software version 12.1.2.3.0, IORM will no longer support using "db_name" in the
inter-database IORM plan directive if the directive does not contain the "role" attribute. Existing customers who may be using
"db_name" need to be alerted to this change.

NOTE: even though the effective version level for this change is 12.1.2.3.0, this check should be performed on versions prior to
that, and the situation resolved, to avoid any issues immediately after the upgrade.

Risk:

If the inter-database IORM plan is not updated to use "db_unique_name", IORM may not manage that database as defined in
the plan since the mapping will not be correct. DB, PDB and CG metrics for that database will also be impacted.

Action / Repair:

To determine if an existing IORM interdatabase plan requires modification, repeat the following process for all databases:

As the "root" userid on one storage server accessed by the target database, check if an interdatabase plan has been configured.
If the count is non-zero, an interdatabase plan has been configured.

cellcli -e "list iormplan attributes dbplan detail" | grep "name=" | wc -l

NOTE: If no IORM interdatabase plan is configured, no further checking is required.

As the database home owner userid, execute the following to determine if "db_name"; is distinct from "db_unique_name":

$ORACLE_HOME/bin/sqlplus -s "/ as sysdba" <<EOF

set head off lines 80 feedback off timing off serveroutput on
select VALUE from v\$parameter where name = 'db_name' and VALUE != (select VALUE from v\$parameter where name =
'db_unique_name');
exit
EOF;

NOTE: if no rows are returned, "db_name" is not distinct and no further checking is required.

If an IORM interdatabase plan is configured and the "db_name" is distinct, as the "root" userid on one storage server accessed
by the target database, execute the following (correctly substituting the target database db_name value) to query the IORM plan
and check if it contains any directive using the "db_name" without the "role" attribute:

cellcli -e "list iormplan attributes dbplan detail" | grep -i "name=<target database db_name value>" | grep –v “role=” | wc –l
If the number of lines returned is non-zero, the interdatabase IORM plan directive needs to be updated to use the target
database "db_unique_name" value.

NOTE: Also review "Ensure db_unique_name is unique across the enterprise".

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 78/137
29/10/2019 Document 1067527.1
Verify Datafiles are Placed on Diskgroups consisting of griddisks with cachingPolicy = DEFAULT

Priority Alert Date Owner Status Engineered System Bug(s)

Level

Critical WARNING 08/04/2015 <Name> Production Exadata, AVM

DB DB Role Engineered System Platform Exadata OS & Version Validation Tool TBD
Version Version Version

11.2.x+ N/A V2, X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, 11.2.3.2+ Linux x86-64
X5-2, X5-8 UEK5.8

Benefit / Impact:

Datafiles should be placed in diskgroups consisting of griddisks with their cachingPolicy set to DEFAULT. The cachingPolicy
attribute determines if flashcache is used for blocks stored on the griddisk. When cachingPolicy is set to DEFAULT, then
flashcache is used; when cachingPolicy is set to NONE, then flaschache will not be used for any blocks stored on the griddisk.
Per Oracle best practices, Exadata is configured with cachingPolicy set to NONE for griddisks in the RECO diskgroup and set to
DEFAULT (to use flashcache) for the DATA diskgroup. Oracle does not recommend storing datafiles in the RECO diskgroup or
any other diskgroup that has its cachingPolicy set to NONE.

Risk:

You will not get the benefit of flashcache and may see greater I/O and related waits and/or higher hard disk utilization than is
expected.

Action / Repair:

First, determine if you have placed datafiles onto a diskgroup that has its cachingPolicy set to NONE. Do this by creating a small
script as follows and set its execute permission; in this example the script is called "check_cp.sh" :

#!/bin/bash
#
# $1 = cell to check
#
CELL=$1
CELL_CP=$( ssh root@$CELL cellcli -e list griddisk attributes name, cachingpolicy,asmDiskGroupName where
cachingpolicy=NONE | awk '{print $3}' | sort -u )
if [ -n "$CELL_CP" ]; then
for i in $( echo -e $CELL_CP )
do
file_part=$( echo -e "+$i%" )
RETVAL1=`sqlplus -silent / as sysdba <<EOF
set linesize 250 pagesize 10000 feedback off heading off echo off show off verify off
set serveroutput on
var vDG varchar2(2000)
begin
:vDG := '$file_part';
end;
/
select count(1) from v\\\$datafile where name like :vDG;
exit;
EOF`
RETVAL="$(echo $RETVAL1 |tr '\n' ' ')"
if [ "$RETVAL" -gt "0" ]; then
echo "FAIL : There are $RETVAL datfiles stored on griddisks in the $CELL_CP diskgroup with cachingPolicy=none"
exit 1
else
echo "SUCCESS : There are NO datafiles stored on griddisks with cachingPolicy=none "
fi
done
else
echo "SUCCESS : There are NO griddisks with cachingPolicy=none "
fi
Set your shell environment to the ORACLE_HOME, ORACLE_SID, etc to allow sqlplus to log on and then run the script against a
single cell by calling it like this:

$ ./check_cp.sh exacel01

The expected output should be similar to:

SUCCESS : There are NO datafiles stored on griddisks with cachingPolicy=none

If any cell has an output of "FAIL ...", the corrective action is to review which files are on the diskgroup reported by the script
and ensure their placement in that diskgroup was intentional. The following query will show the specific datafiles:

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 79/137
29/10/2019 Document 1067527.1
select name from v$datafile where name like '<DISKGROUP LISTED IN THE COMMAND OUTPUT>%';

For example, if command returned the following:

FAIL : There are 2 datfiles stored on griddisks in the RECOC1 diskgroup with cachingPolicy=none

the diskgroup to use in the query is +RECOC1, and the query would be:

select name from v$datafile where name like '+RECOC1%';

The script should be executed across all cells and repeated for each database instance you're interested in checking. If you have
a list of cells stored in a file such as /home/oracle/cell_group, you can check all of the cells like this:

for c in $( cat /home/oracle/cell_group );

do
echo "Now checking cell $c ...";
./check_cp.sh $c;
done

Verify all datafiles are placed on griddisks that are cached on flash disks

Priority Alert Date Owner Status Engineered System

Level
Critical WARNING 02/18/2016 <Name> Production Exadata - Physical,
Exadata - User Domain,
ZDLRA
DB DB Role Engineered System Platform Exadata OS & Validation Tool Version
Version Version Version
11.2.x+ N/A V2, X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5- 11.2.3.2+ Linux x86-64 exachk 12.1.0.2.6
2

Benefit / Impact:

Datafiles should be placed in diskgroups consisting of griddisks with their cachedBy attribute that are set to a list of flash disks.
The cachedBy attribute determines if flashcache is used for blocks stored on the griddisk. When cachedBy is set to a list of flash
disks, then flashcache is used; when cachedBy is not set, then flaschache will not be used for any blocks stored on the griddisk.
Per Oracle best practices, Exadata is configured with cachedBy set to NULL for griddisks in the RECO diskgroup and set to the
list of flash disks (to use flashcache) for the DATA diskgroup. Oracle does not recommend storing datafiles in the RECO
diskgroup or any other diskgroup that has one or more of its griddisks with cachedBy unset.

Risk:

You will not get the benefit of flashcache and may see greater I/O and related waits and/or higher hard disk utilization than is
expected.

Action / Repair:

First, determine if you have placed datafiles onto a diskgroup that has cachedBy unset. Do this by creating a small script as
follows and set its execute permission; in this example the script is called "check_cby.sh" :

#!/bin/bash
#
# $1 = cell to check
#
CELL=$1
FLASH_MODE=$( ssh root@$CELL cellcli -e 'list cell attributes flashCacheMode' | grep -i -c writeback )
CELL_CBY=$( ssh root@$CELL cellcli -e 'list griddisk attributes name,cachedby,asmDiskGroupName where cachedby\=\"

if [ -n "$CELL_CBY" ] && [ "$FLASH_MODE" -eq "1" ]; then

for i in $( echo -e $CELL_CBY )

do
echo "Diskgroup ${i} has griddisks with unset CachedBy attributes....checking if any datafiles are present.

file_part=$( echo -e "+$i%" )

RETVAL1=`sqlplus -silent / as sysdba <<EOF

set linesize 250 pagesize 10000 feedback off heading off echo off show off verify off
set serveroutput on

var vDG varchar2(2000)

begin
:vDG := '$file_part';
end;
/

select count(1) from v\\\$datafile where name like :vDG;

exit;
EOF`
RETVAL="$(echo $RETVAL1 |tr '\n' ' ')"

if [ "$RETVAL" -gt "0" ]; then

echo "FAIL : There are $RETVAL datafiles stored on griddisks in the ${i} diskgroup that are not cached b
else
echo "SUCCESS : There are NO datafiles stored on griddisks with cachedBy unset in the ${i} diskgroup "
fi

done

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 80/137
29/10/2019 Document 1067527.1
else
if [ "$FLASH_MODE" -eq "1" ]; then
echo "SUCCESS : There are NO datafiles stored on griddisks with cachedBy unset "
else
echo "SUCCESS : Cell is in WRITETHROUGH flashcache mode - test does not apply."
fi
fi

Set your shell environment to the ORACLE_HOME, ORACLE_SID, etc to allow sqlplus to log on and then run the
script against a single cell by calling it like this:

$ ./check_cby.sh exacel01

The expected output when a cell is in WriteBack flashcache mode should be:
SUCCESS : There are NO datafiles stored on griddisks with cachedBy unset

The expected output when a cell is in WriteThrough flashcache mode should be:
SUCCESS : Cell is in WRITETHROUGH flashcache mode - test does not apply.

If any cell has an output of "FAIL ...", the corrective action is to review which files are on the diskgroup reported
by the script and ensure their placement in that diskgroup was intentional. The following query will show the
specific datafiles:
select name from v$datafile where name like '<DISKGROUP LISTED IN THE COMMAND OUTPUT>%';

For example, if command returned the following:

FAIL : There are 3 datafiles stored on griddisks in the RECOC1 diskgroup that are not cached by flash (have cach

the diskgroup to use in the query is +RECOC1, and the query would be:

select name from v$datafile where name like '+RECOC1%';

The script should be executed across all cells and repeated for each database instance you're interested in
checking. If you have a list of cells stored in a file such as /home/oracle/cell_group, you can check all of the cells
like this:

for c in $( cat /home/oracle/cell_group ); do echo "Now checking cell $c ..."; ./check_cby.sh $c; done

Validate key sysctl.conf parameters on database servers

Priority Alert Date Owner Status Engineered System

Level

Critical FAIL 2/10/16 <Name> Production Exadata - Physical,

Exadata - Management
Domain,
Exadata - User Domain

DB DB Role Engineered System Platform Exadata OS & Validation Tool Version

Version Version Version

N/A N/A X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, All Linux
X5-2, X5-8

Benefit / Impact:

Risk:

Applying improperly formatted or incorrect value settings to kernel parameters can render a system unusable.

Action / Repair:

Key sysctl.conf parameters on database servers vary by Exadata software version level, hardware type, and whether or not
virtualization is used. exachk runs the appropriate checks based upon the discovered environment configuration. To validate Key
sysctl.conf parameters on database servers, run exachk and review the provided report.

The expected output in the exachk report should be as follows:

In the "Findings Passed" summary section of the report, the overall result should be "PASS":

PASS OS Check sysctl.conf parameters on database servers are configured as recommended All Database Servers View
In the "View" detail section of the report for each individual database server:

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 81/137
29/10/2019 Document 1067527.1
Status on randomadm01:
PASS => sysctl.conf parameters on database servers are configured as recommended
DATA FROM RANDOMADM01 FOR VALIDATE KEY SYSCTL.CONF PARAMETERS ON DATABASE SERVERS
All sysctl.conf formatting checks succeeded
If there are issues discovered, the overall result will be "FAIL" and more information will be listed in the "View" detail section.
Investigate the reported issues for root cause and take appropriate corrective action.

NOTE: If after corrective actions are completed, you wish to run just this review manually without a full exachk run, as the
"root" userid in the directory in which exachk was installed, execute the following:

./exachk -check 018D274D1212689AE05313C0E50AB893

Detect duplicate files in /etc/init directories

Priority Alert Date Owner Status Engineered System Bug(s)
Level
Critical Warning 04/06/16 <Name> Production Exadata - Physical,
Exadata - User Domain,
Exadata - Management
Domain
DB DB Role Engineered System Platform Exadata OS & Validation Tool TBD

Version Version Version Version

n/a n/a X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5- All Linux x86- exachk 12.1.0.2.7
2, X5-8, X6-2, X6-8 64

Benefit / Impact:

It happens administrators backup contents of /etc/init before updating a database node.

Directories with names such as /etc/init122_old can be created with duplicate startup files in it - files that already exist in
/etc/init.

Making sure no duplicate startup files exist is helping in preventing against boot failures.
The impact of verifying /etc/*init* contents is minimal. The impact of correcting the duplicate contents zero.

Risk:

At boot time the Operating System traverses through all directories in /etc starting with the word "init" to execute startup
scripts, duplicate files can cause startup scripts to be executed multiple times which fails the boot process.

Action / Repair:

Execute the following command as the "root" userid on all database servers:

v_dupe_cnt=$(find /etc/*init* -type f -exec basename {} \; | sort | uniq -c | grep -v "^[ \t]*1 " | wc -l);
if [ $v_dupe_cnt -gt 0 ]
then
echo -e "FAILURE: Duplicate content found in /etc/init* directories";
else
echo -e "SUCCESS: No duplicate content found in /etc/init* directories";
fi;

The expected output should be:

SUCCESS: No duplicate content found in /etc/init* directories

A "FAILURE" message would be as follows:

FAILURE: Duplicate content found in /etc/init* directories

If output is a "FAILURE" message, run the following command to identify the duplicate files. Remove (or move) the duplicate files
found in the /etc/*init* directories to another location (out of /etc):

find /etc/*init* -type f -exec basename {} \; | sort | uniq -c | grep -v "^[ \t]*1 "

Verify Database Server Quorum Disks configuration

Priority Alert Date Owner Status Engineered System Engineered Bug(s)

Level System
Platform
Critical FAIL 05/29/19 <Name> Production Exadata - Physical, ALL 28496580 - exachk
Exadata - User 27274882 - exachk
Domain, SSC 25306232 - exachk
23065735 - exachk
27067655 - OEDA
DB/GI Version DB Type DB Role DB Exadata OS & Version Validation Tool MAA Scorecard
Mode Version Version Section
12.1.0.2.160119 ASM N/A N/A ALL Linux, Sparc exachk 19.3.0 N/A
and UP

Benefit / Impact:

The configuration of Quorum Disks for any High Redundancy diskgroup using less than five failgroups, provides following
benefits:

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 82/137
29/10/2019 Document 1067527.1
When storing Voting disks, protects the Grid Infrastructure in the event of a double partner storage failure or an event
involving Exadata storage server being offline due to planned maintenance and a subsequent partner storage failure.
Expanding diskgroups to use a higher number of failgroups and the subsequent shrinking to use less than five failgroups,
will avoid the diskgroup dismount during planned or unplanned maintenance. This is due to changes introduced in bug
26199003

Risk:

Without this feature, voting files get stored in a normal redundancy diskgroup on Exadata racks with less than 5 storage
servers which makes the Grid Infrastructure vulnerable to a cluster outage if multiple vote disks are inaccessible.
Diskgroups used on a flex configuration (expanding/shrinking) are exposed to be dismounted during planned or
unplanned maintenance.

Action / Repair:

NOTE: This check will only pass if the following are all true:
1) /opt/oracle.SupportTools/quorumdiskmgr exists on the db nodes
2) The GI BP version is above 12.1.0.2.160119
3) At least one HIGH redundancy diskgroup exists
4) Quorum disks on DB nodes are implemented when there are less than 5 storage cells in the high redundancy
disk group.
5) All HIGH redundancy diskgroups contain quorum disks
6) If the number of cells is greater than or equal to 5, all the voting files are in the cells

NOTE WELL:For a complete picture, please also reference: Verify all voting disks are online

To verify the database server quorum disks configuration, run exachk and review the provided report.

The expected output in the exachk report should be as follows:

The overall result should be "PASS" or "WARNING" or "FAIL":

In the "View" detail section of the report for this check the expected output should be similar to:

Voting File redundancy check Passed

## STATE File Universal Id File Name Disk group

-- ----- ----------------- --------- ---------
1. ONLINE 11ccca4125424fb1bfec2180a22e24cb (/dev/exadata_quorum/QD_DATAC1_SCAQAE05ADM01VM01) [DATAC1]
2. ONLINE 5da7f33dc5f64f64bfb2b756787a6b48 (o/192.168.221.137;192.168.221.138/DATAC1_FD_05_scaqae05celadm03) [
3. ONLINE 1eefa3ec1ebc4fd3bf8933ca0c587e13 (o/192.168.221.133;192.168.221.134/DATAC1_FD_04_scaqae05celadm01) [
4. ONLINE 6d65ea6de3eb4fcebf3e7984d62d51b9 (/dev/exadata_quorum/QD_DATAC1_SCAQAE05ADM02VM01) [DATAC1]
5. ONLINE de0d94da4fc94f57bf2a12dbc46a3603 (o/192.168.221.135;192.168.221.136/DATAC1_FD_04_scaqae05celadm02) [
Located 5 voting disk(s).

In the "View" detail section of the report for this check a "WARNING" example will be similar to:

A database server quorum disk configuration is not applicable to this system because no high redundancy diskgroup
High redundancy is a MAA best practice.
For details, see http://www.oracle.com/technetwork/database/features/availability/exadata-maa-131903.pdf

## STATE File Universal Id File Name Disk group

-- ----- ----------------- --------- ---------
1. ONLINE 5da7f33dc5f64f64bfb43434787a6b48 (o/192.168.221.137;192.168.221.138/RECOC1_FD_05_scaqae05celadm03) [
2. ONLINE 1eefa3ec1dffe4d3bf8933ca0c587e13 (o/192.168.221.133;192.168.221.134/RECOC1_FD_04_scaqae05celadm01) [
3. ONLINE de0d94da4fc94f52332dr2dbc46a3603 (o/192.168.221.135;192.168.221.136/RECOC1_FD_04_scaqae05celadm02) [
Located 3 voting disk(s).

In the "View" detail section of the report for this check a "FAILURE" example will be similar to:

A database server quorum disk configuration is applicable to this system.

But an optimal Quorum disk setup is not found as seen below.
An optimal quorum disk setup should include 2 quorum disks along with 5 voting files, with 2 of the voting files

## STATE File Universal Id File Name Disk group

If the result is a "FAILURE..." message, follow the steps provided to add database server quorum disks in the "Adding Quorum
Disks to Database Servers" section of the "Oracle® Exadata Database Machine Maintenance Guide"

Verify Oracle Clusterware files are placed appropriately

Priority Alert Date Owner Status Engineered

Level System

Critical Fail 05/25/16 <Name> Development Exadata -

Physical,
https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 83/137
29/10/2019 Document 1067527.1
Exadata - User
Domain

DB Version DB Role Engineered System Platform Exadata OS & Validation Tool

Version Version Version

Any supported GRID X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5- Any supported Linux x86-64 exachk
version 2, X5-8, X6-2, X6-8 version 12.1.0.2.7

Benefit / Impact:

Oracle Clusterware files should always be placed in a high redundancy diskgroup with the exception of voting files for the
following cases.

i) For environments with less than 5 storage cells and running any Exadata software release prior to 12.1.2.3.0, the voting files
need to be placed in a normal redundancy diskgroup.

ii) For environments with less than 5 storage cells , running any Exadata software release 12.1.2.3.0 or above and running any
Oracle Grid Infrastructure version prior to 12.1.0.2.160119, the voting files need to be placed in a normal redundancy diskgroup.

Risk:

Oracle Clusterware files placed on a normal redundancy diskgroup are exposed to the risk of of being lost in the event of
diskgroup failures due to a double partner storage failure. Having the clusterware files on a high redundancy diskgroup mitigates
this risk. The voting files are the only Clusterware files that are mandated to be stored in a normal redundancy diskgroup under
the 2 conditions mentioned above. However, even if we lose the voting files due to a double partner storage failure under the
above 2 conditions, they can be easily recreated unlike all other Clusterware files which require restore from backups.

Action / Repair:

Execute the script provided below as the Grid Infrastructure owner to check if the Clusterware files are placed appropriately.

#!/bin/bash

#################################################################
##
# Purpose: Check the placement of Oracle CLusterware Files #
##
#################################################################

## Function declarations

export GRID_HOME=$(grep ^"+ASM" /etc/oratab|awk -F ":" '{print $2}')

export ORACLE_HOME=$GRID_HOME
export ORACLE_SID=$(grep ^"+ASM" /etc/oratab|awk -F ":" '{print $1}')

usage()
{
echo "Usage: CheckGIFiles.sh [-o check|report] [-h]";
}

checkDGRedundancy()
{
HighRedExists=$($GRID_HOME/bin/asmcmd lsdg --suppressheader|awk '{print $2}'|grep -q HIGH && echo "1")

if [ "$HighRedExists"x == "x" ]
then
HighRedExists=0
else
HighRedExists=1
fi
}

checkOCR()
{
OCRdgName=$($GRID_HOME/bin/ocrcheck|grep "Device/File Name"|awk -F":" '{print $2}'|awk -F"+" '{print $2}')
OCRDGRedundancy=$($GRID_HOME/bin/asmcmd lsdg --suppressheader $OCRdgName|awk '{print $2}'|grep -q HIGH && echo
"1")

if [ "$OCRDGRedundancy"x == "x" ]
then
OCRHighRedundancy=0
OCRRec="Please relocate the OCR to a high redundancy diskgroup using $GRID_HOME/bin/ocrconfig as described in the link
below\n"
OCRRecLink="http://docs.oracle.com/database/121/CWADD/votocr.htm#BABEIEJI\n"
else
OCRHighRedundancy=1
fi
}

checkASMspfile()
{
ASMspfiledgName=$($GRID_HOME/bin/asmcmd spget|awk -F"/" '{print $1}'|awk -F"+" '{print $2}')
ASMspfileDGRed=$($GRID_HOME/bin/asmcmd lsdg --suppressheader $ASMspfiledgName|awk '{print $2}'|grep -q HIGH &&
echo "1")

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 84/137
29/10/2019 Document 1067527.1
if [ "$ASMspfileDGRed"x == "x" ]
then
ASMspfileHighRed=0
ASMspRec="Please relocate the ASM spfile to a high redundancy diskgroup using '$GRID_HOME/bin/asmcmd spcopy -u' as
described in the link below.\nAfter relocating the spfile, if possible restart the Grid Infrastructure in a rolling manner.\nIf a rolling
grid infrastructure restart is not permitted, repeat the steps for relocating the spfile to the high redundancy diskgroup every time
an initialization parameter modification to the ASM spfile is required until the Grid Infrastructure is restarted in a rolling
manner.\n"
ASMspRecLinks="http://docs.oracle.com/database/121/OSTMG/GUID-528363BF-F4C8-4F05-BB61-
DF7A6863E5B8.htm#OSTMG94420\n"
else
ASMspfileHighRed=1
fi
}

checkASMpwfile()
{
ASMpwfiledgName=$($GRID_HOME/bin/srvctl config asm|grep "Password"|awk -F":" '{print $2}'|awk -F"/" '{print $1}'|awk -
F"+" '{print $2}')
ASMpwfileDGRed=$($GRID_HOME/bin/asmcmd lsdg --suppressheader $ASMpwfiledgName|awk '{print $2}'|grep -q HIGH &&
echo "1")

if [ "$ASMpwfileDGRed"x == "x" ]
then
ASMpwfileHighRed=0
ASMpwRec="Please relocate the ASM passwordfile to a high redundancy diskgroup using '$GRID_HOME/bin/asmcmd pwmove'
as described in the link below.\n"
ASMpwRecLink="http://docs.oracle.com/database/121/OSTMG/GUID-6DFC9F42-A949-412F-B9F3-
D947C1A620B8.htm#OSTMG95378\n"
else
ASMpwfileHighRed=1
fi
}

check_main()
{
checkDGRedundancy
checkOCR
checkASMspfile
checkASMpwfile

Dgsfound=$($GRID_HOME/bin/asmcmd lsdg |awk -F"/" '{print $1}'|awk '{print $13,$2}')

if [[ $HighRedExists -eq 1 ]]
then
if [[ $OCRHighRedundancy -eq 0 ]] || [[ $ASMspfileHighRed -eq 0 ]] || [[ $ASMpwfileHighRed -eq 0 ]]
then
repText="\nClusterware files placement check failed. \nThe clusterware files are not all placed in a high redundancy
diskgroup.\n"
exit_code=1
repCmdOutput0="The Diskgroups found are \n=========================\n $Dgsfound\n"
repCmdOutput1="$(echo "OCR is stored in :" $OCRdgName)\n"
repCmdOutput2="$(echo "ASM spfile is stored in :" $ASMspfiledgName)\n"
repCmdOutput3="$(echo "ASM password file is stored in :" $ASMpwfiledgName)\n"
ALVL=1
else
repText="\nClusterware files placement check passed\n"
repCmdOutput0="The Diskgroups found are \n============================\n $Dgsfound\n"
repCmdOutput1="$(echo "OCR is stored in :" $OCRdgName)\n"
repCmdOutput2="$(echo "ASM spfile is stored in :" $ASMspfiledgName)\n"
repCmdOutput3="$(echo "ASM password file is stored in :" $ASMpwfiledgName)\n"
exit_code=0
fi
else
repText="\nClusterware files placement check passed\n"
repCmdOutput0="The Diskgroups found are \n============================\n $Dgsfound\n"
repCmdOutput1="$(echo "OCR is stored in :" $OCRdgName)\n"
repCmdOutput2="$(echo "ASM spfile is stored in :" $ASMspfiledgName)\n"
repCmdOutput3="$(echo "ASM password file is stored in :" $ASMpwfiledgName)\n"
exit_code=0
fi
}

print_result()
{
echo $exit_code
}

print_report()
{
echo -e $repText
echo -e "$repCmdOutput0"
echo -e "$repCmdOutput1"
echo -e "$repCmdOutput2"
echo -e "$repCmdOutput3"
if [ $exit_code -ne 0 ]
then
[ -z "$OCRRec" ] || echo -e "$OCRRec\n$OCRRecLink"

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 85/137
29/10/2019 Document 1067527.1
[ -z "$ASMspRec" ] || echo -e "$ASMspRec\n$ASMspRecLinks"
[ -z "$ASMpwRec" ] || echo -e "$ASMpwRec\n$ASMpwRecLink"
fi
}

NumArgs=$#

if [ $NumArgs -lt 1 ]
then
echo "Invalid or missing command line arguments..."
usage;
exit 1
fi

while getopts "o:h" opt;

do
case "${opt}" in
h) usage;
exit 0
;;
o)
swch=${OPTARG};
;;
*) echo "Invalid or missing command line arguments..."
usage;
exit 1
;;
esac
done

if [ $swch == "check" ]
then
check_main;
print_result;
elif [ $swch == "report" ]
then
check_main;
print_report;
else
echo "Invalid or missing command line arguments..."
usage;
exit 1
fi

The expected output is:

SUCCESS: Clusterware files placement check passed

- OR -

WARNING: Clusterware files placement check failed. The clusterware files are not all placed in a high redundancy diskgroup.

Verify "_reconnect_to_cell_attempts=9" on database servers which access X6 storage servers

Priority Alert Date Owner Status Eng

Level

Critical FAIL 06/29/16 <Name> Draft Exada

Exada
SSC

DB Version DB Role Engineered System Exadata OS & V

Version Version

< 12.1.0.2 OCT ALL X2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5-2, X5-8, X6- 11.2+ Linux x86-64, EX
BP 2, X6-8 Solaris
-or-
< 12.2.0.1

Benefit / Impact:

For optimal high availability, the cellinit.ora parameter file on database servers which access X6 storage servers must contain
"_reconnect_to_cell_attempts=9".

The impact of verifying the this setting is minimal. The impact of adding the parameter to the cellinit.or file on the database
servers is minimal, but after including the parameter on the database side, the cell server process (CELLSRV) on each X6 storage
server must be restarted to activate the change.

Risk:

If the cellinit.ora parameter file on database servers which access X6 storage servers does not contain
"_reconnect_to_cell_attempts=9" brownout duration may be lengthened.

EXAchk runs the appropriate validation based upon the discovered environment configuration, run EXAchk and review the
provided report.

The expected output in the EXAchk report should be as follows:

In the "Findings Passed" summary section of the report, the overall result should be "PASS":

PASS OS Check _reconnect_to_cell_attempts parameter in cellinit.ora is set to recommended value All Database Servers View

In the "View" detail section of the report for each individual database server:

Status on randomadm01:
PASS => _reconnect_to_cell_attempts parameter in cellinit.ora is set to recommended value

DATA FROM RANDOMADM01 - VERIFY "_RECONNECT_TO_CELL_ATTEMPTS=9" ON DATABASE SERVERS WHICH ACCESS X6

STORAGE SERVERS

ipaddress4=192.172.23.4/26
ipaddress3=192.172.23.3/26
ipaddress2=192.172.23.2/26
ipaddress1=192.172.23.1/26
_reconnect_to_cell_attempts=9

If the parameter is not set as expected, the overall result will be "FAIL" and more information will be listed in the "View" detail
section.
To correct a "FAIL" result, do:
1) As the "root" userid on each database server that requires correction, edit the cellinit.ora file with vi and add
"_reconnect_to_cell_attempts=9".
2) As the "root" userid on each storage server that communicates with the database servers in 1), restart the cell server process.

NOTE: If after corrective actions are completed, you wish to run just this verification without a full EXAchk run, as the "root"
userid in the directory in which EXAchk was installed, execute the following:

./exachk -check 39E9CC7370B42BF6E0530E98EB0AC7A5

Verify passwordless SSH connectivity for Enterpise Manager (EM) agent owner userid to target component
userids

Priority Alert Date Owner Status Engineered System Bug(s)

Level

Critical FAIL 08/24/16 <Name> Development Exadata - Physical,

Exadata -
Management
Domain,
Exadata - User
Domain

DB DB Role Engineered System Platform Exadata OS & Validation Tool TBD

Version Version Version Version

N/A N/A X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, 11.2.2.2.0+ Linux x86-64 exachk 12.2.0.1.1
X5-2, X5-8, X6-2, X6-8

Benefit / Impact:

EM agent monitoring requires passwordless SSH connectivity between the userid running the EM agent on each database server
where an EM agent is running and specific userids for each target component that particular EM agent is monitoring. Component
replacement or other maintenance work may destroy the passwordless SSH configuration and cause monitoring to fail.

Risk:

Users would not be notified if there are issues on the EM target components.

Action / Repair:

To verify that the necessary passwordless SSH exists, do the following on each database server where an EM agent is running:

1. Determine which database servers have EM agents installed using the EM console.

2. For each EM agent, determine the components for which it is responsible to monitor in the agent home page of the EM
console.

3. Login to each database server where an EM agent is running as the operating system userid that launched the EM agent and
execute the following for each monitored component determined in 1) and 2):

For a database server EM target:

ssh -o 'PreferredAuthentications=publickey' <AGENT OS USERID>@<Database_Server_Name> "echo Success"

For a storage server EM target:

ssh -o 'PreferredAuthentications=publickey' cellmonitor@<Storage_Server_Name> "echo Success"

ssh -o 'PreferredAuthentications=publickey' nm2user@<IB_Switch_Name> "echo Success"

For a Cisco switch EM target:

ssh -o 'PreferredAuthentications=publickey' admin@<Cisco_Switch_Name> "echo Success"

For each component, the expected output should be:

Success

If "Permission denied (publickey,gssapi-with-mic,password)" is returned then the ssh configuration is not correct.

CORRECTIVE ACTIONS:

For a database server:

4. To correct the "Permission denied..." case:
a. Check to see if /home/oracle/.ssh/id_dsa and id_dsa.pub files exist on the affected agent host. If either file does not exist
follow the steps in: Enterprise Manager Oracle Exadata Database Machine Getting Started Guide, Chapter 8: Troubleshooting the
Exadata Plug-in, Section: Establish SSH Connectivity
b. If so append the contents of /home/oracle/.ssh/ida_dsa.pub on the computer node host to /home/oracle/.ssh on the affected
database3 server(s).
c. Ensure the permission on /home/oracle/.ssh/authorized_keys is set to 600 and owned by the oracle user

For a storage server:

4. To correct the "Permission denied..." case:
a. Check to see if /home/oracle/.ssh/id_dsa and id_dsa.pub files exist on the affected agent host. If either file does not exist
follow the steps in: Enterprise Manager Oracle Exadata Database Machine Getting Started Guide, Chapter 8: Troubleshooting the
Exadata Plug-in, Section: Establish SSH Connectivity
b. If so append the contents of /home/oracle/.ssh/ida_dsa.pub on the agent host to /home/cellmonitor/.ssh on the affected
storage server(s).
c. Ensure the permission on /home/cellmonitor/.ssh/authorized_keys is set to 600 and owned by the cellmonitor user

For an InfiniBand switch:

4. To correct the "Permission denied..." case:
a. Check to see if /home/oracle/.ssh/id_dsa and id_dsa.pub files exist on the affected agent host. If either file does not exist
follow the steps in: Enterprise Manager Oracle Exadata Database Machine Getting Started Guide, Chapter 8: Troubleshooting the
Exadata Plug-in, Section: Establish SSH Connectivity
b. If so append the contents of /home/oracle/.ssh/ida_dsa.pub on the agent host to /home/nm2user/.ssh/authorized_keys on
the affect IB switch(s).
c. Ensure the permission on /home/nm2user/.ssh/authorized_keys is set to 600 and owned by the nm2user user

For the Cisco switch:

Switch hostname>enable
Switch hostname#configure terminal
Switch hostname(config)#ip ssh pubkey-chain
Switch hostname(conf-ssh-pubkey)#username admin
Switch hostname(conf-ssh-pubkey-user)#key-string
Switch hostname(conf-ssh-pubkey-data)#< Enter you keyfile contents here >
Switch hostname(conf-ssh-pubkey-data)#< Enter your keyfile contents here >

** The key may need to be entered on multiple lines as the maximum line length is 254 characters.

Now exit the switch

Switch hostname(conf-ssh-pubkey-data)#exit
Switch hostname(conf-ssh-pubkey-user)#exit
Switch hostname(conf-ssh-pubkey)#exit
Switch hostname(config)#exit
Switch hostname#exit

5. Repeat step 2 and verify connectivity

If some message other than "Success" or "Permission denied...." is returned, investigate for root cause based on the message
keywords and take corrective action.

Check /EXAVMIMAGES on dom0s for possible over allocation by sparse files

Priority Alert Level Date Owner Status Engineered System Engineere

Platf

Critical WARN 03/15/17 <Name> Production Exadata - Management Domain A

DB Version DB Type DB Role DB Mode Exadata Version OS & Version Validation T

N/A N/A N/A N/A Linux Linux exachk 1

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 88/137
29/10/2019 Document 1067527.1
Benefit:

To use dom0 disk space efficiently, two space saving techniques are used for disk image files in /EXAVMIMAGES, sparse files and
reflinks. Sparse files do not allocate blocks on disk for empty space. OCFS2 reflinks allow disk image copies to share blocks on
disk until one of the copies changes, at which time a new block on disk is allocated. The result of these space saving features is
the amount of disk space consumed is less than the apparent size of the user domain disk image files reported by the "du -sS --
apparent-size " command. However, as a user domain is used and files are changed, created, and removed, the disk space
consumed from the /EXAVMIMAGES file system will continually grow while the actual space used by disk image files could
remain the same. This check warns when the total apparent size of all files in /EXAVMIMAGES exceeds the size of file system.

Impact: The impact of this check is minimal

Risk:

A failure does not occur when the apparent size exceeds the size of the /EXAVMIMAGES file system. It may be normal in many
environments that benefit from sparse files and reflinks heavily. However, over time as changes are made to user domain disks
(e.g. by applying Exadata, Grid Infrastructure, or Database patches), allocated space in the /EXAVMIMAGES file system
increases. If the allocated space reaches /EXAVMIMAGES file system size in dom0, then an out of space error will occur within
the user domain, even though df output within the user domain shows there is available space. This can cause unpredictable
behavior, such as an unbootable user domains, or corrupted files that were being changed at the time the out of space error
occurred.

Action/Repair: Execute the script as root on a dom0.

To validate /EXAVMIMAGES on dom0s for possible over allocation by sparse files, run exachk and review the provided report.

The expected output in the exachk report should be as follows:

In the "Findings Passed" summary section of the report, the overall result should be "PASS":

PASS OS Check /EXAVMIMAGES on dom0s has enough free space All Database Servers View

In the "View" detail section of the report for each individual database server:

Status on randomadm01:
PASS => /EXAVMIMAGES on dom0s has enough free space

DATA FROM RANDOMADM01 FOR CHECK /EXAVMIMAGES ON DOM0S FOR POSSIBLE OVER ALLOCATION BY SPARSE FILES

/EXAVMIMAGES space has not been over allocated and the space usage is under the threshold.

If there are issues discovered, the overall result will be "FAIL" and more information will be listed in the "View" detail section.
Investigate the reported issues for root cause and take appropriate corrective action.

Verify active kernel version matches expected version for installed Exadata Image

Priority Alert Date Owner Status Engineered System Engineered Bug(s)

Level System
Platform
Critical FAIL 11/28/18 <Name> Production Exadata - Physical, ALL 28826182 - exachk
Exadata - Management 26337714 - exachk
Domain,
Exadata - User Domain
DB/GI DB Type DB Role DB Exadata OS & Version Validation Tool MAA Scorecard
Version Mode Version Version Section
N/A N/A N/A N/A 12.1.2.3.0 or Linux exachk 18.1.4 N/A
higher

Benefit / Impact:

Beginning with Exadata version 12.1.2.3.0, the "imageinfo" command includes data on the active kernel version and the
expected kernel version for the installed version of the Exadata image. The active and expected kernel versions should match.

Risk:

Having an active kernel version that does not match the expected version could adversely impact upgrade operations.

Action / Repair:

To verify active kernel version matches expected version for the installed Exadata image, as the "root" userid on each database
server, execute the following command set:

RAW_DATA=$(imageinfo | egrep "Kernel|kernel")

ACTIVE_KERNEL_VERSION=$(echo "$RAW_DATA" | egrep "Kernel" | cut -d":" -f2 | cut -d"#" -f1 | tr -d '[[:space:]]')
EXPECTED_KERNEL_VERSION=$(echo "$RAW_DATA" | egrep "kernel" | cut -d":" -f2 | tr -d '[[:space:]]')
AKV_OFFSET=$(echo "$ACTIVE_KERNEL_VERSION" | egrep -b -o "\.el" | cut -d":" -f1)
EKV_OFFSET=$(echo "$EXPECTED_KERNEL_VERSION" | egrep -b -o "\.el" | cut -d":" -f1)
ACTIVE_KERNEL_VERSION_SHORT=$(echo "$ACTIVE_KERNEL_VERSION" | cut -c 1-$AKV_OFFSET)
EXPECTED_KERNEL_VERSION_SHORT=$(echo "$EXPECTED_KERNEL_VERSION" | cut -c 1-$EKV_OFFSET)
if [ $ACTIVE_KERNEL_VERSION_SHORT = $EXPECTED_KERNEL_VERSION_SHORT ]
then
echo -e "SUCCESS: The kernel versions match:\n"
echo -e "Active kernel version:\t\t$ACTIVE_KERNEL_VERSION_SHORT"
echo -e "Expected kernel version:\t$EXPECTED_KERNEL_VERSION_SHORT"
else

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 89/137
29/10/2019 Document 1067527.1
echo -e "FAILURE: The kernel versions should match:\n"
echo -e "Active kernel version:\t\t$ACTIVE_KERNEL_VERSION_SHORT"
echo -e "Expected kernel version:\t$EXPECTED_KERNEL_VERSION_SHORT"
fi;

The expected output should be similar to:

SUCCESS: The kernel versions match:

Active kernel version: 2.6.39-400.284.1

Expected kernel version: 2.6.39-400.284.1

Example of a "FAILURE" message:

FAILURE: The kernel versions should match:

Active kernel version: 2.6.39-400.284.1

Expected kernel version: 2.6.39-500.284.1

If a "FAILURE: ..." message appears, corrective actions will depend upon the kernel versions and the reasons for
which the mismatch was introduced. Please open an SR for diagnostic and corrective assistance.

Verify Storage Server user "CELLDIAG" exists

Priority Alert Date Owner Status

Level
Critical FAIL 10/26/16 <Name> Production

DB DB Role Engineered System Platform Exadata OS &

Version Version Version
N/A N/A X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5-2, X5-8, X6-2, 12.1.2.2.0+ Linux x86-64
X6-8

Benefit / Impact:

Beginning with Exadata Storage Server Software version 12.1.2.2.0, the storage server user "CELLDIAG" is created during
deployment which allows access to diagnostics without using a more privileged user. The benefit of creating and using the
"CELLDIAG" user is improved security. The impact of verifying that the "CELLDIAG" user is created is minimal, as is the impact of
creating the user if it does not exist.

Risk:

Not creating and using the storage server user "CELLDIAG" fails to utilize a security improvement.

Action / Repair:

To Verify the storage server user "CELLDIAG" exists, as the "root" userid storage server, execute the following command set:

#!/bin/bash
USER=`cellcli -e list user where name = 'CELLDIAG'`
RET=$?
if [ $RET -eq 0 -a -n "$USER" ];
then
echo "SUCCESS: CELLDIAG user exists"
else
echo "FAILURE: CELLDIAG user does not exist"
fi

The expected output should be similar to:

SUCCESS: CELLDIAG user exists

Example of a "FAILURE" message (there is no output from the command--the absence of the CELLDIAG output is the failure
condition):

FAILURE: CELLDIAG user does not exist

If a "FAILURE: ..." message appears, create the user and role on each cell in cellcli using commands like these:

create user CELLDIAG password="SomeGood42Password";

create role celldiagrole;
grant privilege create on diagpack to role celldiagrole;
grant privilege list on diagpack to role celldiagrole;
grant privilege download on diagpack to role celldiagrole;
grant role celldiagrole to user CELLDIAG;

NOTE: The "CELLDIAG" user is created during the Exadata Storage Server Software version 12.1.2.2.0 or higher
deployment process. It is not created during an upgrade from an older release.

NOTE: the user detail for a properly configured "CELLDIAG" userid should look like:

CellCLI> list user CELLDIAG detail

name: CELLDIAG
roles: role=celldiagrole
Privileges:
object=diagpack, verb=create, attributes=all attributes, options=all optio
object=diagpack, verb=download, attributes=all attributes, options=all opt
object=diagpack, verb=list, attributes=all attributes, options=all options

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 90/137
29/10/2019 Document 1067527.1

NOTE: Creation of the "CELLDIAG" storage server user is not mandatory. The automatic diagnostic gathering
process continues to function without it and the packaged diagnostics are accessed using one of the other storage
server users.

Verify installed rpm(s) kernel type match the active kernel version

Priority Alert Date Owner Status Engineered Engineered System Bug(s)

Level System Platform
Critical WARN 11/28/18 Production Exadata - Physical, ALL 28740049 - exachk
<Name> Exadata - User 26396389 - exachk
Domain
DB/GI DB Type DB Role DB Exadata OS & Version Validation Tool MAA Scorecard
Version Mode Version Version Section
N/A N/A N/A N/A ALL Linux exachk 18.4.0 N/A

Benefit / Impact:

Verifying installed rpm(s) kernel type match the active kernel version helps avoid update failures due to dependency conflicts
between older rpm versions and newer versions being installed. The impact of verifying that installed rpm(s) kernel type match
the active kernel version is minimal. The impact of correction depends upon why the mismatched rpm(s) was/were installed and
cannot be estimated here.

Risk:

If installed rpm(s) kernel type do not match the active kernel, there may be update interruptions caused by dependency conflicts
between older rpm versions and newer versions being installed.

Action/Repair:

To verify the installed rpm(s) kernel type match the active kernel version, execute the following code as the "root" userid on
each database server:

unset ERROR_MESSAGE
UNAME_DATA=$(uname -r)
START=$(echo "$UNAME_DATA" | awk 'END{print index($0,"el")}')
END=$(expr $START + 2)
KERNEL_TYPE=$(echo "$UNAME_DATA" | cut -c$START-$END)
case "$KERNEL_TYPE" in
el7)
MISMATCHED_RPMS=$(rpm -aq | egrep "\.el5|\.el6")
;;
el6)
MISMATCHED_RPMS=$(rpm -aq | grep "\.el5|\.el7")
;;
el5)
MISMATCHED_RPMS=$(rpm -aq | grep "\.el6|\.el7")
;;
*)
ERROR_MESSAGE=$(echo "Unrecognized kernel type: $KERNEL_TYPE")
;;
esac
if [ -n "$ERROR_MESSAGE" ]
then
echo -e "\nFAILURE: $ERROR_MESSAGE"
else
if [ -n "$MISMATCHED_RPMS" ]
then
MISMATCH_COUNT=$(echo "$MISMATCHED_RPMS" | wc -l)
else MISMATCH_COUNT=0
fi
if [ -z "$MISMATCHED_RPMS" ]
then
echo -e "\nSUCCESS: There were no mismatched rpms found.\n\nKernel type:\t\t$KERNEL_TYPE\nMismatch count:\t\
else
echo -e "\nFAILURE: One or more mismatched rpms were found.\n\nKernel type:\t\t$KERNEL_TYPE\nMismatch count:
fi
fi

The expected output should be similar to:

SUCCESS: There were no mismatched rpms found.

Kernel type: el6

Mismatch count: 0

Examples of "FAILURE" results:

FAILURE: One or more mismatched rpms were found.

Kernel type: el5

Mismatch count: 37
Mismatched rpms:
gdb-7.2-83.el6.x86_64
basesystem-10.0-4.0.1.el6.noarch
strace-4.8-10.el6.x86_64
<output truncated>

FAILURE: Unrecognized kernel type: 25.el

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 91/137
29/10/2019 Document 1067527.1

If the output is not "SUCCESS", investigate for root cause and take corrective action based on root cause
findings.

Verify Flex ASM Cardinality is set to "ALL"

Priority Alert Date Owner Status Engineered Bug(s)

Level System

Critical FAIL 11/23/16 <Name> production Exadata - Physical, -

Exadata - User exachk
Domain

DB DB Role Engineered System Platform Exadata OS & Validation Tool TBD

Version Version Version Version

12.2.0.1+ ASM X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5-2, 11.2.2.2.0+ Linux x86- exachk 12.2.0.1.2
X5-8, X6-2, X6-8 64

Benefit / Impact:

By default, Flex ASM cardinality is set to 3. The impact of verifying that Flex ASM Cardinality is set to "ALL" is minimal. The
impact of setting the Flex ASM cardinality to "ALL" from a lower value is minimal and can be done online; ASM will bring up the
additional instances required to fullfil the cardinality setting.

Risk:

Not having Flex ASM cardinality set to "ALL" could result in a higher number of client (DB) connections on some ASM instances
and may result in longer client reconnection times should an ASM instance crash.

Action / Repair:

To verify Flex ASM Cardinality is set to "ALL", as the Oracle home owner userid with the environment properly set, execute the
following command set on one database server in the cluster where an ASM instance is executing:

RAW_DATA=$($ORACLE_HOME/bin/srvctl config asm -detail)

FLEX_MODE=$($ORACLE_HOME/bin/asmcmd showclustermode | cut -d" " -f6)
if [ "$FLEX_MODE" = "disabled" ]
then
echo -e "INFO: ASM is not in Flex mode: $FLEX_MODE, check not executed."
else
CARDINALITY=$(echo "$RAW_DATA" | grep count | cut -d" " -f4)
if [ "$CARDINALITY" = "ALL" ];
then
echo -e "SUCCESS: Flex ASM cardinality is set to: $CARDINALITY."
else
echo -e "FAILURE: Flex ASM cardinality is set to: $CARDINALITY.\n\n$RAW_DATA"
fi
fi

The expected output should be:

SUCCESS: Flex ASM cardinality is set to: ALL.

-- OR --

INFO: ASM is not in Flex mode: disabled, check not executed.

Example of a "FAILURE" message:

FAILURE: Flex ASM cardinality is set to: 3.

ASM home: <CRS home>

Password file: +DBFS_DG/orapwASM
Backup of Password file:
ASM listener: LISTENER ASM is enabled.
ASM is individually enabled on nodes:
ASM is individually disabled on nodes:
ASM instance count: 3 Cluster
ASM listener: ASMNET1LSNR_ASM

If a "FAILURE: ..." message appears, adjust the Flex ASM cardinality to "ALL" using the following command:

srvctl modify asm -count ALL

After making the change to ASM cardinality, verify that each node has an ASM instance running using the following command:

$ srvctl status asm -detail | grep "is running"

ASM is running on exadb06,exadb05,exadb08,exadb07,exadb02,exadb01,exadb04,exadb03
ASM instance +ASM2 is running on node exadb02
ASM instance +ASM1 is running on node exadb01
ASM instance +ASM4 is running on node exadb04
ASM instance +ASM3 is running on node exadb03
ASM instance +ASM5 is running on node exadb05
ASM instance +ASM6 is running on node exadb06
ASM instance +ASM7 is running on node exadb07
ASM instance +ASM8 is running on node exadb08

Verify "downdelay" is correctly set for bonded client interfaces

Critical FAIL 02/08/17 <Name> Production

DB DB Role Engineered System Platform Exadata OS & Va

Version Version Version

N/A N/A X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5-2, X5-8, X6-2, 12.1.2.2.0+ Linux x86-64
X6-8

Benefit / Impact:

When using the default "downdelay" settings, an undesired VIP failover or brownout may be seen depending upon the timing of
a single client network interface failure. To avoid this possibility, the "downdelay" parameter of the client network interface
should be set to 2000 when using active-backup mode bonding and to 200 when using LACP mode bonding.
The impact of verifying "downdelay" attributes for bonded client interfaces is minimal. The recommended corrective action
includes a reboot.

Risk:

Not verifying "downdelay" attributes for bonded client interfaces increases the risk of unwanted VIP failover or brownouts in the
event of a single client network interface failure.

Action/Repair:

To verify "downdelay" attributes for bonded client interfaces, as the root userid execute the script below on each database
server:

#!/bin/bash

#############################################################################################

# #

# Purpose: Check downdelay is set appropriately for the bonded interfaces #

# #

#############################################################################################

## Variable declarations

exit_code=0

## Function Definitions

usage()

echo "Usage: check_downdelay.sh [-o check|report] [-h]";

check_downdelay()

downDelayActiveBackup=2000

downDelayLACP=200

while read bonintf

bondingType=$(grep "^Bonding Mode:" $bonintf|awk -F ":" '{print $2}')

downdelaySet=$(grep "^Down Delay (ms):" $bonintf|awk '{print $4}')

if [ "${bondingType}" == " fault-tolerance (active-backup)" ]

then

if [ $downdelaySet -ne $downDelayActiveBackup ]

then

exit_code=1

downdelayFailMsgTmp="Down delay not set to 2000 for the active-backup bonded interface $(echo $bonintf|awk

downdelayFailMsg=$(printf "$downdelayFailMsgTmp\n$downdelayFailMsg")

then

if [ $downdelaySet -ne $downDelayLACP ]

then

exit_code=1

downdelayFailMsgTmp="Down delay not set to 200 for the LACP bonded interface $(echo $bonintf|awk -F"/" '{pr

downdelayFailMsg=$(printf "$downdelayFailMsgTmp\n$downdelayFailMsg")

if [ $exit_code -eq 0 ]

then

downdelayPassMsg="Down delay correctly set to correct value(s) for all bonded interfaces"

done << EOF

$(ls -1 /proc/net/bonding/bondeth*)

EOF

check_main()

check_downdelay

print_result()

echo $exit_code

print_report()

if [ $exit_code -eq 0 ]

then

printf "\n$downdelayPassMsg\n"

else

printf "\n$downdelayFailMsg\n"

NumArgs=$#

if [ $NumArgs -lt 1 ]

then

echo "Invalid or missing command line arguments..."

usage;

exit 1

while getopts "o:h" opt;

case "${opt}" in

h) usage;

exit 0

;;

*) echo "Invalid or missing command line arguments..."

usage;

exit 1

;;

esac

done

if [ $swch = "check" ]

then

check_main;

print_result;

elif [ $swch == "report" ]

then

check_main;

print_report;

else

echo "Invalid or missing command line arguments..."

usage;

exit 1

The expected output should be:

Down delay correctly set to correct value(s) for all bonded interfaces

Example of a failure:

Down delay not set to 2000 for the active-backup bonded interface bondeth0

If failures are reported, as the root userid on the database server which has the failure, execute the following command followed
by a reboot:

For active-backup mode - sed -i 's/downdelay=<existing value>/downdelay=2000/' /etc/sysconfig/network-scripts/ifc

For LACP mode - sed -i 's/downdelay=<existing value>/downdelay=200/' /etc/sysconfig/network-scripts/ifcfg-<client

NOTE: It is possible to temporarily set the value in the active kernel as the root userid using this command:

echo 2000 > /sys/class/net/<client network interface name>/bonding/downdelay - For active-backup bonding
echo 200 > /sys/class/net/<client network interface name>/bonding/downdelay - For LACP bonding

However, this will not survive a reboot. The "sed" command followed by a reboot is the preferred method.

Verify ExaWatcher is executing

Alert
Priority Date Owner Status Engine
Level
Exada
Exa
Critical FAIL 02/15/17 <Name> Production

DB Exadata OS & Vali

DB Role Engineered System Platform
Version Version Version
Linux x86-
X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5-2, X5-8, X6-2,
11.2.0.2+ N/A 11.2.3.3.0+ 64, exac
X6-8, SL6
Sparc Linux

ExaWatcher collects data on key metrics for both database and storage servers, which can be used for both troubleshooting and
performance analysis. There is minimal impact to verify that ExaWatcher is executing, or from starting ExaWatcher if it is not
executing.

Risk:

If ExaWatcher is not executing, valuable data for analysis is not collected.

Action / Repair:

To verify that ExaWatcher is executing, as the "root" userid execute the following command set on each database and storage
server in the cluster:

NUM_OF_EXAWATCHERS=$(ps -ef | grep -i exawatcher | grep -v grep | wc -l)

if [[ $NUM_OF_EXAWATCHERS -gt 0 ]]
then
echo -e "SUCCESS: ExaWatcher is executing. Number of processes: $NUM_OF_EXAWATCHERS"
else
echo -e "FAILURE: ExaWatcher is not executing. Number of processes: $NUM_OF_EXAWATCHERS"
fi

The output should be similar to:

SUCCESS: ExaWatcher is executing. Number of processes: 15

NOTE: The number of processes may vary depending upon the site-specific configuration.

If ExaWatcher is not executing, please refer to the "System Diagnostics Data Gathering with sosreports and Oracle ExaWatcher"
section of the "Oracle® Exadata Storage Server Software User's Guide" that is for your specific installed version of Oracle
Exadata Storage Server software.

Verify non-Default services are created for all Pluggable Databases

Engineered System
Priority Alert Level Date Owner Status Engineered System
Platform
Exadata - Physical,
Critical WARN 02/15/17 Frank Kobylanski Production Exadata - Management Domain, ALL Bu
Exadata - User Domain
DB Version DB Type DB Role DB Mode Exadata Version OS & Version Validation Tool Version M
12.1.0.1+ CDB Primary Open ALL ALL exachk 12.2.0.1.3

Benefit / Impact:

Oracle recommends that non-default services should be created for application and end user access to pluggable databases
(PDBs). This provides access control along with automated opening of the PDB as part of container database (CDB) startup.

Risk:

PDBs may not open automatically at instance startup and applications and users may have access to PDBs through default
services at inappropriate times.

Action / Repair:

Note that only PDBs that are open and not in MIGRATE/UPGRADE mode will be checked. Since a PDB may not be open on all
instances the following script should be executed on each instance of each CDB.

To verify that all PDBs in a CDB have at least one non-default service created for them, as the CDB ownerid on each database
server:

1. Set your environment for a CDB

2. Run the script below

Repeat steps 1 and 2 for each CDB running on the database server, then move onto the next database server.

unset PDB_SERVICES;
PDB_SERVICES=$($ORACLE_HOME/bin/sqlplus -s "/ as sysdba" <<EOF
set head off lines 80 feedback off timing off serveroutput on
select name from v\$pdbs p
where p.name not in ('PDB\$SEED','CDB\$ROOT')
and p.open_mode not in ('MOUNTED','MIGRATE')
and p.name not in (select s.pdb from containers(service\$) s
where bitand(s.flags,128) != 128
and deletion_date is null
and s.name != ('SYS.SCHEDULER\$_EVENT_QUEUE')
and s.name not like ('SYS\$%'));
exit
EOF
);
if [ `echo $PDB_SERVICES| grep ORA- | wc -w` = 0 ]
then
if [ `echo $PDB_SERVICES| wc -w` = 0 ]
then
echo -e SUCCESS: all open PDBs have non-default services defined or there are no open PDBs;
else
echo -e WARNING: the following open PDBs do not have non-default services defined: $PDB_SERVICES;
fi;
else

If the all PDBs that can be checked have non-default services defined, the following be returned:

SUCCESS: all open PDBs have non-default services defined or there are no open PDBs

If there are PDBs found that do not have non-default services defined for them, a message similar to the following will be
returned.

WARNING: the following open PDBs do not have non-default services defined: TESTPDB4 TESTPDB2 TESTPDB3 TESTPDB5
TESTPDB1

To resolve the warning, create services for these PDBs using either:

1. srvctl in Grid Infrastructure or Oracle Restart based environments

2. The DBMS_SERVICE.create_service package in environments where srvctl is not available.

Verify Automatic Storage Management Cluster File System (ACFS) file systems do not contain critical database
files

Priority Alert Date Owner Status Engineered Engineered System Bug(s)

Level System Platform
Critical FAIL 08/14/19 Irfan Production Exadata - Physical, ALL 29411526 - exachk
Alvi Exadata - User 26268345 - exachk
Domain 26143661 - OEDA
DB/GI Version DB Type DB Role DB Exadata OS & Version Validation Tool MAA Scorecard
Mode Version Version Section
12.1.0.2.0 or ASM N/A N/A ALL Linux X86-64 exachk 19.3.0 N/A
higher

Benefit / Impact:

ACFS disk groups created on Exadata should not contain any critical database files to isolate operational maintenance and
configuration changes.

The impact of verifying (ACFS) file systems do not contain critical database files is minimal and can be done online.

The impact of moving critical database files out of ACFS disk groups varies by the type of file involved, and cannot be estimated
here.

NOTE: For more information on ACFS use cases and recommended disk group attributes on Exadata, please
see: Oracle ACFS Support on Oracle Exadata Database Machine (Linux only) (Doc ID 1929629.1)

Risk:

Any ACFS maintenance or configuration change could potentially impact the availability of database files residing on the same
disk group as ACFS.

Action / Repair:

To verify (ACFS) file systems do not contain critical database files, run exachk and review the provided report.

The expected output in the exachk report should be as follows:

In the "View" detail section of the report for this check the expected output should be similar to:

Example of a "FAILURE" message: Output in the exachk report

In the "View" detail section of the report for this check a "FAILURE" example will be similar to:

If a "FAILURE: ..." message appears, either relocate ACFS to a new dedicated disk group following How to Relocate an ACFS
Filesystem to Another Diskgroup in Exadata (Doc ID 2133396.1) MOS note or move the database files out of the ACFS disk
group.

NOTE: If after corrective actions are completed, you wish to run just this check manually without a full exachk run,
as the "root" userid in the directory where exachk was installed, execute the following:

Verify the ownership and permissions of the "oradism" file

Priority Alert Level Date Owner Status Engineered System Engineered System
Platform
Critical FAIL 07/12/17 <Name> Production
Exadata - Physical, ALL
Exadata - User Domain
DB/GI Version DB Type DB Role DB Mode Exadata Version OS & Version Validation Tool Version
11.2+ N/A N/A N/A ALL Linux exachk 12.2.0.1.4

Benefit / Impact:

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 97/137
29/10/2019 Document 1067527.1
Maintaining the correct ownership and permissions of the "oradism" file is essential for the proper operation of Direct NFS and
achieving the highest possible throughput. The file should be owned the the "root" userid and have the setuid bit enabled in the
permissions mask. The impact of validating file ownership and permission is minimal. Changing the file ownership and
permissions requires a restart of the Oracle stack running out of the adjusted $ORACLE_HOME.

Risk:

If the ownership and permissions of the "oradism" file are not correct, the performance of Direct NFS will be severely impacted.

Action / Repair:

To verify the ownership and permissions of the "oradism" file, as the appropriate oracle home owner userid on each database
server, execute the following command set on each $ORACLE_HOME:

OWNER_USERID=$(ls -l $ORACLE_HOME/bin/oradism |awk '{print $3}')

SETUID_BIT=$(ls -l $ORACLE_HOME/bin/oradism | cut -c4)
DETAIL=$(echo -e "owner userid:\t$OWNER_USERID\nsetuid bit:\t$SETUID_BIT")
if [[ $OWNER_USERID = "root" && $SETUID_BIT = "s" ]]
then
echo -e "SUCCESS: \"oradism\" file is correctly configured:\n$DETAIL"
else
echo -e "FAILURE: \"oradism\" file is not correctly configured:\n$DETAIL"
fi

The output should be similar to:

SUCCESS: "oradism" file is correctly configured:

owner userid: root
setuid bit: s

Examples of "FAILURE" results:

FAILURE: "oradism" file is not correctly configured:

owner userid: root
setuid bit: x

FAILURE: "oradism" file is not correctly configured:

owner userid: oracle
setuid bit: x

If the output is a "FAILURE" result, investigate and take corrective action.

Verify the SYSTEM, SYSAUX, USERS and TEMP tablespaces are of type bigfile

Benefit / Impact:

Configuring the SYSTEM, SYSAUX, USERS, and TEMP tablespaces to be of type bigfile simplifies maintenance and operations
which involve these tablespaces. The impact of verifying the SYSTEM, SYSAUX, USERS, and TEMP tablespaces are of type bigfile
is minimal.

Risk:

If the SYSTEM, SYSAUX, USERS, and TEMP tablespaces are not of type bigfile, maintenance operations are more complicated
and a tablespace running out of free space is more possible.

Action / Repair:

To verify the SYSTEM, SYSAUX, USERS, and TEMP tablespaces are of type bigfile, as the ORACLE_HOME owner userid on one
database server in the cluster, execute the following command set once for each database running out of a given
ORACLE_HOME, with the environment properly configured to access each given database:

BIGFILE_DATA=$($ORACLE_HOME/bin/sqlplus -s "/ as sysdba" <<EOF

set newpage none lines 80 feedback off timing off serveroutput on
SELECT tablespace_name, bigfile FROM dba_tablespaces
WHERE tablespace_name in ('SYSTEM', 'SYSAUX', 'USERS', 'TEMP');
exit
EOF
)
if [ `echo "$BIGFILE_DATA" | grep -ic "NO"` -gt 0 ]
then
echo -e "FAILURE: One or more of SYSTEM, SYSAUX, USERS, TEMP tablespaces are not of type bigfile:\n\n$BIGFIL
else
echo -e "SUCCESS: SYSTEM, SYSAUX, USERS, TEMP tablespaces are of type bigfile:\n\n$BIGFILE_DATA"
fi

The output should be similar to:

SUCCESS: the SYSTEM, SYSAUX, USERS, and TEMP tablespaces are of type bigfile:

TABLESPACE_NAME BIG
------------------------------ ---
SYSTEM YES
SYSAUX YES
TEMP YES
USERS YES

FAILURE: One or more of SYSTEM, SYSAUX, USERS, TEMP tablespaces are not of type bigfile:

TABLESPACE_NAME BIG
------------------------------ ---
SYSTEM NO
SYSAUX NO
TEMP NO
USERS NO

If the output is a "FAILURE" result, investigate and take corrective action.

Verify the storage servers in use configuration matches across the cluster

Priority Alert Date Owner Status Engineered Engineered System Bug(s)

Level System Platform
Critical FAIL 12/19/18 <Name> Production Exadata - Physical, ALL Bug 29061438 -
Exadata - User exachk
Domain Bug 27541151 -
exachk
Bug 26365216 -
exachk
DB/GI DB Type DB Role DB Exadata OS & Version Validation Tool MAA Scorecard
Version Mode Version Version Section
N/A N/A N/A N/A ALL Linux exachk 18.5.0 N/A

Benefit / Impact:

Verifying the storage servers in use configuration matches across the cluster can prevent potential issues ranging from impaired
performance to a node eviction.

The impact of verifying the storage servers in use configuration matches across the cluster. The impact of making corrections
varies depending upon the root cause of the difference.

Risk:

If the storage servers in use configuration does not match across the cluster, there is risk of impaired performance, node
eviction, and perhaps data loss with multiple hardware failures over time.

Action / Repair:

NOTE: This check will only pass if the following are both true:
1) For each database server, the md5sum for the cellip.ora file matches the md5sum from the list of storage
servers accessed by kfod.
2) The md5sum from 1) matches across the cluster.

To verify the storage servers in use configuration matches across the cluster, run exachk and review the provided report.

The expected output in the exachk report should be as follows:

In the "Cluster Wide" section of the report, the overall result should be "PASS":

PASS Cluster Wide Check The storage servers in use configuration matches across the cluster Cluster Wide

In the "View" detail section of the report for this check the expected output should be similar to:

SUCCESS: The storage servers in use configuration matches:

-
DBSRVR: <Host Name>
DBSRVR_CELLIP_MD5SUM: d2144e88f4249a5d267691b85ed2ae49
DBSRVR_KFOD_MD5SUM: d2144e88f4249a5d267691b85ed2ae49
DBSRVR_BASE_MD5SUM: d2144e88f4249a5d267691b85ed2ae49
-
DBSRVR: <Host Name>
DBSRVR_CELLIP_MD5SUM: d2144e88f4249a5d267691b85ed2ae49
DBSRVR_KFOD_MD5SUM: d2144e88f4249a5d267691b85ed2ae49
DBSRVR_BASE_MD5SUM: d2144e88f4249a5d267691b85ed2ae49

A "FAILURE" example:

In the "Cluster Wide" section of the report, the overall result will be "FAIL":

FAIL Cluster Wide Check The storage servers in use configuration should match across the cluster Cluster W

In the "View" detail section of the report for this check the expected output should be similar to:

FAILURE: The storage servers in use configuration does not match:

-
DBSRVR: randomadm01vm01
DBSRVR_CELLIP_MD5SUM: acd6ad6d153ea1ec1ecf9a5aa19cf4a7
DBSRVR_KFOD_MD5SUM: d41d8cd98f00b204e9800998ecf8427e
DBSRVR_BASE_MD5SUM: acd6ad6d153ea1ec1ecf9a5aa19cf4a7
-
DBSRVR: randomadm02vm01
DBSRVR_CELLIP_MD5SUM: acd6ad6d153ea1ec1ecf9a5aa19cf4a7
DBSRVR_KFOD_MD5SUM: d41d8cd98f00b204e9800998ecf8427e
DBSRVR_BASE_MD5SUM: acd6ad6d153ea1ec1ecf9a5aa19cf4a7

NOTE: In the "FAILURE:" example, the md5sum for the results reported from kfod on the running system does
not match the cellip.ora md5sum.

If the result is not as expected, investigate for root cause and take appropriate corrective action.

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state=… 99/137
29/10/2019 Document 1067527.1
NOTE: If after corrective actions are completed, you wish to run this one check without a full exachk run execute
the following command as the "root" userid in the directory in which exachk was installed:
./exachk -check 5D6AC87BF4669BF2E053D498EB0AFC19,5D691B1A8146F67CE053D398EB0A8822

Verify "asm_power_limit" is greater than zero

Priority Alert Level Date Owner Status Engineered System Engineered System
Platform
Critical CRITICAL 07/26/17 <Name> Production
Exadata - Physical, ALL
Exadata - User Domain
DB/GI Version DB Type DB Role DB Mode Exadata Version OS & Version Validation Tool Version
11.2+ ASM N/A N/A ALL Linux exachk 12.2.0.1.4

Benefit / Impact:

Setting "asm_power_limit=0" disables rebalance operations. Verifying that "asm_power_limit" is greater than zero confirms that
rebalance operations are enabled. The impact of verifying that "asm_power_limit" is greater than zero is minimal, as is the
impact of setting it to a value greater than zero.

NOTE: Changing the default value via the initialization parameter "asm_power_limit" is not the same as changing
the power for an actively running rebalance operation.

Risk:

"asm_power_limit=0" disables rebalance operations, which can lead to data loss in the event of multiple hardware failures over
time.

Action / Repair:

To verify "asm_power_limit" is greater than zero, as the grid home owner userid, execute the following command set once for
each ASM instance with the environment properly configured to access that given instance:

NOTE: This code will not execute properly if executed on a database server in a flex ASM environment where an
ASM instance is not running.

ASMPL_PARAM_DATA=$($ORACLE_HOME/bin/sqlplus -s "/ as sysdba" <<EOF

set newpage none heading off lines 80 feedback off timing off serveroutput on
select value from v\$parameter where name = 'asm_power_limit';
exit
EOF
)
ASMPL_QUEUE_DATA=$($ORACLE_HOME/bin/sqlplus -s "/ as sysdba" <<EOF
set newpage none heading off lines 80 feedback off timing off serveroutput on
select count(*) from gv\$asm_operation where power=0 or actual=0;
exit
EOF
)
if [[ $ASMPL_PARAM_DATA -gt 0 && $ASMPL_QUEUE_DATA -eq 0 ]]
then
echo -e "SUCCESS: \"asm_power_limit\" is set to $ASMPL_PARAM_DATA and there are no rebalance operations in gv\$
else
echo -e "FAILURE:"
if [ $ASMPL_PARAM_DATA -eq 0 ]
then
echo -e "The intitialization parameter \"asm_power_limit\" is set to zero"
fi
if [ $ASMPL_QUEUE_DATA -gt 0 ]
then
echo -e "There are rebalance operation(s) in gv\$asm_operation with the attribute POWER or ACTUAL = 0"
fi
fi;

The output should be similar to:

SUCCESS: "asm_power_limit" is set to 32 and there are no rebalance operations in gv$asm_operation with the attrib

Examples of "FAILURE" results:

FAILURE:
The intitialization parameter "asm_power_limit" is set to zero
There are rebalance operation(s) in gv$asm_operation with the attribute POWER or ACTUAL = 0

FAILURE:
The intitialization parameter "asm_power_limit" is set to zero

FAILURE:
There are rebalance operation(s) in gv$asm_operation with the attribute POWER or ACTUAL = 0

If the output is a "FAILURE" result, investigate and take corrective action.

Verify the recommended patches for Adaptive features are installed

Priority Alert Level Date Owner Status Engineered Engineered Bug(s)

System System
Platform
Critical INFO 06/05/19 <Name> Production Exadata - Exadata 29849595 - exachk
Physical, 26681554 -
Exadata - User exachk
Domain
DB/GI DB Type DB Role DB Exadata OS & Version Validation Tool MAA Scorecard

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state… 100/137
29/10/2019 Document 1067527.1
Version Mode Version Version Section
12.1.0.2 Normal, CDB, Primary, Physical Open ALL Linux exachk 19.3.0 N/A
only PDB Standby

Benefit / Impact:

Adaptive features are a set of capabilities that enable the optimizer to make run-time adjustments to execution plans and to
adjust plans for future executions based on the results of previous executions. For Oracle version 12.1.0.2 only, to maximize
performance and reliability it is recommended that the default configuration for 12.2.x be used. Installing patches 22652097 and
21171382 configures those defaults.

Risk:

Without patches 22652097 and 21171382 Oracle version 12.1.0.2 may experience poor performance and potential instability.

Action / Repair:

To verify the recommended patches for Adaptive features are installed, as the owner userid of a given Oracle home, and with
the environment set to access that Oracle home on each database server, execute the following code set:

opatch_return_code=$($ORACLE_HOME/OPatch/opatch lsinventory -oh $ORACLE_HOME -local >/dev/null 2>&1;echo $?)

if [ $opatch_return_code -eq 0 ]
then
RAW_LSPATCHES=$($ORACLE_HOME/OPatch/opatch lsinventory -oh $ORACLE_HOME -local -bugs_fixed 2>&1)
else
RAW_LSPATCHES=$(cat $ORACLE_HOME/inventory/ContentsXML/comps.xml);
fi
IS_22652097_PRESENT=$(echo "$RAW_LSPATCHES" | grep -wc 22652097)
IS_21171382_PRESENT=$(echo "$RAW_LSPATCHES" | grep -wc 21171382)
if [[ $IS_22652097_PRESENT -eq 1 && $IS_21171382_PRESENT -eq 1 ]]
then
echo -e "SUCCESS: patches 22652097 and 21171382 are installed in $ORACLE_HOME"
else
echo -e "INFO: patches 22652097 and 21171382 are not installed in $ORACLE_HOME"
fi

The expected output should be:

SUCCESS: patches 22652097 and 21171382 are installed in /u01/app/oracle/product/12.1.0.2/dbhome_1

Example of a "INFO:" result:

INFO: patches 22652097 and 21171382 are not installed in /u01/app/oracle/product/12.1.0.2/dbhome_1

If the output is not as expected, install the recommended patches.

Verify initialization parameter cluster_database_instances is at the default value

Alert Engineered System

Priority Date Owner Status Engineered System
Level Platform

Exadata - physical
Critical FAIL 11/08/2017 <Name> Production ALL
Exadata - User
Domain

GI/DB DB Exadata
DB Type DB Role OS & Version Validation Tool Version
Version Mode Version
< 19.1 ALL ALL OPEN ALL Linux exachk 12.2.0.1.4

Benefit / Impact:

cluster_database_instances should not be changed from the default value for performance and stability. The impact of verifying
initialization parameter cluster_database_instances is at the default value is minimal. The impact of removing a set value should
include a database restart to make sure the change survives database shutdown and startup.

Risk:

If cluster_database_instances is modified from the default, dynamic remastering can be impacted potentially causing poor
performance or stability.

Action / Repair:

To verify cluster_database_instances is at the default value, as the owner of the oracle home for a given database and with the
environment set to access that database, execute the following command set:

unset ISDEFAULT_VALUE
ISDEFAULT_VALUE=$($ORACLE_HOME/bin/sqlplus -s "/ as sysdba" <<EOF
set head off lines 80 feedback off timing off serveroutput on
select upper(isdefault) from v\$parameter where name ='cluster_database_instances';
exit
EOF
)
if [ $ISDEFAULT_VALUE = "TRUE" ]
then
echo -e "SUCCESS: cluster_database_instances is at the default value"
else
echo -e "FAILURE: cluster_database_instances should be at the default value: \"isdefault\" column value = "$
fi;

SUCCESS: cluster_database_instances is at the default value

Example of a "FAILURE" result:

FAILURE: cluster_database_instances should be at the default value: "isdefault" column value = FALSE

To correct a failure condition, with the environment properly set to access the target database, unset
cluster_database_instances database parameter using

SQL> alter system reset cluster_database_instances scope=spfile sid='*';

Restart the instance and verify the change survives startup and shutdown.

Verify the database server NVME device configuration

Priority Alert Level Date Owner Status Engineered System Engineered System P

Critical FAIL 11/29/2017 <Name> Production X7-8 Exadata

DB/GI Version DB Type DB Role DB Mode Exadata Version OS & Version Validation Tool Versio

N/A N/A N/A N/A ALL Linux exachk 12.2.0.1.

Benefit / Impact:

Proper configuration of NVME devices is necessary for reliable and efficient operation of a database server. The impact of
verifying the database server NVME device configuration is minimal. The impact of making any required corrections or
adjustment varies depending upon the root issue, and cannot be estimated here.

Risk:

An improper NVME device configuration could lead to unreliable operation, poor performance, or impact upgrade operations.

Action / Repair:

NOTE: This check will pass on a database server only if the following are both true:
1) There are four NVME devices discovered.
2) Every device has a status of "normal".

To verify the database server NVME device configuration, as the "root" userid, execute the following code set on each database
server:

RAW_OUTPUT=$(dbmcli -e "list physicaldisk attributes name,status")

# Is count correct?
if [ $(echo "$RAW_OUTPUT" | wc -l) -eq 4 ]
then
COUNT_CORRECT=1
else
COUNT_CORRECT=0
fi
# Is the status normal?
if [ $(echo "$RAW_OUTPUT" | awk '{print $2}' | grep -icv normal) -eq 0 ]
then
STATUS_NORMAL=1
else
STATUS_NORMAL=0
fi
# Analyze:
if [[ $(echo $COUNT_CORRECT) -eq 1 && $(echo $STATUS_NORMAL) -eq 1 ]]
then
echo "SUCCESS: The NVME device configuration is correct."
else
echo -e "FAILURE: The NVME device configuration is not correct.\nDetails:\n$RAW_OUTPUT"
fi

The expected output should be:

SUCCESS: The NVME device configuration is correct.

Example of a "FAILURE:" result:

FAILURE: The NVME device configuration is not correct.

Details:
FLASH_15_1 failed - dropped for replacement
FLASH_15_2 failed - dropped for replacement
FLASH_1_1 normal
FLASH_1_2 normal

NOTE: The "FAILURE:" example is such because two devices have failed and been dropped.

If the output is not as expected, determine root cause and take appropriate correct action for same.

Alert Engineered Engineered System

Priority Date Owner Status Bug
Level System Platform
Exadata - Physical,
Bug 2729863
Critical FAIL 01/17/18 <Name> Production Exadata - User ALL
Bug 2740305
Domain
DB Exadata ValidationTool MAA Scorec
DB/GI Version DB Type DB Role OS & Version
Mode Version Version Section
12.2.0.1 or
ASM N/A N/A ALL Linux exachk 18.2.0 N/
higher

Benefit / Impact:

Starting with Grid Infrastructure 12.2.0.1, Oracle ACFS supports I/O requests in multiples of 4K logical sector sizes as well as
continued support for 512-byte logical sector size I/O requests. The size of the metadata blocks is not set directly, but derived
from the logical sector size. Using a 4k metadata block size helps improve performance and stability.

Risk:

On ACFS files systems where the metadata block size is not 4k, applications that frequently access large numbers of files stored
on the ACFS file system can experience severe poor performance, and possilby a storage server outage.

Action / Repair:

To verify that the Automatic Storage Management Cluster File System (ACFS) uses 4K metadata block size, on one database
server in the cluster as the owner userid of the Grid home, and with the environment set to access the ASM instance on that
database server, execute the following code:

#!/bin/bash
# acfs check metadata block size
# ORACLE_HOME should be the Grid Infrastucture ORACLE_HOME
CRS_HOME=$ORACLE_HOME
NO4KMETABLK=0
ACFSNO4K=()
isacfsused=$(asmcmd volinfo --all|sed -e 's/ //g'|head -n 1)

if [ $isacfsused = 'novolumesfound' ] ; then

echo -e "ACFS is not used"
exit 1
fi

version=$(acfsutil info fs|grep 'ACFS Version'|sort -u|awk -F: '{print $2}'|awk -F. '{print $1$2}')

if [ $version -lt 122 ] ; then

echo -e "WARNING: This check only is valid when GI version is 12.2 or higher"
exit 1
fi

for vol in $(acfsutil info fs|egrep 'metadata block size|primary volume'|awk -F: '{print $1"="$2}'|sed -e 's/ //
do

attr=$(echo $vol |awk -F= '{print $1}')

attrval=$(echo $vol |awk -F= '{print $2}')

if [ $attr = 'metadatablocksize' ] ; then

if[ $attrval -eq 512 ] ; then
NO4KMETABLK=1
else NO4KMETABLK=0
fi
elif [ $attr = 'primaryvolume' ] && [ $NO4KMETABLK -eq 1 ] ; then
ACFSNO4K=(${ACFSNO4K[@]} $attrval)
NO4KMETABLK=0
fi
done

if [ ${#ACFSNO4K[@]} -eq 0 ] ; then

printf "%s \n" "SUCCESS: ALL the ACFS filesystem are using metadata block size 4096"
else
printf "%s \n" "WARNING: There are ACFS filesystem NOT USING metadata block size 4096"
printf "\t %s \n" "The list of the primary volume is: "
printf "\t %s \n" "${ACFSNO4K[@]}"
printf "\t %s \n" "To get the complete details of each filesystem, please execute command acfsutil info fs"
fi

The expected output should be:

SUCCESS: ALL the ACFS filesystem are using metadata block size 4096

Example of a "FAILURE" result:

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state… 103/137
29/10/2019 Document 1067527.1
WARNING: There are ACFS filesystem NOT USING metadata block size 4096
The list of the primary volume is:
/dev/asm/volume1-399
/dev/asm/volume2-399
/dev/asm/volume3-399
To get the complete details, please execute command acfsutil info fs

An ACFS file system created using Grid Infrastructure 12.2.0.1 or higher, by default will use metadata block size 4k.
An ACFS file system created using Grid Infrastructure before 12.2.0.1, it requires reformatting the ACFS volume, following those
steps:

Create a backup of the filesystem

acfsutil registry -d
Deregister (if required) the file system using command
Dismount the filesystem
Remove the file system using acfsutil rmfs command
Reformat the volume using mkfs -t acfs -i 4096 <dev path> command
Mount the file system
Restore the files
Optionally register the file system using acfsutil registry command.

Evaluate Automated Maintenance Tasks configuration

Priority Alert Level Date Owner Status Engineered Systems Engineered System Platfor
SSC, Exadata - Physical,
Critical WARN 01/31/18 <Name> Development ALL
Exadata - User Domain
DB Version DB Type DB Role DB Mode Exadata Version OS & Version Validation Tool Version
11.2 or higher ALL ALL ALL ALL Linux, Solaris exachk 18.2.0

Benefit / Impact:

Some automated maintenance tasks are enabled by default with default settings at database creation time. It is recommended
that these automated tasks be allowed to run, but that they are reviewed and adjusted if necessary to provide the most benefit
for a given environment's workload. Benefits are provided by improving the overall efficiency of an environment, and also from
not having the automated maintenance tasks themselves negatively impact the environment's specific workload.

Risk:

Leaving automated maintenance tasks at their default values, or disabling them completely may significantly impact a given
environment's specific workload performance.

Action / Repair:

To see basic information on automated maintenance tasks, as the owner of the oracle home for a given database and with the
environment set to access that database, execute the following command set:

FORMATTED_OUTPUT=$($ORACLE_HOME/bin/sqlplus -s "/ as sysdba" <<EOF

set newpage none head off lines 80 feedback off timing off serveroutput on
select client_name,status from DBA_AUTOTASK_CLIENT;
exit
EOF
)
LINE_COUNT=$(echo "$FORMATTED_OUTPUT" | wc -l)
ENABLED_COUNT=$(echo "$FORMATTED_OUTPUT" | egrep -ic enabled)
if [ $LINE_COUNT -eq $ENABLED_COUNT ]
then
echo -e "INFO: all automated maintenance tasks are enabled."
echo -e "Please review configuration appropriateness for this environment."
else
echo -e "WARNING: one or more automated maintenance tasks are not enabled."
echo -e "Please enable all and review configuration appropriateness for this environment.\nDetails:\n$FORMATTED
fi;

The expected output should be similar to:

INFO: all automated maintenance tasks are enabled.

Please review configuration appropriateness for this environment.

Example of a "WARNING" result:

WARNING: one or more automated maintenance tasks are not enabled.

Please enable all and review configuration appropriateness for this environment.
Details:
sql tuning advisor ENABLED
auto optimizer stats collection ENABLED
auto space advisor DISABLED

NOTE:

Oracle recommends that Oracle supplied automated maintenance tasks be utilized and tuned for each individual
database and it's associated workload.
For more information, please see:
Database Administrator's Guide, 11g Release 2, Managing Automated Database Maintenance Tasks
Database Administrator's Guide, 12c Release 1, Managing Automated Database Maintenance Tasks

Verify proper ACFS drivers are installed for Spectre v2 mitigation

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state… 104/137
29/10/2019 Document 1067527.1

Priority Alert Level Date Owner Status Engineered System Engineered Syst
Exadata - Physical,
Critical FAIL 05/08/2018 <Name> Production ALL
Exadata - Management Domain
DB Version DB Type DB Role DB Mode Exadata Version OS & Version Validation To
N/A N/A N/A N/A all Linux exachk 1

Benefit / Impact:

On Exadata database servers that have an Exadata version installed that provides mitigation for Spectre v2 vulnerability, proper
ACFS drivers or other customer-installed kernel drivers must be installed in order for the proper Spectre v2 mitigation to be
used.

The impact of verification is minimal. Installing proper ACFS drivers requires Clusterware restart. The impact of installing proper
customer-installed kernel drivers cannot be estimated here.

Risk:

Not using the proper ACFS drivers or other customer-installed kernel drivers can prevent the desired Spectre v2 mitigation,
which can lead to reduced performance.

Action / Repair:

To verify proper ACFS drivers are installed for Spectre v2 mitigation, execute the following command set as the "root" userid on
all database servers:

#!/bin/bash
# CPU model numbers (/proc/cpuinfo)
# V2:26 X2-2:44 X2-8:46 X2-8M2:47 X3:45 X4:62 X5:63 X6:79 X7:85
modelsUseRetpoline='26|44|46|47|45|62|63|79'
thisModel=$(egrep "^model[[:space:]]*:" /proc/cpuinfo | sort -u | awk '{print $NF}')
# kernels without spectrev2 mitigation will not have this file
if [[ ! -e /sys/devices/system/cpu/vulnerabilities/spectre_v2 ]]; then
echo "WARNING: System is not capable of Spectre v2 mitigation. See minimum version requirements in MOS document
else
v2mitigation=$(</sys/devices/system/cpu/vulnerabilities/spectre_v2)
# dom0 should use retpoline for all hardware
# X6 and older should use retpoline
wantRetpoline=no
if ( [[ -d /proc/xen/capabilities ]] && grep -q 'control_d' /proc/xen/capabilities ) || \
echo "$thisModel" | egrep -q "$modelsUseRetpoline"; then
wantRetpoline=yes
fi
if [[ $wantRetpoline == yes ]]; then
if ! echo $v2mitigation | grep -qi retpoline; then
echo "FAIL: Spectre v2 mitigation is expected to be retpoline, but is not."
if dmesg | grep -q 'Disabling Spectre v2 mitigation retpoline'; then
echo "Spectre v2 mitigation retpoline was disabled after system boot."
# look for modules not compiled with retpoline
badmodules=$(dmesg | grep 'loading module not compiled with retpoline compiler' | awk -F '[]:]' '{print $2}'
echo "Modules loaded not compiled with retpoline compiler: $badmodules. These modules must be updated."
if [[ $badmodules =~ oracleoks ]]; then
echo "oracleoks module will be updated by installing updated ACFS drivers. See MOS document 2356385.1."
fi
fi
else
echo "SUCCESS: Spectre v2 mitigation is using $v2mitigation"
fi
else
echo "SUCCESS: Spectre v2 mitigation is using $v2mitigation"
fi
fi

The expected output is:

SUCCESS: Spectre v2 mitigation is using Mitigation: Full generic retpoline, IBRS_FW, IBPB

-OR-

SUCCESS: Spectre v2 mitigation is using Mitigation: IBRS, IBRS_FW, IBPB

Example of a "WARNING" result:

WARNING: System is not capable of Spectre v2 mitigation. See minimum version requirements in MOS document 235638

In the above "WARNING" example, the system should be upgraded per the MOS note.

Example of a "FAIL" result:

FAIL: Spectre v2 mitigation is expected to be retpoline, but is not. Spectre v2 mitigation retpoline was disable

In the above FAIL, the system was expected to be using retpoline mitigation for Spectre v2, but was not. The system initially
booted with retpoline mitigation, but it was disabled when an improper kernel module was loaded that caused retpoline
mitigation to be disabled.

Verify Exafusion Memory Lock Configuration

Alert
Priority Date Owner Status Engineered System
Level

DB Engineered System Exadata OS & Validation Tool

DB Role
Version Platform Version Version Version
ALL N/A ALL ALL Linux X86-64

Benefit / Impact:

Having memlock set correctly is required for a successful upgrade to releases 12.2 and higher, and also to prevent ORA- errors
associated with IPC context initialization. The impact of verifying the Exafusion memory lock configuration is minimal. Following
any modifications to the limits.conf settings, a logout/login is required for the OS user to ensure the changes take effect.

NOTE: The memlock settings should be correct according to script recommendations regardless of whether
Exafusion is actually being used or not (it is enabled by default in 12.2).

Risk:

Instance startup will fail, and/or clients will fail to connect if memlock settings are insufficient.

Action / Repair:

To verify Exafusion memory lock configuration, on each database server, as the owner userid of each unique RDBMS home,
place the following code into a script and execute it.

#!/bin/sh

# DESCRIPTION
# Parse limits settings under /etc/security and produce an FAILURE if the
# required memlock settings for Exadata are missing.
# If non-standard settings are found, produce an FAILURE if the configured
# limits are below the minimum requirement, else produce a WARNING.
#
# MODIFIED (MM/DD/YY)
# amorimur 04/09/18 - Creation

LIMITSFILE=/etc/security/limits.conf
LIMITSDDIR=/etc/security/limits.d
RDBMS_OWNER=$(whoami)
MINLIMIT=32768
TMPFILE=$(mktemp)
SUCCESS=1
DEBUG=0

#
# Parse the given memlock setting string and see if it is satisfactory
#
check_memlock () {
local L=$*

# Error if we don't see the correct format

if [ $(echo "$L" | wc -w) -ne 5 ] ; then
SUCCESS=0
echo "FAILURE: Invalid entry found ($L)"
else

# The oracle user must have an unlimited limit

local LUSR=$(echo "$L" | sed 's/\*/all_users/g' | awk '{print $1}')
if [ $LUSR = $RDBMS_OWNER ] ; then
local LVAL=$(echo "$L" | awk '{print $4}')
if [ $LVAL != 'unlimited' ] ; then
SUCCESS=0
echo "FAILURE: $RDBMS_OWNER must have an unlimited setting ($L)"
fi

# All others must have the minimum limit

# Even if the limit settings are satisfactory, print a warning for all of these non-standard entries
else
local LVAL=$(echo "$L" | awk '{print $4}' | sed "s/unlimited/$MINLIMIT/g")

if [ $LVAL -lt $MINLIMIT ] ; then

SUCCESS=0
echo "FAILURE: Found the following entry with memlock limit less than $MINLIMIT ($L)"
else
SUCCESS=0
echo "WARNING: Found a non-standard memlock limit entry ($L)"
fi
fi
fi
}

# Check the limits.conf file

# See if the file exists & is readable
if [ -r $LIMITSFILE ] ; then

# Generate a reference file

REFFILE_BASE=$(mktemp)
cat <<! >> $REFFILE_BASE
* soft memlock $MINLIMIT
* hard memlock $MINLIMIT
$RDBMS_OWNER soft memlock unlimited
$RDBMS_OWNER hard memlock unlimited
!

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state… 106/137
29/10/2019 Document 1067527.1

# Sort the contents

REFFILE=$(mktemp)
sort $REFFILE_BASE > $REFFILE

# Extract the limits.conf settings on this system, exclude comments, and sort (duplicates are ok)
MYFILE=$(mktemp)
grep memlock $LIMITSFILE | egrep 'soft|hard' | awk '{print $1, $2, $3, $4}' | grep -v ^# | sort | uniq > $MYFIL

# Find settings missing on this system, missing settings will produce an FAILURE
comm -23 $REFFILE $MYFILE > $TMPFILE
if [ -s $TMPFILE ] ; then
SUCCESS=0
echo "FAILURE: the following required memlock settings are missing in $LIMITSFILE"
echo "------"
cat $TMPFILE
echo "------"
fi

# Find non-standard settings on this system

# An FAILURE is raised when the memlock setting is below the minimum requirement, otherwise a WARNING is raised
comm -13 $REFFILE $MYFILE > $TMPFILE
if [ -s $TMPFILE ] ; then

# Parse results one by one

while read L ; do
check_memlock "$L file:$LIMITSFILE"
done < $TMPFILE
fi

# Debug
if [ $DEBUG -eq 1 -a $SUCCESS -ne 1 ] ; then
echo "-----"
echo "Debug: reference file"
cat $REFFILE
echo "-----"
echo "Debug: local file"
cat $MYFILE
echo "-----"
fi

else
SUCCESS=0
echo "FAILURE: Unable to open $LIMITSFILE for reading"
fi

#
# Check for memlock settings under limits.d
#
for F in $(grep -rl memlock $LIMITSDDIR/*) ; do
grep memlock $F | egrep 'soft|hard' | awk '{print $1, $2, $3, $4}' | grep -v ^# > $TMPFILE

if [ -s $TMPFILE ] ; then

# Parse results one by one

while read L ; do
check_memlock $L file:$F
done < $TMPFILE
fi
done

# Clean up
rm -rf $MYFILE $REFFILE $TMPFILE $REFFILE_BASE

# Success
if [ $SUCCESS -eq 1 ] ; then
echo "SUCCESS: Memlock settings meet the Oracle best practices"
fi

The expected output is:

SUCCESS: Memlock settings meet the Oracle best practices

Example of a "FAILURE" result:

FAILURE: the following required memlock settings are missing in /etc/security/limits.conf

------
* hard memlock 32768
oracle hard memlock unlimited
oracle soft memlock unlimited
* soft memlock 32768
------
WARNING: Found a non-standard memlock limit entry (grid hard memlock 237778560 file:/etc/security/limits.conf)
WARNING: Found a non-standard memlock limit entry (grid soft memlock 237778560 file:/etc/security/limits.conf)
FAILURE: oracle must have an unlimited setting (oracle hard memlock 237778560 file:/etc/security/limits.conf)
FAILURE: oracle must have an unlimited setting (oracle soft memlock 237778560 file:/etc/security/limits.conf)

If a "FAILURE" or "WARNING" message appears, make the necessary edits to "/etc/security/limits.conf" and files under
"/etc/security/limits.d/" as directed.

Verify there are no unhealthy InfiniBand switch sensors

Priority Alert Date Owner Status Engineered System Engineered System Bug(s)
Level Platform
Critical FAIL 08/08/18 <Name> Production Exadata - Physical, ALL Bug 28279223 - exachk

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state… 107/137
29/10/2019 Document 1067527.1
Exadata - Management
Domain,
RA
DB DB Type DB Role DB Mode Exadata OS & Version Validation Tool MAA Scorecard Section
Version Version Version
N/A N/A N/A N/A ALL Linux exachk 18.4.0 N/A

Benefit / Impact:

For maximum functionality and alert notifications, all InfiniBand switch sensors should be functioning properly. The impact of
verifying there are no unhealthy InfiniBand switch sensors is minimal. The impact of correcting failed sensors varies by failed
component.

Risk:

InfiniBand switch functionality may be reduced depending upon which components have failed.

Action / Repair:

To verify there are no unhealthy InfiniBand switch sensors, as the "root" userid on each InfiniBand switch execute the following
code set:

RAW_OUTPUT=$(/usr/local/bin/showunhealthy)
if [ $(echo "$RAW_OUTPUT" | egrep -ic "WARNING|FAILURE") -eq 0 ]
then
echo -e "SUCCESS: there are no unhealthy InfiniBand switch sensors"
else
echo -e "FAILURE: there are one or more unhealthy InfiniBand switch sensors. Details:\n\n$RAW_OUTPUT"
fi

The expected output is the following:

SUCCESS: there are no unhealthy InfiniBand switch sensors

Example of a FAIL result:

FAILURE: there are one or more unhealthy InfiniBand switch sensors. Details:

WARNING PSU 1 present AC Loss

FAILURE - 1 sensors NOT OK

Corrective actions vary depending upon the failed component. Refer to the appropriate switch documentation, and if necessary
open an SR for assistance.

Refer to MOS 1682501.1 if non-Exadata components are in use on the InfiniBand fabric

Priority Alert Date Owner Status Engineered System Engineered System Bug(s)
Level Platform
Critical WARN 09/05/18 <Name> Production Exadata - Physical, ALL Bug 28108851 -
Exadata - Management exachk
Domain,
RA
DB DB Type DB Role DB Mode Exadata OS & Version Validation Tool MAA Scorecard Section
Version Version Version
N/A N/A N/A N/A ALL Linux exachk 18.4.0 N/A

Benefit / Impact:

If non-Exadata components are in use on the same InifiniBand fabric as an Exadata environment, then there are additional
configuration considerations between the components. Verifying these additional considerations helps to ensure the InfiniBand
fabric is stable and performs well.

Risk:

Not referring to MOS 1682501.1 can result in potential InfiniBand fabric instability and poor performance which may cause
components in the Exadata environment to crash. Problems during patching can also occur.

Action / Repair:

To determine if non-Exadata components are discovered on the InfiniBand fabric execute the following code set as the "root"
userid on one database server in the Exadata environment:

unset NONEXADATA_OUTPUT
VT_OUTPUT=$(/opt/oracle.SupportTools/ibdiagtools/verify-topology)
DETECTED_LINE_NUMBER=$(echo "$VT_OUTPUT" | egrep -ni "detected and ignored" | cut -d":" -f1)
ENDING_LINE_NUMBER=$(echo "$VT_OUTPUT" | wc -l)
SPAN=$(expr $ENDING_LINE_NUMBER - $DETECTED_LINE_NUMBER)
NONEXADATA_OUTPUT=$(echo "$VT_OUTPUT" | egrep -i "detected and ignored" -A $SPAN | grep -v "^Detected")
if [ -z "$NONEXADATA_OUTPUT" ]
then
echo -e "SUCCESS: There were no non-Exadata InfiniBand components discovered."
else
echo -e "WARNING: One or more non-Exadata InfiniBand components were discovered:\n\n$NONEXADATA_OUTPUT"
fi

The expected output is the following:

SUCCESS: There were no non-Exadata InfiniBand components discovered.

WARNING: One or more non-Exadata IB components were discovered:

Ca : 0x0010e0605308c000 ports 2 "SUN IB QDR GW switch <host>-sw-ib2 Bridge 0"

Ca : 0x0010e0605308c040 ports 2 "SUN IB QDR GW switch <host>-sw-ib2 Bridge 1"
Ca : 0x0010e00001757140 ports 2 "<host>-bda10-adm BDA xx.xx.xx.200 HCA-1"
Ca : 0x0010e0000178e640 ports 2 "<host>-bda09-adm BDA xx.xx.xx.199 HCA-1"
Ca : 0x0010e0000187b6e8 ports 2 "<host>-bda12 BDA 192.168.43.12 HCA-1"
Ca : 0x0010e00001757ad0 ports 2 "<host>-bda11-adm BDA xx.xx.xx.201 HCA-1"
Ca : 0x0010e00001878808 ports 2 "<host>-bda13 BDA 192.168.43.13 HCA-1"
Ca : 0x0010e000017723d0 ports 2 "<host>-bda14 BDA 192.168.43.14 HCA-1"
Ca : 0x0010e0000187a638 ports 2 "<host>-bda15 BDA 192.168.43.15 HCA-1"
Ca : 0x0010e00001757050 ports 2 "<host>-bda16 BDA 192.168.43.16 HCA-1"
Ca : 0x0010e00001757090 ports 2 "<host>-bda17 BDA 192.168.43.17 HCA-1"
Ca : 0x0010e0000178e5f0 ports 2 "<host>-bda18-adm BDA xx.xx.xx.152 HCA-1"
Ca : 0x0010e0000178e600 ports 2 "<host>-bda08-adm BDA xx.xx.xx.198 HCA-1"
Ca : 0x0010e00001756fa0 ports 2 "<host>-bda07-adm BDA 192.168.43.7 HCA-1"
Ca : 0x0010e00001757070 ports 2 "<host>-bda05-adm BDA xx.xx.xx.195 HCA-1"
Ca : 0x0010e000017573d0 ports 2 "<host>-bda06-adm BDA xx.xx.xx.196 HCA-1"
Ca : 0x0010e000017572a0 ports 2 "<host>-bda03-adm BDA xx.xx.xx.193 HCA-1"
Ca : 0x0010e00001756fb0 ports 2 "<host>-bda04-adm BDA xx.xx.xx.194 HCA-1"
Ca : 0x0010e0000174f0e0 ports 2 "<host>-bda01-adm BDA xx.xx.xx.191 HCA-1"
Ca : 0x0010e0000174e170 ports 2 "<host>-bda02-adm BDA xx.xx.xx.192 HCA-1"
Ca : 0x0010e0602e08c000 ports 2 "SUN IB QDR GW switch <host>-sw-ib3 Bridge 0"
Ca : 0x0010e0602e08c040 ports 2 "SUN IB QDR GW switch <host>-sw-ib3 Bridge 1"
If a "WARNING" result is returned, please refer to: Setting up the Subnet Manager in a multi-rack cabling configu

<strong><a name="verify_ib_sdp_not_loaded" class="mceItemAnchor"></a>Verify the ib_sdp module is not loaded into

</strong>

Priority Alert Level Date Owner Status Engineered System Engineered System Bug(s)
Platform
Critical FAIL 02/20/19 <Name> Production Exadata - Physical, ALL Bug 29157366 - exachk
Exadata - Management Domain,
RA
DB Version DB Type DB Role DB Mode Exadata Version OS & Version Validation Tool Version MAA Scorecard Section
N/A N/A N/A N/A ALL Linux exachk 19.1.0 N/A

Benefit / Impact:

The Socket Direct Protocol (SDP) developed by the OpenFabric Enterprise Distribution (OFED) group Mellanox is no longer
supported. There are open issues with SDP and operating system stability that will not be resolved.

For performance and stability, the ib_sdp module should not be loaded into the kernel. The impact of verifying the ib_sdp
module is not loaded into the kernel is minimal. Modifying a system to not load the ib_sdp module requires a reboot.

NOTE: for Exadata versions 12.2.0.0.0 or greater, the ib_sdp module should not be loaded into the kernel.

NOTE: for Exadata versions 12.1.x.x.x or lower, it is recommended the ib_sdp module not be loaded into the
kernel. However, if the ib_sdp module is loaded against this recommendation, then the option "sdp_apm_enable"
must be set to "0". While the original Automatic Path Migration (APM) issue was reported when Exalogic
application servers were accessing an Oracle Exadata Database Machine using SDP, ANY client requesting a
connection using SDP with APM enabled to an Oracle Exadata Database Machine will eventually cause the
connection to hang on the database server.

Risk:

System instability, poor performance, and potential node evictions are likely if the ib_sdp module is loaded into the kernel.

Action / Repair:

To verify the ib_rds module is not loaded, as the "root" userid on each database server execute the following code set:

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state… 109/137
29/10/2019 Document 1067527.1
EXADATA_VERSION=$(imageinfo -version | cut -d"." -f1-5 | tr -d .)
LSMOD_DATA=$(/sbin/lsmod | egrep -i ^ib_sdp)
if [[ $EXADATA_VERSION -ge 122000 ]]
then
if [ -z "$LSMOD_DATA" ]
then
echo -e "SUCCESS: The ib_sdp module is not loaded into the kernel"
else
echo -e "FAILURE: The ib_sdp module is loaded into the kernel. Details:\n$LSMOD_DATA"
fi
else
if [ -z "$LSMOD_DATA" ]
then
echo "SUCCESS: The ib_sdp module is not loaded into the kernel"
else
CODE_LINE=$(echo $EXADATA_VERSION | cut -c1-2)
KERNEL_TYPE=$(uname -r | cut -d"." -f6)
if [ $KERNEL_TYPE = "el5uek" ]
then
IB_SDP_FILE="/etc/modprobe.conf"
elif [ $KERNEL_TYPE = "el6uek" ]
then
IB_SDP_FILE="/etc/modprobe.d/ib_sdp.conf"
else
echo -e "ERROR: unable to determine IB_SDP_FILE: $KERNEL_TYPE"
fi
IB_SDP_FILE_OUTPUT=$(egrep "ib_sdp" $IB_SDP_FILE)
if [ -s /sys/module/ib_sdp/parameters/sdp_apm_enable ]
then
IB_SDP_KERNEL_OUTPUT_RSLT=$(cat /sys/module/ib_sdp/parameters/sdp_apm_enable)
else
IB_SDP_KERNEL_OUTPUT_RSLT="/sys/module/ib_sdp/parameters/sdp_apm_enable not found"
fi
if [[ $CODE_LINE -eq 11 && $EXADATA_VERSION -lt 112331 || $CODE_LINE -eq 12 && $EXADATA_VERSION -lt 121111 ]]
then
if [ $(echo "$IB_SDP_FILE_OUTPUT" | egrep "sdp_apm_enable*.=0" | wc -l) -eq 1 ]
then
IB_SDP_FILE_OUTPUT_RSLT=0
fi
if [[ "$IB_SDP_FILE_OUTPUT_RSLT" = 0 && "$IB_SDP_KERNEL_OUTPUT_RSLT" = 0 ]]
then
echo -e "SUCCESS: ib_sdp is loaded and sdp_apm_enable is set to 0 in $IB_SDP_FILE and running kernel."
echo -e "$IB_SDP_FILE: $IB_SDP_FILE_OUTPUT"
echo -e "Running Kernel: $IB_SDP_KERNEL_OUTPUT_RSLT"
else
echo -e "FAILURE: ib_sdp is loaded and sdp_apm_enable should be set to 0 in $IB_SDP_FILE and running kern
echo -e "$IB_SDP_FILE: $IB_SDP_FILE_OUTPUT"
echo -e "Running Kernel: $IB_SDP_KERNEL_OUTPUT_RSLT"
fi
else
if [ $(echo "$IB_SDP_FILE_OUTPUT" | egrep "sdp_apm_enable*.=0" | wc -l) -eq 0 ]
then
IB_SDP_FILE_OUTPUT_RSLT=0
fi
if [[ "$IB_SDP_FILE_OUTPUT_RSLT" = 0 && "$IB_SDP_KERNEL_OUTPUT_RSLT" = 0 ]]
then
echo -e "SUCCESS: ib_sdp is loaded and sdp_apm_enable is not set in $IB_SDP_FILE and is set to "0" in the
echo -e "$IB_SDP_FILE: $IB_SDP_FILE_OUTPUT"
echo -e "Running Kernel: $IB_SDP_KERNEL_OUTPUT_RSLT"
else
echo -e "FAILURE: ib_sdp is loaded and sdp_apm_enable should not be set in $IB_SDP_FILE and should be "0"
echo -e "$IB_SDP_FILE: $IB_SDP_FILE_OUTPUT"
echo -e "Running Kernel: $IB_SDP_KERNEL_OUTPUT_RSLT"
fi
fi
fi
fi

The expected output is the following:

SUCCESS: The ib_sdp module is not loaded into the kernel

Example of a FAIL result:

FAILURE: ib_sdp is loaded and sdp_apm_enable should be set to 0 in /etc/modprobe.conf and running kernel.
/etc/modprobe.conf:
Running Kernel: 0

NOTE: To correct a "FAILURE" result, place the text "SDP_LOAD=no" into the file "/etc/rdma/rdma.conf" and
reboot the database server.

Verify all voting disks are online

Priority Alert Date Owner Status Engineered Engineered System Bug(s)
Level System Platform
Critical FAIL 05/29/19 Vern Production Exadata - Physical, ALL 29779386 - exachk
Wagman Exadata - User
Domain
DB/GI DB Type DB Role DB Mode Exadata OS & Version Validation Tool MAA Scorecard
Version Version Version Section
11.2.0.4 or ASM N/A N/A N/A Linux exachk 19.3.0 N/A
higher

Benefit / Impact:

Voting disks help ensure a stable cluster. The impact of verifying all voting disks are online is minimal. The impact of bringing a
given voting disk back online depends upon the reason why it went offline, and cannot be estimated here.

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state… 110/137
29/10/2019 Document 1067527.1
Risk:

Not having all expected voting disks online increases the risk of node eviction or cluster crash.

Action / Repair:

To verify all voting disks are online, as the grid home owner userid, and with CRS_HOME and SID set to access the ASM
instance, execute the following code on one database server in the cluster:

VOTEDISK_OUTPUT=$($CRS_HOME/bin/crsctl query css votedisk)

LOCATED_COUNT=$(echo "$VOTEDISK_OUTPUT" | egrep "^Located" | cut -d" " -f2)
ONLINE_COUNT=$(echo "$VOTEDISK_OUTPUT" | egrep -c ONLINE)
if [ "$LOCATED_COUNT" -eq "$ONLINE_COUNT" ]
then
echo -e "SUCCESS: all voting disks are online."
else
echo -e "FAILURE: not all voting disks are online.\nDETAILS:\n$VOTEDISK_OUTPUT"
fi

The expected output should be:

SUCCESS: all voting disks are online.

Example of a "FAILURE" case:

FAILURE: not all voting disks are online.

DETAILS:
## STATE File Universal Id File Name Disk group
-- ----- ----------------- --------- ---------
1. ONLINE a07c741f08194f71bf7f4d14c7d67a15 (/dev/exadata_quorum/QD_DATAC1_RANDOM05ADM05) [DATAC1]
2. ONLINE d1327820402f4f2fbffca97cbdef72d7 (/dev/exadata_quorum/QD_DATAC1_RANDOM05ADM06) [DATAC1]
3. ONLINE 748b53cfb1a64f6cbff0f71de2de89b3 (o/192.168.22.171;192.168.22.172/DATAC1_FD_05_random05celadm07) [DA
4. ONLINE 5fbc672724094f82bfcd4ea220ab824a (o/192.168.22.173;192.168.22.174/DATAC1_FD_05_random05celadm08) [DA
5. OFFLINE e9efd3be40ad4f64bfd034233f3e37d3 (o/192.168.22.175;192.168.22.176/DATAC1_FD_05_random05celadm09) [D

If a "FAILURE" result is returned, investigate to determine root cause and take appropriate corrective action.
Verify available ksplice fixes are installed
Priority Alert Date Owner Status Engineered System Engineered Bug(s)
Level System
Platform
Critical FAIL 08/14/19 Doug Production Exadata - Physical, ALL 30185190 - exachk
Utzig Exadata - Management
Domain,
Exadata - User Domain,
RA
DB DB Type DB Role DB Exadata Version OS & Version Validation Tool MAA Scorecard
Version Mode Version Section
ALL ALL ALL ALL >=12.2.1.1.4, Linux exachk 19.3.0 N/A
>=18.1.2.0.0

Benefit / Impact:

On Exadata systems some Oracle Linux operating system updates are delivered via ksplice. All available ksplice updates should
be installed to ensure issues fixed in the installed Exadata release are not encountered.

Risk:

Not having all available ksplice updates installed can lead to unexpected behavior caused by encountering issues that are
expected to be fixed in the installed Exadata release. The risk of checking that all available ksplice updates are installed is
minimal.

Action / Repair:

To verify all available ksplice updates are installed run the following command set as the root user on each storage and database
server in the cluster:

The expected output is the following:

-- OR --

Example of a FAIL result:

If there are available ksplice updates not installed then run uptrack-install as the root user, as follows:

Benefit/Impact:

The Flash 20 card supports ESM lifetime to enable proactive replacement before failure.

The impact of verifying that the ESM lifetime is within specification is minimal. Replacing an ESM requires a storage server
outage. The database and application may remain available if the appropriate grid disks are properly inactivated before and
activated after the storage server outage. Refer to MOS Note 1188080.1 and "Shutting Down Exadata Storage Server" in Chapter
7 of "Oracle® Exadata Database Machine Owner's Guide 11g Release 2 (11.2) E13874-14" for additional details.

Risk:

Failure of the ESM will put the Flash 20 card in WriteThrough mode which has a high impact on performance.

Action/Repair:

To verify the ESM lifetime value, use the following command on the storage servers:

for RISER in RISER1/PCIE1 RISER1/PCIE4 RISER2/PCIE2 RISER2/PCIE5; do ipmitool sunoem cli "show /SYS/MB/$RISER/F20

value = 3382.350 Hours

upper_nonrecov_threshold = 17500.000 Hours
upper_critical_threshold = 17200.000 Hours
upper_noncritical_threshold = 16800.000 Hours
lower_noncritical_threshold = N/A
-- <output truncated>

If the "value" reported exceeds the "upper_noncritical_threshold" reported, schedule a replacement of the relevant ESM.

NOTE: There is a bug in ILOM firmware version 3.0.9.19.a which may report "Invalid target..." for
"RISER1/PCIE4". If that happens, consult your site maintenance records to verify the age the ESM Module.

NOTE: For Aura II (F20 M2) cards, the CPLD reports the End of Life indication on the F20 M2 cards, so the
thresholds for UPTIME sensor are not needed. The threshold values are replaced with "N/A". The ILOM will fault
the system when it's time to replace the F20 M2's ESM. Beginning with 2.1.3, exachk does not execute this check
on F20 M2 cards. Beginning with 2.1.5, exachk posts a message in the html report detail that the card is an F20M2
model and the check is not applicable.

Verify Database Server Disk Controller Configuration (ARCHIVE)

Archive Date: 10/01/12

Archive Reason: Beginning with 11.2.3.2.0 the configuration of the database server disk drives was changed to
have all available disk drives in a RAID-5 configuration with no hot spare.

Priority Added Machine Type OS Exadata Oracle

Type Version Version
Critical N/A X2-2(4170), X2-2, Linux 11.2.x + 11.2.x +
X2-8

Benefit / Impact:

For X2-2, there are 4 disk drives in a database server controlled by an LSI MegaRAID SAS 9261-8i disk controller. The disks are
configured RAID-5 with 3 disks in the RAID set and 1 disk as a hot spare. There is 1 virtual drive created across the RAID set.
Verifying the status of the database server RAID devices helps to avoid a possible performance impact, or an outage.

For X2-8, there are 8 disk drives in a database server controlled by an LSI MegaRAID SAS 9261-8i disk controller. The disks are
configured RAID-5 with 7 disks in the RAID set and 1 disk as a hot spare. There is 1 virtual drive created across the RAID set.
Verifying the status of the database server RAID devices helps to avoid a possible performance impact, or an outage.

The impact of validating the RAID devices is minimal. The impact of corrective actions will vary depending on the specific issue
uncovered, and may range from simple reconfiguration to an outage.

Risk:

Not verifying the RAID devices increases the chance of a performance degradation or an outage.

Action / Repair:

To verify the database server disk controller configuration, use the following command:

/opt/MegaRAID/MegaCli/MegaCli64 AdpAllInfo -aALL | grep "Device Present" -A 8

For X2-2, the output will be similar to:

Device Present
================
Virtual Drives : 1
Degraded : 0
Offline : 0
Physical Devices : 5
Disks : 4
Critical Disks : 0
Failed Disks : 0

The expected output is 1 virtual drive, none degraded or offline, 5 physical devices (controller + 4 disks), 4 disks, and no critical
or failed disks.

For X2-8, the output will be similar to:

Device Present
================
Virtual Drives : 1
Degraded : 0
Offline : 0

The expected output is 1 virtual drive, none degraded or offline, 11 physical devices (1 controller + 8 disks + 2 SAS2 expansion
ports), 8 disks, and no critical or failed disks.

On X2-8, there is a SAS2 expander on each NEM, which takes in the 8 ports from the Niwot REM and expands it out to both the
8 physical drive slots through the midplane and the 2 SAS2 expansion ports external on each NEM. See output below from the
MegaRaid? FW event log.

If the reported output differs, investigate and correct the condition.

Verify Database Server Virtual Drive Configuration (ARCHIVE)

Archive Date: 10/01/12

Archive Reason: Beginning with 11.2.3.2.0 the configuration of the database server disk drives was changed to
have all available disk drives in a RAID-5 configuration with no hot spare.

Priority Added Machine Type OS Exadata Oracle

Type Version Version
Critical N/A X2-2(4170), X2-2, X2- Linux 11.2.x + 11.2.x +
8

Benefit / Impact:

The impact of validating the virtual drives is minimal. The impact of corrective actions will vary depending on the specific issue
uncovered, and may range from simple reconfiguration to an outage.

Risk:

Not verifying the virtual drives increases the chance of a performance degradation or an outage.

Action / Repair:

To verify the database server virtual drive configuration, use the following command:

/opt/MegaRAID/MegaCli/MegaCli64 CfgDsply -aALL | grep "Virtual Drive:";/opt/MegaRAID/MegaCli/MegaCli64 CfgDsply -

For X2-2 the output should be similar to:

Virtual Drive: 0 (Target Id: 0)

Number Of Drives : 3
State : Optimal

The expected result is that the virtual device has 3 drives and a state of optimal.

For X2-8, the output should be similar to:

Virtual Drive: 0 (Target Id: 0)

Number Of Drives : 7
State : Optimal

The expected result is that the virtual device has 7 drives and a state of optimal.

If the reported output differs, investigate and correct the condition.

NOTE: The virtual device number reported may vary depending upon configuration and version levels.

NOTE: If a bare metal restore procedure is performed on a database server without using the "dualboot=no"
configuration, that database server may be left with three virtual devices for X2-2 and 7 for X2-8. Please see My
Oracle Support note 1323309.1 for additional information and correction instructions.

Verify Database Server Physical Drive Configuration (ARCHIVE)

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state… 113/137
29/10/2019 Document 1067527.1
Archive Date: 10/01/12
Archive Reason: Beginning with 11.2.3.2.0 the configuration of the database server disk drives was changed to
have all available disk drives in a RAID-5 configuration with no hot spare.

Priority Added Machine Type OS Exadata Oracle

Type Version Version
Critical N/A X2-2(4170), X2-2, X2- Linux 11.2.x + 11.2.x +
8

Benefit / Impact:

The impact of validating the physical drives is minimal. The impact of corrective actions will vary depending on the specific issue
uncovered, and may range from simple reconfiguration to an outage.

Risk:

Not verifying the physical drives increases the chance of a performance degradation or an outage.

Action / Repair:

To verify the database server physical drive configuration, use the following command:

/opt/MegaRAID/MegaCli/MegaCli64 PDList -aALL | grep "Firmware state"

The output for X2-2 will be similar to:

Firmware state: Online, Spun Up

Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Hotspare, Spun down

There should be three lines of output showing a state of "Online, Spun Up", and one line showing a state of "Hotspare, Spun
down". The ordering of the output lines is not significant and may vary based upon a given database server's physical drive
replacement history.

The output for X2-8 will be similar to:

Firmware state: Online, Spun Up

Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Hotspare, Spun down

There should be seven lines of output showing a state of "Online, Spun Up", and one line showing a state of "Hotspare, Spun
down". The ordering of the output lines is not significant and may vary based upon a given database server's physical drive
replacement history.

If the reported output differs, investigate and correct the condition.

NOTE: Modified 03/21/12

For additional information, please reference My Oracle Support note "Exadata: Hot Spares Not Spinning Down (Doc
ID 1403613.1)"

Verify Peripheral Component Interconnect (PCI) Bridges are Configured for Generation II on Storage Servers
(ARCHIVE)

Archive Date: 10/24/12

Archive Reason: Beginning with the X4270 M3 storage servers shipped with the X3-2 and X3-8 database
machines, there is a different PCI architecture and this issue is not relevant to the new hardware.

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state… 114/137
29/10/2019 Document 1067527.1
Priority Added Machine Type OS Type Exadata Oracle
Version Version
Critical 09/13/11 X2-2(4170), X2-2, Linux, 11.2.x + 11.2.x +
X2-8 Solaris

Benefit / Impact:

The storage server PCI bridges (19:0.0 and 27:0.0) should be configured for generation II for maximum performance.

There is minimal impact to verify the PCI Bridges configuration.

Risk:

If the PCI bridges are not configured for generation II, performance will be sub-optimal.

Action / Repair:

To verify the current PCI bridges configuration, execute the following command as the root userid on all storage servers:

for BUS_NUM in 19:0.0 27:0.0; do echo $BUS_NUM `lspci -xxx -s $BUS_NUM | grep ^50 | cut -d" " -f4`; done

The output should be similar to:

19:0.0 82
27:0.0 82

If any of the storage server PCI bridges do not return "82", there are three possible corrective actions:

If the value returned is "81" you may upgrade to Exadata storage server software version 11.2.2.4.0 or greater, or refer to MOS
note1351559.1.

If neither the value "81" nor "82" is returned, contact oracle support for further assistance.

NOTE: PCI Bridge generation I will return the value "81".

[NOTE: INTERNAL ONLY - manual instructions are also listed in exachk bug 12756149.]

Verify Database Server Disk Controller Configuration (ARCHIVE)

Archive Date: 03/06/13

Archive Reason: Beginning with the Exadata software version 11.2.3.2.1, the reclamation of the hotspare device
mandated in 11.2.3.2.0, was made optional for those customers upgrading from a version below 11.2.3.2.0
directly to 11.2.3.2.1 or higher.

Priority Added Machine Type OS Exadata Oracle

Type Version Version
Critical 10/1/2012 X2-2(4170), X2-2, X2-8, Linux 11.2.3.2.0 + 11.2.x +
X3-2, X3-8

Benefit / Impact:

An X3-2 or X2-2 database server contains 4 disk drives in a RAID-5 configuration. An X3-8 or X2-8 database server contains 8
disk drives in a RAID-5 configuration. There is 1 virtual drive created across the RAID set. Verifying the status of the database
server RAID devices helps to avoid a possible performance impact, or an outage.

The impact of validating the RAID devices is minimal. The impact of corrective actions will vary depending on the specific issue
uncovered, and may range from simple reconfiguration to an outage.

Risk:

Not verifying the RAID devices increases the chance of a performance degradation or an outage.

Action / Repair:

To verify the database server disk controller configuration, use the following command:

/opt/MegaRAID/MegaCli/MegaCli64 AdpAllInfo -aALL | grep "Device Present" -A 8

For an X3-2 or X2-2 database server, the output will be similar to:

Device Present
================
Virtual Drives : 1
Degraded : 0
Offline : 0
Physical Devices : 5

The expected output is 1 virtual drive, none degraded or offline, 5 physical devices (controller + 4 disks), 4 disks, and no critical
or failed disks.

For an X3-8 or X2-8 database server, the output will be similar to:

Device Present
================
Virtual Drives : 1
Degraded : 0
Offline : 0
Physical Devices :11
Disks : 8
Critical Disks : 0
Failed Disks : 0

The expected output is 1 virtual drive, none degraded or offline, 11 physical devices (1 controller + 2 SAS2 expansion ports+ 8
disks), 8 disks, and no critical or failed disks.

If the reported output differs, investigate and correct the condition.

NOTE: If additonal virtual drives or a "hot spare" is present, it may be that the procedure to reclaimdisks was not
executed at deployment time or that a bare metal restore procedure was performed without using the
"dualboot=no" qualifier. Please refer to the "Reclaiming Disks for the Linux Operating System" section of "Oracle®
Exadata Database Machine Owner's Guide, 11g Release 2 (11.2)".

Verify Database Server Virtual Drive Configuration (ARCHIVE)

Archive Date: 03/06/13

Priority Added Machine Type OS Exadata Oracle

Type Version Version
Critical 10/1/2012 X2-2(4170), X2-2, X2-8, Linux 11.2.3.2.0 + 11.2.x +
x3-2, x3-8

Benefit / Impact:

The impact of validating the virtual drives is minimal. The impact of corrective actions will vary depending on the specific issue
uncovered, and may range from simple reconfiguration to an outage.

Risk:

Not verifying the virtual drives increases the chance of a performance degradation or an outage.

Action / Repair:

To verify the database server virtual drive configuration, use the following command:

/opt/MegaRAID/MegaCli/MegaCli64 CfgDsply -aALL | grep "Virtual Drive:";/opt/MegaRAID/MegaCli/MegaCli64 CfgDsply -

For an X3-2 or X2-2 database server, the output will be similar to:

Virtual Drive: 0 (Target Id: 0)

Number Of Drives : 4
State : Optimal

The expected result is that the virtual device has 4 drives and a state of optimal.

For an X3-8 or X2-8 database server, the output will be similar to:

Virtual Drive: 0 (Target Id: 0)

Number Of Drives : 8
State : Optimal

The expected result is that the virtual device has 8 drives and a state of optimal.

NOTE: The virtual device number reported may vary depending upon configuration and version levels.

NOTE: If the database server was upgraded to 11.2.3.2.0 or higher, this check may fail because the reported
number of drives is "3" or "7". Please see the "Known Issues" #5 "Hotspare removed for compute nodes" in My
Oracle Support note 1468877.1 for corrective action.

Verify Database Server Physical Drive Configuration (ARCHIVE)

Archive Date: 03/06/13

Priority Added Machine Type OS Exadata Oracle

Type Version Version
Critical 10/1/2012 X2-2(4170), X2-2, X2-8, Linux 11.2.3.2.0 + 11.2.x +
X3-2, X3-8

Benefit / Impact:

The impact of validating the physical drives is minimal. The impact of corrective actions will vary depending on the specific issue
uncovered, and may range from simple reconfiguration to an outage.

Risk:

Not verifying the physical drives increases the chance of a performance degradation or an outage.

Action / Repair:

To verify the database server physical drive configuration, use the following command:

/opt/MegaRAID/MegaCli/MegaCli64 PDList -aALL | grep "Firmware state"

For an X3-2 or X2-2 database server, the output will be similar to:

Firmware state: Online, Spun Up

Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up

There should be 4 lines of output showing a state of "Online, Spun Up".

For an X3-8 or X2-8 database server, the output will be similar to:

Firmware state: Online, Spun Up

There should be 8 lines of output showing a state of "Online, Spun Up".

If the reported output differs, investigate and correct the condition.

NOTE: If the database server was upgraded to 11.2.3.2.0 or higher, this check may fail because one of the devices
shows a state of: "Unconfigured(good), Spun Up". Please see the "Known Issues" #5 "Hotspare removed for
compute nodes" in My Oracle Support note 1468877.1 for corrective action.

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state… 117/137
29/10/2019 Document 1067527.1

Verify processor.max_cstate=1 on database servers

Archive Date: 03/13/13

Archive Reason: Beginning with the Exadata software version fresh install 11.2.2.2.0 or upgrade to 11.2.2.4.0,
the ILOM version went to 3.0.16.10 and this issue was resolved. This also does not apply to the current X3 series
hardware.

Priority Alert Date Owner Status Scope Bug(s)

Level
Critical FAIL 04/17/12 Dan Norris Production Exadata 14153949-
exachk
DB DB Role Engineered Exadata Version OS & Version Validation Tool TBD
Version System Version
N/A N/A X2-2 11.2.x+ (ILOM < Solaris - 11, Linux x86-64 exachk 2.2.2
3.0.16.10) UEK5.8

Benefit / Impact:

The benefit of these settings is avoiding uncorrectable memory errors related to the deep C state features on Nahalem
processors.

NOTE: Fresh images 11.2.2.2.0 or higher automatically include these fixes. Systems upgraded from older original
images should be manually upgraded by following the 11.2.2.2.0 upgrade notes.

Risk:

Without the proper configuration settings, memory errors may be reported.

Action / Repair:

If the database server has been upgraded to version 11.2.2.4.0 or higher, it should be running ILOM version 3.0.16.10 which
includes fix for CR 7036024. Once that fix is installed, the kernel parameter is no longer required as the ILOM/BIOS incorporates
the fix directly. Rather than checking for an image version, the proper check should be against the ILOM version directly.

To verify that processor.max_cstate=1 if required, as the "root" userid execute the following code on each database server:

##### begin script

#!/bin/bash
UNAME_S=`/bin/uname -s`
DMIDECODE=`/usr/sbin/dmidecode -s system-product-name`
### this fixes weirdness with the way dmidecode returns its data
DMIDECODE=ècho $DMIDECODE`
TARGET_ILOM_VER_X4170=03001610
### check basic requirements
if [ "$UNAME_S" = "Linux" -a "$DMIDECODE" = "SUN FIRE X4170 M2 SERVER" ]; then
### verify the ILOM version - if 3.0.16.10 or newer, can exit
ILOM_VER=ìpmitool sunoem cli version | grep firmware | egrep -v 'build number|date:' | awk '{print $3}'`
ILOM_VER1=ècho $ILOM_VER | awk -F. '{print $1}'`
ILOM_VER2=ècho $ILOM_VER | awk -F. '{print $2}'`
ILOM_VER3=ècho $ILOM_VER | awk -F. '{print $3}'`
ILOM_VER4=ècho $ILOM_VER | awk -F. '{print $4}'`
if [ "$ILOM_VER1" -le 9 ]; then ILOM_VER1="0$ILOM_VER1"; fi
if [ "$ILOM_VER2" -le 9 ]; then ILOM_VER2="0$ILOM_VER2"; fi
if [ "$ILOM_VER3" -le 9 ]; then ILOM_VER3="0$ILOM_VER3"; fi
if [ "$ILOM_VER4" -le 9 ]; then ILOM_VER4="0$ILOM_VER4"; fi
LOCALVER="${ILOM_VER1}${ILOM_VER2}${ILOM_VER3}${ILOM_VER4}"
if [ $TARGET_ILOM_VER_X4170 -gt $LOCALVER ]; then
### now we need to check for the parameter in /proc/cmdline
PARAM_PRESENT=`grep processor.max_cstate=1 /proc/cmdline | wc -l `
if [ $PARAM_PRESENT -eq 1 ]; then
### don't have fix via ILOM version, but have cmdline param
echo "PASSED due to cmdline param"
else ### don't have fix via ILOM, don't have fix via kernel cmdline param, failed check
echo "FAILED"
fi
else
### already have the minimum ILOM version, so passed the check
echo "PASSED due to minimum ILOM version"
fi
else
echo "This check is only for Linux-based X4170 M2 database servers, exiting"
fi
#### end script

The expected output is not "FAILED".

To correct a "FAILED" condition:

1) Upgrade to newer versions of Exadata Software not impacted by this issue.

2) If an upgrade is not possible, to configure the proper settings, the kernel boot option "processor.max_cstate=1" should be
added to the /boot/grub/grub.conf file on the "kernel" line so that it looks like this:

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state… 118/137
29/10/2019 Document 1067527.1
kernel /vmlinuz-2.6.18-274.18.1.0.1.el5 root=LABEL=DBSYS ro bootarea=dbsys loglevel=7 panic=60 debug rhgb numa=of

After this change, a system reboot is required to pick up the new setting.

Verify Software on Storage Servers (CheckSWProfile.sh) (ARCHIVE)

Archive Date: 06/26/13

Archive Reason: Beginning with the Exadata software version fresh install 11.2.3.3.0 or upgrade to 11.2.3.3.0,
CheckSWProfile.sh has been desupported by Exadata development.

Priority Added Machine Type OS Type Exadata Oracle

Version Version
Critical N/A X2-2(4170), X2-2, Linux, 11.2.x + 11.2.x +
X2-8 Solaris

Benefit / Impact:

Verifying the software configuration after initial deployment, upgrades, or patching and before the Oracle Exadata Database
Machine is placed into or returned to production status can avoid problems related to the software modifications.

The overhead for these verification steps is minimal.

Risk:

If the software is not validated, inconsistencies can lead to problems and outages.

Action / Repair:

To verify the storage server software configuration execute the following command as the root userid:

/opt/oracle.SupportTools/CheckSWProfile.sh -c

The output will be similar to:

[INFO] SUCCESS: Meets requirements of operating platform and installed software for
[INFO] below listed releases and patches of Exadata and of corresponding Database.
[INFO] Check does NOT verify correctness of configuration for installed software.

[The_ExadataAndDatabaseReleases]
Exadata: 11.2.2.1.0 OracleDatabase: 11.2.0.2+Patches

If any result other than "SUCCESS" is returned, investigate and correct the condition.

Review:

ravindra.dani: This is not correct for database hosts all the time. SW checker is only useful on fresh imaged db nodes. Also this
check is going to be retired by 11.2.3.1.0. This check should not be run on the cells and though not folded in cellcli,s ay at
validate config it should be.

Verify Software on InfiniBand Switches (CheckSWProfile.sh)(ARCHIVE)

Archive Date: 06/26/13

Archive Reason: Beginning with the Exadata software version fresh install 11.2.3.3.0 or upgrade to 11.2.3.3.0,
CheckSWProfile.sh has been desupported by Exadata development.

Priority Added Machine Type OS Type Exadata Oracle

Version Version
Critical N/A X2-2(4170), X2-2, Linux, 11.2.x + 11.2.x +
X2-8 Solaris

Benefit / Impact:

The overhead for these verification steps is minimal.

Risk:

If the software is not validated, problems may occur when the machine is utilized.

Action / Repair:

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state… 119/137
29/10/2019 Document 1067527.1
The commands required to verify the InfiniBand switches software configuration vary slightly by the physcial configuration of the
Oracle Exadata Database Machine. The key difference is whether or not the physical configuration includes a designated spine
switch.

To verify the InfiniBand switches software configuration for a X2-8, a full rack Oracle Exadata Database Machine X2-2 or a late
production model half rack Oracle Exadata Database Machine X2-2, with a designated spine switch properly configured per the
"Oracle Exadata Database Machine Owner's Guide 11g Release 2 (11.2) E13874-15" with "sm_priority=8", and the name
"RanDomsw-ib1", execute the following command as the "root" userid on one of the database servers:

/opt/oracle.SupportTools/CheckSWProfile.sh -I IS_SPINERanDomsw-ib1,RanDomsw-ib3,RanDomsw-ib2

Where "RanDomsw-ib1, RanDomsw-ib3, and RanDomsw-ib2" are the switch names returned by the "ibswitches" command.

NOTE: There is no space between the "IS_SPINE" qualifier and the name of the designated spine switch.

The output will be similar to:

Checking if switch RanDomsw-ib1 is pingable...

Checking if switch RanDomsw-ib3 is pingable...
Checking if switch RanDomsw-ib2 is pingable...
Use the default password for all switches? (y/n) [n]: y
[INFO] SUCCESS Switch RanDomsw-ib1 has correct software and firmware version:
SWVer: 1.3.3-2
[INFO] SUCCESS Switch RanDomsw-ib1 has correct opensm configuration:
controlled_handover=TRUE polling_retry_number=5 routing_engine=ftree sminfo_polling_timeout=1000 sm_priority=8

[INFO] SUCCESS Switch RanDomsw-ib3 has correct software and firmware version:
SWVer: 1.3.3-2
[INFO] SUCCESS Switch RanDomsw-ib3 has correct opensm configuration:
controlled_handover=TRUE polling_retry_number=5 routing_engine=ftree sminfo_polling_timeout=1000 sm_priority=5

[INFO] SUCCESS Switch RanDomsw-ib2 has correct software and firmware version:
SWVer: 1.3.3-2
[INFO] SUCCESS Switch RanDomsw-ib2 has correct opensm configuration:
controlled_handover=TRUE polling_retry_number=5 routing_engine=ftree sminfo_polling_timeout=1000 sm_priority=5

[INFO] SUCCESS All switches have correct software and firmware version:
SWVer: 1.3.3-2
[INFO] SUCCESS All switches have correct opensm configuration:
controlled_handover=TRUE polling_retry_number=5 routing_engine=ftree sminfo_polling_timeout=1000 sm_priority=5 f

To verify the InfiniBand switches software configuration for an early production model half rack Oracle Exadata Database
Machine X2-2 (may not have shipped with a designated spine switch), or a quarter rack Oracle Exadata Database Machine X2-2
properly configured per the "Oracle Exadata Database Machine Owner's Guide 11g Release 2 (11.2) E13874-15", execute the
following command as the "root" userid on one of the database servers:

/opt/oracle.SupportTools/CheckSWProfile.sh -I RanDomsw-ib3,RanDomsw-ib2

Where "RanDomsw-ib3 and RanDomsw-ib2" are the switch names returned by the "ibswitches" command.

The output will be similar to the output for the first command, but there will be no references to a spine switch and all switches
will have "sm_priority" of 5.

In either command case, the expected output is to return "SUCCESS". If anything else is returned, investigate and correct the
condition.

Verify storage server network configuration with ipconf (ARCHIVE)

Archive Date: 05/13/15

Archive Reason: This storage server only check was replaced by "Verify active system values match those
defined in configuration file "cell.conf" which executes on both storage and database servers with broader scope.

Priority Alert Date Owner Status Scope Bug(s)

Level
Critical FAIL 05-Mar-2013 Doug Utzig Production Exadata, SSC
DB DB Role Engineered System Exadata OS & Validation Tool TBD
Version Version Version Version
N/A N/A X2-2(4170), X2-2, X2-8, X3-2, X3-8, All n/a
X4-2

Benefit / Impact:

Exadata Storage Server network configuration is maintained in both operating system level configuration files and in Exadata-
specific configuration files. The configuration defined in the two sets of files must match. To ensure proper configuration and
consistency, network configuration changes to an Exadata Storage Server must be performed with the ipconf utility, as
documented in the Oracle Exadata Storage Server Software User's Guide.

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state… 120/137
29/10/2019 Document 1067527.1
The impact of verifying that storage server network configuration is correct and consistent is minimal.

Risk:

If operating system level configuration files and Exadata-specific configuration files are inconsistent, then maintenance activities
like software patching may fail, or previous configuration may be restored without warning.

Action / Repair:

To verify operating system level configuration files and Exadata-specific configuration files are consistent, run the following
ipconf command on storage servers:

# /usr/local/bin/ipconf -verify -semantic

The output should be similar to:

Verifying of Exadata configuration file /opt/oracle.cellos/cell.conf Done. Configuration file /opt/oracle.cellos/

If the output reports FAILED for any check, investigate to find the root cause, and then use only the ipconf utility to make the
necessary corrections to the storage server network configuration. Refer to the Oracle Exadata Storage Server Software User's
Guide for details of the ipconf utility.

11.2.0.2 ASM Instance Initialization Parameters (ARCHIVE)

Archive Date: 05/08/15

Archive Reason: 11.2.0.2 is fully desupported. Please see: "Release Schedule of Current Database Releases (Doc
ID 742060.1)"

Priority: Critical

Benefit / Impact: Experience and testing has shown that certain ASM initialization parameters should be set at specific values.
These are the best practice values set at deployment time. By setting these ASM initialization parameters as recommended,
known problems may be avoided and performance maximized. The parameters are specific to the ASM instances. Unless
otherwise specified, the value is for both 2 socket and 8 socket Database Machines. The impact of setting these parameters is
minimal.

Risk: If the ASM initialization parameters are not set as recommended, a variety of issues may be encountered, depending upon
which initialization parameter is not set as recommended, and the actual set value.

Action / Repair: To verify the database initialization parameters, compare the values in your environment against the table
below (* = default value):

Parameter Recommended Value Priority Notes

cluster_interconnects Bondib0 IP address for 2 socket servers 1 This is used to avoid the Clusterware HAIP a
supported on Exadata (the only exception b
Colon delimited Bondib* IP addresses for 8
socket servers

asm_power_limit 4 1 This is Exadata default to mitigate applicatio

ASM rebalance. Please evaluate application
a higher ASM_POWER_LIMIT.

Memory_target 1040M 1 This avoids issues with 11.2.0.1 to 11.2.0.2

setting for Exadata.

For < 10 instances per node,

This avoids issues observed when ASM hits
50 * (DB instances per node + 1)
NOTE: "instances" means "non-ASM" instan
processes For >= 10 instances per node, 1
[Internal] Note that bug 11842806 can caus
{(50 * MIN (db_instances_per_node +1, even a properly configured processes param
11) }+ {10 * MAX (db_instances_per_node should be applied
- 10, 0)}

Correct any Priority 1 parameter that is not set as recommended. Evaluate and correct any Priority 2 parameter that is not set as
recommended.

Verify Common Instance Database Initialization Parameters (ARCHIVE)

Archive Date: 08/22/12

Archive Reason: This section was created to account for database initialization parameters that become
deprecated at various release levels.

Benefit / Impact: Experience and testing has shown that certain database initialization parameters should be set at specific
values. These are the best practice values set at deployment time. By setting these database initialization parameters as
recommended, known problems may be avoided and performance maximized. The parameters are common to all database
instances. The impact of setting these parameters is minimal. The performance related settings provide guidance to maintain
highest stability without sacrificing performance. Changing the default performance settings can be done after careful
performance evaluation and clear understanding of the performance impact. Risk: If the database initialization parameters are
not set as recommended, a variety of issues may be encountered, depending upon which initialization parameter is not set as
recommended, and the actual set value. Action / Repair: To verify the database initialization parameters, compare the values
in your environment against the table below (* = default value):

Parameter Recommended Value Priority Notes

_lm_rcvr_hang_allow_time 140 1 This parameter protects from corner case tim

prevents instance evictions

Archive Reason: Deprecated with 11.2.0.3

14526144

_kill_diagnostics_timeout 140 1 This parameter protects from corner case tim

prevents instance evictions

Archive Reason: Deprecated with 11.2.0.3

14526155

Verify RAID Controller Battery Condition (ARCHIVE)

Archive Date: 04/06/16

Archive Reason: This check became obsolete with the release of X5 series hardware

Priority Added Machine Type OS Type Exadata Version Oracle Version

Critical 03/02/11
X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X4-8 Linux, Solaris
11.2.x + 11.2.x +

[Bug(s): 11828407 (Storage Server), 11832924 (EM Storage Server Plugin), 11832981 (EM Agent)]

Benefit/Impact:

The RAID controller battery loses its ability to support cache over time. Verifying the battery charge and condition allows
proactive battery replacement.

The impact of verifying the RAID controller battery condition is minimal.

Risk:

A failed RAID controller battery will put the RAID controller into WriteThrough mode which significantly impacts write I/O
performance.

Action/Repair:

Execute the following command as the "root" userid on all servers:

The output will be similar to:

BatteryType: iBBU08
Full Charge Capacity: 1272 mAh
Max Error: 0 %

Proactive battery replacement should be performed within 60 days for any batteries that meet the following criteria:

1) "Full Charge Capacity" less than or equal to 800 mAh and "Max Error" less than 10%.

Immediately replace any batteries that meet either of the following criteria:

1) "Max Error" is 10% or greater (battery deemed unreliable regardless of "Full Charge Capacity" reading)

2) "Full Charge Capacity" less than 674 mAh regardless of "Max Error" reading

[NOTE: The complete reference guide for LSI disk controller batteries used in Exadata can be found in MOS 1329989.1
(INTERNAL ONLY)]

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state… 122/137
29/10/2019 Document 1067527.1

Verify all "BIGFILE" tablespaces have non-default "MAXBYTES" values set (ARCHIVE)

Priority Added Machine Type OS Type Exadata Version Oracle Version

Critical 11-Nov-2011 X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2 Linux, [WIP:VW]Solaris 11.2.x + 11.2.x +

Benefit / Impact

"MAXBYTES" is the SQL attribute that expresses the "MAXSIZE" value that is used in the DDL command to set "AUTOEXTEND" to
"ON". By default,
for a bigfile tablespace, the value is "3.5184E+13", or "35184372064256". The benefit of having "MAXBYTES" set at a non-
default value for
"BIGFILE" tablespaces is that a runaway operation or heavy simultaneous use (e.g., temp tablespace) cannot take up all the
space in a diskgroup.

The impact of verifying that "MAXBYTES" is set to a non-default value is minimal. The impact of setting the "MAXSIZE" attribute
to a non-default
value "varies depending upon if it is done during database creation, file addition to a tablespace, or added to an existing file.

Risk

The risk of running out of space in a diskgroup varies by application and cannot be quantified here. A diskgroup running out of
space may impact the entire database as well as ASM operations (e.g., rebalance operations).

Action / Repair

To obtain a list of file numbers and bigfile tablespaces that have the "MAXBYTES" attribute at the default value, enter the
following sqlplus command logged into the database as sysdba:

select file_id, a.tablespace_name, autoextensible, maxbytes

from (select file_id, tablespace_name, autoextensible, maxbytes from dba_data_files where autoextensible='YES' an
(select tablespace_name from dba_tablespaces where bigfile='YES') b
where a.tablespace_name = b.tablespace_name
union
select file_id,a.tablespace_name, autoextensible, maxbytes
from (select file_id, tablespace_name, autoextensible, maxbytes from dba_temp_files where autoextensible='YES' an
(select tablespace_name from dba_tablespaces where bigfile='YES') b
where a.tablespace_name = b.tablespace_name;

The output should be:

no rows returned

If you see output similar to:

FILE_ID TABLESPACE_NAME AUT MAXBYTES

---------- ------------------------------ --- ----------
1 TEMP YES 3.5184E+13
3 UNDOTBS1 YES 3.5184E+13
4 UNDOTBS2 YES 3.5184E+13

Investigate and correct the condition.

Ensure Temporary Tablespace is correctly defined (ARCHIVE)

Priority Added Machine Type OS Type Exadata Version Oracle Version

N/A X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2 Linux 11.2.x + 11.2.x +

The temporary tablespace should be

1. A BigFile Tablespace
2. Located in DATA or RECO, whichever one is not HIGH redundancy
3. Sized 32GB Initially
4. Configured with AutoExtend on at 4GB
5. Configured with a Max Size defined to limit out of control growth.

Verify "diagsnap.pl" is not executing

Engineered
Alert Engineered
Priority Date Owner Status System Bug(s)
Level System
Platform

bug 27376516 -
RA, Exadata - Exachk
Dib Chatterjee, Jaime Physical, bug 25960055 -
Critical WARN 04/26/17 Production ALL
Figueroa Exadata - User OEDA
Domain bug 25955127 -
Exachk

DB/GI DB DB Role DB Mode Exadata OS & Version OS & Version MAA Scorecard

Benefit / Impact:
Starting with version 12.2.0.1, by default the Cluster Health Monitor (CHM) framework executes continuously the file
"/u01/app/12.2.0.1/grid/bin/diagsnap.pl". Under certain conditions, this script executes the "pstack" command against key grid
infrastructure processes. The output of "pstack" can be useful for diagnosing grid infrastructure issues, but the "pstack"
command execution and locking can lead these key grid infrastructure processes to hang (especially ocssd) which can trigger
node reboots. It is recommended that "diagsnap.pl" not execute continuously, and that the "pstack" command is only used when
other diagnostics indicate a benefit.

The impact of verifying that "diagsnap.pl" is not executing is minimal, as is the impact of stopping it's execution.

Risk:

Continuously executing "diagsnap.pl" may lead to node reboots that might have otherwise been avoided.

Action / Repair:

To verify that "diagsnap.pl" is not executing, as the owner userid of the grid home, and with the environment properly set to
access the grid home, execute the following code set on each database server:

CRS_HOME=$ORACLE_HOME
unset DIAGSNAP_OUTPUT
unset DIAGSNAP_EXECUTING

function chkdiagsnap
{
if [ $DIAGSNAP_EXECUTING -gt 0 ]
then
echo -e "WARNING: \"diagsnap.pl\" is executing on this database server. Recommendation is to stop the pr
repair
else
echo -e "SUCCESS: \"diagsnap.pl\" is not executing on this database server.\n"
fi
exit 0
}

function repair
{
$CRS_HOME/bin/oclumon manage -disable diagsnap
}

DIAGSNAP_OUTPUT=$(ps -ef | grep $CRS_HOME | grep diagsnap | grep -v grep)

DIAGSNAP_EXECUTING=$(echo "$DIAGSNAP_OUTPUT" | grep -c diagsnap)

chkdiagsnap

The expected output is:

SUCCESS: "diagsnap.pl" is not executing on this database server.

example of a "FAILURE:" result:

WARNING: "diagsnap.pl" is executing on this database server:

Details: root 386456 378366 0 Apr03 ? 00:30:17 /u01/app/12.2.0.1/grid/perl/bin/perl /u01/app/12.2.0.1

NOTE: If a "WARNING:" result is returned, to stop the file "diagsnap.pl" from executing, as the owner userid of the
grid home, and with the environment variables properly set, execute the following command:
$CRS_HOME/bin/oclumon manage -disable diagsnap

Verify memlock is 90% of phys ram when huge pages are enabled

Priority Alert Date Owner Status Scope Bug(s)

Level

Critical FAIL 8/29/14 Rene Kundersma Production Exadata,

DB DB Role Engineered System Exadata OS & Validation Tool TBD

Version Version Version Version

N/A N/A X2-2(4170), X2-2, X2-8, X3-2, X3-8, 11.2.2.2.0+ Linux x86-64
X4-2

Benefit / Impact:

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state… 124/137
29/10/2019 Document 1067527.1
Oracle recommends that the maximum locked memory be at least 90 percent of the installed physical memory when huge pages
are enabled. Refer to the operating system documentation or issue the command ''man limits.conf'' for details. The impact of
verifying this value is minimal. Also see http://docs.oracle.com/database/121/LADBI/usr_grps.htm#LADBI7674

Risk:

Incorrect resource settings can cause instability and performance problems.

Action / Repair:

Obtain hard and soft value for memlock from /etc/security/limits.conf. Verify this value is at least 90% of physical memory.
When hugepages are configured (which should be true) - and this value is less than 90% we should print a warning and suggest
the user to update the values

Verify RAID Controller Battery Temperature

Priority Added Machine Type OS Type Exadata Version Oracle Version

Critical 03/02/11 X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2 Linux, Solaris 11.2.x + 11.2.x +

Benefit/Impact:

Maintaining proper temperature ranges maximizes RAID controller battery life.

The impact of verifying RAID controller battery temperature is minimal.

Risk:

A reported temperature of 60C or higher causes the battery to suspend charging until the temperature drops and shortens the
service life of the battery, causing it to fail prematurely and put the RAID controller into
WriteThrough mode which significantly impacts write I/O performance.

Action/Repair:

To verify the RAID controller battery temperature, execute the following command as the "root" userid on all servers:

if [ -x /opt/MegaRAID/MegaCli/MegaCli64 ]
then
#Linux
/opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -a0 -nolog| grep BatteryType;
/opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -a0 -nolog | grep -i temper;
else
#Solaris
/opt/MegaRAID/MegaCli -AdpBbuCmd -a0 -nolog| grep BatteryType;
/opt/MegaRAID/MegaCli -AdpBbuCmd -a0 -nolog| grep -i temper;
fi;

The output will be similar to:

BatteryType: iBBU08
Temperature: 38 C
Temperature : OK
Over Temperature : No

If the battery temperature is equal to or greater than 55C, investigate and correct the environmental conditions.

NOTE: Replace Battery Module after 3 Year service life assuming the battery temperature has not exceeded 55C. If the
temperature has exceeded 55C (battery temp shall not exceed 60C), replace the battery every 2 years.

[NOTE: The complete reference guide for LSI disk controller batteries used in Exadata can be found in MOS Unpublished Note
1329989.1 (INTERNAL ONLY)]

Verify Database Server Disk Controller Configuration

Priority Alert Date Owner Status Engineered Engineered System Bug(s)

Level System Platform
Critical FAIL 03/17/18 Dib Production Exadata - Physical, X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5-2, Bug 27525145-
Chatterjee Exadata - X5-8, X6-2, X6-8, X7-2 exachk
Management Bug 26775963-
Domainl exachk
Bug 24533088-
exachk
Bug 20557656-
exachk
DB DB DB DB Mode Exadata OS & Version Validation Tool Version MAA
Version Type Role Version Scorecard
Section
N/A N/A N/A N/A ALL Linux exachk 18.2.0 N/A

The recommended configuration for a newly deployed (or upgraded from 11.2.3.2.0) database server varies according to the
hardware type and Exadata software version. Verifying the status of the database server RAID devices helps to avoid a possible
performance impact, or an outage.

The impact of verifying the database server disk controller configuration is minimal. The impact of corrective actions will vary
depending on the specific issue uncovered, and may range from simple reconfiguration to an outage.

Risk:

Not verifying the database server disk controller configuration increases the chance of a performance degradation or an outage.

Action / Repair:

exachk contains all the logic necessary to identify the various correct configurations. To verify the database server disk controller
configuration, run exachk and evaluate the results.

To manually verify the database server disk controller configuration, execute the following command set as the "root" userid on
each database server or the management domain of a virtualized environment:

NOTE: This check is not applicable to X7-8 Oracle Exadata Database Servers as they contain no conventional disk
drives!

if [[ -d /proc/xen && ! -f /proc/xen/capabilities ]]

then
echo -e "\nThis check will not run in a user domain of a virtualized environment. Execute this check in the ma
else
if [ -x /opt/MegaRAID/storcli/storcli64 ]
then
export CMD=/opt/MegaRAID/storcli/storcli64
else
export CMD=/opt/MegaRAID/MegaCli/MegaCli64
fi
RAW_OUTPUT=$($CMD AdpAllInfo -aALL -nolog | grep "Device Present" -A 8);
echo -e "The database server disk controller configuration found is:\n\n$RAW_OUTPUT";
fi;

The output will be similar to:

Device Present
================
Virtual Drives : 1
Degraded : 0
Offline : 0
Physical Devices : 5
Disks : 4
Critical Disks : 0
Failed Disks : 0

The output should match one of the combinations of entries in this table:

Database Server Disk Controller Configurations

Engineered Virtual Degraded Offline Physical Disks Critical Failed Exadata

System Drives Devices Disks Disks Version
X2-2(4170), X2-2 1 0 0 5 4 0 0 < 11.2.3.2.0
X2-8 1 0 0 11 8 0 0 < 11.2.3.2.0

X2-2(4170), X2-2, X3-2, X4-2, X5-2, X6-2, X7-2 1 0 0 5 4 0 0 >= 11.2.3.2.0

X5-2, X6-2, X7-2 (Disk Expansion Kit) 1 0 0 9 8 0 0 >= 11.2.3.2.0
X2-8, X3-8 1 0 0 11 8 0 0 >= 11.2.3.2.0
X4-8 1 0 0 8 7 0 0 >= 11.2.3.2.0
X5-8, X6-8 1 0 0 9 8 0 0 >= 11.2.3.2.0

NOTE: The Disk Expansion Kit is only applicable to X5-2, X6-2, and X7-2 database servers.

Verify Database Server Virtual Drive Configuration

Priority Alert Date Owner Status Engineered System Engineered System Bug(s)
Level Platform
Critical FAIL 03/07/18 Dib Production Exadata- X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5-2, Bug 27533289-
Chatterjee Management X5-8, X6-2, X6-8, X7-2 exachk
Domain, Bug 26775963-
Exadata-Physical exachk
Bug 24533222-
exachk
Bug 20557656-
exachk
DB DB DB Role DB Mode Exadata OS & Version Validation Tool Version MAA Scorecard
Version Type version Section
N/A N/A N/A N/A ALL Linux exachk 18.2.0 N/A

Benefit / Impact:

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state… 126/137
29/10/2019 Document 1067527.1
The impact of verifying the database server virtual drive configuration is minimal. The impact of corrective actions will vary
depending on the specific issue uncovered, and may range from simple reconfiguration to an outage.

Risk:

Not verifying the virtual drives increases the chance of a performance degradation or an outage.

Action / Repair:

exachk contains all the logic necessary to identify the various correct configurations. To verify the database server disk controller
configuration, run exachk and evaluate the results.

NOTE: This check is not applicable to X7-8 Oracle Exadata Database Servers as they contain no conventional disk
drives!

if [[ -d /proc/xen && ! -f /proc/xen/capabilities ]]

then
echo -e "\nThis check will not run in a user domain of a virtualized environment. Execute this check in the ma
else
if [ -x /opt/MegaRAID/storcli/storcli64 ]
then
export CMD=/opt/MegaRAID/storcli/storcli64
else
export CMD=/opt/MegaRAID/MegaCli/MegaCli64
fi
RAW_OUTPUT=$($CMD CfgDsply -aALL -nolog | egrep "Virtual Drive:|Number Of Drives|^State");
echo -e "The database server virtual drive configuration found is:\n\n$RAW_OUTPUT";
fi;

The output will be similar to:

Virtual Drive: 0 (Target Id: 0)

Number Of Drives : 4
State : Optimal

The output should match one of the combinations of entries in this table:

Database Server Virtual Drive Configurations

Engineered Number of State Number of Exadata

System Virtual Drives Physical Drives Version
X2-2(4170), X2-2 1 Optimal 3 < 11.2.3.2.0
X2-8 1 Optimal 7 < 11.2.3.2.0

X2-2(4170), X2-2, X3-2, X4-2, X5-2, X6-2, X7-2 1 Optimal 4 >= 11.2.3.2.0
X5-2, X6-2, X7-2 (Disk Expansion Kit) 1 Optimal 8 >= 11.2.3.2.0
X2-8, X3-8 1 Optimal 8 >= 11.2.3.2.0
X4-8 1 Optimal 7 >= 11.2.3.2.0
X5-8, X6-8 1 Optimal 8 >= 11.2.3.2.0

NOTE: The virtual device number reported may vary depending upon configuration and version levels.

NOTE: The Disk Expansion Kit is only applicable to X5-2, X6-2, and X7-2 database servers.

NOTE: If additonal virtual drives are present, it may be that the procedure to reclaimdisks was not executed at
deployment time or that a bare metal restore procedure was performed without using the "dualboot=no" qualifier.
Please refer to the "Reclaiming Disks for the Linux Operating System" section of "Oracle® Exadata Database
Machine Owner's Guide, 11g Release 2 (11.2)". See also "Verify Database Server Physical Drive Configuration".

NOTE: If the database server was upgraded to 11.2.3.2.0, this check may fail because the reported number of
drives is "3" or "7". Please see the "Known Issues" #5 "Hotspare removed for compute nodes" in My Oracle
Support note 1468877.1 for corrective action.

Verify Database Server Physical Drive Configuration

Priority Alert Date Owner Status Engineered System Engineered System Bug(s)
Level Platform
Critical FAIL 03/07/2018 Production Exadata - Physical, X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5- Bug 27533421-
/Dib Exadata - 2, X5-8, X6-2, X6-8, X7-2 exachk
Chatterjee Management Bug 26775963-
Domain exachk
Bug 24533293-
exachk
Bug 20557656-
exachk
DB DB DB Role DB Mode Exadata OS & Version Validation Tool Version MAA
Version Type Version Scorecard
Section
N/A N/A N/A N/A ALL Linux exachk 18.2.0 N/A

Benefit / Impact:

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state… 127/137
29/10/2019 Document 1067527.1
The impact of verifying the database server physical drive configuration is minimal. The impact of corrective actions will vary
depending on the specific issue uncovered, and may range from simple reconfiguration to an outage.

Risk:

Not verifying the physical drives increases the chance of a performance degradation or an outage.

Action / Repair:

exachk contains all the logic necessary to identify the various correct configurations. To verify the database server physical drive
configuration, run exachk and evaluate the results.

To manually verify the database server physical drive configuration, execute the following command set as the "root" userid on
each database server or the management domain of a virtualized environment:

NOTE: This check is not applicable to X7-8 Oracle Exadata Database Servers as they contain no conventional disk
drives!

if [[ -d /proc/xen && ! -f /proc/xen/capabilities ]]

then
echo -e "\nThis check will not run in a user domain of a virtualized environment. Execute this check in the ma
else
if [ -x /opt/MegaRAID/storcli/storcli64 ]
then
export CMD=/opt/MegaRAID/storcli/storcli64
else
export CMD=/opt/MegaRAID/MegaCli/MegaCli64
fi
RAW_OUTPUT=$($CMD PDList -aALL -nolog | grep "Firmware state");
echo -e "The database server physical drive configuration found is:\n\n$RAW_OUTPUT";
fi;

The output will be similar to:

Recommended Configuration

The database server physical drive configuration found is:

Firmware state: Online, Spun Up

<output truncated for brevity>
Firmware state: Online, Spun Up

The output should match one of the combinations of entries in this table:

Database Server Physical Drive Configurations

Engineered Online Spun Up Hotspare Spun Down Exadata

System Version
X2-2(4170), X2-2 3 3 1 1 < 11.2.3.2.0
X2-8 7 7 1 1 < 11.2.3.2.0

X2-2(4170), X2-2, X3-2, X4-2, X5-2, X6-2, X7-2 4 4 0 0 >= 11.2.3.2.0

X5-2, X6-2, X7-2 (Disk Expansion Kit) 8 8 0 0 >= 11.2.3.2.0
X2-8, X3-8 8 8 0 0 >= 11.2.3.2.0
X4-8 7 7 0 0 >= 11.2.3.2.0
X5-8, X6-8 8 8 0 0 >= 11.2.3.2.0

If the reported output differs, investigate and correct the condition.

NOTE: The Disk Expansion Kit is only applicable to X5-2, X6-2, and X7-2 database servers.
NOTE: If the database server was upgraded to 11.2.3.2.0, this check may fail because one of the devices shows a
state of: "Unconfigured(good), Spun Up". Please see the "Known Issues" #5 "Hotspare removed for compute
nodes" in My Oracle Support note 1468877.1 for corrective action.

Alternate Configuration

For an X2-2(4170), X2-2, or X2-8 database server which is running an Exadata software version lower than 11.2.3.2.0 that is
being upgraded to an Exadata software version of 11.2.3.2.1 or higher, an alternate configuration is permitted. The alternate
configuration for an X2-2(4170) or X2-2 uses 3 disks in the RAID set with 1 disk as a hot spare. The alternate configuration for
an X2-8 uses 7 disks in the RAID set with 1 disk as a hot spare.

The output should be similar to:

Firmware state: Online, Spun Up

<output truncated for brevity>
Firmware state: Hotspare, Spun down

For an X2-2(4170) or X2-2, the expected output should contain three lines of output showing a state of "Online, Spun Up", and
one line showing a state of "Hotspare, Spun down". For an X2-8, the expected output should contain seven lines of output
showing a state of "Online, Spun Up", and one line showing a state of "Hotspare, Spun down". In either case, the ordering of
the output lines is not significant and may vary based upon a given database server's physical drive replacement history.

If the reported output differs, investigate and correct the condition.

NOTE: Modified 03/21/12

Occasionally in normal operation, the "Hotspare" physical drive may be brought to a state of "Online, Spun Up".
Thirty minutes (default) after the operation that brought the drive to "Online, Spun Up" has completed, the drive
should spin down due to the powersaving feature. There is no harm for the drive to be "Online, Spun Up" if there
are no other errors reported in the disk drive configuration checks.
https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state… 128/137
29/10/2019 Document 1067527.1

For additional information, please reference My Oracle Support note "Exadata: Hot Spares Not Spinning Down (Doc
ID 1403613.1)"

Verify database server disk controllers use writeback cache

Priority Alert Date Owner Status Engineered Engineered System Platform Bug(s)
Level System
Critical FAIL 03/07/18 Production Exadata - Physical, X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5- Bug 27523948 -
Exadata - 2, X5-8, X6-2, X6-8, X7-2 exachk
Management
Domain
DB DB DB Role DB Exadata OS & Version Validation Tool Version MAA
Version Type Mode Version Scorecard
Section
N/A N/A N/A N/A ALL Linux exachk 18.2.0 N/A

Benefit / Impact:

Database servers use an internal RAID controller with a battery-backed cache to host local filesystems. For maximum
performance when writing I/O to local disks, the battery-backed cache should be in "WriteBack" mode.

The impact of configuring the battery-backed cache in "WriteBack" mode is minimal.

Risk:

Not configuring the battery-backed cache in "WriteBack" mode will result in degraded performance when writing I/O to the local
database server disks.

Action / Repair:

To verify that the disk controller battery-backed cache is in "WriteBack" mode, run the following set of commands as the "root"
userid on all database servers:

NOTE: This check is not applicable to X7-8 Oracle Exadata Database Servers as they contain no conventional disk
drives!

unset NON_WRITEBACK
if [ -x /opt/MegaRAID/storcli/storcli64 ]
then
export CMD=/opt/MegaRAID/storcli/storcli64
else
export CMD=/opt/MegaRAID/MegaCli/MegaCli64
fi
RAW_OUTPUT=$($CMD -CfgDsply -a0 -nolog | egrep -i "Virtual Drive:|Current Cache Policy:" | grep -v Number | sed '
NON_WRITEBACK=$(echo -n "$RAW_OUTPUT" | grep -vi writeback)
if [ -z "$NON_WRITEBACK" ]
then
echo -e "SUCCESS: All virtual drives have \"Current Cache Policy\" set to \"WriteBack\"."
else
echo -e "FAILURE: One or more virtual drives do not have \"Current Cache Policy\" set to \"WriteBack\". Detail
fi

The output should be:

SUCCESS: All virtual drives have "Current Cache Policy" set to "WriteBack".

Example of a "FAILURE:" result:

FAILURE: One or more virtual drives do not have "Current Cache Policy" set to "WriteBack". Details:

Virtual Drive: 0 (Target Id: 1) Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad

If the battery-backed cache is not in "WriteBack" mode, run these commands as the "root" userid on the effected database
server to place the battery-backed cache into "WriteBack" mode:

if [ -x /opt/MegaRAID/storcli/storcli64 ]
then
export CMD=/opt/MegaRAID/storcli/storcli64
else
export CMD=/opt/MegaRAID/MegaCli/MegaCli64
fi
$CMD -LDSetProp WB -Lall -a0 -nolog
$CMD -LDSetProp NoCachedBadBBU -Lall -a0 -nolog
$CMD -LDSetProp NORA -Lall -a0 -nolog
$CMD -LDSetProp Direct -Lall -a0 -nolog

NOTE: No settings should be modified on Exadata storage cells. The mode described above applies only to
database servers in an Exadata database machine.

Verify that "Disk Cache Policy" is set to "Disabled"

Priority Added Machine Type OS Type Exadata Version Oracle Version

Critical 06/13/11 X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2 Linux 11.2.x + 11.2.x +

"Disk Cache Policy" is set to "Disabled" by default at imaging time and should not be changed because the cache created by
setting "Disk Cache Policy" to "Enabled" is not battery backed. It is possible that a replacement drive
has the disk cache policy enabled so its a good idea to check this setting after replacing a drive.

The impact of verifying that "Disk Cache Policy" is set to "Disabled" is minimal. The impact of suddenly losing power with "Disk
Cache Policy" set to anything other than "Disabled" will vary according to each specific case,
and cannot be estimated here.

Risk:

If the "Disk Cache Policy" is not "Disabled", there is a risk of data loss in the event of a sudden power loss because the cache
created by "Disk Cache Policy" is not backed up by a battery.

Action / Repair:

To verify that "Disk Cache Policy" is set to "Disabled" on all servers, use the following command as the "root" userid on the first
database server in the cluster:

unset TMP_RSLT;
TMP_RSLT='dcli -g /opt/oracle.SupportTools/onecommand/all_group -l root "if [ -x /opt/MegaRAID/MegaCli/MegaCli64 ]; then
/opt/MegaRAID/MegaCli/MegaCli64 -LdPdInfo -aALL -nolog; else /opt/MegaRAID/MegaCli -LdPdInfo -aALL -nolog; fi;" | grep -i
'Disk Cache Policy' | grep -v Disabled | wc -l'
if [ $TMP_RSLT = 0 ]
then
echo -e "\nSUCCESS\n"
else
echo -e "\nFAILURE:";
dcli -g /opt/oracle.SupportTools/onecommand/all_group -l root "if [ -x /opt/MegaRAID/MegaCli/MegaCli64 ]; then
/opt/MegaRAID/MegaCli/MegaCli64 -LdPdInfo -aALL -nolog; else /opt/MegaRAID/MegaCli -LdPdInfo -aALL -nolog; fi;" | grep -i
'Disk Cache Policy' | grep -v Disabled;
echo -e "\n";
fi;

The output should be:

SUCCESS

If anything other than "SUCCESS" is returned, identify the LUN(s) in question and reset the "Disk Cache Policy" to "Disabled"
using the following commands as the "root" userid on the server that reported the issue (where Lx= the lun in question, for
example: L2):

if [ -x /opt/MegaRAID/MegaCli/MegaCli64 ]
then
#Linux
export TMP_CMD=/opt/MegaRAID/MegaCli/MegaCli64
else
#Solaris
export TMP_CMD=/opt/MegaRAID/MegaCli
fi;
$TMP_CMD -LDSetProp -DisDskCache -Lx -a0 -nolog

Note: The "Disk Cache Policy" is completely separate from the disk controller caching mode of "WriteBack". Do not
confuse the two. The cache created by "WriteBack" cache mode is battery-backed, the cache created by "Disk Cache Policy" is
not!

Verify service exachkcfg autostart status

Priority Alert Date Owner Status Scope Bug(s)

Level
Critical FAIL 05/14/2014 <Name> Production Exadata, SSC, Exalogic 18735585-
exachk
DB DB Role Engineered System Exadata OS & Validation Tool TBD
Version Version Version Version
N/A N/A X2-2(4170), X2-2, X2-8, X3-2, X3-8, 11.2.2.2.0+ Linux x86-64 exachk 2.2.5
X4-2

Benefit / Impact:

Verifying the exachkcfg service autostart status helps to avoid an unexpected modification attempt and possibly lengthened boot
sequence. The Impact of verifying the exachkcfg service autostart status is minimal.

Risk:

On either a database or storage server, a required maintenance operation or an incorrect configuration change might be missed.

Action / Repair:

To verify the exachkcfg service autostart status, execute the following command as the "root" userid on all storage and database
servers:

chkconfig --list exachkcfg;

The output should be similar to:

exachkcfg 0:off 1:off 2:off 3:on 4:off 5:off 6:off

For either a database or storage server, run level 3 should be "on" (3:on).

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state… 130/137
29/10/2019 Document 1067527.1
It should be rare to find this not set as expected. Should a correction be required, as the root userid, use the "chkconfig --level"
command. For example, to set the run level "3" for exachkcfg to "on" for a database server with exadata image version >=
11.2.3.3.0:

[root@randomdb03 ~]# chkconfig --level 3 exachkcfg on

For another example, to set the run level "3" for exachkcfg to "off" for a database server with exadata image version <
11.2.3.3.0:

[root@randomdb03 ~]# chkconfig --level 3 exachkcfg off

NOTE: At exadata image versions below 11.2.3.3.0, on a database server all run levels should be set to "off", and
on a storage server, at least one run level should be set to "on" (number varies by exadata software version).

Check alerthistory for test open stateless alerts

Priority Alert Date Owner Status Engineered System Engineered Bug(s)

Level System

Critical FAIL 11/01/2017 <Name> Production Exadata - Physical, ALL 26651210 - exachk
Exadata - Management 21299854 - exachk
Domain

GI/DB DB Type DB Role DB Exadata OS & Version Validation Tool MAA Scorecard
Version Mode Version Version Section

N/A N/A N/A N/A ALL Linux exachk 12.2.0.1.4 N/A

Benefit / Impact

There are two types of alerts maintained in the alerthistory of a storage or database server, stateful and stateless.

The benefit of checking for for test open stateless alerts is a less cluttered alerthistory. The impact of acknowledging any test
open stateless alert is minimal.

Risk:

Unnecessary test alerts maintained in the alerthistory.

Action / Repair:

To verify there are no test open stateless alerts, as the root userid on each storage and database server execute the following
commands:

unset IMAGE_VERSION
unset NODE_TYPE
unset COMMAND_NAME
unset NAME_ARRAY
unset INDIVIDUAL_NAME
unset SID
unset SEVERITY
unset MESSAGE
unset ACTION
unset OUTPUT_ARRAY
if [ `egrep -i node.type /opt/oracle.cellos/cell.conf | grep -i db | wc -l` -eq 1 ]
then NODE_TYPE=db
else
NODE_TYPE=cell
fi
IMAGE_VERSION=$(imageinfo -version |tr -d '.'|cut -c1-6)
if [ $NODE_TYPE = "cell" ]
then
COMMAND_NAME=cellcli
else
if [ $IMAGE_VERSION -ge 121211 ]
then COMMAND_NAME=dbmcli
fi;
fi;
if [ -n "$COMMAND_NAME" ]
then
NAME_ARRAY=$($COMMAND_NAME -e list alerthistory attributes name,alertmessage where alerttype=stateless and exam
if [ -z "$NAME_ARRAY" ]
then
echo -e "SUCCESS: there are no test open stateless alerts."
else
for INDIVIDUAL_NAME in $NAME_ARRAY
do
NAME_RECORD=$($COMMAND_NAME -e "list alerthistory attributes alertsequenceid,severity,alertMessage,alertAct
SID=$(echo "$NAME_RECORD" | cut -d" " -f1)
SEVERITY=$(echo "$NAME_RECORD" | cut -d" " -f2)
MESSAGE=$(echo "$NAME_RECORD" | cut -d'"' -f2)
ACTION=$(echo "$NAME_RECORD" | cut -d'"' -f4)
OUTPUT_ARRAY+=$(echo -e "\n";echo -e "SID:\t\t$SID";echo -e "NAME:\t\t$INDIVIDUAL_NAME";echo -e "SEVERITY:\

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state… 131/137
29/10/2019 Document 1067527.1
done
echo -e -n "FAILURE: there are one or more test open atateless alerts that have not been cleared. Details:"
echo -e "${OUTPUT_ARRAY[@]}"
fi
else
echo "alerthistory is not available on database servers at image versions below 12.1.2.1.1: $NODE_TYPE $IMAGE_V
fi

The output should be similar to:

SUCCESS: there are no test open stateless alerts.

- OR -

alerthistory is not available on database servers at image versions below 12.1.2.1.1: db 112322

If the output is not as expected, examine the full details for each name that has not been cleared and follow the
recommendations.
Example of a FAILURE result:

FAILURE: there are one or more test open atateless alerts that have not been cleared. Details:
SID: 2
NAME: 2
SEVERITY: info
MESSAGE: "This is a test trap"
ACTION:

To acknowledge a test open stateless alert, manually set the "examinedby" field with a command similar to the following
(command name is either cellcli or dbmcli, depending upon whether a storage or database server is involved):

CellCLI> alter alerthistory 2 examinedby="jdoe"

Alert 2 successfully altered

Where jdoe is the name of the person who verified the test open stateless alert, and the number is the name of the stateless
alert. Note that double quotes are used around the value to be set, but not the name of the stateless alert.

Revision History

Date Change

Nov 23 2016 Hidden Parameters Table MAA Nov 23 2016

Nov 16 2016 MAA Nov 16 2016 Verify There Are No Memory

(ECC) Errors

Oct 5 2016 Check /EXAVMIMAGES on dom0s for possible

over allocation by sparse files

Oct 5 2016 Verify InfiniBand Address Resolution Protocol

(ARP) Configuration on Database Servers

Aug 22 2016 Verify "_reconnect_to_cell_attempts=9" on

database servers which access X6 storage
servers

April 6 2016 Detect duplicate files in /etc/init directories

Verify Initialization parameters and diskgroup

attributes

Verify RAID disk controller Cache Valur Capacitor

condition

verify Exadata Smart Flash Cache is created

March 23 2016 Verify Ambient Air Temperature – improved

existing section

March 16 2016 Verify database server file systems have

"Maximum mount count" = "-1"

Verify database server file systems have "Check

interval" = "0"

February 17 2016 Verify Datafiles are Placed on Diskgroups

consisting of griddisks with unset cachedBy
attribute– updated to only check when
flashcache in WriteBack mode

February 5 2016 Adding:

Verify storage server data (non-system) disks
have no partitions
Verify db_unique_name is used in I/O Resource
Management (IORM) interdatabase plans
Verify Datafiles are Placed on Diskgroups
consisting of griddisks with cachingPolicy =
DEFAULT
Verify Datafiles are Placed on Diskgroups
consisting of griddisks with unset cachedBy
attribute

January 26 2016 Adding X4-8/X5-2 to the list of supported

platforms
Feb 10 2017 Consolidation Parameters Reference Table –
updates to parallel parameters row, and removal
of unneeded Exadata platform specific resource
references
Mar 01 2017
Verify "downdelay" is correctly set for bonded
client interfaces – improved with more checks

Verify Storage Server user "CELLDIAG" exists –

improved with prompt for password

Mar 14 2017
(1) Verify RDS Protocol over InfiniBand Network
is used – existing section improved

(2) Verify all Database and Storage Servers are

synchronized with the same NTP server –
existing section improved

Mar 27 1017
(1) Check /EXAVMIMAGES on dom0s for
possible over allocation by sparse files –
converted to new style using exachk -check

Apr 4 2017
(1) Verify ExaWatcher is executing

(2) Verify non-Default services are created for all

Pluggable Databases

Apr 27 2017
(1) Verify "diagsnap.pl" is not executing

Jun 7 2017
(1) Verify Hidden Initialization Parameter Usage
– updated version for
_parallel_adaptive_max_users to include 12.2

(2) Verify IP routing configuration on database

servers

(3) Verify Grid Infrastructure Management

Database (MGMTDB) configuration

Jun 29 2017
(1) Verify Automatic Storage Management
Cluster File System (ACFS) is on a separate Disk
Group

July 12 2017
(1) Ensure Temporary Tablespace is correctly
defined (ARCHIVE) –archived; confirmed
deployment template has key attributes that are
still valid today

(2) Verify ASM Diskgroup Attributes for 12.2.0.x

–new

(3) Verify the SYSTEM, SYSAUX, USERS and

TEMP tablespaces are of type bigfile –new

(4) Verify ASM Diskgroup Attributes for 12.1.0.x

–updated to have “>=” for repair timers

(5) Verify the ownership and permissions of the

"oradism" file –updated to execute as software
owner instead of root

(6) Verify all "BIGFILE" tablespaces have non-

default "MAXBYTES" values set (ARCHIVE) –
archived; relying on other tools like EM to
handle the problem this was originally created to
solve

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state… 133/137
29/10/2019 Document 1067527.1

July 19 2017
July 26 2017
Sep 9 2017 (1)
Verify Hidden Initialization Parameter Usage –
added _asm_max_connected_clients as
acceptable in 12.2.0.1
Oct 10 2017
(1) Verify the recommended patches for
Adaptive features are installed

(2) Verify that griddisks are distributed as

expected across celldisks

(3) Verify Exadata Smart Flash Cache is Created

(4) Verify Database Server Disk Controller

Configuration

(5) Verify Database Server Virtual Drive

Configuration

Oct 28 2017 (1) Verify Database Server Physical Drive

Configuration

(2) Verify Grid Infrastructure Management

Database (MGMTDB) configuration (ARCHIVE) –
archived

Nov 22 2017 (1) Verify that griddisks are distributed as

expected across celldisks – update; added
exception for griddisk RA prefix “CATALOG”

(2) Check alerthistory for non-test open

stateless alerts & Check alerthistory for test
open stateless alerts – update; improved
formatting

Dec 1 2017 (1) Verify that griddisks are distributed as

expected across celldisks – update; added
exception for griddisk RA prefix “CATALOG”

(2) Check alerthistory for non-test open

stateless alerts & Check alerthistory for test
open stateless alerts – update; improved
formatting

(3) Verify initialization parameter

(4) cluster_database_instances is at the default

value
Verify the database server NVME device
configuration

(5) Verify celldisk configuration on flash memory

devices

Jan 25 2018 (1) Verify "diagsnap.pl" is not executing -

update; added repair operation

(2) Verify all Database and Storage Servers are

synchronized with the same NTP server –
update; retrofitted for exachk

(3) Verify that Automatic Storage Management

Cluster File System (ACFS) uses 4K metadata
block size

(4) Verify database server quorum disks

configuration

Mar 08 2018 (1) Verify RAID disk controller CacheVault

capacitor condition

(2) Modified - Verify the storage servers in use

configuration matches across the cluster

(3) Modified - Verify database server disk

controllers use writeback cache

(4) Verify Database Server Virtual Drive

Configuration

(5) Verify Database Server Physical Drive

Configuration

Mar 21 2018 (1) Check cell BIOS state for restore pending
status (ARCHIVE) – archived

Apr 21 2018 (1) Evaluate Automated Maintenance Tasks

configuration -new BP added.

May 15 2018 (1) Verify proper ACFS drivers are installed for
Spectre v2 mitigation

Jun 7 2018 (1) Verify "diagsnap.pl" is not executing

(ARCHIVE) – archived; we have coverage in
critical issue DB41

(2) Verify memlock is 90% of phys ram when

huge pages are enabled (ARCHIVE) – archived;
orachk will retain memlock check for hugepages

Jun 28 2018 (1) Verify Exafusion Memory Lock Configuration

Jul 13 2018 (1) included release 18c for

_asm_max_connected_clients

Aug 14 2018 (1) Verify Hidden Initialization Parameter Usage

- update; consolidated all recommendations
around hidden parameters into one section

(2) Verify there are no unhealthy InfiniBand

switch sensors

(3) Verify RAID disk controller CacheVault

capacitor condition

(4) Verify RAID Disk Controller Battery Condition

(5) Verify RAID Controller Battery Temperature

(ARCHIVE)

Sep 26 2018 (1) Verify the InfiniBand Fabric Topology (verify-

topology)

(2) Refer to MOS 1682501.1 if non-Exadata

components are in use on the InfiniBand fabric

(3) Verify Database Server Disk Controller

Configuration (ARCHIVE) - will not run in 18.1
and higher

(4) Verify Database Server Virtual Drive

Configuration (ARCHIVE) - will not run in 18.1
and higher

(5) Verify Database Server Physical Drive

Configuration (ARCHIVE) - will not run in 18.1
and higher

(6) Verify Common Instance Database

Initialization Parameters for 12.1.0.x & Verify
Common Instance Database Initialization
Parameters for 12.2.0.1 – expand existing
audit_trail and control_files checks

Sep 27 2018 (1) Verify database server disk controllers use

"WriteBack" cache (ARCHIVE) – no longer
needed in Exadata 18.1 and higher

(2) Verify that "Disk Cache Policy" is set to

"Disabled" (ARCHIVE) – no longer needed in
Exadata 18.1 and higher

(3) Verify service exachkcfg autostart status

(ARCHIVE) – no longer needed in Exadata 19.1
and higher

Oct 3 2018 (1) Verify Hidden Initialization Parameter Usage

– update; adjusted _backup_disk_bufcnt,
_backup_disk_bufsz, _backup_file_bufcnt,

Dec 18 2018 (1) Verify active kernel version matches

expected version for installed Exadata Image -
- OL7 support added

(2) Verify installed rpm(s) kernel type match the

active kernel version -- OL7 support added

(3) Verify the Master Subnet Manager is running

on an InfiniBand switch -- OL7 support added

(4) Verify the Subnet Manager is properly

disabled -- OL7 support disabled.

Feb 13 2018 (1) Verify the storage servers in use

configuration matches across the cluster

Apr 20 2019 (1) Verify the ib_sdp module is not loaded into
the kernel

May 03 2019 (1) Verify Hidden Initialization Parameter Usage

- update; improved wording for
_enable_numa_support

(2) Verify the vm.min_free_kbytes configuration

- update; improved logic making it numa aware
and increasing value accordingly

(3) Verify all database and storage servers time

server configuration - update to cover mixed
ntp/chrony case

Jul 11 2019 (1) Verify all voting disks are online & Verify
database server quorum disks configuration -
improved existing sections

(2) Verify all database and storage servers time

server configuration - update to cover mixed
ntp/chrony case

(3) Verify Automatic Storage Management

Cluster File System (ACFS) file systems do not
contain critical database files- improved existing
section

(4) Verify the recommended patches for

Adaptive features are installed- improved
existing section

(5) Check alerthistory for stateful alerts not

cleared - improved existing section & Check
alerthistory for non-test open stateless alerts -
improved existing section

(6) Check alerthistory for test open stateless

alerts (ARCHIVE)

Sep 18 2019 (1) Verify available ksplice fixes are installed

(2) Verify Automatic Storage Management

Cluster File System (ACFS) file systems do not
contain critical database files - Improved the
existing section

REFERENCES

NOTE:1351559.1 - IDT switch on the PCI riser has a problem resulting in occasional loss of connectivity to pair of flash cards on
the cells
NOTE:401749.1 - Oracle Linux: Shell Script to Calculate Values Recommended Linux HugePages / HugeTLB Configuration
NOTE:1284070.1 - Updating key software components on database hosts to match those on the cells
NOTE:1298957.1 - Manage Audit File Directory Growth with cron
NOTE:1286796.1 - rp_filter for multiple private interconnects and Linux Kernel 2.6.32+
NOTE:359515.1 - Mount Options for Oracle files for RAC databases and Clusterware when used with NFS on NAS devices
NOTE:1351036.1 - How to Validate and Fix Proper ASM Failure Group Configuration on Oracle Exadata Database Machine
NOTE:1188080.1 - Steps to shut down or reboot an Exadata storage cell without affecting ASM

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=253739734815772&id=1067527.1&_afrWindowMode=0&_adf.ctrl-state… 136/137
29/10/2019 Document 1067527.1

Didn't find what you are looking for? Ask in Community...

Attachments
script to check oratab (1.74 KB)