Data Domain - Historical Database Pruning Failed After Upgrading To DDOS 6.1 or 6.2 - Dell US
Data Domain - Historical Database Pruning Failed After Upgrading To DDOS 6.1 or 6.2 - Dell US
Data Domain: Historical database pruning failed after upgrading to DDOS 6.1 or 6.2
Summary: How to address issues with the historical database failing to prune after some DDOS upgrade, which will eventually result in the
DD OS /ddr/ partition being full, and causing downtime
Article Content
Symptoms
Data Domain Restorers (DDRs) use a historical database to record events (such as various performance metrics) over time.
This database is a SQLite database file/instance held under /ddr/hd, i.e.:
/ddr/hd:
-rw-r--r-- 1 root root 1490051072 Jun 8 07:23 dd_hd.db
Note that the /ddr/hd directory sits on the /ddr file system:
Once a day a 'prune' job is started by cron to remove old/unnecessary data from the historical database to prevent it from growing too large,
i.e.:
Under certain circumstances, however, this job can fail to run (particularly after some upgrades to 6.2.x code). When this happens
corresponding messages will be seen in messages.engineering:
Feb 4 12:10:00 dd2500-01 hdc: INFO: SQLITE: rc=1, no such table: hd_perf_clgrp
Feb 4 12:10:00 dd2500-01 hdc: INFO: Error: <5146: Internal error accessing historical database. Diagnostics:
Operation(prepare stmt), Reported by(hdal), Location(hdal_db_sqlite3.c, 237, _get_object_meta_info), Database
error(sqlite3, 1, no such table: hd_perf_clgrp), ddr/lib/dd_hd_rdb_sqlite.c, dd_hd_rdb_report_error_sqlite, 3825>
Feb 4 12:12:02 dd2500-01 dd_hd_rdb_tool: INFO: Event posted: p0-19 (11000013:285212691): EVT-HD-RDB-0004: Historical
database pruning failed.
If pruning fails repeatedly then old/unnecessary data will not be able to be removed from the historical database meaning that
it slowly grows in size.
Ultimately this can cause the /ddr file system to become 100% full (either transiently during attempted pruning or permanently
if the historical database increases to a sufficient size). This then causes further issues such as an inability for the system
to update registry files. Ultimately this can lead to system instability/unexpected DDFS restarts.
Cause
Pruning of the historical database generally requires as much free space in the /ddr file system as the current size of the historical
database. If the /ddr/ has somehow become nearly full, the pruning job for the historical database may fail, which will only add to disk
space consumption on /ddr/ overtime.
However in this case we focus on the pruning of the historical database failing for reasons other than previous lack of space. Something in
the sequence of DDOS upgrades and / or other problems had during the upgrades, may have caused a missing table in the historical DB.
This missing table is not something that the sqlite validation check will catch. The missing table is typically "hd_perf_clgrp" or
"hd_space_fmig_runs_detailed", as seen below:
Feb 4 12:10:00 dd2500-01 hdc: INFO: Error: <5146: Internal error accessing historical database.
Diagnostics: Operation(prepare stmt), Reported by(hdal), Location(hdal_db_sqlite3.c, 237,
_get_object_meta_info), Database error(sqlite3, 1, no such table: hd_perf_clgrp),
ddr/lib/dd_hd_rdb_sqlite.c, dd_hd_rdb_report_error_sqlite, 3825>
If this error, or any other in the logs (messages.engineering) for other missing tables keeps showing, or alerts about the historical database
not pruning daily show up, please contact your contracted support provider at the earliest to have the underlying issue resolved.
Resolution
Data Domain Support will most likely need a remote session to do proper troubleshooting for the root cause of the reason why the
historical database is not pruning. Failing to resolve this pruning issue will continue increasing the disk space used under the smaller /ddr/
partition, and will eventually fill up, resulting in potential unavailability of the DD and downtime.
Unless if the problem with pruning is the result of /ddr/ already being nearly full, the resolution will incur fixing the historical database
structure and, in some cases, running the pruning action off the small /ddr/ partition, so that the process doesn't run out of space. Neither
one of these actions require any downtime or will affect the running of backups or other DD activities. The only downside may be some
historical and performance entries in the database to be dropped while the database is being made consistent or being pruned off a
separate, larger partition.
Note the code issue resulting in the mentioned trigger preventing the historical database upgrade (and hence the failures to prune it daily)
was fixed in the code for the following releases:
Hence, for any customer planning to upgrade for the first time to DDOS 6.1 or DDOS 6.2, it is strongly advised to do so to any of the
mentioned fixed releases above.
Note: DD OS upgrade will not resolve the issue if the Data Domain is already having historical database pruning errors.
Additional Information
https://downloads.dell.com/TranslatedPDF/PT-BR_KB531069.pdf
https://downloads.dell.com/TranslatedPDF/ZH-CN_KB531069.pdf
https://downloads.dell.com/TranslatedPDF/ES_KB531069.pdf
https://downloads.dell.com/TranslatedPDF/DE_KB531069.pdf
https://downloads.dell.com/TranslatedPDF/FR_KB531069.pdf
https://downloads.dell.com/TranslatedPDF/IT_KB531069.pdf
https://downloads.dell.com/TranslatedPDF/JA_KB531069.pdf
https://downloads.dell.com/TranslatedPDF/NL_KB531069.pdf
https://downloads.dell.com/TranslatedPDF/KO_KB531069.pdf
https://downloads.dell.com/TranslatedPDF/RU_KB531069.pdf
https://downloads.dell.com/TranslatedPDF/PT_KB531069.pdf
https://downloads.dell.com/TranslatedPDF/SV_KB531069.pdf
Errors while pruning the historical database can be found in "messages.engineering", so using "log view debug/messages.engineering" may
be used to search for those, considering the pruning job is scheduled to be started at 12:12 PM local DD time every day. It would be useful to
provide DD Support with any matching errors or a full SUB up-front for analysis.
Partner Notes
If any of the actions above fail because the /ddr/ partition is too full, and there is no way to make sufficient free space to complete those
actions DD may even enter into a limited session with write error: No space left on device
The historical database will have to be copied from /ddr/hd/ into a partition with more disk space, and the actions performed on the copy
of the database, so once the database is consistent, upgraded and pruned, it will be copied back replacing the original one. Summarized
steps would be as follows:
Make a copy of the large historical database somewhere else after verifying that the destination partition has enough extra space:
# cp /ddr/hd/dd_hd.db /ddr/var/tools/
Do any necessary actions on copy of the database itself (such as dropping the offending trigger):
# sqlite3 -vfs /ddr/var/tools/dd_hd.db
sqlite> DROP TRIGGER IF EXISTS hd_ddboost_optdup_prune;
sqlite> .quit
Run the historical database upgrade tool on the copy of the DB:
# dd_hd_rdb_tool -u -d /ddr/var/tools/dd_hd.db
You can check progress in a duplicate Bash session or wait for the above command to finish.:
# tail -f | grep "HD RDB" /ddr/var/log/debug/messages.engineering
Sep 25 13:30:34 DD9500 dd_hd_rdb_tool: INFO: hd_upgrade: HD RDB Upgrade /ddr/var/tools/dd_hd.db
begin
Sep 25 13:31:04 DD9500 dd_hd_rdb_tool: INFO: hd_upgrade: HD RDB Upgrade end
Finally, in-place replace the historical database with the upgraded, fixed one, and make sure to restart the "hdc" process:
# mv /ddr/var/tools/dd_hd.db /ddr/hd/dd_hd.db
# kill `pidof hdc`
This has been seen a few times, for example Bug 234768 and Bug 245390, the fundamental reason for the problem (with the existing trigger
preventing the historical database upgrade) being described and fixed in Bug 234998.
Article Properties
Affected Product
Data Domain
Product
Data Domain
21 Dec 2020
Version
3
Article Type
Solution