Posts: Data Protection Hub
Posts: Data Protection Hub
About
Quick Links
Nomenclature
Contact
Site Map
Archives
Why the flag?
Subscription
Tools
Books
Twitter
Posts
Crunching NetWorker
Deduplication Stats
October 31, 2021 Preston de Guise 1 Comment
If you use NetWorker with Data Domain, you’ve probably sometimes wanted to know
which of your clients have the best deduplication – or perhaps more correctly, you’ve
probably wanted to occasionally drill into clients delivering lower deduplication levels.
There are NMC and DPA reports that’ll pull this data for you, but likewise, you can get
information directly from mminfo that’ll tell you details about deduplication for Data
Domain Boost devices.
You’ll find the information you’re after within the output of mminfo -S. The -S option in
mminfo provides a veritable treasure trove of extended details about individual
savesets. To give you an example, let’s look at just one saveset in -S format. First,
below, you’ll see the mminfo command to identify a saveset ID, then the -S option
invoked against a single saveset.
## ~
## root@orilla
## ~
## root@orilla
$ mminfo -q "ssid=3665074337" -S
ssid=3665074337 savetime=24/10/21 10:20:00 (1635031200) orilla.turbamentis.int:/nsr/scripts
clientid=c6fb4ece-00000004-5fdabaf1-5fdabaf0-00019ed8-a41317f3
group: NAS_PMdG;
There’s a support article that explains how to review this information, here. The line
you’re looking for in particular though is the “*ss data domain dedup statistics” – outlined
in that aforementioned support article.
I recently wanted to review some deduplication details, and while DPA and NMC are
options, I needed to analyse some data over the weekend 1 and as such didn’t have
access to the live DPA or NMC host for a particular environment. So, since I’m not
doing anything social while Melbourne’s COVID case numbers remain so high, I tasked
myself with writing a script to analyse deduplication statistics from raw mminfo
-S output.
The script I wrote (in Perl, of course), can either be run on a NetWorker server via a
query option, or against the saved output of mminfo -S. The usage syntax is this:
Usage Options for dedupe-analysis
The script pulls the data and outputs at least three key data files, all in CSV format:
Since a saveset, via cloning, can live on more than one Data Domain, the volume IDs
are included in the output – and the second two files actually provide the breakdowns
first by volume ID – so you can see stats on a per-Data Domain basis. There’s also an
option to anonymise the host names in the output, and if you invoke that, a CSV will
also be written containing the original hostname to anonymised hostname conversion 2.
I.e., this would allow you to send the anonymised version to someone for discussion,
but privately lookup the real host in any subsequent discussion.
There’s an additional output option too, which can be handy if you’re analysing millions
of savesets: it’s an option to create one output file per client. You still get the rollup data,
but you’ll also get a per-client CSV file so you can deep-drill into individual client results
with a better chance of avoiding Excel’s 1,000,000 row limit.
## /nsr/scripts
## root@orilla
...processed 0 savesets
Writing PrestonLab-all.csv
Writing PrestonLab-type.csv
So what sort of output does it generate? For some output examples, I’ve imported the
generated CSV files into Excel and converted them to a table. Here’s an example of the
“per-saveset” data:
Now, they’re just lab environments there — and while accurate, they’re hardly edifying.
What I was working towards was the analysis of a production environment. With host
anonymisation turned on, here’s an example of that output:
Deduplication data by Volume ID, Client and Backup Type from a Production
Environment
The only additional thing I’ve done in the example output there is to set number
formatting on the Original/Post-Comp/Average Reduction columns.
If you’re interested in being able to run this against your own environment (or mminfo -S
output, in general), here’s the script:
#!/usr/bin/perl -w
###########################################################################
# Modules
###########################################################################
use strict;
use File::Basename;
use Getopt::Std;
use Sys::Hostname;
###########################################################################
# Subroutines
###########################################################################
sub in_list {
my $element = $_[0];
return 0 if (!defined($element));
shift @_;
my @list = @_;
my $foundCount = 0;
my $e = quotemeta($element);
my $i = quotemeta($item);
return $foundCount;
sub show_help {
my $self = basename($0);
print <<EOF;
Usage:
Where:
-o file File to write results data to. Do NOT include a file extension.
workload type-basis.
EOF
if (@_+0 > 0) {
my @messages = @_;
my $tmp = $message;
chomp $tmp;
print "$tmp\n";
die "\n";
}
sub get_backup_type {
my $saveset = $_[0];
my $backupType = "";
if ($saveset =~ /^RMAN/) {
$backupType = "Oracle";
$backupType = "SAP";
$backupType = "MSSQL";
$saveset =~ /^\<\d+\>\//) {
$backupType = "Unix Filesystem";
$backupType = "Exchange";
$saveset =~ /SYSTEM/ ||
$saveset =~ /DISASTER/ ||
$saveset =~ /\\VOLUME\{/ ||
$backupType = "NetWorker";
} else {
$backupType = "Other";
return $backupType;
###########################################################################
my %opts = ();
my $version = "1.1";
my $method = "query"; # Default method is to seek a query from the command line.
my $query = "";
my $file = "";
my $outFile = "";
my $baseOut = "";
my $byClientOut = "";
my $byTypeOut = "";
my $anonOut = "";
my $anonHosts = 0;
my $individualFile = 0;
my %clientElems = (); # If we're doing individual file out, use this to speedily
iterate through gathered data.
# Capture command line arguments.
if (getopts('h?vq:f:o:adi',\%opts)) {
if (defined($opts{'q'})) {
$query = $opts{'q'};
} elsif (defined($opts{'f'})) {
$method = "file";
$file = $opts{'f'};
if (! -f $file) {
} else {
$anonHosts = 1 if (defined($opts{'a'}));
$individualFile = 1 if (defined($opts{'i'}));
$debug = 1 if (defined($opts{'d'}));
if (defined($opts{'o'})) {
$outFile = $opts{'o'};
if ($outFile =~ /\./) {
} else {
}
my %dataSets = ();
my @rawData = ();
my $ssidCount = 0;
my $ddSavesets = 0;
my %volumeIDsbyCloneID = ();
if ($method eq "query") {
while (<MMI>) {
my $line = $_;
chomp $line;
push(@rawData,$line);
$ddSavesets++;
if ($line =~ /^ssid=\d+/) {
$ssidCount++;
if ($ssidCount % 100000 == 0) {
print "...read $ssidCount saveset details
($ddSavesets on Data Domain)\n";
close(MMI);
if (open(FILE,$file)) {
while (<FILE>) {
my $line = $_;
chomp $line;
push(@rawData,$line);
$ddSavesets++;
if ($line =~ /^ssid=\d+/) {
$ssidCount++;
if ($ssidCount % 100000 == 0) {
print "...read $ssidCount saveset details
($ddSavesets on Data Domain)\n";
close(FILE);
if ($ssidCount == 0) {
die ("Did not find any savesets in an expected format in the $method.\n");
if ($ddSavesets == 0) {
die ("Did not find any savesets on a Data Domain in the $method.\n");
# Now step through and discard all savesets that aren't on Data Domain devices.
my $count = 0;
my @dataSeg = ();
my $foundADDBoostSS = 0;
if ($line =~ /^ssid/) {
if ($index != 0) {
if ($foundADDBoostSS) {
$dataSets{$count} = join("\n",@dataSeg);
@dataSeg = ();
$foundADDBoostSS = 0;
$count++;
@dataSeg = ($line);
} else {
push (@dataSeg,$line);
$foundADDBoostSS = 1;
# Explicitly free up the initially read data. This is useful if you have >1M savesets
# in reducing the runtime footprint. (E.g., on sample of 1.7M savesets it uses 1/3
less
undef @rawData;
my %procData = ();
my $datum = 0;
my $procCount = 0;
my $block = $dataSets{$key};
my @block = split(/\n/,$block);
my $ssid = "";
my $savetime = "";
my $nsavetime = "";
my $client = "";
my $saveset = "";
my $level = "";
my $ssflags = "";
my $totalsize = 0;
my $retention = "";
my $dedupeStats = "";
my $vmname = "";
if ($procCount % 100000 == 0) {
$procCount++;
my $vm = 0;
my $blockIndex = 0;
foreach my $item (@block) {
$ssid = $1;
$savetime = $2;
$nsavetime = $3;
$client = $4;
$saveset = $5;
$procData{$datum}{ssid} = $ssid;
$procData{$datum}{savetime} = $savetime;
$procData{$datum}{client} = $client;
$procData{$datum}{nsavetime} = $nsavetime;
$procData{$datum}{saveset} = $saveset;
if ($anonHosts) {
# Nothing to do here.
} else {
$hostMap{$client} = sprintf("host-%08d",
$hostCount);
$hostCount++;
}
my $cloneID = $1;
my $volID = 0;
# to tape.
my $nextLine = $block[$blockIndex+1];
if (!defined($nextLine)) {
$volID = 0;
} else {
if ($nextLine =~ /.*volid=\s*(\d+) /) {
$volID = $1;
} else {
$volID = 0;
$volumeIDsbyCloneID{$ssid}{$cloneID} = $volID;
if ($item =~ /vcenter_hostname/) {
$vm = 1;
$vmname = $1;
if ($anonHosts) {
$procData{$datum}{vmname} =
$vmMap{$vmname};
} else {
$vmMap{$vmname} = sprintf("vm-%09d",
$vmCount);
$vmCount++;
$procData{$datum}{vmname} =
$vmMap{$vmname};
}
} else {
$procData{$datum}{vmname} = $vmname;
$vm = 0;
if ($item =~ /^\s+level=([^\s]*)\s+sflags=([^\s]*)\s+size=(\d+).*/)
{
$level = $1;
$ssflags = $2;
$totalsize = $3;
$procData{$datum}{level} = $level;
$procData{$datum}{ssflags} = $ssflags;
$procData{$datum}{totalsize} = $totalsize;
}
if ($item =~ /create=.* complete=.* browse=.* retent=(.*)$/) {
$retention = $1;
$procData{$datum}{retention} = $retention;
$dedupeStats = $1;
my $eol = $2;
if ($eol eq ",") {
my $stop = 0;
my $lookahead = $blockIndex + 1;
while (!$stop) {
my $tmpLine = $block[$lookahead];
if ($tmpLine =~ /^\s*"(.*)"(\,|;)\s*$/) {
my $tmpstat = $1;
my $tmpset = $2;
$lookahead++;
$procData{$datum}{dedupe_stats} = $dedupeStats;
my $stop = 0;
my $lookahead = $blockIndex + 1;
while (!$stop) {
my $tmpLine = $block[$lookahead];
if ($tmpLine =~ /^\s*"(.*)"(\,|;)\s*$/) {
my $tmpstat = $1;
my $tmpset = $2;
$lookahead++;
}
$dedupeStats =~ s/^\n(.*)/$1/s;
$procData{$datum}{dedupe_stats} = $dedupeStats;
$blockIndex++;
$datum++;
undef %dataSets;
if ($anonHosts) {
if ($procData{$elem}{saveset} =~ /^index:(.*)/) {
my $indexHostname = $1;
$procData{$elem}{saveset} = "index:" .
$hostMap{$indexHostname};
} else {
$hostMap{$indexHostname} = sprintf("host-%08d",
$hostCount);
$hostCount++;
$procData{$elem}{saveset} = "index:" .
$hostMap{$indexHostname};
my %clientData = ();
# As we write out the base file, assemble the client rollup data.
my @clients = ();
print ("\n\nWriting $baseOut\n");
my $countOut = 0;
if (open(OUTP,">$baseOut")) {
print OUTP
("SSID,CloneID,VolumeID,Client,Savetime,Level,Name,OriginalMB,PreLCompMB,PostLCompMB,
Reduction(:1)\n");
my $dedupeStats = $procData{$elem}{dedupe_stats};
my @dedupeStats = split("\n",$dedupeStats);
if ($countOut % 100000 == 0) {
$countOut++;
if ($stat =~ /^v1:(\d+):(\d+):(\d+):(\d+)/) {
my $clID = $1;
my $original = $2;
my $precomp = $3;
my $postcomp = $4;
my $client = $procData{$elem}{client};
my $saveset = $procData{$elem}{saveset};
my $savetime = $procData{$elem}{savetime};
my $ssid = $procData{$elem}{ssid};
my $level = $procData{$elem}{level};
my $vmName = (defined($procData{$elem}{vmname})) ?
$procData{$elem}{vmname} : "";
my $volumeID = $volumeIDsbyCloneID{$ssid}{$clID};
# individual clients.
if ($individualFile) {
if (in_list($client,@clients)) {
push(@{$clientElems{$client}},
$elem);
} else {
push (@clients,$client);
@{$clientElems{$client}} = ($elem);
if ($vmName ne "") {
$saveset = "VM:$vmName";
}
my $backupType = get_backup_type($saveset);
$procData{$elem}{backuptype} = $backupType; #
Store this in case we need it for individuals.
if (%clientData &&
defined($clientData{by_client}) &&
defined($clientData{by_client}{$volumeID})
&&
defined($clientData{by_client}{$volumeID}
{$client})) {
$clientData{by_client}{$volumeID}{$client}
{original} += $original / 1024;
$clientData{by_client}{$volumeID}{$client}
{postcomp} += $postcomp / 1024;
$clientData{by_client}{$volumeID}{$client}
{count}++;
} else {
$clientData{by_client}{$volumeID}{$client}
{original} = $original / 1024; # Store in GB
$clientData{by_client}{$volumeID}{$client}
{postcomp} = $postcomp / 1024; # Store in GB
$clientData{by_client}{$volumeID}{$client}
{count} = 1;
if (%clientData &&
defined($clientData{by_type}) &&
defined($clientData{by_type}{$volumeID}) &&
defined($clientData{by_type}{$volumeID}
{$client}) &&
defined($clientData{by_type}{$volumeID}
{$client}{$backupType})) {
$clientData{by_type}{$volumeID}{$client}
{$backupType}{original} += $original / 1024; # Store in GB
$clientData{by_type}{$volumeID}{$client}
{$backupType}{postcomp} += $postcomp / 1024; # Store in GB
$clientData{by_type}{$volumeID}{$client}
{$backupType}{count}++;
} else {
$clientData{by_type}{$volumeID}{$client}
{$backupType}{original} = $original / 1024; # Store in GB
$clientData{by_type}{$volumeID}{$client}
{$backupType}{postcomp} = $postcomp / 1024; # Store in GB
$clientData{by_type}{$volumeID}{$client}
{$backupType}{count} = 1;
my $finalClientName = ($anonHosts == 1) ?
$hostMap{$client} : $client;
if ($debug) {
}
}
close(OUTP);
if ($individualFile) {
mkdir("$outFile");
if (-d $outFile) {
my $filename = $client;
# Override here.
if ($anonHosts) {
$filename = $hostMap{$client};
$filename =~ s/\./_/g;
print OUTP
("SSID,CloneID,VolumeID,Client,VMName,Savetime,Level,Name,OriginalMB,PreLCompMB,PostL
CompMB,Reduction(:1)\n");
my @elemList = @{$clientElems{$client}};
next if ($procData{$elem}{client} ne
$client);
# Else...
my $dedupeStats = $procData{$elem}
{dedupe_stats};
my @dedupeStats = split("\n",$dedupeStats);
if ($stat =~ /^v1:(\d+):(\d+):
(\d+):(\d+)/) {
my $clID = $1;
my $original = $2;
my $precomp = $3;
my $postcomp = $4;
my $saveset =
$procData{$elem}{saveset};
my $savetime =
$procData{$elem}{savetime};
my $ssid =
$procData{$elem}{ssid};
my $level =
$procData{$elem}{level};
my $vmName =
(defined($procData{$elem}{vmname})) ? $procData{$elem}{vmname} : "";
my $volumeID =
$volumeIDsbyCloneID{$ssid}{$clID};
push (@clients,$client) if
(!in_list($client,@clients));
if ($vmName ne "") {
$saveset = "VM:
$vmName";
$original = $original /
1024 / 1024; #MB
$postcomp = $postcomp /
1024 / 1024; #MB
my $reduction =
$original / $postcomp; #:1
my $backupType =
$procData{$elem}{backuptype};
my $finalClient =
($anonHosts == 1) ? $hostMap{$client} : $client;
my $finalSaveset = "";
if ($anonHosts) {
if ($vmName ne "")
{
$finalSaveset = $vmName;
} else {
$finalSaveset = $saveset;
} else {
$finalSaveset =
$saveset;
}
}
close(OUTP);
} else {
print("\n\nWriting $byClientOut\n");
if (open(OUTP,">$byClientOut")) {
my $finalClientName = ($anonHosts == 1) ?
$hostMap{$client} : $client;
$volumeID,
$finalClientName,
$clientData{by_client}
{$volumeID}{$client}{original},
$clientData{by_client}
{$volumeID}{$client}{postcomp},
$clientData{by_client}
{$volumeID}{$client}{original} / $clientData{by_client}{$volumeID}{$client}
{postcomp});
close(OUTP);
if (open(OUTP,">$byTypeOut")) {
my $finalClientName = ($anonHosts == 1) ?
$hostMap{$client} : $client;
$volumeID,
$finalClientName,
$type,
$clientData{by_type}{$volumeID}{$client}{$type}{original},
$clientData{by_type}{$volumeID}{$client}{$type}{postcomp},
$clientData{by_type}{$volumeID}{$client}{$type}{original} /
$clientData{by_type}{$volumeID}{$client}{$type}{postcomp});
close(OUTP);
# If we're anonymising hosts, write data that can be kept private to map
if ($anonHosts) {
if (open(ANONMAP,">$anonOut")) {
close (ANONMAP);
One thing to note in the script — the breakdown of what types of backups there are was
dependent on the saveset information that I had available to me. So it while it covers
things like Windows and Unix filesystems, Oracle, SAP and MSSQL, it doesn’t include
coverage for identifying Lotus Notes, DB2, and so on. (There’s a subroutine,
get_backup_type, that interprets the backup type from the saveset name, that you’d
need to modify if you wanted additional types.)
Tumblr
Pinterest
Footnotes
1. There are, unfortunately, only so many hours in a busy work-day.
2. The “all” output does not attempt to anonymise saveset names – though it will
anonymise virtual machine names
Like this:
Loading...
Related Posts:
1. Basics – mminfo, savetime and greater than/less than
2. NetWorker and Incremental Manual Backups
3. mminfo2html
4. Accelerating Oracle Recoveries with NetWorker 18.1
Post navigation
← NetWorker Basics – Policy Tree and Status
Unmanaged Data Hoarding is Deadly to your Business →
Kr,
Michael
Reply
Leave a Reply
Your email address will not be published. Required fields are marked *
Comment
Name *
Email *
Website
Post Comment
This site uses Akismet to reduce spam. Learn how your comment data is processed.
Search…
Search for:
Subscribe to updates
Email Address*
First Name
Last Name
* = required field
Subscribe
powered by MailChimp!
About
Quick Links
Nomenclature
Contact
Site Map
Archives
Why the flag?
Subscription
Tools
Books
Twitter
Facebook
Google Plus
Custom Social