Architecting Splunk For High Availability and Disaster Recovery
Architecting Splunk For High Availability and Disaster Recovery
2
About
me
! Member
of
Splunk
Tech
Services
! Large
scale
deployments
! Cloud
and
Big
Data
! FiQh
.Conf
AGENDA
Disaster
Recovery
Recover
in
the
event
of
a
disaster
High
Availability
Maintain
an
acceptable
• Data
Collec:on
level
of
con<nuous
• Indexing
&
Searching
service
Top
Takeaways
Disaster
Recovery
(DR)
DR
What
is
Disaster
Recovery?
6
DR
Disaster
Recovery
Steps
1
Backup
necessary
data
2
Restore
7
DR
Backup
a Configura:ons
1
$SPLUNK_HOME/etc/*
Indexes
b
Buckets:
Hot*,
Warm,
Cold,
Frozen
8
DR
Backup
Configura:ons
Splunk Instance
$SPLUNK_HOME/etc/*
9
DR
Backup:
Bucket
Lifecycle
Events
$
Frozen
Path
$
Thawed
Path
or
Deleted
10
DR
Backup
Data
Bucket
Type
State
Can
Backup?
*Unless using snapshot aware FS (VSS, ZFS) or roll to warm first (which introduces a performance penalty).
11
DR
Restore
Configura:ons
$SPLUNK_HOME/etc/* $SPLUNK_HOME/etc/*
12
DR
Restore
Data
$Indexes_Loca:on
$Indexes_Loca:on
($SPLUNK_HOME/var/lib/splunk)
($SPLUNK_HOME/var/lib/splunk)
Splunk advises restoring fully from a backup rather than restoring on top of a par:ally corrupted datastore.
13
DR
Backup
Clustered
Data
! Op<on
1:
Backup
all
data
on
each
node
–
Will
also
result
in
backups
of
duplicate
data
! Op<on
2:
Iden:fy
one
copy
of
each
bucket
on
the
cluster
and
backup
only
those
(requires
scrip:ng)
– Decide
whether
or
not
you
need
to
also
backup
index
files
14
DR
Puqng
Restore
Together
2
b
C
Configura:ons
Data/Indexes
15
DR
Things
to
think
about:
Recovery
Time
and
Tolerable
Loss
vs.
Complexity
and
Cost
! Other
custom
factors
in
your
environment
– Ex.
Job
ar<facts,
DM,
Collec<ons
if
DR’ing
a
Search
Head
16
High
Availability
(HA)
HA
What
is
High
Availability?
18
HA
Splunk
High
Availability
1 Data Collec<on/Recep<on
2 Searching
3
Indexing
19
HA
Data
Collec:on
A
B
Indexers
. . .
.
.
.
Forwarder
Forwarder
Forwarder
20
HA
Data
Collec:on
A
B
Indexers
outputs.conf:
.
.
.
[tcpout]
defaultGroup
=
mygroup
[tcpout:mygroup]
server
=
A:9997,
B:9997
autoLB
=
true
.
.
.
Forwarder
Forwarder
Forwarder
21
HA
Searching
2
a
Search
Head
Clustering
(SHC)
22
HA
Searching
Typical
Search
Hierarchy
23
HA
Searching
Typical
Search
Hierarchy
24
HA
Search
Head
Pooling
25
HA
SHP
NFS
used
to
sync:
SH
Configura<ons
Job
Ar<facts
SH
Schedulers
A NFS B
26
HA
Search
Head
Clustering
(SHC)
! Improved
horizontal
scaling
! Improved
high
availability
! No
single
point
of
failure
27
HA
SHC
vs.
SHP
SHC
SHP
NFS-‐less
Uses
NFS
28
HA
SHC
Replica:on
protocol
syncs:
-‐
Configura:ons
-‐
Job
Ar:facts
A B C
29
HA
SHC
Replica:on
protocol
syncs:
-‐
Configura:ons
-‐
Job
Ar:facts
A
B
C
Configura:ons
Deployer
30
HA
SHC
Replica:on
protocol
syncs:
-‐
Configura:ons
-‐
Job
Ar:facts
A
B
C
Configura:ons
Deployer
Captain
31
HA
SHC
Opera:on
-‐
High
Level
! Deployer
ensures
all
SHC
members
have
iden:cal
baseline
configura:ons
– Subsequent
UI
changes
propagated
using
an
internal
replica:on
mechanism
! Job
Scheduler
gets
disabled
on
all
members
but
the
Captain
! Captain
selects
members
to
run
scheduled
jobs
based
on
load
– Selec:on
based
on
load
sta:s:cs.
Ensures
bewer
load
distribu:on
vs.
SHP
! Captain
orchestrates
job
ar:fact
replica:on
to
selected
members/
candidates
of
the
cluster.
! Transparent
job
ar:fact
proxying
(and
eventual
replica:on)
if
ar:fact
not
present
on
user’s
SH.
32
HA
Deploying
SHC
! Same
SH
version
and
high
speed
network
(LAN)
– More
storage
required
vs.
stand-‐alone
SHs.
Linux/Solaris
only
! Needs
LB
and
a
Deployer
instance
(DS
or
MN
can
also
be
used
to
fulfill
this
role)
! Select
RF
per
your
HA/DR
requirements
! Configure
Deployer
first
with
a
secret
key
! Ini:alize
each
instance,
point
them
to
Deployer,
then
bootstrap
one
of
them
to
become
the
cluster
captain
! More
details
on
Splunk
Docs
33
HA
Indexing
3 Indexer Clustering
34
HA
Index
Replica:on
! Cluster
=
a
group
of
search
peers
(indexers)
that
replicate
each
others'
buckets
! Data
Availability
– Availability
for
inges:on
and
searching
! Data
Fidelity
– Forwarder
Acknowledgement,
assurance
Trade
offs
! Disaster
Recovery
• Extra
storage
– Site
awareness
! Search
Affinity
• Slightly
increased
– Local
search
preference
vs.
remote
processing
load.
35
HA
Cluster
Components
• Master
Node
• Orchestrates
replica:on/remedial
process.
Informs
the
SH
where
to
find
searchable
data.
Helps
manage
peer
configura:ons.
• Peer
Nodes
• Receive
and
index
data.
Replicate
data
to/from
other
peers.
Peer
Nodes
Number
≥
RF
• Search
Head(s)
• Must
use
one
to
search
across
the
cluster.
• Forwarders
• Use
with
auto-‐lb
and
indexer
acknowledgement
36
HA!
Single Site Cluster
Architecture!
37
HA!
Replica<on
Factor
(RF)
• Number
of
copies
of
data
in
the
cluster.
Default
RF=3
• Cluster
can
tolerate
RF-‐1
node
failures
38
HA!
Search
Factor
(SF)
• Number
of
copies
of
data
in
the
cluster.
Default
SF=2
• Requires
more
storage
• Replicated
vs.
Searchable
Bucket
39
HA
Clustered
Indexing
! Origina:ng
peer
node
streams
copies
of
data
to
other
clustered
peers.
– Receiving
peers
store
those
copies.
! Master
determines
replicated
data
des:na:on.
– Instructs
peers
what
peers
to
stream
data
to.
Does
not
sit
on
data
path.
! Master
manages
all
peer-‐to-‐peer
interac:ons
and
coordinates
remedial
ac:vi:es.
! Master
keeps
track
of
which
peers
have
searchable
data.
– Ensures
that
there
are
always
SF
copies
of
searchable
data
available.
40
HA
Clustered
Searching
! Search
head
coordinates
all
searches
in
the
cluster
! SH
relies
on
master
to
tell
it
who
its
peers
are.
– The
master
keeps
track
of
which
peers
have
searchable
data
! Only
one
replicated
bucket
is
searchable
a.k.a
primary
– i.e.,
searches
occur
over
primary
buckets,
only.
! Primary
buckets
may
change
over
:me
– Peers
know
their
status
and
therefore
know
where
to
search
41
Mul:site
Clustering
! Site
awareness
introduced
in
Splunk
6.1
! Improved
disaster
recovery
– Mul:site
clusters
provide
site
failover
capability
! Search
Affinity
– Search
heads
will
scope
searches
to
local
site,
whenever
possible
– Ability
to
turn
off
for
bewer
thruput
vs.
X-‐Site
bandwidth
42
Multi Site Cluster
Architecture!
43
Mul:site
Clustering
Cont’d
! Each
node
belongs
to
an
assigned
site,
except
for
the
Master
Node,
which
controls
all
sites
but
it’s
not
logically
a
member
of
any
! Replica:on
of
bucket
copies
occurs
in
a
site-‐aware
manner.
– Mul:site
replica:on
determines
#
copies
on
each
site.
Ex.
3
site
cluster:
site_replication_factor
=
origin:2,
site1:1,
site2:1,
site3:1,
total:4
! Bucket-‐fixing
ac:vi:es
respect
site
boundaries
when
applicable
! Searches
are
fulfilled
by
local
peers
whenever
possible
(a.k.a
search
affinity)
– Each
site
must
have
at
least
a
full
set
of
searchable
data
44
Puqng
it
Together
Deployer
Master
………..
Search
Head
Clustering
45
END
Top
Takeways
• DR
–
Process
of
backing-‐up
and
restoring
service
in
case
of
disaster
– Configura<on
files
–
copy
of
$SPLUNK_HOME/etc/
folder
– Indexed
data
–
backup
and
restore
buckets
ê Hot,
warm,
cold,
frozen
ê Can’t
backup
hot
(without
snapshots)
but
can
safely
backup
warm
and
cold
• HA
–
con<nuously
opera<onal
system
bounded
by
a
set
of
tolerances
– Data
collec<on
ê Autolb
from
forwarders
to
mul:ple
indexers
ê Use
Indexer
Acknowledgement
to
protect
in
flight
data
– Searching
ê Search
Head
Clustering
(SHC)
– Indexing
ê Use
Index
Replica:on
46
Copyright
©
2015
Splunk
Inc.
Q
&
A
THANK
YOU
Feedback:
[email protected]