0% found this document useful (0 votes)

212 views47 pages

Architecting Splunk For High Availability and Disaster Recovery

This document discusses architecting Splunk for high availability and disaster recovery. It covers key aspects of setting up Splunk for high availability including data collection, indexing, and searching across multiple nodes; and disaster recovery processes like backing up configurations, indexes and restoring data after a disaster. The document emphasizes having processes to maintain continuous service during failures and to recover from disasters within a tolerable time period and loss of data.

Uploaded by

ronaldo.panuelos

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

212 views47 pages

Architecting Splunk For High Availability and Disaster Recovery

Uploaded by

ronaldo.panuelos

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Copyright

© 2015 Splunk Inc.

Architec:ng Splunk for

High Availability and
Disaster Recovery
Dritan Bi:ncka
Splunk Technical Services
Disclaimer
During the course of this presenta:on, we may make forward looking statements regarding future
events or the expected performance of the company. We cau:on you that such statements reflect our
current expecta:ons and es:mates based on factors currently known to us and that actual events or
results could differ materially. For important factors that may cause actual results to differ from those
contained in our forward-‐looking statements, please review our filings with the SEC. The forward-‐
looking statements made in the this presenta:on are being made as of the :me and date of its live
presenta:on. If reviewed aQer its live presenta:on, this presenta:on may not contain current or
accurate informa:on. We do not assume any obliga:on to update any forward looking statements we
may make.

In addi:on, any informa:on about our roadmap outlines our general product direc:on and is subject to
change at any :me without no:ce. It is for informa:onal purposes only and shall not, be incorporated
into any contract or other commitment. Splunk undertakes no obliga:on either to develop the features
or func:onality described or to include any such feature or func:onality in a future release.

2
About me

! Member of Splunk Tech Services
! Large scale deployments
! Cloud and Big Data
! FiQh .Conf
AGENDA
Disaster Recovery Recover in the event of
a disaster

High Availability Maintain an acceptable
• Data Collec:on level of con<nuous
• Indexing & Searching
service

Top Takeaways
Disaster Recovery (DR)
DR What is Disaster Recovery?

Set of processes necessary to ensure recovery

of service aQer a disaster

6
DR Disaster Recovery Steps

1
Backup necessary data

Backup to a medium at least as resilient as source

Local Backup vs. Remote

2
Restore

Ensure this works

Backup is worthless without restore

7
DR Backup

a Conﬁgura:ons

1
$SPLUNK_HOME/etc/*

Indexes
b Buckets: Hot*, Warm, Cold, Frozen

8
DR Backup Conﬁgura:ons

Splunk Instance

$SPLUNK_HOME/etc/*

9
DR Backup: Bucket Lifecycle
Events

[Out of volume space or

[Hot Bucket is Full] too many warms]

[Out of Space or Bucket is Old]

$ Home Path $ Cold Path

[Cheaper Storage]

[Explicit User Ac:on]

$ Frozen Path
$ Thawed Path or Deleted

10
DR Backup Data
Bucket Type State Can Backup?

Read + Write No*

Read Only Yes

*Unless using snapshot aware FS (VSS, ZFS) or roll to warm ﬁrst (which introduces a performance penalty).

11
DR Restore Conﬁgura:ons

New Splunk Instance

$SPLUNK_HOME/etc/* $SPLUNK_HOME/etc/*

12
DR Restore Data

New Splunk Instance

$Indexes_Loca:on $Indexes_Loca:on
($SPLUNK_HOME/var/lib/splunk) ($SPLUNK_HOME/var/lib/splunk)

Splunk advises restoring fully from a backup rather than restoring on top of a par:ally corrupted datastore.

13
DR Backup Clustered Data
! Op<on 1: Backup all data on each node
– Will also result in backups of duplicate data
! Op<on 2: Iden:fy one copy of each bucket on the cluster and
backup only those (requires scrip:ng)
– Decide whether or not you need to also backup index ﬁles

Bucket naming conven<ons

Non-‐clustered buckets: db_<newest_<me>_<oldest_<me>_<localid>
Clustered original bucket: db_<newest_<me>_<oldest_<me>_<localid>_<guid>
Clustered replicated bucket copies: rb_<newest_<me>_<oldest_<me>_<localid>_<guid>

14
DR Puqng Restore Together

a New Splunk Instance

2 b
C
Conﬁgura:ons

Data/Indexes

15
DR Things to think about:

Recovery Time and Tolerable Loss
vs.
Complexity and Cost

! Other custom factors in your environment
– Ex. Job ar<facts, DM, Collec<ons if DR’ing a Search Head
16
High Availability (HA)
HA What is High Availability?

A design methodology whereby a system is

con:nuously opera:onal, bounded by a set of
predetermined tolerances.
Note: “high availability” !=“complete availability”

18
HA Splunk High Availability

1 Data Collec<on/Recep<on

2 Searching

3 Indexing
19
HA Data Collec:on
A B
Indexers

. . .

. . .
Forwarder Forwarder Forwarder

20
HA Data Collec:on
A B
Indexers
outputs.conf:
. . .
[tcpout]
defaultGroup = mygroup

[tcpout:mygroup]
server = A:9997, B:9997
autoLB = true

. . .
Forwarder Forwarder Forwarder

21
HA Searching

2
a Search Head Clustering (SHC)

b Search Head Pooling (SHP)

22
HA Searching

Typical Search
Hierarchy

Indexer A Indexer B . . . Indexer N

23
HA Searching

Typical Search
Hierarchy

Indexer A Indexer B . . . Indexer N

24
HA Search Head Pooling

NFS based Search Head Pooling

has been deprecated*

*s:ll works and supported for
current Splunk version but plan
for its eventual removal.

25
HA SHP
NFS used to sync:
SH Conﬁgura<ons
Job Ar<facts
SH Schedulers

A NFS B

Indexer A Indexer B . . . Indexer N

26
HA Search Head Clustering (SHC)

! Improved horizontal scaling
! Improved high availability
! No single point of failure

27
HA SHC vs. SHP
SHC SHP
NFS-‐less Uses NFS

NFS-‐less Single point of failure

NFS-‐less Performance issues

28
HA SHC
Replica:on protocol syncs:
-‐ Conﬁgura:ons
-‐ Job Ar:facts

A B C

Indexer A Indexer B Indexer C . . . Indexer N

29
HA SHC
Replica:on protocol syncs:
-‐ Conﬁgura:ons
-‐ Job Ar:facts

A B C
Conﬁgura:ons
Deployer

Deployer ensures iden:cal

deployed conﬁgura:ons

Indexer A Indexer B Indexer C . . . Indexer N

30
HA SHC
Replica:on protocol syncs:
-‐ Conﬁgura:ons
-‐ Job Ar:facts

A B C
Conﬁgura:ons
Deployer
Captain

Captain plays a special role in

cluster orchestra:on and job
scheduling.

Indexer A Indexer B Indexer C . . . Indexer N

31
HA SHC Opera:on -‐ High Level
! Deployer ensures all SHC members have iden:cal baseline
conﬁgura:ons
– Subsequent UI changes propagated using an internal replica:on mechanism
! Job Scheduler gets disabled on all members but the Captain
! Captain selects members to run scheduled jobs based on load
– Selec:on based on load sta:s:cs. Ensures bewer load distribu:on vs. SHP
! Captain orchestrates job ar:fact replica:on to selected members/
candidates of the cluster.
! Transparent job ar:fact proxying (and eventual replica:on) if ar:fact
not present on user’s SH.

32
HA Deploying SHC
! Same SH version and high speed network (LAN)
– More storage required vs. stand-‐alone SHs. Linux/Solaris only
! Needs LB and a Deployer instance (DS or MN can also be used to
fulfill this role)
! Select RF per your HA/DR requirements
! Configure Deployer first with a secret key
! Ini:alize each instance, point them to Deployer, then bootstrap one
of them to become the cluster captain
! More details on Splunk Docs
33
HA Indexing

3 Indexer Clustering

34
HA Index Replica:on
! Cluster = a group of search peers (indexers) that replicate each others'
buckets
! Data Availability
– Availability for inges:on and searching
! Data Fidelity
– Forwarder Acknowledgement, assurance Trade oﬀs
! Disaster Recovery • Extra storage
– Site awareness
! Search Aﬃnity • Slightly increased
– Local search preference vs. remote processing load.

35
HA Cluster Components
• Master Node
• Orchestrates replica:on/remedial process. Informs the SH where to ﬁnd
searchable data. Helps manage peer conﬁgura:ons.
• Peer Nodes
• Receive and index data. Replicate data to/from other peers. Peer Nodes
Number ≥ RF
• Search Head(s)
• Must use one to search across the cluster.
• Forwarders
• Use with auto-‐lb and indexer acknowledgement

36
HA!
Single Site Cluster
Architecture!

Credit: Splunk Docs Team

37
HA!
Replica<on Factor (RF)
• Number of copies of data in
the cluster. Default RF=3
• Cluster can tolerate RF-‐1
node failures

Credit: Splunk Docs Team

38
HA!
Search Factor (SF)
• Number of copies of data in
the cluster. Default SF=2
• Requires more storage
• Replicated vs. Searchable
Bucket

Credit: Splunk Docs Team

39
HA Clustered Indexing
! Origina:ng peer node streams copies of data to other clustered
peers.
– Receiving peers store those copies.
! Master determines replicated data des:na:on.
– Instructs peers what peers to stream data to. Does not sit on data path.
! Master manages all peer-‐to-‐peer interac:ons and coordinates
remedial ac:vi:es.
! Master keeps track of which peers have searchable data.
– Ensures that there are always SF copies of searchable data available.

40
HA Clustered Searching
! Search head coordinates all searches in the cluster
! SH relies on master to tell it who its peers are.
– The master keeps track of which peers have searchable data
! Only one replicated bucket is searchable a.k.a primary
– i.e., searches occur over primary buckets, only.
! Primary buckets may change over :me
– Peers know their status and therefore know where to search

41
Mul:site Clustering
! Site awareness introduced in Splunk 6.1
! Improved disaster recovery
– Mul:site clusters provide site failover capability
! Search Aﬃnity
– Search heads will scope searches to local site, whenever
possible
– Ability to turn oﬀ for bewer thruput vs. X-‐Site bandwidth

42
Multi Site Cluster
Architecture!

Diﬀerences vs. single site

• Assign a site to each node
• Specify RF and SF on a site
by site basis

Credit: Splunk Docs Team

43
Mul:site Clustering Cont’d
! Each node belongs to an assigned site, except for the Master Node,
which controls all sites but it’s not logically a member of any
! Replica:on of bucket copies occurs in a site-‐aware manner.
– Mul:site replica:on determines # copies on each site. Ex. 3 site cluster:
site_replication_factor = origin:2, site1:1, site2:1, site3:1, total:4
! Bucket-‐fixing ac:vi:es respect site boundaries when applicable
! Searches are fulfilled by local peers whenever possible (a.k.a search
affinity)
– Each site must have at least a full set of searchable data

44
Puqng it Together

Deployer Master
……….. Search Head Clustering

……….. Indexer Clustering

……….. Forwarding Layer – autoLB

45
END Top Takeways
• DR – Process of backing-‐up and restoring service in case of disaster
– Configura<on files – copy of $SPLUNK_HOME/etc/ folder
– Indexed data – backup and restore buckets
ê Hot, warm, cold, frozen
ê Can’t backup hot (without snapshots) but can safely backup warm and cold
• HA – con<nuously opera<onal system bounded by a set of tolerances
– Data collec<on
ê Autolb from forwarders to mul:ple indexers
ê Use Indexer Acknowledgement to protect in flight data
– Searching
ê Search Head Clustering (SHC)
– Indexing
ê Use Index Replica:on

Architecting Splunk For High Availability and Disaster Recovery

Uploaded by

Architecting Splunk For High Availability and Disaster Recovery

Uploaded by

Copyright

© 2015 Splunk Inc.

Architec:ng Splunk for

Set of processes necessary to ensure recovery

Backup to a medium at least as resilient as source

Ensure this works

[Out of volume space or

[Out of Space or Bucket is Old]

$ Home Path $ Cold Path

[Explicit User Ac:on]

Read + Write No*

Read Only Yes

Read Only Yes

New Splunk Instance

New Splunk Instance

Bucket naming conven<ons

a New Splunk Instance

A design methodology whereby a system is

b Search Head Pooling (SHP)

Indexer A Indexer B . . . Indexer N

Indexer A Indexer B . . . Indexer N

NFS based Search Head Pooling

Indexer A Indexer B . . . Indexer N

NFS-­‐less Single point of failure

NFS-­‐less Performance issues

Indexer A Indexer B Indexer C . . . Indexer N

Deployer ensures iden:cal

Indexer A Indexer B Indexer C . . . Indexer N

Captain plays a special role in

Indexer A Indexer B Indexer C . . . Indexer N

Credit: Splunk Docs Team

Credit: Splunk Docs Team

Credit: Splunk Docs Team

Diﬀerences vs. single site

Credit: Splunk Docs Team

……….. Indexer Clustering

……….. Forwarding Layer – autoLB

You may also like:

You might also like

NFS-‐less Single point of failure

NFS-‐less Performance issues