0% found this document useful (0 votes)
7 views19 pages

Es Lab Final

The document outlines a lab course on Distributed Information Systems focusing on Elasticsearch, detailing the setup of a three-node cluster and the management of data through indices, shards, and replicas. It includes instructions for creating virtual machines, installing Ubuntu and Elasticsearch, and performing CRUD operations using RESTful APIs. The course aims to provide hands-on experience with Elasticsearch's architecture and functionality in a distributed environment.

Uploaded by

springlee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views19 pages

Es Lab Final

The document outlines a lab course on Distributed Information Systems focusing on Elasticsearch, detailing the setup of a three-node cluster and the management of data through indices, shards, and replicas. It includes instructions for creating virtual machines, installing Ubuntu and Elasticsearch, and performing CRUD operations using RESTful APIs. The course aims to provide hands-on experience with Elasticsearch's architecture and functionality in a distributed environment.

Uploaded by

springlee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

CSCI 5533 - Distributed Information Systems

Distributed Clusters in Elasticsearch


University of Houston - Clear Lake
Spencer Riner
January 26, 2020
Contents
1 Introduction 2
1.1 Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Shards and Replicas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 A Simple Shard Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Node Roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 RESTful APIs and curl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Lab Instructions 5
2.1 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Downloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Part 1 - Cluster Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.2 Virtual Machine Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.3 Install Ubuntu 18.04 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.4 Install Elasticsearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.5 Initial Elasticsearch config . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.6 Clone es-master-a . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.7 Configure Elasticsearch on the new nodes . . . . . . . . . . . . . . . . . . . 8
2.2.8 Submission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Part 2 - Inserting and Deleting Elasticsearch Data . . . . . . . . . . . . . . . . . . 11
2.3.1 Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.2 Create a New Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.3 Add Documents to the car index . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.4 Delete a Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.5 Unexpected Node Shutdown . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.6 Lab Submission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 Conclusion 17

4 Further Study 17

1
1 Introduction
Elasticsearch is a distributed and open source search engine used to index various types of un-
structured data [1]. Each independent machine running an instance of Elasticsearch is referred to
in this lab as a server. Many servers who coordinate with one another comprise a cluster. In this
lab, you will create a three-node cluster that is able to recover from an unexpected outage without
any user intervention. The Elasticsearch documentation is extensive and will prove useful during
the lab. You can find it here: https://www.elastic.co/guide/en/elasticsearch/reference/
current/index.html

1.1 Indices
The essential components of an Elasticsearch cluster are indices, shards, and replicas. Indices
are a logical namespace which map to one or more primary shards and have zero or more replica
shards [2]. An index, in some cases, can be thought of similarly to a relational database. See the
table below for a correlation of the terms used in each. The advantage that Elasticsearch provides
is in the sharding of indices. Depending on your architecture, shards can be distributed across
multiple servers and replicated.

Table 1: Comparison between Elasticsearch and relational database terms.


Structured DBMS Elasticsearch
Database Index
Tables Types
Rows Documents
Attributes Properties

Figure 1: The nested structure of documents in an Elasticsearch index.

2
1.2 Shards and Replicas
The concept of shards and replicas is important to understand. A shard is a self-contained index
that contains a subset of that index’s documents [3]. This is how Elasticsearch distributes its data
across multiple physical nodes. There are two types of shards: primary and replica.
Primary shards are the original copy of the shard. Primary shards are then copied to other
machines as replicas. When a server is powered off or loses its network connection, the replica
shards are used to create new primary shards to ensure the availability of the index data.
In this lab, each index will have three primary shards (one for each node) and one replica of each
of the primary shards. See Figure 2.

Figure 2: An Elasticsearch cluster with three nodes.

The squares labelled P0, P1, and P2 all represent primary shards. These are the original copies
of the documents contained in the index. The squares labelled R0, R1, and R2 are the respec-
tive replicas of each primary shard. You can see that if any one of the nodes were to go down
unexpectedly, any of the shards could be reliably replicated to recreate the 0 through 2 shards.

1.2.1 A Simple Shard Example


A car dealership has decided to use Elasticsearch to manage their inventory. An index called
car is created. Within the car index, there exists a type for each car manufacturer. Within each
type, there are multiple documents that represent the available models. The documents have
properties such as VIN number, license plate number, and color.
Similar indices for trucks, SUVs, and vans may also be created. After the indices have been created
and populated, shards and replicas are created according to the configuration set in the cluster.
In this example, each index has three shards and one replica of each shard. The cluster has three
nodes to distribute the shards across. See Figure 3.

3
Figure 3: The car index distributed across three nodes.

Each shard contains its own set of documents from the index, in this case multiple models of cars
available at the dealership. It’s important to know that the shards do not overlap, and each shard
is required for the complete index. Because of the replicas, any of the three nodes could go down
unexpectedly and the entire index (via Shards 1, 2, and 3) would still be available. This is one of
the advantages to using a multi-node cluster with Elasticsearch.

1.3 Node Roles


There are many types of Elasticsearch nodes, but we are only going to use two: Master-eligible
nodes and Data nodes.
Master-eligible nodes are responsible for index creation and deletion, as well as deciding which
shards are allocated to which servers [4]. In the case of an unexpected outage on a Master node, the
cluster has an election process that will designate a new master from the remaining master-eligible
nodes. This presents the potential for split brain [5] within the cluster. Because of this, a special
role called “Voting-only master-eligible” exists. This terminology can be confusing, because the
node is actually not eligible to become the master, but exists only to resolve election conflicts.
See the Elasticsearch docs for more info.
Data nodes store the shards and perform resource-intensive operations requested by the Master
node.

1.4 RESTful APIs and curl


Elasticsearch has a RESTful API available for performing create, read, update, and delete (CRUD)
actions on its documents. REST stands for Representational State Transfer [6] and is easily
accessed using the command line tool curl. Curl does Internet transfers for resources specified
as URLs using Internet protocols [7] and will be the primary method of creating and deleting
Elasticsearch indices in this lab. See the example below for creating a document in the car index.

4
curl -XPOST "http://localhost:9200/car/toyota" -d’
{
"model": "Camry"
"year": "2009"
"color": "green"
}’

2 Lab Instructions
This lab will be separated into two separate portions: installing and configuring Elasticsearch,
and inserting and deleting data in the cluster.
The Elasticsearch cluster for this lab will consist of three nodes, each running on an independent
virtual machine with its own IP address (your IP addresses may differ):

Table 2: Elasticsearch cluster overview.


Host Name Role IP Address
es-master-a Master-Eligible Node 192.168.128.4
es-master-b Master-Eligible Node 192.168.128.7
es-data-a Data Node 192.168.128.8

In production clusters, the master nodes generally do not do any data processing. However, for
the purposes of this lab, the master nodes will process data using the node.data: true option
in /etc/elasticsearch/elasticsearch.yml.

2.1 Prerequisites
This lab will use Oracle VirtualBox as a hypervisor for three virtual machines. You are free to
use different software if you prefer. You will also need 60 GB of free space on your drive and at
least 8GB of memory.

2.1.1 Downloads
You will need Oracle VirtualBox installed on your computer.
https://www.virtualbox.org/wiki/Downloads

You also need an image of Ubuntu 18.04 LTS Server.


https://ubuntu.com/download/server

5
2.2 Part 1 - Cluster Setup
See Figure 4 for the cluster architecture.

Figure 4: Elasticsearch cluster architecture.

2.2.1 Learning Objectives


There are many reasons an organization may want to employ Elasticsearch. Many use what
is known as the Elastic Stack, which is a collection of three separate software packages called
Logstash, Elasticsearch, and Kibana, to aggregate log information from client machines. This
project will introduce you to the concepts of virtualization, installing a Linux operating system,
and installing software required to configure an Elasticsearch cluster.

2.2.2 Virtual Machine Creation


1. Create an initial virtual machine with these attributes (leave Create a virtual hard disk
now selected):

• Name: es-master-a
• Type: Linux
• Version: Ubuntu (64-bit)
• Memory Size: 2560 MB
• File Size: 20.0 GB

2. Open the Preferences window found in the File menu and select Network on the sidebar.
Click the icon to add a NAT 1 Network.

3. Set the Network Name to dis and the Network CIDR to 192.168.128.0/24.
1
Network Address Translation

6
4. Open the Settings window for your newly created VM.

5. Choose Network on the sidebar and select NAT Network from the Attached to: dropdown
menu. Choose the dis NAT Network.

2.2.3 Install Ubuntu 18.04


1. Start the es-master-a virtual machine with the Start button.

2. Choose the Ubuntu image file you downloaded previously when prompted.

3. Choose the default options during the installer. When you get to the Profile setup dialog,
enter these details:

• Your name: dis


• Your server’s name: es-master-a
• Pick a username: dis
• Choose a password: dislab2020

4. Do not enable ssh or install any suggested server snaps.

5. Reboot when prompted.

2.2.4 Install Elasticsearch


1. Log in to the machine using the username and password you set during OS installation (dis
and dislab2020).

2. Update the apt repository: sudo apt update

3. Install the apt-transport-https package necessary to access a repository over HTTPS:


sudo apt install apt-transport-https

4. Install OpenJDK 8: sudo apt install openjdk-8-jdk


2
5. Import the Elasticsearch’s repository’s GNU Privacy Guard (GPG) key with wget:
wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key
add -
You should see an OK message.

6. Add the Elasticsearch repository:


sudo sh -c ’echo "deb https://artifacts.elastic.co/packages/7.x/apt stable main"
> /etc/apt/sources.list.d/elastic-7.x.list’

7. Use cat to verify the string deb https://artifacts.elastic.co/packages/7.x/apt stable


main is present in the file /etc/apt/sources.list.d/elastic-7.x.list.
cat /etc/apt/sources.list.d/elastic-7.x.list
2
GnuPG is a complete and free implementation of the OpenPGP standard as defined by RFC4880 (also known
as PGP) [8]

7
8. Update the repository information again with sudo apt update and then install Elastic-
search with sudo apt install elasticsearch

2.2.5 Initial Elasticsearch config


1. Open the Elasticsearch config file using the text editor of your choice.
sudo vim /etc/elasticsearch/elasticsearch.yml

2. Note that most of the lines in this file are commented. Find the lines listed below, uncomment
them if needed, and change their values as shown. The lines after the line break beginning
with node. will need to be added yourself.

cluster.name: discluster
node.name: es-master-a
# You need to run ifconfig to find your node’s IP address
network.host: 192.168.128.4
discovery.seed_hosts: ["192.168.128.4", "192.168.128.7", "192.168.128.8"]
cluster.initial_master_nodes: ["192.168.128.4", "192.168.128.8"]

# These lines must be added


node.master: true
node.voting_only: false
node.data: true
node.ingest: true
node.ml: false
xpack.ml.enabled: false
cluster.remote.connect: false

2.2.6 Clone es-master-a


1. Power off es-master-a.

2. Right click es-master-a in the sidebar.

3. Name the clones es-master-b and es-data-a.

4. Use the Generate new MAC addresses for all network adapters option when cloning.

5. Power on es-master-a, es-master-b, and es-data-a.

2.2.7 Configure Elasticsearch on the new nodes


1. You will notice the host name on both clones is still es-master-a. Run sudo hostnamectl
set-hostname es-master-b and sudo hostnamectl set-hostname es-data-a.

2. Run sudo truncate -s 0 /etc/machine-id on both clones and reboot.

8
3. Make a note of the IP addresses on the clones after the reboot. Go to the cluster array lines
in the Elasticsearch configuration file and make sure the IP addresses match the three VMs.

4. The configuration file will still be present on the clones, but you must change the following
lines to be unique for each node.

# es-master-b
node.name: es-master-b
# Use ifconfig to find IP address
network.host: 192.168.128.7

# es-data-a
node.name: es-data-a
# Use ifconfig to find IP address
network.host: 192.168.128.8
node.voting_only: true

5. Start and enable the Elasticsearch service on each of the virtual machines:

sudo systemctl enable elasticsearch


sudo systemctl start elasticsearch

6. If there is no output, the service started successfully. You can check with systemctl status
elasticsearch.

2.2.8 Submission
1. On each node, take screenshots with your name and student ID visible of the following
commands to verify the cluster formation was successful. You will need to use the host’s IP
address on each VM, e.g. 192.168.128.4 on es-master-a, etc. See example below. Send these
screenshots to your TA.

curl -XGET "http://192.168.128.4:9200/_cluster/health?pretty"


curl -XGET "http://192.168.128.4:9200/_cat/nodes?pretty"

9
Figure 5: Example of lab submission screenshot - es-master-a.

Figure 6: Example of lab submission screenshot - es-master-b.

10
Figure 7: Example of lab submission screenshot - es-data-a.

2. Demonstrate the running cluster to your TA by running the following commands with suc-
cessful return codes:

sudo systemctl status elasticsearch


curl -XGET "http://192.168.128.4:9200/_cluster/health?pretty"
curl -XGET "http://192.168.128.4:9200/_cat/nodes?pretty"

2.3 Part 2 - Inserting and Deleting Elasticsearch Data


2.3.1 Learning Objectives
Resources
https://www.w3resource.com/mongodb/nosql.php

The benefits of using a NoSQL database such as Elasticsearch are being free to store unstructured
data. In this section, we will use the curl tool to perform HTTP requests such as GET and POST
on our Elasticsearch cluster. We will also observe how nodes react when a member of the cluster
is powered off unexpectedly.

11
2.3.2 Create a New Index
Resources
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-create-index.
html

To create the car index, send a file containing JSON data to Elasticsearch using curl. Create a
car.json file in the text editor of your choice with the following contents:

{
"settings" : {
"index" : {
"number_of_shards": 3,
"number_of_replicas": 1
}
}
}

Create the index using curl:

curl -X PUT "192.168.128.4:9200/car?pretty" \


-H ’Content-Type: application/json’ -d @car.json

Verify the index was created on es-master-b using curl:

curl -X GET "http://192.168.128.8:9200/_cat/indices

You should see an output line including the index name (car), the health state of the index (green),
the shard count (3) and the document count (1). See Figure 8. As new indices are created on
single nodes in the cluster, they are replicated to the other members of the cluster automatically.

Figure 8: Listing all indices in the cluster.

2.3.3 Add Documents to the car index


Resources
https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started-index.
html

Create a file named inventory.json with the following contents:

12
{"index":{"_id":"1"}}
{"make":"Toyota","model":"Camry","year":"1990","color":"green"}
{"index":{"_id":"2"}}
{"make":"Toyota","model":"Corolla","year":"2012","color":"blue"}
{"index":{"_id":"3"}}
{"make":"Toyota","model":"Celica","year":"2003","color":"white"}
{"index":{"_id":"4"}}
{"make":"Toyota","model":"Prius","year":"2016","color":"grey"}
{"index":{"_id":"5"}}
{"make":"Toyota","model":"Corolla","year":"2016","color":"white"}
{"index":{"_id":"6"}}
{"make":"Toyota","model":"Supra","year":"1994","color":"red"}
{"index":{"_id":"7"}}
{"make":"Toyota","model":"Yaris","year":"2014","color":"blue"}
{"index":{"_id":"8"}}
{"make":"Toyota","model":"Camry","year":"2017","color":"grey"}
{"index":{"_id":"9"}}
{"make":"Toyota","model":"Prius","year":"2014","color":"black"}

You can also download this file using wget in the terminal:

wget https://gitlab.com/spencerriner/dis_lab/-/raw/master/inventory.json

Perform a bulk upload of the inventory items with curl:

curl -H "Content-Type: application/json" \


-X POST "http://192.168.128.4:9200/car/_bulk?pretty&refresh" \
--data-binary @inventory.json}

Search for one of the documents on es-data-a using curl and the id (1-9) to verify successful
document creation. See Figure 9 for the expected output.

curl -X GET "http://192.168.128.7:9200/car/_doc/9?pretty"

13
Figure 9: Listing document with id 9.

As shown, as documents are added, they are immediately replicated to the other nodes in the
cluster. You can also use curl to search documents based on their properties. Use this curl
command to find documents with the color:white property:

curl -X GET "http://192.168.128.4:9200/car/_search/?q=color:white&pretty"

Figure 10: Listing documents with property color ”white”.

14
2.3.4 Delete a Document
Resources
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-delete.html

You can also delete documents in an index with curl using their id property:

curl -X DELETE "http://192.168.128.4:9200/car/_doc/1?pretty"

You should see a confirmation as shown in Figure 11.

Figure 11: Deleting document 1.

2.3.5 Unexpected Node Shutdown


One of the benefits of Elasticsearch is its ability to redistribute nodes if one of the cluster members
is powered off unexpectedly. Find the current master node with curl:

curl -XGET "http://192.168.128.4:9200/_cat/master?pretty"

Figure 12: Showing the master node of the cluster.

Issue a sudo poweroff command to the current master node (in this case, es-master-b) then
list all nodes on the other master-eligible node to verify the newly elected master node (it should
have an asterisk next to its name). Then verify the cluster health to make sure that all 6 shards
have been reallocated to the remaining nodes. See Figure 13.

15
curl -XGET "http://192.168.128.4:9200/_cat/nodes?pretty"
curl -XGET "http://192.168.128.4:9200/_cluster/health?pretty"

Figure 13: Showing the master node of the cluster.

Power on the old master and run the cluster health command again to verify number of nodes is
back to 3.

2.3.6 Lab Submission


Take screenshots of the results of these queries and submit to the TA.

1. List all cars made in 2014

2. List all Camry models

3. List all grey cars

4. Create a new car document with properties of your choice

5. Delete the document with id: 4

Present a demo to the TA demonstrating the following:

1. Show a documents properties by querying its id

2. Search for a document using one of its properties

3. Power off the master node, show the new master and a cluster health state of green

16
3 Conclusion
You have now successfully created a distributed Elasticsearch cluster from scratch and demon-
strated how it provides a high level of availability by tolerating sporadic node loss. Elasticsearch
is highly scalable, as more nodes can be added to distribute the storage and compute needs of a
growing dataset.

4 Further Study
For more experience with Elasticsearch, you may try these additional projects.

1. Install the Elastic Stack (Elasticsearch, Logstash, Kibana) on a cluster of VMs and
visualize data using the Kibana web interface.
Tutorial: https://www.digitalocean.com/community/tutorials/
how-to-install-elasticsearch-logstash-and-kibana-elastic-stack-on-ubuntu-18-04

2. Send syslog messages from a Linux host to Logstash using rsyslog.


Tutorial: https://www.digitalocean.com/community/tutorials/
how-to-centralize-logs-with-rsyslog-logstash-and-elasticsearch-on-ubuntu-14-04

3. After installing the Elastic Stack, install Filebeat on the nodes to ingest data to the cluster.
Tutorial: https://www.elastic.co/guide/en/beats/filebeat/current/
filebeat-getting-started.html

References
[1] Elastic, “What is elasticsearch?,” January 2020. [Online]. Available: https://www.elastic.
co/what-is/elasticsearch. [Accessed January 28, 2020].

[2] Zachary Tong, “What is an elasticsearch index?,” February 2013. [Online]. Available: https:
//www.elastic.co/blog/what-is-an-elasticsearch-index. [Accessed January 26, 2020].

[3] Elastic, “Scalability and resilience: clusters, nodes, and shards,” December 2019.
[Online]. Available: https://www.elastic.co/guide/en/elasticsearch/reference/
current/scalability.html. [Accessed January 28, 2020].

[4] Elastic, “Node,” January 2020. [Online]. Available: https://www.elastic.co/guide/en/


elasticsearch/reference/current/modules-node.html. [Accessed January 26, 2020].

[5] Adam Vanderbush, “Avoiding the split brain problem in elasticsearch,” June 2017. [On-
line]. Available: https://qbox.io/blog/split-brain-problem-elasticsearch. [Accessed
February 2, 2020].

[6] “Rest api tutorial,” February 2020. [Online]. Available: https://restfulapi.net/. [Ac-
cessed February 4, 2020].

[7] Daniel Stenberg, Everything curl. February 2020. [Online]. Available: https://ec.haxx.se/.
[Accessed February 4, 2020].

17
[8] “Gnupg,” January 2020. [Online]. Available: https://gnupg.org. [Accessed February 8,
2020].

[9] Elastic, “Adding nodes to your cluster,” December 2019. [Online]. Avail-
able: https://www.elastic.co/guide/en/elasticsearch/reference/current/
add-elasticsearch-nodes.html. [Accessed January 26, 2020].

[10] Elastic, “Index some documents,” December 2019. [Online]. Available: https://www.
elastic.co/guide/en/elasticsearch/reference/current/getting-started-index.
html. [Accessed February 9, 2020].

18

You might also like