Hand Book: Ahmedabad Institute of Technology
Hand Book: Ahmedabad Institute of Technology
CE & IT Department
Hand Book
BIG DATA ANALYTICS (2180710)
Year: 2020-21
4.0 Spark 63
5.0 NoSQL 74
Big data helps to analyze the in-depth concepts for the better decisions and
strategic taken for the development of the organization.
Big Data includes huge volume, high velocity, and extensible variety of data.
The data in it will be of three types.
Structured data − Relational data.
Semi Structured data − XML data.
Unstructured data − Word, PDF, Text, Media Logs.
1) Structured
Any data that can be stored, accessed and processed in the form of fixed format
is termed as a 'structured' data.
Over the period of time, talent in computer science have achieved greater success
in developing techniques for working with such kind of data (where the format
is well known in advance) and also deriving value out of it.
When size of such data grows to a huge extent, typical sizes are being in the rage
of multiple zettabyte.
2) Unstructured
Any data with unknown form or the structure is classified as unstructured data.
In addition to the size being huge, un-structured data poses multiple challenges
in terms of its processing for deriving value out of it.
Typical example of unstructured data is, a heterogeneous data source containing
a combination of simple text files, images, videos etc.
Now a day organizations have wealth of data available with them but
unfortunately they don't know how to derive value out of it since this data is in
its raw form or unstructured format.
Examples Of Un-structured Data :- Typical human-generated unstructured data
includes: Text files: Word processing, spreadsheets, presentations, email, logs.
Email: Email has some internal structure thanks to its metadata, and we
sometimes refer to it as semi-structured. However, its message field is
unstructured and traditional analytics tools cannot parse it.
Social Media: Data from Facebook, Twitter, LinkedIn.
Website: YouTube, Instagram, photo sharing sites.
Mobile data: Text messages, locations.
Communications: Chat, IM, phone recordings, collaboration software.
Media: MP3, digital photos, audio and video files.
Business applications: MS Office documents, productivity applications.
Typical machine-generated unstructured data includes:
Volume
The sheer scale of the information processed helps define big data systems.
These datasets can be orders of magnitude larger than traditional datasets, which
demands more thought at each stage of the processing and storage life cycle.
Often, because the work requirements exceed the capabilities of a single
computer, this becomes a challenge of pooling, allocating, and coordinating
resources from groups of computers.
Cluster management and algorithms capable of breaking tasks into smaller
pieces become increasingly important.
Velocity
Another way in which big data differs significantly from other data systems is
the speed that information moves through the system.
Data is frequently flowing into the system from multiple sources and is often
expected to be processed in real time to gain insights and update the current
understanding of the system.
This focus on near instant feedback has driven many big data practitioners away
from a batch-oriented approach and closer to a real-time streaming system.
Variety
Big data problems are often unique because of the wide range of both the sources
being processed and their relative quality.
Data can be ingested from internal systems like application and server logs, from
social media feeds and other external APIs, from physical device sensors, and
from other providers.
Big data seeks to handle potentially useful data regardless of where it's coming
from by consolidating all information into a single system.
The formats and types of media can vary significantly as well. Rich media like
images, video files, and audio recordings are ingested alongside text
files,structured logs, etc.
While more traditional data processing systems might expect data to enter the
pipeline already labeled, formatted, and organized, big data systems usually
accept and store data closer to its raw state.
Ideally, any transformations or changes to the raw data will happen in memory
at the time of processing.
Features of HDFS
1) It is suitable for the distributed storage and processing.
2) Hadoop provides a command interface to interact with HDFS.
3) The built-in servers of namenode and datanode help users to easily check the
status of cluster.
4) Streaming access to file system data.
5) HDFS provides file permissions and authentication.
HDFS Architecture
HDFS follows the master-slave architecture and it has the following elements.
Namenode
The namenode is the commodity hardware that contains the GNU/Linux
operating system and the namenode software.
It is a software that can be run on commodity hardware.
The system having the namenode acts as the master server and it does the
following tasks Manages the file system namespace.
Regulates client’s access to files. It also executes file system operations such as
renaming, closing, and opening files and directories.
2) Datanode
The datanode is a commodity hardware having the GNU/Linux operating
system and datanode software. For every node (Commodity hardware/System)
in a cluster, there will be a datanode.
These nodes manage the data storage of their system. Datanodes perform read-
write operations on the file systems, as per client request.
They also perform operations such as block creation, deletion, and replication
according to the instructions of the namenode.
3) Block
The file in a file system will be divided into one or more segments and/or stored
in individual data nodes.
Goals of HDFS
2) Huge datasets − HDFS should have hundreds of nodes per cluster to manage
the applications having huge datasets.
2) Banking
In banking data is used to manage large financial data.
SEC (Securities Exchange Commission) uses big data in order to monitor the
market and finance related data of the bank and Network analytics in order to
track illegal activities in the finance.
Big data is also used in the trading sector for trade analytics and decision support
analytics.
3) Healthcare
Big data is used in the healthcare sector in order to manage the large amount of
data relate to the patients, doctors and the other staff members.
It helps to eliminate the failures like errors, invalid or inappropriate data, any
system fault etc. that comes while utilizing the system and provides benefits like
managing customer, staff and doctors information related to healthcare (Bughin
et al. 2010).
According to (Gartner 2013), 43% of the healthcare industries have invested in
Big data.
5) Education
A major challenge in the education industry is to incorporate big data from
different sources and vendors and to utilize it on platforms that were not
designed for the varying data.
Big data is used quite significantly in higher education. For example, The
University of Tasmania. An Australian university with over 26000 students, has
deployed a Learning and Management System that tracks among other things,
when a student logs onto the system, how much time is spent on different pages
in the system, as well as the overall progress of a student over time.
In a different use case of the use of big data in education, it is also used to
measure teacher’s effectiveness to ensure a good experience for both students
and teachers.
Similarly, large volumes of data from the manufacturing industry are untapped.
The underutilization of this information prevents improved quality of products,
energy efficiency, reliability, and better profit margins.
7) Government
In governments the biggest challenges are the integration and interoperability of
big data across different government departments and affiliated organizations.
In public services, big data has a very wide range of applications including:
energy exploration, financial market analysis, fraud detection, health related
research and environmental protection.
Some more specific examples are as follows:
Big data is being used in the analysis of large amounts of social disability claims,
made to the Social Security Administration (SSA), that arrive in the form of
unstructured data. The analytics are used to process medical information rapidly
and efficiently for faster decision making and to detect suspicious or fraudulent
claims.
The Food and Drug Administration (FDA) is using big data to detect and study
patterns of food-related illnesses and diseases. This allows for faster response
which has led to faster treatment and less death.
Why MapReduce?
Traditional Enterprise Systems normally have a centralized server to store and
process data.
The following illustration depicts a schematic view of a traditional enterprise
system.
Traditional model is certainly not suitable to process huge volumes of scalable
data and cannot be accommodated by standard database servers.
Moreover, the centralized system creates too much of a bottleneck while
processing multiple files simultaneously.
MapReduce divides a task into small parts and assigns them to many computers.
The results are collected at one place and integrated to form the result dataset.
How MapReduce Works?
The MapReduce algorithm contains two important tasks, namely Map and
Reduce.
Input Phase − Here we have a Record Reader that translates each record in an
input file and sends the parsed data to the mapper in the form of key-value pairs.
Map − Map is a user-defined function, which takes a series of key-value pairs
and processes each one of them to generate zero or more key-value pairs.
Intermediate Keys − They key-value pairs generated by the mapper are known
as intermediate keys.
Combiner − A combiner is a type of local Reducer that groups similar data from
the map phase into identifiable sets. It takes the intermediate keys from the
mapper as input and applies a user-defined code to aggregate the values in a
small scope of one mapper. It is not a part of the main MapReduce algorithm; it
is optional.
Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It
downloads the grouped key-value pairs onto the local machine, where the
Reducer is running. The individual key-value pairs are sorted by key into a
larger data list. The data list groups the equivalent keys together so that their
values can be iterated easily in the Reducer task.
Reducer − The Reducer takes the grouped key-value paired data as input and
runs a Reducer function on each one of them. Here, the data can be aggregated,
filtered, and combined in a number of ways, and it requires a wide range of
processing. Once the execution is over, it gives zero or more key-value pairs to
the final step.
MapReduce-Example
Let us take a real-world example to comprehend the power of MapReduce.
Twitter receives around 500 million tweets per day, which is nearly 3000 tweets
per second.
The following illustration shows how Tweeter manages its tweets with the help
of MapReduce.
Sorting
Sorting is one of the basic MapReduce algorithms to process and analyze data.
MapReduce implements sorting algorithm to automatically sort the output key-
value pairs from the mapper by their keys.
Sorting methods are implemented in the mapper class itself.
In the Shuffle and Sort phase, after tokenizing the values in the mapper class, the
Context class (user-defined class) collects the matching valued keys as a
collection.
To collect similar key-value pairs (intermediate keys), the Mapper class takes the
help of RawComparator class to sort the key-value pairs.
The set of intermediate key-value pairs for a given Reducer is automatically
sorted by Hadoop to form key-values (K2, {V2, V2, …}) before they are presented
to the Reducer.
The Map phase processes each input file and provides the employee data in key-value
pairs (<k, v> : <emp name, salary>). See the following illustration.
The combiner phase (searching technique) will accept the input from the Map
phase as a key-value pair with employee name and salary. Using searching
technique, the combiner will check all the employee salary to find the highest
salaried employee in each file. See the following snippet.
<k: employee name, v: salary> Max= the salary of an first employee. Treated as max
salary
if (v(second employee).salary > Max)
{
Max = v(salary);
}
Indexing
Normally indexing is used to point to a particular data and its address. It
performs batch indexing on the input files for a particular Mapper.
The indexing technique that is normally used in MapReduce is known as
inverted index. Search engines like Google and Bing use inverted indexing
technique. Let us try to understand how Indexing works with the help of a
simple example.
Example
The following text is the input for inverted indexing. Here T[0], T[1], and t[2] are
the file names and their content are in double quotes.
T[0] = "it is what it is" T[1] = "what is it" T[2] = "it is a banana"
After applying the Indexing algorithm, we get the following output −
"a": {2}
"banana": {2}
"is": {0, 1, 2}
"it": {0, 1, 2}
"what": {0, 1}
Here "a": {2} implies the term "a" appears in the T[2] file. Similarly, "is": {0, 1, 2}
implies the term "is" appears in the files T[0], T[1], and T[2].
TF-IDF
TF-IDF is a text processing algorithm which is short for Term Frequency −
Inverse Document Frequency. It is one of the common web analysis algorithms.
Here, the term 'frequency' refers to the number of times a term appears in a
document.
Term Frequency (TF)
It measures how frequently a particular term occurs in a document. It is
calculated by the number of times a word appears in a document divided by the
total number of words in that document.
Example
Consider a document containing 1000 words, wherein the word hive appears 50
times. The TF for hive is then (50 / 1000) = 0.05.
Now, assume we have 10 million documents and the word hive appears in 1000
of these. Then, the IDF is calculated as log(10,000,000 / 1,000) = 4.
The TF-IDF weight is the product of these quantities − 0.05 × 4 = 0.20.
Apache Hadoop is the most important framework for working with Big Data. Hadoop
biggest strength is scalability. It upgrades from working on a single node to thousands
of nodes without any issue in a seamless manner.
Hadoop Architecture
Architecture can be broken down into two branches,
1) Hadoop core components 2) Hadoop echosystem
IDEMPOTENCE
Idempotent operation produces the same result no matter how many times it’s
executed.
In a relational database the inserts typically aren’t idempotent, because exe-
cuting them multiple times doesn’t produce the same resulting database state.
Alter-natively, updates often are idempotent, because they’ll produce the same
end result.
Any time data is being written idempotence should be a consideration, and
dataingress and egress in Hadoop is no different.
How well do distributed log collectionframeworks deal with data
retransmissions?
How do you ensure idempotent behavior in a MapReduce job where multiple
tasks are inserting into a database in parallel?
AGGREGATION
The data aggregation process combines multiple data elements.
In the context of data ingress this can be useful because moving large quantities
of small files into HDFS potentially translates into NameNode memory woes, as
well as slow MapReduce exe-cution times.
Having the ability to aggregate files or data together mitigates this prob-lem, and
is a feature to consider.
RECOVERABILITY
Recoverability allows an ingress or egress tool to retry in the event of a failed
opera-tion.
it’s unlikely that any data source, sink, or Hadoop itself can be 100 percent
available, it’s important that an ingress or egress action be retried in the event of
failure.
CORRECTNESS
In the context of data transportation, checking for correctness is how you verify
that no data corruption occurred as the data was in transit.
When user work with heteroge-neous systems such as Hadoop data ingress and
egress tools, the fact that data is being transported across different hosts,
networks, and protocols only increases the poten-tial for problems during data
transfer.
Common methods for checking correctness of raw data such as storage devices
include Cyclic Redundancy Checks (CRC ), which are what HDFS uses internally
to maintain block-level integrity.
Mooving Data into Hadoop
Making data available to Hadoop is the first setp while working with data in Hadoop.
Two Primary approches 1) HDFS Level 2) MapReduce Level.
Comaring Flume,Chukwa and Scribe
Flume
The basic architecture of Flume. It is , data generators (such as Facebook, Twitter)
generate data which gets collected by individual Flume agents running on them.
Thereafter, a data collector (which is also an agent) collects the data from the
agents which is aggregated and pushed into a centralized store such as HDFS or
Hbase.
F
lume Event
An event is the basic unit of the data transported inside Flume.
Flume Agent
An agent is an independent daemon process (JVM) in Flume.
It receives the data (events) from clients or other agents and forwards it to its
next destination (sink or agent).
Flume may have more than one agent. Following diagram represents a Flume
Agent.
As shown in the diagram a Flume Agent contains three main components namely,
source, channel, and sink.
Source
A source is the component of an Agent which receives data from the data
generators and transfers it to one or more channels in the form of Flume events.
Apache Flume supports several types of sources and each source receives events
from a specified data generator.
Example − Avro source, Thrift source, twitter 1% source etc.\
Channel
Agent
s
and Adaptors
Chukwa agents do not collect some particular fixed set of data. Rather, they
support dynamically starting and stopping Adaptors, which small dynamically-
controllable modules that run inside the Agent process and are responsible for
the actual collection of data.
These dynamically controllable data sources are called adaptors, since they
generally are wrapping some other data source, such as a file or a Unix
HICC
HICC, the Hadoop Infrastructure Care Center is a web-portal style interface for
displaying data. Data is fetched from a MySQL database, which in turn is
populated by a mapreduce job that runs on the collected data, after Demux.
Scribe
Scribe was a server for aggregating log data streamed in real-time from a large
number of servers. It was designed to be scalable, extensible without client-side
modification, and robust to failure of the network or any specific machine.
Scribe was developed at Facebook and released in 2008 as open source.
Scribe servers are arranged in a directed graph, with each server knowing only
about the next server in the graph. This network topology allows for adding
extra layers of fan-in as a system grows, and batching messages before sending
them between data centers, without having any code that explicitly needs to
understand data center topology, only a simple configuration.
Scribe was designed to consider reliability but to not require heavyweight protocols
and expansive disk usage. Scribe spools data to disk on any node to handle
intermittent connectivity node failure, but doesn't sync a log file for every message.
This creates a possibility of a small amount of data loss in the event of a crash or
Prepared By: Prof. Neha Prajapati Page 27
catastrophic hardware failure. However, this degree of reliability is often suitable for
most Facebook use cases.
Features of HDFS
It is suitable for the distributed storage and processing.
Hadoop provides a command interface to interact with HDFS.
The built-in servers of namenode and datanode help users to easily check the
status of cluster.
Streaming access to file system data.
HDFS provides file permissions and authentication.
HDFS follows the master-slave architecture and it has the following elements.
Namenode
The namenode is the commodity hardware that contains the GNU/Linux
operating system and the namenode software.
It is a software that can be run on commodity hardware. The system having the
namenode acts as the master server and it does the following tasks −Manages the
file system namespace.
Regulates client’s access to files.
Goals of HDFS
Fault detection and recovery − Since HDFS includes a large number of
commodity hardware, failure of components is frequent. Therefore HDFS should
have mechanisms for quick and automatic fault detection and recovery.
Huge datasets − HDFS should have hundreds of nodes per cluster to manage the
applications having huge datasets.
Hardware at data − A requested task can be done efficiently, when the
computation takes place near the data. Especially where huge datasets are
involved, it reduces the network traffic and increases the throughput.
Hadoop Installation
This text describes the installation and configuration of Hadoop cluster backed
by the Hadoop Distributed File System, running on Ubuntu Linux.
Hadoop 1.2.1 (stable release)
Ubuntu 14.04 LTS
Java 7 (Open JDK for Linux)
Part 1: Prerequisites
Configuring Ubuntu:-
For these steps I will be creating 2-node cluster of Hadoop. Both machines have
Ubuntu 14.04 LTS installed with all latest updates from default repositories.
Networking:
Both machines must be able to reach each other over the network. The easiest is
to put both machines in the same network with regard to hardware and software
configuration.
When using virtual machines in VIRTULBOX:
These steps are only to be performed if configuring networking in Oracle
VirtualBox.In the settings of the each virtual machine, go to Network and select
Bridged adapter to put the virtual machines as the same network as the host. So
that all VMs are now in same network.
Inside the running VM, configure the IP address manually to allocate some IP
address like 192.168.0.xxx For Example IP address as follows: -
master 192.168.0.1
slave 192.168.0.2
Now edit the /etc/hosts file to let the system know about other systems on the
network. Edit the file on all machines to look like this.
127.0.0.1localhost
#127.0.1.1hostname
192.168.0.1master
192.168.0.2slave
Disabling IPv6:
IPv6 should be disabled as caution because Hadoop uses IP address 0.0.0.0 for
internal options and it will be bind to an IPv6 address by default in Ubuntu. To
disable it, add the following lines to /etc/sysctl conf. The machines should be
restarted in order for changes to take effect.
disable ipv6 net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
Configuring Java: -
Java JDK is required to execute the Hadoop code, so it must be installed and
configured properly on each machine on the cluster.
Installation:
Run the following commands from a terminal
sudo apt-get install openjdk-7-jreopenjdk-7-jdk
The installed files of java will be placed in /usr/lib/jvm/java-7-openjdk-i386
Installation:
To install sshin Ubuntu sudo apt-get install ssh
Generate keys: Execute this step on both master as well all the slaves
ssh-keygen -t rsa -P " "
Adding this key to authorized keys in ssh:Perform this step on all machines in
cluster to copy the above generated key to trusted keys for the user.
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys.
<property><name>fs.default.name</name><value>hdfs://master:54310</value>
<description>The name of the default file system. </description></property>
mapred-site.xml:
Insert the following code snippet in between
<configuration></configuration>tags.
<property><name>mapred.job.tracker</name>
<value>master:54311</value>
<description>The host and port that the MapReduce job tracker runs at. If "local",
then jobs are run in-process as a single map and reduce task.
</description></property>
hdfs-site.xml
<property><name>dfs.replication</name> <value>2</value>
<description>Default block replication.The actual number of replications can be
specified when the file is created. The default is used if replication is not
specified in create time. </description></property>
On slave machines
2404 DataNode
3102 SecondaryNameNode
2532 TaskTracker
3325 jps
This command copies file temp.txt from the local filesystem to HDFS.
HDFS Commands
1) touchz:- HDFS Command to create a file in HDFS with file size 0 bytes.
Usage: hdfs dfs–touchz/directory/filename
Command:hdfsdfs–touchz /new_edureka/sample
2) text
HDFS Command that takes a source file and outputs the file in text format.
Usage:hdfs dfs –text /directory/filename
Command:hdfs dfs –text /new_edureka/test
3) cat
Architecture of Hive
The following component diagram depicts the architecture of Hive:
Working of Hive
The following diagram depicts the workflow between Hive and Hadoop.
The following table defines how Hive interacts with Hadoop framework:
Step
Operation
No.
Execute Query The Hive interface such as Command Line or Web UI sends query
1
to Driver (any database driver such as JDBC, ODBC, etc.) to execute.
Get Plan The driver takes the help of query compiler that parses the query to check
2
the syntax and query plan or the requirement of query.
3 Get Metadata The compiler sends metadata request to Metastore (any database).
4 Send Metadata Metastore sends metadata as a response to the compiler.
Send Plan The compiler checks the requirement and resends the plan to the driver.
5
Up to here, the parsing and compiling of a query is complete.
6 Execute Plan The driver sends the execute plan to the execution engine.
Execute Job Internally, the process of execution job is a MapReduce job. The
execution engine sends the job to JobTracker, which is in Name node and it assigns
7
this job to TaskTracker, which is in Data node. Here, the query executes
MapReduce job.
Metadata Ops Meanwhile in execution, the execution engine can execute metadata
7
operations with Metastore.
The following command is used to verify the download and extract the hive
archive:
$ tar zxvf apache-hive-0.14.0-bin.tar.gz
$ ls
On successful download, you get to see the following response:
apache-hive-0.14.0-bin apache-hive-0.14.0-bin.tar.gz
$ su -passwd:
# cd /home/user/Download
# mv apache-hive-0.14.0-bin /usr/local/hive
# exit
$ cd $HIVE_HOME/conf
$ cp hive-env.sh.template hive-env.sh
$ su -passwd:
# cd /home/user
# mv db-derby-10.4.2.0-bin /usr/local/derby
# exit
Apache Hive
export
CLASSPATH=$CLASSPATH:$DERBY_HOME/lib/derby.jar:$DERBY_HOME/lib/
derbytools.jar
The following command is used to execute ~/.bashrc file:
$ source ~/.bashrc
$ cd $HIVE_HOME/conf
$ cp hive-default.xml.template hive-site.xml
Edit hive-site.xml and append the following lines between the <configuration>
and </configuration> tags:
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby://localhost:1527/metastore_db;create=true </value>
<description>JDBC connect string for a JDBC metastore </description>
Create a file named jpox.properties and add the following lines into it:
javax.jdo.PersistenceManagerFactoryClass = org.jpox.PersistenceManagerFactoryImpl
org.jpox.autoCreateSchema = false
org.jpox.validateTables = false
org.jpox.validateColumns = false
org.jpox.validateConstraints = false
org.jpox.storeManagerType = rdbms
org.jpox.autoCreateSchema = true
org.jpox.autoStartMechanismMode = checked
org.jpox.transactionIsolation = read_committed
javax.jdo.option.DetachAllOnCommit = true
javax.jdo.option.NontransactionalRead = true
javax.jdo.option.ConnectionDriverName = org.apache.derby.jdbc.ClientDriver
javax.jdo.option.ConnectionURL = jdbc:derby://hadoop1:1527/metastore_db;create =
true
javax.jdo.option.ConnectionUserName = APP
javax.jdo.option.ConnectionPassword = mine
Now set them in HDFS before verifying Hive. Use the following commands:
$ $HADOOP_HOME/bin/hadoop fs -mkdir /tmp
$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse
Hivehistory
file=/tmp/hadoop/hive_job_log_hadoop_201312121621_1494929084.txt
Prepared By: Prof. Neha Prajapati Page 47
………………….
hive>
i. Embedded Metastore
In Hive by default, metastore service runs in the same JVM as the Hive service. It
uses embedded derby database stored on the local file system in this mode. Thus
both metastore service and hive service runs in the same JVM by using
embedded Derby Database. But, this mode also has limitation that, as only one
embedded Derby database can access the database files on disk at any one time,
o
n
l
y
o
n
e
This configuration is called as local metastore because metastore service still runs
in the same process as the Hive. But it connects to a database running in a
separate process, either on the same machine or on a remote machine. Before
starting Apache Hive client, add the JDBC / ODBC driver libraries to the Hive lib
folder.
iii. Remote Metastore
Moving further, another metastore configuration called Remote Metastore. In
this mode, metastore runs on its own separate JVM, not in the Hive service JVM.
If other processes want to communicate with the metastore server they can
communicate using Thrift Network APIs. We can also have one more metastore
servers in this case to provide more availability.
This also brings better manageability/security because the database tier can be
completely firewalled off. And the clients no longer need share database
credentials with each Hiver user to access the metastore database.
HiveQL
HiveQL
Hive’s SQL language is known as HiveQL, it is a combination of SQL-92, Oracal’s
SQL language and MySQL.
HiveQI provides some improved features from previous version of SQL
standards, like analytics function from SQL 2003.
Some Hive’s’ extension like multitable inserts, TRANSFORM, MAP and
REDUCE.
Hive is an open source data warehouse system used for querying and analyzing
large datasets. Data in Apache Hive can be categorized into Table,Partition, and
Bucket. The table in Hive is logically made up of the data being stored. Hive has
two types of tables which are as follows: Managed Table (Internal Table) External
Table
Hive Managed Tables-
It is also know an internal table. When we create a table in Hive, it by default
manages the data. This means that Hive moves the data into its warehouse
directory.Hive External Tables-
We can also create an external table. It tells Hive to refer to the data that is at an
existing location outside the warehouse directory.
Here we are going to cover the comparison between Hive Internal tables vs
External tables on the basis of different features. Let’s discuss them one by one-
Now, with the EXTERNAL keyword, Apache Hive knows that it is not managing
the data. So it doesn’t move data to its warehouse directory. It does not even
check whether the external location exists at the time it is defined. This very
useful feature because it means we create the data lazily after creating the table.
The important thing to notice is that when we drop an external table, Hive will
leave the data untouched and only delete the metadata.
ii. Security
Managed Tables –Hive solely controls the Managed table security. Within Hive,
security needs to be managed; probably at the schema level (depends on
organization).
External Tables –These tables’ files are accessible to anyone who has access to
HDFS file structure. So, it needs to manage security at the HDFS file/folder level.
iii. When to use Managed and external table
Use Managed table when –
We want Hive to completely manage the lifecycle of the data and table.
Data is temporary
hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String,salary
String, destination String)
The following query executes JOIN on the CUSTOMER and ORDER tables, and
retrieves the records:
hive> SELECT c.ID, c.NAME, c.AGE, o.AMOUNT FROM CUSTOMERS c JOIN
ORDERS o ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:
+----+----------+-----+--------+
| ID | NAME | AGE | AMOUNT |
+----+----------+-----+--------+
| 3 | kaushik | 23 | 3000 |
| 3 | kaushik | 23 | 1500 |
| 2 | Khilan | 25 | 1560 |
| 4 | Chaitali | 25 | 2060 |
+----+----------+-----+--------+
LEFT OUTER JOIN
The HiveQL LEFT OUTER JOIN returns all the rows from the left table, even if
there are no matches in the right table. This means, if the ON clause matches 0
(zero) records in the right table, the JOIN still returns a row in the result, but
with NULL in each column from the right table.
On successful execution of the query, you get to see the following response:
+----+----------+--------+---------------------+
| ID | NAME | AMOUNT | DATE |
+----+----------+--------+---------------------+
| 1 | Ramesh | NULL | NULL |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
| 5 | Hardik | NULL | NULL |
| 6 | Komal | NULL | NULL |
| 7 | Muffy | NULL | NULL |
+----+----------+--------+---------------------+
+------+----------+--------+---------------------+
| ID | NAME | AMOUNT | DATE |
+------+----------+--------+---------------------+
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik |1500 | 2009-10-08 00:00:00 |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
+------+----------+--------+---------------------+
On successful execution of the query, you get to see the following response:
+------+----------+--------+---------------------+
| ID | NAME | AMOUNT | DATE |
+------+----------+--------+---------------------+
| 1 | Ramesh | NULL | NULL |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
| 5 | Hardik | NULL | NULL |
| 6 | Komal | NULL | NULL |
| 7 | Muffy | NULL | NULL |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
+------+----------+--------+---------------------+
Hbase Concept
Introduction of HBase
HBase is a distributed column-oriented database built on top of the Hadoop file
system. It is an open-source project and is horizontally scalable.
HBase is a data model that is similar to Google’s big table designed to provide
quick random access to huge amounts of structured data. It leverages the fault
tolerance provided by the Hadoop File System (HDFS).
It is a part of the Hadoop ecosystem that provides random real-time read/write
access to data in the Hadoop File System.
Featu
res of HBase
HBase is linearly scalable.
It has automatic failure support.
It provides consistent read and writes.
It integrates with Hadoop, both as a source and a destination.
It has easy java API for client.
It provides data replication across clusters.
Zookeeper
Zookeeper is an open-source project that provides services like maintaining
configuration information, naming, providing distributed synchronization, etc.
Zookeeper has ephemeral nodes representing different region servers. Master
servers use these nodes to discover available servers.
Benefits
of
Distribut
ed Applications
Reliability − Failure of a single or a few systems does not make the whole system
to fail.
Scalability − Performance can be increased as and when needed by adding more
machines with minor change in the configuration of the application with no
downtime.
Transparency − Hides the complexity of the system and shows itself as a single
entity / application.
Challenges of Distributed Applications
Race condition − Two or more machines trying to perform a particular task,
which actually needs to be done only by a single machine at any given time. For
Server, one of the nodes in our ZooKeeper ensemble, provides all the
Server services to clients. Gives acknowledgement to client to inform that the server
is alive.
Group of ZooKeeper servers. The minimum number of nodes that is
Ensemble
required to form an ensemble is 3.
Server node which performs automatic recovery if any of the connected
Leader
node failed. Leaders are elected on service startup.
Follower Server node which follows leader instruction.
Pig Architecture
Pig consists of two components:
1. Pig Latin, which is a language
2. A runtime environment, for running PigLatin programs.
A Pig Latin program consists of a series of operations or transformations which
are applied to the input data to produce output. These operations describe a data
flow which is translated into an executable representation, by Pig execution
environment. Underneath, results of these transformations are series of
MapReduce jobs which a programmer is unaware of. So, in a way, Pig allows the
programmer to focus on data rather than the nature of execution.
Hive Pig
Used for Data Analysis Used for Data and Programs
Used as Structured Data Pig is Semi-Structured Data
Hive has HiveQL Pig has Latin
Hive is used for creating reports Pig is used for programming
Hive works on the server side Pig works on the client side
Hive does not support avro Pig supports Avro
Spark Stack
Spark SQL
Spark SQL is Spark’s package for working with structured data. It allows
querying data via SQL as well as the Apache Hive variant of SQL—called the
Hive Query Language (HQL)—and it supports many sources of data, including
Hive tables, Parquet, and JSON.
Beyond providing a SQL interface to Spark, Spark SQL allows developers to
intermix SQL queries with the programmatic data manipulations supported by
RDDs in Python, Java, and Scala, all within a single application, thus combining
SQL with complex analytics.
This tight integration with the rich computing environment provided by Spark
makes Spark SQL unlike any other open source data warehouse tool. Spark SQL
was added to Spark in version 1.0.
Shark was an older SQL-on-Spark project out of the University of California,
Berkeley, that modified Apache Hive to run on Spark. It has now been replaced
Prepared By: Prof. Neha Prajapati Page 66
by Spark SQL to provide better integration with the Spark engine and language
APIs.
Spark Streaming
Spark Streaming is a Spark component that enables processing of live streams of
data.
Examples of data streams include logfiles generated by production web servers,
or queues of messages containing status updates posted by users of a web service.
Spark Streaming provides an API for manipulating data streams that closely
matches the Spark Core’s RDD API, making it easy for programmers to learn the
project and move between applications that manipulate data stored in memory,
on disk, or arriving in real time.
Underneath its API, Spark Streaming was designed to provide the same degree
of fault tolerance, throughput, and scalability as Spark Core.
MLlib
Spark comes with a library containing common machine learning (ML)
functionality, called MLlib. MLlib provides multiple types of machine learning
algorithms, including classification, regression, clustering, and collaborative
filtering, as well as supporting functionality such as model evaluation and data
import.
It also provides some lower-level ML primitives, including a generic gradient
descent optimization algorithm. All of these methods are designed to scale out
across a cluster.
GraphX
GraphX is a library for manipulating graphs (e.g., a social network’s friend graph)
and performing graph-parallel computations.
Like Spark Streaming and Spark SQL, GraphX extends the Spark RDD API,
allowing us to create a directed graph with arbitrary properties attached to each
vertex and edge.
GraphX also provides various operators for manipulating graphs (e.g., subgraph
and mapVertices) and a library of common graph algorithms (e.g., PageRank and
triangle counting).
Data Analysis with Spark
Data Science Tasks
Data science, a discipline that has been emerging over the past few years, centers
on analyzing data. While there is no standard definition, for our purposes a data
scientist is somebody whose main task is to analyze and model data.
Data scientists may have experience with SQL, statistics, predictive modeling
(machine learning), and programming, usually in Python, Matlab, or R.
Features of RDD
Resilient: RDDs track data lineage information to recover the lost data,
automatically on failure. It is also called Fault tolerance.
Distributed: Data present in the RDD resides on multiple nodes. It is distributed
across different nodes of a cluster.
Lazy Evaluation: Data does not get loaded in the RDD even if we define it.
Transformations are actually computed when you call an action, like count or
collect, or save the output to a file system.
Immutability: Data stored in the RDD is in a read-only mode you cannot edit the
data which is present in the RDD. But, you can create new RDDs by performing
transformations on the existing RDDs. memory Computation: RDD stores any
immediate data that is generated in the memory (RAM) than on the disk so that
it provides faster access.
Partitioning: Partitions can be done on any existing RDDs to create logical parts
that are mutable. You can achieve this by applying transformations on existing
partitions.
Transformations: These are functions which accept existing RDDs as the input
and outputs one or more RDDs. The data in the existing RDDs does not change
as it is immutable. Some of the transformation operations are shown in the table
given below:
Functions Description
Returns a new RDD by applying the function on each data
map()
element
Returns a new RDD formed by selecting those elements of the
filter()
source on which the function returns true
reduceByKey() Used to aggregate values of a key using a function
groupByKey() Used to convert a (key, value) pair to (key, <iterable value>) pair
These transformations are executed when they are invoked or called. Every time
transformations are applied, a new RDD is created.
Actions: Actions in Spark are functions which return the end result of RDD
computations. It uses a lineage graph to load the data onto the RDD in a
particular order. After all transformations are done, actions return the final result
to the Spark Driver. Actions are operations which provide non-RDD values.
Some of the common actions used in Spark are:
Functions Description
count() Gets the number of data elements in an RDD
collect() Gets all data elements in the RDD as an array
Aggregates data elements into the RDD by taking two arguments
reduce()
and returning one
take(n) Used to fetch the first n elements of the RDD
foreach(operation) Used to execute operation for each data element in the RDD
first() Retrieves the first data element of the RDD
Motivation
Spark provides special type of operations on RDDs containing key or value pairs.
These RDDs are called pair RDDs operations. Pair RDDs are a useful building
block in many programming language, as they expose operations that allow you
to act on each key operations in parallel or regroup data across the network.
Java doesn’t have a built-in function of tuple function, so only Spark’s Java API
has users create tuples using the scala.Tuple2 class. Java users can construct a
new tuple by writing new Tuple2(elem1, elem2) and can then access its relevant
elements with the -1() and -2() methods.
Java users also need to call special versions of Spark’s functions when you are
creating pair of RDDs. For instance, the mapToPair () function should be used in
place of the basic map() function.
Creating a pair RDD using the first word as the key word in Java program.
PairFunction<String, String, String> keyData =
new PairFunction<String, String, String>() {
public Tuple2<String, String> call(String x) {
return new Tuple2(x.split(” “)[0], x);}};
JavaPairRDD<String, String> pairs = lines.mapToPair(keyData);
In [2]:companydf=sqlContext.read.format(‘com.databricks.spark.csv’)
options(header=’true’,inferschema=’true’)
load(‘C:/Users/intellipaat/Downloads/spark-2.3.2-bin-
hadoop2.7/Fortune5002017.csv’) company-df.take(1)
You can choose the number of rows you want to view while displaying the data
of a dataframe. I have displayed the first row only.
Output:Out[2]:[Row (Rank=1, Title= ‘Walmart’, Website=
‘http:/www.walmart.com’, Employees-2300000, Sector= ‘retailing’)]
Data exploration:
To check the datatype of every column of a dataframe and print the schema of
the dataframe in a tree format, you can use the following commands respectively.
Input:In[3]:company-df.cache(),company-df.printSchema()
Output:Out [3]:DataFrame[Rank: int, Title: string, Website: string, Employees:Int,
Sector: string]
root
|– Rank: integer (nullable = true)
|– Title: string (nullable = true)
|– Website:string (nullable = true)
Prepared By: Prof. Neha Prajapati Page 75
|– Employees: integer (nullable = true)
|– Sector: string (nullable = true)
Performing Descriptive Analysis:
Input:In [4]: company-df.describe().toPandas().transpose()
Output:Out [4]:
0 1 2 3 4
Summary count mean stddev min max
Rank 5 3.0 1.581138830084 1 5
Title 5 None None Apple Walmart
Website 5 None None www.apple.com www.walmart.com
Employees 5 584880.0 966714.2168190142 68000 2300000
Sector 5 None None Energy Wholesalers
Machine learning in Industry
Computer systems with the ability to predict and learn from a given data and
improve themselves without having to be reprogrammed used to be a dream
only but in the recent years it has been made possible using machine learning.
Now machine learning is a most used branch of artificial intelligence that is
being accepted by big industries in order to benefit their businesses.
Following are some of the organisations where machine learning has various use
cases: PayPal:PayPal uses machine learning to detect suspicious activity.
IBM: There is a machine learning technology patented by IBM which helps to
decide when to handover the control of self-driving vehicle between a vehicle
control processor and a human driver.
Google:Machine learning is used to gather information from the users which
further is used to improve their search engine results.
Walmart: Machine learning in Walmart is used to benefit their efficiency
Amazon:Machine learning is used to design and implement personalised
product recommendations.
Facebook:Machine learning is used to filter out poor quality content.
Session Store
Managing session data using relational database is very difficult, especially in
case where applications are grown very much.
In such cases the right approach is to use a global session store, which manages
session information for every user who visits the site.
NOSQL is suitable for storing such web application session information very is
large in size.
Since the session data is unstructured in form, so it is easy to store it in schema
less documents rather than in relation database record.
Mobile Applications
Since the smartphone users are increasing very rapidly, mobile applications face
problems related to growth and volume.
Using NoSQL database mobile application development can be started with
small size and can be easily expanded as the number of user increases, which is
very difficult if you consider relational databases.
Since NoSQL database store the data in schema-less for the application developer
can update the apps without having to do major modification in database.
The mobile app companies like Kobo and Playtika, uses NOSQL and serving
millions of users across the world.
Internet of Things
Today, billions of devices are connected to internet, such as smartphones, tablets,
home appliances, systems installed in hospitals, cars and warehouses. For such
devices large volume and variety of data is generated and keep on generating.
Relational databases are unable to store such data. The NOSQL permits
organizations to expand concurrent access to data from billions of devices and
systems which are connected, store huge amount of data and meet the required
performance.
E-Commerce
E-commerce companies use NoSQL for store huge volume of data and large
amount of request from user.
Social Gaming
Advantages of NoSQL
NOSQL provides high level of scalability.
It is used in distributed computing environment.
Implementation is less costly It provides storage for semi-structured data and it
is also provide flexibility in schema.
Relationships are less complicated
The advantages of NOSQL also include being able to handle :
Large volumes of structured, semi-structured and unstructured data.
Object-oriented algorithms permit implementations in order to achieve the
maximum availability over multiple data centers.
Eventual-consistency based systems scale update workloads better than
traditional OLAP RDBMS, while also scaling to very large datasets.
Programming that is easy to use and flexible. Efficient, scale-out
architecture instead of expensive, monolithic architecture
Differences between SQL and NoSQL database:
Index SQL NoSQL
Databases are categorized as
NoSQL databases are categorized as Non-
1) Relational Database Management
relational or distributed database system.
System (RDBMS).
SQL databases have fixed or static
2) NoSQL databases have dynamic schema.
or predefined schema.
SQL databases display data in NoSQL databases display data as collection of
3) form of tables so it is known as key-value pair, documents, graph databases or
table-based database. wide-column stores.
4) SQL databases are vertically NoSQL databases are horizontally scalable.
Prepared By: Prof. Neha Prajapati Page 82
scalable.
SQL databases use a powerful In NoSQL databases, collection of documents
language "Structured Query are used to query the data. It is also called
5)
Language" to define and unstructured query language. It varies from
manipulate the data. database to database.
NoSQL databases are not so good for complex
SQL databases are best suited for
6) queries because these are not as powerful as
complex queries.
SQL queries.
SQL databases are not best suited NoSQL databases are best suited for
7)
for hierarchical data storage. hierarchical data storage.
MySQL, Oracle, Sqlite, MongoDB, BigTable, Redis, RavenDB,
8) PostgreSQL and MS-SQL etc. are Cassandra, Hbase, Neo4j, CouchDB etc. are the
the example of SQL database. example of nosql database
NewSQL
NewSQL is a class of modern relational database management systems that seek
to provide the same scalable performance of NoSQL systems for online
transaction processing (OLTP) read-write workloads while still maintaining the
ACID guarantees of a traditional database system.
NewSQL systems vary greatly in their internal architectures, the two
distinguishing features common amongst them are that they all support the
relational data model and use SQL as their primary interface.
The applications typically targeted by these NewSQL systems are characterized
by being OLTP, that is, having a large number of transactions.
(1) are short-lived (i.e., no user stalls) (2) touch a small subset of data using index
lookups (i.e., no full table scans or large distributed joins), (3) are repetitive (i.e.
executing the same queries with different inputs).
However, some of the NewSQL databases are also HTAP systems, therefore,
supporting hybrid transactional/analytical workloads.
These NewSQL systems achieve high performance and scalability by eschewing
much of the legacy architecture of the original IBM System R design, such as
heavyweight recovery or concurrency control algorithms.
One of the first known NewSQL systems is the H-Store parallel database system
Advantages of NoSQL
Large volumes of structured, semi-structured, and unstructured data
Agile sprints, quick iteration, and frequent code pushes
Object-oriented programming that is easy to use and flexible
Efficient, scale-out architecture instead of expensive, monolithic architecture
Secondary Indexes
Indexes support the efficient execution of queries in MongoDB. Without indexes,
MongoDB must perform a collection scan, i.e. scan every document in a
collection, to select those documents that match the query statement. If an
Replication
Replication is the process of synchronizing data across multiple servers.
Replication provides redundancy and increases data availability with multiple
copies of data on different database servers.
Replication protects a database from the loss of a single server.
Replication also allows you to recover from hardware failure and service
interruptions. With additional copies of the data, you can dedicate one to disaster
recovery, reporting, or backup.
Why Replication?
To keep your data safe
High (24*7) availability of data
Disaster recovery
No downtime for maintenance (like backups, index rebuilds, compaction)
Read scaling (extra copies to read from)
Replica set is transparent to the application
Load Balancer
A load balancer is a device that distributes network or application traffic across a
cluster of servers. Load balancing improves responsiveness and increases
availability of applications.
A load balancer sits between the client and the server farm accepting incoming
network and application traffic and distributing the traffic across multiple
backend servers using various methods.
By balancing application requests across multiple servers, a load balancer
reduces individual server load and prevents any one application server from
becoming a single point of failure, thus improving overall application availability
and responsiveness.
Scaling Up Vs Scaling Out
Scaling out is considered more important as commodity hardware is cheaper
compared to cost of special configuration hardware (super computer).
Scalable Architecture
Application architecture is scalable if each layer in multi layered architecture is
scalable (scale out). For example :– As shown in diagram below we should be
able linearly scale by add additional box in Application Layer or Database
Laye
MongoDB's Tools
MongoDB Tools consists of the following:
JavaScript shell
Database drivers
Command-line tools.
Defining Variables
The first place to begin within JavaScript is defining variables. Variables are a
means to name data so that you can use that name to temporarily store and
access data from your JavaScript files. Variables can point to simple data types
such as numbers or strings, or they can point to more complex data types such as
objects.
To define a variable in JavaScript, you use the var keyword and then give the
variable a name, as in this example:
var myData;
You can also assign a value to the variable in the same line. For example, the
following line of code creates a variable named myString and assigns it the value
of "Some Text":var myString = "Some Text";
It works as well as this code:
var myString;
myString = "Some Text";
After you have declared the variable, you can use the name to assign the variable a
value and access the value of the variable. For example, the following code stores a
string into the myString variable and then uses it when assigning the value to the
newString variable:
var myString = "Some Text";
var newString = myString + " Some More Text";
Your variable names should describe the data stored in them so that you can easily
use them later in your program. The only rules for creating variable names is that
they must begin with a letter, $, or _ and they cannot contain spaces. Also remember
that variable names are case sensitive, so myString is different from MyString.
Data Type
JavaScript uses data types to determine how to handle data that is assigned to a
variable. The variable type determines what operations you can perform on the
variable, such as looping or executing. The following list describes the most common
types of variables you will be working with through the book:
Array: An indexed array is a series of separate distinct data items all stored
under a single variable name. Items in the array can be accessed by their zero-
based index using array[index]. The following is an example of creating a simple
array and then accessing the first element, which is at index 0.
var arr = ["one", "two", "three"];
var first = arr[0];
Object literal: JavaScript supports the capability to create and use object literals.
When you use an object literal, you can access values and functions in the object
using object.property syntax. The following example shows how to create and
access properties with an object literal:
var obj = {"name":"Brad", "occupation":"Hacker", "age", "Unknown"};
var name = obj.name;
Null: Sometimes you do not have a value to store in a variable either because it
hasn’t been created or you are no longer using it. At this time, you can set a
variable to null. Using null is better than assigning the variable a value of 0 or an
empty string "" because those might be valid values for the variable. Assigning
the variable null lets you assign no value and check against null inside your code.
var newVar = null;
Inserts and queries
Inserts a document into a collection.
The insertOne() method has the following syntax:
db.collection.insertOne(
db.collection.update(
<query>, <update>,{upsert: <boolean>, multi: <boolean>, writeConcern:
<document>, collation: <document>,arrayFilters: [ <filterdocument1>, ... ]
})
To insert the document you can use db.post.save(document) also. If you don't
specify _id in the document then save() method will work same as insert()
method. If you specify _id then it will replace whole data of document
containing _id as specified in save() method.
Example
>db.mycol.ensureIndex({"title":1})
In ensureIndex() method you can pass multiple fields, to create index on multiple
fields.
>db.mycol.ensureIndex({"title":1,"description":-1})
Prepared By: Prof. Neha Prajapati Page 95
>ensureIndex() method also accepts list of options (which are optional).
Parameter Type Description
Builds the index in the background so that building an
background Boolean index does not block other database activities. Specify true
to build in the background. The default value is false.
Creates a unique index so that the collection will not accept
insertion of documents where the index key or keys match
unique Boolean
an existing value in the index. Specify true to create a
unique index. The default value is false.
The name of the index. If unspecified, MongoDB generates
name string an index name by concatenating the names of the indexed
fields and the sort order.
Creates a unique index on a field that may have duplicates.
MongoDB indexes only the first occurrence of a key and
dropDups Boolean removes all documents from the collection that contain
subsequent occurrences of that key. Specify true to create
unique index. The default value is false.
If true, the index only references documents with the
specified field. These indexes use less space but behave
sparse Boolean
differently in some situations (particularly sorts). The
default value is false.
Specifies a value, in seconds, as a TTL to control how long
expireAfterSeconds integer
MongoDB retains documents in this collection.
The index version number. The default index version
index
v depends on the version of MongoDB running when
version
creating the index.
The weight is a number ranging from 1 to 99,999 and
weights document denotes the significance of the field relative to the other
indexed fields in terms of the score.
For a text index, the language that determines the list of
default_language string stop words and the rules for the stemmer and tokenizer.
The default value is english.
For a text index, specify the name of the field in the
language_override string document that contains, the language to override the
default language. The default value is language.
Connect to Database
To connect database, you need to specify the database name, if the database
doesn't exist then MongoDB creates it automatically.
Following is the code snippet to connect to the database −
// Creating Credentials
MongoCredential credential;
credential = MongoCredential.createCredential("sampleUser", "myDb",
"password".toCharArray());
System.out.println("Connected to the database successfully");
//Creating a collection
database.createCollection("sampleCollection");
System.out.println("Collection created successfully");
}
}
On compiling, the above program gives you the following result −
Connected to the database successfully
Collection created successfully
Getting/Selecting a Collection
To get/select a collection from the database, getCollection() method of
com.mongodb.client.MongoDatabase class is used.
Following is the program to get/select a collection −
import com.mongodb.client.MongoCollection;
import com.mongodb.client.MongoDatabase;
import org.bson.Document;
import com.mongodb.MongoClient;
import com.mongodb.MongoCredential;
// Creating Credentials
MongoCredential credential;
credential = MongoCredential.createCredential("sampleUser", "myDb",
"password".toCharArray());
// Creating a collection
System.out.println("Collection created successfully");
// Retieving a collection
MongoCollection<Document> collection = database.getCollection("myCollection");
System.out.println("Collection myCollection selected successfully");
}
}
Insert a Document
To insert a document into MongoDB, insert() method of
com.mongodb.client.MongoCollection class is used.
// Creating Credentials
MongoCredential credential;
credential = MongoCredential.createCredential("sampleUser", "myDb",
"password".toCharArray());
System.out.println("Connected to the database successfully");
// Retrieving a collection
MongoCollection<Document> collection =
database.getCollection("sampleCollection");
System.out.println("Collection sampleCollection selected successfully");
import java.util.Iterator;
import org.bson.Document;
import com.mongodb.MongoClient;
import com.mongodb.MongoCredential;
// Creating Credentials
MongoCredential credential;
credential = MongoCredential.createCredential("sampleUser", "myDb",
"password".toCharArray());
System.out.println("Connected to the database successfully");
// Retrieving a collection
MongoCollection<Document> collection =
database.getCollection("sampleCollection");
System.out.println("Collection sampleCollection selected successfully");
while (it.hasNext()) {
System.out.println(it.next());
i++;
}
}
}
Update Document
To update a document from the collection, updateOne() method of
com.mongodb.client.MongoCollection.
Name Description
$eq Matches values that are equal to a specified value.
$gt Matches values that are greater than a specified value.
$gte Matches values that are greater than or equal to a specified value.
$in Matches any of the values specified in an array.
$lt Matches values that are less than a specified value.
$lte Matches values that are less than or equal to a specified value.
$ne Matches all values that are not equal to a specified value.
$nin Matches none of the values specified in an array.
Logical
Name Description
$and Joins query clauses with a logical AND returns all documents that match the
conditions of both clauses.
$not Inverts the effect of a query expression and returns documents that do not match
the query expression.
$nor Joins query clauses with a logical NOR returns all documents that fail to match
Bitwise
Name Description
$bitsAllClear Matches numeric or binary values in which a set of bit positions all have
a value of 0.
$bitsAllSet Matches numeric or binary values in which a set of bit positions all have
a value of 1.
$bitsAnyClear Matches numeric or binary values in which any bit from a set of bit
positions has a value of 0.
$bitsAnySet Matches numeric or binary values in which any bit from a set of bit
positions has a value of 1.
Projection Operators
Name Description
$ Projects the first element in an array that matches the query condition.
$elemMatch Projects the first element in an array that matches the specified $elemMatch
condition.
$meta Projects the document’s score assigned during $text operation.
$slice Limits the number of elements projected from an array. Supports skip and
limit slices.