Big Data Unit 3
Big Data Unit 3
NoSQL Database has no fixed schema. Relational Database has a fixed schema.
NoSQL Database is only eventually consistent. Relational Database follows acid properties. (Atomicity,
Consistency, Isolation, and Durability)
NoSQL databases don't support transactions (support only simple Relational Database supports transactions (also complex
transactions). transactions with joins).
NoSQL Database is used to handle data coming in high velocity. Relational Database is used to handle data coming in low
velocity.
The NoSQL?s data arrive from many locations. Data in relational database arrive from one or few locations.
NoSQL database can manage structured, unstructured and semi- Relational database manages only structured data.
structured data.
NoSQL databases have no single point of failure. Relational databases have a single point of failure with failover.
NoSQL databases can handle big data or data in a very high NoSQL databases are used to handle moderate volume of data.
volume .
NoSQL has decentralized structure. Relational database has centralized structure.
Brief History of NoSQL Databases
• 1998- Carlo Strozzi use the term NoSQL for his lightweight, open-
source relational database
• 2000- Graph database Neo4j is launched
• 2004- Google BigTable is launched
• 2005- CouchDB is launched
• 2007- The research paper on Amazon Dynamo is released
• 2008- Facebooks open sources the Cassandra project
• 2009- The term NoSQL was reintroduced
Features of NoSQL
Non-relational
• NoSQL databases never follow the relational
model
• Never provide tables with flat fixed-column
records
• Work with self-contained aggregates or BLOBs
• Doesn’t require object-relational mapping and
data normalization
• No complex features like query languages, query
planners, referential integrity joins, ACID
Schema-free
NoSQL Databases
NoSQL Databases
Id: 1
Name:
Alice
Age: 18
Id: 3
Name:
Chess
Type:
Group
NoSQL Databases
NoSQL Databases
• Column stores are excellent at compression and therefore are efficient in terms of
storage. This means you can reduce disk resources while holding massive amounts of
information in a single column
• Since a majority of the information is stored in a column, aggregation queries are quite
fast, which is important for projects that require large amounts of queries in a small
amount of time.
• Scalability is excellent with column-store databases. They can be expanded nearly
infinitely, and are often spread across large clusters of machines, even numbering in
thousands. That also means that they are great for Massive Parallel Processing
• Load times are similarly excellent, as you can easily load a billion-row table in a few
seconds. That means you can load and query nearly instantly.
• Large amounts of flexibility as columns do not necessarily have to look like each other.
That means you can add new and different columns without disrupting the whole
database. That being said, entering completely new record queries requires a change
to all tables.
The CAP Theorem
The limitations of distributed databases can be described
in the so called the CAP theorem
Consistency: every node always sees the same data at any
given instance (i.e., strict consistency)
CAP theorem: any distributed database with shared data, can have at
most two of the three desirable properties, C, A or P
The CAP Theorem (Cont’d)
Let us assume two nodes on opposite sides of a
network partition:
Webpage-A
Webpage-A Webpage-A
Event: Update
Webpage-A Webpage-A
Webpage-A
Webpage-A
Eventual Consistency:
A Main Challenge
But, what if the client accesses the data from
different replicas?
Webpage-A
Webpage-A Webpage-A
Event: Update
Webpage-A Webpage-A
Webpage-A
Webpage-A
43
Q/A
• Which of the following is a characteristic of NoSQL databases?
• a) They use a fixed schema for data storage.
• b) They are only suitable for small-scale applications.
• c) They provide ACID (Atomicity, Consistency, Isolation, Durability)
transactions.
• d) They offer flexible and scalable data models.
44
Q/A
Which type of data model is commonly used in NoSQL databases?
a) Relational model
b) Document model
c) Entity-relationship model
d) Hierarchical model
45
References:
✔https://www.edureka.co/blog/big-data-tutorial
✔https://www.coursera.org/learn/big-data-introduction?specialization=big-data2.
✔https://www.coursera.org/learn/fundamentals-of-big-data
✔Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization, DT Editorial
Service, Dreamtech Press
✔Big Data Analytics, Subhashini Chellappa, Seema Acharya, Wiley publications
✔Big Data: Concepts, Technology, and Architecture, Nandhini Abirami R , Seifedine Kadry, Amir H. Gandomi ,
Wiley publication
8/8/2021 46
THANK YOU
47
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)
NoSQL Database has no fixed schema. Relational Database has a fixed schema.
NoSQL Database is only eventually consistent. Relational Database follows acid properties. (Atomicity,
Consistency, Isolation, and Durability)
NoSQL databases don't support transactions (support only simple Relational Database supports transactions (also complex
transactions). transactions with joins).
NoSQL Database is used to handle data coming in high velocity. Relational Database is used to handle data coming in low
velocity.
The NoSQL?s data arrive from many locations. Data in relational database arrive from one or few locations.
NoSQL database can manage structured, unstructured and semi- Relational database manages only structured data.
structured data.
NoSQL databases have no single point of failure. Relational databases have a single point of failure with failover.
NoSQL databases can handle big data or data in a very high NoSQL databases are used to handle moderate volume of data.
volume .
NoSQL has decentralized structure. Relational database has centralized structure.
Brief History of NoSQL Databases
• 1998- Carlo Strozzi use the term NoSQL for his lightweight, open-
source relational database
• 2000- Graph database Neo4j is launched
• 2004- Google BigTable is launched
• 2005- CouchDB is launched
• 2007- The research paper on Amazon Dynamo is released
• 2008- Facebooks open sources the Cassandra project
• 2009- The term NoSQL was reintroduced
Features of NoSQL
Non-relational
• NoSQL databases never follow the relational
model
• Never provide tables with flat fixed-column
records
• Work with self-contained aggregates or BLOBs
• Doesn’t require object-relational mapping and
data normalization
• No complex features like query languages, query
planners, referential integrity joins, ACID
Schema-free
NoSQL Databases
NoSQL Databases
Id: 1
Name:
Alice
Age: 18
Id: 3
Name:
Chess
Type:
Group
NoSQL Databases
NoSQL Databases
• Column stores are excellent at compression and therefore are efficient in terms of
storage. This means you can reduce disk resources while holding massive amounts of
information in a single column
• Since a majority of the information is stored in a column, aggregation queries are quite
fast, which is important for projects that require large amounts of queries in a small
amount of time.
• Scalability is excellent with column-store databases. They can be expanded nearly
infinitely, and are often spread across large clusters of machines, even numbering in
thousands. That also means that they are great for Massive Parallel Processing
• Load times are similarly excellent, as you can easily load a billion-row table in a few
seconds. That means you can load and query nearly instantly.
• Large amounts of flexibility as columns do not necessarily have to look like each other.
That means you can add new and different columns without disrupting the whole
database. That being said, entering completely new record queries requires a change
to all tables.
The CAP Theorem
The limitations of distributed databases can be described
in the so called the CAP theorem
Consistency: every node always sees the same data at any
given instance (i.e., strict consistency)
CAP theorem: any distributed database with shared data, can have at
most two of the three desirable properties, C, A or P
The CAP Theorem (Cont’d)
Let us assume two nodes on opposite sides of a
network partition:
Webpage-A
Webpage-A Webpage-A
Event: Update
Webpage-A Webpage-A
Webpage-A
Webpage-A
Eventual Consistency:
A Main Challenge
But, what if the client accesses the data from
different replicas?
Webpage-A
Webpage-A Webpage-A
Event: Update
Webpage-A Webpage-A
Webpage-A
Webpage-A
43
Which NoSQL database is optimized for handling large graphs and
complex relationships?
a) Cassandra
b) Redis
c) CouchDB
d) Neo4j
Ans : d) Neo4j
44
Q/A
Which of the following is an example of a NoSQL database?
a) MySQL
b) PostgreSQL
c) MongoDB
d) Oracle Database
Ans: c) MongoDB
45
References:
✔https://www.edureka.co/blog/big-data-tutorial
✔https://www.coursera.org/learn/big-data-introduction?specialization=big-data2.
✔https://www.coursera.org/learn/fundamentals-of-big-data
✔Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization, DT Editorial
Service, Dreamtech Press
✔Big Data Analytics, Subhashini Chellappa, Seema Acharya, Wiley publications
✔Big Data: Concepts, Technology, and Architecture, Nandhini Abirami R , Seifedine Kadry, Amir H. Gandomi ,
Wiley publication
8/8/2021 46
THANK YOU
47
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)
NoSQL Databases
NoSQL Databases
Id: 1
Name:
Alice
Age: 18
Id: 3
Name:
Chess
Type:
Group
NoSQL Databases
NoSQL Databases
• Column stores are excellent at compression and therefore are efficient in terms of
storage. This means you can reduce disk resources while holding massive amounts of
information in a single column
• Since a majority of the information is stored in a column, aggregation queries are quite
fast, which is important for projects that require large amounts of queries in a small
amount of time.
• Scalability is excellent with column-store databases. They can be expanded nearly
infinitely, and are often spread across large clusters of machines, even numbering in
thousands. That also means that they are great for Massive Parallel Processing
• Load times are similarly excellent, as you can easily load a billion-row table in a few
seconds. That means you can load and query nearly instantly.
• Large amounts of flexibility as columns do not necessarily have to look like each other.
That means you can add new and different columns without disrupting the whole
database. That being said, entering completely new record queries requires a change
to all tables.
The CAP Theorem
The limitations of distributed databases can be described
in the so called the CAP theorem
Consistency: every node always sees the same data at any
given instance (i.e., strict consistency)
CAP theorem: any distributed database with shared data, can have at
most two of the three desirable properties, C, A or P
The CAP Theorem (Cont’d)
Let us assume two nodes on opposite sides of a
network partition:
Webpage-A
Webpage-A Webpage-A
Event: Update
Webpage-A Webpage-A
Webpage-A
Webpage-A
Eventual Consistency:
A Main Challenge
But, what if the client accesses the data from
different replicas?
Webpage-A
Webpage-A Webpage-A
Event: Update
Webpage-A Webpage-A
Webpage-A
Webpage-A
• Ans : d) Redis
32
Q/A
Which NoSQL database is suitable for storing semi-structured or
unstructured data, such as JSON documents?
a) MongoDB
b) Cassandra
c) CouchDB
d) Redis
33
Q/A
Which type of NoSQL database is suitable for storing hierarchical
data?
a) Document database
b) Columnar database
c) Key-value store
d) Graph database
34
References:
✔https://www.edureka.co/blog/big-data-tutorial
✔https://www.coursera.org/learn/big-data-introduction?specialization=big-data2.
✔https://www.coursera.org/learn/fundamentals-of-big-data
✔Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization, DT Editorial
Service, Dreamtech Press
✔Big Data Analytics, Subhashini Chellappa, Seema Acharya, Wiley publications
✔Big Data: Concepts, Technology, and Architecture, Nandhini Abirami R , Seifedine Kadry, Amir H. Gandomi ,
Wiley publication
8/8/2021 35
THANK YOU
36
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)
CAP theorem: any distributed database with shared data, can have at
most two of the three desirable properties, C, A or P
The CAP Theorem (Cont’d)
Let us assume two nodes on opposite sides of a
network partition:
Webpage-A
Webpage-A Webpage-A
Event: Update
Webpage-A Webpage-A
Webpage-A
Webpage-A
Eventual Consistency:
A Main Challenge
But, what if the client accesses the data from
different replicas?
Webpage-A
Webpage-A Webpage-A
Event: Update
Webpage-A Webpage-A
Webpage-A
Webpage-A
12
Q/A
Which property of NoSQL databases enables them to handle large
amounts of data and high traffic loads?
a) ACID compliance
b) Strong data consistency
c) Horizontal scalability
d) Schema flexibility
13
Q/A
Which property of NoSQL databases allows for distributed data
storage across multiple servers?
a) ACID compliance
b) Strong data consistency
c) Horizontal scalability
d) Schema flexibility
14
References:
✔https://www.edureka.co/blog/big-data-tutorial
✔https://www.coursera.org/learn/big-data-introduction?specialization=big-data2.
✔https://www.coursera.org/learn/fundamentals-of-big-data
✔Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization, DT Editorial
Service, Dreamtech Press
✔Big Data Analytics, Subhashini Chellappa, Seema Acharya, Wiley publications
✔Big Data: Concepts, Technology, and Architecture, Nandhini Abirami R , Seifedine Kadry, Amir H. Gandomi ,
Wiley publication
8/8/2021 15
THANK YOU
16
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)
•Wide and sparsely populated tables present in HBase •Contains thin tables in database
•Well suited for OLAP systems •Well suited for OLTP systems
•Retrieve one row at a time and hence could read unnecessary
•Read only relevant data from database
data if only some of the data in a row is required
•Structured and semi-structure data can be stored and
•Structured data can be stored and processed using RDBMS
processed using HBase
•Enables aggregation over many rows and columns •Aggregation is an expensive operation
HBase Vs. Hive
Features HBase Hive
24
Q/A
Which of the following is NOT a characteristic of HBase?
a) High scalability
b) Strong data consistency
c) Fault tolerance
d) Fast random read and write access
25
Q/A
What is the primary data model used in HBase?
a) Key-value model
b) Document model
c) Relational model
d) Graph model
26
References:
✔https://www.edureka.co/blog/big-data-tutorial
✔https://www.coursera.org/learn/big-data-introduction?specialization=big-data2.
✔https://www.coursera.org/learn/fundamentals-of-big-data
✔Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization, DT Editorial
Service, Dreamtech Press
✔Big Data Analytics, Subhashini Chellappa, Seema Acharya, Wiley publications
✔Big Data: Concepts, Technology, and Architecture, Nandhini Abirami R , Seifedine Kadry, Amir H. Gandomi ,
Wiley publication
8/8/2021 27
THANK YOU
28
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)
13
•b) RegionServer
Q/A
Which component is responsible for handling read and write requests
from clients in HBase?
a) HMaster
b) RegionServer
c) ZooKeeper
d) DataNode
Ans: b) RegionServer
14
Q/A
What is the role of ZooKeeper in HBase?
a)Managing the Hadoop Distributed File System (HDFS)
b) Coordinating and synchronizing distributed processes in HBase
c) Serving client requests and managing tables
d) Storing and serving data in Hbase
15
References:
✔https://www.edureka.co/blog/big-data-tutorial
✔https://www.coursera.org/learn/big-data-introduction?specialization=big-data2.
✔https://www.coursera.org/learn/fundamentals-of-big-data
✔Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization, DT Editorial
Service, Dreamtech Press
✔Big Data Analytics, Subhashini Chellappa, Seema Acharya, Wiley publications
✔Big Data: Concepts, Technology, and Architecture, Nandhini Abirami R , Seifedine Kadry, Amir H. Gandomi ,
Wiley publication
8/8/2021 16
THANK YOU
17
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)
Ans : a
9
Q/A
Which component in the HBase architecture provides distributed
coordination and synchronization services?
a) HMaster
b) RegionServer
c) ZooKeeper
d) DataNode
Ans: ZooKeeper
10
Q/A
What is a Region in HBase?
a) A key-value pair
Ans : c
11
References:
✔https://www.edureka.co/blog/big-data-tutorial
✔https://www.coursera.org/learn/big-data-introduction?specialization=big-data2.
✔https://www.coursera.org/learn/fundamentals-of-big-data
✔Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization, DT Editorial
Service, Dreamtech Press
✔Big Data Analytics, Subhashini Chellappa, Seema Acharya, Wiley publications
✔Big Data: Concepts, Technology, and Architecture, Nandhini Abirami R , Seifedine Kadry, Amir H. Gandomi ,
Wiley publication
8/8/2021 12
THANK YOU
13
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)
Ans: HMaster
9
Q/A
Which consistency model is followed by ZooKeeper?
a) Eventual consistency
b) Strong consistency
c) Sequential consistency
d) Event-driven consistency
10
Q/A
Which of the following operations can be performed on a znode in
ZooKeeper?
a) Read and write data
b) Query with SQL-like queries
c) Create and delete tables
d) Execute distributed data processing tasks
11
References:
✔https://www.edureka.co/blog/big-data-tutorial
✔https://www.coursera.org/learn/big-data-introduction?specialization=big-data2.
✔https://www.coursera.org/learn/fundamentals-of-big-data
✔Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization, DT Editorial
Service, Dreamtech Press
✔Big Data Analytics, Subhashini Chellappa, Seema Acharya, Wiley publications
✔Big Data: Concepts, Technology, and Architecture, Nandhini Abirami R , Seifedine Kadry, Amir H. Gandomi ,
Wiley publication
8/8/2021 12
THANK YOU
13
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)
• Finally, the sink stores the data into centralized stores like
HBase and HDFS.
• It consumes the data (events) from the channels and delivers it
to the destination.
• The destination of the sink might be another agent or the
central stores.
• Example: HDFS sink. Flume supports the following sinks: HDFS
sink, Logger, Avro, Thrift, IRC, File Roll, Null sink, HBase, and
Morphline solr.
References:
✔https://www.edureka.co/blog/big-data-tutorial
✔https://www.coursera.org/learn/big-data-introduction?specialization=big-data2.
✔https://www.coursera.org/learn/fundamentals-of-big-data
✔Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization, DT Editorial
Service, Dreamtech Press
✔Big Data Analytics, Subhashini Chellappa, Seema Acharya, Wiley publications
✔Big Data: Concepts, Technology, and Architecture, Nandhini Abirami R , Seifedine Kadry, Amir H. Gandomi ,
Wiley publication
8/8/2021 9
THANK YOU
10
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)
• Sqoop is a tool used for data transfer between RDBMS (like MySQL,
Oracle SQL etc.) and Hadoop (Hive, HDFS, and HBASE etc.)
• It is used to import data from RDBMS to Hadoop and export data
from Hadoop to RDBMS.
• Again Sqoop is one of the top projects by Apache software
foundation and works brilliantly with relational databases such as
Teradata, Netezza, Oracle, MySQL, and Postgres etc.
• In Sqoop, developers just need to mention the source, destination
and the rest of the work will be done by the Sqoop tool.
Features of Sqoop
✔https://www.edureka.co/blog/big-data-tutorial
✔https://www.coursera.org/learn/big-data-introduction?specialization=big-data2.
✔https://www.coursera.org/learn/fundamentals-of-big-data
✔Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization, DT Editorial
Service, Dreamtech Press
✔Big Data Analytics, Subhashini Chellappa, Seema Acharya, Wiley publications
✔Big Data: Concepts, Technology, and Architecture, Nandhini Abirami R , Seifedine Kadry, Amir H. Gandomi ,
Wiley publication
8/8/2021 8
THANK YOU
9
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)
• The Sqoop import tool will import each table of the RDBMS in
Hadoop and each row of the table will be considered as a record in
the HDFS.
• All records are stored as text data in text files or as binary data in
Avro and Sequence files.
Sqoop Export
• The Sqoop export tool will export Hadoop files back to RDBMS
tables. The records in the HDFS files will be the rows of a table.
• Those are read and parsed into a set of records and delimited with a
user-specified delimiter.
Sqoop Installation
12
Q/A
Which of the following databases can be used as a source or target for
data transfer with Sqoop?
a) MySQL
b) Oracle
c) PostgreSQL
d) All of the above
13
Q/A
Which command is used to import data from a relational database
into Hadoop using Sqoop?
a) sqoop export
b) sqoop import
c) sqoop connect
d) sqoop load
14
References:
✔https://www.edureka.co/blog/big-data-tutorial
✔https://www.coursera.org/learn/big-data-introduction?specialization=big-data2.
✔https://www.coursera.org/learn/fundamentals-of-big-data
✔Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization, DT Editorial
Service, Dreamtech Press
✔Big Data Analytics, Subhashini Chellappa, Seema Acharya, Wiley publications
✔Big Data: Concepts, Technology, and Architecture, Nandhini Abirami R , Seifedine Kadry, Amir H. Gandomi ,
Wiley publication
8/8/2021 15
THANK YOU
16
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)
12/6/2023
12/6/2023
Apache Pig
• Apache Pig is a platform for analyzing large data sets that consists of a
high-level language for expressing data analysis programs, coupled with
infrastructure for evaluating these programs.
• Pig's infrastructure layer consists of
– a compiler that produces sequences of Map-Reduce programs,
– Pig's language layer currently consists of a textual language called Pig Latin, which
has the following key properties:
• Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly
parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data
transformations are explicitly encoded as data flow sequences, making them easy to write,
understand, and maintain.
• Optimization opportunities. The way in which tasks are encoded permits the system to
optimize their execution automatically, allowing the user to focus on semantics rather than
efficiency.
• Extensibility. Users can create their own functions to do special-purpose processing.
12/6/2023
12/6/2023
12/6/2023
12/6/2023
12/6/2023
12/6/2023
12/6/2023
12/6/2023
12/6/2023
12/6/2023
12/6/2023
12/6/2023
Running Pig
• You can execute Pig Latin statements:
– Using grunt shell or command line
$ pig ... - Connecting to ...
grunt> A = load 'data';
grunt> B = ... ;
– In local mode or hadoop mapreduce mode
$ pig myscript.pig
Command Line - batch, local mode mode
$ pig -x local myscript.pig
– Either interactively or in batch
12/6/2023
Program/flow organization
• A LOAD statement reads data from the file system.
• A series of "transformation" statements process the data.
• A STORE statement writes output to the file system; or, a DUMP
statement displays output to the screen.
12/6/2023
Interpretation
• In general, Pig processes Pig Latin statements as follows:
– First, Pig validates the syntax and semantics of all statements.
– Next, if Pig encounters a DUMP or STORE, Pig will execute the
statements.
12/6/2023
Simple Examples
A = LOAD 'input' AS (x, y, z);
B = FILTER A BY x > 5;
DUMP B;
C = FOREACH B GENERATE y, z;
STORE C INTO 'output';
-----------------------------------------------------------------------------
A = LOAD 'input' AS (x, y, z);
B = FILTER A BY x > 5;
STORE B INTO 'output1';
C = FOREACH B GENERATE y, z;
STORE C INTO 'output2'
12/6/2023
12/6/2023
Analysing using MapReduce
12/6/2023
Limitations with Mapreduce
• Analysis needs to be typically done in Java.
• Joins, that are performed, needs to be written in Java, which makes
it longer and more error-prone.
• For projection and filters, custom code needs to be written which
makes the whole process slower.
• The job is divided into many stages while using MapReduce, which
makes it difficult to manage.
12/6/2023
12/6/2023
QUESTION: ANALYZING HOW
MANY TWEETS ARE STORED PER
USER, IN THE GIVEN TWEET
TABLES?
12/6/2023
12/6/2023
Steps
• STEP 1– First of all, twitter imports the twitter tables (i.e.
user table and tweet table) into the HDFS.
• STEP 2– Then Apache Pig loads (LOAD) the tables into
Apache Pig framework.
• STEP 3– Then it joins and groups the tweet tables and user
table using COGROUP command as shown in the above
image.
• This results in the inner Bag Data type, which we will discuss
later in this blog.
• Example of Inner bags produced (refer to the above image) –
• (1,{(1,Jay,xyz),(1,Jay,pqr),(1,Jay,lmn)})
• (2,{(2,Ellie,abc),(2,Ellie,vxy)})
• (3, {(3,Sam,stu)})
12/6/2023
• STEP 4– Then the tweets are counted according to the users using
COUNT command. So, that the total number of tweets per user can
be easily calculated.
• Example of tuple produced as (id, tweet count) (refer to the above
image) –
• (1, 3)
• (2, 2)
• (3, 1)
12/6/2023
• STEP 5– At last the result is joined with user table to extract the
user name with produced result.
• Example of tuple produced as (id, name, tweet count) (refer to the
above image) –
• (1, Jay, 3)
• (2, Ellie, 2)
• (3, Sam, 1)
• STEP 6– Finally, this result is stored back in the HDFS.
12/6/2023
12/6/2023
More examples from Cloudera
• http://www.cloudera.com/wp-content/uploads/2010/01/IntroToPig.pdf
• https://www.edureka.co/blog/big-data-tutorial
• https://www.coursera.org/learn/big-data-introduction?specialization=big-data2.
• https://www.coursera.org/learn/fundamentals-of-big-data
• Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization, DT
Editorial Service, Dreamtech Press
• Big Data Analytics, Subhashini Chellappa, Seema Acharya, Wiley publications
• Big Data: Concepts, Technology, and Architecture, Nandhini Abirami R , Seifedine Kadry, Amir H.
Gandomi , Wiley publication
12/6/2023
Q/A
What is the primary purpose of Pig Latin, the language used in Apache
Pig?
a) Real-time data processing
b) Data storage and retrieval
c) Data integration and ETL (Extract, Transform, Load)
d) Data visualization and reporting
31
Q/A
What is the key concept in Pig Latin for representing and manipulating
data?
a) Tables
b) Relations
c) Schemas
d) Dataflows
32
Q/A
Which of the following is NOT a basic data type in Pig Latin?
a) Integer
b) Float
c) Boolean
d) Character
33
THANK YOU
34
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)
20
Q/A
Which programming language is commonly used to write Hive
queries?
1. Python
2. Java
3. SQL
4. C++
21
Q/A
What is the primary purpose of Hive?
a) Real-time data processing
b) Data storage and retrieval
c) Data integration and ETL (Extract, Transform, Load)
d) Data visualization and reporting
22
References:
✔https://www.edureka.co/blog/big-data-tutorial
✔https://www.coursera.org/learn/big-data-introduction?specialization=big-data2.
✔https://www.coursera.org/learn/fundamentals-of-big-data
✔Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization, DT Editorial
Service, Dreamtech Press
✔Big Data Analytics, Subhashini Chellappa, Seema Acharya, Wiley publications
✔Big Data: Concepts, Technology, and Architecture, Nandhini Abirami R , Seifedine Kadry, Amir H. Gandomi ,
Wiley publication
8/8/2021 23
THANK YOU
24
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)
32
Q/A
Which programming language is commonly used for developing Spark
applications?
Java
Python
C++
JavaScript
33
Q/A
Which of the following is NOT a component of the Apache Spark
architecture?
a) Spark Core
b) Spark SQL
c) Spark Streaming
d) Spark Machine Learning
34
References:
✔https://www.edureka.co/blog/big-data-tutorial
✔https://www.coursera.org/learn/big-data-introduction?specialization=big-data2.
✔https://www.coursera.org/learn/fundamentals-of-big-data
✔Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization, DT Editorial
Service, Dreamtech Press
✔Big Data Analytics, Subhashini Chellappa, Seema Acharya, Wiley publications
✔Big Data: Concepts, Technology, and Architecture, Nandhini Abirami R , Seifedine Kadry, Amir H. Gandomi ,
Wiley publication
8/8/2021 35
THANK YOU
36
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)
24
Q/A
Which of the following is a supported data source in Spark?
a) Hadoop Distributed File System (HDFS)
b) MySQL
c) Amazon S3
d) All of the above
25
Q/A
Which of the following is NOT a machine learning library available in
Spark?
a) Spark MLlib
b) Spark GraphX
c) Spark ML
d) Spark TensorFlow
26
References:
✔https://www.edureka.co/blog/big-data-tutorial
✔https://www.coursera.org/learn/big-data-introduction?specialization=big-data2.
✔https://www.coursera.org/learn/fundamentals-of-big-data
✔Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization, DT Editorial
Service, Dreamtech Press
✔Big Data Analytics, Subhashini Chellappa, Seema Acharya, Wiley publications
✔Big Data: Concepts, Technology, and Architecture, Nandhini Abirami R , Seifedine Kadry, Amir H. Gandomi ,
Wiley publication
8/8/2021 27
THANK YOU
28