0% found this document useful (0 votes)
13 views46 pages

Big Data (Unit 3)

The document provides an introduction to MongoDB, highlighting its features as a cross-platform, document-oriented database that offers high performance, availability, and scalability. It explains the concepts of databases, collections, and documents, and compares MongoDB to traditional RDBMS in terms of structure and data handling. Additionally, it outlines the advantages of MongoDB, its data types, and basic operations such as data insertion and querying using methods like find() and findOne().

Uploaded by

Rahul saini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views46 pages

Big Data (Unit 3)

The document provides an introduction to MongoDB, highlighting its features as a cross-platform, document-oriented database that offers high performance, availability, and scalability. It explains the concepts of databases, collections, and documents, and compares MongoDB to traditional RDBMS in terms of structure and data handling. Additionally, it outlines the advantages of MongoDB, its data types, and basic operations such as data insertion and querying using methods like find() and findOne().

Uploaded by

Rahul saini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 46

UNIT-3

INTRODUCTION TO MONGODB
• MongoDB :is a cross-platform, document-
oriented database that provides, high performance,
high availability, and easy scalability. MongoDB
works on concept of collection and document.
• Database: Database is a physical container for
collections. Each database gets its own set of files
on the file system. A single MongoDB server
typically has multiple databases.
• Collection: Collection is a group of MongoDB
documents. It is the equivalent of an RDBMS table.
A collection exists within a single database.
• Document
• A document is a set of key-value pairs.
Documents have dynamic schema.
• Dynamic schema means that documents in the
same collection do not need to have the same
set of fields or structure, and common fields in a
collection's documents may hold different types
of data.
• RDBMS MongoDB
• Database Database
• Table Collection
• Tuple/Row Document
• column Field
• Table Join Embedded Documents
• Primary Key Primary Key (Default key _id )
• {
• _id: ObjectId(7df78ad8902c)
• title: 'MongoDB Overview',
• description: 'MongoDB is no sql database',
• by: 'tutorials point',
• url: 'http://www.tutorialspoint.com',
• tags: ['mongodb', 'database', 'NoSQL'],
• likes: 100,
• comments: [
• {
• user:'user1',
• message: 'My first comment',
• dateCreated: new Date(2011,1,20,2,15),
• like: 0
• },
• {
• user:'user2',
• message: 'My second comments',
• dateCreated: new Date(2011,1,25,7,45),
like: 5
}
]
}
• Advantages of MongoDB over RDBMS
•  Schema less − MongoDB is a document database
in which one collection holds different documents.
Number of fields, content and size of the document
can differ from one document to another.
•  Structure of a single object is clear.
•  No complex joins.
•  Deep query-ability. MongoDB supports dynamic
queries on documents using a documentbased
query language that's nearly as powerful as SQL.
•  Tuning.
• Ease of scale-out − MongoDB is easy to scale.
•  Conversion/mapping of application objects to
database objects not needed.
•  Uses internal memory for storing the
(windowed) working set, enabling faster access
of data.
• Why Use MongoDB?
•  Document Oriented Storage − Data is stored in the form of
JSON style documents.
•  Index on any attribute
•  Replication and high availability
•  Auto-Sharding
•  Rich queries
•  Fast in-place updates
•  Professional support by MongoDB
• Where to Use MongoDB?
•  Big Data
•  Content Management and Delivery
•  Mobile and Social Infrastructure
•  User Data Management
•  Data Hub
• MongoDB supports many datatypes. Some of them are −
•  String − This is the most commonly used datatype to store the data. String in
MongoDB must be UTF-8 valid.
•  Integer − This type is used to store a numerical value. Integer can be 32 bit or 64 bit
depending upon your server.
•  Boolean − This type is used to store a boolean (true/ false) value.
•  Double − This type is used to store floating point values.
•  Min/ Max keys − This type is used to compare a value against the lowest and highest
BSON elements.
•  Arrays − This type is used to store arrays or list or multiple values into one key.
•  Timestamp − ctimestamp. This can be handy for recording when a document has been
modified or added.
•  Object − This datatype is used for embedded documents.
•  Null − This type is used to store a Null value.
•  Symbol − This datatype is used identically to a string; however, it's generally reserved
for languages that use a specific symbol type.
•  Date − This datatype is used to store the current date or time in UNIX time format. You
can specify your own date time by creating object of Date and passing day, month, year
into it.
•  Object ID − This datatype is used to store the document’ID
• Creation of collection :1⃣ Automatic Collection
Creation (Default Behavior):MongoDB
automatically creates a collection when you
insert a document into it.
EX:-db.users.insertOne({ name: "Alice", age: 25 });
2.Explicitly Creating a Collection:
• If you want to create a collection before inserting
documents, use the createCollection() method.

EX:-db.createCollection("employees");
• ✅ This creates an empty collection named
employees.
2. 2. Insert Methods (Adding Data)
Method Description

db.collection.insert([doc1,doc2,doc3])

db.collection.insertOne(doc) Inserts one document

Inserts multiple documents


db.collection.insertMany([doc1, doc2])
• > db.mycol.insert([
• {
• title: "MongoDB Overview",
• description: "MongoDB is no SQL database",
• by: "tutorials point",
• url: "http://www.tutorialspoint.com",
• tags: ["mongodb", "database", "NoSQL"],
• likes: 100
• },
• {
• title: "NoSQL Database",
• description: "NoSQL database doesn't have tables",
• by: "tutorials point",
• url: "http://www.tutorialspoint.com",
• tags: ["mongodb", "database", "NoSQL"],
• likes: 20,
• comments: [
• The find() Method
• To query data from MongoDB collection, you
need to use MongoDB's find() method.
• Syntax:
• The basic syntax of find() method is as follows −
• >db.COLLECTION_NAME.find()
• find() method will display all the documents in a
non-structured way.
• db.mycol.find()
{ "_id" : ObjectId("5dd4e2cc0821d3b44607534c"), "title" :
"MongoDB Overview", "description"
: "MongoDB is no SQL database", "by" : "tutorials point", "url" :
"http://www.tutorialspoint.com",
"tags" : [ "mongodb", "database", "NoSQL" ], "likes" : 100 }
{ "_id" : ObjectId("5dd4e2cc0821d3b44607534d"), "title" :
"NoSQL Database", "description" :
"NoSQL database doesn't have tables", "by" : "tutorials point",
"url" :
"http://www.tutorialspoint.com", "tags" : [ "mongodb",
"database", "NoSQL" ], "likes" : 20,
"comments" : [ { "user" : "user1", "message" : "My first
comment", "dateCreated" :
ISODate("2013-12-09T21:05:00Z"), "like" : 0 } ] }
• The pretty() Method:
• To display the results in a formatted way, you
can use pretty() method.
• Syntax
• >db.COLLECTION_NAME.find().pretty()
• The findOne() method
• Apart from the find() method, there is findOne()
method, that returns only one document.
• Syntax:
• >db.COLLECTIONNAME.findOne()
• EX:> db.mycol.findOne({title: "MongoDB
Overview"})
1.MongoDB in Aeronautical Engineering ✈️
• Use Cases:
• Flight Data Analysis: Storing and analyzing real-time
aircraft sensor data.
• Air Traffic Management: Managing large datasets from
multiple aircraft and airports.
• Satellite Communication: Storing and processing
telemetry data from satellites.
• Maintenance Logs & Predictive Maintenance: Storing
maintenance history and predicting failures.
• Why MongoDB?
• Handles real-time data streams.
• Supports geo-spatial queries for tracking aircraft.
• Stores complex aircraft telemetry data efficiently.
• {
• "flight_id": "AI203",
• "aircraft": "Boeing 787",
• "altitude": 35000,
• "speed": 870,
• "location": { "lat": 40.7128, "long": -74.0060 },
• "engine_status": "Optimal",
• "timestamp": ISODate("2025-03-
05T10:00:00Z")
• }
2.MongoDB in Electrical Engineering ⚡
• Use Cases:
• Smart Grid Systems: Storing and processing
energy consumption data from smart meters.
• IoT-Based Monitoring: Collecting sensor data
from electrical devices.
• Renewable Energy Management: Analyzing
solar and wind power generation.
• Storing Smart Meter Data

• { "meter_id": "SM12345", "location": "New


York", "voltage": 220, "current": 15.5,
"power_consumption_kWh": 3.2,
"timestamp": ISODate("2025-03-
05T12:30:00Z") }
• Unit 4
• INTRODUCTION TO HIVE AND PIG
• The term ‘Big Data’ is used for collections of large
datasets that include huge volume, high velocity,
and a variety of data that is increasing day by day.
Using traditional data management systems, it is
difficult to process Big Data. Therefore, the Apache
Software Foundation introduced a framework
called Hadoop to solve Big Data management and
processing challenges.
• Hadoop
• Hadoop is an open-source framework to store and process Big
Data in a distributed environment.
• It contains two modules, one is MapReduce and another is
Hadoop Distributed File System
• (HDFS).
•  MapReduce: It is a parallel programming model for processing
large amounts of structured, semi-structured, and unstructured
data on large clusters of commodity
• hardware.
•  HDFS:Hadoop Distributed File System is a part of Hadoop
framework, used to store and process the datasets. It provides a
fault-tolerant file system to run on commodity hardware. The
Hadoop ecosystem contains different sub-projects (tools) such
as Sqoop, Pig, and Hive that are used to help Hadoop modules
• Sqoop: It is used to import and export data to
and from between HDFS and RDBMS.
• Pig: It is a procedural language platform used
to develop a script for MapReduce operations.
• Hive: It is a platform used to develop SQL type
scripts to do MapReduce operations
What is Hive?

• Hive is a data warehouse infrastructure tool to process


structured data in Hadoop. It resides on
top of Hadoop to summarize Big Data, and makes
querying and analyzing easy.
• Initially Hive was developed by Facebook, later the
Apache Software Foundation took it up and
developed it further as an open source under the name
Apache Hive. It is used by different
companies. For example, Amazon uses it in Amazon
Elastic MapReduce.
• Features of Hive
•  It stores schema in a database and
processed data into HDFS.
•  It is designed for OLAP.
•  It provides SQL type language for querying
called HiveQL or HQL.
•  It is familiar, fast, scalable, and extensible
• Architecture of Hive
• User Interface :Hive is a data warehouse
infrastructure software that can create interaction
between user and HDFS. The user interfaces that
Hive supports are Hive Web UI, Hive command
line, and Hive HD Insight (In Windows server).
• Meta Store :Hive chooses respective database
servers to store the schema or Metadata of
tables, databases, columns in a table, their
data types, and HDFS mapping.
• HiveQL Process Engine: HiveQL is similar to SQL
for querying on schema info on the Meta store.
It is one of the replacements of traditional
approach for MapReduce program. Instead of
writing MapReduce program in Java, we can
write a query for MapReduce job and process it.
• Execution Engine :The conjunction part of
HiveQL process Engine and MapReduce is Hive
Execution Engine. Execution engine processes
the query and generates results as same as
MapReduce results. It uses the flavor of
MapReduce
• HDFS or HBASE: Hadoop distributed file system
or HBASE are the data storage techniques to
store data into file system.
• Working of Hive
• The following diagram depicts the workflow
between Hive and Hadoop
Operations:
• 1 Execute Query:The Hive interface such as Command Line or Web UI sends query to
Driver (any database driver such as JDBC, ODBC, etc.) to execute.
• 2 Get Plan:The driver takes the help of query compiler that parses the query to
check the syntax and query plan or the requirement of query.
• 3 Get Metadata:The compiler sends metadata request to Metastore (any database).
• 4 Send Metadata:Metastore sends metadata as a response to the compiler.
• 5 Send plan:The compiler checks the requirement and resends the plan to the driver.
Up to here, the parsing and compiling of a query is complete.
• 6 Execute Plan:The driver sends the execute plan to the execution engine.
• 7 Execute Job:Internally, the process of execution job is a MapReduce job. The
execution engine sends the job to JobTracker, which is in Name node and it assigns
this job to TaskTracker, which is in Data node. Here, the query executes MapReduce
job.
• 7.1 Metadata Ops:Meanwhile in execution, the execution engine can execute
metadata operations with Metastore.
• 8 Fetch Result
• The execution engine receives the results from Data nodes.
• 9 Send Results
• The execution engine sends those resultant values to the driver.
• 10 Send Results
• The driver sends the results to Hive Interfaces.
• Commonly used File Formats –
• 1. TextFile format
•  Suitable for sharing data with other tools
•  Can be viewed/edited manually
• 2. SequenceFile
•  Flat files that stores binary key ,value pair
•  SequenceFile offers a Reader ,Writer, and Sorter classes for
reading ,writing, and sorting
• respectively
•  Supports – Uncompressed, Record compressed ( only value is
compressed) and Block
• compressed ( both key,value compressed) formats
• 3. RCFile
•  RCFile stores columns of a table in a record columnar way
• 4. ORC(optimised row columnar)
• 5. AVRO(row based storage format)
• Hive Commands
• Hive supports Data definition Language(DDL), Data Manipulation Language(DML) and User
defined functions.
• Hive DDL Commands
• create database
• drop database
• create table
• drop table
• alter table
• create index
• create view
• Hive DML Commands
• Select
• Where
• Group By
• Order By
• Load Data
• Join:
• o Inner Join
• o Left Outer Join
• o Right Outer Join
• o Full Outer Join
• Hive DDL Commands
1.Create Database Statement:
• A database in Hive is a namespace or a collection of
tables.
(a) hive> CREATE SCHEMA userdb;
(b) hive> SHOW DATABASES;
2. Drop database:
hive> DROP DATABASE IF EXISTS userdb;
3. Creating Hive Tables:
hive> CREATE TABLE students(id INT, name STRING);
4. Browse the table
hive> Show tables;
5.Altering and Dropping Tables
• 1. hive> ALTER TABLE Student RENAME TO
Kafka;
• 2. hive> ALTER TABLE Kafka ADD COLUMNS (col
INT);
• Hive DML Commands:1)SELECTS and FILTERS
hive> SELECT E.EMP_ID FROM Employee E
WHERE E.Address='US';
• 2)GROUP BY
hive> hive> SELECT E.EMP_ID FROM Employee E
GROUP BY E.Addresss;
• What is Apache Pig
• Apache Pig is a high-level data flow platform for
executing MapReduce programs of Hadoop. The
language used for Pig is Pig Latin.
• Pig can handle any type of data, i.e., structured,
semi-structured or unstructured and stores the
corresponding results into Hadoop Data File
System.
• Features of Apache Pig
• Let's see the various uses of Pig technology.
1) Ease of programming:
• Writing complex java programs for map reduce is quite tough
for non-programmers. Pig makes this process easy. In the Pig,
the queries are converted to MapReduce internally.
2) Optimization opportunities:
It is how tasks are encoded permits the system to optimize their
execution automatically, allowing the user to focus on semantics
rather than efficiency.
3) Extensibility: A user-defined function is written in which the
user can write their logic to execute over the data set.
4) Flexible:It can easily handle structured as well as unstructured
data.
5) In-built operators:It contains various type of operators such as
sort, filter and joins.
• Pig Latin:
The Pig Latin is a data flow language used by Apache Pig to
analyze the data in Hadoop. It is a textual language that
abstracts the programming from the Java MapReduce
idiom into a notation.
• Pig Latin Statements:
• The Pig Latin statements are used to process the data. It
is an operator that accepts a relation as an
input and generates another relation as an output.
• o It can span multiple lines.
• o Each statement must end with a semi-colon.
• o It may include expression and schemas.
• o By default, these statements are processed using
multi-query execution
• Apache Pig Execution Modes
• You can run Apache Pig in two modes, namely, Local
Mode and HDFS mode.
• Local Mode
• In this mode, all the files are installed and run from your
local host and local file system. There is no need of
Hadoop or HDFS. This mode is generally used for testing
purpose.
• MapReduce Mode
• MapReduce mode is where we load or process the data
that exists in the Hadoop File System (HDFS) using
Apache Pig. In this mode, whenever we execute the Pig
Latin statements to process the data, a MapReduce job is
invoked in the back-end to perform a particular
operation on the data that exists in the HDFS.
• Apache Pig Execution Mechanisms:Apache Pig scripts
can be executed in three ways, namely, interactive
mode, batch mode, and embedded mode.
•  Interactive Mode (Grunt shell) − You can run Apache
Pig in interactive mode using the Grunt shell. In this
shell, you can enter the Pig Latin statements and get
the output (using Dump operator).
•  Batch Mode (Script) − You can run Apache Pig in
Batch mode by writing the Pig Latin script in a single
file with .pig extension.
•  Embedded Mode (UDF) − Apache Pig provides the
provision of defining our own functions (User Defined
Functions) in programming languages such as Java,
and using them in our script.
• Pig Commands:
1.load :Reads data from the system
2.Store :Writes data to file system
3.foreach: Applies expressions to each record and
outputs one or more records
4.filter: Applies predicate and removes records that
do not return true
5.Group/cogroup: Collects records with the same
key from one or more inputs
6.join :Joins two or more inputs based on a key
7.order: Sorts records based on a key
8.distinct: Removes duplicate records
• EX:
• lines = LOAD '/user/hadoop/HDFS_File.txt' AS
(line:chararray);
• words = FOREACH lines GENERATE
FLATTEN(TOKENIZE(line)) as word;
• grouped = GROUP words BY word;
• wordcount = FOREACH grouped GENERATE
group, COUNT(words);
• DUMP wordcount;
1. Loading and Displaying Data:
• data = LOAD 'hdfs://path/to/data.txt' USING
PigStorage(',') AS (id:int, name:chararray,
age:int);
DUMP data;
2.Filtering Data:
• filtered_data = FILTER data BY age > 25;
DUMP filtered_data;
3. Grouping Data:
grouped_data = GROUP data BY age;
DUMP grouped_data;

You might also like