0% found this document useful (0 votes)

23 views54 pages

Module 1

Uploaded by

Aditya Raj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views54 pages

Module 1

Uploaded by

Aditya Raj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 54

BIG DATA ANALYTICS

(DSE 3264)
Faculty Name:
Shavantrevva S Bilakeri (SSB)
Assistant Professor, Dept of DSCA, MIT.
Phone No: +91 8217228519 E-mail: [email protected]
Course Objectives
 To be familiar with overview of Apache Hadoop Ecosystem

 Understand the storage mechanism, architecture, features and execution modes

of big data tools (Pig, Hive, HBase, Spark)

 To be proficient in data analysis and its implications on structured,

unstructured, and semi- structured data.

 To be proficient with Big Data framework and use cases.

Course Outcome

 Identify Big Data and its business inferences.

 Explore, Manage and Analyze Job executions in local &

cluster-based Hadoop environment.

 Apply and Perform machine learning techniques using Scala

& python.
Syllabus
Introduction to Big Data: evolution, structuring elements, big data analytics,
distributed and parallel computing for big data, Life cycle of Big data, Cloud
computing and big data, in-memory computing technology for big data, Big Data
Stack, Layer Structure, Big Data Layout.

Hadoop: ecosystem, Hadoop Distributed File System (HDFS),

MapReduce: MapReduce Framework, optimizing MapReduce jobs, MapReduce

Applications, Understanding YARN architecture.
Big Data Tools: “PIG”: History, Features, Architecture, Components, Data Models,
Operators, Running & Executing Modes, Analysing data with Pig, Pig Libraries,
Processing Structured Data using Pig.
Syllabus
Big Data Tools: “HBASE”: History, Characteristics, Features, Architecture,
Storage Mechanism, HDFS Versus HBASE, HBase Query writing.

Big Data Tools: “Hive”: Brief History of Hive, Data Types In Hive, Executing
Modes, Writing & Executing Hive queries.

Big Data tools: “Apache Spark”: Spark Architecture, Components, Features,

Spark vs Hadoop, RDD, Need for RDD, Spark memory management & Fault
tolerance, Spark’s Python and Scala shells, Programming with RDD: RDD
Operations, Passing Functions to Spark, Common Transformations and Actions,
Contents
 Introduction to Big Data
 Types of Big Data
 Big Data characteristics
 Challenges
 Data Generators (Fields of Big Data)
Traditional Vs Big Data approach
 Life Cycle
 Case Study.
Big Data Analytics
Data Analytics • Big Data is a massive amount of datasets that cannot
be stored, processed, or analyzed using traditional tools.
 The term big data was first used to refer to increasing data
Insight Into data volumes in the mid-1990s.

 In 2001, Doug Laney, then an analyst at consultancy Meta

Using toos and Processes Group Inc., expanded the definition of big data.

• Volume of data being stored and used by

Representations, organizations.
Visualization, • Variety of data being generated by organizations.
• Velocity, or speed, in which that data was being
decision making created and updated.
Why Big Data Analytics
SL.NO
Small Data VS Big Data
Small Data Big Data
1 Structured Structured
/Unstructured/ semistructured/
2 Megabyte (MB): 1 MB is equivalent to 1,024 Petabyte (PB): 1 PB is equivalent to 1,024 terabytes or 1
kilobytes or approximately 1 million bytes. million gigabytes.
Gigabyte (GB): 1 GB equals 1,024 megabytes or Exabyte (EB): 1 EB equals 1,024 petabytes or 1 billion
approximately 1 billion bytes. gigabytes.
Terabyte (TB): 1 TB is equal to 1,024 gigabytes or Zettabyte (ZB): 1 ZB is equal to 1,024 exabytes or 1 trillion
approximately 1 trillion bytes. gigabytes.
Yottabyte (YB): 1 YB represents 1,024 zettabytes or 1
quadrillion gigabytes.
3 Gradually Increases (Slow) Exponentially/ Rapidly Increases

4 Locally Present Globally Present

5 Centralized Distributed

6 Frameworks:Oracle, SQL Server Hadoop, Spark, Cassandra

7 Single Node Multiple Node

TYPES OF DATA
Structured
•Data that resides in a fixed field within a record.

•It is type of data most familiar to our everyday lives. Ex: birthday,
address
• A certain schema binds it, so all the data has the same set of
properties. Structured data is also called relational data.

•It is split into multiple tables to enhance the integrity of the data
by creating a single record to depict an entity.

•Relationships are enforced by the application of table constraints.

Unstructured
•Unstructured data is the kind of data that doesn’t adhere to any
definite schema or set of rules.

• Its arrangement is unplanned and haphazard.

•Photos, videos, text documents, and log files can be generally

considered unstructured data.

•Even though the metadata accompanying an image or a video

may be semi-structured, the actual data being dealt with is
unstructured.

•Additionally, Unstructured data is also known as “dark data”

because it cannot be analyzed without the proper software tools.
Semi- Structured
•Semi-structured data is not bound by any rigid schema for data
storage and handling.

•The data is not in the relational format and is not neatly

organized into rows and columns like that in a spreadsheet.

•However, there are some features like key-value pairs that help
in discerning the different entities from each other.

•Since semi-structured data doesn’t need a structured query

language, it is commonly called NoSQL data.

•This type of information typically comes from external sources

such as social media platforms or other web-based data feeds.
Data Generated On internet – Per minute
Examples of Big Data
Big data is a clustered management of different forms of data
Generated by various devices (Android, iOS, etc.),
Applications (music apps, web apps, game apps, etc.),
 Actions (searching through SE, navigating through similar types of web pages,
etc.).
Challenges of Big Data
Rapid Data Growth: The growth velocity at such a high rate creates a problem to look for
insights using it. There is no 100% efficient way to filter out relevant data.
 Storage: The generation of such a massive amount of data needs space for storage, and
organizations face challenges to handle such extensive data without suitable tools and
technologies.
Unreliable Data: It cannot be guaranteed that the big data collected and analyzed are totally
(100%) accurate. Redundant data, contradicting data, or incomplete data are challenges that
remain within it.
Data Security: Firms and organizations storing such massive data (of users) can be a target of
cybercriminals, and there is a risk of data getting stolen. Hence, encrypting such colossal data is
also a challenge for firms and organizations.
Fields of data that come under the umbrella
of Big Data: (Generators of Data)
Black Box Data: Black box data is a type of data that is collected from private and government
helicopters, airplanes, and jets. This data includes the capture of Flight Crew Sounds, separate
recording of the microphone as well as earphones, etc.
 Stock Exchange Data: Stock exchange data includes various data prepared about 'purchase' and
'selling' of different raw and well-made decisions.
 Social Media Data: This type of data contains information about social media activities that
include posts submitted by millions of people worldwide.
Transport Data: Transport data includes vehicle models, capacity, distance (from source to
destination), and the availability of different vehicles.
Search Engine Data: Retrieve a wide variety of unprocessed information that is stored in SE
databases.
Classify any data as Big Data ?
 Based on 5V’s.
Big Data Characteristics
• Volume: Related to size of the data
• Variety: Comprises of a variety of data
• Velocity: Refers to the speed of generation of data.
• Veracity: Quality of data captured, which can vary
greatly, affecting its accurate analysis
Traditional Approach: Data storage and processing
How we store and process this Big Data?
Store – Break file in to small sizes – 300 MB =
128Mb, 128 MB, 44MB
Store them with different nodes – when node
breaks down it will be easy to to fetch the file

MapReduce/
Processing

Data Analysis
Use case of Big Data Analysis – Gaming

• Designers Analyze gaming data at

which stage customers pause ,
restart , quit etc.

• This analysis will help designers to

work on enhancement of story line
of games.

• Improve “user Experience”

• Reduce Churn rate.

Use case of Big Data Analysis: Predict Hurricane’s landfall

• Processed
• Analyzed Accurately
Big data Technologies/ Classes
Big Data?
Big Data Layout

• Apache Hadoop- Apache

• MapReduce – Google
• HDFS(Hadoop distributed file
system) – Apache
• Hive – Facebook
• Pig - Yahoo
Figure : Big Data layout
Apache Hadoop
Apache Hadoop is one of the main supportive element in Big Data technologies.
It simplifies the processing of large amount of structured or unstructured data in a
cheap manner.
Hadoop is an open source project from Apache that is continuously improving over
the years.
Hadoop is basically a set of software libraries and frameworks to manage and process
big amount of data from a single server to thousands of machines.
It provides an efficient and powerful error detection mechanism based on application
layer rather than relying upon hardware."
In December 2012 Apache releases Hadoop 1.0.0, more information and installation
guide can be found at Apache Hadoop Documentation.
 Hadoop is not a single project but includes a number of other technologies in it.
Hadoop
Main Components key features
1. HDFS (Hadoop Distributed File System) • Distributed Storage
• Scalability
2. YARN (Yet Another Resource Negotiator)
• Fault-Tolerance
3. Mapreduce
• Data locality
Includes several additional modules: Hive ,Pig, and HBase • High Availability
• Flexible Data Processing
• Data Integrity
• Data Compression
HDFS(Hadoop distributed file system)

 HDFS is a java based file system that is used to store structured or unstructured
data over large clusters of distributed servers.
 The data stored in HDFS has no restriction or rule to be applied, the data can be
either fully unstructured or purely structured.
 In HDFS the work to make data senseful, is done by developer's code only.
 Hadoop distributed file system provides a highly fault tolerant atmosphere with
a deployment on low cost hardware machines.
 HDFS is now a part of Apache Hadoop project, more information and
installation guide can be found at Apache HDFS documentation
MapReduce

MapReduce was introduced by google, which has mapper and reducer modules.
Helps to analyze web logs to create large amount of web search indexes.
It is basically a framework to write applications that processes a large amount of
structured or unstructured data over the web. (programming model)
 MapReduce takes the query and breaks it into parts to run it on multiple nodes.
 By distributed query processing it makes it easy to maintain large amount of data
by dividing the data into several different machines.
 Hadoop MapReduce is a software framework for easily writing applications to
manage large amount of data sets with a highly fault tolerant manner.
HIVE
Hive was originally developed by Facebook, now it is made open source.
 Hive works something like a bridge in between SQL and Hadoop, it is
basically used to make SQL queries on Hadoop clusters.
Apache Hive is basically a data warehouse that provides ad-hoc queries, data
summarization and analysis of huge data sets stored in Hadoop compatible file
systems.
Hive provides a SQL like called HiveQL query based implementation of huge
amount of data stored in Hadoop clusters.
In January 2013 apache releases Hive 0.10.0
Pig
Pig was introduced by yahoo and later on it was made fully open source.
It also provides a bridge to query data over Hadoop clusters but unlike hive, it
implements a script implementation to make Hadoop data access able by
developers and business persons.
Apache pig provides a high level programming platform for developers to
process and analyses Big Data using user defined functions and programming
efforts.
In January 2013 Apache released Pig 0.10.1 which is defined for use with
Hadoop 0.10.1 or later releases.
Traditional vs Big Data
Schema Hard Code/SQL/Small Data
online transactions and quick updates.
Schema Based DB (hard code schema attachment)
Structured
Uses SQL for Data processing
Maintains relationship between elements

Schema Less /NoSQL/Big Data

Migration Easy
Schema Less Based DB
store unstructured, semi structured or even fully structured data
Store a huge amount of data and not to maintain relationship between elements.
Traditional BI Vs. Big Data Analytics
• Working on the live coming data, which can be an input from the ever-changing scenario cannot
be dealt in the traditional approach.

• The live flow of data is captured and the analysis is done on it.
• Efficiency increases when the data to be analyzed is large
Traditional vs Big data Approaches
What is Changing in the Realms
of
Big Data ??????
Life Cycle of Data
1) The analysis of data is done from the knowledge
experts and the expertise is applied for the
development of an application.
2) The streaming of data after the analysis and the
application, the data log is created for the acquisition
of data.
3) The data id mapped and clustered together on the data
log.
4) The clustered data from the data acquisition is then
aggregated by applying various aggregation
algorithms.
5) The integrated data again goes for an analysis.
6) The complete steps are repeated till the desired, and
expected output is produced
Big Data Technology Stack
Design of logical layers
in a data processing
Various Data Storage and Usage, Tools
Various Data Storage and Usage, Tools
Components Classification in Hadoop
Components Classification
Hadoop Ecosystem
Hadoop 1 vs Hadoop 2 [MRV1 vs
MRV2]
Hadoop 1 vs Hadoop 2 [MRV1 vs
MRV2]
Hadoop V1 vs Hadoop V2
Hadoop Technology
Why TO Use Hadoop?
MapReduce:
MapReduce
Phases of MapReduce: Word Count
MapReduce : Case Study (Paper Correction and Identifying Topper)
Map : Parallelism , Reduce : Grouping
Total = 20 Papers
Time = 1 min/Paper
Without (MPP):
 20 Papers Corrections = 20 mins
With (MPP/MapReduce)
 Divide Task – 4 groups / 4 districts (D1, D2, D3, D4) [Divide Tasks – Mapper Class]
 Assign 5 papers- Each district
 To fetch “District Topper” = (D1 =T1, D2= T2, D3 = T3, D4 = T4) = 4 Toppers ; 5 minutes
 To fetch “State Topper” = T1, T2, T3, T4 = Topper; 4 minutes [Aggregation: Reducer
Class]
 Total job “To Fetch Topper” = 9 Minutes

 Summary/ Technicality of MapReduce : Time is reduced rapidly in case of MPP application .

Big Data Analytics - Lecture Slides
No ratings yet
Big Data Analytics - Lecture Slides
72 pages
Big Data 2022 Notes
No ratings yet
Big Data 2022 Notes
118 pages
Big Data Unit 1 Notes
100% (1)
Big Data Unit 1 Notes
27 pages
Big Data Seminar
100% (2)
Big Data Seminar
27 pages
Bda MST Merged
No ratings yet
Bda MST Merged
230 pages
Big Data Analytics Notess
No ratings yet
Big Data Analytics Notess
69 pages
Bda (Unit 1)
No ratings yet
Bda (Unit 1)
24 pages
Bda Unit 1
No ratings yet
Bda Unit 1
47 pages
Unit 1 Bda Complete Notes
No ratings yet
Unit 1 Bda Complete Notes
15 pages
BDA Unit 1
No ratings yet
BDA Unit 1
68 pages
BDCC Unit 1
No ratings yet
BDCC Unit 1
165 pages
Big Data Complete Notes
No ratings yet
Big Data Complete Notes
9 pages
Big Data
No ratings yet
Big Data
17 pages
Big Data Lec4
No ratings yet
Big Data Lec4
38 pages
BigData AmberSahai1
No ratings yet
BigData AmberSahai1
32 pages
Big Data 2022 Notes
No ratings yet
Big Data 2022 Notes
118 pages
Ese Bda
No ratings yet
Ese Bda
28 pages
Big Data Analytics (VN) 1
No ratings yet
Big Data Analytics (VN) 1
98 pages
Big Data Analytics
No ratings yet
Big Data Analytics
45 pages
Big Data 2022 Notes
No ratings yet
Big Data 2022 Notes
118 pages
Big Data Chapter-I - New
No ratings yet
Big Data Chapter-I - New
49 pages
Big Data UNIT I
No ratings yet
Big Data UNIT I
91 pages
BDA NOTES With Questions Included
No ratings yet
BDA NOTES With Questions Included
108 pages
Unit 1 - BDS - DS307
No ratings yet
Unit 1 - BDS - DS307
47 pages
Big Data A Comprehensive Overview
No ratings yet
Big Data A Comprehensive Overview
25 pages
Module I Big Data
No ratings yet
Module I Big Data
7 pages
Module 1. 16974328175990
No ratings yet
Module 1. 16974328175990
119 pages
Lec1 Special
No ratings yet
Lec1 Special
21 pages
Da Unit - I - Notes
No ratings yet
Da Unit - I - Notes
30 pages
Big Data Complete Notes
No ratings yet
Big Data Complete Notes
33 pages
Bda Unit I LM
No ratings yet
Bda Unit I LM
14 pages
BigData Unit1
No ratings yet
BigData Unit1
74 pages
Bda CHP1
No ratings yet
Bda CHP1
83 pages
BDA ppt1
No ratings yet
BDA ppt1
45 pages
BIG Data1
No ratings yet
BIG Data1
49 pages
Hand Book: Ahmedabad Institute of Technology
No ratings yet
Hand Book: Ahmedabad Institute of Technology
103 pages
Unit-1 Final Sgs
No ratings yet
Unit-1 Final Sgs
24 pages
Unit 1 Bigdata
No ratings yet
Unit 1 Bigdata
30 pages
Beyond The Hype
No ratings yet
Beyond The Hype
30 pages
2 Emerging
No ratings yet
2 Emerging
10 pages
Lecture1 Introductiontobigdata 190301171350
No ratings yet
Lecture1 Introductiontobigdata 190301171350
63 pages
BDA Mod1
No ratings yet
BDA Mod1
36 pages
Fundamentals of Big Data Analytics
No ratings yet
Fundamentals of Big Data Analytics
151 pages
Big Data (Unit 1)
No ratings yet
Big Data (Unit 1)
32 pages
BIG Data - Unit - 1
No ratings yet
BIG Data - Unit - 1
24 pages
01 - Introduction To Big Data Analytics PDF
No ratings yet
01 - Introduction To Big Data Analytics PDF
37 pages
Introduction To Bda
No ratings yet
Introduction To Bda
67 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Bda U1
No ratings yet
Bda U1
78 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
Big Data
No ratings yet
Big Data
25 pages
CS8091 LN
No ratings yet
CS8091 LN
68 pages
Form 1 Term 1 ICT Quiz
No ratings yet
Form 1 Term 1 ICT Quiz
10 pages
BDA Unit 1
No ratings yet
BDA Unit 1
50 pages
Unit 2 Da
No ratings yet
Unit 2 Da
69 pages
BDA (18CS72) Module-1
No ratings yet
BDA (18CS72) Module-1
36 pages
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
No ratings yet
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
30 pages
CC Becse Unit 4 PDF
No ratings yet
CC Becse Unit 4 PDF
32 pages
01 - Introduction To Big Data Analytics PDF
No ratings yet
01 - Introduction To Big Data Analytics PDF
38 pages
GRZINIC Repoliticizing Art
No ratings yet
GRZINIC Repoliticizing Art
240 pages
Powerflex 700H:Internal Password/Backdoor Password Using Him
No ratings yet
Powerflex 700H:Internal Password/Backdoor Password Using Him
2 pages
Pcs TCM 2800 - 200
No ratings yet
Pcs TCM 2800 - 200
1 page
Abstracts 05
No ratings yet
Abstracts 05
137 pages
X1 Hybrid HV User Manual
No ratings yet
X1 Hybrid HV User Manual
38 pages
Transmission Diagnostics 6T30-40
No ratings yet
Transmission Diagnostics 6T30-40
276 pages
Cost Reduction
100% (3)
Cost Reduction
20 pages
Samsung Np-r410 PCB Diagram
No ratings yet
Samsung Np-r410 PCB Diagram
48 pages
SL Unit 1 Ruby
No ratings yet
SL Unit 1 Ruby
92 pages
Trav L Cutter - 02 MAN 01 - R4 0308
No ratings yet
Trav L Cutter - 02 MAN 01 - R4 0308
50 pages
JBL Live 650btnc Manual
No ratings yet
JBL Live 650btnc Manual
11 pages
Eco Informed Materials Selection Lecture Unit 11 PPTEFFEN21
No ratings yet
Eco Informed Materials Selection Lecture Unit 11 PPTEFFEN21
16 pages
Client Tutorial
No ratings yet
Client Tutorial
78 pages
En-Vda 4994 - GTL V1.3 - 2021-06
No ratings yet
En-Vda 4994 - GTL V1.3 - 2021-06
45 pages
Face Recognition Technology: Presented By: R.VAISHNAVI 20RH1A05K0
No ratings yet
Face Recognition Technology: Presented By: R.VAISHNAVI 20RH1A05K0
13 pages
PCSuite Cleaner Log
No ratings yet
PCSuite Cleaner Log
10 pages
Ai
No ratings yet
Ai
3 pages
MCQ
No ratings yet
MCQ
8 pages
7 User Manual For DP C240.CAN: 7.1 Important Notice 2 7.2 Introduction of Display 2 7.3 Product Description 3
No ratings yet
7 User Manual For DP C240.CAN: 7.1 Important Notice 2 7.2 Introduction of Display 2 7.3 Product Description 3
14 pages
Manage On-Premises Mailbox Moves in Exchange Server
No ratings yet
Manage On-Premises Mailbox Moves in Exchange Server
19 pages
Sage Accounting Application Specialist Study Guide v2
No ratings yet
Sage Accounting Application Specialist Study Guide v2
2 pages
Invest Novel Thinking, Create Novel Value: WWW - Serontech.co - KR
No ratings yet
Invest Novel Thinking, Create Novel Value: WWW - Serontech.co - KR
8 pages
Defining Digital Advertising
No ratings yet
Defining Digital Advertising
8 pages
Datasheet - nGeniusPulse
No ratings yet
Datasheet - nGeniusPulse
4 pages
05 - Auger Specifications
No ratings yet
05 - Auger Specifications
1 page
Vessel: M/V: KENZ Date: .22/02/2022 SO No. KENZ 1/22
No ratings yet
Vessel: M/V: KENZ Date: .22/02/2022 SO No. KENZ 1/22
1 page
PRES-L-01 Pre-Fabricated Pipe Spool
No ratings yet
PRES-L-01 Pre-Fabricated Pipe Spool
1 page
HAIL2CC-00-BF-EDC-AT-0104 (Rev 01) Warranty Inspection Procedure For Transformer
No ratings yet
HAIL2CC-00-BF-EDC-AT-0104 (Rev 01) Warranty Inspection Procedure For Transformer
4 pages
Address Decoding
No ratings yet
Address Decoding
5 pages
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet

Module 1

Uploaded by

Module 1

Uploaded by

BIG DATA ANALYTICS

 Understand the storage mechanism, architecture, features and execution modes

 To be proficient in data analysis and its implications on structured,

 To be proficient with Big Data framework and use cases.

 Identify Big Data and its business inferences.

 Explore, Manage and Analyze Job executions in local &

 Apply and Perform machine learning techniques using Scala

Hadoop: ecosystem, Hadoop Distributed File System (HDFS),

MapReduce: MapReduce Framework, optimizing MapReduce jobs, MapReduce

Big Data tools: “Apache Spark”: Spark Architecture, Components, Features,

 In 2001, Doug Laney, then an analyst at consultancy Meta

• Volume of data being stored and used by

4 Locally Present Globally Present

6 Frameworks:Oracle, SQL Server Hadoop, Spark, Cassandra

7 Single Node Multiple Node

•Relationships are enforced by the application of table constraints.

• Its arrangement is unplanned and haphazard.

•Photos, videos, text documents, and log files can be generally

•Even though the metadata accompanying an image or a video

•Additionally, Unstructured data is also known as “dark data”

•The data is not in the relational format and is not neatly

•Since semi-structured data doesn’t need a structured query

•This type of information typically comes from external sources

• Designers Analyze gaming data at

• This analysis will help designers to

• Improve “user Experience”

• Reduce Churn rate.

• Apache Hadoop- Apache

Schema Less /NoSQL/Big Data

 Summary/ Technicality of MapReduce : Time is reduced rapidly in case of MPP application .

You might also like