0% found this document useful (0 votes)

30 views55 pages

Bda Module-1

Uploaded by

sharanyaramchandra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views55 pages

Bda Module-1

Uploaded by

sharanyaramchandra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

MODULE-1

BIG DATA ANALYTICS

(21CS63)
MODULE-1

INTRODUCTION TO BIG DATA AND

HADOOP
1. Introduction to big data

2. What is Big Data?

3. Characteristics of big data (V’s in big data)

4. Big data analytics

5. Hadoop architecture / ecosystem

6. Challenges in big data

CONTENTS
7. CAP theorem

8. Web analytics

9. Industry applications of big data

10. Benefits of Big Data Analytics

11. Tools used in Big Data Analytics

SLIDES BY AP 2
1. INTRODUCTION TO BIG DATA

Fig: Evolution of Big Data and their characteristics

SLIDES BY AP 3
1. INTRODUCTION TO BIG DATA
• Following are selected key terms and their meanings, which are essential to understand
the topics of Big Data,

a) Application: Collection of software components.

b) Application Programming Interface (API): Software component which enables uses

to access an application, service etc.

c) Data Model: Map or schema, which represents inherent properties of data.

d) Data Repositories: Collection of data.

e) Data Store: Data repository of a set of objects.

SLIDES BY AP 4
1. INTRODUCTION TO BIG DATA
f) Distributed Data Store: Refers to a data store distributed over multiple nodes (Ex:
Apache and Cassandra).

g) Database (DB): Refers to a grouping of tables for the collection of data.

h) Table: Refers to a presentation which consists of row fields and column fields.

i) CSV File: Refers to a file with comma-separated values.

j) Name-Value Pair: Refers to constructs used in which a field consists of name and the
corresponding value after that.

k) Key-Value Pair: Refers to a construct used in which a field is the key, which pairs with
the corresponding value or values after the key.
SLIDES BY AP 5
1. INTRODUCTION TO BIG DATA
l) Database Administration (DBA): Refers to the function of managing and maintaining
Database Management System (DBMS) software regularly.

m) Data Warehouse: Refers to sharable data, data stores and databases in an enterprise.

SLIDES BY AP 6
2. WHAT BIG DATA?
• Definitions of data:

“Data is information, usually in the form of facts or statistics that one can analyze or use
for further calculations.”

“Data is information that can be stored and used by a computer program.”

“Data is information presented in numbers, letters, or other form.”

“Data is information from series of observations, measurements or facts.”

“Data is information from series of behavioral observations, measurements or facts.”

Example: Data generated from applications like Snapchat, Instagram, Facebook, etc.
SLIDES BY AP 7
2. WHAT BIG DATA?
• Definition of web data:

“Web data is the data present on web servers in the form of text, images, videos, audios
and multimedia files for web users.”

Example: YouTube, Instagram, Wikipedia, etc.

SLIDES BY AP 8
2. WHAT BIG DATA?

Fig: Classification of Data

SLIDES BY AP 9
2. WHAT BIG DATA?
• Definitions of Big Data:

“Big Data is high volume, high velocity and/or high-variety information asset that requires new
forms of processing for enhanced decision making, insight discovery and process optimization.”

“Big Data is a collection of data sets so large or complex that traditional data processing
applications are inadequate."

“Big Data is data of a very large size, typically to the extent that its manipulation and
management present significant logistical challenges."

"Big Data refers to data sets whose size is beyond the ability of typical database software tool to
capture, store, manage and analyze."
SLIDES BY AP 10
3. CHARACTERISTICS OF BIG DATA (V’S
IN BIG DATA)

Fig: Characteristics of Big Data

SLIDES BY AP 11
3. CHARACTERISTICS OF BIG DATA (V’S
IN BIG DATA)
• Volume: The phrase 'Big Data' contains the term big, which is related to size of the data
and hence the characteristic. Size defines the amount or quantity of data, which is
generated from an application(s).

• Velocity: The term velocity refers to the speed of generation of data. In simple terms
how fast the data is generated and processed.

• Variety: Big Data comprises of a variety of data. Data is generated from multiple
sources in a system.

• Veracity: It is considered an important characteristic to take into account the quality

of data captured, which can vary greatly, affecting its accurate analysis.
SLIDES BY AP 12
3. CHARACTERISTICS OF BIG DATA (V’S
IN BIG DATA)
• Value: Refers to the benefits that big data can provide and it relates directly to what
organizations can do with that collected data.

(OR)

• Value: The ability to turn data into useful insights.

SLIDES BY AP 13
4. BIG DATA ANALYTICS

Fig: Big Data Analytics

SLIDES BY AP 14
4. BIG DATA ANALYTICS
• Big data analytics describes the process of uncovering trends, patterns and
correlations in large amounts of raw data to help make data-informed decisions.

• These processes use familiar statistical analysis techniques like clustering and
regression and apply them to more extensive datasets with the help of newer tools.

SLIDES BY AP 15
5. HADOOP ARCHITECTURE /
ECOSYSTEM
• Hadoop is an open source framework from Apache and is used to store, process and analyze
data which are very huge in volume.

• Hadoop is written in Java and is not OLAP (Online Analytical Processing).

Fig: Hadoop Architecture / Ecosystem

SLIDES BY AP 16
5. HADOOP ARCHITECTURE /
ECOSYSTEM
• Modules of Hadoop are,

1. HDFS:

✔ Hadoop Distributed File System.

✔ In HDFS the files will be broken into blocks

and stored in nodes over the distributed
architecture.

SLIDES BY AP 17
5. HADOOP ARCHITECTURE /
ECOSYSTEM
✔ HDFS has 2-major components,

Namenode (Master) and Datanode (Slave)

SLIDES BY AP 18
5. HADOOP ARCHITECTURE /
ECOSYSTEM
2. YARN (Yet Another Resource Negotiator):

✔ Hadoop YARN is a cluster resource

manager.

✔ It handles the cluster of nodes (Ex: If asked

where is the SD located? Where is RAM
located?)

SLIDES BY AP 19
5. HADOOP ARCHITECTURE /
ECOSYSTEM

SLIDES BY AP 20
5. HADOOP ARCHITECTURE /
ECOSYSTEM
3. MapReduce (Data Processing):

✔ MapReduce processes large volume of data in a parallelly distributed manner.

SLIDES BY AP 21
5. HADOOP ARCHITECTURE /
ECOSYSTEM
Sqoop and Flume (Data Collection and Ingestion)

• Sqoop is used to transfer data between Hadoop and external datastores such as
relational databases and enterprise data warehouses (servers that are at a very
high-end).

SLIDES BY AP 22
5. HADOOP ARCHITECTURE /
ECOSYSTEM
Sqoop and Flume (Data Collection and Ingestion)

SLIDES BY AP 23
5. HADOOP ARCHITECTURE /
ECOSYSTEM
Sqoop and Flume (Data Collection and Ingestion)

• Flume is a distributed service for collecting, aggregating and moving large amount
of log data.

SLIDES BY AP 24
5. HADOOP ARCHITECTURE /
ECOSYSTEM
Sqoop and Flume (Data Collection and Ingestion)

SLIDES BY AP 25
5. HADOOP ARCHITECTURE /
ECOSYSTEM
Pig (Scripting Language) and Hive (SQL Queries)

• Pig is used to analyze data in Hadoop.

• It provides a high level data processing language to perform numerous operations

on the data.

SLIDES BY AP 26
5. HADOOP ARCHITECTURE /
ECOSYSTEM
Pig (Scripting Language) and Hive (SQL Queries)

• Hive facilitates reading, writing and managing large datasets residing in the
distributed storage using SQL (Hive Query Language).

SLIDES BY AP 27
5. HADOOP ARCHITECTURE /
ECOSYSTEM
Spark (Real-time data analysis)

• Spark is an open-source distributed computing engine for processing and analyzing

huge volumes of real-time data.

• It is written in Scala.

SLIDES BY AP 28
5. HADOOP ARCHITECTURE /
ECOSYSTEM
Mahout (Machine Learning)

• Mahout is used to create scalable and distributed machine learning algorithms.

SLIDES BY AP 29
5. HADOOP ARCHITECTURE /
ECOSYSTEM
Apache Ambari (Management and Monitoring)

• Ambari is an open-source tool responsible for keeping track of running

applications and their statuses.

SLIDES BY AP 30
5. HADOOP ARCHITECTURE /
ECOSYSTEM
Kafka and Apache Storm (Streaming)

• Kafka is a distributed streaming platform to store and process streams of records.

SLIDES BY AP 31
5. HADOOP ARCHITECTURE /
ECOSYSTEM
Kafka and Apache Storm (Streaming)

• Storm is a processing engine that processes real-time streaming data at a very high
speed.

• It is written in Clojure.

SLIDES BY AP 32
5. HADOOP ARCHITECTURE /
ECOSYSTEM
Apache Ranger and Apache Knox (Security)

• Ranger is a framework to enable, monitor and manage data securities across the
Hadoop platform.

SLIDES BY AP 33
5. HADOOP ARCHITECTURE /
ECOSYSTEM
Apache Ranger and Apache Knox (Security)

• Knox is a application gateway for interacting with the REST APIs and UIs of
Hadoop deployments.

SLIDES BY AP 34
5. HADOOP ARCHITECTURE /
ECOSYSTEM
Oozie (Workflow system)

• Oozie is a workflow scheduler system to manage Hadoop jobs.

SLIDES BY AP 35
6. CHALLENGES IN BIG DATA
• The following are the challenges in big data,

1. Managing massive amounts of data.

2. Integrating data from multiple sources.

3. Ensuring data quality.

4. Keeping data secure.

5. Selecting the right big data tools.

6. Scaling systems and costs efficiently.

7. Lack of skilled data professionals.

8. Organizational resistance.
SLIDES BY AP 36
7. CAP THEOREM

SLIDES BY AP 37
7. CAP THEOREM
• The CAP Theorem is comprised of three components (hence its name) as they relate
to distributed data stores,

a) Consistency: All reads receive the most recent write or an error.

b) Availability: All reads contain data, but it might not be the most recent.

c) Partition tolerance: The system continues to operate despite network failures (ie,
dropped partitions, slow network connections or unavailable network connections between
nodes.)

SLIDES BY AP 38
7.1. CONSISTENCY IN DATABASES
• Consistent databases should be used when the value of the information returned
needs to be accurate.

• Financial data is a good example. When a user logs in to their banking institution,
they do not want to see an error that no data is returned, or that the value is higher or
lower than it actually is. Banking apps should return the exact value of a user’s account
information. In this case, banks would rely on consistent databases.

• Examples of a consistent database include: Bank account balances, Text messages.

• Database options for consistency: MongoDB, Redis, Hbase.

SLIDES BY AP 39
7.2. AVAILABILITY IN DATABASES
• Availability databases should be used when the service is more important than the
information.

• An example of having a highly available database can be seen in e-commerce

businesses. Online stores want to make their store and the functions of the shopping
cart available 24/7 so shoppers can make purchases exactly when they need.

• Database options for availability: Cassandra, DynamoDB, Cosmos DB.

SLIDES BY AP 40
8. WEB ANALYTICS
• Web analytics is the measurement and analysis of data to inform an understanding
of user behavior across web pages.

• Analytics platforms measure activity and behavior on a website, for example: how
many users visit, how long they stay, how many pages they visit, which pages they
visit and whether they arrive by following a link or not.

• Businesses use web analytics platforms to measure and benchmark site

performance and to look at key performance indicators that drive their business,
such as purchase conversion rate.

SLIDES BY AP 41
8. WEB ANALYTICS
WHY WEB ANALYTICS IS IMPORTANT?

• Website analytics provide insights and data that can be used to create a better user
experience for website visitors.

• Understanding customer behavior is also key to optimizing a website for key

conversion metrics.

• For example, web analytics will show you the most popular pages on your website
and the most popular paths to purchase.

• With website analytics, you can also accurately track the effectiveness of your online
marketing campaigns to help inform future efforts.
SLIDES BY AP 42
8. WEB ANALYTICS
SAMPLE WEB DATA ANALYTICS DATA

1. Audience data:

• Number of visits, number of unique visitors.

• New vs returning visitor ratio.

• What country they are from?

• What browser or device they are on (desktop vs mobile)?

SLIDES BY AP 43
8. WEB ANALYTICS
SAMPLE WEB DATA ANALYTICS DATA

2. Audience behavior:

• Common landing pages.

• Common exit page.

• Frequently visited pages.

• Length of time spent per visit.

• Number of pages per visit.

• Bounce rate.
SLIDES BY AP 44
8. WEB ANALYTICS
SAMPLE WEB DATA ANALYTICS DATA

3. Campaign data:

• Which campaigns drove the most traffic?

• Which websites referred the most traffic?

• Which keyword searches resulted in a visit?

• Campaign medium breakdown, such as email vs social media.

SLIDES BY AP 45
8. WEB ANALYTICS
COMMONLY USED WEB DATA ANALYTICS TOOLS

• The following are the most commonly used web data analytics tools,

1. Google analytics

2. Piwik

3. Adobe Analytics

4. Kissmetrics

5. Mixpanel

6. Parse.ly

7. CrazyEgg
SLIDES BY AP 46
9. INDUSTRY APPLICATIONS OF BIG
DATA

Fig: Applications of Big Data

SLIDES BY AP 47
9. INDUSTRY APPLICATIONS OF BIG
DATA

SLIDES BY AP 48
9. INDUSTRY APPLICATIONS OF BIG
DATA

SLIDES BY AP 49
9. INDUSTRY APPLICATIONS OF BIG
DATA

SLIDES BY AP 50
9. INDUSTRY APPLICATIONS OF BIG
DATA

SLIDES BY AP 51
9. INDUSTRY APPLICATIONS OF BIG
DATA

SLIDES BY AP 52
10. BENEFITS OF BIG DATA ANALYTICS

SLIDES BY AP 53
11. TOOLS USED IN BIG DATA ANALYTICS

Fig: Tools used in Big Data Analytics

SLIDES BY AP 54
THANK YOU

SLIDES BY AP 55

Big Data & Hadoop Training Material 0 1 PDF
50% (2)
Big Data & Hadoop Training Material 0 1 PDF
168 pages
ABDE™ - Introduction
No ratings yet
ABDE™ - Introduction
16 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Big Data S All Units
No ratings yet
Big Data S All Units
122 pages
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Big Data Analytics Overview
No ratings yet
Big Data Analytics Overview
17 pages
TIE - 21CS71 SIMP With Key Answers
No ratings yet
TIE - 21CS71 SIMP With Key Answers
19 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
IOT and Comp - Architecture
No ratings yet
IOT and Comp - Architecture
17 pages
Big Data Imp-1
No ratings yet
Big Data Imp-1
16 pages
Bda Unit 1
No ratings yet
Bda Unit 1
32 pages
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Ashish Presentation Stage1 Modify LR
No ratings yet
Ashish Presentation Stage1 Modify LR
24 pages
2 Data Science
No ratings yet
2 Data Science
27 pages
BD by Maaz
No ratings yet
BD by Maaz
19 pages
Lecture8 - Big Data (Hadoop)
No ratings yet
Lecture8 - Big Data (Hadoop)
29 pages
Topic 1 Big Data Technologies
No ratings yet
Topic 1 Big Data Technologies
5 pages
Big Data Analytics
No ratings yet
Big Data Analytics
20 pages
Big Data Analytics
No ratings yet
Big Data Analytics
31 pages
BIG DATA AND ANALYTICS Presentation
No ratings yet
BIG DATA AND ANALYTICS Presentation
31 pages
Hadoop PPT
No ratings yet
Hadoop PPT
25 pages
BigData Unit1
No ratings yet
BigData Unit1
74 pages
Big Data A Comprehensive Overview
No ratings yet
Big Data A Comprehensive Overview
25 pages
Hadoop Ecosystem Large PDF
No ratings yet
Hadoop Ecosystem Large PDF
229 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
Data Science
No ratings yet
Data Science
87 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
Unit 1 B Tech 3 Year BD
No ratings yet
Unit 1 B Tech 3 Year BD
10 pages
BAD601 Big Data Model Question Paper Solution Search Creators
No ratings yet
BAD601 Big Data Model Question Paper Solution Search Creators
50 pages
Big Data Complete Notes
No ratings yet
Big Data Complete Notes
33 pages
MCAD2232 (PRESS) BIG DATA and Its Applications
No ratings yet
MCAD2232 (PRESS) BIG DATA and Its Applications
140 pages
21cs71BDA Question Bank
No ratings yet
21cs71BDA Question Bank
4 pages
Bda U1
No ratings yet
Bda U1
80 pages
The Age OF: Every Minute
No ratings yet
The Age OF: Every Minute
47 pages
Big Data Streams Analytics: Challenges, Analysis, and Applications
No ratings yet
Big Data Streams Analytics: Challenges, Analysis, and Applications
55 pages
Big Data Analytics
No ratings yet
Big Data Analytics
61 pages
Unit1 - BDH
No ratings yet
Unit1 - BDH
77 pages
Big Data
No ratings yet
Big Data
190 pages
Big Data Hadoop Complete Final Spaced
No ratings yet
Big Data Hadoop Complete Final Spaced
15 pages
Bigdata PPT Slides (E)
No ratings yet
Bigdata PPT Slides (E)
10 pages
Unit 1
No ratings yet
Unit 1
19 pages
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
No ratings yet
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
30 pages
Big Data
No ratings yet
Big Data
25 pages
Big Data Analytics - Lecture Slides
No ratings yet
Big Data Analytics - Lecture Slides
72 pages
Lect 2 Big Data Lesson01
No ratings yet
Lect 2 Big Data Lesson01
26 pages
Big Data Presentation Slide
100% (1)
Big Data Presentation Slide
30 pages
Introduction To Big Data Computing
No ratings yet
Introduction To Big Data Computing
25 pages
BDA 01 - Introduction
No ratings yet
BDA 01 - Introduction
43 pages
Big Data
No ratings yet
Big Data
25 pages
Big Data - Cloud - AI
No ratings yet
Big Data - Cloud - AI
45 pages
Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
BD Imp Ques 1
No ratings yet
BD Imp Ques 1
22 pages
Big Data Tools
No ratings yet
Big Data Tools
29 pages
BIGDATAUNIT1 AKTUpdf
No ratings yet
BIGDATAUNIT1 AKTUpdf
33 pages
Uc PDF
No ratings yet
Uc PDF
10 pages
Course Code: CCS334 Course Name: Big Data Analytics Regulation: 2021 Year/Sem: Iii / Vi Faculty Incharge
No ratings yet
Course Code: CCS334 Course Name: Big Data Analytics Regulation: 2021 Year/Sem: Iii / Vi Faculty Incharge
12 pages
Bda U2
No ratings yet
Bda U2
68 pages
Module 1
No ratings yet
Module 1
54 pages
Unit-I Material
No ratings yet
Unit-I Material
32 pages
Mastering Apache Iceberg: Managing Big Data in a Modern Data Lake
From Everand
Mastering Apache Iceberg: Managing Big Data in a Modern Data Lake
Robert Johnson
No ratings yet
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
From Everand
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
Robert Johnson
No ratings yet
Blockchain Empowering Digital Economy (Yang Yan, Bin Wang, Jun Zou)
100% (1)
Blockchain Empowering Digital Economy (Yang Yan, Bin Wang, Jun Zou)
272 pages
PDF Machine Learning For Decision Makers: Cognitive Computing Fundamentals For Better Decision Making 1st Edition Patanjali Kashyap Download
100% (9)
PDF Machine Learning For Decision Makers: Cognitive Computing Fundamentals For Better Decision Making 1st Edition Patanjali Kashyap Download
62 pages
Big Data Analytics For Intelligent Manufacturing Systems A Review
No ratings yet
Big Data Analytics For Intelligent Manufacturing Systems A Review
15 pages
Impact of Digital Marketing On The Buying Behavior
No ratings yet
Impact of Digital Marketing On The Buying Behavior
4 pages
Business Research Methods BRM - BA4205 - Notes by MIET
No ratings yet
Business Research Methods BRM - BA4205 - Notes by MIET
106 pages
Data Analytics Skills For Managers
No ratings yet
Data Analytics Skills For Managers
10 pages
Digital Transformation of The Hotel Industry: Jorge Marques Rui Pedro Marques Editors
No ratings yet
Digital Transformation of The Hotel Industry: Jorge Marques Rui Pedro Marques Editors
272 pages
IEEE BDA Webinar Series: Big Data & Analytics For Power Systems
No ratings yet
IEEE BDA Webinar Series: Big Data & Analytics For Power Systems
1 page
BI Comparison Guide
No ratings yet
BI Comparison Guide
216 pages
BDT Lab Manual 2023 KRCE
No ratings yet
BDT Lab Manual 2023 KRCE
42 pages
Big Data: New Insights Transform Industries: White Paper
No ratings yet
Big Data: New Insights Transform Industries: White Paper
12 pages
AIS - Written Report
No ratings yet
AIS - Written Report
10 pages
Denodo Job Role
No ratings yet
Denodo Job Role
2 pages
A Blockchain-Based Framework For Green Logistics in Supply Chains
No ratings yet
A Blockchain-Based Framework For Green Logistics in Supply Chains
13 pages
Complexity Theory: Reductions
No ratings yet
Complexity Theory: Reductions
4 pages
Driving SMEinnovationwith AIsolutionsovercomingadoptionbarriersandfuturegrowthopportunities
No ratings yet
Driving SMEinnovationwith AIsolutionsovercomingadoptionbarriersandfuturegrowthopportunities
20 pages
Unit 2-Chapter 7 Slides
No ratings yet
Unit 2-Chapter 7 Slides
38 pages
Privacy in India in The Age of Big Data
No ratings yet
Privacy in India in The Age of Big Data
52 pages
Regulatory Compliance Is A Data Management Game 0
No ratings yet
Regulatory Compliance Is A Data Management Game 0
12 pages
Artificial Intelligence Market Size - Industry Report, 2030
No ratings yet
Artificial Intelligence Market Size - Industry Report, 2030
15 pages
Lecture Business Intelligence - An Introduction
No ratings yet
Lecture Business Intelligence - An Introduction
36 pages
Plan of Mata Elang Stable Development
No ratings yet
Plan of Mata Elang Stable Development
11 pages
Big Data Analytics in Developing Countries Implications and Challenges
No ratings yet
Big Data Analytics in Developing Countries Implications and Challenges
7 pages
Data Science Africa AI Researchers Kick-Off Unilag 2019 PDF
No ratings yet
Data Science Africa AI Researchers Kick-Off Unilag 2019 PDF
103 pages
Sathish Yellanki: Skyess: in Association With
No ratings yet
Sathish Yellanki: Skyess: in Association With
12 pages
SYBSc Data Science Sem IV NEP Syllabus 2024-2025
No ratings yet
SYBSc Data Science Sem IV NEP Syllabus 2024-2025
65 pages
IDA Ireland Economic Benefits of Data Centre Investment Final May182018
No ratings yet
IDA Ireland Economic Benefits of Data Centre Investment Final May182018
39 pages
A Review of Cyber-Physical Energy System Security Assessment
No ratings yet
A Review of Cyber-Physical Energy System Security Assessment
7 pages
Essay On An Education Issue
No ratings yet
Essay On An Education Issue
16 pages

Bda Module-1

Uploaded by

Bda Module-1

Uploaded by

MODULE-1

BIG DATA ANALYTICS

INTRODUCTION TO BIG DATA AND

2. What is Big Data?

3. Characteristics of big data (V’s in big data)

4. Big data analytics

5. Hadoop architecture / ecosystem

6. Challenges in big data

9. Industry applications of big data

10. Benefits of Big Data Analytics

11. Tools used in Big Data Analytics

Fig: Evolution of Big Data and their characteristics

a) Application: Collection of software components.

b) Application Programming Interface (API): Software component which enables uses

c) Data Model: Map or schema, which represents inherent properties of data.

d) Data Repositories: Collection of data.

e) Data Store: Data repository of a set of objects.

g) Database (DB): Refers to a grouping of tables for the collection of data.

i) CSV File: Refers to a file with comma-separated values.

“Data is information that can be stored and used by a computer program.”

“Data is information presented in numbers, letters, or other form.”

“Data is information from series of observations, measurements or facts.”

“Data is information from series of behavioral observations, measurements or facts.”

Example: YouTube, Instagram, Wikipedia, etc.

Fig: Classification of Data

Fig: Characteristics of Big Data

• Veracity: It is considered an important characteristic to take into account the quality

• Value: The ability to turn data into useful insights.

Fig: Big Data Analytics

• Hadoop is written in Java and is not OLAP (Online Analytical Processing).

Fig: Hadoop Architecture / Ecosystem

✔ Hadoop Distributed File System.

✔ In HDFS the files will be broken into blocks

Namenode (Master) and Datanode (Slave)

✔ Hadoop YARN is a cluster resource

✔ It handles the cluster of nodes (Ex: If asked

✔ MapReduce processes large volume of data in a parallelly distributed manner.

• Pig is used to analyze data in Hadoop.

• It provides a high level data processing language to perform numerous operations

• Spark is an open-source distributed computing engine for processing and analyzing

• Mahout is used to create scalable and distributed machine learning algorithms.

• Ambari is an open-source tool responsible for keeping track of running

• Kafka is a distributed streaming platform to store and process streams of records.

• Oozie is a workflow scheduler system to manage Hadoop jobs.

1. Managing massive amounts of data.

2. Integrating data from multiple sources.

3. Ensuring data quality.

4. Keeping data secure.

5. Selecting the right big data tools.

6. Scaling systems and costs efficiently.

7. Lack of skilled data professionals.

a) Consistency: All reads receive the most recent write or an error.

• Examples of a consistent database include: Bank account balances, Text messages.

• Database options for consistency: MongoDB, Redis, Hbase.

• An example of having a highly available database can be seen in e-commerce

• Database options for availability: Cassandra, DynamoDB, Cosmos DB.

• Businesses use web analytics platforms to measure and benchmark site

• Understanding customer behavior is also key to optimizing a website for key

• Number of visits, number of unique visitors.

• New vs returning visitor ratio.

• What country they are from?

• What browser or device they are on (desktop vs mobile)?

• Common landing pages.

• Common exit page.

• Frequently visited pages.

• Length of time spent per visit.

• Number of pages per visit.

• Which campaigns drove the most traffic?

• Which websites referred the most traffic?

• Which keyword searches resulted in a visit?

• Campaign medium breakdown, such as email vs social media.

Fig: Applications of Big Data

Fig: Tools used in Big Data Analytics

You might also like