0% found this document useful (0 votes)
30 views55 pages

Bda Module-1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views55 pages

Bda Module-1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

MODULE-1

BIG DATA ANALYTICS


(21CS63)
MODULE-1

INTRODUCTION TO BIG DATA AND

HADOOP
1. Introduction to big data

2. What is Big Data?

3. Characteristics of big data (V’s in big data)

4. Big data analytics

5. Hadoop architecture / ecosystem

6. Challenges in big data


CONTENTS
7. CAP theorem

8. Web analytics

9. Industry applications of big data

10. Benefits of Big Data Analytics

11. Tools used in Big Data Analytics


SLIDES BY AP 2
1. INTRODUCTION TO BIG DATA

Fig: Evolution of Big Data and their characteristics

SLIDES BY AP 3
1. INTRODUCTION TO BIG DATA
• Following are selected key terms and their meanings, which are essential to understand
the topics of Big Data,

a) Application: Collection of software components.

b) Application Programming Interface (API): Software component which enables uses


to access an application, service etc.

c) Data Model: Map or schema, which represents inherent properties of data.

d) Data Repositories: Collection of data.

e) Data Store: Data repository of a set of objects.


SLIDES BY AP 4
1. INTRODUCTION TO BIG DATA
f) Distributed Data Store: Refers to a data store distributed over multiple nodes (Ex:
Apache and Cassandra).

g) Database (DB): Refers to a grouping of tables for the collection of data.

h) Table: Refers to a presentation which consists of row fields and column fields.

i) CSV File: Refers to a file with comma-separated values.

j) Name-Value Pair: Refers to constructs used in which a field consists of name and the
corresponding value after that.

k) Key-Value Pair: Refers to a construct used in which a field is the key, which pairs with
the corresponding value or values after the key.
SLIDES BY AP 5
1. INTRODUCTION TO BIG DATA
l) Database Administration (DBA): Refers to the function of managing and maintaining
Database Management System (DBMS) software regularly.

m) Data Warehouse: Refers to sharable data, data stores and databases in an enterprise.

SLIDES BY AP 6
2. WHAT BIG DATA?
• Definitions of data:

“Data is information, usually in the form of facts or statistics that one can analyze or use
for further calculations.”

“Data is information that can be stored and used by a computer program.”

“Data is information presented in numbers, letters, or other form.”

“Data is information from series of observations, measurements or facts.”

“Data is information from series of behavioral observations, measurements or facts.”

Example: Data generated from applications like Snapchat, Instagram, Facebook, etc.
SLIDES BY AP 7
2. WHAT BIG DATA?
• Definition of web data:

“Web data is the data present on web servers in the form of text, images, videos, audios
and multimedia files for web users.”

Example: YouTube, Instagram, Wikipedia, etc.

SLIDES BY AP 8
2. WHAT BIG DATA?

Fig: Classification of Data


SLIDES BY AP 9
2. WHAT BIG DATA?
• Definitions of Big Data:

“Big Data is high volume, high velocity and/or high-variety information asset that requires new
forms of processing for enhanced decision making, insight discovery and process optimization.”

“Big Data is a collection of data sets so large or complex that traditional data processing
applications are inadequate."

“Big Data is data of a very large size, typically to the extent that its manipulation and
management present significant logistical challenges."

"Big Data refers to data sets whose size is beyond the ability of typical database software tool to
capture, store, manage and analyze."
SLIDES BY AP 10
3. CHARACTERISTICS OF BIG DATA (V’S
IN BIG DATA)

Fig: Characteristics of Big Data


SLIDES BY AP 11
3. CHARACTERISTICS OF BIG DATA (V’S
IN BIG DATA)
• Volume: The phrase 'Big Data' contains the term big, which is related to size of the data
and hence the characteristic. Size defines the amount or quantity of data, which is
generated from an application(s).

• Velocity: The term velocity refers to the speed of generation of data. In simple terms
how fast the data is generated and processed.

• Variety: Big Data comprises of a variety of data. Data is generated from multiple
sources in a system.

• Veracity: It is considered an important characteristic to take into account the quality


of data captured, which can vary greatly, affecting its accurate analysis.
SLIDES BY AP 12
3. CHARACTERISTICS OF BIG DATA (V’S
IN BIG DATA)
• Value: Refers to the benefits that big data can provide and it relates directly to what
organizations can do with that collected data.

(OR)

• Value: The ability to turn data into useful insights.

SLIDES BY AP 13
4. BIG DATA ANALYTICS

Fig: Big Data Analytics


SLIDES BY AP 14
4. BIG DATA ANALYTICS
• Big data analytics describes the process of uncovering trends, patterns and
correlations in large amounts of raw data to help make data-informed decisions.

• These processes use familiar statistical analysis techniques like clustering and
regression and apply them to more extensive datasets with the help of newer tools.

SLIDES BY AP 15
5. HADOOP ARCHITECTURE /
ECOSYSTEM
• Hadoop is an open source framework from Apache and is used to store, process and analyze
data which are very huge in volume.

• Hadoop is written in Java and is not OLAP (Online Analytical Processing).

Fig: Hadoop Architecture / Ecosystem


SLIDES BY AP 16
5. HADOOP ARCHITECTURE /
ECOSYSTEM
• Modules of Hadoop are,

1. HDFS:

✔ Hadoop Distributed File System.

✔ In HDFS the files will be broken into blocks


and stored in nodes over the distributed
architecture.

SLIDES BY AP 17
5. HADOOP ARCHITECTURE /
ECOSYSTEM
✔ HDFS has 2-major components,

Namenode (Master) and Datanode (Slave)

SLIDES BY AP 18
5. HADOOP ARCHITECTURE /
ECOSYSTEM
2. YARN (Yet Another Resource Negotiator):

✔ Hadoop YARN is a cluster resource


manager.

✔ It handles the cluster of nodes (Ex: If asked


where is the SD located? Where is RAM
located?)

SLIDES BY AP 19
5. HADOOP ARCHITECTURE /
ECOSYSTEM

SLIDES BY AP 20
5. HADOOP ARCHITECTURE /
ECOSYSTEM
3. MapReduce (Data Processing):

✔ MapReduce processes large volume of data in a parallelly distributed manner.

SLIDES BY AP 21
5. HADOOP ARCHITECTURE /
ECOSYSTEM
Sqoop and Flume (Data Collection and Ingestion)

• Sqoop is used to transfer data between Hadoop and external datastores such as
relational databases and enterprise data warehouses (servers that are at a very
high-end).

SLIDES BY AP 22
5. HADOOP ARCHITECTURE /
ECOSYSTEM
Sqoop and Flume (Data Collection and Ingestion)

SLIDES BY AP 23
5. HADOOP ARCHITECTURE /
ECOSYSTEM
Sqoop and Flume (Data Collection and Ingestion)

• Flume is a distributed service for collecting, aggregating and moving large amount
of log data.

SLIDES BY AP 24
5. HADOOP ARCHITECTURE /
ECOSYSTEM
Sqoop and Flume (Data Collection and Ingestion)

SLIDES BY AP 25
5. HADOOP ARCHITECTURE /
ECOSYSTEM
Pig (Scripting Language) and Hive (SQL Queries)

• Pig is used to analyze data in Hadoop.

• It provides a high level data processing language to perform numerous operations


on the data.

SLIDES BY AP 26
5. HADOOP ARCHITECTURE /
ECOSYSTEM
Pig (Scripting Language) and Hive (SQL Queries)

• Hive facilitates reading, writing and managing large datasets residing in the
distributed storage using SQL (Hive Query Language).

SLIDES BY AP 27
5. HADOOP ARCHITECTURE /
ECOSYSTEM
Spark (Real-time data analysis)

• Spark is an open-source distributed computing engine for processing and analyzing


huge volumes of real-time data.

• It is written in Scala.

SLIDES BY AP 28
5. HADOOP ARCHITECTURE /
ECOSYSTEM
Mahout (Machine Learning)

• Mahout is used to create scalable and distributed machine learning algorithms.

SLIDES BY AP 29
5. HADOOP ARCHITECTURE /
ECOSYSTEM
Apache Ambari (Management and Monitoring)

• Ambari is an open-source tool responsible for keeping track of running


applications and their statuses.

SLIDES BY AP 30
5. HADOOP ARCHITECTURE /
ECOSYSTEM
Kafka and Apache Storm (Streaming)

• Kafka is a distributed streaming platform to store and process streams of records.

SLIDES BY AP 31
5. HADOOP ARCHITECTURE /
ECOSYSTEM
Kafka and Apache Storm (Streaming)

• Storm is a processing engine that processes real-time streaming data at a very high
speed.

• It is written in Clojure.

SLIDES BY AP 32
5. HADOOP ARCHITECTURE /
ECOSYSTEM
Apache Ranger and Apache Knox (Security)

• Ranger is a framework to enable, monitor and manage data securities across the
Hadoop platform.

SLIDES BY AP 33
5. HADOOP ARCHITECTURE /
ECOSYSTEM
Apache Ranger and Apache Knox (Security)

• Knox is a application gateway for interacting with the REST APIs and UIs of
Hadoop deployments.

SLIDES BY AP 34
5. HADOOP ARCHITECTURE /
ECOSYSTEM
Oozie (Workflow system)

• Oozie is a workflow scheduler system to manage Hadoop jobs.

SLIDES BY AP 35
6. CHALLENGES IN BIG DATA
• The following are the challenges in big data,

1. Managing massive amounts of data.

2. Integrating data from multiple sources.

3. Ensuring data quality.

4. Keeping data secure.

5. Selecting the right big data tools.

6. Scaling systems and costs efficiently.

7. Lack of skilled data professionals.

8. Organizational resistance.
SLIDES BY AP 36
7. CAP THEOREM

SLIDES BY AP 37
7. CAP THEOREM
• The CAP Theorem is comprised of three components (hence its name) as they relate
to distributed data stores,

a) Consistency: All reads receive the most recent write or an error.

b) Availability: All reads contain data, but it might not be the most recent.

c) Partition tolerance: The system continues to operate despite network failures (ie,
dropped partitions, slow network connections or unavailable network connections between
nodes.)

SLIDES BY AP 38
7.1. CONSISTENCY IN DATABASES
• Consistent databases should be used when the value of the information returned
needs to be accurate.

• Financial data is a good example. When a user logs in to their banking institution,
they do not want to see an error that no data is returned, or that the value is higher or
lower than it actually is. Banking apps should return the exact value of a user’s account
information. In this case, banks would rely on consistent databases.

• Examples of a consistent database include: Bank account balances, Text messages.

• Database options for consistency: MongoDB, Redis, Hbase.

SLIDES BY AP 39
7.2. AVAILABILITY IN DATABASES
• Availability databases should be used when the service is more important than the
information.

• An example of having a highly available database can be seen in e-commerce


businesses. Online stores want to make their store and the functions of the shopping
cart available 24/7 so shoppers can make purchases exactly when they need.

• Database options for availability: Cassandra, DynamoDB, Cosmos DB.

SLIDES BY AP 40
8. WEB ANALYTICS
• Web analytics is the measurement and analysis of data to inform an understanding
of user behavior across web pages.

• Analytics platforms measure activity and behavior on a website, for example: how
many users visit, how long they stay, how many pages they visit, which pages they
visit and whether they arrive by following a link or not.

• Businesses use web analytics platforms to measure and benchmark site


performance and to look at key performance indicators that drive their business,
such as purchase conversion rate.

SLIDES BY AP 41
8. WEB ANALYTICS
WHY WEB ANALYTICS IS IMPORTANT?

• Website analytics provide insights and data that can be used to create a better user
experience for website visitors.

• Understanding customer behavior is also key to optimizing a website for key


conversion metrics.

• For example, web analytics will show you the most popular pages on your website
and the most popular paths to purchase.

• With website analytics, you can also accurately track the effectiveness of your online
marketing campaigns to help inform future efforts.
SLIDES BY AP 42
8. WEB ANALYTICS
SAMPLE WEB DATA ANALYTICS DATA

1. Audience data:

• Number of visits, number of unique visitors.

• New vs returning visitor ratio.

• What country they are from?

• What browser or device they are on (desktop vs mobile)?

SLIDES BY AP 43
8. WEB ANALYTICS
SAMPLE WEB DATA ANALYTICS DATA

2. Audience behavior:

• Common landing pages.

• Common exit page.

• Frequently visited pages.

• Length of time spent per visit.

• Number of pages per visit.

• Bounce rate.
SLIDES BY AP 44
8. WEB ANALYTICS
SAMPLE WEB DATA ANALYTICS DATA

3. Campaign data:

• Which campaigns drove the most traffic?

• Which websites referred the most traffic?

• Which keyword searches resulted in a visit?

• Campaign medium breakdown, such as email vs social media.

SLIDES BY AP 45
8. WEB ANALYTICS
COMMONLY USED WEB DATA ANALYTICS TOOLS

• The following are the most commonly used web data analytics tools,

1. Google analytics

2. Piwik

3. Adobe Analytics

4. Kissmetrics

5. Mixpanel

6. Parse.ly

7. CrazyEgg
SLIDES BY AP 46
9. INDUSTRY APPLICATIONS OF BIG
DATA

Fig: Applications of Big Data


SLIDES BY AP 47
9. INDUSTRY APPLICATIONS OF BIG
DATA

SLIDES BY AP 48
9. INDUSTRY APPLICATIONS OF BIG
DATA

SLIDES BY AP 49
9. INDUSTRY APPLICATIONS OF BIG
DATA

SLIDES BY AP 50
9. INDUSTRY APPLICATIONS OF BIG
DATA

SLIDES BY AP 51
9. INDUSTRY APPLICATIONS OF BIG
DATA

SLIDES BY AP 52
10. BENEFITS OF BIG DATA ANALYTICS

SLIDES BY AP 53
11. TOOLS USED IN BIG DATA ANALYTICS

Fig: Tools used in Big Data Analytics


SLIDES BY AP 54
THANK YOU

SLIDES BY AP 55

You might also like