Bda Module-1
Bda Module-1
HADOOP
1. Introduction to big data
8. Web analytics
SLIDES BY AP 3
1. INTRODUCTION TO BIG DATA
• Following are selected key terms and their meanings, which are essential to understand
the topics of Big Data,
h) Table: Refers to a presentation which consists of row fields and column fields.
j) Name-Value Pair: Refers to constructs used in which a field consists of name and the
corresponding value after that.
k) Key-Value Pair: Refers to a construct used in which a field is the key, which pairs with
the corresponding value or values after the key.
SLIDES BY AP 5
1. INTRODUCTION TO BIG DATA
l) Database Administration (DBA): Refers to the function of managing and maintaining
Database Management System (DBMS) software regularly.
m) Data Warehouse: Refers to sharable data, data stores and databases in an enterprise.
SLIDES BY AP 6
2. WHAT BIG DATA?
• Definitions of data:
“Data is information, usually in the form of facts or statistics that one can analyze or use
for further calculations.”
Example: Data generated from applications like Snapchat, Instagram, Facebook, etc.
SLIDES BY AP 7
2. WHAT BIG DATA?
• Definition of web data:
“Web data is the data present on web servers in the form of text, images, videos, audios
and multimedia files for web users.”
SLIDES BY AP 8
2. WHAT BIG DATA?
“Big Data is high volume, high velocity and/or high-variety information asset that requires new
forms of processing for enhanced decision making, insight discovery and process optimization.”
“Big Data is a collection of data sets so large or complex that traditional data processing
applications are inadequate."
“Big Data is data of a very large size, typically to the extent that its manipulation and
management present significant logistical challenges."
"Big Data refers to data sets whose size is beyond the ability of typical database software tool to
capture, store, manage and analyze."
SLIDES BY AP 10
3. CHARACTERISTICS OF BIG DATA (V’S
IN BIG DATA)
• Velocity: The term velocity refers to the speed of generation of data. In simple terms
how fast the data is generated and processed.
• Variety: Big Data comprises of a variety of data. Data is generated from multiple
sources in a system.
(OR)
SLIDES BY AP 13
4. BIG DATA ANALYTICS
• These processes use familiar statistical analysis techniques like clustering and
regression and apply them to more extensive datasets with the help of newer tools.
SLIDES BY AP 15
5. HADOOP ARCHITECTURE /
ECOSYSTEM
• Hadoop is an open source framework from Apache and is used to store, process and analyze
data which are very huge in volume.
1. HDFS:
SLIDES BY AP 17
5. HADOOP ARCHITECTURE /
ECOSYSTEM
✔ HDFS has 2-major components,
SLIDES BY AP 18
5. HADOOP ARCHITECTURE /
ECOSYSTEM
2. YARN (Yet Another Resource Negotiator):
SLIDES BY AP 19
5. HADOOP ARCHITECTURE /
ECOSYSTEM
SLIDES BY AP 20
5. HADOOP ARCHITECTURE /
ECOSYSTEM
3. MapReduce (Data Processing):
SLIDES BY AP 21
5. HADOOP ARCHITECTURE /
ECOSYSTEM
Sqoop and Flume (Data Collection and Ingestion)
• Sqoop is used to transfer data between Hadoop and external datastores such as
relational databases and enterprise data warehouses (servers that are at a very
high-end).
SLIDES BY AP 22
5. HADOOP ARCHITECTURE /
ECOSYSTEM
Sqoop and Flume (Data Collection and Ingestion)
SLIDES BY AP 23
5. HADOOP ARCHITECTURE /
ECOSYSTEM
Sqoop and Flume (Data Collection and Ingestion)
• Flume is a distributed service for collecting, aggregating and moving large amount
of log data.
SLIDES BY AP 24
5. HADOOP ARCHITECTURE /
ECOSYSTEM
Sqoop and Flume (Data Collection and Ingestion)
SLIDES BY AP 25
5. HADOOP ARCHITECTURE /
ECOSYSTEM
Pig (Scripting Language) and Hive (SQL Queries)
SLIDES BY AP 26
5. HADOOP ARCHITECTURE /
ECOSYSTEM
Pig (Scripting Language) and Hive (SQL Queries)
• Hive facilitates reading, writing and managing large datasets residing in the
distributed storage using SQL (Hive Query Language).
SLIDES BY AP 27
5. HADOOP ARCHITECTURE /
ECOSYSTEM
Spark (Real-time data analysis)
• It is written in Scala.
SLIDES BY AP 28
5. HADOOP ARCHITECTURE /
ECOSYSTEM
Mahout (Machine Learning)
SLIDES BY AP 29
5. HADOOP ARCHITECTURE /
ECOSYSTEM
Apache Ambari (Management and Monitoring)
SLIDES BY AP 30
5. HADOOP ARCHITECTURE /
ECOSYSTEM
Kafka and Apache Storm (Streaming)
SLIDES BY AP 31
5. HADOOP ARCHITECTURE /
ECOSYSTEM
Kafka and Apache Storm (Streaming)
• Storm is a processing engine that processes real-time streaming data at a very high
speed.
• It is written in Clojure.
SLIDES BY AP 32
5. HADOOP ARCHITECTURE /
ECOSYSTEM
Apache Ranger and Apache Knox (Security)
• Ranger is a framework to enable, monitor and manage data securities across the
Hadoop platform.
SLIDES BY AP 33
5. HADOOP ARCHITECTURE /
ECOSYSTEM
Apache Ranger and Apache Knox (Security)
• Knox is a application gateway for interacting with the REST APIs and UIs of
Hadoop deployments.
SLIDES BY AP 34
5. HADOOP ARCHITECTURE /
ECOSYSTEM
Oozie (Workflow system)
SLIDES BY AP 35
6. CHALLENGES IN BIG DATA
• The following are the challenges in big data,
8. Organizational resistance.
SLIDES BY AP 36
7. CAP THEOREM
SLIDES BY AP 37
7. CAP THEOREM
• The CAP Theorem is comprised of three components (hence its name) as they relate
to distributed data stores,
b) Availability: All reads contain data, but it might not be the most recent.
c) Partition tolerance: The system continues to operate despite network failures (ie,
dropped partitions, slow network connections or unavailable network connections between
nodes.)
SLIDES BY AP 38
7.1. CONSISTENCY IN DATABASES
• Consistent databases should be used when the value of the information returned
needs to be accurate.
• Financial data is a good example. When a user logs in to their banking institution,
they do not want to see an error that no data is returned, or that the value is higher or
lower than it actually is. Banking apps should return the exact value of a user’s account
information. In this case, banks would rely on consistent databases.
SLIDES BY AP 39
7.2. AVAILABILITY IN DATABASES
• Availability databases should be used when the service is more important than the
information.
SLIDES BY AP 40
8. WEB ANALYTICS
• Web analytics is the measurement and analysis of data to inform an understanding
of user behavior across web pages.
• Analytics platforms measure activity and behavior on a website, for example: how
many users visit, how long they stay, how many pages they visit, which pages they
visit and whether they arrive by following a link or not.
SLIDES BY AP 41
8. WEB ANALYTICS
WHY WEB ANALYTICS IS IMPORTANT?
• Website analytics provide insights and data that can be used to create a better user
experience for website visitors.
• For example, web analytics will show you the most popular pages on your website
and the most popular paths to purchase.
• With website analytics, you can also accurately track the effectiveness of your online
marketing campaigns to help inform future efforts.
SLIDES BY AP 42
8. WEB ANALYTICS
SAMPLE WEB DATA ANALYTICS DATA
1. Audience data:
SLIDES BY AP 43
8. WEB ANALYTICS
SAMPLE WEB DATA ANALYTICS DATA
2. Audience behavior:
• Bounce rate.
SLIDES BY AP 44
8. WEB ANALYTICS
SAMPLE WEB DATA ANALYTICS DATA
3. Campaign data:
SLIDES BY AP 45
8. WEB ANALYTICS
COMMONLY USED WEB DATA ANALYTICS TOOLS
• The following are the most commonly used web data analytics tools,
1. Google analytics
2. Piwik
3. Adobe Analytics
4. Kissmetrics
5. Mixpanel
6. Parse.ly
7. CrazyEgg
SLIDES BY AP 46
9. INDUSTRY APPLICATIONS OF BIG
DATA
SLIDES BY AP 48
9. INDUSTRY APPLICATIONS OF BIG
DATA
SLIDES BY AP 49
9. INDUSTRY APPLICATIONS OF BIG
DATA
SLIDES BY AP 50
9. INDUSTRY APPLICATIONS OF BIG
DATA
SLIDES BY AP 51
9. INDUSTRY APPLICATIONS OF BIG
DATA
SLIDES BY AP 52
10. BENEFITS OF BIG DATA ANALYTICS
SLIDES BY AP 53
11. TOOLS USED IN BIG DATA ANALYTICS
SLIDES BY AP 55