Module 1
Module 1
(DSE 3264)
Faculty Name:
Shavantrevva S Bilakeri (SSB)
Assistant Professor, Dept of DSCA, MIT.
Phone No: +91 8217228519 E-mail: [email protected]
Course Objectives
To be familiar with overview of Apache Hadoop Ecosystem
Big Data Tools: “Hive”: Brief History of Hive, Data Types In Hive, Executing
Modes, Writing & Executing Hive queries.
5 Centralized Distributed
•It is type of data most familiar to our everyday lives. Ex: birthday,
address
• A certain schema binds it, so all the data has the same set of
properties. Structured data is also called relational data.
•It is split into multiple tables to enhance the integrity of the data
by creating a single record to depict an entity.
•However, there are some features like key-value pairs that help
in discerning the different entities from each other.
MapReduce/
Processing
Data Analysis
Use case of Big Data Analysis – Gaming
• Processed
• Analyzed Accurately
Big data Technologies/ Classes
Big Data?
Big Data Layout
HDFS is a java based file system that is used to store structured or unstructured
data over large clusters of distributed servers.
The data stored in HDFS has no restriction or rule to be applied, the data can be
either fully unstructured or purely structured.
In HDFS the work to make data senseful, is done by developer's code only.
Hadoop distributed file system provides a highly fault tolerant atmosphere with
a deployment on low cost hardware machines.
HDFS is now a part of Apache Hadoop project, more information and
installation guide can be found at Apache HDFS documentation
MapReduce
MapReduce was introduced by google, which has mapper and reducer modules.
Helps to analyze web logs to create large amount of web search indexes.
It is basically a framework to write applications that processes a large amount of
structured or unstructured data over the web. (programming model)
MapReduce takes the query and breaks it into parts to run it on multiple nodes.
By distributed query processing it makes it easy to maintain large amount of data
by dividing the data into several different machines.
Hadoop MapReduce is a software framework for easily writing applications to
manage large amount of data sets with a highly fault tolerant manner.
HIVE
Hive was originally developed by Facebook, now it is made open source.
Hive works something like a bridge in between SQL and Hadoop, it is
basically used to make SQL queries on Hadoop clusters.
Apache Hive is basically a data warehouse that provides ad-hoc queries, data
summarization and analysis of huge data sets stored in Hadoop compatible file
systems.
Hive provides a SQL like called HiveQL query based implementation of huge
amount of data stored in Hadoop clusters.
In January 2013 apache releases Hive 0.10.0
Pig
Pig was introduced by yahoo and later on it was made fully open source.
It also provides a bridge to query data over Hadoop clusters but unlike hive, it
implements a script implementation to make Hadoop data access able by
developers and business persons.
Apache pig provides a high level programming platform for developers to
process and analyses Big Data using user defined functions and programming
efforts.
In January 2013 Apache released Pig 0.10.1 which is defined for use with
Hadoop 0.10.1 or later releases.
Traditional vs Big Data
Schema Hard Code/SQL/Small Data
online transactions and quick updates.
Schema Based DB (hard code schema attachment)
Structured
Uses SQL for Data processing
Maintains relationship between elements
• The live flow of data is captured and the analysis is done on it.
• Efficiency increases when the data to be analyzed is large
Traditional vs Big data Approaches
What is Changing in the Realms
of
Big Data ??????
Life Cycle of Data
1) The analysis of data is done from the knowledge
experts and the expertise is applied for the
development of an application.
2) The streaming of data after the analysis and the
application, the data log is created for the acquisition
of data.
3) The data id mapped and clustered together on the data
log.
4) The clustered data from the data acquisition is then
aggregated by applying various aggregation
algorithms.
5) The integrated data again goes for an analysis.
6) The complete steps are repeated till the desired, and
expected output is produced
Big Data Technology Stack
Design of logical layers
in a data processing
Various Data Storage and Usage, Tools
Various Data Storage and Usage, Tools
Components Classification in Hadoop
Components Classification
Hadoop Ecosystem
Hadoop 1 vs Hadoop 2 [MRV1 vs
MRV2]
Hadoop 1 vs Hadoop 2 [MRV1 vs
MRV2]
Hadoop V1 vs Hadoop V2
Hadoop Technology
Why TO Use Hadoop?
MapReduce:
MapReduce
Phases of MapReduce: Word Count
MapReduce : Case Study (Paper Correction and Identifying Topper)
Map : Parallelism , Reduce : Grouping
Total = 20 Papers
Time = 1 min/Paper
Without (MPP):
20 Papers Corrections = 20 mins
With (MPP/MapReduce)
Divide Task – 4 groups / 4 districts (D1, D2, D3, D4) [Divide Tasks – Mapper Class]
Assign 5 papers- Each district
To fetch “District Topper” = (D1 =T1, D2= T2, D3 = T3, D4 = T4) = 4 Toppers ; 5 minutes
To fetch “State Topper” = T1, T2, T3, T4 = Topper; 4 minutes [Aggregation: Reducer
Class]
Total job “To Fetch Topper” = 9 Minutes