INT312::BIG DATA
FUNDAMENTALS
Lecture #0
Course details
LTP 004 [Four Practicals/week] [BYOD]
CA Category: A0304
Course Orientation: RESEARCH, SOFTWARE SKILL
Weightages: ATT: 5 CA: 25 MTT: 20 ETT: 50
Course details
TEXT BOOKS
No Textbook for this course.
REFERENCE BOOKS
1. BIG DATA by ANIL MAHESHWARI, MCGRAW HILL EDUCATION
2. BIG DATA AND ANALYTICS by SEEMA ACHARYA, SUBHASHINI CHELLAPPAN, WILEY
3. UNDERSTANDING BIG DATA: ANALYTICS FOR ENTERPRISE CLASS HADOOP AND
STREAMING DATA by PAUL C ZIKOPOULOS, IBM, CHRIS EATON, PAUL ZIKOPOULOS,
MC GRAW HILL
4. ORACLE BIG DATA HANDBOOK by TOM PLUNKETT, BRIAN MACDONALD, BRUCE
NELSON, MARK HORNICK, HELEN SUN, KHADER MOHIUDDIN, DEBRA HARDING,
GOKULA MISHRA, ROBERT STACKOWIAK, KEIT, MC GRAW HILL
5. PROFESSIONAL HADOOP SOLUTIONS by BORIS LUBLINSKY, KEVIN T. SMITH, ALEXEY
YAKUBOVICH, WILEY
Course Objectives
recognize the need and importance of fundamental concepts and
principles of Big Data
examine internal functioning of different modules of Big Data and
Hadoop
conceptualize the big data ecosystem and appreciate its key
components
What you will learn?
Big Data Fundamentals provides a path for
Introduction to Big Data
Introduction to Hadoop
Installation of Hadoop
Hadoop Architecture
Hadoop Ecosystem
HIVE and HBASE
6
Course Prerequisite
Prerequisite:
Java Programming / C++
Database basics
7
Whats Big Data?
No single definition; here is from Wikipedia:
Big data is the term for a collection of data sets so
large and complex that it becomes difficult to process
using on-hand database management tools or
traditional data processing applications.
The challenges include capture, curation, storage,
search, sharing, transfer, analysis, and visualization.
8
Big Data: 3Vs
9
Volume (Scale)
Data Volume
44x increase from 2009 2020
From 0.8 zettabytes to 35zb
Data volume is increasing exponentially
Exponential increase in
collected/generated data
4.6
30 billion RFID billion
. 12+ TBs tags today
(1.3B in 2005)
camera
of tweet data phones
every day world wide
100s of
millions
data every day
of GPS
? TBs of
enabled
devices
sold
annually
25+ TBs of
log data 2+
every day billion
people on
the Web
76 million smart by end
meters in 2009 2011
200M by 2014
CERNs Large Hydron Collider (LHC) generates 15 PB a year
Maximilien Brice, CERN
2
Variety (Complexity)
Relational Data (Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
Graph Data
Social Network, Semantic Web (RDF),
Streaming Data
You can only scan the data once
A single application can be generating/collecting
many types of data
Big Public Data (online, weather, finance, etc)
To extract knowledge all these types of
data need to linked together
A Single View to the Customer
Social Banking
Media Finance
Our
Gaming
Customer Known
History
Purchas
Entertain
e
4
Velocity (Speed)
Data is begin generated fast and need to be processed
fast
Online Data Analytics
Late decisions missing opportunities
Examples
E-Promotions: Based on your current location, your purchase history, what
you like send promotions right now for store next to you
Healthcare monitoring: sensors monitoring your activities and body
any abnormal measurements require immediate reaction
5
Real-time/Fast Data
Mobile devices
(tracking all objects all the time)
Social media and networks Scientific instruments
(all of us are generating data) (collecting all sorts of data)
Sensor technology and
networks
(measuring all kinds of data)
The progress and innovation is no longer hindered by the ability to collect data
But, by the ability to manage, analyze, summarize, visualize, and discover
knowledge from the collected data in a timely manner and in a scalable fashion
6
Some Make it 4Vs
7
Harnessing Big Data
OLTP: Online Transaction Processing (DBMSs)
OLAP: Online Analytical Processing (Data Warehousing)
RTAP: Real-Time Analytics Processing (Big Data Architecture & technology)
8
The Model Has Changed
The Model of Generating/Consuming Data has Changed
Old Model: Few companies are generating data, all others are consuming data
New Model: all of us are generating data, and all of us are consuming
data
9
Whats driving Big Data
- Optimizations and predictive analytics
- Complex statistical analysis
- All types of data, and many sources
- Very large datasets
- More of a real-time
- Ad-hoc querying and reporting
- Data mining techniques
- Structured data, typical sources
- Small to mid-size datasets
1
Big Data Technology