01 - Introduction To Big Data Analytics PDF

The document outlines a course on big data analytics that includes 9 topics: 1. Introduction to big data analytics 2. Hadoop Ecosystem 3. MapReduce (Distributed processing) 4. Hadoop DB 5. Spark (Big data processing) 6. Pig (HLL for Data Processing) 7. Hive (Data warehouse system) 8. Hbase (Distributed database) 9. Big data use cases

Uploaded by

elamin004

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

72 views

01 - Introduction To Big Data Analytics PDF

Uploaded by

elamin004

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

Edited by

2
Course Outlines
1. Introduction to big data analytics
2. Hadoop Ecosystem
3. MapReduce (Distributed processing)
4. Hadoop DB
5. Spark (Big data processing)
6. Pig (HLL for Data Processing)
7. Hive (Data warehouse system)
8. Hbase (Distributed database)
9. Big data use cases Source:
IBM Big Data & Analytics Course
Level (1) & (2)
.v • ■
• ._,'

NUMBER OF DATA VIDEO DATA PER TWEE TOTAL MINUTES OATA PRODUCT
EMAILS CONSUMED UPLOADED DAY TS SENT SPENT ON AND S
SENT BY TO YOUTUBE PROCESS PER RECEIVED ORDERED
EVERY HOUSEHOL EVERY ED BY OAV FACEBOOK BY ON
SECOND DS EACH MINUTE GOOGLE MOBILE AMAZON
DAY EACH MONTH INTERNET PER
USERS SECONO

MILLION BILLION LX A BYTES ITS MS

THE WORLD OF DATA

5
6
7
Big Data Issues
• Big Data Analytics: data mining and machine learning
Large-scale machine learning, data mining and data visualization
• Big Data Computing: data center support for Analytics
Big data collection and transformation, integration and distributed
data management and computing
• Big Data Theory, Privacy & Security issues on Analytics
Big data sampling and statistical theory, Big data security and
privacy
• Big Data Science: 4th Paradigm – Analytics for Science and
Engineering
Big Data and Multi-disciplines (Bio, Chemistry, Engineering,
Social)

8
9
10
Characteristics of Big Data
The main characteristic of big data is its huge
volume collected through various sources. We are
used to measuring data in Gigabytes or Terabytes.
However, according to various studies, big data volume
created so far is in Zettabytes which is equivalent to a
trillion gigabytes.
Tabular Representation of various data Sizes
Big data is collected and created in various
formats and sources. It includes structured
data as well as unstructured data like text,
multimedia, social media, business reports etc.
Structured data such as bank records, demographic data,
inventory databases, business data, product data feeds
have a defined structure and can be stored and analyzed
using traditional data management and analysis methods.
Unstructured data includes captured like images, tweets
or Facebook status updates, instant messenger
conversations, blogs, videos uploads, voice recordings,
sensor data. These types of data do not have any defined
pattern.
Note:
• Unstructured data is most of the time reflection of human
thoughts, emotions and feelings which sometimes would be
difficult to be expressed using exact words.
• One of the main objectives of big data is to collect all this
unstructured data and analyze it using the appropriate
technology. Data crawling, also known as web crawling, is a
popular technology includes data mining algorithms designed to
reach the maximum depth of a page and extract useful data
worth analyzing.
In today’s fast paced world, speed is one of the key
drivers for success in your business as time is
equivalent to money.
Expectations of quick results and quick deliverables are
pressing to a great extent.

In big data, Velocity is the speed or frequency at which data is

collected in various forms and from different sources for
processing.

Big data technology allows you to process the real- time data,
sometimes without even capturing in a database.

Streams of data are processed and databases are updated in

real-time, using parallel processing of live streams of data.
Data veracity refers to the quality of data that is to be
analyzed. The quality of data is dependent on certain
factors such as; where the data has been collected from,
how it was collected, and how it will be analyzed.

The last V in the 5 V's of big data is value. This refers to

Value the value that big data can provide, and it relates
directly to what organizations can do with that collected
data.
Types of Big Data

•Structured
•Semi-structured
•Unstructured
Structured Data
Any data that can be stored, accessed and processed in the form
of fixed format is termed as a 'structured' data.

Examples of Structured Data:

An 'Employee' table in a database is an example of Structured Data
Unstructured Data
Any data with unknown form or the structure is classified as
unstructured data.
A typical example of unstructured data is a heterogeneous
data source containing a combination of simple text files,
images, videos etc.
• Examples of Un-structured Data
The output returned by 'Google Search'
Semi-structured Data
Semi-structured data can contain both the forms of data.
We can see semi-structured data as a structured in form but it
is actually not defined with a table.

Example of semi-structured data is a data represented in an

XML file.
Four Main Types of Data Structures
Structured Data
Unstructured Data
The Red Wheelbarrow, by
William Carlos Williams

Semi-Structured Data
Traditional vs. Big Data approaches to using data

Traditional vs. Big Data approaches to using data

Source: IBM
25
Stored Data Processing
- Batch-based stored
- Real-Time Data-stream processing
Batch Based Stored Data
Processing
• Process large volumes of data
• Can be periodic or one-time processing
• Batch results are produced after data is collected,
entered and processed
• Separate techniques or programs for input,
processing and output
Real Time Data Processing
(Streaming Data)
• Real-time data (RTD) refers to information
that is processed, consumed, and/or acted
upon immediately after it's generated.
• Wearable devices, stock markets, weather
forecasting, Monitoring and safety system,
etc..
Tools and Techniques for analyzing
big Data
The choice of tools mostly driven by:

Who is going to use the data

+
The business requirement for a particular
scenario
Where to store data?
How to get data in and out?
How to manage access of data?
How do I process the data?
How do I execute machine learning from the data?
How do I tell people my analytics results?
Apache (http server) — the oldest and most popular web server exists in every
linux machine, including MacOS machines.

— display webpages of those files reside in its http root directory

Case
Study:
Social
Media
Analytics

Using people’s history on internet, what they buy, what they search giving a rough
view of attitude on a product.
More, these output can be used to study:
customer satisfaction, churn prediction, financial performance, stock performance.
37
PREPARE
YOURSELF
TO SURF THE DATA ERA!

Tableau 9 - The Official Guide - Peck, George
100% (1)
Tableau 9 - The Official Guide - Peck, George
356 pages
BSBXCS402 Student Assessment Tasks and Project Portfolio
No ratings yet
BSBXCS402 Student Assessment Tasks and Project Portfolio
36 pages
Touchpad Plus Ver. 4.0 Class 7
From Everand
Touchpad Plus Ver. 4.0 Class 7
Nidhi Gupta
No ratings yet
(M8S2-POWERPOINT) - Data Control Language (DCL)
No ratings yet
(M8S2-POWERPOINT) - Data Control Language (DCL)
24 pages
TOAD User's Guide
No ratings yet
TOAD User's Guide
260 pages
Big Data Analytics Notess
No ratings yet
Big Data Analytics Notess
69 pages
Introduction To Pig: SESSION 2016-2017
No ratings yet
Introduction To Pig: SESSION 2016-2017
44 pages
Unit 3 Basics of SQL
No ratings yet
Unit 3 Basics of SQL
7 pages
Data Analytics New Quantum AKTU
No ratings yet
Data Analytics New Quantum AKTU
210 pages
Facets of Data
No ratings yet
Facets of Data
6 pages
R Lnaguager
No ratings yet
R Lnaguager
38 pages
R Programming
No ratings yet
R Programming
11 pages
Lecture 6 Data Preprocessing
No ratings yet
Lecture 6 Data Preprocessing
59 pages
Factset Placements
No ratings yet
Factset Placements
2 pages
Introduction To Data Engineering
No ratings yet
Introduction To Data Engineering
8 pages
DBMS & SQL
No ratings yet
DBMS & SQL
34 pages
Queue Data Structure
No ratings yet
Queue Data Structure
13 pages
Databases and Data Modelling
No ratings yet
Databases and Data Modelling
53 pages
BY:-Abhishek Goel Shubham Gupta Varun Sood
No ratings yet
BY:-Abhishek Goel Shubham Gupta Varun Sood
27 pages
SQL Queries and PL/SQL
No ratings yet
SQL Queries and PL/SQL
92 pages
ER Practical 7r
No ratings yet
ER Practical 7r
5 pages
Bridge Course Computer Science
No ratings yet
Bridge Course Computer Science
2 pages
Unit I - Introduction To DBMS
No ratings yet
Unit I - Introduction To DBMS
9 pages
Amcat
No ratings yet
Amcat
300 pages
Course On: Big Data Analytics
No ratings yet
Course On: Big Data Analytics
52 pages
Cs2258 Database Management Systems Lab Manual: Prepared by
No ratings yet
Cs2258 Database Management Systems Lab Manual: Prepared by
65 pages
Sepm Unit 3.... Roshan
No ratings yet
Sepm Unit 3.... Roshan
16 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
19 pages
DBMS Unit 4
No ratings yet
DBMS Unit 4
71 pages
DBMS - Question Bank
No ratings yet
DBMS - Question Bank
6 pages
Amity University Sample Report
No ratings yet
Amity University Sample Report
15 pages
What Is BPM
100% (1)
What Is BPM
36 pages
Seminar 7 Introduction To Databases
No ratings yet
Seminar 7 Introduction To Databases
41 pages
DBMS Chapter 4
No ratings yet
DBMS Chapter 4
39 pages
(M) BROCHURE - Data Science Learning Path
No ratings yet
(M) BROCHURE - Data Science Learning Path
33 pages
MSC Datascience Unit1
No ratings yet
MSC Datascience Unit1
20 pages
DBMS Chit Sheet For Capgemini Preparation
No ratings yet
DBMS Chit Sheet For Capgemini Preparation
7 pages
Purpose of Database System: What Is DBMS?
No ratings yet
Purpose of Database System: What Is DBMS?
8 pages
NPTEL Domain
No ratings yet
NPTEL Domain
1 page
Future Skills - An Introduction, General Overview of The Future Skills Sub-Sector-1
No ratings yet
Future Skills - An Introduction, General Overview of The Future Skills Sub-Sector-1
15 pages
Model Test Paper Dbms
No ratings yet
Model Test Paper Dbms
14 pages
Dbms Unit 4.2
No ratings yet
Dbms Unit 4.2
60 pages
Disadvantages of File Processing System
No ratings yet
Disadvantages of File Processing System
17 pages
Mrcet R20 Iv 1 QB
No ratings yet
Mrcet R20 Iv 1 QB
79 pages
Unit - 2
No ratings yet
Unit - 2
26 pages
Keys in Rdbms With Examples
No ratings yet
Keys in Rdbms With Examples
11 pages
Assignment Bca DBMS
No ratings yet
Assignment Bca DBMS
13 pages
Machine Learning Unit 4
100% (1)
Machine Learning Unit 4
78 pages
Unit-3 TOC
No ratings yet
Unit-3 TOC
42 pages
Types of Constraints in DBMS
No ratings yet
Types of Constraints in DBMS
15 pages
Unit-5 - Hive
No ratings yet
Unit-5 - Hive
31 pages
Mining Comlex Types of Data
No ratings yet
Mining Comlex Types of Data
19 pages
LP-VI - BI - Lab Manual
No ratings yet
LP-VI - BI - Lab Manual
48 pages
Unit 1 DataScience
No ratings yet
Unit 1 DataScience
105 pages
Data Science Module1
No ratings yet
Data Science Module1
20 pages
Java Notes
No ratings yet
Java Notes
169 pages
DBMS Unit 1
No ratings yet
DBMS Unit 1
23 pages
DBMS Notes
No ratings yet
DBMS Notes
43 pages
DBMS Module1 Part1
No ratings yet
DBMS Module1 Part1
66 pages
MySQL Presentation
100% (2)
MySQL Presentation
39 pages
Dbms Lab Manual
No ratings yet
Dbms Lab Manual
40 pages
Difference Between MOLAP, ROLAP and HOLAP in SSAS
No ratings yet
Difference Between MOLAP, ROLAP and HOLAP in SSAS
3 pages
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
EBSPROD - Switchover - Switchback - Version 1.0
No ratings yet
EBSPROD - Switchover - Switchback - Version 1.0
20 pages
Aad Unit 2
No ratings yet
Aad Unit 2
6 pages
Academia Basis-Resumo
No ratings yet
Academia Basis-Resumo
130 pages
Master Thesis Lab Inventory System
No ratings yet
Master Thesis Lab Inventory System
92 pages
Fundamentals of Database Systems: (Query Optimization - I)
No ratings yet
Fundamentals of Database Systems: (Query Optimization - I)
27 pages
Firewalls and Intrusion Prevention Systems
No ratings yet
Firewalls and Intrusion Prevention Systems
43 pages
Total ITSM Presentation
0% (1)
Total ITSM Presentation
24 pages
Java
No ratings yet
Java
5 pages
How To Invoke Script From A Siebel Button at The Business Component and Applet Levels
No ratings yet
How To Invoke Script From A Siebel Button at The Business Component and Applet Levels
7 pages
Data Access With ADO
No ratings yet
Data Access With ADO
55 pages
Data Mining and Data Warehouse - Mukesh Prasad Chaudhary
No ratings yet
Data Mining and Data Warehouse - Mukesh Prasad Chaudhary
651 pages
Chapter 14. Perl - The Master Manipulator Introduciton
No ratings yet
Chapter 14. Perl - The Master Manipulator Introduciton
13 pages
Moulinette
No ratings yet
Moulinette
41 pages
Unit 9
No ratings yet
Unit 9
14 pages
Exporting Large Table Data To Excel in ASP
No ratings yet
Exporting Large Table Data To Excel in ASP
10 pages
R.V. College of Engineering BANGALORE-560059 (Autonomous Institution Affiliated To VTU, Belgaum)
No ratings yet
R.V. College of Engineering BANGALORE-560059 (Autonomous Institution Affiliated To VTU, Belgaum)
9 pages
List of Major Software Companies in India 4
No ratings yet
List of Major Software Companies in India 4
1 page
Yabridge VST and VST3 On Linux Definitive Tool
No ratings yet
Yabridge VST and VST3 On Linux Definitive Tool
3 pages
Visual Cryptography: Cryptography" Aims at Providing The Voters A Facility To Cast Their Vote For The Elections
No ratings yet
Visual Cryptography: Cryptography" Aims at Providing The Voters A Facility To Cast Their Vote For The Elections
21 pages
Kivy Akram Hama Database Report
No ratings yet
Kivy Akram Hama Database Report
15 pages
A Single PPT On ERP Systems - Used in Class For MBA 12-14
No ratings yet
A Single PPT On ERP Systems - Used in Class For MBA 12-14
182 pages
VinithaRavichandran Resume
No ratings yet
VinithaRavichandran Resume
1 page
HTTP Handlers and HTTP Modules
No ratings yet
HTTP Handlers and HTTP Modules
22 pages
Bureau Brandeis - GDPR Compliance Roadmap
No ratings yet
Bureau Brandeis - GDPR Compliance Roadmap
33 pages
Trainee - Software Engineer - JD + JNF (Engineering)
No ratings yet
Trainee - Software Engineer - JD + JNF (Engineering)
2 pages
New Book by Balloon
No ratings yet
New Book by Balloon
2 pages
E-Commerce and Operations Management
No ratings yet
E-Commerce and Operations Management
18 pages