0% found this document useful (0 votes)
72 views

01 - Introduction To Big Data Analytics PDF

The document outlines a course on big data analytics that includes 9 topics: 1. Introduction to big data analytics 2. Hadoop Ecosystem 3. MapReduce (Distributed processing) 4. Hadoop DB 5. Spark (Big data processing) 6. Pig (HLL for Data Processing) 7. Hive (Data warehouse system) 8. Hbase (Distributed database) 9. Big data use cases

Uploaded by

elamin004
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views

01 - Introduction To Big Data Analytics PDF

The document outlines a course on big data analytics that includes 9 topics: 1. Introduction to big data analytics 2. Hadoop Ecosystem 3. MapReduce (Distributed processing) 4. Hadoop DB 5. Spark (Big data processing) 6. Pig (HLL for Data Processing) 7. Hive (Data warehouse system) 8. Hbase (Distributed database) 9. Big data use cases

Uploaded by

elamin004
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Edited by

2
Course Outlines
1. Introduction to big data analytics
2. Hadoop Ecosystem
3. MapReduce (Distributed processing)
4. Hadoop DB
5. Spark (Big data processing)
6. Pig (HLL for Data Processing)
7. Hive (Data warehouse system)
8. Hbase (Distributed database)
9. Big data use cases Source:
IBM Big Data & Analytics Course
Level (1) & (2)
.v • ■
• ._,'

NUMBER OF DATA VIDEO DATA PER TWEE TOTAL MINUTES OATA PRODUCT
EMAILS CONSUMED UPLOADED DAY TS SENT SPENT ON AND S
SENT BY TO YOUTUBE PROCESS PER RECEIVED ORDERED
EVERY HOUSEHOL EVERY ED BY OAV FACEBOOK BY ON
SECOND DS EACH MINUTE GOOGLE MOBILE AMAZON
DAY EACH MONTH INTERNET PER
USERS SECONO

MILLION BILLION LX A BYTES ITS MS

THE WORLD OF DATA


5
6
7
Big Data Issues
• Big Data Analytics: data mining and machine learning
Large-scale machine learning, data mining and data visualization
• Big Data Computing: data center support for Analytics
Big data collection and transformation, integration and distributed
data management and computing
• Big Data Theory, Privacy & Security issues on Analytics
Big data sampling and statistical theory, Big data security and
privacy
• Big Data Science: 4th Paradigm – Analytics for Science and
Engineering
Big Data and Multi-disciplines (Bio, Chemistry, Engineering,
Social)

8
9
10
Characteristics of Big Data
The main characteristic of big data is its huge
volume collected through various sources. We are
used to measuring data in Gigabytes or Terabytes.
However, according to various studies, big data volume
created so far is in Zettabytes which is equivalent to a
trillion gigabytes.
Tabular Representation of various data Sizes
Big data is collected and created in various
formats and sources. It includes structured
data as well as unstructured data like text,
multimedia, social media, business reports etc.
Structured data such as bank records, demographic data,
inventory databases, business data, product data feeds
have a defined structure and can be stored and analyzed
using traditional data management and analysis methods.
Unstructured data includes captured like images, tweets
or Facebook status updates, instant messenger
conversations, blogs, videos uploads, voice recordings,
sensor data. These types of data do not have any defined
pattern.
Note:
• Unstructured data is most of the time reflection of human
thoughts, emotions and feelings which sometimes would be
difficult to be expressed using exact words.
• One of the main objectives of big data is to collect all this
unstructured data and analyze it using the appropriate
technology. Data crawling, also known as web crawling, is a
popular technology includes data mining algorithms designed to
reach the maximum depth of a page and extract useful data
worth analyzing.
In today’s fast paced world, speed is one of the key
drivers for success in your business as time is
equivalent to money.
Expectations of quick results and quick deliverables are
pressing to a great extent.

In big data, Velocity is the speed or frequency at which data is


collected in various forms and from different sources for
processing.

Big data technology allows you to process the real- time data,
sometimes without even capturing in a database.

Streams of data are processed and databases are updated in


real-time, using parallel processing of live streams of data.
Data veracity refers to the quality of data that is to be
analyzed. The quality of data is dependent on certain
factors such as; where the data has been collected from,
how it was collected, and how it will be analyzed.

The last V in the 5 V's of big data is value. This refers to


Value the value that big data can provide, and it relates
directly to what organizations can do with that collected
data.
Types of Big Data

•Structured
•Semi-structured
•Unstructured
Structured Data
Any data that can be stored, accessed and processed in the form
of fixed format is termed as a 'structured' data.

Examples of Structured Data:


An 'Employee' table in a database is an example of Structured Data
Unstructured Data
Any data with unknown form or the structure is classified as
unstructured data.
A typical example of unstructured data is a heterogeneous
data source containing a combination of simple text files,
images, videos etc.
• Examples of Un-structured Data
The output returned by 'Google Search'
Semi-structured Data
Semi-structured data can contain both the forms of data.
We can see semi-structured data as a structured in form but it
is actually not defined with a table.

Example of semi-structured data is a data represented in an


XML file.
Four Main Types of Data Structures
Structured Data
Unstructured Data
The Red Wheelbarrow, by
William Carlos Williams

Semi-Structured Data
Traditional vs. Big Data approaches to using data

Traditional vs. Big Data approaches to using data


Source: IBM
25
Stored Data Processing
- Batch-based stored
- Real-Time Data-stream processing
Batch Based Stored Data
Processing
• Process large volumes of data
• Can be periodic or one-time processing
• Batch results are produced after data is collected,
entered and processed
• Separate techniques or programs for input,
processing and output
Real Time Data Processing
(Streaming Data)
• Real-time data (RTD) refers to information
that is processed, consumed, and/or acted
upon immediately after it's generated.
• Wearable devices, stock markets, weather
forecasting, Monitoring and safety system,
etc..
Tools and Techniques for analyzing
big Data
The choice of tools mostly driven by:

Who is going to use the data


+
The business requirement for a particular
scenario
Where to store data?
How to get data in and out?
How to manage access of data?
How do I process the data?
How do I execute machine learning from the data?
How do I tell people my analytics results?
Apache (http server) — the oldest and most popular web server exists in every
linux machine, including MacOS machines.

— display webpages of those files reside in its http root directory


Case
Study:
Social
Media
Analytics

Using people’s history on internet, what they buy, what they search giving a rough
view of attitude on a product.
More, these output can be used to study:
customer satisfaction, churn prediction, financial performance, stock performance.
37
PREPARE
YOURSELF
TO SURF THE DATA ERA!

You might also like