0% found this document useful (0 votes)

154 views30 pages

Lesson 3 Big Data Overview

The document provides an introduction to big data and data science. It discusses the characteristics of big data, including the huge volumes, complex data types and structures, and speed of data creation. It differentiates common data structures like structured, semi-structured, quasi-structured and unstructured data. The document also examines common data repositories used by data scientists and the state of analytics practice in organizations. It provides examples of how different industries like credit cards, mobile phones and social media analyze large amounts of user data.

Uploaded by

Neerom Baldemoro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

154 views30 pages

Lesson 3 Big Data Overview

Uploaded by

Neerom Baldemoro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 30

Introduction to

Data Science
BIG DATA OVERVIEW
Module Objectives
At the end of this module, students must be able to:
1. discuss Big Data and its characteristics;
2. differentiate the data structures;
3. distinguished the different repositories used by data scientist;
4. examine the state of the practice of analytics;
5. differentiate between business intelligence and data science; and
6. examine the current analytical architecture and its problems
Big Data
 Data is created constantly, and at an ever-increasing rate. Mobile phones, social
media, imaging technologies to determine a medical diagnosis—all these and
more create new data, and that must be stored somewhere for some purpose.

 Merely keeping up with this huge influx of data is difficult, but substantially more
challenging is analyzing vast amounts of it, especially when it does not conform
to traditional notions of data structure, to identify meaningful patterns and extract
useful information.

 These challenges of the data deluge present the opportunity to transform

business, government, science, and everyday life.

*Text taken from Data Science and Big Data Analytics by EMC Education Services
Big Data
Several industries have led the way in developing their ability to gather and exploit data:
 Credit card companies monitor every purchase their customers make and can
identify fraudulent purchases with a high degree of accuracy using rules derived by
processing billions of transactions.
 Mobile phone companies analyze subscribers’ calling patterns to determine, for
example, whether a caller’s frequent contacts are on a rival network. If that rival
network is offering an attractive promotion that might cause the subscriber to defect,
the mobile phone company can proactively offer the subscriber an incentive to
remain in her contract.
 For companies such as LinkedIn and Facebook, data itself is their primary product.
The valuations of these companies are heavily derived from the data they gather and
host, which contains more and more intrinsic value as the data grows.

*Text taken from Data Science and Big Data Analytics by EMC Education Services
Big Data
Three attributes stand out as defining Big Data characteristics:

 Huge volume of data: Rather than thousands or millions of rows, Big Data can be
billions of rows and millions of columns.

 Complexity of data types and structures: Big Data reflects the variety of new
data sources, formats, and structures, including digital traces being left on the web
and other digital repositories for subsequent analysis.

 Speed of new data creation and growth: Big Data can describe high velocity
data, with rapid data ingestion and near real time analysis.

*Text taken from Data Science and Big Data Analytics by EMC Education Services
Big Data
Although the volume of Big Data tends to attract the most attention, generally
the variety and velocity of the data provide a more apt definition of Big Data.

Due to its size or structure, Big Data cannot be efficiently analyzed using only
traditional databases or methods. Big Data problems require new tools and
technologies to store, manage, and realize the business benefit. These new
tools and technologies enable creation, manipulation, and management of large
datasets and the storage environments that house them.

*Text taken from Data Science and Big Data Analytics by EMC Education Services
Big Data
Big Data Definition:
“Big Data is data whose scale, distribution, diversity, and/or timeliness require
the use of new technical architectures and analytics to enable insights that
unlock new sources of business value.”
-McKinsey Global Report, 2011

McKinsey’s definition of Big Data implies that organizations will need new data
architectures and ana- lytic sandboxes, new tools, new analytical methods, and
an integration of multiple skills into the new role of the data scientist

*Text taken from Data Science and Big Data Analytics by EMC Education Services
Big Data

*Text taken from Data Science and Big Data Analytics by EMC Education Services
Big Data
 Social media and genetic sequencing are among the fastest-growing sources
of Big Data and examples of untraditional sources of data being used for
analysis.

 For example, in 2012 Facebook users posted 700 status updates per second
worldwide, which can be leveraged to deduce latent interests or political views
of users and show relevant ads. For instance, an update in which a woman
changes her relationship status from “single” to “engaged” would trigger ads
on bridal dresses, wedding planning, or name-changing services.

*Text taken from Data Science and Big Data Analytics by EMC Education Services
Big Data

 Another example comes from

genomics. Genetic sequencing and
human genome mapping provide a
detailed understanding of genetic
makeup and lineage.

 The health care industry is looking

toward these advances to help predict
which illnesses a person is likely to
get in his lifetime and take steps to
avoid these maladies or reduce their
impact through the use of
personalized medicine and treatment.

*Text taken from Data Science and Big Data Analytics by EMC Education Services
Data Structures
 Big data can come in multiple forms, including structured and non-structured
data such as financial data, text files, multimedia files, and genetic mappings.

 Contrary to much of the traditional data analysis performed by organizations,

most of the Big Data is unstructured or semi-structured in nature, which
requires different techniques and tools to process and analyze.

 Distributed computing environments and massively parallel processing (MPP)

architectures that enable parallelized data ingest and analysis are the
preferred approach to process such complex data.

*Text taken from Data Science and Big Data Analytics by EMC Education Services
Data Structures
 The following shows four types of
data structures, with 80–90% of
future data growth coming from non-
structured data types.

 Although analyzing structured data

tends to be the most familiar
technique, a different technique is
required to meet the challenges to
analyze semi-structured data (shown
as XML), quasi-structured (shown as
a clickstream), and unstructured data.

*Text taken from Data Science and Big Data Analytics by EMC Education Services
Structured Data
 Structured data: Data
containing a defined data type,
format, and structure (that is,
transaction data, online
analytical processing data
cubes, traditional RDBMS,
CSV files, and even simple
spreadsheets).

*Text taken from Data Science and Big Data Analytics by EMC Education Services
Semi-structured Data
 Semi-structured data:
Textual data files with a
discernible pattern that
enables parsing (such as
Extensible Markup Language
[XML] data files that are self-
describing and defined by an
XML schema)

*Text taken from Data Science and Big Data Analytics by EMC Education Services
Quasi-structured Data
 Quasi-structured data: Textual data
with erratic data formats that can be
formatted with effort, tools, and time
(for instance, web clickstream data
that may contain inconsistencies in
data values and formats).

 Example: Visiting three websites

connected to a keyword comprise a
clickstream data that can be parsed
and mined by data scientist to
discover usage patterns and
relationships pertaining to areas of
interest.

*Text taken from Data Science and Big Data Analytics by EMC Education Services
Unstructured Data
 Unstructured data: Data that
has no inherent structure,
which may include text
documents, PDFs, images,
and video.

*Text taken from Data Science and Big Data Analytics by EMC Education Services
Data Repositories

 As data needs grew, so did more scalable data warehousing solutions. These
technologies enabled data to be managed centrally, providing benefits of security,
failover, and a single repository where users could rely on getting an “official”
source of data for financial reporting or other mission-critical tasks.

 The next table summarizes the characteristics of the data repositories:

*Text taken from Data Science and Big Data Analytics by EMC Education Services
Data Repositories

*Text taken from Data Science and Big Data Analytics by EMC Education Services
State of Practice in Analytics
 Current business problems provide many opportunities for organizations to
become more analytical and data driven, as shown in the ff table:

*Text taken from Data Science and Big Data Analytics by EMC Education Services
State of Practice in Analytics
 The previous table outlines four categories of common business problems that organizations
contend with where they have an opportunity to leverage advanced analytics to create
competitive advantage. Rather than only performing standard reporting on these areas,
organizations can apply advanced analytical techniques to optimize processes and derive
more value from these common tasks.
 The first three examples do not represent new problems. Organizations have been trying to reduce
customer churn, increase sales, and cross-sell customers for many years. What is new is the opportunity
to fuse advanced analytical techniques with Big Data to produce more impactful analyses for these
traditional problems.
 The last example portrays emerging regulatory requirements. Many compliance and
regulatory laws have been in existence for decades, but additional requirements are
added every year, which represent additional complexity and data requirements for
organizations. Laws related to anti-money laundering (AML) and fraud prevention
require advanced analytical techniques to comply with and manage properly.
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Business Intelligence vs Data Science
 Although much is written
generally about analytics, it is
important to distinguish between
Business Intelligence (BI) and
Data Science. As shown in figure
on the right, there are several
ways to compare these groups of
analytical techniques.

 One way to evaluate the type of

analysis being performed is to
examine the time horizon and
the kind of analytical approaches
being used.

*Text taken from Data Science and Big Data Analytics by EMC Education Services
Business Intelligence vs Data Science
 BI tends to provide reports, dashboards, and queries on business questions
for the current period or in the past. BI systems make it easy to answer
questions related to quarter-to-date revenue, progress toward quarterly
targets, and understand how much of a given product was sold in a prior
quarter or year.

 These questions tend to be closed-ended and explain current or past

behavior, typically by aggregating historical data and grouping it in some way.
BI provides hindsight and some insight and generally answers questions
related to “when” and “where” events occurred.

*Text taken from Data Science and Big Data Analytics by EMC Education Services
Business Intelligence vs Data Science
 By comparison, Data Science tends to use disaggregated data in a more
forward-looking, exploratory way, focusing on analyzing the present and
enabling informed decisions about the future.

 Rather than aggregating historical data to look at how many of a given

product sold in the previous quarter, a team may employ Data Science
techniques such as time series analysis to forecast future product sales and
revenue more accurately than extending a simple trend line.

 In addition, Data Science tends to be more exploratory in nature and may use
scenario optimization to deal with more open-ended questions. This approach
provides insight into current activity and foresight into future events, while
generally focusing on questions related to “how” and “why” events occur.
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Business Intelligence vs Data Science
 Where BI problems tend to require highly structured data organized in rows
and columns for accurate reporting, Data Science projects tend to use many
types of data sources, including large or unconventional datasets.

 Depending on an organization’s goals, it may choose to embark on a BI

project if it is doing reporting, creating dashboards, or performing simple
visualizations, or it may choose Data Science projects if it needs to do a more
sophisticated analysis with disaggregated or varied datasets.

*Text taken from Data Science and Big Data Analytics by EMC Education Services
Current Analytical Architecture
 Data Science projects need workspaces that are purpose-built for
experimenting with data, with flexible and agile data architectures.

 Most organizations still have data warehouses that provide excellent support
for traditional reporting and simple data analysis activities but unfortunately
have a more difficult time supporting more robust analyse

*Text taken from Data Science and Big Data Analytics by EMC Education Services
Current Analytical Architecture
 The following figure shows
a typical data architecture
and several of the
challenges it presents to
data scientists and others
trying to do advanced
analytics.

 We will examine the data

flow to the Data Scientist
and how this individual fits
into the process of getting
data to analyze on projects

*Text taken from Data Science and Big Data Analytics by EMC Education Services
Current Analytical Architecture
1. Data sources are loaded into the data
warehouse where data needs to be well
understood, structured, and normalized
with the appropriate data type definitions.
This kind of centralization enables
security, backup, and failover of highly
critical data.

2. Additional local systems in the form of

departmental warehouses and local data
marts are created to accommodate their
need for flexible analysis.

*Text taken from Data Science and Big Data Analytics by EMC Education Services
Current Analytical Architecture
3. Once in the data warehouse, data is read
by additional applications across the
enterprise for BI and reporting purposes.
These are high-priority operational processes
getting critical data feeds from the data
warehouses and repositories.

4. At the end of this workflow, analysts get

data provisioned for their downstream
analytics.

*Text taken from Data Science and Big Data Analytics by EMC Education Services
Current Analytical Architecture

 The typical data architectures just described are designed for storing and
processing mission-critical data, supporting enterprise applications, and
enabling corporate reporting activities.

 Although reports and dashboards are still important for organizations, most
traditional data architectures inhibit data exploration and more sophisticated
analysis.

*Text taken from Data Science and Big Data Analytics by EMC Education Services
Problems in Traditional Data Architecture
 High-value data is hard to reach and leverage, and predictive analytics and data mining
activities are last in line for data. Because the EDWs are designed for central data
management and reporting, those wanting data for analysis are generally prioritized only after
operational processes.

 Data moves in batches from EDW to local analytical tools.This workflow means that
datascientists are limited to performing in-memory analytics which will restrict the size of the
datasets they can use. As such, analysis may be subject to constraints of sampling, which
can skew model accuracy.

 Data Science projects will remain isolated and ad hoc, rather than centrally managed. The
implication of this isolation is that the organization can never harness the power of advanced
analytics in a scalable way, and Data Science projects will exist as nonstandard initiatives,
which are frequently not aligned with corporate business goals or strategy.
*Text taken from Data Science and Big Data Analytics by EMC Education Services

Unit-2 SQL Updated
No ratings yet
Unit-2 SQL Updated
102 pages
Blue Team Fundamentals
No ratings yet
Blue Team Fundamentals
11 pages
All BI MCQs Merged
75% (4)
All BI MCQs Merged
161 pages
Final - DBMS UNIT-5
No ratings yet
Final - DBMS UNIT-5
181 pages
Unit 7: Data Mining For Business Intelligence Applications: A) Balanced Scorecard
33% (3)
Unit 7: Data Mining For Business Intelligence Applications: A) Balanced Scorecard
11 pages
Fundamentals of Data Science
No ratings yet
Fundamentals of Data Science
1 page
GCP Data Engineer
No ratings yet
GCP Data Engineer
100 pages
SQL Statements: - Select - Insert - Update - Delete - Create - Alter - Drop - Rename - Truncate - Commit - Rollback - Savepoint
100% (1)
SQL Statements: - Select - Insert - Update - Delete - Create - Alter - Drop - Rename - Truncate - Commit - Rollback - Savepoint
231 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
22 pages
SQL
No ratings yet
SQL
101 pages
Session 3 4 Data Literacy Privacy Ethics
100% (1)
Session 3 4 Data Literacy Privacy Ethics
19 pages
ch4 23 11 2023
100% (1)
ch4 23 11 2023
81 pages
Lecture1 Big Data
No ratings yet
Lecture1 Big Data
47 pages
Big Data Analytics and Artificial Intelligence in
No ratings yet
Big Data Analytics and Artificial Intelligence in
10 pages
SQL Basic
100% (1)
SQL Basic
53 pages
02 - Data Preparation and Cleaning
No ratings yet
02 - Data Preparation and Cleaning
16 pages
Unit 6
No ratings yet
Unit 6
143 pages
Data Mining Techniques Unit-1
No ratings yet
Data Mining Techniques Unit-1
122 pages
Lesson1 - Data Definitions
No ratings yet
Lesson1 - Data Definitions
57 pages
Eds Unit 1
No ratings yet
Eds Unit 1
28 pages
Module No 5 Relational Database Design
No ratings yet
Module No 5 Relational Database Design
160 pages
4
No ratings yet
4
1 page
Risk Management
No ratings yet
Risk Management
27 pages
L05 - Advance Analytical Theory and Methods - Classification
No ratings yet
L05 - Advance Analytical Theory and Methods - Classification
34 pages
Descriptive Data Analytics
No ratings yet
Descriptive Data Analytics
56 pages
Advanced SQL - LAB 1
No ratings yet
Advanced SQL - LAB 1
12 pages
Question Bank For Web Analytics
No ratings yet
Question Bank For Web Analytics
10 pages
Advanced SQL (Part 1) Nataliya Bogushevskaya
No ratings yet
Advanced SQL (Part 1) Nataliya Bogushevskaya
48 pages
Data Mining
No ratings yet
Data Mining
87 pages
Training in R For Data Statistics
No ratings yet
Training in R For Data Statistics
113 pages
11
No ratings yet
11
3 pages
CH 5
No ratings yet
CH 5
80 pages
Case Study
100% (2)
Case Study
6 pages
Final - Unit 3 Data Preprocessing - Phases
No ratings yet
Final - Unit 3 Data Preprocessing - Phases
42 pages
Lesson 2 Linear Regression
100% (1)
Lesson 2 Linear Regression
21 pages
BookSlides 1 Machine Learning For Predictive Data Analytics
No ratings yet
BookSlides 1 Machine Learning For Predictive Data Analytics
56 pages
DataMining S
No ratings yet
DataMining S
103 pages
Unit-3 DMDW
No ratings yet
Unit-3 DMDW
36 pages
Lesson 3 Data Cleaning and Preparation
No ratings yet
Lesson 3 Data Cleaning and Preparation
105 pages
Vision: MEC32P-2 Mechanics of Deformable Bodies
No ratings yet
Vision: MEC32P-2 Mechanics of Deformable Bodies
14 pages
Digital Marketing Plan For Bosch in Grai
No ratings yet
Digital Marketing Plan For Bosch in Grai
29 pages
Procurement Management
No ratings yet
Procurement Management
14 pages
Big Data - S
No ratings yet
Big Data - S
79 pages
Lesson 6 Data Life Cycle Part 2
No ratings yet
Lesson 6 Data Life Cycle Part 2
30 pages
Lecture - 04 - Data Understanding and Preparation
No ratings yet
Lecture - 04 - Data Understanding and Preparation
59 pages
4 Data Distribution 1
No ratings yet
4 Data Distribution 1
20 pages
A Model of Consumer Behavior Online
No ratings yet
A Model of Consumer Behavior Online
6 pages
Communications Management Merged
No ratings yet
Communications Management Merged
58 pages
DBMS Module1 Part1
No ratings yet
DBMS Module1 Part1
66 pages
Module 4
No ratings yet
Module 4
63 pages
DBMS Module 2
No ratings yet
DBMS Module 2
125 pages
Examples On Triggers: Instructor: Mohamed Eltabakh Meltabakh@cs - Wpi.edu
No ratings yet
Examples On Triggers: Instructor: Mohamed Eltabakh Meltabakh@cs - Wpi.edu
15 pages
538ASM 2 Frontsheet
No ratings yet
538ASM 2 Frontsheet
40 pages
Unit 2 Da
No ratings yet
Unit 2 Da
69 pages
Subqueries
No ratings yet
Subqueries
32 pages
Unit 01
No ratings yet
Unit 01
32 pages
Module 1
No ratings yet
Module 1
107 pages
A6515 BDA Question Bank
No ratings yet
A6515 BDA Question Bank
9 pages
Business Operations and Analytics
No ratings yet
Business Operations and Analytics
33 pages
Asd
No ratings yet
Asd
2 pages
L9 SQL
No ratings yet
L9 SQL
128 pages
Perl Tutorial
No ratings yet
Perl Tutorial
32 pages
Chap-1 (INTRODUCTION TO WEBX.0) Web X.0 Module 1
100% (1)
Chap-1 (INTRODUCTION TO WEBX.0) Web X.0 Module 1
24 pages
Advanced SQL: Stored Procedures: Instructor: Mohamed Eltabakh Meltabakh@cs - Wpi.edu
No ratings yet
Advanced SQL: Stored Procedures: Instructor: Mohamed Eltabakh Meltabakh@cs - Wpi.edu
23 pages
Introduction To R: Shanti.S.Chauhan, PH.D Business Studies Shuats
No ratings yet
Introduction To R: Shanti.S.Chauhan, PH.D Business Studies Shuats
53 pages
20IT503 - Big Data Analytics - Unit2
No ratings yet
20IT503 - Big Data Analytics - Unit2
62 pages
DBMS Module 1
No ratings yet
DBMS Module 1
56 pages
Report Design & Data Monitor Using Businessobjects Dashboard Design
No ratings yet
Report Design & Data Monitor Using Businessobjects Dashboard Design
74 pages
Data Scientist - KD PDF
No ratings yet
Data Scientist - KD PDF
1 page
Classification and Prediction
No ratings yet
Classification and Prediction
143 pages
Screencapture Chegg Homework Help Questions and Answers Velocity M S Rocket Given Function Time Seconds Table Shows Velocity Rocket Specific Time q74830069 2021 08 05 10 - 25 - 47
No ratings yet
Screencapture Chegg Homework Help Questions and Answers Velocity M S Rocket Given Function Time Seconds Table Shows Velocity Rocket Specific Time q74830069 2021 08 05 10 - 25 - 47
2 pages
4-Stored Procedures
No ratings yet
4-Stored Procedures
22 pages
Data Science Introduction
No ratings yet
Data Science Introduction
82 pages
Z
No ratings yet
Z
5 pages
Mana Mohan R
No ratings yet
Mana Mohan R
147 pages
Who Does The Messenger Claim Is A "Walking Dead Man"?
No ratings yet
Who Does The Messenger Claim Is A "Walking Dead Man"?
2 pages
Web Design Introductory 5th Edition Campbell Solutions Manual
100% (33)
Web Design Introductory 5th Edition Campbell Solutions Manual
9 pages
Chapter 5: Advanced SQL: Database System Concepts, 6 Ed
No ratings yet
Chapter 5: Advanced SQL: Database System Concepts, 6 Ed
77 pages
X
No ratings yet
X
2 pages
RDBMS
No ratings yet
RDBMS
155 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
19 pages
Hkust PHD Thesis
100% (3)
Hkust PHD Thesis
5 pages
Analytical CRM - 1
100% (1)
Analytical CRM - 1
29 pages
Data Driven Merchandising Fashion Apparel
No ratings yet
Data Driven Merchandising Fashion Apparel
19 pages
Introduction To Web Analytics RFD Topic 1
No ratings yet
Introduction To Web Analytics RFD Topic 1
10 pages
Web X.0 Notes-1
No ratings yet
Web X.0 Notes-1
32 pages
Jiang 2017
No ratings yet
Jiang 2017
11 pages
WSMA 2021-22 Question Paper Answered
No ratings yet
WSMA 2021-22 Question Paper Answered
11 pages
Lesson 3 Big Data Overview
No ratings yet
Lesson 3 Big Data Overview
30 pages
Mining Tutorial Slides
No ratings yet
Mining Tutorial Slides
84 pages
Webx PDF
No ratings yet
Webx PDF
35 pages
DM - Unit-4
No ratings yet
DM - Unit-4
36 pages
Basic Choices Quick Guide: TGI Client Service: +44 (0) 20 8185 4900 222 Gray's Inn Road, London WC1X 8HB - UK
No ratings yet
Basic Choices Quick Guide: TGI Client Service: +44 (0) 20 8185 4900 222 Gray's Inn Road, London WC1X 8HB - UK
37 pages
05b.BDA (18CS72) Module-5 Text Mining
No ratings yet
05b.BDA (18CS72) Module-5 Text Mining
23 pages
E Commerce Data
No ratings yet
E Commerce Data
30 pages
Web and Social Media Analytics
No ratings yet
Web and Social Media Analytics
30 pages
SAP Ecommerce Ultimate Guide Book
No ratings yet
SAP Ecommerce Ultimate Guide Book
11 pages
Assigment E Business
No ratings yet
Assigment E Business
9 pages
Clickstream Data
No ratings yet
Clickstream Data
38 pages
CRM Notes All Units
No ratings yet
CRM Notes All Units
19 pages
Utbk Bahasa Inggris 06
No ratings yet
Utbk Bahasa Inggris 06
6 pages

Lesson 3 Big Data Overview

Uploaded by

Lesson 3 Big Data Overview

Uploaded by

Introduction to

 These challenges of the data deluge present the opportunity to transform

 Another example comes from

 The health care industry is looking

 Contrary to much of the traditional data analysis performed by organizations,

 Distributed computing environments and massively parallel processing (MPP)

 Although analyzing structured data

 Example: Visiting three websites

 The next table summarizes the characteristics of the data repositories:

 One way to evaluate the type of

 These questions tend to be closed-ended and explain current or past

 Rather than aggregating historical data to look at how many of a given

 Depending on an organization’s goals, it may choose to embark on a BI

 We will examine the data

2. Additional local systems in the form of

4. At the end of this workflow, analysts get

You might also like