0% found this document useful (0 votes)
78 views46 pages

Big Data Lesson 1 Lucrezia Noli

This document provides an introduction to big data and discusses: 1) The data science process which involves data quality, integration, transformation, supervised vs unsupervised learning, and model performance. 2) The foundation of data analytics including business intelligence, data warehousing, ETL, data sources analysis, and data quality. 3) Machine learning and how machines can learn tasks without being specifically programmed through experience and improving performance. Algorithms like deep reinforcement learning are discussed.

Uploaded by

Reyansh Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views46 pages

Big Data Lesson 1 Lucrezia Noli

This document provides an introduction to big data and discusses: 1) The data science process which involves data quality, integration, transformation, supervised vs unsupervised learning, and model performance. 2) The foundation of data analytics including business intelligence, data warehousing, ETL, data sources analysis, and data quality. 3) Machine learning and how machines can learn tasks without being specifically programmed through experience and improving performance. Algorithms like deep reinforcement learning are discussed.

Uploaded by

Reyansh Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

BIG DATA

Introduction

Lecturer: Lucrezia Noli


Lesson 1
LUCREZIA NOLI

▪ Università Commerciale Luigi Bocconi


▪ Bachelor in Finance
▪ Master in Economics of Innovation & Technology

▪ Master Thesis: «Machine Learning Techniques to


Investigate the ALS Disease»
▪ Won second prize of PRISLA competition

▪ Current roles
▪ Big Data Scientist at Dataskills
▪ Lecturer SEDIN - Università Bocconi
▪ Lecturer Overnet

▪ Previous roles
▪ Business Development Manager at Metail London
The Data Science process

• Quality Model • Parameters


• Integration optimization
• Supervised vs
• Transformation & • Model
Unsupervised
creation performance
• Rule based or Black
• reports
Box
Data Predictions

OLAP

Data Mart

Data Mart

3
Data Mart
The foundation of Data Analytics

BI Tools & Reporting Business Intelligence is


central for any company
wanting to use data in an
innovative way.
Data warehouse This is because BI certifies
ETL that the data to be analyzed
are correct.
Business Rules

Data Sources Analysis

Data Quality
What is Business Intelligence?

DATA SOURCES CLEANSING & STRUCTURE DATA WAREHOUSE CLIENT TOOLS

Sensors/ PLC

BIG DATA Analysis


INTERNAL OLAP
CRM STAGING AREA Data Mart
ERP

ETL
Production DB

Data Mart
ETL

REPORTING TOOLS

KPIs

Data Mart
EXTERNAL

Master data
management
Data quality
Machine learning

GENERICITA’

Through Machine Learning a computer can solve various tasks,


without being given all parameters that characterize any MACHINE LEARNING
ofthem specifically
Process through
which a machine

APPRENDIMENTO

“A computer program is said to learn from


experience E with respect to some class of tasks T and
M learns

without
how
complete a task

specifically
to

being

programmed to do
performance measure P if its performance at tasks in T, so
as measured by P, improves with experience E.”
Tom M. Mitchell
MACHINE LEARNING - Algorithms

In 2012 Deepmind creates an Artificial


Intelligence called Deep Q Learner that
can play any game of the Atari
package, famous in the 70s, through
ML techniques called «Deep
Reinforcement Learning».

The algorithm doesn’t specify the


characteristics of the game played, but
simply enables the machine to learn
how to play by itself.

LEARNING HOW TO PLAY FLAPPY BIRD


PREDICTIVE ANALYTICS
Learning from the data

Anni Carburante Porte Antifurto Prezzo ($)

1 Diesel 5 Si 14.000
MODEL

TARGET
1 Benzina 5 Si 12.500
TRAINING
Through
2 INPUT DATA
Benzina 3 No 11.000
2 Diesel 5 Si 13.000
machine
learning 3 Benzina 5 No 9.000
4 Benzina 3 No 8.500

3 Diesel 5 data No
New input ?
PREDICTION
Predictive Analytics

EXISTANCE OF HISTORICAL DATA, which are studied in order to PREDICTIVE


understand which interactions between input variables have ANALYTICS
generated a specific output

P Extraction
information
of
and
knowledge from data,
making
Machine
use of
Learning
AIM TO PREDICT the evolution of data in the future techniques
Predictive Analytics

PREDICTION &
DATABASE BUSINESS INSIGHT

PREDICTIVE ENGINE
Predictive Analysis - steps

• SUPERVISED
1. REPRESENTING THE PROBLEM • UNSUPERVISED
• HYBRID
PREDICTIVE
ANALYTICS

P
• CONFRONTING REAL & Extraction of
2. VALUATING THE PERFORMANCE EXPECTED VALUE
information and
• TRIAL & ERROR
knowledge from data,
making use of
Machine Learning
• MINIMIZATION OF techniques
3. OPTIMIZING THE PARAMETERS COST FUNCTION
• SEARCH FOR AN
OPTIMUM
Predictive Analytics – representing the problem

PREDICTIVE
ANALYTICS

P Extraction
information
knowledge
of
and
from
data, making use of
Machine Learning
techniques
Predictive Analytics - applications
Predictive Analytics – energy demand forecast

WHAT WILL THE ELECTRICAL


EXPENSE BE IN THE NEXT
HOUR?

▪ Time series of electric


consumption

▪ Exogenous data(weather
forecast)

➢ Better estimation of
needs and costs
➢ Ability to fix energy price
PREDICTIVE ANALYTICS – Advanced Client Segmentation

CAN WE IDENTIFY
HOMOGENEOUS GROUPS
WITHIN OUR CUSTOMER
BASE?

▪ Purchasing behavior

▪ Demographic information

➢ Ad-hoc marketing
➢ Price-setting strategies
➢ New products & services
PREDICTIVE ANALYTICS – Churn Analysis

WHICH OF OUR CLIENTS ARE


LIKELY TO LEAVE US FOR
OUR COMPETITORS?

▪ Time series data of


demand

▪ Exogenous data(eg.
Promotions, sales, holiday
seasons)

➢ Ad-hoc Marketing
➢ Promo activities
PREDICTIVE ANALYTICS – Sentiment Analysis

WHICH KIND OF FEEDBACK


PEOPLE LEAVE ONLINE
ABOUT OUR FIRM?

▪ Social media posts

▪ Customers reviews

➢ Various aims:
recommendation
engines, advanced
clustering, propensity
analysis
PREDICTIVE ANALYTICS – Propensity Analysis

HOW LIKELY IS IT THAT A


PROSPECT WILL BUY?

▪ Cross-referencing of purchase
data & data on marketing &
adv campaigns
▪ Analysis of online consumer
behavior

➢ Ad-hoc Marketing
➢ Price-setting strategies
➢ Promo activities
PREDICTIVE ANALYTICS – Price Prediction

CAN WE PREDICT HOW PRICES


WILL EVOLVE IN OUR MARKETS
OF INTEREST?

▪ Time series of prices


▪ Time series of exogenous
data(eg. Sales, promotions,
holiday seasons)

➢ Competitors’ strategy
➢ Promotions & offers
PREDICTIVE ANALYTICS – Demand Forecast

HOW MUCH DEMAND WILL


WE HAVE FOR OUR GOOD &
SERVICE IN THE NEXT
HOUR/DAY?

▪ Time series of demand


▪ Exogenous data(eg.
Promotions, sales, holiday
seasons)

➢ Inventory
➢ Resource allocation
➢ Ad-hoc marketing
➢ promotions
PREDICTIVE ANALYTICS – Fraud Detection

CAN WE IDENTIFY
FRAUDOLENT
TRANSACTIONS BEFORE
THEY ARE CARRIED OUT?

▪ Time series of transactions


with «fraud tag»

➢ Fraud detection
➢ Fraud classification
PREDICTIVE ANALYTICS – Resource Allocation

CAN WE PREDICT GEOGRAPHIC


AREAS WITH HIGHER DEMAND
FOR OUR SERVICE?

▪ Time series of demand


▪ Exogenous data(eg.fairs,
events, strikes)

➢ Optimal resource
allocation
➢ Demand/revenue
forecasting
PREDICTIVE ANALYTICS – Document Classification

CAN WE AUTOMATICALLY
CLASSIFY OUR DOCUMENTS
BASED ON THEIR CONTENT?

▪ Sample of documents to
classify

➢ Automatic document
classification
➢ Error identification
Data Sources
“Operational” sources picturing a firm’s daily activities are:

• Production-related devices
• Sales- related devices
• Tools to track orders and deliveries
• Accounting tools
• HR tools
• Client-management tools
• Back-office tools
• The product
• Production line (production plant) – data from machine’s sensors
• Orders & deliveries
• Inventory
• Suppliers data
• Customers’ feedback: call center, emails, returns data
Data types

Structured vs. unstructured data:

• Structured data → table-like data with columns and rows → eg.


Transactions: each row is a transaction, each column is a
characteristic of the transaction: when it was made, by whom, the
amount,etc.

• Unstructures (semi-structured) data → images, videos, emails, any


feedback from social networks…
Introducing Big data
Big data - definitions
1997
“ Data of a very large size, “…data sets are generally
typically to the extent that its quite large, taxing the
manipulation and capacities of main memory, 2013
management present local disk, and even remote
significant logistical disk. We call this the problem “…things one can do at
challenges.” of big data”
a large scale that cannot
be done at a smaller
one, to extract new
Oxford English Dictionary Cox, Ellsworth – NASA insights or create new
forms of value.”
Mayer-Schönberger
2001
“An all-encompassing term Cukier
Big Data described for any collection of data sets 2011
by the 3Vs:
Volume, Velocity,
so large and complex that it
becomes difficult to process
“datasets whose size is
Variety beyond the ability of
using on-hand data
Doug Laney - Gartner typical database software
management tools or
tools to capture, store,
traditional data processing
manage, and analyze,”
applications.”
Wikipedia McKinsey
27
3 Vs definition
• Volume
• Huge amounts of data
• Variety
• Variety of structures, data sources, types of data
• Complexity of structures
• Unstructured data
• Velocity
• Rapidity with which data is produced
3 Vs + 2
• Value
• Ability to extract value from this huge amount of data

• Veracity
• Not all data we have at hand are actually of “good quality”
BUT…

• Big data are not JUST data in big volume ( need to have
other characteristics too in order to be defined as big
data)

• They come from both new sources (eg. Social networks),


but also traditional ones! (think about data coming from a
sensor put on a machine in a production plant)

• In many statistics published on the web or shared on tv,


the number are either exagerated or imprecise → most of
the times, the data we’ll have to actually analize are much
less than the initial number
So there’s an alternative definition

Big data are also:

• Data which cannot be analyzed by a single machine, and shouldn’t


be analyzed with traditional hardware or software technologies.

• They might require particularly sofisticated analytical tools, but not


necessarily

• Even when we face unstructured data, these too have to be turned


into structured data before they can be analyzed
What does this all mean?

It means that when dealing with big data we still use the tools, and apply
the main concepts, that we also encounter when dealing with traditional
BI and Predictive Analysis.

The difference is in the use of specific technologies required for bigger and
unstructured/non-traditional data, such as:

• Hardware and Software devices to extract data directly from sensors placed on
smart objects/devices
• Tools to extract data in semi-real time
• Tools for parallel computing
Why do we have so many?
• Web 2.0 -> user generated content
• Facebook
• Twitter
• Instagram
• Youtube
• Blogs
• Videos
• Photos
• Posts
• …

• IOT (Internet of Things)


Big data - chart

LOW COMPLEXITY MID COMPLEXITY HIGH COMPLEXITY


Big data - chart

man
SOURCE

machine

Structured data Unstructured


Unstructured data
data
DATA STRUCTURE
Big data - cases

Case characteristic example


Sensors & DCS Velocity & volume Predictive maintenance

Radio Frequency Identification Analysis of the path of consumers within a shop, or


Velocity & volume
(RFID) goods delivered within a geographic area

Stock markets Velocity & volume Yields’ analysis, optimal portfolio analysis, risk analysis
Scientific instruments data Velocity & volume Pattern recognition & simulations
Weather forecast info Volume Weather data

Healthcare info Volume & variety Monitoring of diseases

Information of accounts & transactions – information


Fiscal & bank data Volume
cross-checking
Big data - cases

Case characteristic example

Social Network Volume,velocity, variety Sentiment Analysis

Blog, Forum Volume,velocity, variety Sentiment Analysis

Web server log Volume Web server traffic and users’ behavior

Log del traffico di un router Volume,velocity Utilizzo da parte di provider.

Surveillance Volume,velocity, variety Anomaly identification

Documents Volume, variety Automatic document classification

Dati geografici Volume,velocity Resourse allocation (eg car-sharing)


How to generate value from Big Data?

• We receive information at very high frequency

• We can store & analyze more detailed information (because we have

the hw & sw to do so)

• Enable more advanced analysis

• Micro-segmentation & Ad-hoc offering


This leads to …

• More sophisticated analysis leading to more


efficient decision-making

• Possibility to create new products/services


Software tools
Data Ingestion

Data storing Data organization

Computation/Analysis Integration/Enrichment
Hardware architectures
• Symmetric multiprocessing (SMP):
• Two or more processors connected to a single, shared RAM.
• Each processor has full access to I/O devices. Only one instance of the
operating system
• Massively Parallel Processing (MPP)
• Shared nothing architecture: each processor has its own RAM and I/O
devices
• No resource is shared
• An efficient communication layer enables collaboration between nodes

42
SMP Architecture

CPU CPU CPU CPU DISKs


1 2 3 4

BUS

RAM
MPP Architecture

Communication layer

CPU CPU CPU CPU

RAM RAM RAM RAM

DISKs DISKs DISKs DISKs


Big data Hardware
• MPP is obviously more suited to work with huge amounts of data
• The limit of this architecture is in the number of nodes we can add to
the system, and their cost!
• Examples:
• Oracle Exadata – max 18 rack (x 672TB)= 11 PB
• Microsoft APS –max 7 nodi = 6,2 PB
• Teradata - max 4096 nodi = 186PB !!!
Now let’s set up KNIME

This Photo by Unknown Author is licensed under CC BY-SA

You might also like