BIG DATA
Introduction
Lecturer: Lucrezia Noli
Lesson 1
LUCREZIA NOLI
▪ Università Commerciale Luigi Bocconi
▪ Bachelor in Finance
▪ Master in Economics of Innovation & Technology
▪ Master Thesis: «Machine Learning Techniques to
Investigate the ALS Disease»
▪ Won second prize of PRISLA competition
▪ Current roles
▪ Big Data Scientist at Dataskills
▪ Lecturer SEDIN - Università Bocconi
▪ Lecturer Overnet
▪ Previous roles
▪ Business Development Manager at Metail London
The Data Science process
• Quality Model • Parameters
• Integration optimization
• Supervised vs
• Transformation & • Model
Unsupervised
creation performance
• Rule based or Black
• reports
Box
Data Predictions
OLAP
Data Mart
Data Mart
3
Data Mart
The foundation of Data Analytics
BI Tools & Reporting Business Intelligence is
central for any company
wanting to use data in an
innovative way.
Data warehouse This is because BI certifies
ETL that the data to be analyzed
are correct.
Business Rules
Data Sources Analysis
Data Quality
What is Business Intelligence?
DATA SOURCES CLEANSING & STRUCTURE DATA WAREHOUSE CLIENT TOOLS
Sensors/ PLC
BIG DATA Analysis
INTERNAL OLAP
CRM STAGING AREA Data Mart
ERP
ETL
Production DB
Data Mart
ETL
REPORTING TOOLS
KPIs
Data Mart
EXTERNAL
Master data
management
Data quality
Machine learning
GENERICITA’
Through Machine Learning a computer can solve various tasks,
without being given all parameters that characterize any MACHINE LEARNING
ofthem specifically
Process through
which a machine
APPRENDIMENTO
“A computer program is said to learn from
experience E with respect to some class of tasks T and
M learns
without
how
complete a task
specifically
to
being
programmed to do
performance measure P if its performance at tasks in T, so
as measured by P, improves with experience E.”
Tom M. Mitchell
MACHINE LEARNING - Algorithms
In 2012 Deepmind creates an Artificial
Intelligence called Deep Q Learner that
can play any game of the Atari
package, famous in the 70s, through
ML techniques called «Deep
Reinforcement Learning».
The algorithm doesn’t specify the
characteristics of the game played, but
simply enables the machine to learn
how to play by itself.
LEARNING HOW TO PLAY FLAPPY BIRD
PREDICTIVE ANALYTICS
Learning from the data
Anni Carburante Porte Antifurto Prezzo ($)
1 Diesel 5 Si 14.000
MODEL
TARGET
1 Benzina 5 Si 12.500
TRAINING
Through
2 INPUT DATA
Benzina 3 No 11.000
2 Diesel 5 Si 13.000
machine
learning 3 Benzina 5 No 9.000
4 Benzina 3 No 8.500
3 Diesel 5 data No
New input ?
PREDICTION
Predictive Analytics
EXISTANCE OF HISTORICAL DATA, which are studied in order to PREDICTIVE
understand which interactions between input variables have ANALYTICS
generated a specific output
P Extraction
information
of
and
knowledge from data,
making
Machine
use of
Learning
AIM TO PREDICT the evolution of data in the future techniques
Predictive Analytics
PREDICTION &
DATABASE BUSINESS INSIGHT
PREDICTIVE ENGINE
Predictive Analysis - steps
• SUPERVISED
1. REPRESENTING THE PROBLEM • UNSUPERVISED
• HYBRID
PREDICTIVE
ANALYTICS
P
• CONFRONTING REAL & Extraction of
2. VALUATING THE PERFORMANCE EXPECTED VALUE
information and
• TRIAL & ERROR
knowledge from data,
making use of
Machine Learning
• MINIMIZATION OF techniques
3. OPTIMIZING THE PARAMETERS COST FUNCTION
• SEARCH FOR AN
OPTIMUM
Predictive Analytics – representing the problem
PREDICTIVE
ANALYTICS
P Extraction
information
knowledge
of
and
from
data, making use of
Machine Learning
techniques
Predictive Analytics - applications
Predictive Analytics – energy demand forecast
WHAT WILL THE ELECTRICAL
EXPENSE BE IN THE NEXT
HOUR?
▪ Time series of electric
consumption
▪ Exogenous data(weather
forecast)
➢ Better estimation of
needs and costs
➢ Ability to fix energy price
PREDICTIVE ANALYTICS – Advanced Client Segmentation
CAN WE IDENTIFY
HOMOGENEOUS GROUPS
WITHIN OUR CUSTOMER
BASE?
▪ Purchasing behavior
▪ Demographic information
➢ Ad-hoc marketing
➢ Price-setting strategies
➢ New products & services
PREDICTIVE ANALYTICS – Churn Analysis
WHICH OF OUR CLIENTS ARE
LIKELY TO LEAVE US FOR
OUR COMPETITORS?
▪ Time series data of
demand
▪ Exogenous data(eg.
Promotions, sales, holiday
seasons)
➢ Ad-hoc Marketing
➢ Promo activities
PREDICTIVE ANALYTICS – Sentiment Analysis
WHICH KIND OF FEEDBACK
PEOPLE LEAVE ONLINE
ABOUT OUR FIRM?
▪ Social media posts
▪ Customers reviews
➢ Various aims:
recommendation
engines, advanced
clustering, propensity
analysis
PREDICTIVE ANALYTICS – Propensity Analysis
HOW LIKELY IS IT THAT A
PROSPECT WILL BUY?
▪ Cross-referencing of purchase
data & data on marketing &
adv campaigns
▪ Analysis of online consumer
behavior
➢ Ad-hoc Marketing
➢ Price-setting strategies
➢ Promo activities
PREDICTIVE ANALYTICS – Price Prediction
CAN WE PREDICT HOW PRICES
WILL EVOLVE IN OUR MARKETS
OF INTEREST?
▪ Time series of prices
▪ Time series of exogenous
data(eg. Sales, promotions,
holiday seasons)
➢ Competitors’ strategy
➢ Promotions & offers
PREDICTIVE ANALYTICS – Demand Forecast
HOW MUCH DEMAND WILL
WE HAVE FOR OUR GOOD &
SERVICE IN THE NEXT
HOUR/DAY?
▪ Time series of demand
▪ Exogenous data(eg.
Promotions, sales, holiday
seasons)
➢ Inventory
➢ Resource allocation
➢ Ad-hoc marketing
➢ promotions
PREDICTIVE ANALYTICS – Fraud Detection
CAN WE IDENTIFY
FRAUDOLENT
TRANSACTIONS BEFORE
THEY ARE CARRIED OUT?
▪ Time series of transactions
with «fraud tag»
➢ Fraud detection
➢ Fraud classification
PREDICTIVE ANALYTICS – Resource Allocation
CAN WE PREDICT GEOGRAPHIC
AREAS WITH HIGHER DEMAND
FOR OUR SERVICE?
▪ Time series of demand
▪ Exogenous data(eg.fairs,
events, strikes)
➢ Optimal resource
allocation
➢ Demand/revenue
forecasting
PREDICTIVE ANALYTICS – Document Classification
CAN WE AUTOMATICALLY
CLASSIFY OUR DOCUMENTS
BASED ON THEIR CONTENT?
▪ Sample of documents to
classify
➢ Automatic document
classification
➢ Error identification
Data Sources
“Operational” sources picturing a firm’s daily activities are:
• Production-related devices
• Sales- related devices
• Tools to track orders and deliveries
• Accounting tools
• HR tools
• Client-management tools
• Back-office tools
• The product
• Production line (production plant) – data from machine’s sensors
• Orders & deliveries
• Inventory
• Suppliers data
• Customers’ feedback: call center, emails, returns data
Data types
Structured vs. unstructured data:
• Structured data → table-like data with columns and rows → eg.
Transactions: each row is a transaction, each column is a
characteristic of the transaction: when it was made, by whom, the
amount,etc.
• Unstructures (semi-structured) data → images, videos, emails, any
feedback from social networks…
Introducing Big data
Big data - definitions
1997
“ Data of a very large size, “…data sets are generally
typically to the extent that its quite large, taxing the
manipulation and capacities of main memory, 2013
management present local disk, and even remote
significant logistical disk. We call this the problem “…things one can do at
challenges.” of big data”
a large scale that cannot
be done at a smaller
one, to extract new
Oxford English Dictionary Cox, Ellsworth – NASA insights or create new
forms of value.”
Mayer-Schönberger
2001
“An all-encompassing term Cukier
Big Data described for any collection of data sets 2011
by the 3Vs:
Volume, Velocity,
so large and complex that it
becomes difficult to process
“datasets whose size is
Variety beyond the ability of
using on-hand data
Doug Laney - Gartner typical database software
management tools or
tools to capture, store,
traditional data processing
manage, and analyze,”
applications.”
Wikipedia McKinsey
27
3 Vs definition
• Volume
• Huge amounts of data
• Variety
• Variety of structures, data sources, types of data
• Complexity of structures
• Unstructured data
• Velocity
• Rapidity with which data is produced
3 Vs + 2
• Value
• Ability to extract value from this huge amount of data
• Veracity
• Not all data we have at hand are actually of “good quality”
BUT…
• Big data are not JUST data in big volume ( need to have
other characteristics too in order to be defined as big
data)
• They come from both new sources (eg. Social networks),
but also traditional ones! (think about data coming from a
sensor put on a machine in a production plant)
• In many statistics published on the web or shared on tv,
the number are either exagerated or imprecise → most of
the times, the data we’ll have to actually analize are much
less than the initial number
So there’s an alternative definition
Big data are also:
• Data which cannot be analyzed by a single machine, and shouldn’t
be analyzed with traditional hardware or software technologies.
• They might require particularly sofisticated analytical tools, but not
necessarily
• Even when we face unstructured data, these too have to be turned
into structured data before they can be analyzed
What does this all mean?
It means that when dealing with big data we still use the tools, and apply
the main concepts, that we also encounter when dealing with traditional
BI and Predictive Analysis.
The difference is in the use of specific technologies required for bigger and
unstructured/non-traditional data, such as:
• Hardware and Software devices to extract data directly from sensors placed on
smart objects/devices
• Tools to extract data in semi-real time
• Tools for parallel computing
Why do we have so many?
• Web 2.0 -> user generated content
• Facebook
• Twitter
• Instagram
• Youtube
• Blogs
• Videos
• Photos
• Posts
• …
• IOT (Internet of Things)
Big data - chart
LOW COMPLEXITY MID COMPLEXITY HIGH COMPLEXITY
Big data - chart
man
SOURCE
machine
Structured data Unstructured
Unstructured data
data
DATA STRUCTURE
Big data - cases
Case characteristic example
Sensors & DCS Velocity & volume Predictive maintenance
Radio Frequency Identification Analysis of the path of consumers within a shop, or
Velocity & volume
(RFID) goods delivered within a geographic area
Stock markets Velocity & volume Yields’ analysis, optimal portfolio analysis, risk analysis
Scientific instruments data Velocity & volume Pattern recognition & simulations
Weather forecast info Volume Weather data
Healthcare info Volume & variety Monitoring of diseases
Information of accounts & transactions – information
Fiscal & bank data Volume
cross-checking
Big data - cases
Case characteristic example
Social Network Volume,velocity, variety Sentiment Analysis
Blog, Forum Volume,velocity, variety Sentiment Analysis
Web server log Volume Web server traffic and users’ behavior
Log del traffico di un router Volume,velocity Utilizzo da parte di provider.
Surveillance Volume,velocity, variety Anomaly identification
Documents Volume, variety Automatic document classification
Dati geografici Volume,velocity Resourse allocation (eg car-sharing)
How to generate value from Big Data?
• We receive information at very high frequency
• We can store & analyze more detailed information (because we have
the hw & sw to do so)
• Enable more advanced analysis
• Micro-segmentation & Ad-hoc offering
This leads to …
• More sophisticated analysis leading to more
efficient decision-making
• Possibility to create new products/services
Software tools
Data Ingestion
Data storing Data organization
Computation/Analysis Integration/Enrichment
Hardware architectures
• Symmetric multiprocessing (SMP):
• Two or more processors connected to a single, shared RAM.
• Each processor has full access to I/O devices. Only one instance of the
operating system
• Massively Parallel Processing (MPP)
• Shared nothing architecture: each processor has its own RAM and I/O
devices
• No resource is shared
• An efficient communication layer enables collaboration between nodes
42
SMP Architecture
CPU CPU CPU CPU DISKs
1 2 3 4
BUS
RAM
MPP Architecture
Communication layer
CPU CPU CPU CPU
RAM RAM RAM RAM
DISKs DISKs DISKs DISKs
Big data Hardware
• MPP is obviously more suited to work with huge amounts of data
• The limit of this architecture is in the number of nodes we can add to
the system, and their cost!
• Examples:
• Oracle Exadata – max 18 rack (x 672TB)= 11 PB
• Microsoft APS –max 7 nodi = 6,2 PB
• Teradata - max 4096 nodi = 186PB !!!
Now let’s set up KNIME
This Photo by Unknown Author is licensed under CC BY-SA