0% found this document useful (0 votes)
49 views232 pages

Dsbda Unit1

The document provides an overview of Data Science and Big Data, emphasizing their importance in handling vast amounts of data and extracting valuable insights for business strategies. It covers key concepts such as data types, data engineering processes, machine learning approaches, and the challenges associated with Big Data, including its volume, velocity, variety, and veracity. Additionally, it highlights real-world applications and the need for advanced tools and methodologies to effectively analyze and utilize Big Data.

Uploaded by

Devika Rankhambe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views232 pages

Dsbda Unit1

The document provides an overview of Data Science and Big Data, emphasizing their importance in handling vast amounts of data and extracting valuable insights for business strategies. It covers key concepts such as data types, data engineering processes, machine learning approaches, and the challenges associated with Big Data, including its volume, velocity, variety, and veracity. Additionally, it highlights real-world applications and the need for advanced tools and methodologies to effectively analyze and utilize Big Data.

Uploaded by

Devika Rankhambe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 232

DATA SCIENCE AND BIG DATA Analytics

INTRODUCTION: DATA SCIENCE AND BIG DATA

Unit Objectives
1. To introduce basic need of Big Data and Data science to handle huge amount of
data.
2. To understand the application and impact of Big Data.
3.

Unit outcomes:
1. To understand Big Data primitives.
2. To learn different programming platforms for big data analytics.
Outcome Mapping: PEO: I,V , PEO c, e CO: 1, 2, PSO: 3,4
Books :
1. Krish Krishnan, Data warehousing in the age of Big Data, Elsevier, ISBN: 9780124058910,
1st Edition
2.
INTRODUCTION: DATA SCIENCE AND BIG DATA

• Data science and Big Data


• Defining Data science and Big Data
• Big Data examples
• Data explosion
• Data volume
• Data Velocity
• Big data infrastructure and challenges
• Big Data Processing Architectures
• Data Warehouse
• Re-Engineering the Data Warehouse
• Shared everything and shared nothing Architecture
• Big data learning approaches.
INTRODUCTION: DATA SCIENCE AND BIG DATA

• Is a process which examined that:


- from where the information can be taken
- what it signifies
- how it can be converted into a useful resource in the creation of
business & IT strategies.

• Goal : extract value from data in all its forms.

• Manipulates the mathematics, statistics, & computer science


regulations.

• Includes methods like machine learning, cluster analysis, data


mining & visualization.
INTRODUCTION: DATA SCIENCE AND BIG DATA

• With the h elp o f min ing h u ge qua nt ity of s t r uc tur ed &


unstructured data, organizations can:
- reduce costs
-raise efficiencies
- identifies new market opportunities.
- enhances organization’s competitive benefit.
INTRODUCTION: DATA SCIENCE AND BIG DATA
Scientific
Method
Data Math
Engineering

Statistics
Domain Data Science
Expertise

Advanced
Computing
Hacker Mindset

Visualization

Fig: Data Science


Data Scientists
 Convert the organization’s raw data into the useful
information.

 Managing & understanding large amounts of data.

 Create data visualization models that facilitates


demonstrating the business value of digital information.

 Can illustrates digital information easily with the help


of smart phones, Internet of Things devices , Social
media.
Fig: The Data Science Pipeline
Data and its structure

• Data comes in many forms, but at a high level, it falls into three
categories: structured, semi-structured, and unstructured.
• Structured data :
- highly organized data
- exists within a repository such as a database (or a comma-
separated values [CSV] file).
- easily accessible.
- format of the data makes it appropriate for queries and
computation (by using languages such as Structured Query
Language (SQL)).
• Unstructured data : lacks any content structure at all (for example, an
audio stream or natural language text).
• Semi-structure data: Include metadata or data that can be more easily
processed than unstructured data by using semantic tagging.
Data and its structure

Figure : Models of data


Data engineering
Data wrangling:
• Process of manipulating raw data to make it useful for data
analytics or to train a machine learning model.

• Include:
- sourcing the data from one or more data sets (in addition to
reducing the set to the required data),
- normalizing the data so that data merged from multiple data
sets is consistent.
- parsing data into some structure or storage for further use.

• process by which you identify, collect, merge, and preprocess one


or more data sets in preparation for data cleansing.
Data cleansing

• After you have collected and merged your data set, the next step
is cleansing.

• Data sets in the wild are typically messy and infected with any
number of common issues.

• Common issues, including missing values (or too many values),


bad or incorrect delimiters ( which segregate the data),
inconsistent records, or insufficient parameters.

• When data set is syntactically correct, the next step is to ensure


that it is semantically correct.
Data preparation/preprocessing

• final step in data engineering.

• This step assumes that you have a cleansed data set that might not
be ready for processing by a machine learning algorithm.

• Using normalization, you transform an input feature to distribute


the data evenly into an acceptable range for the machine learning
algorithm.
Machine learning

• Create and validate a machine learning model.

• Sometimes, the machine learning model is the product, which is


deployed in the context of an application to provide some
capability (such as classification or prediction).

• In other cases, the product isn’t the trained machine learning


algorithm but rather the data that it produces.
Model learning

• In one model, the algorithm process the data, & create new
data product as the result.

• But, in a production sense, the machine learning model is the


product itself, deployed to provide insight or add value (such
as the deployment of a neural network to provide prediction
capabilities for an insurance market).
Machine learning approaches

Machine learning approaches:


• Supervised learning
• Unsupervised learning
• Reinforcement learning
1. Supervised learning:
-algorithm is trained to produce the correct class and alter the model
when it fails to do so.
- The model is trained until it reaches some level of accuracy.
2. Unsupervised learning:
- has no class; instead, it inspects the data and groups it based on some
structure that is hidden within the data.
- these types of algorithms can be used in recommendation systems by
grouping customers based on the viewing or purchasing history.
Reinforcement learning
- is a semi-supervised learning algorithm.
- provides a reward after the model makes some number of
decisions that lead to a satisfactory result.

Model validation
• used to understand how model behave in production after a model
is trained.
• for that purpose it reserve a small amount of available training
data to be tested against final model.(called as test data)
• training data is used to train machine learning model.
• Test data is used when the model is complete to validate how well
it generalizes to unseen data.
Reinforcement learning
Operations:
• end goal of the data science pipeline.
• creating a visualization for data product.
• Deploying machine learning model in a production environment to
operate on unseen data to provide prediction or classification.
Model deployment:
• When the product of the machine learning phase is a model then it
will be deployed into some production environment to apply to
new data.
• This model could be a prediction system.

I /P O/P
Prediction System
Historical Financial data Classification of whether a company
eg.Sales & Revenue is a reasonable acquisition target.
Reinforcement learning

Model visualization:
• In smaller scale data science , the product is data ;instead of model
produced in the machine learning phase.
• Data product answers some questions about the original data set.
• Options for visualization are vast and can be produced from the R
programming language.
Summary- Definitions of Data Science

• Is a field of Big data, which searches for providing meaningful


information from huge amounts of complex data.

• Is a system used for retrieving the information in different forms


either in structured or unstructured.

• It combines different fields of work in statistics & computation in


order to understand the data for the purpose of decision making.
Introduction to Big Data

• Is huge amount of data.

• Organizations use data generated through various sources to run


their business.

• They analyze the data to understand & interpret market trends,


study customer behavior & take financial decisions.

• Consists of large datasets that cannot be managed efficiently by the


common DBMS.

• These datasets range from terabytes to exabytes.


Introduction to Big Data
• Mobile phones, credit cards, Radio Frequency Identification
(RFID) devices & Social Networking platforms create huge
amounts of data that may reside unutilized at unknown servers for
many years.

• With the evolution of Big data, this data can be accessed &
analyzed on a regular basis to generate useful information.

• The Sheer Volume, Variety, Velocity & Veracity of data is signified


by the term “ Big Data “

• Is structured, Unstructured, Semi-structured or heterogeneous in


nature.
Introduction to Big Data
• Big data is the term for a collection of data sets so large and complex that
it becomes difficult to process using on-hand database management tools
or traditional data processing applications.
• The challenges include capture, curation, storage, search, sharing,
transfer, analysis, and visualization.
• The trend to larger data sets is due to the additional information derivable
from analysis of a single large set of related data, as compared to separate
smaller sets with the same total amount of data, allowing correlations to
be found to "spot business trends, determine quality of research, prevent
diseases, link legal citations, combat crime, and determine real-time
roadway traffic conditions.”
Introduction to Big Data
• Big data is the term for a collection of data sets so large and complex that
it becomes difficult to process using on-hand database management tools
or traditional data processing applications.
• The challenges include capture, curation, storage, search, sharing,
transfer, analysis, and visualization.
• The trend to larger data sets is due to the additional information derivable
from analysis of a single large set of related data, as compared to separate
smaller sets with the same total amount of data, allowing correlations to
be found to "spot business trends, determine quality of research, prevent
diseases, link legal citations, combat crime, and determine real-time
roadway traffic conditions.”
Introduction to Big Data
• Traditional DBMS, warehousing & analysis systems fizzle to analyze
huge amount of data.

• Big data is stored in Distributed architecture file system.

• Hadoop by Apache is widely used for storing & managing Bigdata.

• Sort, organize, analyze & after this critical data in a systematic manner is
nothing but Big data.

• The process of capturing or collecting Big data is known as


“datafication”.

• Bigdata is datafied so that it can be used productively.


Facts and Figures
• Walmart handles 1 million customer transactions/hour.
• Facebook handles 40 billion photos from its user base!
• Facebook inserts 500 terabytes of new data every day.
• Facebook stores, accesses, and analyzes 30+ Petabytes of user
generated data.
• A flight generates 240 terabytes of flight data in 6-8 hours of flight.
• More than 5 billion people are calling, texting, tweeting and
browsing on mobile phones worldwide.
• Decoding the human genome originally took 10 years to process;
now it can be achieved in one week. 8
• The largest AT&T database boasts titles including the largest volume
of data in one unique database (312 terabytes) and the second largest
number of rows in a unique database (1.9 trillion), which comprises AT&T’s
extensive calling records.
An Insight
• Byte: One grain of rice
• KB(3): One cup of rice:
• MB (6): 8 bags of rice: Desktop
• GB (9): 3 Semi trucks of rice:
• TB (12): 2 container ships of rice Internet
• PB (15): Blankets ½ of Jaipur
• Exabyte (18): Blankets West coast Big Data
Or 1/4th of India
• Zettabyte (21): Fills Pacific Ocean Future
• Yottabyte(24): An earth-sized rice bowl
• Brontobyte (27): Astronomical size
Source of Data Generation
Source of Data Generation
Where is the Problem?
• Traditional RDBMS queries isn't isn't sufficient to get
useful information information out of the huge huge
volume of data.
• To search search it with traditional tools to find out
out if a particular topic was trending would take so
long that result would be meaningless by the
timewas computed.
• Big Data come up with a solution to store this data in
novel ways in order order to make accessible, and
also to come up with methods of performing
analysis on it .
What are the Challenges?
IBM considers 3 V’s
Volume
Example of Big Data
Velocity (Speed) Big Data
Variety (Complexity) Big Data
Variety (Complexity) Big Data
3 V’s (+1)
3 V’s (+1) +(N More)
Veracity
Validity
Introduction to Big Data

• Big data can be made useful by:


- organizing it
- determining what we can do with it.

Big Data

Is a new data Is classified in Is usually


challenge that terms of unstructured &
requires leveraging 4Vs:Volume qualitative in
existing systems ,Variety, Velocity, nature.
differently. Veracity
Real-world Examples of Big Data

• Consumer product companies & retail organizations are observing


data on social media websites such as FB & Twitter. These sites
help them to analyze customer behavior, preferences & product
perception. Accordingly, the companies can line up their
upcoming products to gain profits called as social media analytics.

• Sports teams are using data for tracking ticket sales & even for
tracking team strategies.
Big Data
• Big data is a pool of huge amounts of data of all types , shapes and
formats collected from different sources.

Evolution of Bigdata
 Bigdata is the new term of data evolution directed by the velocity,
variety & volume of data.

 Velocity: implies the speed with which the data flows in an


organizations.

 Variety: varied forms of data; such as structured, semi-structured,


unstructured.

 Volume: amount of data an organization has to deal with.


Big Data
Structuring Big data
• Arranging the available data in a manner such that it becomes easy to
study, analyze & derive conclusion from it.

• Structuring data helps in understanding user behaviors, requirements


& preferences to make personalized recommendations for every
individual.

• Eg: when user regularly visits or purchases from online shopping


sites , each time he logs in, the system can present a recommended
list of products that may interest the user on the basis of his earlier
purchases or searches.

• Different types of data (eg. images, text, audio) can be structured


only if it is sorted & organized in some logical pattern.
Big Data

Analysis Distributed
System
Data Storage
Big Data Data Science
Parallel
Processing
Artificial
Data mining Intelligence

Fig: Concepts of Big Data


Big Data

Structured
Data ₊ Unstructured
Data ₊ Semi-
Structured
Data
₌ Big Data

Fig: Types of Data


Elements of Big Data

• Volume
• Velocity
• Variety
• Veracity
Elements of Big Data

• Amount of data generated by organizations.

• Volume of data in most organizations is exabytes .

• Organizations are doing best to handle this ever-increasing volume of


data.

• Internet alone generates a huge amount of data.

• Eg: Internet has around 14.3 trillion live pages , 48 billion web pages
are indexed by Google Inc, 14 Billion web pages are indexed by
Microsoft Bing.
Velocity

• Rate at which data is generated, captured & shared.

• Is flow of data from various sources such as networks, human resources,


social media etc.

• The data can be huge & flows in continuous manner.

• Enterprises can capitalize on data only if it is captured & shared in real


time.

• Information processing systems such as CRM & ERP face problems


associated with data, which keeps adding up but cannot be processed
quickly.
Velocity

• These systems are able to attend data in batches every few hours.

• Eg: eBay analyzes around 5 million transactions per day in real


time to detect & prevent frauds arising from the use of PayPal.

• Sources of high velocity data includes:


 IT devices: routers , switches, firewalls.
 Social media: Facebook posts , tweets etc.
 Portable devices: Mobile

• Examples of data velocity :


Amazon, FB, Yahoo, Google, Sensor data, Mobile networks etc.
Variety

• Data generated from different types of sources such as Internal,


External, Social etc.

• Data comes in different formats(images, text, videos)

• Single source can generate data in varied formats.

• Eg: GPS & Social networking sites such as Facebook produce data
of all types, including text, images, videos.
Veracity

• Refers to the uncertainty of data; that is whether the obtained data


is correct or consistent.

• Out of huge amount of data, correct & consistent data can be used
for further analysis.

• Unstructured & semi-structured data take lots of efforts to clean


the data & make it suitable for analysis.
Data Explosion

• Is rapid growth of the data.

• One reason to this explosion is innovation.

• Innovation has transformed the way we engage in business,


provide services, and the associated measurement of value and
profitability.

• 3 basic trends to build up the data:


 Business model transformation
 Globalization
 Personalization of services
Business model transformation

• Modern companies can be moved towards the service oriented


technologies rather than product oriented.

• In service oriented, the value of the organization from customers point


of view is measured by how much the service is effective instead of how
much product is useful.

• The amount of data produced &consumed by every organization today


exceeds what the same organization produced prior to the business
transformation.

• Higher priority data are kept in center & the supporting data which is
required but not available or accessible previously now can be available
& accessible with the help of multiple channels.
Globalization
• Is a key trend that has radically changed the commerce of the
world, starting from manufacturing to customer service.

Personalization of services

•Business transformation’s maturity index is measured by the extent


of personalization of services and the value perceived by their
customers from such transformation.
Big Data Processing Architectures

• Big data architecture is designed to handle the processing &


analysis of data that is too large or complex for traditional DBMS.
Fig: Components of Big data architecture:
Big Data Processing Architectures

Data sources.
• All big data solutions start with one or more data sources.
• Examples include:
• Application data stores, such as relational databases.
• Static files produced by applications, such as web server log files.
• Real-time data sources, such as IoT devices.

Data storage.
• Data for batch processing operations is typically stored in a
distributed file store that can hold high volumes of large files in
various formats.
• This kind of store is often called a data lake.
Big Data Processing Architectures

Batch processing.
• data files are processed using long-running batch jobs to filter,
aggregate, and otherwise prepare the data for analysis.
• Usually these jobs involve reading source files, processing them,
and writing the output to new files.
Real-time message ingestion.
• If the solution includes real-time sources, the architecture must
include a way to capture and store real-time messages for stream
processing.
Stream processing.
After capturing real-time messages, the solution must process
them by filtering, aggregating, and otherwise preparing the data
for analysis. The processed stream data is then written to an
output sink.
Big Data Processing Architectures

Analytical data store.


Many big data solutions prepare data for analysis and then serve the
processed data in a structured format that can be queried using analytical
tools.
Analysis and reporting.
The goal of most big data solutions is to provide insights into the data
through analysis and reporting.
Orchestration:
• Most big data solutions consist of repeated data processing operations,
encapsulated in workflows, that transform source data, move data
between multiple sources and sinks, load the processed data into an
analytical data store, or push the results straight to a report or dashboard.
• To automate these workflows, you can use an orchestration technology
such Azure Data Factory or Apache Oozie and Sqoop.
Data processing challenges
Data processing challenges

Storage
• The first & major problem to big data is storage.

• As Big data is increased rapidly, there is need to process this huge data as
well as to store it.

• We need the additional 0.5 times storage to process & store the
intermediate result set.

• Storage has been a problem in the world of transaction processing and


data warehousing.

• Due to the design of the underlying software, we do not consume all the
storage that is available on a disk.

• Another problem with storage is the cost per byte.


Data processing infrastructure challenges
Transportation

• One of the biggest issue is moving data between different


systems and then storing it or loading it into memory for
manipulation.

• This continuous movement of data has been one of the reasons


that structured data processing evolved to be restrictive in nature,
(where the data had to be transported between the compute and
storage layers. )

• Network technologies facilitated the bandwidth of the transport


layers to be much bigger and more scalable.
Data processing infrastructure challenges

Processing
• I s to co mbin e s o me for m of logical and mathemat i ca l
calculations together in one cycle of operation.

• Divided into the 3 main areas:

1. CPU or processor.
2. Memory
3. Software
Data processing infrastructure challenges
CPU or processor.
• With each generation:
- the computing speed and processing power have increased
-leading to more processing capabilities
- access to wider memory.
- architecture evolution within the software layers.
Memory.
• While the storage of data to disk for offline processing proved the need
for storage evolution and data management.

• As the processor evaluations improving the capability of the processor,


Memory has becomes cheaper and faster in terms of speed.

• According to the allocated memory to system, the process resides within


a system, has changed significantly.
Data processing infrastructure challenges

Software
• Main component of data processing.

• used to develop the programs to transform and process the data.

• Software across different layers from operating systems to


programming languages has evolved generationally.

• Translates sequenced instruction sets into machine language that is


used to process data with the infrastructure layers of CPU +
memory + storage.
Data processing infrastructure challenges

Speed or throughput

• The biggest continuing challenge.

• Speed is a combination of various architecture layers: hardware,


software, networking, and storage.
Big Data Processing Architectures
Big Data Processing Architectures
Centralized Processing Architecture
– All the data is collected to a single centralized storage area and
processed by a single computer

– Evolved with transaction processing and are well suited for


small organizations with one location of service.

Advantages :
requires minimal resources both from people & system
perspectives.

Centralized processing is very successful when the


collection and consumption of data occurs at the same
location.
Big Data Processing Architectures

Distributed Processing Architecture


Data and its processing are distributed across geographies or data centers
Types:
1. Client –ServerArchitecture
Client : Collection and Presentation
Sever : Processing and Management
2. Three tier architecture
Client ,Server ,Middle tier
Middle Tier : Processing Logic
3.n-tierArchitecture
clients, middleware, applications, and servers are isolated
into tiers.
Any tier can be scaled independently
Big Data Processing Architectures

4. Cluster architecture.
• Machines are connected in a network architecture .
•Both software or hardware work together to process data or
compute requirements in parallel.
• Each machine in a cluster is associated with a task that is
processed locally and the result sets are collected to a
master server that returns it back to the user.
5. Peer-to-peer architecture.
• No dedicated servers and clients; instead, all the processing
responsibilities are allocated among all machines, known as
peers.
• Each machine can perform the role of a client or server or
just process data.
Big Data Processing Architectures

Distributed processing advantages :


– Scalability of systems and resources can be achieved
based on isolated needs.
– Processing and management of information
can be architected based on desired unit of
operation.
– Parallel processing of data reducing time latencies.
Distributed processing Disadvantages:
– Data redundancy
– Process redundancy
– Resource overhead
– Volumes
• Big Data Processing Architectures

• The lambda architecture, first proposed by Nathan Marz,


addresses this problem by creating two paths for data flow.

• All data coming into the system goes through these two paths:

• A batch layer (cold path) stores all of the incoming data in its
raw form and performs batch processing on the data. The result
of this processing is stored as a batch view.

• A speed layer (hot path) analyzes data in real time. This layer is
designed for low latency, at the expense of accuracy.
Big Data Processing Architectures

The batch layer


Lamda Architecture feeds into a serving
layer that indexes
the batch view for
efficient querying.

The speed layer


updates the serving
layer with
incremental updates
based on the most
recent data.
• Big Data Processing Architectures
• The hot and cold paths converge at the analytics client
application.

• If the client needs to display timely, yet potentially less


accurate data in real time, it will acquire its result from the hot
path.

• Otherwise, it will select results from the cold path to display


less timely but more accurate data.

• In other words, the hot path has data for a relatively small
window of time, after which the results can be updated with
more accurate data from the cold path.
• Big Data Processing Architectures
• The raw data stored at the batch layer is immutable.

• Incoming data is always appended to the existing data, and the


previous data is never overwritten.

• Any changes to the value of a particular data are stored as a


new timestamped event record.

• This allows for recomputation at any point in time across the


history of the data collected.

• The ability to recompute the batch view from the original raw
data is important, because it allows for new views to be
created as the system evolves.
Big Data Processing Architectures

Lamda Architecture
Batch Layer (Cold Path)
Stores all incoming data & perform a batch processing
Managing all historical data
Recomputing the result using machine learning model
Results come at high latency due to computational cost
Data can be only appended not updated or deleted
Data is stored using memory databases or long term
persistent like no-SQL storages
Uses Map-reduce
Speed Layer
Provide low –latency result
Data is processed in real-time
Incremental Algorithms
Create ,delete dataset is possible
Big Data Processing Architectures

Lamda Architecture
Serving Layer :
User fires query
Applications:
Ad-hoc queries
Netflix,Twitter,Yahoo
Pros:
Batch layer manages historical data so low error when system crashes
Good speed, reliability
Fault tolerance and scalable processing
Cons:
Caching overhead , complexity ,duplicate computation
Difficult to migrate or reorganize
Big Data Processing Architectures

Kappa Architecture
• A drawback to the lambda architecture is its complexity.
Processing logic appears in two different places — the cold and hot
paths — using different frameworks. This leads to duplicate
computation logic and the complexity of managing the architecture
for both paths.

• The kappa architecture was proposed by Jay Kreps as an


alternative to the lambda architecture.

• It has the same basic goals as the lambda architecture, but with an
important distinction: All data flows through a single path, using a
stream processing system.
Big Data Processing Architectures

• Kappa Architecture
Big Data Processing Architectures

• There are some similarities to the lambda architecture's batch layer, in


that the event data is immutable and all of it is collected, instead of a
subset.

• The data is ingested as a stream of events into a distributed and fault


tolerant unified log.

• These events are ordered, and the current state of an event is changed
only by a new event being appended.

• Similar to a lambda architecture's speed layer, all event processing is


performed on the input stream and persisted as a real-time view.

• If you need to recompute the entire data set (equivalent to what the
batch layer does in lambda), you simply replay the stream, typically
using parallelism to complete the computation in a timely fashion.
Big Data Processing Architectures

Kappa Architecture
• Simple lambda architecture with batch layer removed
• Speed layer is capable of both real and batch data
• Only two layers : Stream processing and Serving
• All event processing is performed on the input stream and
persisted as a real-time view
• Speed layer is designed using Apache Storm,Spark
Big Data Processing Architectures

Zeta architecture
• This is the next generation Enterprise architecture cultivated
by Jim Scott.

• This is a pluggable architecture which consists of Distributed


file system, Real-time data storage, Pluggable compute
model/execution engine, Deployment/container management
system, Solution architecture, Enterprise applications and
Dynamic and global resource management.
Big Data Processing Architectures

Zeta architecture diagram


Big Data Processing Architectures

The Zeta Architecture is a high-level enterprise architectural


construct not unlike the Lambda architecture which enables
simplified business processes and defines a scalable way to
increase the speed of integrating data into the business.
The result? A powerful, data-centric enterprise.
Big Data Processing Architectures
There are seven pluggable components of the Zeta Architecture which work together,
reducing system-level complexity while radically increasing resource utilization and
efficiency.
Distributed File System - all applications read and write to a common, scalable
solution, which dramatically simplifies the system architecture.
Real-time Data Storage - supports the need for high-speed business applications
through the use of real-time databases.
Pluggable Compute Model / Execution Engine -delivers different processing
engines and models in order to meet the needs of diverse business applications and
users in an organization.
Deployment / Container Management System - provides a standardized approach
for deploying software. All resource consumers are isolated and deployed in a standard
way.
Solution Architecture - focuses on solving specific business problems, and combines
one or more applications built to deliver the complete solution. These solution
architectures encompass a higher-level interaction among common algorithms or
libraries, software components and business workflows.
Enterprise Applications - brings simplicity and reusability by delivering the
components necessary to realize all of the business goals defined for an application.
Dynamic and Global Resource Management - allows dynamic allocation of
resources so that you can accommodate whatever task is the most important for that
day.
Big Data Processing Architectures

Zeta architecture
There are several benefits to implementing a Zeta Architecture in your
organization
•Reduce time and costs of deploying and maintaining applications
•Fewer moving parts with simplifications such as using a distributed file
system
•Less data movement and duplication - transforming and moving data
around will no longer be required unless a specific use case calls for it
•Simplified testing, troubleshooting, and systems management
•Better resource utilization to lower data center costs
The Traditional Research Approach

 Query-driven (lazy, on-demand)


Clients

Integration System Metadata

...

Wrapper Wrapper Wrapper

...
Source Source Source
Big Data Processing Architectures

 Delay in query processing


 Slow or unavailable information sources
 Complex filtering and integration
 Inefficient and potentially expensive for
frequent queries
 Competes with local processing at sources
 Hasn’t caught on in industry
Data Warehouse
The Warehousing Approach

 Information Clients
integrated in
advance
 Stored in wh for
direct querying
Integration System Metadata
and analysis
...

Extractor / Extractor / Extractor /


Monitor Monitor Monitor

...
Source Source Source
The Warehousing Approach

• Technique for assembling and


managing data from various
sources for the purpose of
answering business questions.
Thus making decisions that were
not previous possible
• A decision support database
maintained separately from the
o rg a n i z a t i o n ’s o p e r a t i o n a l
database
The Warehousing Approach
Definition: A single, complete and consistent store of data
obtained from a variety of different sources made available
to end users in a what they can understand and use in a
business context. [Barry Devlin]

• By comparison: an OLTP (on-line transaction processor) or


operational system is used to deal with the everyday
running of one aspect of an enterprise.

• OLTP systems are usually designed independently of each


other and it is difficult for them to share information.
Characteristics of Data Warehouse

• Subject oriented. Data are organized based on how


the users refer to them.
• Integrated. All inconsistencies regarding naming
convention and value representations are removed.
• Nonvolatile. Data are stored in read-only format and
do not change over time.
• Time variant. Data are not current but normally time
series.
Characteristics of Data Warehouse

• Summarized Operational data are mapped into a


decision-usable format
• Large volume. Time series data sets are normally
quite large.
• Not normalized. DW data can be, and often are,
redundant.
• Metadata. Data about data are stored.
• Data sources. Data come from internal and external
unintegrated operational systems.
Need of Data Warehouses

• Consolidation of information resources


• Improved query performance
• Separate research and decision support
functions from the operational systems
• Foundation for data mining, data
visualization, advanced reporting and OLAP
tools
Data Warehouse

A subject-oriented ,integrated, time-variant and non-volatile


collection of data in support of managements decision making
process is called as data warehouse.

Subject-oriented
DWH organized around major subjects of enterprise (e.g.
customer ,product sales ) rather than application areas (customer
invoicing ,stock control,product sale)
Integrated : Data coming from enterprise wide applications in different
formats.
Time –Variant
DWH behave differently at different time interval
Non-Volatile
New data is always added in existing one rather than
replacement
Merits Data Warehouse

1. Delivers enhanced Business Intelligence


2. Ensures Data quality and consistency
DWH supports data conversion into common & standard
format
No discrepancy
3. Saves time and money
Saves user’s time data is at one place
DWH execution doesnot require IT support & higher no of
channels
4. Tracks historically Intelligent data
updates about changing trends
5. Generates high revenue
Advantages of Warehousing Approach

• High query performance


– But not necessarily most current information
• Doesn’t interfere with local processing at sources
– Complex queries at warehouse
– OLTP at information sources
• Information copied at warehouse
– Can modify, annotate, summarize, restructure, etc.
– Can store historical information
– Security, no auditing
• Has caught on in industry
Which are our
lowest/highest margin
customers ?
Who are my customers
What is the most and what products
effective distribution are they buying?
channel?

What product prom- Which customers


-otions have the biggest are most likely to go
impact on revenue? to the competition ?
What impact will
new products/services
have on revenue
and margins?
• Knowledge discovery
– Making consolidated reports
– Finding relationships and correlations
– Data mining
– Examples
• Banks identifying credit risks
• Insurance companies searching for fraud
• Medical research
Comparison Chart of Database Types

Data warehouse Big Data


Extracts data from varieties of SQL Handle huge data coming from
based data sources(relational various heterogeneous recourses
databases) & help for generating analytic including social media.
reports.
Mainly handle structured data. Can handle structured ,
unstructured , semi-structured
data.
Helps to analytic oninformed(specific) Has a lot of data so analytic
information. provides information by extracting
useful information from data.
Don’t use Distributed file system. Use Distributed file system.

Never erase previous data when Also Never erase previous data
new data is added. when new data is added but sometimes
real-time data streams are processed.
Timing of fetching simultaneously is Timing of fetching simultaneously
more. is small using Hadoop File
System.
Reengineering the data
warehouse
Reengineering the Data Warehouse

Enterprise data warehouse platform


There are several layers of infrastructure that make the platform for the EDW:
1. The hardware platform:
● Database server:
– Processor
– Memory
– BUS architecture
● Storage server:
– Class of disk
– Controller
● Network
2. Operating system
3. Application software:
● Database
● Utilities
Reengineering the Data Warehouse
Data distribution in a data warehouse

Operational data store


Reengineering the Data Warehouse
Choices for reengineering the data warehouse
Replatforming
• Replatform the data warehouse to a new platform including all
hardware & infrastructure.

• There are several new technology options and depending on the


requirement of the organization, any of these technologies can be
deployed.

• The choices include data warehouse appliances, commodity


platforms, tiered storage, private cloud, and in- memory
technologies.
Reengineering the Data Warehouse
Benefits:
● move the data warehouse to a scalable and reliable platform.

● The underlying infrastructure and the associated application


software layers can be architected

● to provide security, lower maintenance, and increase reliability.

● Optimize the application and database code.

● Provide some additional opportunities to use new functionality.

● Makes it possible to rearchitect things in a different/better way,


which is almost impossible to do in an existing setup.
Reengineering the Data Warehouse
Disadvantages:
● Takes a long cycle time to complete, leading to disruption of
business activities.
● Replatforming often means reverse engineering complex business
processes and rules that may be undocumented or custom
developed in the current platform.
● May not be feasible for certain aspects of data processing or there
may be complex calculations that need to be rewritten if they
cannot be directly supported by the functionality of the new
platform.
● Replatforming is not economical in environments that have large
legacy platforms, as it consumes too many business process cycles
to reverse engineer logic and documenting the same.
Data warehouse platform.
Platform engineering

• Modify parts of the infrastructure and get great gains in scalability and
performance.

• Used in the automotive industry where the focus was on improving


quality, reducing costs, and delivering services and products to end
users in a highly cost-efficient manner.

• Applied to the data warehouse can translate to:


● Reduce the cost of the data warehouse.
● Increase efficiencies of processing.
● Simplify the complexities in the acquisition, processing, and delivery of
data.
● Reduce redundancies.
Platform engineering
• Platform reengineering can be done at multiple layers:
 Storage level: storage layer of the data is engineered to process data at
very high speeds for high or low volumes.
 Server reengineering: hardware and its components can be replaced
with more modern components that can be supported in the
configuration
 Network reengineering: In this approach the network layout and the
infrastructure are reengineered.
 Data warehouse appliances: In this approach the entire data
warehouse or datamart can be ported to the data warehouse appliance
.The data warehouse appliance is an integrated stack of hardware,
software, and network components designed and engineered to handle
data warehouse rigors.
 Application server: In this approach the application server is
customized to process reports and analytic layers across a clustered
architecture.
Platform engineering
Data engineering
• The data structures are reengineered to create better performance.

•The data model developed as a part of the initial data warehouse is


often scrubbed and new additions are made to the data model.

Typical changes include:

 Partitioning—a table can be vertically partitioned depending on the


usage of columns, thus reducing the span of I/O operations. Another
partition technique is horizontal partitioning where the table is
partitioned by date or numeric ranges into smaller slices.

 Colocation—a table and all its associated tables can be colocated in the
same storage region.
Platform engineering

 Distribution—a large table can be broken into a distributed set of


smaller tables and used.

 New data types—several new data types like geospatial and


temporal data can be used in the data architecture and current
workarounds for such data can be retired. This will provide a
significant performance boost.

 New database functions—several new databases provide native


functions like scalar tables and indexed views, and can be utilized
to create performance boosts.
Architectures
Shared-everything architecture

• Is a system architecture where all resources are shared including


storage, memory, and the processer.

• Two variations of shared-everything architecture are:

 Symmetric multiprocessing (SMP)

 Distributed shared memory (DSM).


Symmetric multiprocessing Distributed shared memory
(SMP) (DSM).
Share a single pool of memory addresses the scalability problem
for read–write access by providing multiple pools of
concurrently and uniformly memory
without latency. for processors to use.
Referred to as uniform memory Referred to as non uniform
access (UMA) architecture. memory access (NUMA)
architecture.
The drawback is when multiple latency to access memory
processors are present & share a depends on the relative
single system bus, which results distances of the processors and
in choking of the bandwidth for their dedicated memory pools.
simultaneous memory access ,
therefore, the scalability of such
system is very limited.
Shared-everything architecture

• Both SMP and DSM architectures have been deployed for many
transaction processing systems , where the transactional data is
s mal l i n s i z e and has a s hor t bur s t c yc l e of r e sour c e
requirements.

• Data warehouses have been deployed on the shared-everything


architecture for many years.

• Due to the intrinsic architecture limitations, the direct impact has


been on cost and performance.

• Analytical applications and Big Data cannot be processed on a


shared-everything architecture.
Fig: Shared-everything architecture
Shared-nothing architecture

• Is a distributed computing architecture where multiple systems (called


nodes) are networked to form a scalable system.

• Each node has its own private memory, disks, and storage devices
independent of any other node in the configuration.

• None of the nodes share memory or disk storage.

• Each processor has its own local memory & local disk.

• I n t e r c o mmu n i c a t i o n c h a n n e l i s u s e d b y t h e p r o c e s s o r s t o
communicate.

• Processors can independently act as a server to serve the data of local


disk.
Fig: Shared-nothing architecture
Shared-everything architecture

• The flexibility of the architecture is its scalability.

• This is the underlying architecture for data warehouse appliances


and large data processing.

• The extensibility and infinite scalability of this architecture makes it


the platform architecture for Internet & web applications.

• The key feature is that the operating system not the application
server owns responsibility for controlling and sharing hardware
resources.

• A system can assign dedicated applications or partition its data


among the different nodes to handle a particular task.
Shared-everything architecture
Advantages of Shared-nothing architecture
• Scalable.
• When node gets added transmission capacity increases.
• Failure is local.(failure of one node cannot affect to other
node)

Disadvantages of Shared-nothing architecture


• Cost of communication is higher than shared memory
architecture.
• Data sending involves the software interaction.
• More coordination is required.
1
2
6
What is big data?

Big Data is Small Data is


any thing when is fit in RAM.
which is Big Data is when is
crash Excel. crash because is
not fit in RAM.

Or, in other words, Big Data is data


in volumes too great to process by
traditional methods.

https://twitter.com/devops_borat

1
2
Data accumulation

• Today, data is accumulating at tremendous


rates
– click streams from web visitors
– supermarket transactions
– sensor readings
– video camera footage
– GPS trails
– social media interactions
– ...
• It really is becoming a challenge to store
and process it all in a meaningful way

1
2
From WWW to VVV

• Volume
– data volumes are becoming unmanageable
• Variety
– data complexity is growing
– more types of data captured than previously
• Velocity
– some data is arriving so rapidly that it must either
be processed instantly, or lost
– this is a whole subfield called “stream processing”

1
2
The promise of Big Data

• Data contains information of great


business value
• If you can extract those insights you can
make far better decisions
• ...but is data really that valuable?
13
1
13
2
“quadrupling
the average
cow's milk
"When Freddie [as he is known]
had no daughter records our

thatproduction
equations predicted from his DNA
he would be the best bull,"
USDA research geneticist Paul
VanRs adien
n ec
meailedy
moe wuithr
a
detectable hint of pride. "Now he is

parents were
the best progeny tested bull (as
predicted)."

born”

13
3
Some more examples

• Sports
– basketball increasingly driven by data analytics
– soccer beginning to follow
• Entertainment
– House of Cards designed based on data analysis
– increasing use of similar tools in Hollywood
• “Visa Says Big Data Identifies Billions of
Dollars in Fraud”
– new Big Data analytics platform on Hadoop
• “Facebook is about to launch Big Data
play”
– starting to connect Facebook with real life

14
Ok, ok, but ... does it apply to our
customers?
• Norwegian Food Safety Authority
– accumulates data on all farm animals
– birth, death, movements, medication, samples, ...
• Hafslund
– time series from hydroelectric dams, power prices,
meters of individual customers, ...
• Social SecurityAdministration
– data on individual cases, actions taken, outcomes...
• Statoil
– massive amounts of data from oil exploration,
operations, logistics, engineering, ...
• Retailers
– seeTarget example above
– also, connection between what people buy, weather
forecast, logistics, ...
13
5
How to extract insight from data?

Monthly Retail Sales in New SouthWales


(NSW) Retail Department Stores
13
6
Types of algorithms

• Clustering
• Association learning
• Parameter estimation
• Recommendation engines
• Classification
• Similarity matching
• Neural networks
• Bayesian networks
• Genetic algorithms

13
7
Basically, it’s all maths...

• Linear algebra
• Calculus
• Probability theory
Only 10% in
• Graph theory devops are know
• ... how of work
with Big Data.
Only 1% are
realize they are
need 2 Big Data
for fault
tolerance

https://twSitmtetr.Kcoasmh/idbeavioNpasv_ableorCaotlegeof Engineer

18
Big data skills gap

• Hardly anyone knows this stuff


• It’s a big field, with lots and lots of theory
• And it’s all maths, so it’s tricky to learn

http:/www.ibmbigdatahub .c om /bl og /a d dr es sin g-b ig - da ta- sk ils- ga p


S m t. K a s h ib a i N a va le C olle g eof Engineering, Vadgoan
http://wikibon.org/wiki/v/Big_Data:_Hadoop,_Business_Analytics_and_Beyond#The_Big_Data_Skills_Gap

19
Two orthogonal aspects

• Analytics / machine learning


– learning insights from data
• Big data
– handling massive data volumes
• Can be combined, or used separately

20
Data science?

http://drewconway.comS/zm
iat/2. 0K1a
3/s3h
/2i6b/a
thieN
-daavtaa-lsecieCnoclel-evg
eneno
-dfiaEgnragm
ieering, Vadgoan
21
How to process Big Data?

• If relational databases are not enough,


what is?

Mining of Big
Data is
problem solve
in 2013 with
zgrep

htStpmst:./Ktwasihttiebra.cioNmav/adlevCooplsl_ebgeoroaftEngineering, Vadgoan

22
MapReduce

• A framework for writing massively parallel


code
• Simple, straightforward model
• Based on “map” and “reduce” functions
from functional programming (LISP)

14
3
NoSQL and Big Data

• Not really that relevant


• Traditional databases handle big data sets,
too
• NoSQL databases have poor analytics
• MapReduce often works from text files
– can obviously work from SQL and NoSQL, too
• NoSQL is more for high throughput
– basically, AP from the CAP theorem, instead of CP
• In practice, really Big Data is likely to be a
mix
– text files, NoSQL, and SQL
14
4
The 4th V:Veracity
“The greatest enemy of knowledge is not
ignorance, it is the illusion of knowledge.”
Daniel Borstin, in The Discoverers (1983)

e, when is clean Big Data is g t Little Data

Smt. Khatsthpisb:a//itN
waitvtaelre.cCoomll/edgevofpEs_nbgoinreaet ring, Vadgoan

25
Data quality

• A huge problem in practice


– any manually entered data is suspect
– most data sets are in practice deeply problematic
• Even automatically gathered data can be a
problem
– systematic problems with sensors
– errors causing data loss
– incorrect metadata about the sensor
• Never, never, never trust the data without
checking it!
– garbage in, garbage out, etc

26
Approaches to learning

• Supervised
– we have training data with correct answers
– use training data to prepare the algorithm
– then apply it to data without a correct answer
• Unsupervised
– no training data
– throw data into the algorithm, hope it makes some
kind of sense out of the data

14
7
Approaches to learning

• Prediction
– predicting a variable from data
• Classification
– assigning records to predefined groups
• Clustering
– splitting records into groups based on similarity
• Association learning
– seeing what often appears together with what

14
8
Issues

• Data is usually noisy in some way


– imprecise input values
– hidden/latent input values
• Inductive bias
– basically, the shape of the algorithm we choose
– may not fit the data at all
– may induce underfitting or overfitting
• Machine learning without inductive bias is
not possible

14
9
Underfitting

• Using an algorithm that cannot capture the


full complexity of the data

15
0
Overfitting

• Tuning the algorithm so carefully it starts


matching the noise in the training data

15
1
“What if the knowledge and data we have are
not sufficient to completely determine the
correct classifier? Then we run the risk of just
hallucinating a classifier (or parts of it) that is
not grounded in reality, and is simply
encoding random quirks in the data.This
problem is called overfitting, and is the
bugbear of machine learning. When your
learner outputs a classifier that is 100%
accurate on the training data but only 50%
accurate on test data, when in fact it could
have output one that is 75% accurate on both,
it has overfit.”

http://homes.cs.wSmasth. iKnagsthoibna.eidNua/v~apledCrooldle/pgaepoefrEs/ncgaicnmee1r2in.pgd,foan

35
Testing

• When doing this for real, testing is crucial


• Testing means splitting your data set
– training data (used as input to algorithm)
– test data (used for evaluation only)
• Need to compute some measure of
performance
– precision/recall
– root mean square error
• A huge field of theory here
– will not go into it in this course
– very important in practice

15
3
Missing values

• Usually, there are missing values in the


data set
– that is, some records have some NULL values
• These cause problems for many machine
learning algorithms
• Need to solve somehow
– remove all records with NULLs
– use a default value
– estimate a replacement value
– ...

15
4
Terminology

• Vector
– one-dimensional array
• Matrix
– two-dimensional array
• Linear algebra
– algebra with vectors and matrices
– addition, multiplication, transposition, ...

15
5
Top 10 algorithms

15
6
Top 10 machine learning algs

1. C4.5
2. k-means clustering
3. Support vector machines
4. the Apriori algorithm
5. the EM algorithm
6. PageRank
7. AdaBoost
8. k-nearest neighbours class.
9. Naïve Bayes
10. CART
From a survey at IEEE In te rnatio n a lC on fe ren ce o n D ata M inin g ( ICD M ) in D ec em ber 20 06. “T o p 10
S m t. Kas hiba i N ava le C ol ege o fEn gin ee rin g,V adg oan
algorithms in data mining”, by X.Wu et al

40
C4.5

• Algorithm for building decision trees


– basically trees of boolean expressions
– each node split the data set in two
– leaves assign items to classes
• Decision trees are useful not just for
classification
– they can also teach you something about the
classes
• C4.5 is a bit involved to learn
– the ID3 algorithm is much simpler
• CART (#10) is another algorithm for
learning decision trees
15
8
Support Vector Machines

• A way to do binary classification on


matrices
• Support vectors are the data points nearest
to the hyperplane that divides the classes
• SVMs maximize the distance between SVs
and the boundary
• Particularly valuable because of “the kernel
trick”
– using a transformation to a higher dimension to
handle more complex class boundaries
• A bit of work to learn, but manageable
15
9
Apriori

• An algorithm for “frequent itemsets”


– basically, working out which items frequently
appear together
– for example, what goods are often bought
together in the supermarket?
– used for Amazon’s “customers who bought this...”
• Can also be used to find association rules
– that is, “people who buy X often buyY” or similar
• Apriori is slow
– a faster, further development is FP-growth

http://www.dssrSemsot.uKrcaessh.icboamiN/naevwalseleCtotelresg/e66o.fpEhnp

43
Expectation Maximization

• A deeply interesting algorithm I’ve seen


used in a number of contexts
– very hard to understand what it does
– very heavy on the maths
• Essentially an iterative algorithm
– skips between “expectation” step and
“maximization” step
– tries to optimize the output of a function
• Can be used for
– clustering
– a number of more specialized examples, too

16
1
PageRank

• Basically a graph analysis algorithm


– identifies the most prominent nodes
– used for weighting search results onGoogle
• Can be applied to any graph
– for example an RDF data set
• Basically works by simulating random walk
– estimating the likelihood that a walker would be
on a given node at a given time
– actual implementation is linear algebra
• The basic algorithm has some issues
– “spider traps”
– graph must be connected
– straightforward solutions to these exist

16
2
AdaBoost

• Algorithm for “ensemble learning”


• That is, for combining several algorithms
– and training them on the same data
• Combining more algorithms can be very
effective
– usually better than a single algorithm
• AdaBoost basically weights training
samples
– giving the most weight to those which are
classified the worst

16
3
Naïve Bayes

16
4
Bayes’s Theorem

• Basically a theorem for combining


probabilities
– I’ve observedA, which indicates H is true with
probability 70%
– I’ve also observed B, which indicates H is true with
probability 85%
– what should I conclude?
• Naïve Bayes is basically using this theorem
– with the assumption that A and B are indepedent
– this assumption is nearly always false, hence
“naïve”

16
5
Simple example

• Is the coin fair or not?


– we throw it 10 times, get 9 heads and one tail

– we try again, get 8 heads and two tails

• What do we know now?


– can combine data and recompute
– or just use Bayes’s Theorem directly
>>> compute_bayes([0.92, 0.84])
0.9837067209775967
http://www.bbc.cSom.ut.kK/naeswhisb/ami aNgaavzailneeC-2o2l3e1g0e1o86fgineering, Vadgoan
68
Ways I’ve used Bayes

• Duke
– record deduplication engine
– estimate probability of duplicate for each property
– combine probabilities with Bayes
• Whazzup
– news aggregator that finds relevant news
– works essentially like spam classifier on next slide
• Tine recommendation prototype
– recommends recipes based on previous choices
– also like spam classifier
• Classifying expenses
– using export from my bank
– also like spam classifier

69
Bayes against spam

• Take a set of emails, divide it into spam and


non-spam (ham)
– count the number of times a feature appears in
each of the two sets
– a feature can be a word or anything you please
• To classify an email, for each feature in it
– consider the probability of email being spam given
that feature to be (spam count) / (spam count +
ham count)
– ie: if “viagra” appears 99 times in spam and 1 in
ham, the probability is 0.99
• Then combine the probabilities with Bayes
http://www.paulSgmrath. Kamas.hciobmai/Nspaavmale.hCtmdl
70
Running the script

• I pass it
– 1000 emails from my Bouvet folder
– 1000 emails from my Spam folder
• Then I feed it
– 1 email from another Bouvet folder
– 1 email from another Spam folder

71
Code
# scan spam
for spam in glob.glob(spamdir + '/' + PATTERN)[ : SAMPLES]:
for token in featurize(spam):
corpus.spam(token)
# scan ham
for ham in glob.glob(hamdir + '/' + PATTERN)[ : SAMPLES]:
for token in featurize(ham):
corpus.ham(token)
# compute probability for email in sys.argv[3 : ]:
print email
p = classify(email) if p < 0.2:
print ' Spam', p else:
print ' Ham', p

https://github.coSmm/lta.rKsagsah/pibya-isNniapvpaeletsC/torleleg/meaosftEenr/gminaecehrinneg-,lVeaadrngionagn/spam

72
Classify
class Feature:
def init (self, token): self._token = token self._spam = 0
self._ham = 0
def spam(self): self._spam += 1
def ham(self): self._ham += 1
def spam_probability(self):
return (self._spam + PADDING) / float(self._spam + self._ham + (PADDING * 2))
def compute_bayes(probs):
product = reduce(operator.mul, probs)
lastpart = reduce(operator.mul, map(lambda x: 1-x, probs)) if product + lastpart ==
0:
return 0 # happens rarely, but happens
else:
return product / (product + lastpart)
def classify(email):
return compute_bayes([corpus.spam_probability(f) for f in featurize(email)])

17
1
Ham output
So, clearly most of the spam
Ham 1.0 is from March 2013...
Received:2013 0.00342935528121
Date:2013 0.00624219725343
<br 0.0291715285881
background-color: 0.03125
background-color: 0.03125
background-color: 0.03125
background-color: 0.03125
background-color: 0.03125
Received:Mar 0.0332667997339
Date:Mar 0.0362756952842
...
Postboks 0.998107494322
Postboks 0.998107494322
Postboks 0.998107494322
+47 0.99787414966
+47 0.99787414966
+47 0.99787414966
+47 0.99787414966
Lars 0.996863237139
Lars 0.996863237139
23 0.995381062356

17
2
Spam output ...and the ham from October 2012

Spam 2.92798502037e-16
Received:-0400 0.0115646258503
Received:-0400 0.0115646258503
Received-SPF:(ontopia.virtual.vps-host.net: 0.0135823429542
Received-SPF:receiver=ontopia.virtual.vps-host.net; 0.0135823429542
Received:<[email protected]>; 0.0139318885449
Received:<[email protected]>; 0.0139318885449
Received:ontopia.virtual.vps-host.net 0.0170863309353
Received:(8.13.1/8.13.1) 0.0170863309353
Received:ontopia.virtual.vps-host.net 0.0170863309353
Received:(8.13.1/8.13.1) 0.0170863309353
...
Received:2012 0.986111111111
Received:2012 0.986111111111
$ 0.983193277311
Received:Oct 0.968152866242
Received:Oct 0.968152866242
Date:2012 0.959459459459
20 0.938864628821
+ 0.936526946108
+ 0.936526946108
+ 0.936526946108

17
3
More solid testing

• Using the SpamAssassin public corpus


• Training with 500 emails from
– spam
– easy_ham (2002)
• Test results
– spam_2: 1128 spam, 269 misclassified as ham
– easy_ham 2003: 2283 ham, 217 spam
• Results are pretty good for 30 minutes of
effort...

76
http://spamassasSsmint..aKpaaschhieb.aoirNg/apvuablelicCcoolrepgues/ofEngineering, Vadgoan
Linear regression
Linear regression

• Let’s say we have a number of numerical


parameters for an object
• We want to use these to predict some
other value
• Examples
– estimating real estate prices
– predicting the rating of a beer
– ...
Estimating real estate prices

• Take parameters
– x1 square meters
– x2 number of rooms
– x3 number of floors
– x4 energy cost per year
– x5 meters to nearest subway station
– x6 years since built
– x7 years since last refurbished
– ...
• a x1 + b x2 + c x3 + ... = price
– strip out the x-es and you have a vector
– collect N samples of real flats with prices = matrix
– welcome to the world of linear algebra
Our data set: beer ratings

• Ratebeer.com
– a web site for rating beer
– scale of 0.5 to 5.0
• For each beer we know
– alcohol %
– country of origin
– brewery
– beer style (IPA, pilsener, stout, ...)
• But ... only one attribute is numeric!
– how to solve?
Example

ABV .se .nl .us .uk IIPA Black Pale Bitter Rating
IPA ale
8.5 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 3.5
8.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 3.7
6.2 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 3.2
4.4 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 3.2
... ... ... ... ... ... ... ... ... ...

Basically, we turn each category into a column of 0.0 or 1.0 values.


Normalization

• If some columns have much bigger values than


the others they will automatically dominate
predictions
• We solve this by normalization
• Basically, all values get resized into the 0.0-1.0
range
• For ABV we set a ceiling of 15%
– compute with min(15.0, abv) / 15.0
Adding more data

• To get a bit more data, I added manually a


description of each beer style
• Each beer style got a 0.0-1.0 rating on
– colour (pale/dark)
– sweetness
– hoppiness
– sourness
• These ratings are kind of coarse because all
beers of the same style get the same value
Making predictions

• We’re looking for a formula


– a * abv + b * .se + c * .nl + d * .us + ... = rating
• We have n examples
– a * 8.5 + b * 1.0 + c * 0.0 + d * 0.0 + ... = 3.5
• We have one unknown per column
– as long as we have more rows than columns we can
solve the equation
• Interestingly, matrix operations can be used to
solve this easily
Matrix formulation

• Let’s say
– x is our data matrix
– y is a vector with the ratings and
– w is a vector with the a, b, c, ... values
• That is: x * w = y
– this is the same as the original equation
– a x1 + b x2 + c x3 + ... = rating
• If we solve this, we get
Enter Numpy

• Numpy is a Python library for matrix


operations
• It has built-in types for vectors and matrices
• Means you can very easily work with matrices
in Python
• Why matrices?
– much easier to express what we want to do
– library written in C and very fast
– takes care of rounding errors, etc
Quick Numpy example
>>> from numpy import *
>>> range(10)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> [range(10)] * 10
[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5,
6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1,
2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 8,
9], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]
>>> m = mat([range(10)] * 10)
>>> m
matrix([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
>>> m.T
matrix([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
[3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
[4, 4, 4, 4, 4, 4, 4, 4, 4, 4],
[5, 5, 5, 5, 5, 5, 5, 5, 5, 5],
[6, 6, 6, 6, 6, 6, 6, 6, 6, 6],
[7, 7, 7, 7, 7, 7, 7, 7, 7, 7],
[8, 8, 8, 8, 8, 8, 8, 8, 8, 8],
[9, 9, 9, 9, 9, 9, 9, 9, 9, 9]])
Numpy solution

• We load the data into


– a list: scores
– a list of lists: parameters
• Then:
x_mat = mat(parameters) y_mat =
mat(scores).T
x_tx = x_mat.T * x_mat

assert linalg.det(x_tx)

ws = x_tx.I * (x_mat.T * y_mat)


Does it work?

• We only have very rough information about


each beer (abv, country, style)
– so very detailed prediction isn’t possible
– but we should get some indication
• Here are the results based on my ratings
– 10% imperial stout from US 3.9
– 4.5% pale lager from Ukraine 2.8
– 5.2% German schwarzbier 3.1
– 7.0% German doppelbock 3.5

http://www.ratebSemetr.cKoamsh/uibsaeirN/1a5v2a0l6e/Craotlilneggse/

89
Beyond prediction

• We can use this for more than just prediction


• We can also use it to see which columns
contribute the most to the rating
– that is, which aspects of a beer best predict the rating
• If we look at the w vector we see the following
– Aspect LMG grove
– ABV 0.56 1.1
– colour 0.46 0.42
– sweetness 0.25 0.51
– hoppiness 0.45 0.41
– sourness 0.29 0.87
• Could also use correlation
18
8
Did we underfit?

• Who says the relationship betweenABV


and the rating is linear?
– perhaps very low and very high ABV are both
negative?
– we cannot capture that with linear regression
• Solution
– add computed columns for parameters raised to
higher powers
– abv2, abv3, abv4, ...
– beware of overfitting...

18
9
Scatter plot

Rating

Freeze-distilled Brewdog beers

Code in Github, reSqmuti.reKsamshaitpalioNtlaibvgAoBanVin %


92
Trying again

191
Matrix factorization

• Another way to do recommendations is


matrix factorization
– basically, make a user/item matrix with ratings
– try to find two smaller matrices that, when
multiplied together, give you the original matrix
– that is, original with missing values filled in
• Why that works?
– I don’t know
– I tried it, couldn’t get it to work
– therefore we’re not covering it
– known to be a very good method, however

192
Clustering

193
Clustering

• Basically, take a set of objects and sort


them into groups
– objects that are similar go into the same group
• The groups are not defined beforehand
• Sometimes the number of groups to create
is input to the algorithm
• Many, many different algorithms for this

194
Sample data

• Our sample data set is data about aircraft from


DBpedia
• For each aircraft model we have
– name
– length (m)
– height (m)
– wingspan (m)
– number of crew members
– operational ceiling, or max height (m)
– max speed (km/h)
– empty weight (kg)
• We use a subset of the data
– 149 aircraft models which all have values for all of these
properties
• Also, all values normalized to the 0.0-1.0 range
195
Distance

• All clustering algorithms require a distance


function
– that is, a measure of similarity between two objects
• Any kind of distance function can be used
– generally, lower values mean more similar
• Examples of distance functions
– metric distance
– vector cosine
– RMSE
– ...

196
k-means clustering

• Input: the number of clusters to create


(k)
• Pick k objects
– these are your initial clusters
• For all objects, find nearest cluster
– assign the object to that cluster
• For each cluster, compute mean of all
properties
– use these mean values to compute distance to
clusters
– the mean is often referred to as a “centroid”
– go back to previous step
• Continue until no objects change cluster
197
First attempt at aircraft

• We leave out name and number built when


doing comparison
• We use RMSE as the distance measure
• We set k = 5
• What happens?
– first iteration: all 149 assigned to a cluster
– second: 11 models change cluster
– third: 7 change
– fourth: 5 change
– fifth: 5 change
– sixth: 2
– seventh: 1
– eighth: 0

198
cluster5, 4 models

Cluster 5 3 jet ceiling : 13400.0


maxspeed : 1149.7
crew : 7.5

bombe length : 47.275


height : 11.65
emptyweight : 69357.5
rs, one wingspan : 47.18

propell
The Myasishchev M-50 was a Soviet
prototype four-engine supersonic
bomber which never attained service
e
The Myasishchev M-4 Molot is a
four -engined

bomber
r
strategic bomber

. Not
t o o
The Convair B-36 "Peacemaker” was a

TheTupolevTu-16 was a twin-engine


jet bomber used by the Soviet Union.
bad.
strategic bomber built by Convair and
operated solely by the United StatesAir
Force (USAF) from 1949 to 1959
199
cluster4, 56 models
ceiling : 5898.2
Small,4 slow propeller aircraft. Not
Cluster maxspeed : 259.8
crew : 2.2

too bad. length : 10.0


height : 3.3
emptyweight : 2202.5
wingspan : 13.8

TheAvia B.135 was a Czechoslovak TheYakovlev UT-1 was a single-seater


cantilever monoplane fighter aircraft trainer aircraft

The Siebel Fh 104 Hallore was a small


German twin-engined transport,
communications and liaison aircraft

TheYakovlev UT-2 was a single-seater


The NorthAmerican B-25 Mitchell was trainer aircraft
an American twin-engined medium
bomber

TheAirco DH.2 was a single-seat The Messerschmitt Bf 108Taifun was a


biplane "pusher" aircraft German single-engine sports and touring
ai rc ra ft
102 Smt. Kashibai Navale College of Engineering, Vad g o an
cluster3, 12 models
ceiling : 16921.1
Small,3 very fast jet planes. Pretty
Cluster maxspeed : 2456.9
crew : 2.67

good. length : 17.2


height : 4.92
emptyweight : 9941
wingspan : 10.1

The Mikoyan MiG-29 is a fourth- The English Electric Lightning is a


generation jet fighter aircraft supersonic jet fighter aircraft of the
ColdWar era, noted for its great
speed.

The NorthropT-38Talon is a two-


seat, twin-engine supersonic jet
trainer

TheVought F-8 Crusader was a The Dassault Mirage 5 is a supersonic


single-engine, supersonic [fighter] attack aircraft
aircraft
The Mikoyan MiG-35 is a further
development of the MiG-29
103
cluster2, 27 models
ceiling : 6447.5
ClusteBr i2ggish, kind of slow planes. maxspeed : 435
crew : 5.4

Some oddballs in this group. length : 24.4


height : 6.7
emptyweight : 16894
wingspan : 32.8

The Bartini BerievVVA-14 (vertical The Fokker 50 is a turboprop-


take-off amphibious aircraft) powered airliner

The Junkers Ju 89 was a heavy


bomber

TheAviationTradersATL-98 The PB2YCoronado was a large


Carvair was a large piston-engine flying boat patrol bomber
transport aircraft.

The Beriev Be-200Altair is a


multipurpose amphibious aircraft
Smt. Kashibai Navale College of eTehreiJnungpatrol
Engi1n04maritime k,erVsaJud2g9o0waansa long-range transport,
aircraft and heavy bomber
cluster1, 50 models

Cluster 1 Small, fast ceiling : 11612


maxspeed : 726.4
crew : 1.6
planes. length : 11.9
height : 3.8
emptyweight : 5303
Mostly wingspan : 13

TheAdam A700AdamJet was a


good,
The Curtiss P-36 Hawk was an American-
proposed six-seat civil utility aircraft
though the
designed and built fighter aircraft

Canberra is The English ElectricCanberra is a


first-generation jet-powered light
bomber

a poor fit.
The Heinkel He
100 was a
The Kawasaki Ki-61 Hien was a German pre-
The Learjet 23 is a ... twin-engine,
JapaneseWorldWar II fighter aircraft WorldWar II
high-speed business jet
fighter aircraft

The Learjet 24 is a ... twin-engine,


105 high-speed business jet Smt. KasThAmerican
hiebGariumNmaavnaFl3eFwCaosltlheeglaestofEngineering, Vadgoan
biplane fighter aircraft
Clusters, summarizing

• Cluster 1: small, fast aircraft (750 km/h)


• Cluster 2: big, slow aircraft (450 km/h)
• Cluster 3: small, very fast jets (2500 km/h)
• Cluster 4: small, very slow planes (250 km/h)
• Cluster 5: big, fast jet planes (1150 km/h)

For a first attempt to sort through the data,


this is not bad at all

https://github.comSm/lta.rKsgasah/pibya-sinNiapvpaeltes/Ctorelle/gmeaosfteErn/gminaecehriinneg-,len/aircraft

106
Agglomerative clustering

• Put all objects in a pile


• Make a cluster of the two objects closest to
one another
– from here on, treat clusters like objects
• Repeat second step until satisfied

There is code foSrmtth.isK,atsohoib,ainiNthaevaGleithCuobllesagemopfe

107
Principal
component analysis

206
PCA

• Basically, using eigenvalue analysis to find


out which variables contain the most
information
– the maths are pretty involved
– and I’ve forgotten how it works
– and I’ve thrown out my linear algebra book
– and ordering a new one from Amazon takes too
long
– ...so we’re going to do this intuitively

207
An example data set

• Two variables
• Three classes
• What’s the longest line we could draw
through the data?
• That line is a vector in two dimensions
• What dimension dominates?
– that’s right: the horizontal
– this implies the horizontal contains most of the
information in the data set
• PCA identifies the most significant
variables

110
Dimensionality reduction

• After PCA we know which dimensions


matter
– based on that information we can decide to throw
out less important dimensions
• Result
– smaller data set
– faster computations
– easier to understand

209
Trying out PCA

• Let’s try it on the Ratebeer data


• We know ABV has the most information
– because it’s the only value specified for each
individual beer
• We also include a new column: alcohol
– this is the amount of alcohol in a pint glass of the
beer, measured in centiliters
– this column basically contains no information at
all; it’s computed from the abv column

210
Complete code
import rblib
from numpy import *
def eigenvalues(data, columns):
covariance = cov(data - mean(data, axis = 0), rowvar = 0)
eigvals = linalg.eig(mat(covariance))[0]
indices = list(argsort(eigvals))
indices.reverse() # so we get most significant first
return [(columns[ix], float(eigvals[ix])) for ix in indices]
(scores, parameters, columns) =
rblib.load_as_matrix('ratings.txt')
for (col, ev) in eigenvalues(parameters, columns):
print "%40s %s" % (col, float(ev))

211
Output

abv 0.184770392185
colour 0.13154093951
sweet 0.121781685354
hoppy 0.102241100597
sour 0.0961537687655
alcohol 0.0893502031589
Uni 0.0677552513387
ted ....
States -3.73028421245e-18
-3.73028421245e-18
Eis -1.68514561515e-17
b o c k
Belarus
Vietnam

212
MapReduce

213
University pre-lecture, 1991

• My first meeting with university was Open


University Day, in 1991
• Professor Bjørn Kirkerud gave the computer
science talk
• His subject
– some day processors will stop becoming faster
– we’re already building machines with many processors
– what we need is a way to parallelize software
– preferably automatically, by feeding in normal source
code and getting it parallelized back
• MapReduce is basically the state of the art on
that today
214
MapReduce

• A framework for writing massively parallel


code
• Simple, straightforward model
• Based on “map” and “reduce” functions
from functional programming (LISP)

215
http://research.google.com/archive/mapreduce.html

Appeared in:
OSDI'04: Sixth Symposium on Operating System Design and
Implementation,
San Francisco, CA, December, 2004.
216
map and reduce

>>> "1 2 3 4 5 6 7 8".split()


['1', '2', '3', '4', '5', '6', '7', '8']
>>> l = map(int, "1 2 3 4 5 6 7 8".split())
>>> l
[1, 2, 3, 4, 5, 6, 7, 8]
>>> import operator
>>> reduce(operator.add, l) 36

217
MapReduce

1. Split data into fragments


2. Create a Map task for each fragment
– the task outputs a set of (key, value) pairs
3. Group the pairs by key
4. Call Reduce once for each key
– all pairs with same key passed in together
– reduce outputs new (key, value) pairs

Tasks get spread out over worker nodes


Master node keeps track of completed/failed tasks Failed tasks are restarted
Failed nodes are detected and avoided
Also scheduling tricks to deal with slow nodes

120
Communications

• HDFS
– Hadoop Distributed File System
– input data, temporary results, and results are
stored as files here
– Hadoop takes care of making files available to
nodes
• Hadoop RPC
– how Hadoop communicates between nodes
– used for scheduling tasks, heartbeat etc
• Most of this is in practice hidden from the
developer

219
Does anyone need MapReduce?

• I tried to do book recommendations with


linear algebra
• Basically, doing matrix multiplication to
produce the full user/item matrix with
blanks filled in
• My Mac wound up freezing
• 185,973 books x 77,805 users =
14,469,629,265
– assuming 2 bytes per float = 28 GB of RAM
• So it doesn’t necessarily take that much to
have some use for MapReduce
220
The word count example

• Classic example of using MapReduce


• Takes an input directory of text files
• Processes them to produce word
frequency counts
• To start up, copy data into HDFS
– bin/hadoop dfs -mkdir <hdfs-dir>
– bin/hadoop dfs -copyFromLocal <local-dir> <hdfs-
dir>

221
WordCount – the mapper
• public static class Map extends Mapper<LongWritable, ext, Text, IntWritable>
• {
• private final static IntWritable one = new IntWritable(1);
• private Text word = new Text();

• public void map(LongWritable key, Text value, Context context)


• {
• String line = value.toString();
•StringTokenizer tokenizer = new StringTokenizer(line); while
(tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken());
• context.write(word, one);
• }
• }

222
WordCount – the reducer

public static class Reduce extends Reducer<Text,


IntWritable,Text, IntWritable> {

public void reduce(Text key,


Iterable<IntWritable> values, Context context) {
int sum = 0;
for (IntWritable val : values) sum += val.get();
context.write(key, new IntWritable(sum));
}
}

223
The Hadoop ecosystem

• Pig
– dataflow language for setting up MR jobs
• HBase
– NoSQL database to store MR input in
• Hive
– SQL-like query language on top of Hadoop
• Mahout
– machine learning library on top of Hadoop
• Hadoop Streaming
– utility for writing mappers and reducers as
command-line tools in other languages

224
Word count in HiveQL
CREATETABLE input (line STRING);
LOAD DATA LOCAL INPATH 'input.tsv' OVERWRITE INTOTABLE
input;
-- temporary table to hold words... CREATETABLE words (word
STRING);
add file splitter.py;
INSERT OVERWRITETABLE words SELECTTRANSFORM(text)
USING 'python splitter.py' AS word
FROM input;
SELECT word, COUNT(*)
FROM input
LATERALVIEW explode(split(text, ' ')) lTable as word GROUP BY
word;

225
Word count in Pig
input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray);
-- Extract words from each line and put them into a pig bag
-- datatype, then flatten the bag to get one word on each row
words = FOREACH input_linesGENERATE FLATTEN(TOKENIZE(line))AS word;
-- filter out any words that are just white spaces filtered_words = FILTER words
BY word MATCHES '\\w+';
-- create a group for each word
word_groups = GROUP filtered_words BY word;
-- count the entries in each group
word_count = FOREACH word_groupsGENERATECOUNT(filtered_words)AS
count, groupAS word;
-- order the records by count
ordered_word_count = ORDER word_count BY count DESC;
STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';

226
Applications of MapReduce

• Linear algebra operations


– easily mapreducible
• SQL queries over heterogeneous data
– basically requires only a mapping to tables
– relational algebra easy to do in MapReduce
• PageRank
– basically one big set of matrix multiplications
– the original application of MapReduce
• Recommendation engines
– the SON algorithm
• ...

227
Apache Mahout

• Has three main application areas


– others are welcome, but this is mainly what’s there
now
• Recommendation engines
– several different similarity measures
– collaborative filtering
– Slope-one algorithm
• Clustering
– k-means and fuzzy k-means
– Latent Dirichlet Allocation
• Classification
– stochastic gradient descent
– SupportVector Machines
– Naïve Bayes
130
SQL to relational algebra

select lives.person_name, city from works,


lives
where company_name = ’FBC’ and
works.person_name = lives.person_name

229
Translation to MapReduce

• σ(company_name=‘FBC’, works)
– map: for each record r in works, verify the condition,
and pass (r, r) if it matches
– reduce: receive (r, r) and pass it on unchanged
• π(person_name, σ(...))
– map: for each record r in input, produce a new record r’
with only wanted columns, pass (r’, r’)
– reduce: receive (r’, [r’, r’, r’ ...]), output (r’, r’)
• ⋈(π(...), lives)
– map:
• for each record r in π(...), output (person_name, r)
• for each record r in lives, output (person_name, r)
– reduce: receive (key, [record, record, ...]), and perform
the actual join
• ...
230
Lots of SQL-on-MapReduce tools

• Tenzing Google
• Hive Apache Hadoop
• YSmart Ohio State
• SQL-MR AsterData
• HadoopDB Hadapt
• Polybase Microsoft
• RainStor RainStor Inc.
• ParAccel ParAccel Inc.
• Impala Cloudera
• ...

231
Thank You….!

232

You might also like