0% found this document useful (0 votes)
87 views

Bigdata Overview PDF

This document discusses managing and analyzing large datasets. It provides an agenda covering topics like SQL and its limitations, data pipelines, Hadoop ecosystem components, NoSQL, machine learning, and the role of cloud computing. It describes how the amount of data being generated is increasing and new approaches are needed to store, analyze and visualize large datasets beyond what is possible with traditional SQL databases and desktop tools.

Uploaded by

manindra1konda
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
87 views

Bigdata Overview PDF

This document discusses managing and analyzing large datasets. It provides an agenda covering topics like SQL and its limitations, data pipelines, Hadoop ecosystem components, NoSQL, machine learning, and the role of cloud computing. It describes how the amount of data being generated is increasing and new approaches are needed to store, analyze and visualize large datasets beyond what is possible with traditional SQL databases and desktop tools.

Uploaded by

manindra1konda
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 98

Managing and Analyzing

Large DataSets
Author : Rajdeep Dua
Twiter : @rajdeepdua

Agenda
Introduc7on, Historical Perspec7ve
SQL and its Limita7ons
Data Pipelines
Cloud and Big Data
Storing and Serializing Data
Map Reduce, Hadoop Ecosystem
Hadoop Components : MR, Pig, Hive
NoSQL
Big Data in the Cloud
Machine Learning Introduc7on
Machine Learning Demos

Introduc8on
Large Data is being Generated
Mobility
Internet of Things
Social Data
Need to
Store Data
Analyze Data
Visualize data

Analysing Data
Data Analysis has been done
since ages
Tradi7onal term called Data
Mining / Machine Learning
Data science is the new term
Need for New Age Skills
Learn new tools
Learn how to handle large data
sets
Learn domain and how to extract
meaning from LOT of noise

hSp://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Historical Perspec8ve
When Data became Important :
Apple II
Windows 3.1, Rise of Desktop Tools like Spreadsheets and SPSS

Inherent Limita7on in Desktop Processing and Analysing of Data


Limited Processing Speed, RAM and Storage

SQL Databases
SQL Databases have evolved from System R [1974]
First Commercial Database by Oracle 19 1979
Rise of SQL Structured Query Language : SQL 92, SQL 99
CODDs rule : Rules 0 to 12, 13 rules which dene characteris7c of a
Database

hSp://en.wikipedia.org/wiki/Rela7onal_database_management_system

Limita8ons of RDBMS
Fixed Schema
Cannot Scale to Large Clusters
Very Expensive Licensing
Querying Large Data sets using SQL on RDBMs is very 7me consuming

Era of Big Data Tradeo


New Approaches to handle large data sets where tradi7onal RDBMs
were failing
Create topologies with 1000s of servers in a single processing unit
Compromise on Consistency
Table en77es with no xed schema
De-normalized data
Store data as key-value or in Json format
New Programming paradigms
Map Reduce
NoSQL

Four Big Data Rules


No single Database or architecture can handle all the use cases
Evaluate and choose the right architecture based on type of data
Structured, Semi Structured or Unstructured
Streaming or Batch?
Type of Data analysis required
Sources of Data
How the Data is be presented?

Expose appropriate APIs to extract and process data


Build a Data Pipeline

Analysing Data by Building Data Pipelines


Capture
Curate
Communicate

Jim Gray
Turing Award Winner

Acquire
Parse
Filter
Mine
Represent
Rene
Interact

Ben Fry
Visualiza7on Expert

Iden7fy Problem
Instrument Data
Sources
Collect Data
Evaluate
Build Model
Communicate
Results
Je HammerBacher
Facebook, cloudera

Analysing Data by Building Data Pipelines


Capture
Curate
Communicate

Jim Gray
Turing Award Winner

Acquire
Parse
Filter
Mine
Represent
Rene
Interact

Ben Fry
Visualiza7on Expert

Iden7fy Problem
Instrument Data
Sources
Collect Data
Evaluate
Build Model
Communicate
Results
Je HammerBacher
Facebook, cloudera

Data Driver Development : Tradi8onal


Approach

Spreadsheets
Databases
DataWarehousing Tools

Special Tools
SPSS, SAS
Spreadhseets

Limited by amount of data


Tools can handle
Real 7me processing was
very limited

Data Driven Development : Large DataSets

Batch Jobs
Map Reduce
Pig Hive

Apache storm
Spark

Apache Mahout
Apache Spark
R
Python Based Tools

Components of a Big Data Data Pipeline


Data Inges7on pipeline

Processed Data
Pipeline

Data Analysis pipeline


Map Reduce
Machine Learning

Source1

Source2

Data
Dump

Curated
Data

API

Job2

Job1

HDFS

API

Persistence
Layer
SQL/NoSQL

Dashboard

Role of Cloud in Big Data


Provides Infrastructure
and Plajorm choices for
Storing and Processing
Data
Pay as you go model
changes the CAPEX to
OPEX
Substan7al cost savings
achieved as data
management and analysis
moves to the cloud :
Public or Private

Increased speed of execu7on has a posi7ve impact on life7me cost models.


Cost is reduced over the life7me of a product or service as the deprecia7on
cost of purchased assets decreases and as eciencies are introduced.
The speed of cost reduc7on can be much higher using cloud compu7ng than
tradi7onal investment and divestment of IT assets,

Role of Cloud in Big Data


Storage as a Service
Object storage
SQL Engines as a Service
NoSQL as a Service

Map Reduce as a Service


Hadoop, Spark, Big Query

Machine Learning as a
Service

Key Cloud Players in Big Data Space


Amazon Web Services
Microson Azure
Google Compute Engine
Al7scale
Quobole

Storing Data

Tradi7onally data has been stored in structured schema in rela7onal


databases

Ques7ons which cannot be answered by this approach
How to store
Large Character or Binary data a.k.a CLOBs and BLOBs?
Data which does not have xed schema?
Streaming data?

Storing Data -
Approach varies depending on the following factor
Privacy Requirements of the data
Type of Organiza7on
Preference for Private or Public Cloud
Public Cloud Stores les on AWS S3, Google storage or Azure Storage
Private Cloud Store les on EMC Viper, Open Stack Swin

Some Organiza7ons use Hadoop Backend HDFS as a Storage Dump

Storing Data Data Formats


Store it as a
Comma Separated Values
Tab Separated Values
JSON
XML

How is Data Encoded


ASCII
UTF-8 / UTF-16 (Unicode encoding to take care on non ascii characters)

Serializing Data
Need to send data in a ecient binary format vs text format
Binary Format is much smaller in size.
Serializa7on and Deserializa7on is much smaller
Apache Thrin
Dene your service deni7on using Thrin le
Generate bindings for Specic Languages java, Python, CPP etc
Create the Server using those bindings. Server accepts a TCP connec7on
Create a client which serializes data using this bindings a

Protocol Buers :

Serializing Data
Protocol Buers
Developed by Google
Used extensively by Google for all its services for communica7ng with each
other

Run7me Performance

Big Data PlaLorms in the Market


Cloudera
MapR
Horton Works
Pivotal
AWS
Teradata
IBM
Microson

Big Data Hadoop : Players Posi8oning

Map Reduce
A new Programming framework to process very large data
(some7mes peta bytes) over large cluster of Servers
First implemented at Google to build Search Index and Process
incoming Ads
Open Source Version Implemented at Yahoo Called Hadoop
Commercial distribu7ons available from Cloudera, HortonWorks,
MapR, Greenplum

Map Reduce Programming Model


Input & Output: each a set of key/value pairs
Programmer species two func7ons:

map (in_key, in_value) -> list(out_key, intermediate_value)

Processes input key/value pair


Produces set of intermediate pairs
reduce (out_key, list(intermediate_value)) -> list(out_value)
Combines all intermediate values for a par7cular key
Produces a set of merged output values (usually just one)
Inspired by similar primi7ves in LISP and other languages

Hadoop Ecosystem
Apache Oozie
Workow

Hive

Pig Latin

DW System

Data Analysis

Mahout
Machine
Learning

Map Reduce Framework

Other
YARN
MPI,
GIRAPH

YARN
HDFS

Flume
Unstructured Data

Sqoop
Structured Data

HBase

Hadoop 2.x Core Components


Hadoop 2.x Core
components
Storage

Processing

HDFS

YARN

Flume
Unstructured Data
NameNode

Master

Resource Manager

Secondary
NameNode
DataNode

Slave

Node Manager

Hadoop 2.x Core Components

YARN

Resource Manager

Node Manager

Node Manager

Node Manager

HDFS

NameNode

DataNode

DataNode

DataNode

HDFS Federa8on

Block Storage

Namespace
Block Storage Service

Namespace

HDFS has two main layers:


Namenode
NS

Block Management

Datanode

Datanode

Storage

Hive

What is Hive
Data warehousing package built on top of Hadoop
Used for data analysis
Targeted towards users comfortable with SQL
Abstracts complexity of Hadoop
No need to learn Java and Hadoop APIs
Developed by Facebook and maintained by the Community

What is Hive?
Denes SQLLike Query
Language : QL

Used for Data


Warehousing

Hive
Allows
Programmers
to plugin-in
custom
mappers and
reducers

Provides tools
to enable ETL

Hive Applica8ons
Data
Mining

Log
Processing

Hive
Applications

BI

Hive
Applications

Predictive
Modelling

Hive Architecture
Hive

Hive runs as a separate service


which talks to Hadoop using a
Driver
Hive can be accessed using a
Command Line Interface or a
Hive Web Interface ( HWI)
which needs to be enabled

JDBC
CLI

HWI

ODBC

Thrift Server

Driver
(compiles, optimizes, executes)

Metastore

Hadoop
Master
DFS

JobTracker

Name Node

Limita8ons of Hive
Not designed for online transac7on processing
Does not oer real-7me queries and row level updates
Latency for Hive query is generally very high (minutes)
Provides acceptable (not op7mal) latency for interac7ve data
browsing

Abili8es of HiveQL
Filter rows from a table using a 'where' clause
Store the results of a query into another table
Manage tables and par77ons (create, drop and alter)
Store results of a query in Hadoop dfs directory
Do equi-joins between two tables

Dierences with tradi8onal RDBMS


Schema on Read vs Schema on Write
Hive does not verify data when it is loaded, but rather when a query is issued
This makes for a very fast ini7al load

No Updates, Transac7ons and Indexes

Hive Data Models


Tables

Analogous to tables in rela7onal database


Each table has corresponding directory in HDFS
Example
For table Student
Directory - /hive/warehouse/Student/

Par77ons

Analogous to dense indexes on par77on column


Nested subdirectories in HDFS for each combina7on of par77on column
values

Hive Data Models


Example : Par77on column - branch

HDFS sub-directory for college=IITD and branch = cse

/hive/warehouse/Student/college=IITD/branch=cse


HDFS sub-directory for college=IITD and branch = ece

/hive/warehouse/Student/college=IITD/branch=ece

Example Use Case Employees Data with


Complex types
Create
Table

Load Data
in HDFS

Load Data
into HIVE

Query
Data

Example Use Case : Data


John
Mary
Todd
Bill

Doe
Smith
Jones
King

100000.0
80000.0
70000.0
60000.0

Mary
Bill
Mary
Todd

Smith,Todd Jones
Federal Taxes-.2,State Taxes-.05,Insurance-.1
A1~Michigan Ave~Chicago~IL~B60600
King
Federal Taxes-.2,State Taxes-.05,Insurance-.1
100~Ontario St.~Chicago~IL~60601
Smith
Federal Taxes-.15,State Taxes-.03,Insurance-.1
200~Chicago Ave.~Oak Park~IL~B60700
Jones
Federal Taxes-.15,State Taxes-.03,Insurance-.1
300~Obscure Dr.~Obscuria~IL~60100

Field Separator
: \t
Array Delimiter
: ,
Map Key Value Seperator : -
Struct Separator
: ~

Example Use Case : Create Table


CREATE TABLE employees (
name
STRING,
salary
FLOAT,
subordinates ARRAY<STRING>,
deductions
MAP<STRING, FLOAT>,
address
STRUCT<street:STRING, city:STRING, state:STRING,
zip:INT>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY '-'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

Example Use Case :


Copy Data into HDFS


Load Data into Hive
$HADOOP_HOME/bin/hadoop fs -put /home/ubuntu/work/hive-book/
ch03/types-data-2 /emp_data/employees.txt;
LOAD DATA INPATH '/emp_data/types-data-2' OVERWRITE INTO TABLE
employees;

Example Use Case :


Query Data
hive> select * from employees;
OK
John Doe 100000.0 ["Mary Smith","Todd Jones"]
{"Federal Taxes":0.2,"State Taxes .05":null}
{"street":"Insurance-.1","city":null,"state":null,"zip":null}
Mary Smith
80000.0 ["Bill King"]
{"Federal Taxes":0.2,"State Taxes":0.05,"Insurance":0.1}
{"street":"100~Ontario St.~Chicago~IL~60601","city":null,"state":null,"zip":null}
Todd Jones
70000.0 ["Mary Smith"]
{"Federal Taxes":0.15,"State Taxes":0.03,"Insurance":0.1}
{"street":"200~Chicago Ave.~Oak Park~IL~B60700","city":null,"state":null,"zip":null}
Bill King
60000.0 ["Todd Jones"]
{"Federal Taxes":0.15,"State Taxes":0.03,"Insurance":0.1}
{"street":"300~Obscure Dr.~Obscuria~IL~60100","city":null,"state":null,"zip":null}
Time taken: 0.453 seconds, Fetched: 4 row(s)

Internal Tables
Also called managed tables, because Hive controls the lifecyle of their
data
When we drop an internal table, Hive deletes the data in the table
Less convenient for sharing with other tools


External Tables
Unlike Internal tables, Hive does not own the data in external table
Dropping the table does not delete the data, only metadata for the
table is deleted

Why Pig? Contd...


Provides common data opera7ons lters, joins, ordering etc.
Provides nested data types tuples, bags and maps
Open source and ac7vely supported by a community of developers

Where to use Pig?


Pig is on top of Hadoop and makes it possible to create complex jobs
to process large volumes of data quickly and eciently
Time sensi7ve data loads
Processing many data sources
Analy7c insight through sampling

Where not to use Pig?


Complex unstructured data
video, audio, raw human readable text

Where a perfectly implemented MapReduce code can execute jobs


faster than well-wriSen Pig code
Where you would like to have more power to op7mize your code

Use cases
Processing of Web logs
Data processing for search plajorms
Support for AdHoc queries across large datasets
Quick Prototyping of algorithms for processing large datasets

Basic Program Structure


Script
Pig can run a script le that contains pig commands

Grunt
An interac7ve shell for running Pig commands
Also possible to run Pig scripts within Grunt using run and exec commands

Embedded
Can run Pig programs from Java, like you can use JDBC to run SQL programs
from Java

Components
Pig is made up of two components
Pig La7n
Used to express Data Flows

Execu7on Environments
Distributed execu7on on a Hadoop Cluster
Local execu7on in a single JVM

Pig La8n
Pig La7n program is made up of opera7ons or transforma7ons that are
applied to the input data to produce output.


NoSQL Databases

Need for NoSQL Databases

Scalability
Ability to easily scale up or Down
Can handle and store data depending on the source
Flexible to No Schema

Types of NoSQL Databases


Key Value Databases
Document Based Databases
Column Family Databases
Graph Databases

What
is NoSQL
NoSQL is a set of concepts that allows the rapid and ecient
processing of data sets with a focus on performance,
reliability, and agility.


Need for NoSQL Databases
Ability to easily scale up to very large clusters
Can handle and store data depending on the source
Flexible to No Schema

KeyValue Data Stores


Store data in the form of Keys and Values
Use Cases : Web Crawler Data.
Example : Amazon Dynamo DB, BigTable, Redis

Column Family Store


Sparse Matrix which uses Row and Column as a Key.
Used where consistency can be relaxed
Example : Apache Cassandra, HBase

NoSQL Case
Study

Applica8on Layers - RDBMs vs NoSQL

ACID vs BASE
Atomicity
Consistency
Integrity
Durability

Basic Availability
Son State
Eventual Consistency

Key Value Stores


A key-value store is a simple database that
when presented with a simple string (the
key) returns an arbitrary large BLOB of data
(the value).
Key-value stores have no query language
They provide a way to add and remove key-
value pairs
Example
Data Store : Amazon s3 : Key and Binary Data
Google Web Crawler stores data in Google Big
Table

Column Family Stores


Column family systems are important NoSQL data architecture paSerns because they
can scale to manage large volumes of data. Column family stores use row and column
iden7ers as general purposes keys for data lookup.
Theyre some7mes referred to as data stores rather than databases, since they lack
features you may expect to nd in tradi7onal databases.
For example, they lack typed columns, secondary indexes, triggers, and query languages
Almost all column family stores have been heavily inuenced by the original Google
Bigtable paper.
HBase, Hypertable, and Cassandra are good examples of systems that have Bigtable-like
interfaces, although how theyre implemented varies.

Apache Cassandra
Started at Facebook, Open Sourced in 2008.
Top Level Apache Project
Used by Nejlix, TwiSer and Rackspace
Why Cassandra
Scale
Opera7ons
Data Model

Each Key can be a combina7on of row and column names, leading to


a table with millions of columns

Apache Cassandra Composite Keys


create table Bite (
partkey varchar,
score bigint,
id varchar,
data varchar,
PRIMARY KEY (partkey, score, id)
) with clustering order by (score desc);
select * from Bite;
partkey | score | id | data
----------+---------+-------+----------------------
feed0 | 102 | bite3 | { id : bite2, ...
feed0 | 101 | bite2 | { id : bite3, ...
feed0 | 100 | bite1 | { id : bite1, ...

MongoDB Overview

Documents
At the heart of MongoDB is the document: an ordered set of keys
with associated values.
The representa7on of a document varies by programming language,
but most languages have a data structure that is a natural t, such as
a map, hash, or dic7onary.
In JavaScript, for example, documents are represented as objects:
{"gree7ng" : "Hello, world!}
{"gree7ng" : "Hello, world!", "foo" : 3}

Collec8ons
Collec7ons
A collec7on is a group of documents. Collec7on is like a row in a
database table
Dynamic Schemas
Collec7ons have dynamic schemas. This means that the documents
within a single collec7on can have any number of dierent shapes.
For example, both of the following documents could be stored in a
single collec7on:
{"gree7ng" : "Hello, world!}

{"foo" : 3}

Databases
Collec7ons are Grouped by Databases

Collec7on
Database

Collec7on
Document
Collec7on
Document
Document

Big Data and Cloud

Big Data in the Cloud


Cloud Compu7ng provides an elas7c Fabric for hos7ng Big Data
Services
Use Cases for the Cloud
NoSQL Databases as a Service
Object storage
Hadoop and Spark as a Service
Machine Learning as a Service
Streaming Data Support

Amazon Web Services Handling Large


DataSets
AmazonDynamoDB is a NoSQL database as a service
Object Store to store Very Large Objects and global keys to nd them
AWS oers Hadoop as a Service in the form of Elas7c Map Reduce
Uses Open Source Hadoop with Ini7al Data pulled in from S3

AWS provides its own implementa7on of Machine Learning


Algorithms.

Amazon Web Service EMR

Introduc8on
Amazon EMR is an AWS service that allows users to launch and use
resizable Hadoop clusters inside of Amazon's infrastructure
Can be used analyze large data sets
Greatly simplies setup and management of the cluster of Hadoop
and MapReduce components
EMR instances use Amazon's prebuilt and customized EC2 instances
Can take full advantage of other AWS services

Introduc8on contd...
EC2 instances are invoked when we start a new Job Flow to form an
EMR cluster
A Job Flow is Amazon's term for the complete data processing that
occurs through a number of compute steps
A Job Flow is specied by the MapReduce applica7on and its input
and output parameters

Architecture

AWS EMR at Run8me


1. Master Instance Group controls the cluster,
Runs NameNode and Job Tracker
Stores cluster info in MySQL
2. Core Instance group created for life of the cluster
3. Cores instances run DataNode and TaskTracker
4. Op7onal Task Instances can be added or subtracted
based on load
5. S3 is the underlying le system for ini7al data and
nal results deni7on

AWS EMR at Run8me


1. Master Node manages distribu7on of work and
manages cluster state
2. Core and Task Instance Groups read from and write
to S3

Accessing EMR
Using the Management Console
Using Command Line Interface
Amazon EMR SDKs Java, PHP, Python, .Net etc
Choosing the Instance Type Depends on the Use Case

EMR Leverage Spot Instances to Reduce Cost

EMR Case study : Log Processing at Yelp


Use Case : Yelp uses an automated review lter to iden7fy
suspicious content and minimize exposure to the
consumer. The site also features a wide range of other
features that help people discover new businesses (lists,
special oers, and events), and communicate with each
other.
Technology : Yelp moved from RAID disks and Single
Hadoop Installa7on to use s3 and EMR.
Map Reduce jobs wriSen in Python using MRJob. Boto
client APIs used to congure EMR
SAvings : Approx $55000 per year

Machine Learning
Extract Usesful informa7on from the data by designing models
Use Cases
Clustering
Classica7on
Decision Trees
Regression

Clustering Techniques
Centroid Based Clustering : k-means

Clustering Techniques
Distribu7on Based Clustering :
The clustering model most closely related to sta7s7cs is
based on distribu7on models. Clusters can then easily be
dened as objects belonging most likely to the same
distribu7on.
One prominent method is known as Gaussian mixture
models (using the expecta7on-maximiza7on algorithm).
Data set is modelled with a xed (to avoid overng)
number of Gaussian distribu7ons that are ini7alized
randomly and whose parameters are itera7vely op7mized
to t beSer to the data set.
Example Alogirthm is Guassian Mixture Models using
Expecta7on Maximiza7on
Real World example : Classifying genes to a cluster using
Guassian Mixture Models

Clustering : DBSCAN
Density-based spa7al clustering of applica7ons with noise
(DBSCAN) is a data clustering algorithm proposed by Mar7n
Ester et al. in 1996.
It is a density-based clustering algorithm: given a set of points
in some space,
Groups together points that are closely packed together (points
with many nearby neighbors),
Marking as outliers points that lie alone in low-density regions
(whose nearest neighbors are too far away).

DBSCAN is one of the most common clustering algorithms


and also most cited in scien7c literature.
It needs two parameters (epsilon) and minPts to calculate
the clusters

Machine Learning K Means


k-means clustering is a method of vector quan7za7on, that is popular
for cluster analysis in data mining.
k-means clustering aims to par77on n observa7ons into k clusters in
which each observa7on belongs to the cluster with the nearest mean,
serving as a prototype of the cluster.
Given a set of observa7ons (x1, x2, , xn), where each
observa7on is a d-dimensional real vector, k-means clustering
aims to par77on the n observa7ons into k ( n) sets S = {S1, S2,
, Sk} so as to minimize the within-cluster sum of squares
(WCSS)

Machine Learning KMeans Customer


Segmenta8on
Problem : Who is shopping at my stores and how I can market to them?
Need : Targeted Marking
Approach Convert Real Data Set into Insights

hSp://www.slideshare.net/jonsedar/customer-clustering-for-marke7ng?related=1

Process Followed

Analysis
Converted Features into Principal
Components
Clustering Technique : K means

EM / Guassian Mixture Models applica8on to


Computa8onal Biology for Gene Clustering
Many probabilis7c models in computa7onal biology include latent variables.
In somecases, these latent variables are present due to missing or corrupted
data; in most applica7ons of expecta7on maximiza7on to computa7onal biology,
however, the latent factors are inten7onally included, and parameter learning
itself provides a mechanism for knowledge discovery.
In gene expression clustering, we are given microarray gene expression
measurements for thousands of genes under varying condi7ons, and our goal is
to group the observed expression vectors into dis7nct clusters of related genes.

EM / Guassian Mixture Models applica8on to


Computa8onal Biology for Gene Clustering
One approach is to model the vector of expression measurements for each gene as being
sampled from a mul7variate Gaussian distribu7on (a generaliza7on of a standard Gaussian
distribu7on to mul7ple correlated variables) associated with that gene's cluster.
In this case,
the observed data x correspond to microarray measurements,
the unobserved latent factors z are the assignments of genes to clusters,
the parameters theta include the means and covariance matrices of the mul7variate Gaussian distribu7ons
represen7ng the expression paSerns for each cluster.

For parameter learning, the expecta7on maximiza7on algorithm alternates between compu7ng
probabili7es for assignments of each gene to each cluster (E-step) and upda7ng the cluster
means and covariance based on the set of genes predominantly belonging to that cluster (M-
step).

Classica8on Algorithms
Classica7on is the problem of iden7fying to which of a set of
categories (sub-popula7ons) a new observa7on belongs, on the basis
of a training set of data containing observa7ons (or instances) whose
category membership is known.
Example :
assigning a given email into "spam" or "non-spam" classes or assigning a
diagnosis to a given pa7ent as described by observed characteris7cs of the
pa7ent (gender, blood pressure, presence or absence of certain symptoms,
etc.).

List of Algorithms
Classiers

Linear Classiers
Quadrant Classiers
Support Vector Machines
Decision Trees
Neural Networks

Regression

Linear Regression
Logis7cal Regression
Polynomial Regression
Generalized Linear Model

Linear Classier
A linear classier determines class of an object by making a
classica7on decision based on the value of a linear combina7on of
the characteris7cs.
An object's characteris7cs are also known as feature values and are
typically presented to the machine in a vector called a feature vector.

Linear Regression
Linear regression is an approach for
modeling the rela7onship between a
scalar dependent variable y and one
or more explanatory variables (or
independent variable) denoted X.
The case of one explanatory variable
is called simple linear regression. For
more than one explanatory variable,
the process is called mul7ple linear
regression

Analyzing the Impact of Price Changes


Linear regression can also be used to analyze the eect of
pricing on consumer behavior. For instance, if a company
changes the price on a certain product several 7mes, it can
record the quan7ty it sells for each price level and then
perform a linear regression with quan7ty sold as the
dependent variable and price as the explanatory variable. The
result would be a line that depicts the extent to which
consumers reduce their consump7on of the product as prices
increase, which could help guide future pricing decisions.

Assessing Risk
Linear regression can be used to analyze risk. For example, a
health insurance company might conduct a linear regression
plong number of claims per customer against age and
discover that older customers tend to make more health
insurance claims. The results of such an analysis might guide
important business decisions made to account for risk.

Logis8cal Regression
Logis7c regression can be binomial or mul7nomial.
Binomial or binary logis7c regression deals with situa7ons in which the observed outcome for a
dependent variable can have only two possible types (for example, "dead" vs. "alive" or "win" vs.
"loss").
Mul7nomial logis7c regression deals with situa7ons where the outcome can have three or more
possible types (e.g., "disease A" vs. "disease B" vs. "disease C").
Logis7c regression is used to predict the odds of being a case based on the values of the
independent variables (predictors). The odds are dened as the probability that a par7cular
outcome is a case divided by the probability that it is a noncase.

Decision Trees
A decision tree uses a tree structure to
represent a number of possible
decision paths and an outcome for
each path.
Find Entropy at each level and choose
label with lowest entropy for spling
the data
Can be used for Classica7on and
Regression
Other types of trees :
Random Forest
Boosted Trees

Machine Learning with Big Data PlaLorms


Hadoop has a Machine Learning module called Mahout which
provides implementa7ons of various algorithms like Clustering and
Classica7on
Apache Spark provides implementa7on where ML can be done In-
Memory
Python Based toolkits like SciKit-Learn are quite popular as well.
R is popular with Analysts and Data Scien7sts for ini7al modelling and
quick Proof of Concepts
AWS and Microson provide their own implementa7on of algorithms

Appendix

Demos
MemSQL : hSp://fast.wis7a.net/embed/iframe/yi5kwa94uk?popover=true
MetaMarkets : Analy7cs on Programma7c Adver7sing :
hSp://fast.wis7a.net/embed/iframe/yi5kwa94uk?popover=true

You might also like