0% found this document useful (0 votes)

97 views98 pages

Bigdata Overview PDF

This document discusses managing and analyzing large datasets. It provides an agenda covering topics like SQL and its limitations, data pipelines, Hadoop ecosystem components, NoSQL, machine learning, and the role of cloud computing. It describes how the amount of data being generated is increasing and new approaches are needed to store, analyze and visualize large datasets beyond what is possible with traditional SQL databases and desktop tools.

Uploaded by

manindra1konda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

97 views98 pages

Bigdata Overview PDF

Uploaded by

manindra1konda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 98

Managing and Analyzing

Large DataSets
Author : Rajdeep Dua
Twiter : @rajdeepdua

Agenda
Introduc7on, Historical Perspec7ve
SQL and its Limita7ons
Data Pipelines
Cloud and Big Data
Storing and Serializing Data
Map Reduce, Hadoop Ecosystem
Hadoop Components : MR, Pig, Hive
NoSQL
Big Data in the Cloud
Machine Learning Introduc7on
Machine Learning Demos

Introduc8on
Large Data is being Generated
Mobility
Internet of Things
Social Data
Need to
Store Data
Analyze Data
Visualize data

Analysing Data
Data Analysis has been done
since ages
Tradi7onal term called Data
Mining / Machine Learning
Data science is the new term
Need for New Age Skills
Learn new tools
Learn how to handle large data
sets
Learn domain and how to extract
meaning from LOT of noise

hSp://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Historical Perspec8ve
When Data became Important :
Apple II
Windows 3.1, Rise of Desktop Tools like Spreadsheets and SPSS

Inherent Limita7on in Desktop Processing and Analysing of Data

Limited Processing Speed, RAM and Storage

SQL Databases
SQL Databases have evolved from System R [1974]
First Commercial Database by Oracle 19 1979
Rise of SQL Structured Query Language : SQL 92, SQL 99
CODDs rule : Rules 0 to 12, 13 rules which dene characteris7c of a
Database

hSp://en.wikipedia.org/wiki/Rela7onal_database_management_system

Limita8ons of RDBMS
Fixed Schema
Cannot Scale to Large Clusters
Very Expensive Licensing
Querying Large Data sets using SQL on RDBMs is very 7me consuming

Era of Big Data Tradeo

New Approaches to handle large data sets where tradi7onal RDBMs
were failing
Create topologies with 1000s of servers in a single processing unit
Compromise on Consistency
Table en77es with no xed schema
De-normalized data
Store data as key-value or in Json format
New Programming paradigms
Map Reduce
NoSQL

Four Big Data Rules

No single Database or architecture can handle all the use cases
Evaluate and choose the right architecture based on type of data
Structured, Semi Structured or Unstructured
Streaming or Batch?
Type of Data analysis required
Sources of Data
How the Data is be presented?

Expose appropriate APIs to extract and process data

Build a Data Pipeline

Analysing Data by Building Data Pipelines

Capture
Curate
Communicate

Jim Gray
Turing Award Winner

Acquire
Parse
Filter
Mine
Represent
Rene
Interact

Ben Fry
Visualiza7on Expert

Iden7fy Problem
Instrument Data
Sources
Collect Data
Evaluate
Build Model
Communicate
Results
Je HammerBacher
Facebook, cloudera

Analysing Data by Building Data Pipelines

Capture
Curate
Communicate

Jim Gray
Turing Award Winner

Acquire
Parse
Filter
Mine
Represent
Rene
Interact

Ben Fry
Visualiza7on Expert

Iden7fy Problem
Instrument Data
Sources
Collect Data
Evaluate
Build Model
Communicate
Results
Je HammerBacher
Facebook, cloudera

Data Driver Development : Tradi8onal

Approach

Spreadsheets
Databases
DataWarehousing Tools

Special Tools
SPSS, SAS
Spreadhseets

Limited by amount of data

Tools can handle
Real 7me processing was
very limited

Data Driven Development : Large DataSets

Batch Jobs
Map Reduce
Pig Hive

Apache storm
Spark

Apache Mahout
Apache Spark
R
Python Based Tools

Components of a Big Data Data Pipeline

Data Inges7on pipeline

Processed Data
Pipeline

Data Analysis pipeline

Map Reduce
Machine Learning

Source1

Source2

Data
Dump

Curated
Data

API

Job2

Job1

HDFS

API

Persistence
Layer
SQL/NoSQL

Dashboard

Role of Cloud in Big Data

Provides Infrastructure
and Plajorm choices for
Storing and Processing
Data
Pay as you go model
changes the CAPEX to
OPEX
Substan7al cost savings
achieved as data
management and analysis
moves to the cloud :
Public or Private

Increased speed of execu7on has a posi7ve impact on life7me cost models.

Cost is reduced over the life7me of a product or service as the deprecia7on
cost of purchased assets decreases and as eciencies are introduced.
The speed of cost reduc7on can be much higher using cloud compu7ng than
tradi7onal investment and divestment of IT assets,

Role of Cloud in Big Data

Storage as a Service
Object storage
SQL Engines as a Service
NoSQL as a Service

Map Reduce as a Service

Hadoop, Spark, Big Query

Machine Learning as a
Service

Key Cloud Players in Big Data Space

Amazon Web Services
Microson Azure
Google Compute Engine
Al7scale
Quobole

Storing Data

Tradi7onally data has been stored in structured schema in rela7onal

databases

Ques7ons which cannot be answered by this approach
How to store
Large Character or Binary data a.k.a CLOBs and BLOBs?
Data which does not have xed schema?
Streaming data?

Storing Data -
Approach varies depending on the following factor
Privacy Requirements of the data
Type of Organiza7on
Preference for Private or Public Cloud
Public Cloud Stores les on AWS S3, Google storage or Azure Storage
Private Cloud Store les on EMC Viper, Open Stack Swin

Some Organiza7ons use Hadoop Backend HDFS as a Storage Dump

Storing Data Data Formats

Store it as a
Comma Separated Values
Tab Separated Values
JSON
XML

How is Data Encoded

ASCII
UTF-8 / UTF-16 (Unicode encoding to take care on non ascii characters)

Serializing Data
Need to send data in a ecient binary format vs text format
Binary Format is much smaller in size.
Serializa7on and Deserializa7on is much smaller
Apache Thrin
Dene your service deni7on using Thrin le
Generate bindings for Specic Languages java, Python, CPP etc
Create the Server using those bindings. Server accepts a TCP connec7on
Create a client which serializes data using this bindings a

Protocol Buers :

Serializing Data
Protocol Buers
Developed by Google
Used extensively by Google for all its services for communica7ng with each
other

Run7me Performance

Big Data PlaLorms in the Market

Cloudera
MapR
Horton Works
Pivotal
AWS
Teradata
IBM
Microson

Big Data Hadoop : Players Posi8oning

Map Reduce
A new Programming framework to process very large data
(some7mes peta bytes) over large cluster of Servers
First implemented at Google to build Search Index and Process
incoming Ads
Open Source Version Implemented at Yahoo Called Hadoop
Commercial distribu7ons available from Cloudera, HortonWorks,
MapR, Greenplum

Map Reduce Programming Model

Input & Output: each a set of key/value pairs
Programmer species two func7ons:

map (in_key, in_value) -> list(out_key, intermediate_value)

Processes input key/value pair

Produces set of intermediate pairs
reduce (out_key, list(intermediate_value)) -> list(out_value)
Combines all intermediate values for a par7cular key
Produces a set of merged output values (usually just one)
Inspired by similar primi7ves in LISP and other languages

Hadoop Ecosystem
Apache Oozie
Workow

Hive

Pig Latin

DW System

Data Analysis

Mahout
Machine
Learning

Map Reduce Framework

Other
YARN
MPI,
GIRAPH

YARN
HDFS

Flume
Unstructured Data

Sqoop
Structured Data

HBase

Hadoop 2.x Core Components

Hadoop 2.x Core
components
Storage

Processing

HDFS

YARN

Flume
Unstructured Data
NameNode

Master

Resource Manager

Secondary
NameNode
DataNode

Slave

Node Manager

Hadoop 2.x Core Components

YARN

Resource Manager

Node Manager

HDFS

NameNode

DataNode

HDFS Federa8on

Block Storage

Namespace
Block Storage Service

Namespace

HDFS has two main layers:

Namenode
NS

Block Management

Datanode

Storage

Hive

What is Hive
Data warehousing package built on top of Hadoop
Used for data analysis
Targeted towards users comfortable with SQL
Abstracts complexity of Hadoop
No need to learn Java and Hadoop APIs
Developed by Facebook and maintained by the Community

What is Hive?
Denes SQLLike Query
Language : QL

Used for Data

Warehousing

Hive
Allows
Programmers
to plugin-in
custom
mappers and
reducers

Provides tools
to enable ETL

Hive Applica8ons
Data
Mining

Log
Processing

Hive
Applications

Predictive
Modelling

Hive Architecture
Hive

Hive runs as a separate service

which talks to Hadoop using a
Driver
Hive can be accessed using a
Command Line Interface or a
Hive Web Interface ( HWI)
which needs to be enabled

JDBC
CLI

HWI

ODBC

Thrift Server

Driver
(compiles, optimizes, executes)

Metastore

Hadoop
Master
DFS

JobTracker

Name Node

Limita8ons of Hive
Not designed for online transac7on processing
Does not oer real-7me queries and row level updates
Latency for Hive query is generally very high (minutes)
Provides acceptable (not op7mal) latency for interac7ve data
browsing

Abili8es of HiveQL
Filter rows from a table using a 'where' clause
Store the results of a query into another table
Manage tables and par77ons (create, drop and alter)
Store results of a query in Hadoop dfs directory
Do equi-joins between two tables

Dierences with tradi8onal RDBMS

Schema on Read vs Schema on Write
Hive does not verify data when it is loaded, but rather when a query is issued
This makes for a very fast ini7al load

No Updates, Transac7ons and Indexes

Hive Data Models

Tables

Analogous to tables in rela7onal database

Each table has corresponding directory in HDFS
Example
For table Student
Directory - /hive/warehouse/Student/

Par77ons

Analogous to dense indexes on par77on column

Nested subdirectories in HDFS for each combina7on of par77on column
values

Hive Data Models

Example : Par77on column - branch

HDFS sub-directory for college=IITD and branch = cse

/hive/warehouse/Student/college=IITD/branch=cse

HDFS sub-directory for college=IITD and branch = ece

/hive/warehouse/Student/college=IITD/branch=ece

Example Use Case Employees Data with

Complex types
Create
Table

Load Data
in HDFS

Load Data
into HIVE

Query
Data

Example Use Case : Data

John
Mary
Todd
Bill

Doe
Smith
Jones
King

100000.0
80000.0
70000.0
60000.0

Mary
Bill
Mary
Todd

Smith,Todd Jones
Federal Taxes-.2,State Taxes-.05,Insurance-.1
A1~Michigan Ave~Chicago~IL~B60600
King
Federal Taxes-.2,State Taxes-.05,Insurance-.1
100~Ontario St.~Chicago~IL~60601
Smith
Federal Taxes-.15,State Taxes-.03,Insurance-.1
200~Chicago Ave.~Oak Park~IL~B60700
Jones
Federal Taxes-.15,State Taxes-.03,Insurance-.1
300~Obscure Dr.~Obscuria~IL~60100

Field Separator
: \t
Array Delimiter
: ,
Map Key Value Seperator : -
Struct Separator
: ~

Example Use Case : Create Table

CREATE TABLE employees (
name
STRING,
salary
FLOAT,
subordinates ARRAY<STRING>,
deductions
MAP<STRING, FLOAT>,
address
STRUCT<street:STRING, city:STRING, state:STRING,
zip:INT>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY '-'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

Example Use Case :

Copy Data into HDFS

Load Data into Hive
$HADOOP_HOME/bin/hadoop fs -put /home/ubuntu/work/hive-book/
ch03/types-data-2 /emp_data/employees.txt;
LOAD DATA INPATH '/emp_data/types-data-2' OVERWRITE INTO TABLE
employees;

Example Use Case :

Query Data
hive> select * from employees;
OK
John Doe 100000.0 ["Mary Smith","Todd Jones"]
{"Federal Taxes":0.2,"State Taxes .05":null}
{"street":"Insurance-.1","city":null,"state":null,"zip":null}
Mary Smith
80000.0 ["Bill King"]
{"Federal Taxes":0.2,"State Taxes":0.05,"Insurance":0.1}
{"street":"100~Ontario St.~Chicago~IL~60601","city":null,"state":null,"zip":null}
Todd Jones
70000.0 ["Mary Smith"]
{"Federal Taxes":0.15,"State Taxes":0.03,"Insurance":0.1}
{"street":"200~Chicago Ave.~Oak Park~IL~B60700","city":null,"state":null,"zip":null}
Bill King
60000.0 ["Todd Jones"]
{"Federal Taxes":0.15,"State Taxes":0.03,"Insurance":0.1}
{"street":"300~Obscure Dr.~Obscuria~IL~60100","city":null,"state":null,"zip":null}
Time taken: 0.453 seconds, Fetched: 4 row(s)

Internal Tables
Also called managed tables, because Hive controls the lifecyle of their
data
When we drop an internal table, Hive deletes the data in the table
Less convenient for sharing with other tools

External Tables
Unlike Internal tables, Hive does not own the data in external table
Dropping the table does not delete the data, only metadata for the
table is deleted

Why Pig? Contd...

Provides common data opera7ons lters, joins, ordering etc.
Provides nested data types tuples, bags and maps
Open source and ac7vely supported by a community of developers

Where to use Pig?

Pig is on top of Hadoop and makes it possible to create complex jobs
to process large volumes of data quickly and eciently
Time sensi7ve data loads
Processing many data sources
Analy7c insight through sampling

Where not to use Pig?

Complex unstructured data
video, audio, raw human readable text

Where a perfectly implemented MapReduce code can execute jobs

faster than well-wriSen Pig code
Where you would like to have more power to op7mize your code

Use cases
Processing of Web logs
Data processing for search plajorms
Support for AdHoc queries across large datasets
Quick Prototyping of algorithms for processing large datasets

Basic Program Structure

Script
Pig can run a script le that contains pig commands

Grunt
An interac7ve shell for running Pig commands
Also possible to run Pig scripts within Grunt using run and exec commands

Embedded
Can run Pig programs from Java, like you can use JDBC to run SQL programs
from Java

Components
Pig is made up of two components
Pig La7n
Used to express Data Flows

Execu7on Environments
Distributed execu7on on a Hadoop Cluster
Local execu7on in a single JVM

Pig La8n
Pig La7n program is made up of opera7ons or transforma7ons that are
applied to the input data to produce output.

NoSQL Databases

Need for NoSQL Databases

Scalability
Ability to easily scale up or Down
Can handle and store data depending on the source
Flexible to No Schema

Types of NoSQL Databases

Key Value Databases
Document Based Databases
Column Family Databases
Graph Databases

What
is NoSQL
NoSQL is a set of concepts that allows the rapid and ecient
processing of data sets with a focus on performance,
reliability, and agility.

Need for NoSQL Databases
Ability to easily scale up to very large clusters
Can handle and store data depending on the source
Flexible to No Schema

KeyValue Data Stores

Store data in the form of Keys and Values
Use Cases : Web Crawler Data.
Example : Amazon Dynamo DB, BigTable, Redis

Column Family Store

Sparse Matrix which uses Row and Column as a Key.
Used where consistency can be relaxed
Example : Apache Cassandra, HBase

NoSQL Case
Study

Applica8on Layers - RDBMs vs NoSQL

ACID vs BASE
Atomicity
Consistency
Integrity
Durability

Basic Availability
Son State
Eventual Consistency

Key Value Stores

A key-value store is a simple database that
when presented with a simple string (the
key) returns an arbitrary large BLOB of data
(the value).
Key-value stores have no query language
They provide a way to add and remove key-
value pairs
Example
Data Store : Amazon s3 : Key and Binary Data
Google Web Crawler stores data in Google Big
Table

Column Family Stores

Column family systems are important NoSQL data architecture paSerns because they
can scale to manage large volumes of data. Column family stores use row and column
iden7ers as general purposes keys for data lookup.
Theyre some7mes referred to as data stores rather than databases, since they lack
features you may expect to nd in tradi7onal databases.
For example, they lack typed columns, secondary indexes, triggers, and query languages
Almost all column family stores have been heavily inuenced by the original Google
Bigtable paper.
HBase, Hypertable, and Cassandra are good examples of systems that have Bigtable-like
interfaces, although how theyre implemented varies.

Apache Cassandra
Started at Facebook, Open Sourced in 2008.
Top Level Apache Project
Used by Nejlix, TwiSer and Rackspace
Why Cassandra
Scale
Opera7ons
Data Model

Each Key can be a combina7on of row and column names, leading to

a table with millions of columns

Apache Cassandra Composite Keys

create table Bite (
partkey varchar,
score bigint,
id varchar,
data varchar,
PRIMARY KEY (partkey, score, id)
) with clustering order by (score desc);
select * from Bite;
partkey | score | id | data
----------+---------+-------+----------------------
feed0 | 102 | bite3 | { id : bite2, ...
feed0 | 101 | bite2 | { id : bite3, ...
feed0 | 100 | bite1 | { id : bite1, ...

MongoDB Overview

Documents
At the heart of MongoDB is the document: an ordered set of keys
with associated values.
The representa7on of a document varies by programming language,
but most languages have a data structure that is a natural t, such as
a map, hash, or dic7onary.
In JavaScript, for example, documents are represented as objects:
{"gree7ng" : "Hello, world!}
{"gree7ng" : "Hello, world!", "foo" : 3}

Collec8ons
Collec7ons
A collec7on is a group of documents. Collec7on is like a row in a
database table
Dynamic Schemas
Collec7ons have dynamic schemas. This means that the documents
within a single collec7on can have any number of dierent shapes.
For example, both of the following documents could be stored in a
single collec7on:
{"gree7ng" : "Hello, world!}

{"foo" : 3}

Databases
Collec7ons are Grouped by Databases

Collec7on
Database

Collec7on
Document
Collec7on
Document
Document

Big Data and Cloud

Big Data in the Cloud

Cloud Compu7ng provides an elas7c Fabric for hos7ng Big Data
Services
Use Cases for the Cloud
NoSQL Databases as a Service
Object storage
Hadoop and Spark as a Service
Machine Learning as a Service
Streaming Data Support

Amazon Web Services Handling Large

DataSets
AmazonDynamoDB is a NoSQL database as a service
Object Store to store Very Large Objects and global keys to nd them
AWS oers Hadoop as a Service in the form of Elas7c Map Reduce
Uses Open Source Hadoop with Ini7al Data pulled in from S3

AWS provides its own implementa7on of Machine Learning

Algorithms.

Amazon Web Service EMR

Introduc8on
Amazon EMR is an AWS service that allows users to launch and use
resizable Hadoop clusters inside of Amazon's infrastructure
Can be used analyze large data sets
Greatly simplies setup and management of the cluster of Hadoop
and MapReduce components
EMR instances use Amazon's prebuilt and customized EC2 instances
Can take full advantage of other AWS services

Introduc8on contd...
EC2 instances are invoked when we start a new Job Flow to form an
EMR cluster
A Job Flow is Amazon's term for the complete data processing that
occurs through a number of compute steps
A Job Flow is specied by the MapReduce applica7on and its input
and output parameters

Architecture

AWS EMR at Run8me

1. Master Instance Group controls the cluster,
Runs NameNode and Job Tracker
Stores cluster info in MySQL
2. Core Instance group created for life of the cluster
3. Cores instances run DataNode and TaskTracker
4. Op7onal Task Instances can be added or subtracted
based on load
5. S3 is the underlying le system for ini7al data and
nal results deni7on

AWS EMR at Run8me

1. Master Node manages distribu7on of work and
manages cluster state
2. Core and Task Instance Groups read from and write
to S3

Accessing EMR
Using the Management Console
Using Command Line Interface
Amazon EMR SDKs Java, PHP, Python, .Net etc
Choosing the Instance Type Depends on the Use Case

EMR Leverage Spot Instances to Reduce Cost

EMR Case study : Log Processing at Yelp

Use Case : Yelp uses an automated review lter to iden7fy
suspicious content and minimize exposure to the
consumer. The site also features a wide range of other
features that help people discover new businesses (lists,
special oers, and events), and communicate with each
other.
Technology : Yelp moved from RAID disks and Single
Hadoop Installa7on to use s3 and EMR.
Map Reduce jobs wriSen in Python using MRJob. Boto
client APIs used to congure EMR
SAvings : Approx $55000 per year

Machine Learning
Extract Usesful informa7on from the data by designing models
Use Cases
Clustering
Classica7on
Decision Trees
Regression

Clustering Techniques
Centroid Based Clustering : k-means

Clustering Techniques
Distribu7on Based Clustering :
The clustering model most closely related to sta7s7cs is
based on distribu7on models. Clusters can then easily be
dened as objects belonging most likely to the same
distribu7on.
One prominent method is known as Gaussian mixture
models (using the expecta7on-maximiza7on algorithm).
Data set is modelled with a xed (to avoid overng)
number of Gaussian distribu7ons that are ini7alized
randomly and whose parameters are itera7vely op7mized
to t beSer to the data set.
Example Alogirthm is Guassian Mixture Models using
Expecta7on Maximiza7on
Real World example : Classifying genes to a cluster using
Guassian Mixture Models

Clustering : DBSCAN
Density-based spa7al clustering of applica7ons with noise
(DBSCAN) is a data clustering algorithm proposed by Mar7n
Ester et al. in 1996.
It is a density-based clustering algorithm: given a set of points
in some space,
Groups together points that are closely packed together (points
with many nearby neighbors),
Marking as outliers points that lie alone in low-density regions
(whose nearest neighbors are too far away).

DBSCAN is one of the most common clustering algorithms

and also most cited in scien7c literature.
It needs two parameters (epsilon) and minPts to calculate
the clusters

Machine Learning K Means

k-means clustering is a method of vector quan7za7on, that is popular
for cluster analysis in data mining.
k-means clustering aims to par77on n observa7ons into k clusters in
which each observa7on belongs to the cluster with the nearest mean,
serving as a prototype of the cluster.
Given a set of observa7ons (x1, x2, , xn), where each
observa7on is a d-dimensional real vector, k-means clustering
aims to par77on the n observa7ons into k ( n) sets S = {S1, S2,
, Sk} so as to minimize the within-cluster sum of squares
(WCSS)

Machine Learning KMeans Customer

Segmenta8on
Problem : Who is shopping at my stores and how I can market to them?
Need : Targeted Marking
Approach Convert Real Data Set into Insights

hSp://www.slideshare.net/jonsedar/customer-clustering-for-marke7ng?related=1

Process Followed

Analysis
Converted Features into Principal
Components
Clustering Technique : K means

EM / Guassian Mixture Models applica8on to

Computa8onal Biology for Gene Clustering
Many probabilis7c models in computa7onal biology include latent variables.
In somecases, these latent variables are present due to missing or corrupted
data; in most applica7ons of expecta7on maximiza7on to computa7onal biology,
however, the latent factors are inten7onally included, and parameter learning
itself provides a mechanism for knowledge discovery.
In gene expression clustering, we are given microarray gene expression
measurements for thousands of genes under varying condi7ons, and our goal is
to group the observed expression vectors into dis7nct clusters of related genes.

EM / Guassian Mixture Models applica8on to

Computa8onal Biology for Gene Clustering
One approach is to model the vector of expression measurements for each gene as being
sampled from a mul7variate Gaussian distribu7on (a generaliza7on of a standard Gaussian
distribu7on to mul7ple correlated variables) associated with that gene's cluster.
In this case,
the observed data x correspond to microarray measurements,
the unobserved latent factors z are the assignments of genes to clusters,
the parameters theta include the means and covariance matrices of the mul7variate Gaussian distribu7ons
represen7ng the expression paSerns for each cluster.

For parameter learning, the expecta7on maximiza7on algorithm alternates between compu7ng
probabili7es for assignments of each gene to each cluster (E-step) and upda7ng the cluster
means and covariance based on the set of genes predominantly belonging to that cluster (M-
step).

Classica8on Algorithms
Classica7on is the problem of iden7fying to which of a set of
categories (sub-popula7ons) a new observa7on belongs, on the basis
of a training set of data containing observa7ons (or instances) whose
category membership is known.
Example :
assigning a given email into "spam" or "non-spam" classes or assigning a
diagnosis to a given pa7ent as described by observed characteris7cs of the
pa7ent (gender, blood pressure, presence or absence of certain symptoms,
etc.).

List of Algorithms
Classiers

Linear Classiers
Quadrant Classiers
Support Vector Machines
Decision Trees
Neural Networks

Regression

Linear Regression
Logis7cal Regression
Polynomial Regression
Generalized Linear Model

Linear Classier
A linear classier determines class of an object by making a
classica7on decision based on the value of a linear combina7on of
the characteris7cs.
An object's characteris7cs are also known as feature values and are
typically presented to the machine in a vector called a feature vector.

Linear Regression
Linear regression is an approach for
modeling the rela7onship between a
scalar dependent variable y and one
or more explanatory variables (or
independent variable) denoted X.
The case of one explanatory variable
is called simple linear regression. For
more than one explanatory variable,
the process is called mul7ple linear
regression

Analyzing the Impact of Price Changes

Linear regression can also be used to analyze the eect of
pricing on consumer behavior. For instance, if a company
changes the price on a certain product several 7mes, it can
record the quan7ty it sells for each price level and then
perform a linear regression with quan7ty sold as the
dependent variable and price as the explanatory variable. The
result would be a line that depicts the extent to which
consumers reduce their consump7on of the product as prices
increase, which could help guide future pricing decisions.

Assessing Risk
Linear regression can be used to analyze risk. For example, a
health insurance company might conduct a linear regression
plong number of claims per customer against age and
discover that older customers tend to make more health
insurance claims. The results of such an analysis might guide
important business decisions made to account for risk.

Logis8cal Regression
Logis7c regression can be binomial or mul7nomial.
Binomial or binary logis7c regression deals with situa7ons in which the observed outcome for a
dependent variable can have only two possible types (for example, "dead" vs. "alive" or "win" vs.
"loss").
Mul7nomial logis7c regression deals with situa7ons where the outcome can have three or more
possible types (e.g., "disease A" vs. "disease B" vs. "disease C").
Logis7c regression is used to predict the odds of being a case based on the values of the
independent variables (predictors). The odds are dened as the probability that a par7cular
outcome is a case divided by the probability that it is a noncase.

Decision Trees
A decision tree uses a tree structure to
represent a number of possible
decision paths and an outcome for
each path.
Find Entropy at each level and choose
label with lowest entropy for spling
the data
Can be used for Classica7on and
Regression
Other types of trees :
Random Forest
Boosted Trees

Machine Learning with Big Data PlaLorms

Hadoop has a Machine Learning module called Mahout which
provides implementa7ons of various algorithms like Clustering and
Classica7on
Apache Spark provides implementa7on where ML can be done In-
Memory
Python Based toolkits like SciKit-Learn are quite popular as well.
R is popular with Analysts and Data Scien7sts for ini7al modelling and
quick Proof of Concepts
AWS and Microson provide their own implementa7on of algorithms

Appendix

Demos
MemSQL : hSp://fast.wis7a.net/embed/iframe/yi5kwa94uk?popover=true
MetaMarkets : Analy7cs on Programma7c Adver7sing :
hSp://fast.wis7a.net/embed/iframe/yi5kwa94uk?popover=true

Big Data Presentation Slide
100% (1)
Big Data Presentation Slide
30 pages
Big Data
No ratings yet
Big Data
63 pages
Big Data
No ratings yet
Big Data
190 pages
BIG DATA Module 1
No ratings yet
BIG DATA Module 1
16 pages
Taming Big Data
No ratings yet
Taming Big Data
268 pages
big data 1
No ratings yet
big data 1
28 pages
Big Data Analysis by deshbandhu
No ratings yet
Big Data Analysis by deshbandhu
368 pages
BDT..U1_PPT_08112023
No ratings yet
BDT..U1_PPT_08112023
71 pages
8 Revolution of Big Data
No ratings yet
8 Revolution of Big Data
18 pages
Bigdatappt
No ratings yet
Bigdatappt
31 pages
05-Big Data
No ratings yet
05-Big Data
29 pages
Unit 1
No ratings yet
Unit 1
76 pages
Chapter 1
No ratings yet
Chapter 1
21 pages
BDA 01 - Introduction
No ratings yet
BDA 01 - Introduction
43 pages
BigData AmberSahai1
No ratings yet
BigData AmberSahai1
32 pages
Big Data Presentation
No ratings yet
Big Data Presentation
24 pages
Data Analytics Notes Unit 1
No ratings yet
Data Analytics Notes Unit 1
23 pages
BD U-1 (Anupam Sir)
No ratings yet
BD U-1 (Anupam Sir)
20 pages
UNIT1 -BDH
No ratings yet
UNIT1 -BDH
77 pages
Big Data Analytics_Lecture Slides
No ratings yet
Big Data Analytics_Lecture Slides
72 pages
Big-Data-A-Comprehensive-Overview
No ratings yet
Big-Data-A-Comprehensive-Overview
25 pages
Big Data complete Notes
No ratings yet
Big Data complete Notes
33 pages
BDA U1
No ratings yet
BDA U1
80 pages
Big Data Project
100% (3)
Big Data Project
61 pages
BD IMP QUES 1
No ratings yet
BD IMP QUES 1
22 pages
Big-Data-ppt
No ratings yet
Big-Data-ppt
30 pages
Big Data and Hadoop Self Notes
No ratings yet
Big Data and Hadoop Self Notes
16 pages
Big Data PPT 55b0fc01e7543
No ratings yet
Big Data PPT 55b0fc01e7543
31 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
20 pages
L8 Big Data Management en
No ratings yet
L8 Big Data Management en
58 pages
Mca Big Data PDF Sem 3
No ratings yet
Mca Big Data PDF Sem 3
193 pages
unit 1 b tech 3 year bd
No ratings yet
unit 1 b tech 3 year bd
10 pages
Bdhs - Ebook
No ratings yet
Bdhs - Ebook
970 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
Big Data Analytics (R18a0529)
No ratings yet
Big Data Analytics (R18a0529)
134 pages
BIG Data_Unit_1
No ratings yet
BIG Data_Unit_1
24 pages
Introduction To Big Data Computing
No ratings yet
Introduction To Big Data Computing
25 pages
Unit-1
No ratings yet
Unit-1
11 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
Seminar On: Big Data
No ratings yet
Seminar On: Big Data
23 pages
DBMS Unit1
No ratings yet
DBMS Unit1
30 pages
Big Data With Cloud Computing Discussions and Challenges
No ratings yet
Big Data With Cloud Computing Discussions and Challenges
9 pages
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
No ratings yet
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
30 pages
Data Science
No ratings yet
Data Science
87 pages
Big Data Analytics Digital Notes
No ratings yet
Big Data Analytics Digital Notes
119 pages
big data analytics02
No ratings yet
big data analytics02
20 pages
Seminar Report Alisha
No ratings yet
Seminar Report Alisha
22 pages
BA ppt
No ratings yet
BA ppt
17 pages
BIG DATA Notes
No ratings yet
BIG DATA Notes
11 pages
Big Data
No ratings yet
Big Data
30 pages
Hadoop Report
No ratings yet
Hadoop Report
110 pages
Experiment No _ 1 Bda
No ratings yet
Experiment No _ 1 Bda
10 pages
Introduction To Big Data Analytics
No ratings yet
Introduction To Big Data Analytics
33 pages
Unit 1
No ratings yet
Unit 1
19 pages
Fentress et al v. City of Leitchfield et al
No ratings yet
Fentress et al v. City of Leitchfield et al
17 pages
Introduction To Big Data: Soorya Prasanna Ravichandran
No ratings yet
Introduction To Big Data: Soorya Prasanna Ravichandran
33 pages
Big Data
No ratings yet
Big Data
31 pages
Get Persian Gulf 2018: India's Relations With The Region P. R. Kumaraswamy Free All Chapters
100% (3)
Get Persian Gulf 2018: India's Relations With The Region P. R. Kumaraswamy Free All Chapters
62 pages
JD-R_Bakker et al_2023
No ratings yet
JD-R_Bakker et al_2023
31 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Medical English Exercises
100% (3)
Medical English Exercises
54 pages
Analyse The Narrative Technique in Raja
No ratings yet
Analyse The Narrative Technique in Raja
6 pages
Infra Finance Case PPT - Group 7 (Final)
100% (3)
Infra Finance Case PPT - Group 7 (Final)
19 pages
Ncm2 21 - Care of Mother, Child Week 1 - Part 2
No ratings yet
Ncm2 21 - Care of Mother, Child Week 1 - Part 2
28 pages
Research Paper
No ratings yet
Research Paper
45 pages
State Bank of Bikaner & Jaipur
No ratings yet
State Bank of Bikaner & Jaipur
17 pages
Dominican Rosary Updated
No ratings yet
Dominican Rosary Updated
2 pages
M. Tech. Semester - IX: Highway Materials (IBMCETE 903)
No ratings yet
M. Tech. Semester - IX: Highway Materials (IBMCETE 903)
20 pages
D20 - Star Wars - Netbook of Prestige Classes
No ratings yet
D20 - Star Wars - Netbook of Prestige Classes
65 pages
Managing Organizations and Leading People I
No ratings yet
Managing Organizations and Leading People I
12 pages
Factor Analysis
No ratings yet
Factor Analysis
13 pages
Madan_Mohan_Singh_Ors_vs_Rajni_Kant_Anr_on_13_August_2010[1]
No ratings yet
Madan_Mohan_Singh_Ors_vs_Rajni_Kant_Anr_on_13_August_2010[1]
7 pages
Project Management - Introduction
No ratings yet
Project Management - Introduction
30 pages
Tamil Morphological Analysis
No ratings yet
Tamil Morphological Analysis
18 pages
FINAL - Oblicon Finals Reviewer
No ratings yet
FINAL - Oblicon Finals Reviewer
116 pages
POstmenopausal Bleeding
No ratings yet
POstmenopausal Bleeding
62 pages
Wibes Html5 & Css3: Class 1
No ratings yet
Wibes Html5 & Css3: Class 1
5 pages
Summer Intern Project Submitted By: Manindra Konda IIM Calcutta (PGP 2014 - 2016)
No ratings yet
Summer Intern Project Submitted By: Manindra Konda IIM Calcutta (PGP 2014 - 2016)
23 pages
Family
No ratings yet
Family
6 pages
Excursions in Geometry
80% (10)
Excursions in Geometry
185 pages
Circular3OrientationProgramXI202526pdf_202504040417_0
No ratings yet
Circular3OrientationProgramXI202526pdf_202504040417_0
2 pages
Activity No. 6: Tissues, Glands and Membranes
No ratings yet
Activity No. 6: Tissues, Glands and Membranes
3 pages
Adv Unit1 Answerkey
No ratings yet
Adv Unit1 Answerkey
2 pages
Studying Abroad Vs Locally
No ratings yet
Studying Abroad Vs Locally
3 pages
English L - 4 & 6
No ratings yet
English L - 4 & 6
2 pages
CFR Project
No ratings yet
CFR Project
1 page
Screenshot 2024-10-13 at 4.59.02 AM
No ratings yet
Screenshot 2024-10-13 at 4.59.02 AM
1 page
Subtraction Board Game
No ratings yet
Subtraction Board Game
1 page
Easy Ways To Make The Switch To Organic Foods
No ratings yet
Easy Ways To Make The Switch To Organic Foods
1 page
Leadership Material
No ratings yet
Leadership Material
15 pages
Dell Case Analysis
No ratings yet
Dell Case Analysis
5 pages
Essential Oils Guide
100% (12)
Essential Oils Guide
119 pages
ACR - Career Guidance - G6
100% (1)
ACR - Career Guidance - G6
5 pages
Mastering Apache Iceberg: Managing Big Data in a Modern Data Lake
From Everand
Mastering Apache Iceberg: Managing Big Data in a Modern Data Lake
Robert Johnson
No ratings yet
PySpark Essentials: A Practical Guide to Distributed Computing
From Everand
PySpark Essentials: A Practical Guide to Distributed Computing
Robert Johnson
No ratings yet
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet

Bigdata Overview PDF

Uploaded by

Bigdata Overview PDF

Uploaded by

Managing and Analyzing

Inherent Limita7on in Desktop Processing and Analysing of Data

Era of Big Data Tradeo

Four Big Data Rules

Expose appropriate APIs to extract and process data

Analysing Data by Building Data Pipelines

Analysing Data by Building Data Pipelines

Data Driver Development : Tradi8onal

Limited by amount of data

Data Driven Development : Large DataSets

Components of a Big Data Data Pipeline

Data Analysis pipeline

Role of Cloud in Big Data

Increased speed of execu7on has a posi7ve impact on life7me cost models.

Role of Cloud in Big Data

Map Reduce as a Service

Key Cloud Players in Big Data Space

Tradi7onally data has been stored in structured schema in rela7onal

Some Organiza7ons use Hadoop Backend HDFS as a Storage Dump

Storing Data Data Formats

How is Data Encoded

Big Data PlaLorms in the Market

Big Data Hadoop : Players Posi8oning

Map Reduce Programming Model

Processes input key/value pair

Map Reduce Framework

Hadoop 2.x Core Components

Hadoop 2.x Core Components

HDFS has two main layers:

Used for Data

Hive runs as a separate service

Dierences with tradi8onal RDBMS

No Updates, Transac7ons and Indexes

Hive Data Models

Analogous to tables in rela7onal database

Analogous to dense indexes on par77on column

Hive Data Models

Example Use Case Employees Data with

Example Use Case : Data

Example Use Case : Create Table

Example Use Case :

Copy Data into HDFS

Example Use Case :

Why Pig? Contd...

Where to use Pig?

Where not to use Pig?

Where a perfectly implemented MapReduce code can execute jobs

Basic Program Structure

Need for NoSQL Databases

Types of NoSQL Databases

KeyValue Data Stores

Column Family Store

Applica8on Layers - RDBMs vs NoSQL

Key Value Stores

Column Family Stores

Each Key can be a combina7on of row and column names, leading to

Apache Cassandra Composite Keys

Big Data and Cloud

Big Data in the Cloud

Amazon Web Services Handling Large

AWS provides its own implementa7on of Machine Learning

Amazon Web Service EMR

AWS EMR at Run8me

AWS EMR at Run8me

EMR Leverage Spot Instances to Reduce Cost

EMR Case study : Log Processing at Yelp

DBSCAN is one of the most common clustering algorithms

Machine Learning K Means

Machine Learning KMeans Customer

EM / Guassian Mixture Models applica8on to

EM / Guassian Mixture Models applica8on to

Analyzing the Impact of Price Changes

Machine Learning with Big Data PlaLorms

You might also like