Dsbda Unit1
Dsbda Unit1
Unit Objectives
1. To introduce basic need of Big Data and Data science to handle huge amount of
data.
2. To understand the application and impact of Big Data.
3.
Unit outcomes:
1. To understand Big Data primitives.
2. To learn different programming platforms for big data analytics.
Outcome Mapping: PEO: I,V , PEO c, e CO: 1, 2, PSO: 3,4
Books :
1. Krish Krishnan, Data warehousing in the age of Big Data, Elsevier, ISBN: 9780124058910,
1st Edition
2.
INTRODUCTION: DATA SCIENCE AND BIG DATA
Statistics
Domain Data Science
Expertise
Advanced
Computing
Hacker Mindset
Visualization
• Data comes in many forms, but at a high level, it falls into three
categories: structured, semi-structured, and unstructured.
• Structured data :
- highly organized data
- exists within a repository such as a database (or a comma-
separated values [CSV] file).
- easily accessible.
- format of the data makes it appropriate for queries and
computation (by using languages such as Structured Query
Language (SQL)).
• Unstructured data : lacks any content structure at all (for example, an
audio stream or natural language text).
• Semi-structure data: Include metadata or data that can be more easily
processed than unstructured data by using semantic tagging.
Data and its structure
• Include:
- sourcing the data from one or more data sets (in addition to
reducing the set to the required data),
- normalizing the data so that data merged from multiple data
sets is consistent.
- parsing data into some structure or storage for further use.
• After you have collected and merged your data set, the next step
is cleansing.
• Data sets in the wild are typically messy and infected with any
number of common issues.
• This step assumes that you have a cleansed data set that might not
be ready for processing by a machine learning algorithm.
• In one model, the algorithm process the data, & create new
data product as the result.
Model validation
• used to understand how model behave in production after a model
is trained.
• for that purpose it reserve a small amount of available training
data to be tested against final model.(called as test data)
• training data is used to train machine learning model.
• Test data is used when the model is complete to validate how well
it generalizes to unseen data.
Reinforcement learning
Operations:
• end goal of the data science pipeline.
• creating a visualization for data product.
• Deploying machine learning model in a production environment to
operate on unseen data to provide prediction or classification.
Model deployment:
• When the product of the machine learning phase is a model then it
will be deployed into some production environment to apply to
new data.
• This model could be a prediction system.
I /P O/P
Prediction System
Historical Financial data Classification of whether a company
eg.Sales & Revenue is a reasonable acquisition target.
Reinforcement learning
Model visualization:
• In smaller scale data science , the product is data ;instead of model
produced in the machine learning phase.
• Data product answers some questions about the original data set.
• Options for visualization are vast and can be produced from the R
programming language.
Summary- Definitions of Data Science
• With the evolution of Big data, this data can be accessed &
analyzed on a regular basis to generate useful information.
• Sort, organize, analyze & after this critical data in a systematic manner is
nothing but Big data.
Big Data
• Sports teams are using data for tracking ticket sales & even for
tracking team strategies.
Big Data
• Big data is a pool of huge amounts of data of all types , shapes and
formats collected from different sources.
Evolution of Bigdata
Bigdata is the new term of data evolution directed by the velocity,
variety & volume of data.
Analysis Distributed
System
Data Storage
Big Data Data Science
Parallel
Processing
Artificial
Data mining Intelligence
Structured
Data ₊ Unstructured
Data ₊ Semi-
Structured
Data
₌ Big Data
• Volume
• Velocity
• Variety
• Veracity
Elements of Big Data
• Eg: Internet has around 14.3 trillion live pages , 48 billion web pages
are indexed by Google Inc, 14 Billion web pages are indexed by
Microsoft Bing.
Velocity
• These systems are able to attend data in batches every few hours.
• Eg: GPS & Social networking sites such as Facebook produce data
of all types, including text, images, videos.
Veracity
• Out of huge amount of data, correct & consistent data can be used
for further analysis.
• Higher priority data are kept in center & the supporting data which is
required but not available or accessible previously now can be available
& accessible with the help of multiple channels.
Globalization
• Is a key trend that has radically changed the commerce of the
world, starting from manufacturing to customer service.
Personalization of services
Data sources.
• All big data solutions start with one or more data sources.
• Examples include:
• Application data stores, such as relational databases.
• Static files produced by applications, such as web server log files.
• Real-time data sources, such as IoT devices.
Data storage.
• Data for batch processing operations is typically stored in a
distributed file store that can hold high volumes of large files in
various formats.
• This kind of store is often called a data lake.
Big Data Processing Architectures
Batch processing.
• data files are processed using long-running batch jobs to filter,
aggregate, and otherwise prepare the data for analysis.
• Usually these jobs involve reading source files, processing them,
and writing the output to new files.
Real-time message ingestion.
• If the solution includes real-time sources, the architecture must
include a way to capture and store real-time messages for stream
processing.
Stream processing.
After capturing real-time messages, the solution must process
them by filtering, aggregating, and otherwise preparing the data
for analysis. The processed stream data is then written to an
output sink.
Big Data Processing Architectures
Storage
• The first & major problem to big data is storage.
• As Big data is increased rapidly, there is need to process this huge data as
well as to store it.
• We need the additional 0.5 times storage to process & store the
intermediate result set.
• Due to the design of the underlying software, we do not consume all the
storage that is available on a disk.
Processing
• I s to co mbin e s o me for m of logical and mathemat i ca l
calculations together in one cycle of operation.
1. CPU or processor.
2. Memory
3. Software
Data processing infrastructure challenges
CPU or processor.
• With each generation:
- the computing speed and processing power have increased
-leading to more processing capabilities
- access to wider memory.
- architecture evolution within the software layers.
Memory.
• While the storage of data to disk for offline processing proved the need
for storage evolution and data management.
Software
• Main component of data processing.
Speed or throughput
Advantages :
requires minimal resources both from people & system
perspectives.
4. Cluster architecture.
• Machines are connected in a network architecture .
•Both software or hardware work together to process data or
compute requirements in parallel.
• Each machine in a cluster is associated with a task that is
processed locally and the result sets are collected to a
master server that returns it back to the user.
5. Peer-to-peer architecture.
• No dedicated servers and clients; instead, all the processing
responsibilities are allocated among all machines, known as
peers.
• Each machine can perform the role of a client or server or
just process data.
Big Data Processing Architectures
• All data coming into the system goes through these two paths:
• A batch layer (cold path) stores all of the incoming data in its
raw form and performs batch processing on the data. The result
of this processing is stored as a batch view.
• A speed layer (hot path) analyzes data in real time. This layer is
designed for low latency, at the expense of accuracy.
Big Data Processing Architectures
• In other words, the hot path has data for a relatively small
window of time, after which the results can be updated with
more accurate data from the cold path.
• Big Data Processing Architectures
• The raw data stored at the batch layer is immutable.
• The ability to recompute the batch view from the original raw
data is important, because it allows for new views to be
created as the system evolves.
Big Data Processing Architectures
Lamda Architecture
Batch Layer (Cold Path)
Stores all incoming data & perform a batch processing
Managing all historical data
Recomputing the result using machine learning model
Results come at high latency due to computational cost
Data can be only appended not updated or deleted
Data is stored using memory databases or long term
persistent like no-SQL storages
Uses Map-reduce
Speed Layer
Provide low –latency result
Data is processed in real-time
Incremental Algorithms
Create ,delete dataset is possible
Big Data Processing Architectures
Lamda Architecture
Serving Layer :
User fires query
Applications:
Ad-hoc queries
Netflix,Twitter,Yahoo
Pros:
Batch layer manages historical data so low error when system crashes
Good speed, reliability
Fault tolerance and scalable processing
Cons:
Caching overhead , complexity ,duplicate computation
Difficult to migrate or reorganize
Big Data Processing Architectures
Kappa Architecture
• A drawback to the lambda architecture is its complexity.
Processing logic appears in two different places — the cold and hot
paths — using different frameworks. This leads to duplicate
computation logic and the complexity of managing the architecture
for both paths.
• It has the same basic goals as the lambda architecture, but with an
important distinction: All data flows through a single path, using a
stream processing system.
Big Data Processing Architectures
• Kappa Architecture
Big Data Processing Architectures
• These events are ordered, and the current state of an event is changed
only by a new event being appended.
• If you need to recompute the entire data set (equivalent to what the
batch layer does in lambda), you simply replay the stream, typically
using parallelism to complete the computation in a timely fashion.
Big Data Processing Architectures
Kappa Architecture
• Simple lambda architecture with batch layer removed
• Speed layer is capable of both real and batch data
• Only two layers : Stream processing and Serving
• All event processing is performed on the input stream and
persisted as a real-time view
• Speed layer is designed using Apache Storm,Spark
Big Data Processing Architectures
Zeta architecture
• This is the next generation Enterprise architecture cultivated
by Jim Scott.
Zeta architecture
There are several benefits to implementing a Zeta Architecture in your
organization
•Reduce time and costs of deploying and maintaining applications
•Fewer moving parts with simplifications such as using a distributed file
system
•Less data movement and duplication - transforming and moving data
around will no longer be required unless a specific use case calls for it
•Simplified testing, troubleshooting, and systems management
•Better resource utilization to lower data center costs
The Traditional Research Approach
...
...
Source Source Source
Big Data Processing Architectures
Information Clients
integrated in
advance
Stored in wh for
direct querying
Integration System Metadata
and analysis
...
...
Source Source Source
The Warehousing Approach
Subject-oriented
DWH organized around major subjects of enterprise (e.g.
customer ,product sales ) rather than application areas (customer
invoicing ,stock control,product sale)
Integrated : Data coming from enterprise wide applications in different
formats.
Time –Variant
DWH behave differently at different time interval
Non-Volatile
New data is always added in existing one rather than
replacement
Merits Data Warehouse
Never erase previous data when Also Never erase previous data
new data is added. when new data is added but sometimes
real-time data streams are processed.
Timing of fetching simultaneously is Timing of fetching simultaneously
more. is small using Hadoop File
System.
Reengineering the data
warehouse
Reengineering the Data Warehouse
• Modify parts of the infrastructure and get great gains in scalability and
performance.
Colocation—a table and all its associated tables can be colocated in the
same storage region.
Platform engineering
• Both SMP and DSM architectures have been deployed for many
transaction processing systems , where the transactional data is
s mal l i n s i z e and has a s hor t bur s t c yc l e of r e sour c e
requirements.
• Each node has its own private memory, disks, and storage devices
independent of any other node in the configuration.
• Each processor has its own local memory & local disk.
• I n t e r c o mmu n i c a t i o n c h a n n e l i s u s e d b y t h e p r o c e s s o r s t o
communicate.
• The key feature is that the operating system not the application
server owns responsibility for controlling and sharing hardware
resources.
https://twitter.com/devops_borat
1
2
Data accumulation
1
2
From WWW to VVV
• Volume
– data volumes are becoming unmanageable
• Variety
– data complexity is growing
– more types of data captured than previously
• Velocity
– some data is arriving so rapidly that it must either
be processed instantly, or lost
– this is a whole subfield called “stream processing”
1
2
The promise of Big Data
thatproduction
equations predicted from his DNA
he would be the best bull,"
USDA research geneticist Paul
VanRs adien
n ec
meailedy
moe wuithr
a
detectable hint of pride. "Now he is
parents were
the best progeny tested bull (as
predicted)."
born”
13
3
Some more examples
• Sports
– basketball increasingly driven by data analytics
– soccer beginning to follow
• Entertainment
– House of Cards designed based on data analysis
– increasing use of similar tools in Hollywood
• “Visa Says Big Data Identifies Billions of
Dollars in Fraud”
– new Big Data analytics platform on Hadoop
• “Facebook is about to launch Big Data
play”
– starting to connect Facebook with real life
14
Ok, ok, but ... does it apply to our
customers?
• Norwegian Food Safety Authority
– accumulates data on all farm animals
– birth, death, movements, medication, samples, ...
• Hafslund
– time series from hydroelectric dams, power prices,
meters of individual customers, ...
• Social SecurityAdministration
– data on individual cases, actions taken, outcomes...
• Statoil
– massive amounts of data from oil exploration,
operations, logistics, engineering, ...
• Retailers
– seeTarget example above
– also, connection between what people buy, weather
forecast, logistics, ...
13
5
How to extract insight from data?
• Clustering
• Association learning
• Parameter estimation
• Recommendation engines
• Classification
• Similarity matching
• Neural networks
• Bayesian networks
• Genetic algorithms
13
7
Basically, it’s all maths...
• Linear algebra
• Calculus
• Probability theory
Only 10% in
• Graph theory devops are know
• ... how of work
with Big Data.
Only 1% are
realize they are
need 2 Big Data
for fault
tolerance
https://twSitmtetr.Kcoasmh/idbeavioNpasv_ableorCaotlegeof Engineer
18
Big data skills gap
19
Two orthogonal aspects
20
Data science?
http://drewconway.comS/zm
iat/2. 0K1a
3/s3h
/2i6b/a
thieN
-daavtaa-lsecieCnoclel-evg
eneno
-dfiaEgnragm
ieering, Vadgoan
21
How to process Big Data?
Mining of Big
Data is
problem solve
in 2013 with
zgrep
htStpmst:./Ktwasihttiebra.cioNmav/adlevCooplsl_ebgeoroaftEngineering, Vadgoan
22
MapReduce
14
3
NoSQL and Big Data
Smt. Khatsthpisb:a//itN
waitvtaelre.cCoomll/edgevofpEs_nbgoinreaet ring, Vadgoan
25
Data quality
26
Approaches to learning
• Supervised
– we have training data with correct answers
– use training data to prepare the algorithm
– then apply it to data without a correct answer
• Unsupervised
– no training data
– throw data into the algorithm, hope it makes some
kind of sense out of the data
14
7
Approaches to learning
• Prediction
– predicting a variable from data
• Classification
– assigning records to predefined groups
• Clustering
– splitting records into groups based on similarity
• Association learning
– seeing what often appears together with what
14
8
Issues
14
9
Underfitting
15
0
Overfitting
15
1
“What if the knowledge and data we have are
not sufficient to completely determine the
correct classifier? Then we run the risk of just
hallucinating a classifier (or parts of it) that is
not grounded in reality, and is simply
encoding random quirks in the data.This
problem is called overfitting, and is the
bugbear of machine learning. When your
learner outputs a classifier that is 100%
accurate on the training data but only 50%
accurate on test data, when in fact it could
have output one that is 75% accurate on both,
it has overfit.”
http://homes.cs.wSmasth. iKnagsthoibna.eidNua/v~apledCrooldle/pgaepoefrEs/ncgaicnmee1r2in.pgd,foan
35
Testing
15
3
Missing values
15
4
Terminology
• Vector
– one-dimensional array
• Matrix
– two-dimensional array
• Linear algebra
– algebra with vectors and matrices
– addition, multiplication, transposition, ...
15
5
Top 10 algorithms
15
6
Top 10 machine learning algs
1. C4.5
2. k-means clustering
3. Support vector machines
4. the Apriori algorithm
5. the EM algorithm
6. PageRank
7. AdaBoost
8. k-nearest neighbours class.
9. Naïve Bayes
10. CART
From a survey at IEEE In te rnatio n a lC on fe ren ce o n D ata M inin g ( ICD M ) in D ec em ber 20 06. “T o p 10
S m t. Kas hiba i N ava le C ol ege o fEn gin ee rin g,V adg oan
algorithms in data mining”, by X.Wu et al
40
C4.5
http://www.dssrSemsot.uKrcaessh.icboamiN/naevwalseleCtotelresg/e66o.fpEhnp
43
Expectation Maximization
16
1
PageRank
16
2
AdaBoost
16
3
Naïve Bayes
16
4
Bayes’s Theorem
16
5
Simple example
• Duke
– record deduplication engine
– estimate probability of duplicate for each property
– combine probabilities with Bayes
• Whazzup
– news aggregator that finds relevant news
– works essentially like spam classifier on next slide
• Tine recommendation prototype
– recommends recipes based on previous choices
– also like spam classifier
• Classifying expenses
– using export from my bank
– also like spam classifier
69
Bayes against spam
• I pass it
– 1000 emails from my Bouvet folder
– 1000 emails from my Spam folder
• Then I feed it
– 1 email from another Bouvet folder
– 1 email from another Spam folder
71
Code
# scan spam
for spam in glob.glob(spamdir + '/' + PATTERN)[ : SAMPLES]:
for token in featurize(spam):
corpus.spam(token)
# scan ham
for ham in glob.glob(hamdir + '/' + PATTERN)[ : SAMPLES]:
for token in featurize(ham):
corpus.ham(token)
# compute probability for email in sys.argv[3 : ]:
print email
p = classify(email) if p < 0.2:
print ' Spam', p else:
print ' Ham', p
https://github.coSmm/lta.rKsagsah/pibya-isNniapvpaeletsC/torleleg/meaosftEenr/gminaecehrinneg-,lVeaadrngionagn/spam
72
Classify
class Feature:
def init (self, token): self._token = token self._spam = 0
self._ham = 0
def spam(self): self._spam += 1
def ham(self): self._ham += 1
def spam_probability(self):
return (self._spam + PADDING) / float(self._spam + self._ham + (PADDING * 2))
def compute_bayes(probs):
product = reduce(operator.mul, probs)
lastpart = reduce(operator.mul, map(lambda x: 1-x, probs)) if product + lastpart ==
0:
return 0 # happens rarely, but happens
else:
return product / (product + lastpart)
def classify(email):
return compute_bayes([corpus.spam_probability(f) for f in featurize(email)])
17
1
Ham output
So, clearly most of the spam
Ham 1.0 is from March 2013...
Received:2013 0.00342935528121
Date:2013 0.00624219725343
<br 0.0291715285881
background-color: 0.03125
background-color: 0.03125
background-color: 0.03125
background-color: 0.03125
background-color: 0.03125
Received:Mar 0.0332667997339
Date:Mar 0.0362756952842
...
Postboks 0.998107494322
Postboks 0.998107494322
Postboks 0.998107494322
+47 0.99787414966
+47 0.99787414966
+47 0.99787414966
+47 0.99787414966
Lars 0.996863237139
Lars 0.996863237139
23 0.995381062356
17
2
Spam output ...and the ham from October 2012
Spam 2.92798502037e-16
Received:-0400 0.0115646258503
Received:-0400 0.0115646258503
Received-SPF:(ontopia.virtual.vps-host.net: 0.0135823429542
Received-SPF:receiver=ontopia.virtual.vps-host.net; 0.0135823429542
Received:<[email protected]>; 0.0139318885449
Received:<[email protected]>; 0.0139318885449
Received:ontopia.virtual.vps-host.net 0.0170863309353
Received:(8.13.1/8.13.1) 0.0170863309353
Received:ontopia.virtual.vps-host.net 0.0170863309353
Received:(8.13.1/8.13.1) 0.0170863309353
...
Received:2012 0.986111111111
Received:2012 0.986111111111
$ 0.983193277311
Received:Oct 0.968152866242
Received:Oct 0.968152866242
Date:2012 0.959459459459
20 0.938864628821
+ 0.936526946108
+ 0.936526946108
+ 0.936526946108
17
3
More solid testing
76
http://spamassasSsmint..aKpaaschhieb.aoirNg/apvuablelicCcoolrepgues/ofEngineering, Vadgoan
Linear regression
Linear regression
• Take parameters
– x1 square meters
– x2 number of rooms
– x3 number of floors
– x4 energy cost per year
– x5 meters to nearest subway station
– x6 years since built
– x7 years since last refurbished
– ...
• a x1 + b x2 + c x3 + ... = price
– strip out the x-es and you have a vector
– collect N samples of real flats with prices = matrix
– welcome to the world of linear algebra
Our data set: beer ratings
• Ratebeer.com
– a web site for rating beer
– scale of 0.5 to 5.0
• For each beer we know
– alcohol %
– country of origin
– brewery
– beer style (IPA, pilsener, stout, ...)
• But ... only one attribute is numeric!
– how to solve?
Example
ABV .se .nl .us .uk IIPA Black Pale Bitter Rating
IPA ale
8.5 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 3.5
8.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 3.7
6.2 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 3.2
4.4 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 3.2
... ... ... ... ... ... ... ... ... ...
• Let’s say
– x is our data matrix
– y is a vector with the ratings and
– w is a vector with the a, b, c, ... values
• That is: x * w = y
– this is the same as the original equation
– a x1 + b x2 + c x3 + ... = rating
• If we solve this, we get
Enter Numpy
assert linalg.det(x_tx)
http://www.ratebSemetr.cKoamsh/uibsaeirN/1a5v2a0l6e/Craotlilneggse/
89
Beyond prediction
18
9
Scatter plot
Rating
191
Matrix factorization
192
Clustering
193
Clustering
194
Sample data
196
k-means clustering
198
cluster5, 4 models
propell
The Myasishchev M-50 was a Soviet
prototype four-engine supersonic
bomber which never attained service
e
The Myasishchev M-4 Molot is a
four -engined
bomber
r
strategic bomber
. Not
t o o
The Convair B-36 "Peacemaker” was a
a poor fit.
The Heinkel He
100 was a
The Kawasaki Ki-61 Hien was a German pre-
The Learjet 23 is a ... twin-engine,
JapaneseWorldWar II fighter aircraft WorldWar II
high-speed business jet
fighter aircraft
https://github.comSm/lta.rKsgasah/pibya-sinNiapvpaeltes/Ctorelle/gmeaosfteErn/gminaecehriinneg-,len/aircraft
106
Agglomerative clustering
107
Principal
component analysis
206
PCA
207
An example data set
• Two variables
• Three classes
• What’s the longest line we could draw
through the data?
• That line is a vector in two dimensions
• What dimension dominates?
– that’s right: the horizontal
– this implies the horizontal contains most of the
information in the data set
• PCA identifies the most significant
variables
110
Dimensionality reduction
209
Trying out PCA
210
Complete code
import rblib
from numpy import *
def eigenvalues(data, columns):
covariance = cov(data - mean(data, axis = 0), rowvar = 0)
eigvals = linalg.eig(mat(covariance))[0]
indices = list(argsort(eigvals))
indices.reverse() # so we get most significant first
return [(columns[ix], float(eigvals[ix])) for ix in indices]
(scores, parameters, columns) =
rblib.load_as_matrix('ratings.txt')
for (col, ev) in eigenvalues(parameters, columns):
print "%40s %s" % (col, float(ev))
211
Output
abv 0.184770392185
colour 0.13154093951
sweet 0.121781685354
hoppy 0.102241100597
sour 0.0961537687655
alcohol 0.0893502031589
Uni 0.0677552513387
ted ....
States -3.73028421245e-18
-3.73028421245e-18
Eis -1.68514561515e-17
b o c k
Belarus
Vietnam
212
MapReduce
213
University pre-lecture, 1991
215
http://research.google.com/archive/mapreduce.html
Appeared in:
OSDI'04: Sixth Symposium on Operating System Design and
Implementation,
San Francisco, CA, December, 2004.
216
map and reduce
217
MapReduce
120
Communications
• HDFS
– Hadoop Distributed File System
– input data, temporary results, and results are
stored as files here
– Hadoop takes care of making files available to
nodes
• Hadoop RPC
– how Hadoop communicates between nodes
– used for scheduling tasks, heartbeat etc
• Most of this is in practice hidden from the
developer
219
Does anyone need MapReduce?
221
WordCount – the mapper
• public static class Map extends Mapper<LongWritable, ext, Text, IntWritable>
• {
• private final static IntWritable one = new IntWritable(1);
• private Text word = new Text();
222
WordCount – the reducer
223
The Hadoop ecosystem
• Pig
– dataflow language for setting up MR jobs
• HBase
– NoSQL database to store MR input in
• Hive
– SQL-like query language on top of Hadoop
• Mahout
– machine learning library on top of Hadoop
• Hadoop Streaming
– utility for writing mappers and reducers as
command-line tools in other languages
224
Word count in HiveQL
CREATETABLE input (line STRING);
LOAD DATA LOCAL INPATH 'input.tsv' OVERWRITE INTOTABLE
input;
-- temporary table to hold words... CREATETABLE words (word
STRING);
add file splitter.py;
INSERT OVERWRITETABLE words SELECTTRANSFORM(text)
USING 'python splitter.py' AS word
FROM input;
SELECT word, COUNT(*)
FROM input
LATERALVIEW explode(split(text, ' ')) lTable as word GROUP BY
word;
225
Word count in Pig
input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray);
-- Extract words from each line and put them into a pig bag
-- datatype, then flatten the bag to get one word on each row
words = FOREACH input_linesGENERATE FLATTEN(TOKENIZE(line))AS word;
-- filter out any words that are just white spaces filtered_words = FILTER words
BY word MATCHES '\\w+';
-- create a group for each word
word_groups = GROUP filtered_words BY word;
-- count the entries in each group
word_count = FOREACH word_groupsGENERATECOUNT(filtered_words)AS
count, groupAS word;
-- order the records by count
ordered_word_count = ORDER word_count BY count DESC;
STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';
226
Applications of MapReduce
227
Apache Mahout
229
Translation to MapReduce
• σ(company_name=‘FBC’, works)
– map: for each record r in works, verify the condition,
and pass (r, r) if it matches
– reduce: receive (r, r) and pass it on unchanged
• π(person_name, σ(...))
– map: for each record r in input, produce a new record r’
with only wanted columns, pass (r’, r’)
– reduce: receive (r’, [r’, r’, r’ ...]), output (r’, r’)
• ⋈(π(...), lives)
– map:
• for each record r in π(...), output (person_name, r)
• for each record r in lives, output (person_name, r)
– reduce: receive (key, [record, record, ...]), and perform
the actual join
• ...
230
Lots of SQL-on-MapReduce tools
• Tenzing Google
• Hive Apache Hadoop
• YSmart Ohio State
• SQL-MR AsterData
• HadoopDB Hadapt
• Polybase Microsoft
• RainStor RainStor Inc.
• ParAccel ParAccel Inc.
• Impala Cloudera
• ...
231
Thank You….!
232