Data Mining With Hadoop and Hive Introduction To Architecture

DATA MINING WITH HADOOP
AND HIVE
Introduction to Architecture
Dr. Wlodek Zadrozny, Dr. Srinivas Akella
Slide ‹#›
Data Science
Hadoop
& Hive
Source: http://www.dataists.com/2010/09/the-
data-science-venn-diagram/
©2015-2025. Reproduction or usage prohibited without

permission of authors (Dr. Hansen or Dr. Zadrozny) DSBA6100 Big Data Analytics for Competitive Advantage Slide ‹#›
MapReduce and Hadoop
MapReduce
• MapReduce programming paradigm for clusters of
commodity PCs
• Map computation across many inputs
• Fault-tolerant
• Scalable
• Machine independent programming model

• Permits programming at abstract level
• Runtime system handles scheduling, load balancing
• First Google server
~1999
Computer History Museum

Data Centers
Yahoo data center
Google data center layout
Harpers, 3/2008
Motivation: Large Scale Data Processing
• Many tasks: Process lots of data to produce other data
• Want to use hundreds or thousands of CPUs
... but this needs to be easy
• MapReduce provides:
– Automatic parallelization and distribution
– Fault-tolerance
– I/O scheduling
– Status and monitoring
Example Tasks
• Finding all occurrences of a string on the web
• Finding all pages that point to a given page
• Data analysis of website access log files
• Clustering web pages
Functional Programming
• MapReduce: Based on Functional Programming paradigm that treats computation as
evaluation of math functions
• Map
• map result-type function sequence &rest more-sequences
• The function must take as many arguments as there are sequences provided; at least one
sequence must be provided. The result of map is a sequence such that element j is the
result of applying function to element j of each of the argument sequences.
• Example: (map 'list #'- '(1 2 3 4)) => (-1 -2 -3 -4)
• Reduce
• reduce function sequence &key :from-end :start :end :initial-value
• The reduce function combines all the elements of a sequence using a binary operation; for
example, using + one can add up all the elements.
• Example: (reduce #'+ '(1 2 3 4)) => 10
MapReduce Programming Model
• Input and Output: each a set of key/value pairs
• Programmer specifies two functions:
• map (in_key, in_value) -> list(out_key, intermediate_value)

– Processes input key/value pair
– Produces set of intermediate pairs
• reduce (out_key, list(intermediate_value)) -> list(out_value)

– Combines all intermediate values for a particular key
– Produces a set of merged output values (usually just one)
• Inspired by similar primitives in LISP and other languages

Example: Count word occurrences
MapReduce Operations
• Conceptual:
– Map: (K1, V1) -> list(K2, V2)
– Reduce: (K2, list(V2)) -> list(K3, V3)
• WordCount example:
– Map: (doc, contents) -> list(word_i, 1)
– Reduce:
(word_i, list(1,1,…)) -> list(word_i, count_i)
Execution Overview
Dean and Ghemawat, 2008
Parallel Execution
• 200,000 map/5000 reduce tasks w/ 2000 machines (Dean and
Ghemawat, 2004)
• Over 1m/day at FB last year
Model has Broad Applicability
MapReduce Programs In Google Source Tree
Example uses:
distributed grep distributed sort web link-graph reversal
term-vector per host web access log stats inverted index construction
statistical machine
document clustering machine learning translation
Usage at Google
Hadoop
• Open Source Apache project
• Written in Java; runs on Linux, Windows, OS/X, Solaris
• Hadoop includes:
– MapReduce: distributes applications
– HDFS: distributes data
Hadoop Design Goals
• Storage of large data sets
• Running jobs in parallel
• Maximizing disk I/O
• Batch processing
Job Distribution
• Users submit mapreduce jobs to jobtracker
• Jobtracker puts jobs in queue, executes on first-come, first-served
basis
• Jobtracker manages assignment of map and reduce tasks to
• tasktrackers
• Tasktrackers execute tasks upon instruction from jobtracker, and
• handle data transfer between map and reduce phases
Hadoop MapReduce
Data Distribution
• Data transfer handled implicitly by HDFS
• Move computation to where data is: data locality
• Map tasks are scheduled on same node that input data
resides on
• If lots of data is on the same node, nearby nodes will
• map instead
Hadoop DFS (HDFS)
Map Reduce and HDFS
http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)
Data Access
• CPU and transfer speed, RAM and disk size double every
18-24 months
• Disk seek time is nearly constant (~5% per year)
• Time to read entire disk is growing
• Scalable computing should not be limited by disk seek

time
• Throughput more important than latency
Original Google Storage
Source: Computer History Museum

HDFS
• Inspired by Google File System (GFS)
• Follows master/slave architecture
• HDFS installation has one Namenode and one or more Datanodes (one per
node in cluster)
• Namenode: Manages filesystem namespace and regulates file access by
clients. Makes filesystem namespace operations (open/close/rename of files
and directories) available via RPC
• Datanode: Responsible for serving read/write requests from filesystem
clients. Also perform block creation/deletion/replication (upon instruction
from Namenode)
HDFS Design Goals
• Very large files:
– Files may be GB or TB in size
• Streaming data access:
– Write once, read many times
– Throughput more important than latency
• Commodity hardware
– Node failure may occur
HDFS
• Files are broken into blocks of 64MB (but can be user
specified)
• Default replication factor is 3x
• Block placement algorithm is rack-aware
• Dynamic control of replication factor

HDFS
Source: http://lucene.apache.org/hadoop/hdfs_design.html
Example HDFS Installation
• Facebook, 2010 (Largest HDFS • 2014: Facebook generates 4 new
petabyes of data and runs 600,000
• installation at the time) queries and 1 million map-reduce jobs
• 2000 machines, 22,400 cores per day.
• 24 TB / machine, (21 PB total)

• Writing 12TB/day • Hive is Facebook's data warehouse,
with 300 petabytes of data in 800,000
• Reading 800TB/day tables More at:
• 25K MapReduce jobs/day • https://research.facebook.com/blog/152269292
79
• 65 Million HDFS files • 72019/facebook-s-top-open-data-problems/
• 30K simultaneous clients.

Companies to first use MapReduce/Hadoop
• Google LinkedIn
• Yahoo Disney
• Microsoft NY Times
• Amazon Ebay
• Facebook Quantcast
• Twitter Veoh
• Fox interactive media Joost
Any company with
• AOL Last.fm enough data
• Hulu
Hadoop Vendors
• Cloudera
• Hortonworks
• MapR
• IBM BigInsights
Hive
• Apache Hive is a data warehouse infrastructure built on top of Hadoop for
providing data summarization, query, and analysis.
• Developed at Facebook to enable analysts to query Hadoop data
• MapReduce for computation, HDFS for storage, RDBMS for metadata
• Can use Hive to perform SQL style queries on Hadoop data

• Most Hive queries generate MapReduce jobs
• Can perform parallel queries over massive data sets
Apache Pig
• Pig: High-level platform for creating MapRed uce
programs
• Developed at Yahoo
• Pig Latin: Procedural language for Pig
• Ability to use user code at any point in pipeline makes it

good for pipeline development
Traditional Data Enterprise Architecture
Hortonworks, 2013
Emerging Big Data Architecture
1. Collect data
2. Clean and process using Hadoop
3. Push data to Data Warehouse, or use directly in Enterprise applications
Hortonworks, 2013
• Data flow of meter done manually
Awadallah and Graham

• Automatic update of meter readings every 15 mins

DW and Hadoop Use Cases

Data Mining With Hadoop and Hive Introduction To Architecture

Uploaded by

Copyright:

Available Formats

Data Mining With Hadoop and Hive Introduction To Architecture

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining With Hadoop and Hive Introduction To Architecture

Uploaded by

Copyright:

Available Formats

DATA MINING WITH HADOOP

©2015-2025. Reproduction or usage prohibited without

• Machine independent programming model

Computer History Museum

Google data center layout

• Programmer specifies two functions:

• map (in_key, in_value) -> list(out_key, intermediate_value)

• reduce (out_key, list(intermediate_value)) -> list(out_value)

• Inspired by similar primitives in LISP and other languages

• Scalable computing should not be limited by disk seek

Source: Computer History Museum

• Default replication factor is 3x

• Block placement algorithm is rack-aware

• Dynamic control of replication factor

• 24 TB / machine, (21 PB total)

• 30K simultaneous clients.

• Can use Hive to perform SQL style queries on Hadoop data

• Ability to use user code at any point in pipeline makes it

Awadallah and Graham

Awadallah and Graham

Awadallah and Graham

You might also like