Data Mining With Hadoop and Hive Introduction To Architecture
Data Mining With Hadoop and Hive Introduction To Architecture
Data Mining With Hadoop and Hive Introduction To Architecture
AND HIVE
Introduction to Architecture
Dr. Wlodek Zadrozny, Dr. Srinivas Akella
Slide ‹#›
Data Science
Hadoop
& Hive
Source: http://www.dataists.com/2010/09/the-
data-science-venn-diagram/
• Fault-tolerant
• Scalable
Harpers, 3/2008
Motivation: Large Scale Data Processing
• Many tasks: Process lots of data to produce other data
• Want to use hundreds or thousands of CPUs
... but this needs to be easy
• MapReduce provides:
– Automatic parallelization and distribution
– Fault-tolerance
– I/O scheduling
– Status and monitoring
Example Tasks
• Finding all occurrences of a string on the web
• Finding all pages that point to a given page
• Data analysis of website access log files
• Clustering web pages
Functional Programming
• MapReduce: Based on Functional Programming paradigm that treats computation as
evaluation of math functions
• Map
• map result-type function sequence &rest more-sequences
• The function must take as many arguments as there are sequences provided; at least one
sequence must be provided. The result of map is a sequence such that element j is the
result of applying function to element j of each of the argument sequences.
• Example: (map 'list #'- '(1 2 3 4)) => (-1 -2 -3 -4)
• Reduce
• reduce function sequence &key :from-end :start :end :initial-value
• The reduce function combines all the elements of a sequence using a binary operation; for
example, using + one can add up all the elements.
• Example: (reduce #'+ '(1 2 3 4)) => 10
MapReduce Programming Model
• Input and Output: each a set of key/value pairs
• WordCount example:
– Map: (doc, contents) -> list(word_i, 1)
– Reduce:
(word_i, list(1,1,…)) -> list(word_i, count_i)
Execution Overview
Dean and Ghemawat, 2008
Parallel Execution
• 200,000 map/5000 reduce tasks w/ 2000 machines (Dean and
Ghemawat, 2004)
• Over 1m/day at FB last year
Model has Broad Applicability
MapReduce Programs In Google Source Tree
Example uses:
distributed grep distributed sort web link-graph reversal
term-vector per host web access log stats inverted index construction
statistical machine
document clustering machine learning translation
Usage at Google
Hadoop
• Open Source Apache project
• Written in Java; runs on Linux, Windows, OS/X, Solaris
• Hadoop includes:
– MapReduce: distributes applications
– HDFS: distributes data
Hadoop Design Goals
• Storage of large data sets
• Running jobs in parallel
• Maximizing disk I/O
• Batch processing
Job Distribution
• Users submit mapreduce jobs to jobtracker
• Jobtracker puts jobs in queue, executes on first-come, first-served
basis
• Jobtracker manages assignment of map and reduce tasks to
• tasktrackers
• Tasktrackers execute tasks upon instruction from jobtracker, and
• handle data transfer between map and reduce phases
Hadoop MapReduce
Data Distribution
• Data transfer handled implicitly by HDFS
• Move computation to where data is: data locality
• Map tasks are scheduled on same node that input data
resides on
• If lots of data is on the same node, nearby nodes will
• map instead
Hadoop DFS (HDFS)
Map Reduce and HDFS
http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)
Data Access
• CPU and transfer speed, RAM and disk size double every
18-24 months
• Disk seek time is nearly constant (~5% per year)
• Time to read entire disk is growing
Source: http://lucene.apache.org/hadoop/hdfs_design.html
Example HDFS Installation
• Facebook, 2010 (Largest HDFS • 2014: Facebook generates 4 new
petabyes of data and runs 600,000
• installation at the time) queries and 1 million map-reduce jobs
• 2000 machines, 22,400 cores per day.
• Hortonworks
• MapR
• IBM BigInsights
Hive
• Apache Hive is a data warehouse infrastructure built on top of Hadoop for
providing data summarization, query, and analysis.
• Developed at Facebook to enable analysts to query Hadoop data
• MapReduce for computation, HDFS for storage, RDBMS for metadata
Hortonworks, 2013
Emerging Big Data Architecture
1. Collect data
2. Clean and process using Hadoop
3. Push data to Data Warehouse, or use directly in Enterprise applications
Hortonworks, 2013
• Data flow of meter done manually