Real World Hadoop Use Cases: Jfokus 2013, Stockholm
Real World Hadoop Use Cases: Jfokus 2013, Stockholm
2
Big Data?
Big Data
Increased volumes of data
Increased speed of incoming data
Increased variety of data types
Challenges
Stress on traditional systems
Process more data within same time window
ETL / Cleansing of exponential ingest amounts and new data types
Inflexible models for when questions change
Siloed data / organizations preventing most value
3
Hadoop Distributed File System (HDFS)
4
MapReduce: A scalable data processing framework
REAL WORLD EXAMPLE #1
6
Analyzing Twitter
SQL
Filter on industry
Aggregate tweets by original poster and count retweets
Sort
Complex data
Deeply nested
Variable schema Hadoop!
Size of data set
8
Flume
headers.put("timestamp", String.valueOf(
status.getCreatedAt().getTime()));
Event event = EventBuilder.withBody(
DataObjectFactory.getRawJSON(status).getBytes(), headers);
channel.processEvent(event);
}
HiveQL
SQL like interface
Hive interpreter
converts HiveQL to
MapReduce code
Returns results to the
client
Schema on read
Scalar types (int, float, double, boolean, string)
Complex types (struct, map, array)
Metastore contains table definitions
Allows queries to be data agnostic
Stored in a relational database
Similar to catalog tables in other DBs
16
Hive Serializers and Deserializers (SerDe)
Hive Strenghts:
Flexible in the data model
Extendable format support
Query through
your favorite
SQL tool
Custom
Flume
Source
Sink to JSON SerDe
HDFS Parses Data
Flume HDFS Hive
SELECT
t.retweeted_screen_name,
sum(retweets) AS total_retweets,
count(*) AS tweet_count
FROM (SELECT
retweeted_status.user.screen_name AS retweet_screen_name,
retweeted_status.text,
max(retweet_count) AS retweets
FROM tweets
GROUP BY
retweeted_status.user.screen_name,
retweeted_status.text) t
GROUP BY t.retweet_screen_name
ORDER BY total_retweets DESC
LIMIT 10;
21
Try it out yourself?
Lars Sjdin
24 2012 Cloudera, Inc.
Analyzing Twitter data with Hadoop
EXTRA SLIDES
25
NOTE: Hive is not a database
RDBMS Hive
Subset of SQL-92 plus
Generally >= SQL-92
Language Hive specific
extensions
INSERT, UPDATE, INSERT OVERWRITE
Update Capabilities
DELETE no UPDATE, DELETE
Transactions Yes No
Latency Sub-second Minutes
Indexes Yes Yes
Data size Terabytes Petabytes
Cloudera Manager
https://ccp.cloudera.com/display/SUPPORT/Downloads
Free up to 50 nodes
Analyzing Twitter data with Hadoop
JSON INTERLUDE
OOZIE:
AUTOMATION