0% found this document useful (0 votes)

66 views33 pages

Real World Hadoop Use Cases: Jfokus 2013, Stockholm

RealWorldHadoop

Uploaded by

senthil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

66 views33 pages

Real World Hadoop Use Cases: Jfokus 2013, Stockholm

RealWorldHadoop

Uploaded by

senthil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Real World Hadoop Use Cases

JFokus 2013, Stockholm

Eva Andreasson, Cloudera Inc.

Lars Sjdin, King.com

1 2012 Cloudera, Inc.

Agenda

Recap of Big Data and Hadoop

Analyzing Twitter feeds with Hadoop
Real world Hadoop use case Featuring King.com
Q&A

2
Big Data?

Big Data
Increased volumes of data
Increased speed of incoming data
Increased variety of data types
Challenges
Stress on traditional systems
Process more data within same time window
ETL / Cleansing of exponential ingest amounts and new data types
Inflexible models for when questions change
Siloed data / organizations preventing most value

3
Hadoop Distributed File System (HDFS)

4
MapReduce: A scalable data processing framework
REAL WORLD EXAMPLE #1

ANALYZING TWITTER DATA WITH HADOOP

6
Analyzing Twitter

Social media popular with marketing teams

Twitter is an effective tool for promotion

But how do we find out who is most influential:

Who is influential and has the most followers?
Which Twitter user gets the most retweets?
Who is influential in our industry?

7 2012 Cloudera, Inc.

Techniques

SQL
Filter on industry
Aggregate tweets by original poster and count retweets
Sort

Complex data
Deeply nested
Variable schema Hadoop!
Size of data set

8
Flume

Streaming data flow (like Twitter)

Sources
Push or pull
Sinks
Event based

9 2012 Cloudera, Inc.

Pulling data From Twitter

Custom source, using twitter4j

Source will process data as discrete events
Filter on key words
Sink writes to files in HDFS
Loading data into HDFS

HDFS Sink comes stock with Flume

Easily separate files by creation time
hdfs://hadoop1:8020/user/flume/tweets/%Y/%m/%d/%H/
Outline of Flume Source for Tweets
public class TwitterSource extends AbstractSource
implements EventDrivenSource, Configurable {
...
// The initialization method for the Source. The context contains all
// the Flume configuration info
@Override
public void configure(Context context) {
...
}
...
// Start processing events. Uses the Twitter Streaming API to sample
// Twitter, and process tweets.
@Override
public void start() {
...
}
...
// Stops Source's event processing and shuts down the Twitter stream.
@Override
public void stop() {
...
}
}

12 2012 Cloudera, Inc.

Twitter API

Callback mechanism for catching new tweets

/** The actual Twitter stream. It's set up to collect raw JSON data */
private final TwitterStream twitterStream = new TwitterStreamFactory(
new ConfigurationBuilder().setJSONStoreEnabled(true).build())
.getInstance();
...
// The StatusListener is a twitter4j API that can be added to a stream,
// and will call a method every time a message is sent to the stream.
StatusListener listener = new StatusListener() {
// The onStatus method is executed every time a new tweet comes in.
public void onStatus(Status status) {
...
}
}
...
// Set up the stream's listener (defined above), and set any necessary
// security information.
twitterStream.addListener(listener);
twitterStream.setOAuthConsumer(consumerKey, consumerSecret);
AccessToken token = new AccessToken(accessToken, accessTokenSecret);
twitterStream.setOAuthAccessToken(token);

13 2012 Cloudera, Inc.

JSON data

JSON data is processed as an event and written to

HDFS
public void onStatus(Status status) {
// The EventBuilder is used to build an event using the headers and
// the raw JSON of a tweet

headers.put("timestamp", String.valueOf(
status.getCreatedAt().getTime()));
Event event = EventBuilder.withBody(
DataObjectFactory.getRawJSON(status).getBytes(), headers);

channel.processEvent(event);
}

14 2012 Cloudera, Inc.

What is Hive?

HiveQL
SQL like interface
Hive interpreter
converts HiveQL to
MapReduce code
Returns results to the
client

15 2012 Cloudera, Inc.

Hive details

Schema on read
Scalar types (int, float, double, boolean, string)
Complex types (struct, map, array)
Metastore contains table definitions
Allows queries to be data agnostic
Stored in a relational database
Similar to catalog tables in other DBs

16
Hive Serializers and Deserializers (SerDe)

Instructs Hive on how to interpret data

JSONSerDe

Hive Strenghts:
Flexible in the data model
Extendable format support

17 2012 Cloudera, Inc.

Analyzing Twitter data with Hadoop

PUTTING IT ALL TOGETHER

18 2012 Cloudera, Inc.

Architecture

Query through
your favorite
SQL tool

Custom
Flume
Source
Sink to JSON SerDe
HDFS Parses Data
Flume HDFS Hive

19 2012 Cloudera, Inc.

Now We Can Start Asking Bigger Questions

SELECT
t.retweeted_screen_name,
sum(retweets) AS total_retweets,
count(*) AS tweet_count
FROM (SELECT
retweeted_status.user.screen_name AS retweet_screen_name,
retweeted_status.text,
max(retweet_count) AS retweets
FROM tweets
GROUP BY
retweeted_status.user.screen_name,
retweeted_status.text) t
GROUP BY t.retweet_screen_name
ORDER BY total_retweets DESC
LIMIT 10;

20 2012 Cloudera, Inc.

Analyzing Twitter data with Hadoop

TEASER: FASTER HIVE? GO IMPALA!

21
Try it out yourself?

Cloudera provides demo VMs

https://ccp.cloudera.com/display/SUPPORT/Cloudera+Ma
nager+Free+Edition+Demo+VM
More info and examples
http://blog.cloudera.com/
Beyond Big and Data
Prelude to a Philosophy of the BI Future

Lars Sjdin
24 2012 Cloudera, Inc.
Analyzing Twitter data with Hadoop

EXTRA SLIDES

25
NOTE: Hive is not a database

RDBMS Hive
Subset of SQL-92 plus
Generally >= SQL-92
Language Hive specific
extensions
INSERT, UPDATE, INSERT OVERWRITE
Update Capabilities
DELETE no UPDATE, DELETE
Transactions Yes No
Latency Sub-second Minutes
Indexes Yes Yes
Data size Terabytes Petabytes

26 2012 Cloudera, Inc.

My personal preference to reduce complexity

Cloudera Manager
https://ccp.cloudera.com/display/SUPPORT/Downloads
Free up to 50 nodes
Analyzing Twitter data with Hadoop

JSON INTERLUDE

28 2012 Cloudera, Inc.

What is JSON?

Complex, semi-structured data

Based on JavaScripts data syntax
Rich, nested data types:
number
string
Array
object
true, false
null

29 2012 Cloudera, Inc.

What is JSON?
{
"retweeted_status": {
"contributors": null,
"text": "#Crowdsourcing drivers already generate traffic data for your smartphone to suggest
alternative routes when a road is clogged. #bigdata",
"retweeted": false,
"entities": {
"hashtags": [
{
"text": "Crowdsourcing",
"indices": [0, 14]
},
{
"text": "bigdata",
"indices": [129,137]
}
],
"user_mentions": []
}
}
}

30 2012 Cloudera, Inc.

Analyzing Twitter data with Hadoop

OOZIE:
AUTOMATION

31 2012 Cloudera, Inc.

Oozie: everything in its right place
Oozie for partition management

Once an hour, add a partition

Takes advantage of advanced Hive functionality

Data Pipelines From Zero To Solid
No ratings yet
Data Pipelines From Zero To Solid
58 pages
1 - HADOOP Crash Course
No ratings yet
1 - HADOOP Crash Course
52 pages
YouTube Data Analysis Using Hadoop
No ratings yet
YouTube Data Analysis Using Hadoop
64 pages
Project Data Lake
No ratings yet
Project Data Lake
7 pages
Akk A Stream and HTTP Java
No ratings yet
Akk A Stream and HTTP Java
138 pages
Twitter BDA Presentation
No ratings yet
Twitter BDA Presentation
15 pages
Big Data Final Presentation
No ratings yet
Big Data Final Presentation
11 pages
Sentiment Analysis On Twitter Data-Set Using Naive Bayes Algorithm
No ratings yet
Sentiment Analysis On Twitter Data-Set Using Naive Bayes Algorithm
4 pages
Practical Training On Big Data and Hadoop at MTA, Lucknow
No ratings yet
Practical Training On Big Data and Hadoop at MTA, Lucknow
18 pages
Module 1 - Introduction To Big Data
100% (1)
Module 1 - Introduction To Big Data
40 pages
Twitrends: A Real Time Trending Topics Detection System For Twitter Social Network
No ratings yet
Twitrends: A Real Time Trending Topics Detection System For Twitter Social Network
10 pages
Demystifying The Big Data Ecosystem... - Param Natarajan
100% (1)
Demystifying The Big Data Ecosystem... - Param Natarajan
8 pages
Lecture 4 - Hadoop Ecosystem - 1691899782480
No ratings yet
Lecture 4 - Hadoop Ecosystem - 1691899782480
36 pages
Bigdata Analysis: Streaming Twitter Data With Apache Hadoop and V Isualizing Using Biginsights
No ratings yet
Bigdata Analysis: Streaming Twitter Data With Apache Hadoop and V Isualizing Using Biginsights
5 pages
DocScanner Jan 12, 2023 2-29 PM
No ratings yet
DocScanner Jan 12, 2023 2-29 PM
32 pages
Twitter Data Analysis Using Flume & Hive On Hadoop Framework
No ratings yet
Twitter Data Analysis Using Flume & Hive On Hadoop Framework
5 pages
Big Data Overview
No ratings yet
Big Data Overview
39 pages
Streaming Data Via Flume
No ratings yet
Streaming Data Via Flume
13 pages
Data Science and Big Data UNIT 4
No ratings yet
Data Science and Big Data UNIT 4
10 pages
Hortonworks Data Platform (HDP)
100% (1)
Hortonworks Data Platform (HDP)
56 pages
BDA Unit 3
No ratings yet
BDA Unit 3
42 pages
Big Data
No ratings yet
Big Data
4 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
58 pages
An Introduction To Hadoop Presentation PDF
100% (1)
An Introduction To Hadoop Presentation PDF
91 pages
Hadoop Ecosystem Large PDF
No ratings yet
Hadoop Ecosystem Large PDF
229 pages
Osint Research Tools
100% (1)
Osint Research Tools
14 pages
Big Data Unit 1
No ratings yet
Big Data Unit 1
24 pages
Week 4 - Hadoop Ecosystem
No ratings yet
Week 4 - Hadoop Ecosystem
109 pages
The Age OF: Every Minute
No ratings yet
The Age OF: Every Minute
47 pages
T09 Data Streaming
No ratings yet
T09 Data Streaming
52 pages
Case 11 - Big Data and The Elephant 2022 Valacich IS Today
No ratings yet
Case 11 - Big Data and The Elephant 2022 Valacich IS Today
1 page
Hadoop Overview Training Material
No ratings yet
Hadoop Overview Training Material
44 pages
Cisco IPS IDS Interview Questions and Answers VOL 1.0
100% (2)
Cisco IPS IDS Interview Questions and Answers VOL 1.0
11 pages
Configuration Vs Customization
100% (2)
Configuration Vs Customization
2 pages
Java Performance Tuning For Beginners PDF
No ratings yet
Java Performance Tuning For Beginners PDF
17 pages
GRASP Principles
No ratings yet
GRASP Principles
56 pages
WebUI Admin Guide Platform V9.5.3
No ratings yet
WebUI Admin Guide Platform V9.5.3
74 pages
Exam Questions 700-760: Cisco Security Architecture For Account Managers
No ratings yet
Exam Questions 700-760: Cisco Security Architecture For Account Managers
6 pages
Difference Between Red Hat Enterprise Linux 6 and 7
No ratings yet
Difference Between Red Hat Enterprise Linux 6 and 7
3 pages
Cloud Interview Questions and Answers
No ratings yet
Cloud Interview Questions and Answers
22 pages
Snowflake SnowPro Core Certification Exam Questions - Page 18 of 27 - SkillCertPro
No ratings yet
Snowflake SnowPro Core Certification Exam Questions - Page 18 of 27 - SkillCertPro
1 page
Define Infotype Menu
No ratings yet
Define Infotype Menu
6 pages
How Can I Remove Win32 - Grenam.a Permanently - Win32 - Grenam
No ratings yet
How Can I Remove Win32 - Grenam.a Permanently - Win32 - Grenam
15 pages
SAP NetWeaver Developer Studio 7.30 Installation Guide
No ratings yet
SAP NetWeaver Developer Studio 7.30 Installation Guide
11 pages
87695a5a6dfa0b0649650d5ccf468a15
No ratings yet
87695a5a6dfa0b0649650d5ccf468a15
402 pages
Si Housekeeping BP
No ratings yet
Si Housekeeping BP
2 pages
Software Project Management: Sixth Edition
No ratings yet
Software Project Management: Sixth Edition
48 pages
Worksheet - Sample SQL Server Inventory
No ratings yet
Worksheet - Sample SQL Server Inventory
5 pages
DEF CON 23 - Sean-Metcalf-Red-vs-Blue-AD-Attack-and-Defense
No ratings yet
DEF CON 23 - Sean-Metcalf-Red-vs-Blue-AD-Attack-and-Defense
81 pages
Odiproject Flatfile To Table
No ratings yet
Odiproject Flatfile To Table
13 pages
Business Analytics - SPLN Orientation - 2024-26
No ratings yet
Business Analytics - SPLN Orientation - 2024-26
49 pages
Introduction To SELinux
No ratings yet
Introduction To SELinux
11 pages
1.1 Objective Library Management Application
No ratings yet
1.1 Objective Library Management Application
34 pages
Tupmmpc Loan Monitoring and Management System
No ratings yet
Tupmmpc Loan Monitoring and Management System
72 pages
Veritas Infoscale: Technical Overview: Managing Mission-Critical Applications in A Software-Defined Data Center
No ratings yet
Veritas Infoscale: Technical Overview: Managing Mission-Critical Applications in A Software-Defined Data Center
33 pages
Java Unit 5
No ratings yet
Java Unit 5
11 pages
Chapter 4: Threads Chapter 4: Threads: Silberschatz, Galvin and Gagne ©2013 Operating System Concepts - 9 Edition
No ratings yet
Chapter 4: Threads Chapter 4: Threads: Silberschatz, Galvin and Gagne ©2013 Operating System Concepts - 9 Edition
25 pages
Prelim Quiz 2 Attempt Review
No ratings yet
Prelim Quiz 2 Attempt Review
4 pages
B2B Add On Installation
No ratings yet
B2B Add On Installation
6 pages
Quiz 2 Sia
No ratings yet
Quiz 2 Sia
2 pages
Sandhyacv
No ratings yet
Sandhyacv
2 pages
Big Data on Kubernetes: A practical guide to building efficient and scalable data solutions
From Everand
Big Data on Kubernetes: A practical guide to building efficient and scalable data solutions
Neylson Crepalde
No ratings yet
Real-Time Big Data Analytics: Emerging Trends
From Everand
Real-Time Big Data Analytics: Emerging Trends
Trilokesh Khatri
No ratings yet
DP-420 Designing and Implementing Cloud-Native Applications Using Microsoft Azure Cosmos DB Certification Exam Guide
From Everand
DP-420 Designing and Implementing Cloud-Native Applications Using Microsoft Azure Cosmos DB Certification Exam Guide
Anand Vemula
No ratings yet
Engineering Data Mesh in Azure Cloud: Implement data mesh using Microsoft Azure's Cloud Adoption Framework
From Everand
Engineering Data Mesh in Azure Cloud: Implement data mesh using Microsoft Azure's Cloud Adoption Framework
Aniruddha Deswandikar
No ratings yet
DP-500 Designing and Implementing Enterprise-Scale Analytics Solutions Using Microsoft Azure and Microsoft Power BI Exam Guide
From Everand
DP-500 Designing and Implementing Enterprise-Scale Analytics Solutions Using Microsoft Azure and Microsoft Power BI Exam Guide
Anand Vemula
No ratings yet
OpenStack Orchestration
From Everand
OpenStack Orchestration
Adnan Ahmed Siddiqui
5/5 (1)
Building Modern Data Applications Using Databricks Lakehouse: Develop, optimize, and monitor data pipelines on Databricks
From Everand
Building Modern Data Applications Using Databricks Lakehouse: Develop, optimize, and monitor data pipelines on Databricks
Will Girten
No ratings yet
Mastering Apache Cassandra - Second Edition
From Everand
Mastering Apache Cassandra - Second Edition
Nishant Neeraj
No ratings yet
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
From Everand
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
Peter Jones
No ratings yet
Mastering Hadoop
From Everand
Mastering Hadoop
Sandeep Karanth
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Effective Business Intelligence with QuickSight
From Everand
Effective Business Intelligence with QuickSight
Rajesh Nadipalli
No ratings yet
Cloudera Administration Handbook
From Everand
Cloudera Administration Handbook
Rohit Menon
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
Software Containers: The Complete Guide to Virtualization Technology. Create, Use and Deploy Scalable Software with Docker and Kubernetes. Includes Docker and Kubernetes.
From Everand
Software Containers: The Complete Guide to Virtualization Technology. Create, Use and Deploy Scalable Software with Docker and Kubernetes. Includes Docker and Kubernetes.
Jordan Lioy
No ratings yet
Hadoop Blueprints
From Everand
Hadoop Blueprints
Anurag Shrivastava
No ratings yet
Microsoft Azure Text Book
From Everand
Microsoft Azure Text Book
Manish Soni
No ratings yet
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
From Everand
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
Adam Jones
No ratings yet
Visual Basic 2010 Coding Briefs Data Access
From Everand
Visual Basic 2010 Coding Briefs Data Access
Kevin Hough
5/5 (1)
Elements of Android Room
From Everand
Elements of Android Room
Mark Murphy
No ratings yet
Sqoop Essentials: Definitive Reference for Developers and Engineers
From Everand
Sqoop Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Comprehensive Guide to Azure HDInsight: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to Azure HDInsight: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
C# 2010 Coding Briefs Data Access
From Everand
C# 2010 Coding Briefs Data Access
Kevin Hough
No ratings yet
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
From Everand
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
Robert Johnson
No ratings yet
Node.js: The Definitive Resource
From Everand
Node.js: The Definitive Resource
Tom Henricksen
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet

Real World Hadoop Use Cases: Jfokus 2013, Stockholm

Uploaded by

Real World Hadoop Use Cases: Jfokus 2013, Stockholm

Uploaded by

Real World Hadoop Use Cases

JFokus 2013, Stockholm

Eva Andreasson, Cloudera Inc.

1 2012 Cloudera, Inc.

Recap of Big Data and Hadoop

ANALYZING TWITTER DATA WITH HADOOP

Social media popular with marketing teams

But how do we find out who is most influential:

7 2012 Cloudera, Inc.

Streaming data flow (like Twitter)

9 2012 Cloudera, Inc.

Custom source, using twitter4j

HDFS Sink comes stock with Flume

12 2012 Cloudera, Inc.

Callback mechanism for catching new tweets

13 2012 Cloudera, Inc.

JSON data is processed as an event and written to

14 2012 Cloudera, Inc.

15 2012 Cloudera, Inc.

Instructs Hive on how to interpret data

17 2012 Cloudera, Inc.

PUTTING IT ALL TOGETHER

18 2012 Cloudera, Inc.

19 2012 Cloudera, Inc.

20 2012 Cloudera, Inc.

TEASER: FASTER HIVE? GO IMPALA!

Cloudera provides demo VMs

26 2012 Cloudera, Inc.

28 2012 Cloudera, Inc.

Complex, semi-structured data

29 2012 Cloudera, Inc.

30 2012 Cloudera, Inc.

31 2012 Cloudera, Inc.

Once an hour, add a partition

You might also like