100% found this document useful (1 vote)

142 views45 pages

Mining Public Datasets

Mining Public Datasets using Apache Zeppelin (incubating), Apache Spark and Juju by Alexander Bezzubov NFLabs for AppacheCon ’16 NA

Uploaded by

Mahout

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

142 views45 pages

Mining Public Datasets

Mining Public Datasets using Apache Zeppelin (incubating), Apache Spark and Juju by Alexander Bezzubov NFLabs for AppacheCon ’16 NA

Uploaded by

Mahout

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

Mining Public Datasets

using Apache Zeppelin (incubating),

Apache Spark and Juju
by Alexander Bezzubov

NFLabs for AppacheCon ’16 NA

Alexander Bezzubov

Software Engineer at NFLabs, Seoul,

South Korea

Co-organizer of SeoulTech Society

Committer and PPMC member of

Apache Zeppelin (Incubating)

github.com/bzz

@seoul_engineer
Graduated Maths at St.Petersburg State
University, Russia
PUBLIC DATASETS: Number, Size & Growth

Web Crawls
Structured data (RDF, micro-formats, tables)
Hackers News\Reddit\Twitter\StackOverflow\Wikipedia
Reviews (movies, restaurants, beer, wine)
Emails (Enroll, ASF public ML archives)
order of Tbs
Census Data (US, UK, UN, Japan, etc)
Transportation (Taxi, Flights, Bicycles)
Genome
AWS Public Datasets https://aws.amazon.com/public-data-sets/
Yahoo Webscope https://webscope.sandbox.yahoo.com/
Stanford Network Analyser Project http://snap.stanford.edu/data/

Physics Research http://opendata.cern.ch

PUBLIC DATASETS: Number, Size & Growth

Physics Research http://opendata.cern.ch order of Pbs

PUBLIC DATA = OPPORTUNITY
I. Tools
II. Data
TOOL TO PURSUIT THE OPPORTUNITY:
Overview Big Data eco-system

… …
TOOL TO PURSUIT THE OPPORTUNITY:
Overview Big Data eco-system
TOOL TO PURSUIT THE OPPORTUNITY:
Todays choice Zeppelin, Spark, Juju

Apache Spark
Scala, Python, R

Apache Zeppelin
Modern Web GUI, plays nicely with Spark, Flink,
Elasticsearch, etc.

Warcbase
Spark library for saved crawl data (WARC)

Juju
Scales, integration with Spark, Zeppelin, AWS, GCE

APACHE ZEPPELIN: Overview

Zeppelin: Brief history

http://zeppelin.incubator.apache.org

12.2012 Commercial App using AMP Lab Shark 0.5

10.2013 Prototype Hive/Shark
08.2013 NFLabs Internal project Hive/Shark
12.2014 Enters ASF Incubation
01.2016 3 major releases
05.2016 Graduation vote passed
Interactive Visualization
APACHE SPARK

http://spark.apache.org

From Berkeley AMP Labs, since 2010

Joined Apache since 2014

1000+ contributors

REPL + Java, Scala, Python, R APIs

JUJU

https://jujucharms.com/

Service modelling at scale

Deployment\configuration automation

+ Integration with Spark, Zeppelin, Ganglia, etc

+ AWS, GCE, Azure, LXC, etc

JUJU

http://bigdata.juju.solutions/getstarted

$ apt-get install juju-core juju-quickstart

# or
$ brew install juju juju-quickstart
$ juju generate-config
#LXC, AWS, GCE, Azure, VMWare, OpenStack

$ juju bootstrap
$ juju quickstart apache-hadoop-spark-zeppelin
$ juju expose spark zeppelin
$ juju add-unit -n4 slave
JUJU

http://bigdata.juju.solutions/getstarted

7 node cluster designed to scale out

APPROACH: local, small cluster, big cluster

1 core Prototype
Your laptop

10s PC Estimate the cost

AWS spot instances

1000 instances Scale out Deployment automation

I. Tools
II. Data
DATA: GitHub http://githubarchive.org

• 300Gb compressed

• Collaboration google and github engineers

• Events on PR, repo, issues, comments, etc in JSON

http://www.commitlogsfromlastnight.com/
http://sideeﬀect.kr/popularconvention/
https://www.gitlive.net/
http://zoom.it/kCsU
DATA PRODUCT: Get notified when
project goes Open Source
DATA PRODUCT: Exploration
DATA PRODUCT: Sketch

We are going to build a Notebook that

sends you a digest email:
DATA PRODUCT: pieces (flow-chart)

We are going to build a Notebook that:

• Downloads the latest data from GitHub Archive

• Read & explore the dataset

• Imports, filters the PublicEvent

• Join logs w/ more data from Github API calls

• Shows HTML template, to visualise the list

• Sends email notifications

• Does all above automatically, once a day

DATA PRODUCT: Full impl
I. Tools
II. Data
DATA: Common Crawl

https://commoncrawl.org

Nonprofit, by Factual

On AWS S3 in WARC, WAT, formats

since 2013, monthly: ~150Tb compressed, 2+bln ulrs

URL Index by Ilya Kreymer of @webrecorder_io

http://index.commoncrawl.org/
https://about.commonsearch.org
DATA: CommonCrawl - Data Product

Measuring the impact of Google Analytics

Objective: estimate % of pages/domains that use Google

Analytics/Facebook

Existing research from 2013

DATA: CommonCrawl - Data Product

Measuring the impact of Google Analytics

Copy to HDFS vs read from S3
Verify using grep
hadoop jar hadoop-examples.jar grep /grep-data/ \
/grep-output/ '[Bb]ig [Dd]ata is ([a-zA-Z]{5,})' 

Verify using grep

DATA: CommonCrawl - Data Product

Feb 2016 Crawl: 

- 48Tb compressed
- 100 segments (dir on S3)
- 30,000 files, ~1Gb each
DATA: CommonCrawl - Data Product

AWS optimisations:
- pick spot instance prices
- pick instance type (net throughput)
- user Juju instead of EMR (2x $$ savings!)
Spark optimisations:
- IO-bound, so increase spark.executor.cores
spark.executor.memory
DATA: CommonCrawl - Data Product
Zeppelin Viewer

Community service for sharing example notebooks

http://zeppelinhub.com/viewer
TAKEAWAY

There are plenty of free tools out there

To crunch the data for fun and profit

They are easy (not simple) to learn and generic enough

Questions?

Alexander Bezzubov
@seoul_engineer

github.com/bzz
Thank you
Alexander Bezzubov
NFLabs, Seoul (we are hiring!)

Mobile Python - Rapid Pro To Typing of Applications On The Mobile Platform
100% (2)
Mobile Python - Rapid Pro To Typing of Applications On The Mobile Platform
349 pages
Cloud Computing Assignment 2025
100% (1)
Cloud Computing Assignment 2025
36 pages
Your Next Five Moves - Patrick Bet-David
No ratings yet
Your Next Five Moves - Patrick Bet-David
25 pages
The Stream Processing Paradigm: Research Report For HIT-382
100% (1)
The Stream Processing Paradigm: Research Report For HIT-382
36 pages
Dash User Guide and Documentation
100% (2)
Dash User Guide and Documentation
376 pages
Accelerating Ai With Synthetic Data Nvidia - Web
No ratings yet
Accelerating Ai With Synthetic Data Nvidia - Web
64 pages
Data Analytics Process
No ratings yet
Data Analytics Process
10 pages
Certified International Payment Systems Professional
50% (2)
Certified International Payment Systems Professional
4 pages
2016 Outsourcer Rolodex PDF
100% (1)
2016 Outsourcer Rolodex PDF
68 pages
Global Pet Insurance Market: Trends and Opportunities (2014-2019) - New Report by Daedal Research
No ratings yet
Global Pet Insurance Market: Trends and Opportunities (2014-2019) - New Report by Daedal Research
10 pages
Top 65 SQL Data Analysis Q&A
No ratings yet
Top 65 SQL Data Analysis Q&A
53 pages
Stock Market Analysis Project
No ratings yet
Stock Market Analysis Project
23 pages
Big Data in Healthcare
100% (1)
Big Data in Healthcare
33 pages
Financial Analyst Interview Questions
75% (4)
Financial Analyst Interview Questions
2 pages
Effective Media Strategies For Communicating Quarterly Earnings
100% (2)
Effective Media Strategies For Communicating Quarterly Earnings
10 pages
KPMG 50 Best Fintech Innovators Report 2014
No ratings yet
KPMG 50 Best Fintech Innovators Report 2014
65 pages
Cloud Nptel Answers
No ratings yet
Cloud Nptel Answers
55 pages
Model Question Paper - Big Data - 2024-25 - Kca022
No ratings yet
Model Question Paper - Big Data - 2024-25 - Kca022
3 pages
MCQs - Big Data Analytics - Fundamentals
No ratings yet
MCQs - Big Data Analytics - Fundamentals
14 pages
Credit Card Marketing Analytics
100% (1)
Credit Card Marketing Analytics
18 pages
2021 Book AppliedAdvancedAnalytics
No ratings yet
2021 Book AppliedAdvancedAnalytics
236 pages
Sample Plan by Satish Mistry: Scope of Personal Financial Plan / Financial Objective
No ratings yet
Sample Plan by Satish Mistry: Scope of Personal Financial Plan / Financial Objective
147 pages
5 Data Analytics Projects For Beginners - CourseraG
No ratings yet
5 Data Analytics Projects For Beginners - CourseraG
6 pages
Analytix Labs Data Science Course
100% (1)
Analytix Labs Data Science Course
18 pages
Computer Science For The Masses: Robert Sedgewick Princeton University
No ratings yet
Computer Science For The Masses: Robert Sedgewick Princeton University
50 pages
Affiliate Marketing
No ratings yet
Affiliate Marketing
9 pages
Metrics and Data Analytics For Entrepreneurial Growth: Dessislava A. Pachamanova, PHD
No ratings yet
Metrics and Data Analytics For Entrepreneurial Growth: Dessislava A. Pachamanova, PHD
28 pages
End To End Guide On Getting A Job in Tech Industry
100% (1)
End To End Guide On Getting A Job in Tech Industry
11 pages
Big Data Unit 5 (Easy Notes) Edushine Classes
No ratings yet
Big Data Unit 5 (Easy Notes) Edushine Classes
42 pages
Unit 3 Data-Analytics
No ratings yet
Unit 3 Data-Analytics
48 pages
SSRN Id3177534 PDF
No ratings yet
SSRN Id3177534 PDF
11 pages
CertyIQ AZ-900 UpdatedExam Dumps - 2022 Part 3
No ratings yet
CertyIQ AZ-900 UpdatedExam Dumps - 2022 Part 3
29 pages
Big Data Analytics Master
No ratings yet
Big Data Analytics Master
14 pages
Business Analyst Certificate Program: SUMMER 2006
No ratings yet
Business Analyst Certificate Program: SUMMER 2006
8 pages
The LUA 5.1 Language Short Reference
No ratings yet
The LUA 5.1 Language Short Reference
4 pages
Scraping Book Python PDF
No ratings yet
Scraping Book Python PDF
50 pages
The Future of Analytics
No ratings yet
The Future of Analytics
48 pages
HDP Developer-Enterprise Spark 1-Student Guide-Rev 1
No ratings yet
HDP Developer-Enterprise Spark 1-Student Guide-Rev 1
234 pages
Ace The Data Science Interview-1
No ratings yet
Ace The Data Science Interview-1
5 pages
FM GWP 1 Report
No ratings yet
FM GWP 1 Report
7 pages
HADOOP PPT
No ratings yet
HADOOP PPT
21 pages
20IT503 - Big Data Analytics - Unit3
No ratings yet
20IT503 - Big Data Analytics - Unit3
78 pages
Algorithms For Webdevs Ebook
No ratings yet
Algorithms For Webdevs Ebook
24 pages
A Summer Training Report On "Python Language"
No ratings yet
A Summer Training Report On "Python Language"
20 pages
Cloud Computing Unit 2
No ratings yet
Cloud Computing Unit 2
54 pages
Application Program Interface API
No ratings yet
Application Program Interface API
10 pages
Create A Sprite Animation With HTML5 Canvas and JavaScript (William Malone)
No ratings yet
Create A Sprite Animation With HTML5 Canvas and JavaScript (William Malone)
12 pages
Punjab Police Digital FIR System
No ratings yet
Punjab Police Digital FIR System
7 pages
File Module 2
No ratings yet
File Module 2
94 pages
Data Science-UG
No ratings yet
Data Science-UG
31 pages
Unit 2
No ratings yet
Unit 2
56 pages
Big Data Analytics (BDA) UNIT 1: Introduction To Big Data
No ratings yet
Big Data Analytics (BDA) UNIT 1: Introduction To Big Data
3 pages
Batch Processing Vs Stream Processing
No ratings yet
Batch Processing Vs Stream Processing
13 pages
Berkeley Data Analytics Stack: Prof. Harold Liu 15 December 2014
No ratings yet
Berkeley Data Analytics Stack: Prof. Harold Liu 15 December 2014
48 pages
HDFS Tutorial
No ratings yet
HDFS Tutorial
5 pages
Ajay Kadiyala Resume 2023 PDF
No ratings yet
Ajay Kadiyala Resume 2023 PDF
6 pages
Data Science P3 Job Description
No ratings yet
Data Science P3 Job Description
4 pages
H13 711 Simulacro
No ratings yet
H13 711 Simulacro
41 pages
Spark 20 Tuning Guide
No ratings yet
Spark 20 Tuning Guide
21 pages
Text Analysis With NLTK Cheatsheet
No ratings yet
Text Analysis With NLTK Cheatsheet
3 pages
Apache Spark - Executors - How Many Tasks Can My Cluster Run in Parallel - by Swetha Murali - Medium
No ratings yet
Apache Spark - Executors - How Many Tasks Can My Cluster Run in Parallel - by Swetha Murali - Medium
8 pages
How-To - Install CDH On Mac OSX 10
No ratings yet
How-To - Install CDH On Mac OSX 10
20 pages
Big Data Assignment
No ratings yet
Big Data Assignment
14 pages
Big Data Analytics White Paper
No ratings yet
Big Data Analytics White Paper
38 pages
Software Architect
No ratings yet
Software Architect
1 page
Design and Implementation of Vsstor: A Large-Scale Video Surveillance Storage System
No ratings yet
Design and Implementation of Vsstor: A Large-Scale Video Surveillance Storage System
6 pages
Samruddhi Kale Resume
No ratings yet
Samruddhi Kale Resume
5 pages
Praveen Kumar Kandhala
No ratings yet
Praveen Kumar Kandhala
5 pages
Mastering Time Series Analysis and Forecasting with Python
From Everand
Mastering Time Series Analysis and Forecasting with Python
Sulekha Aloorravi
No ratings yet
Microsoft Dynamics NAV 2009: Professional Reporting
From Everand
Microsoft Dynamics NAV 2009: Professional Reporting
Steven Renders
No ratings yet
Udemy Profit Secrets
From Everand
Udemy Profit Secrets
Bogdan Anastasiei
No ratings yet
Talend Open Studio Cookbook
From Everand
Talend Open Studio Cookbook
Rick Barton
2/5 (1)
R Data Science Essentials: R Data Science Essentials
From Everand
R Data Science Essentials: R Data Science Essentials
Raja B. Koushik
2/5 (1)
Practical Data Analytics for BFSI
From Everand
Practical Data Analytics for BFSI
Mr. Bharat Sikka
No ratings yet
IBM Cognos Business Intelligence
From Everand
IBM Cognos Business Intelligence
Dustin Adkison
No ratings yet
From Zero to Hero: Your Journey to Becoming a Data Scientist
From Everand
From Zero to Hero: Your Journey to Becoming a Data Scientist
William Webb
No ratings yet
SQL Server Functions and tutorials 50 examples
From Everand
SQL Server Functions and tutorials 50 examples
Nino Paiotta
1/5 (1)
Maximizing Marketing ROI: A Practical Guide
From Everand
Maximizing Marketing ROI: A Practical Guide
Growth Toolbox
No ratings yet
Ian Talks Python A-Z
From Everand
Ian Talks Python A-Z
Ian Eress
No ratings yet
Accounting 101
From Everand
Accounting 101
Paul Smith
No ratings yet
How to start Day Trading on $500 Capital
From Everand
How to start Day Trading on $500 Capital
Moriaco Dinheiro
No ratings yet
Blockchain For Business
From Everand
Blockchain For Business
empreender
No ratings yet
WhatsApp A Complete Guide
From Everand
WhatsApp A Complete Guide
Gerardus Blokdyk
No ratings yet
Change data capture Third Edition
From Everand
Change data capture Third Edition
Gerardus Blokdyk
No ratings yet
Semantic Knowledge Graphing Third Edition
From Everand
Semantic Knowledge Graphing Third Edition
Gerardus Blokdyk
No ratings yet
Statistical Analysis with Excel Complete Self-Assessment Guide
From Everand
Statistical Analysis with Excel Complete Self-Assessment Guide
Gerardus Blokdyk
No ratings yet
Electronic Customer Relationship Management E-CRM Complete Self-Assessment Guide
From Everand
Electronic Customer Relationship Management E-CRM Complete Self-Assessment Guide
Gerardus Blokdyk
No ratings yet
Economic data Second Edition
From Everand
Economic data Second Edition
Gerardus Blokdyk
No ratings yet
Qlik A Complete Guide - 2020 Edition
From Everand
Qlik A Complete Guide - 2020 Edition
Gerardus Blokdyk
No ratings yet
Single customer view Second Edition
From Everand
Single customer view Second Edition
Gerardus Blokdyk
No ratings yet

Mining Public Datasets

Uploaded by

Mining Public Datasets

Uploaded by

Mining Public Datasets

using Apache Zeppelin (incubating),

NFLabs for AppacheCon ’16 NA

Software Engineer at NFLabs, Seoul,

Co-organizer of SeoulTech Society

Committer and PPMC member of

Physics Research http://opendata.cern.ch

Physics Research http://opendata.cern.ch order of Pbs

APACHE ZEPPELIN: Overview

12.2012 Commercial App using AMP Lab Shark 0.5

From Berkeley AMP Labs, since 2010

Joined Apache since 2014

REPL + Java, Scala, Python, R APIs

Service modelling at scale

+ Integration with Spark, Zeppelin, Ganglia, etc

+ AWS, GCE, Azure, LXC, etc

$ apt-get install juju-core juju-quickstart

7 node cluster designed to scale out

10s PC Estimate the cost

1000 instances Scale out Deployment automation

• Collaboration google and github engineers

• Events on PR, repo, issues, comments, etc in JSON

We are going to build a Notebook that

We are going to build a Notebook that:

• Read & explore the dataset

• Imports, filters the PublicEvent

• Join logs w/ more data from Github API calls

• Shows HTML template, to visualise the list

• Sends email notifications

• Does all above automatically, once a day

On AWS S3 in WARC, WAT, formats

since 2013, monthly: ~150Tb compressed, 2+bln ulrs

URL Index by Ilya Kreymer of @webrecorder_io

Measuring the impact of Google Analytics

Objective: estimate % of pages/domains that use Google

Existing research from 2013

Measuring the impact of Google Analytics

Verify using grep

Feb 2016 Crawl:

Community service for sharing example notebooks

There are plenty of free tools out there

To crunch the data for fun and profit

They are easy (not simple) to learn and generic enough

You might also like

Feb 2016 Crawl: