Mining Public Datasets

Download as pdf or txt
Download as pdf or txt
You are on page 1of 45

Mining Public Datasets

using Apache Zeppelin (incubating),


Apache Spark and Juju
by Alexander Bezzubov

NFLabs for AppacheCon ’16 NA


Alexander Bezzubov

Software Engineer at NFLabs, Seoul,


South Korea

Co-organizer of SeoulTech Society

Committer and PPMC member of


Apache Zeppelin (Incubating)

github.com/bzz

@seoul_engineer
Graduated Maths at St.Petersburg State
University, Russia
PUBLIC DATASETS: Number, Size & Growth

Web Crawls
Structured data (RDF, micro-formats, tables)
Hackers News\Reddit\Twitter\StackOverflow\Wikipedia
Reviews (movies, restaurants, beer, wine)
Emails (Enroll, ASF public ML archives)
Census Data (US, UK, UN, Japan, etc)
Transportation (Taxi, Flights, Bicycles)
Climate
Genome
PUBLIC DATASETS: Number, Size & Growth

Web Crawls
Structured data (RDF, micro-formats, tables)
Hackers News\Reddit\Twitter\StackOverflow\Wikipedia
Reviews (movies, restaurants, beer, wine)
Emails (Enroll, ASF public ML archives)
order of Tbs
Census Data (US, UK, UN, Japan, etc)
Transportation (Taxi, Flights, Bicycles)
Genome
PUBLIC DATASETS: Number, Size & Growth

Web Crawls
Structured data (RDF, micro-formats, tables)
Hackers News\Reddit\Twitter\StackOverflow\Wikipedia
Reviews (movies, restaurants, beer, wine)
Emails (Enroll, ASF public ML archives)
order of Tbs
Census Data (US, UK, UN, Japan, etc)
Transportation (Taxi, Flights, Bicycles)
Genome
AWS Public Datasets https://aws.amazon.com/public-data-sets/
Yahoo Webscope https://webscope.sandbox.yahoo.com/
Stanford Network Analyser Project http://snap.stanford.edu/data/

Physics Research http://opendata.cern.ch


PUBLIC DATASETS: Number, Size & Growth

Web Crawls
Structured data (RDF, micro-formats, tables)
Hackers News\Reddit\Twitter\StackOverflow\Wikipedia
Reviews (movies, restaurants, beer, wine)
Emails (Enroll, ASF public ML archives)
order of Tbs
Census Data (US, UK, UN, Japan, etc)
Transportation (Taxi, Flights, Bicycles)
Genome
AWS Public Datasets https://aws.amazon.com/public-data-sets/
Yahoo Webscope https://webscope.sandbox.yahoo.com/
Stanford Network Analyser Project http://snap.stanford.edu/data/

Physics Research http://opendata.cern.ch order of Pbs


PUBLIC DATA = OPPORTUNITY
I. Tools
II. Data
TOOL TO PURSUIT THE OPPORTUNITY:
Overview Big Data eco-system

… …
TOOL TO PURSUIT THE OPPORTUNITY:
Overview Big Data eco-system
TOOL TO PURSUIT THE OPPORTUNITY:
Todays choice Zeppelin, Spark, Juju

Apache Spark
Scala, Python, R

Apache Zeppelin
Modern Web GUI, plays nicely with Spark, Flink,
Elasticsearch, etc.

Warcbase
Spark library for saved crawl data (WARC)

Juju
Scales, integration with Spark, Zeppelin, AWS, GCE

APACHE ZEPPELIN: Overview


Zeppelin: Brief history

http://zeppelin.incubator.apache.org

12.2012 Commercial App using AMP Lab Shark 0.5


10.2013 Prototype Hive/Shark
08.2013 NFLabs Internal project Hive/Shark
12.2014 Enters ASF Incubation
01.2016 3 major releases
05.2016 Graduation vote passed
Interactive Visualization
APACHE SPARK

http://spark.apache.org

From Berkeley AMP Labs, since 2010

Joined Apache since 2014

1000+ contributors

REPL + Java, Scala, Python, R APIs


JUJU

https://jujucharms.com/

Service modelling at scale

Deployment\configuration automation

+ Integration with Spark, Zeppelin, Ganglia, etc

+ AWS, GCE, Azure, LXC, etc

JUJU

http://bigdata.juju.solutions/getstarted

$ apt-get install juju-core juju-quickstart


# or
$ brew install juju juju-quickstart
$ juju generate-config
#LXC, AWS, GCE, Azure, VMWare, OpenStack

$ juju bootstrap
$ juju quickstart apache-hadoop-spark-zeppelin
$ juju expose spark zeppelin
$ juju add-unit -n4 slave
JUJU

http://bigdata.juju.solutions/getstarted

7 node cluster designed to scale out


APPROACH: local, small cluster, big cluster

1 core Prototype
Your laptop

10s PC Estimate the cost


AWS spot instances

1000 instances Scale out Deployment automation


I. Tools
II. Data
DATA: GitHub http://githubarchive.org

• 300Gb compressed

• Collaboration google and github engineers

• Events on PR, repo, issues, comments, etc in JSON


http://www.commitlogsfromlastnight.com/
http://sideeffect.kr/popularconvention/
https://www.gitlive.net/
http://zoom.it/kCsU
DATA PRODUCT: Get notified when
project goes Open Source
DATA PRODUCT: Exploration
DATA PRODUCT: Sketch

We are going to build a Notebook that


sends you a digest email:
DATA PRODUCT: pieces (flow-chart)

We are going to build a Notebook that:


• Downloads the latest data from GitHub Archive

• Read & explore the dataset

• Imports, filters the PublicEvent

• Join logs w/ more data from Github API calls

• Shows HTML template, to visualise the list

• Sends email notifications

• Does all above automatically, once a day


DATA PRODUCT: Full impl
I. Tools
II. Data
DATA: Common Crawl

https://commoncrawl.org

Nonprofit, by Factual

On AWS S3 in WARC, WAT, formats

since 2013, monthly: ~150Tb compressed, 2+bln ulrs

URL Index by Ilya Kreymer of @webrecorder_io

http://index.commoncrawl.org/
https://about.commonsearch.org
DATA: CommonCrawl - Data Product

Measuring the impact of Google Analytics

Objective: estimate % of pages/domains that use Google


Analytics/Facebook

Existing research from 2013


DATA: CommonCrawl - Data Product

Measuring the impact of Google Analytics


Copy to HDFS vs read from S3
Verify using grep
hadoop jar hadoop-examples.jar grep /grep-data/ \
/grep-output/ '[Bb]ig [Dd]ata is ([a-zA-Z]{5,})'


Verify using grep


DATA: CommonCrawl - Data Product

Feb 2016 Crawl:



- 48Tb compressed
- 100 segments (dir on S3)
- 30,000 files, ~1Gb each
DATA: CommonCrawl - Data Product

AWS optimisations:
- pick spot instance prices
- pick instance type (net throughput)
- user Juju instead of EMR (2x $$ savings!)
Spark optimisations:
- IO-bound, so increase spark.executor.cores
spark.executor.memory
DATA: CommonCrawl - Data Product
Zeppelin Viewer

Community service for sharing example notebooks


http://zeppelinhub.com/viewer
TAKEAWAY

There are plenty of free tools out there

To crunch the data for fun and profit

They are easy (not simple) to learn and generic enough


Questions?

Alexander Bezzubov
@seoul_engineer

github.com/bzz
Thank you
Alexander Bezzubov
NFLabs, Seoul (we are hiring!)

You might also like