Mining Public Datasets
Mining Public Datasets
Mining Public Datasets
github.com/bzz
@seoul_engineer
Graduated Maths at St.Petersburg State
University, Russia
PUBLIC DATASETS: Number, Size & Growth
Web Crawls
Structured data (RDF, micro-formats, tables)
Hackers News\Reddit\Twitter\StackOverflow\Wikipedia
Reviews (movies, restaurants, beer, wine)
Emails (Enroll, ASF public ML archives)
Census Data (US, UK, UN, Japan, etc)
Transportation (Taxi, Flights, Bicycles)
Climate
Genome
PUBLIC DATASETS: Number, Size & Growth
Web Crawls
Structured data (RDF, micro-formats, tables)
Hackers News\Reddit\Twitter\StackOverflow\Wikipedia
Reviews (movies, restaurants, beer, wine)
Emails (Enroll, ASF public ML archives)
order of Tbs
Census Data (US, UK, UN, Japan, etc)
Transportation (Taxi, Flights, Bicycles)
Genome
PUBLIC DATASETS: Number, Size & Growth
Web Crawls
Structured data (RDF, micro-formats, tables)
Hackers News\Reddit\Twitter\StackOverflow\Wikipedia
Reviews (movies, restaurants, beer, wine)
Emails (Enroll, ASF public ML archives)
order of Tbs
Census Data (US, UK, UN, Japan, etc)
Transportation (Taxi, Flights, Bicycles)
Genome
AWS Public Datasets https://aws.amazon.com/public-data-sets/
Yahoo Webscope https://webscope.sandbox.yahoo.com/
Stanford Network Analyser Project http://snap.stanford.edu/data/
Web Crawls
Structured data (RDF, micro-formats, tables)
Hackers News\Reddit\Twitter\StackOverflow\Wikipedia
Reviews (movies, restaurants, beer, wine)
Emails (Enroll, ASF public ML archives)
order of Tbs
Census Data (US, UK, UN, Japan, etc)
Transportation (Taxi, Flights, Bicycles)
Genome
AWS Public Datasets https://aws.amazon.com/public-data-sets/
Yahoo Webscope https://webscope.sandbox.yahoo.com/
Stanford Network Analyser Project http://snap.stanford.edu/data/
… …
TOOL TO PURSUIT THE OPPORTUNITY:
Overview Big Data eco-system
TOOL TO PURSUIT THE OPPORTUNITY:
Todays choice Zeppelin, Spark, Juju
Apache Spark
Scala, Python, R
Apache Zeppelin
Modern Web GUI, plays nicely with Spark, Flink,
Elasticsearch, etc.
Warcbase
Spark library for saved crawl data (WARC)
Juju
Scales, integration with Spark, Zeppelin, AWS, GCE
http://zeppelin.incubator.apache.org
http://spark.apache.org
1000+ contributors
https://jujucharms.com/
Deployment\configuration automation
JUJU
http://bigdata.juju.solutions/getstarted
$ juju bootstrap
$ juju quickstart apache-hadoop-spark-zeppelin
$ juju expose spark zeppelin
$ juju add-unit -n4 slave
JUJU
http://bigdata.juju.solutions/getstarted
1 core Prototype
Your laptop
• 300Gb compressed
https://commoncrawl.org
Nonprofit, by Factual
http://index.commoncrawl.org/
https://about.commonsearch.org
DATA: CommonCrawl - Data Product
AWS optimisations:
- pick spot instance prices
- pick instance type (net throughput)
- user Juju instead of EMR (2x $$ savings!)
Spark optimisations:
- IO-bound, so increase spark.executor.cores
spark.executor.memory
DATA: CommonCrawl - Data Product
Zeppelin Viewer
Alexander Bezzubov
@seoul_engineer
github.com/bzz
Thank you
Alexander Bezzubov
NFLabs, Seoul (we are hiring!)