0% found this document useful (0 votes)
53 views7 pages

003 This Course 1

This document discusses the evolution of data analysis tools and abstractions over time. It notes that early commercial databases and open source tools in the pre-2004 era were followed by the introduction of MapReduce in 2004 and the Hadoop platform starting in 2008. Many relational query tools were then built on Hadoop, including Pig, DryadLINQ, and Hive. The document also discusses how simply downloading large datasets will not scale to the sizes of data now being collected, and that databases and parallel/distributed systems are needed to enable indexing and analysis of petabytes of data. It cites a report that the US faces shortages of people with skills in advanced data analysis and the ability to use big data to make effective decisions.

Uploaded by

Mauricio Micoski
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views7 pages

003 This Course 1

This document discusses the evolution of data analysis tools and abstractions over time. It notes that early commercial databases and open source tools in the pre-2004 era were followed by the introduction of MapReduce in 2004 and the Hadoop platform starting in 2008. Many relational query tools were then built on Hadoop, including Pig, DryadLINQ, and Hive. The document also discusses how simply downloading large datasets will not scale to the sizes of data now being collected, and that databases and parallel/distributed systems are needed to enable indexing and analysis of petabytes of data. It cites a report that the US faces shortages of people with skills in advanced data analysis and the ability to use big data to make effective decisions.

Uploaded by

Mauricio Micoski
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

4/28/13 Bill Howe, UW eScience 1

tools
abstr.
desk
cloud
structs stats
hackers analysts
This Course
4/28/13 Bill Howe, UW 2
tools
abstr.
What goes around comes around
Pre-2004: commercial RDBMS, some open source
2004 Dean et al. MapReduce
2008 Hadoop 0.17 release
2008 Olston et al. Pig: Relational Algebra on Hadoop
2008 DryadLINQ: Relational Algebra in a Hadoop-like system
2009 Thusoo et al. HIVE: SQL on Hadoop
2009 Hbase: Indexing for Hadoop
2010 Dietrich et al. Schemas and Indexing for Hadoop
2012 Transactions in HBase (plus VoltDB, other NewSQL systems)
But also some permanent contributions:
Fault tolerance
Schema-on-Read
User-defined functions that dont suck
4/28/13 Bill Howe, UW 3
What are the abstractions of
data science?
tools
abstr.
Data Jujitsu
Data Wrangling
Data Munging

Translation: We have no idea what
this is all about
4/28/13 Bill Howe, UW 4
matrices and linear algebra?
relations and relational algebra?
objects and methods?
files and scripts?
data frames and functions?
What are the abstractions of
data science?
tools
abstr.
5
Data Access Hitting a Wall
Current practice based on data download (FTP/GREP)
Will not scale to the datasets of tomorrow
You can GREP 1 MB in a second
You can GREP 1 GB in a minute
You can GREP 1 TB in 2 days
You can GREP 1 PB in 3 years.
Oh!, and 1PB ~5,000 disks
At some point you need
indices to limit search
parallel data search and analysis
This is where databases can help
You can FTP 1 MB in 1 sec
You can FTP 1 GB / min (~1$)
2 days and 1K$
3 years and 1M$
desk
cloud
[slide src: Jim Gray]
US faces shortage of 140,000 to 190,000
people with deep analytical skills, as well
as 1.5 million managers and analysts with
the know-how to use the analysis of big
data to make effective decisions.
4/28/13 Bill Howe, UW 6
--Mckinsey Global Institute
hackers analysts
SELECT x.strain, x.chr, x.region as snp_region, x.start_bp as snp_start_bp
, x.end_bp as snp_end_bp, w.start_bp as nc_start_bp, w.end_bp as nc_end_bp
, w.category as nc_category
, CASE WHEN (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp)
THEN x.end_bp - x.start_bp + 1
WHEN (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp)
THEN x.end_bp - w.start_bp + 1
WHEN (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp)
THEN w.end_bp - x.start_bp + 1
END AS len_overlap

FROM [[email protected]].[hotspots_deserts.tab] x
INNER JOIN [[email protected]].[table_noncoding_positions.tab] w
ON x.chr = w.chr
WHERE (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp)
OR (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp)
OR (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp)
ORDER BY x.strain, x.chr ASC, x.start_bp ASC
Biologists are beginning to write very complex
queries (rather than relying on staff programmers)

Example: Computing the overlaps of two sets of blast results
We see thousands of
queries written by
non-programmers
hackers analysts

You might also like