003 This Course 1

This document discusses the evolution of data analysis tools and abstractions over time. It notes that early commercial databases and open source tools in the pre-2004 era were followed by the introduction of MapReduce in 2004 and the Hadoop platform starting in 2008. Many relational query tools were then built on Hadoop, including Pig, DryadLINQ, and Hive. The document also discusses how simply downloading large datasets will not scale to the sizes of data now being collected, and that databases and parallel/distributed systems are needed to enable indexing and analysis of petabytes of data. It cites a report that the US faces shortages of people with skills in advanced data analysis and the ability to use big data to make effective decisions.

Uploaded by

Mauricio Micoski

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views7 pages

003 This Course 1

Uploaded by

Mauricio Micoski

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

4/28/13 Bill Howe, UW eScience 1

tools
abstr.
desk
cloud
structs stats
hackers analysts
This Course
4/28/13 Bill Howe, UW 2
tools
abstr.
What goes around comes around
Pre-2004: commercial RDBMS, some open source
2004 Dean et al. MapReduce
2008 Hadoop 0.17 release
2008 Olston et al. Pig: Relational Algebra on Hadoop
2008 DryadLINQ: Relational Algebra in a Hadoop-like system
2009 Thusoo et al. HIVE: SQL on Hadoop
2009 Hbase: Indexing for Hadoop
2010 Dietrich et al. Schemas and Indexing for Hadoop
2012 Transactions in HBase (plus VoltDB, other NewSQL systems)
But also some permanent contributions:
Fault tolerance
Schema-on-Read
User-defined functions that dont suck
4/28/13 Bill Howe, UW 3
What are the abstractions of
data science?
tools
abstr.
Data Jujitsu
Data Wrangling
Data Munging

Translation: We have no idea what
this is all about
4/28/13 Bill Howe, UW 4
matrices and linear algebra?
relations and relational algebra?
objects and methods?
files and scripts?
data frames and functions?
What are the abstractions of
data science?
tools
abstr.
5
Data Access Hitting a Wall
Current practice based on data download (FTP/GREP)
Will not scale to the datasets of tomorrow
You can GREP 1 MB in a second
You can GREP 1 GB in a minute
You can GREP 1 TB in 2 days
You can GREP 1 PB in 3 years.
Oh!, and 1PB ~5,000 disks
At some point you need
indices to limit search
parallel data search and analysis
This is where databases can help
You can FTP 1 MB in 1 sec
You can FTP 1 GB / min (~1$)
2 days and 1K$
3 years and 1M$
desk
cloud
[slide src: Jim Gray]
US faces shortage of 140,000 to 190,000
people with deep analytical skills, as well
as 1.5 million managers and analysts with
the know-how to use the analysis of big
data to make effective decisions.
4/28/13 Bill Howe, UW 6
--Mckinsey Global Institute
hackers analysts
SELECT x.strain, x.chr, x.region as snp_region, x.start_bp as snp_start_bp
, x.end_bp as snp_end_bp, w.start_bp as nc_start_bp, w.end_bp as nc_end_bp
, w.category as nc_category
, CASE WHEN (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp)
THEN x.end_bp - x.start_bp + 1
WHEN (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp)
THEN x.end_bp - w.start_bp + 1
WHEN (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp)
THEN w.end_bp - x.start_bp + 1
END AS len_overlap

FROM [[email protected]].[hotspots_deserts.tab] x
INNER JOIN [[email protected]].[table_noncoding_positions.tab] w
ON x.chr = w.chr
WHERE (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp)
OR (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp)
OR (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp)
ORDER BY x.strain, x.chr ASC, x.start_bp ASC
Biologists are beginning to write very complex
queries (rather than relying on staff programmers)

Example: Computing the overlaps of two sets of blast results
We see thousands of
queries written by
non-programmers
hackers analysts

Google Analytics Certification Questions Answers
100% (2)
Google Analytics Certification Questions Answers
83 pages
ShopData PDF
No ratings yet
ShopData PDF
162 pages
CSE 444 Practice Problems
No ratings yet
CSE 444 Practice Problems
13 pages
Poster Presentation Templates 42x42
No ratings yet
Poster Presentation Templates 42x42
1 page
Data Science Lecture 2 Four Dimensions
No ratings yet
Data Science Lecture 2 Four Dimensions
25 pages
Final Lec
No ratings yet
Final Lec
22 pages
DW
No ratings yet
DW
4 pages
Intro To Data Science Prelims Reviewer
No ratings yet
Intro To Data Science Prelims Reviewer
12 pages
Introduction To Dbms
No ratings yet
Introduction To Dbms
37 pages
Lecture1 Intro To DBMS
No ratings yet
Lecture1 Intro To DBMS
32 pages
An Introduction To Database Systems Bipin C.desaI
No ratings yet
An Introduction To Database Systems Bipin C.desaI
849 pages
Unit 2.4: Bioinformatics and Databases
No ratings yet
Unit 2.4: Bioinformatics and Databases
55 pages
Group 4 - Introduction To Database
No ratings yet
Group 4 - Introduction To Database
51 pages
Mining Databases: Towards Algorithms For Knowledge Discovery
No ratings yet
Mining Databases: Towards Algorithms For Knowledge Discovery
10 pages
001 - OpenEdge Getting Started Database Essentials Gsdbe
No ratings yet
001 - OpenEdge Getting Started Database Essentials Gsdbe
142 pages
Lecture 1
No ratings yet
Lecture 1
19 pages
Dic PLB L1
No ratings yet
Dic PLB L1
64 pages
Ms Access SQL
No ratings yet
Ms Access SQL
12 pages
DB System Lecture Note All in One
No ratings yet
DB System Lecture Note All in One
86 pages
Case Study of Data Science
No ratings yet
Case Study of Data Science
16 pages
DB Lecture Chapter 1-3
No ratings yet
DB Lecture Chapter 1-3
58 pages
Lt20 21 Index
No ratings yet
Lt20 21 Index
28 pages
118.721 Analysis and Interpretation of Animal Health Data
No ratings yet
118.721 Analysis and Interpretation of Animal Health Data
81 pages
Lecture 2-Data Science
No ratings yet
Lecture 2-Data Science
25 pages
1 Databases
No ratings yet
1 Databases
10 pages
Timothy G. Griffin: Introduction To Databases
No ratings yet
Timothy G. Griffin: Introduction To Databases
20 pages
Mark Raasveldt & Hannes Mühleisen: Duckdb
No ratings yet
Mark Raasveldt & Hannes Mühleisen: Duckdb
38 pages
Lecture 1 - Introduction To Big Data
No ratings yet
Lecture 1 - Introduction To Big Data
51 pages
13 QP1
No ratings yet
13 QP1
33 pages
Chapter 1 Slides
No ratings yet
Chapter 1 Slides
50 pages
Fundamentals of Database Systems
No ratings yet
Fundamentals of Database Systems
105 pages
Review
No ratings yet
Review
18 pages
Module 1
No ratings yet
Module 1
78 pages
Chapter 1
No ratings yet
Chapter 1
39 pages
Chapter 1 Introduction To DB
No ratings yet
Chapter 1 Introduction To DB
50 pages
A Relational Model of Data For Large Shared Data Banks
100% (1)
A Relational Model of Data For Large Shared Data Banks
35 pages
Database Module PDFS 2023
No ratings yet
Database Module PDFS 2023
251 pages
DBMS Chap-4
No ratings yet
DBMS Chap-4
20 pages
A Practical Guide To Database Design Second Edition PDF
No ratings yet
A Practical Guide To Database Design Second Edition PDF
431 pages
AIMP339 Material 1
No ratings yet
AIMP339 Material 1
21 pages
BY:-Abhishek Goel Shubham Gupta Varun Sood
No ratings yet
BY:-Abhishek Goel Shubham Gupta Varun Sood
27 pages
1 Introduction
No ratings yet
1 Introduction
43 pages
Session 8 - George Strawn - Big Data
No ratings yet
Session 8 - George Strawn - Big Data
34 pages
Database Management SystemsWFA
No ratings yet
Database Management SystemsWFA
71 pages
CS201 17data Mining
No ratings yet
CS201 17data Mining
65 pages
01 Relationalmodel
No ratings yet
01 Relationalmodel
70 pages
Execution
No ratings yet
Execution
37 pages
ADBMS ppt2
No ratings yet
ADBMS ppt2
158 pages
Lecture 2.1.1
No ratings yet
Lecture 2.1.1
21 pages
Database Overview
No ratings yet
Database Overview
101 pages
Cca498 - Final - Review - Jiajia
No ratings yet
Cca498 - Final - Review - Jiajia
86 pages
Chapter 9
No ratings yet
Chapter 9
5 pages
Relational Databases: What Is A Database?
No ratings yet
Relational Databases: What Is A Database?
25 pages
1.a.i. DWM 423 Data Warehousing Mod1Wk1
No ratings yet
1.a.i. DWM 423 Data Warehousing Mod1Wk1
57 pages
Chapter 4
No ratings yet
Chapter 4
47 pages
01 Intro
No ratings yet
01 Intro
20 pages
1 Intro-1
No ratings yet
1 Intro-1
47 pages
DBMS Notes PDF
No ratings yet
DBMS Notes PDF
38 pages
Python For Data Science 2025 Slides
No ratings yet
Python For Data Science 2025 Slides
364 pages
Database Questions
No ratings yet
Database Questions
4 pages
Field Installation Guide-V2 1 Foundation
No ratings yet
Field Installation Guide-V2 1 Foundation
61 pages
Interacting With Computer (Input Devices) 2
No ratings yet
Interacting With Computer (Input Devices) 2
3 pages
Lab Setup - AWS Cloud - Mithun Technologies - 2022
No ratings yet
Lab Setup - AWS Cloud - Mithun Technologies - 2022
17 pages
Where Do I Find The My Templates Folder On My Mac? - Microsoft Community
No ratings yet
Where Do I Find The My Templates Folder On My Mac? - Microsoft Community
3 pages
Hls7000dn Use QSG Leb502001
No ratings yet
Hls7000dn Use QSG Leb502001
35 pages
2020 06 21 10.45.02
No ratings yet
2020 06 21 10.45.02
2 pages
Department of Information Technology
No ratings yet
Department of Information Technology
53 pages
Hybris Documentation
No ratings yet
Hybris Documentation
30 pages
Removal of Redundant Code
No ratings yet
Removal of Redundant Code
2 pages
DBMS Lab Experiments
100% (1)
DBMS Lab Experiments
45 pages
Project Title: Customer Relationship Management
No ratings yet
Project Title: Customer Relationship Management
38 pages
C Language
No ratings yet
C Language
107 pages
Practica 1 Equilibrio de Fuerzas
No ratings yet
Practica 1 Equilibrio de Fuerzas
5 pages
Online Rental System
No ratings yet
Online Rental System
60 pages
User Manual Training Centre Verification by DSOs PDF
No ratings yet
User Manual Training Centre Verification by DSOs PDF
36 pages
Social Media: By: Ibraheem Abdulkhaliq Class: 8 A
0% (1)
Social Media: By: Ibraheem Abdulkhaliq Class: 8 A
5 pages
Usage Guide
No ratings yet
Usage Guide
6 pages
SC2006 Notes
No ratings yet
SC2006 Notes
75 pages
Identying and Measuring Developmentes in AI
No ratings yet
Identying and Measuring Developmentes in AI
68 pages
User Controls: Usercontrol
No ratings yet
User Controls: Usercontrol
17 pages
Senior IT Infrastructure Project Manager - M Tariq Nazar Resume Rev11
No ratings yet
Senior IT Infrastructure Project Manager - M Tariq Nazar Resume Rev11
5 pages
Resume Pee Yoo SH
No ratings yet
Resume Pee Yoo SH
1 page
Introduction To Mobile Forensics: Full Physical Image Analysis
No ratings yet
Introduction To Mobile Forensics: Full Physical Image Analysis
31 pages
System Analysis and Design Unit One
100% (1)
System Analysis and Design Unit One
15 pages
How To Build Android Apps With Kotlin A Practical Guide To Developing Testing and Publishing Your First Android Apps 2nd Edition Alex Forrester Download
100% (1)
How To Build Android Apps With Kotlin A Practical Guide To Developing Testing and Publishing Your First Android Apps 2nd Edition Alex Forrester Download
59 pages
Database Management Systems Question Paper
No ratings yet
Database Management Systems Question Paper
9 pages
UserManual AutoDataCollectionTool
No ratings yet
UserManual AutoDataCollectionTool
17 pages

003 This Course 1

Uploaded by

003 This Course 1

Uploaded by

4/28/13 Bill Howe, UW eScience 1

You might also like