004 This Course 2
004 This Course 2
tools
abstr.
desk
cloud
structs stats
hackers analysts
This Course
4/28/13 Bill Howe, UW 2
80% of analytics is sums and averages
-- Aaron Kimball, wibidata
structs stats
Three types of tasks:
4/28/13 Bill Howe, UW 3
1) Preparing to run a model
2) Running the model
3) Interpreting the results
Gathering, cleaning, integrating, restructuring,
transforming, loading, filtering, deleting, combining,
merging, verifying, extracting, shaping, massaging
80% of the work
-- Aaron Kimball
The other 80% of the work?
structs stats
4/28/13 Bill Howe, UW 4
no greater barrier to effective data management will
exist than the variety of incompatible data formats,
non-aligned data structures, and inconsistent data
semantics.
Doug Laney, 3-D Data Management: Controlling Data
Volume, Velocity and Variety, Gartner, 2001
structs stats
Problem
How much time do you spend handling
data as opposed to doing science?
Mode answer: 90%
structs stats
4/28/13 Bill Howe, UW 6
src: Christian Grant, MADSkills
structs stats
4/28/13 Bill Howe, UW 7
src: Christian Grant, MADSkills
(Sparse) Matrix Multiply in SQL
structs stats
Aside: Schema-on-Write vs.
Schema-on-Read
A schema* is a shared consensus about some universe of
discourse
At the frontier of research, this shared consensus does not
exist, by definition
Any schema that does emerge will change frequently, by
definition
Data found in the wild will typically not conform to any
schema, by definition
But this doesnt mean we have to live with ad hoc scripts and
files
A good approach: Schema-later! Schemas are important, but
not a prerequisite to processing.
* ontology/metadata standard/controlled vocabulary/etc.
4/28/13 Bill Howe, UW 8
structs stats