0% found this document useful (0 votes)
34 views8 pages

004 This Course 2

This document discusses challenges in data analytics. It notes that 80% of analytics work involves basic tasks like data preparation, cleaning, and transformation. The remaining 20% involves more advanced work like running and interpreting models. It also discusses issues around inconsistent data formats and structures that make data management difficult. The document reports that many analysts estimate spending 90% of their time handling data rather than doing actual analysis or science. Finally, it discusses approaches to data schemas, noting the benefits of delaying rigid schemas until data has been explored ("schema-later") rather than requiring schemas up front ("schema-on-write").

Uploaded by

Teja Kamal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views8 pages

004 This Course 2

This document discusses challenges in data analytics. It notes that 80% of analytics work involves basic tasks like data preparation, cleaning, and transformation. The remaining 20% involves more advanced work like running and interpreting models. It also discusses issues around inconsistent data formats and structures that make data management difficult. The document reports that many analysts estimate spending 90% of their time handling data rather than doing actual analysis or science. Finally, it discusses approaches to data schemas, noting the benefits of delaying rigid schemas until data has been explored ("schema-later") rather than requiring schemas up front ("schema-on-write").

Uploaded by

Teja Kamal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

4/28/13 Bill Howe, UW eScience 1

tools
abstr.
desk
cloud
structs stats
hackers analysts
This Course
4/28/13 Bill Howe, UW 2
80% of analytics is sums and averages
-- Aaron Kimball, wibidata
structs stats
Three types of tasks:
4/28/13 Bill Howe, UW 3
1) Preparing to run a model



2) Running the model

3) Interpreting the results
Gathering, cleaning, integrating, restructuring,
transforming, loading, filtering, deleting, combining,
merging, verifying, extracting, shaping, massaging
80% of the work
-- Aaron Kimball
The other 80% of the work?
structs stats
4/28/13 Bill Howe, UW 4
no greater barrier to effective data management will
exist than the variety of incompatible data formats,
non-aligned data structures, and inconsistent data
semantics.
Doug Laney, 3-D Data Management: Controlling Data
Volume, Velocity and Variety, Gartner, 2001
structs stats
Problem

How much time do you spend handling
data as opposed to doing science?


Mode answer: 90%
structs stats
4/28/13 Bill Howe, UW 6
src: Christian Grant, MADSkills
structs stats
4/28/13 Bill Howe, UW 7
src: Christian Grant, MADSkills
(Sparse) Matrix Multiply in SQL
structs stats
Aside: Schema-on-Write vs.
Schema-on-Read
A schema* is a shared consensus about some universe of
discourse
At the frontier of research, this shared consensus does not
exist, by definition
Any schema that does emerge will change frequently, by
definition
Data found in the wild will typically not conform to any
schema, by definition
But this doesnt mean we have to live with ad hoc scripts and
files
A good approach: Schema-later! Schemas are important, but
not a prerequisite to processing.
* ontology/metadata standard/controlled vocabulary/etc.
4/28/13 Bill Howe, UW 8
structs stats

You might also like