Ds Intro KK
Ds Intro KK
Science
Dr. Kalpakis, Fall 2017
1
What is Data Science?
• Data scientists, "The Sexiest Job of the 21st Century" (Davenport and Patil, Harvard Business
Review, 2012)
• Much of the data science explosion is coming from the tech-world
• What does Data Science mean?
• Is it the science of Big Data?
• What is Big Data anyway?
• Who does Data Science and where?
• What existed before Data Science came along?
• Is it simply a rebranding of statistics and machine learning?
• “Anything that has to call itself a science isn’t.”
• Hype increases noise-to-signal ratio in perceiving reality and makes it harder to focus on the
gems
• Why and how to hire a data scientist? http://goo.gl/F4K4hE
2
Why now?
• massive amounts of data about many aspects of our lives, both online and offline activities, real-
time as well as past-time
• Datafication=“taking all aspects of life and turning them into data”
• “Once we datafy things, we can transform their purpose and turn the information into new
forms of value.”
• abundance of inexpensive computing power, communication capacity
• proliferation of small footprint low-power sensors (IoT)
• feedback loop between our behavior, environment, and data products
3
Data Science take I
“Data science, as it’s practiced, is a blend of Red-Bull-fueled hacking and
espresso-inspired statistics.
And data science is not merely statistics, because when statisticians finish
theorizing the perfect model, few could read a tab-delimited file into R if their
job depended on it.
Data science is the civil engineering of data. Its acolytes possess a practical
knowledge of tools and materials, coupled with a theoretical understanding of
what’s possible.”
Drew Conway’s Venn diagram of data science
Mike Driscoll (CEO of Metamarket)
Many posers “It’s not enough to just know how to run a black box algorithm. You actually need to know how and
why it works, so that when it doesn’t work, you can adjust. “ Cathy O’Neil 4
Data Science team
• individual data scientist profiles are merged to
make a Data science team
5
Data science: skills and actors
Clustering and visualization of data science subfields based on a survey of data science practitioners (
Analyzing the Analyzers by Harlan Harris, Sean Murphy, and Marck Vaisman, 2012)
• Data Researchers apply their scientific training, and the tools and
techniques they learned in academia, to organizational data. They may
have PhDs, and their creative applications of mathematical tools yields
valuable insights and products. 6
Types of Data Scientists
• Machine Learning Scientist
• Statistician
• Software Programming
Analyst
• Data Engineer
• Actuarial Scientist
• Business Analytic
Practitioner
• Quality Analyst
• Spatial Data Scientist
• Mathematician
• Digital Analytic Consultant
7
What do data scientists do?
• “define what data science is by what data scientists get paid to do” (O’Neil and Schutt)
• In academia, a data scientist is trained in some discipline, works with large amounts of data,
grapples with computational problems posed by the structure, size, messiness, and the
complexity and nature of the data, and solves real-world problems.
9
Data Science Process
CRISP-DM (Cross Industry Standard Process for Data
Data science process flowchart (O’Neil and Schutt) Mining)
10
CRISP-DM Phases, tasks, outputs
Business Data Data
Understanding Understanding Preparation Modeling Evaluation Deployment
Determine Business Objectives Collect Initial Data Data Set Select Modeling Technique Evaluate Results Plan Deployment
Background Initial Data Data Set Description Modeling Technique Assessment of Data Mining Deployment Plan
Business Objectives Collection Report Modeling Assumptions Results w.r.t. Business Success
Business Success Criteria Criteria
Approved Models
Situation Assessment Describe Data Select Data Generate Test Design Review Process Plan Monitoring and
Inventory of Resources Data Description Rationale for Test Design Review of Process Maintenance
Requirements,Assumptions, and Report Inclusion / Exclusion Monitoring &
Constraints Maintenance Plan
Risks and Contingencies
Terminology
Costs and Benefits
Determine Explore Data Clean Data Build Model Determine Next Steps Produce Final Report
Data Mining Goal Data Exploration Data Cleaning Parameter Settings List of Possible Actions Final Report
Data Mining Goals Report Report Models Decision Final Presentation
Data Mining Success Criteria Model Description
Produce Project Plan Verify Data Quality Construct Data Assess Model Review Project
Project Plan Data Quality Report Derived Attributes Model Assessment Experience
Initial Asessment of Tools and Generated Records Revised Parameter Settings Documentation
Techniques
Integrate Data
Merged Data
Format Data
Reformatted Data
11