0% found this document useful (0 votes)
31 views11 pages

Ds Intro KK

Data science involves extracting insights from large amounts of structured and unstructured data using scientific methods. It is an interdisciplinary field that utilizes tools and techniques from statistics, machine learning, and software engineering. Data scientists clean, analyze, visualize, and model data to discover patterns and build algorithms in order to solve real-world problems. The CRISP-DM process is commonly used, involving business understanding, data understanding, data preparation, modeling, evaluation, and deployment phases to iteratively extract knowledge from data.

Uploaded by

zaheer zubair
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views11 pages

Ds Intro KK

Data science involves extracting insights from large amounts of structured and unstructured data using scientific methods. It is an interdisciplinary field that utilizes tools and techniques from statistics, machine learning, and software engineering. Data scientists clean, analyze, visualize, and model data to discover patterns and build algorithms in order to solve real-world problems. The CRISP-DM process is commonly used, involving business understanding, data understanding, data preparation, modeling, evaluation, and deployment phases to iteratively extract knowledge from data.

Uploaded by

zaheer zubair
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Introduction to Data

Science
Dr. Kalpakis, Fall 2017

1
What is Data Science?
• Data scientists, "The Sexiest Job of the 21st Century" (Davenport and Patil, Harvard Business
Review, 2012)
• Much of the data science explosion is coming from the tech-world
• What does Data Science mean?
• Is it the science of Big Data?
• What is Big Data anyway?
• Who does Data Science and where?
• What existed before Data Science came along?
• Is it simply a rebranding of statistics and machine learning?
• “Anything that has to call itself a science isn’t.”
• Hype increases noise-to-signal ratio in perceiving reality and makes it harder to focus on the
gems
• Why and how to hire a data scientist? http://goo.gl/F4K4hE
2
Why now?
• massive amounts of data about many aspects of our lives, both online and offline activities, real-
time as well as past-time
• Datafication=“taking all aspects of life and turning them into data”
• “Once we datafy things, we can transform their purpose and turn the information into new
forms of value.”
• abundance of inexpensive computing power, communication capacity
• proliferation of small footprint low-power sensors (IoT)
• feedback loop between our behavior, environment, and data products

3
Data Science take I
“Data science, as it’s practiced, is a blend of Red-Bull-fueled hacking and
espresso-inspired statistics.

But data science is not merely hacking—because when hackers finish


debugging their Bash one-liners and Pig scripts, few of them care about non-
Euclidean distance metrics.

And data science is not merely statistics, because when statisticians finish
theorizing the perfect model, few could read a tab-delimited file into R if their
job depended on it.

Data science is the civil engineering of data. Its acolytes possess a practical
knowledge of tools and materials, coupled with a theoretical understanding of
what’s possible.”
Drew Conway’s Venn diagram of data science
Mike Driscoll (CEO of Metamarket)

Many posers “It’s not enough to just know how to run a black box algorithm. You actually need to know how and
why it works, so that when it doesn’t work, you can adjust. “ Cathy O’Neil 4
Data Science team
• individual data scientist profiles are merged to
make a Data science team

• team profile should align with the profile of the


data problems to tackle

5
Data science: skills and actors
Clustering and visualization of data science subfields based on a survey of data science practitioners (
Analyzing the Analyzers by Harlan Harris, Sean Murphy, and Marck Vaisman, 2012)

• Data Businesspeople are the product and profit-focused data scientists.


They’re leaders, managers, and entrepreneurs, but with a technical
bent. A common educational path is an engineering degree paired with
an MBA.

• Data Creatives are eclectic jacks-of-all-trades, able to work with a broad


range of data and tools. They may think of themselves as artists or
hackers, and excel at visualization and open source technologies.

• Data Developers are focused on writing software to do analytic,


statistical, and machine learning tasks, often in production
environments. They often have computer science degrees, and often
work with so-called “big data”.

• Data Researchers apply their scientific training, and the tools and
techniques they learned in academia, to organizational data. They may
have PhDs, and their creative applications of mathematical tools yields
valuable insights and products. 6
Types of Data Scientists
• Machine Learning Scientist
• Statistician
• Software Programming
Analyst
• Data Engineer
• Actuarial Scientist
• Business Analytic
Practitioner
• Quality Analyst
• Spatial Data Scientist
• Mathematician
• Digital Analytic Consultant

7
What do data scientists do?
• “define what data science is by what data scientists get paid to do” (O’Neil and Schutt)

• In academia, a data scientist is trained in some discipline, works with large amounts of data,
grapples with computational problems posed by the structure, size, messiness, and the
complexity and nature of the data, and solves real-world problems.

• In industry, a data scientist


• knows how to extract meaning from and interpret data, which requires both tools and methods from
statistics and machine learning, as well as being human.
• spends a lots of effort in collecting, cleaning, and munging data utilizing statistics and software
engineering skills.
• performs exploratory data analysis, finds patterns, builds models, and algorithms.
• communicates the findings in clear language and with data visualizations so that even if her/his
colleagues unfamiliar with the data can understand the implications
8
Data Science take II
• “Data science, also known as data-driven science, is an interdisciplinary field about scientific
methods, processes, and systems to extract knowledge or insights from data in various forms,
either structured or unstructured, similar to data mining.” (Wikipedia)
• The 4th paradigm of science (theoretical, empirical, computational, and data-driven) (Jim Gray)

9
Data Science Process
CRISP-DM (Cross Industry Standard Process for Data
Data science process flowchart (O’Neil and Schutt) Mining)

10
CRISP-DM Phases, tasks, outputs
Business Data Data
Understanding Understanding Preparation Modeling Evaluation Deployment
Determine Business Objectives Collect Initial Data Data Set Select Modeling Technique Evaluate Results Plan Deployment
Background Initial Data Data Set Description Modeling Technique Assessment of Data Mining Deployment Plan
Business Objectives Collection Report Modeling Assumptions Results w.r.t. Business Success
Business Success Criteria Criteria
Approved Models

Situation Assessment Describe Data Select Data Generate Test Design Review Process Plan Monitoring and
Inventory of Resources Data Description Rationale for Test Design Review of Process Maintenance
Requirements,Assumptions, and Report Inclusion / Exclusion Monitoring &
Constraints Maintenance Plan
Risks and Contingencies
Terminology
Costs and Benefits

Determine Explore Data Clean Data Build Model Determine Next Steps Produce Final Report
Data Mining Goal Data Exploration Data Cleaning Parameter Settings List of Possible Actions Final Report
Data Mining Goals Report Report Models Decision Final Presentation
Data Mining Success Criteria Model Description

Produce Project Plan Verify Data Quality Construct Data Assess Model Review Project
Project Plan Data Quality Report Derived Attributes Model Assessment Experience
Initial Asessment of Tools and Generated Records Revised Parameter Settings Documentation
Techniques

Integrate Data
Merged Data
Format Data
Reformatted Data

11

You might also like