0% found this document useful (0 votes)
26 views100 pages

Intro To Data Science

GA intro data sc

Uploaded by

asd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views100 pages

Intro To Data Science

GA intro data sc

Uploaded by

asd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 100

Intro to Data Science

WELCOME TO GA
GENERAL ASSEMBLY
Travis Huang (He/Him)
Data Science Part-Time Lead Instructor

● Technical Program Manager


● Stats Nerd
● Casual Gamer

[email protected]

https://www.linkedin.com/in/huangtravis/
2
2 | © 2018 General Assembly
What is General Assembly?
General Assembly is a pioneer in education and
career transformation, specializing in today's most
in-demand skills. We foster a flourishing community
of professionals pursuing careers they love.
What We Teach

Coding UX & Design Data

Marketing Business Career Development


Our Agreement

● Turn off or silence your devices.


● Be present — engage in active learning,
collaborate, and ask questions.
● Be curious.

You’ll receive digital copies of these slides


after class has ended and you’ve filled out the
survey.

6 | © 2021 General Assembly


Agenda

Defining Data Science

The Data Science Workflow

Crafting Good Questions

Supervised Learning

Decision Trees

7 | © 2021 General Assembly


Our Goals For Today

Define data science.

Identify the Data Science Workflow and explain the value it adds to solving
a business challenge.

Construct a good data science question.

Observe how decision trees are used in data science.

8 | © 2021 General Assembly


Big-Picture Goal

This workshop represents the first step toward


improving your data science literacy.

9 | © 2021 General Assembly


Intro to Data Science

Defining Data Science

WELCOME TO GA
GENERAL ASSEMBLY
What Is Data Science, Anyway?

11 | © 2021 General Assembly


Discussion:
Data Science Careers

How would you define data science?

12 | © 2021 General Assembly


Real Cases:
Data Science on the Job

How They’re Using Data Science:

● Prioritizes listings in popular areas, making desirable


Airbnbs easier for users to find.

● Basketball hoop rim sensors track real-time data to


better predict court placement for successful shots.

● Optimizes package drop-off and delivery transport


using machine learning and AI to predict delivery
obstacles (e.g., weather, traffic).

13 | © 2021 General Assembly



The ability to take data — to be able to understand it,
to process it, to extract value from it, to visualize it, to
communicate it — that’s going to be a hugely
important skill in the next decades.
Hal Varian, chief economist at Google | UC Berkeley professor

14 | © 2021 General Assembly


Data Science is the Extraction of Knowledge From Data.

15 | © 2021 General Assembly


Real Cases:
Data Science on the Job

Consider these three products and services:

● How do they utilize data science?


● What kinds of data do you think they use?
● How might they leverage data science in other parts of their business?

16 | © 2021 General Assembly


Discussion:
Data Science Careers

What skills and competencies do you think are most


important for data scientists?

17 | © 2021 General Assembly


Makeup of a Data Scientist

Tech Soft Skills


SAS, R, Python, Perl, Influencing, critical thinking,
Excel, SQL, Hadoop, systems thinking, visual
JavaScript, IoT. thinking, design.

Domain
Math Methods
Statistics techniques, Knowledge
quantitative and qualitative Industry knowledge,
methods. workflows, data operations,
analytics.

18 | © 2021 General Assembly


Intro to Data Science

The Data Science Workflow

WELCOME TO GA
GENERAL ASSEMBLY
Overcoming Challenges With Data Science

Going from answering... To...

“Let’s optimize our sales “Here are actionable


funnel to improve our recommendations drawn from
conversion rates. ” data-driven insights.”

20 | © 2021 General Assembly


Why Does It Matter?

Think of the steps in the Data


Science Workflow as
problem-solving guidelines.

21 | © 2021 General Assembly


Steps in the Workflow
Iterative —
repeat as
needed!

Frame Prepare Analyze Interpret Communicate

Develop Select, import, and Structure, visualize, Make business Present data-driven
hypothesis-driven clean relevant data. and complete your decisions based on insights to your
questions for your analysis. data. audience.
analysis.

22 | © 2021 General Assembly


Frame: “What Is the Challenge?”

Frame Prepare Analyze Interpret Communicate

Develop Select, import, and Structure, visualize, Make business Present data-driven
hypothesis-driven clean relevant data. and complete your decisions based on insights to your
questions for your analysis. data. audience.
analysis.

23 | © 2021 General Assembly


Intro to Data Science

A Closer Look at
the Data Science Workflow

WELCOME TO GA
GENERAL ASSEMBLY

Asking the right questions is what separates data
scientists that know ‘why’ from folks that only know
‘what’ (tools and technologies).
Kayode Ayankoya, MBA, PhD | clinical data scientist

25 | © 2021 General Assembly


Discussion:
Frame Problems With Good Questions

What makes a question “good”?

26 | © 2021 General Assembly


Asking Good Questions...

Establishes the basis for reproducibility.

Enables collaboration through clear goals.

Produces actionable recommendations and strategies for stakeholders.

27 | © 2021 General Assembly


Some Good Questions

“Which ad distribution channels would yield


the greatest volume at the lowest cost of
acquisition?”
Place
“Which markets are most attractive in terms Photo
of profit potential?” On Top
Of Box
“The past three quarters have seen a
year-over-year decline of 5% — what are the
top five changes in competitive dynamics?”

28 | © 2021 General Assembly


Some Not-So-Good Questions

“What is the best way to attract


more users?”
Place
Photo
“Which markets should we enter?”
On Top
Of Box
“What is causing the decline in sales?”

29 | © 2021 General Assembly


Discussion:
Spot the Differences

Good Questions Not-So-Good Questions

“Which ad distribution channels would yield the


“What is the best way to attract more users?”
greatest volume at the lowest cost of acquisition?”

“Which markets are most attractive in terms of


“Which markets should we enter?”
profit potential?”

“The past three quarters have seen a


year-over-year decline of 5% — what are the top “What is causing the decline in sales?”
five changes in competitive dynamics?”

30 | © 2021 General Assembly


Group Exercise:
Restructure the Question

Consider the wording of this question:

“What is going to happen with my stock?”

How could you rephrase the question to make it stronger?

31 | © 2021 General Assembly


Real Cases:
Data Science In Action: Survival Prediction

On April 15, 1912, the RMS Titanic sank after


colliding with an iceberg.
The crash resulted in 1,502 fatalities out of
2,224 passengers and crew members.
Some groups were more likely to survive
than others, such as women, children, and
members of the upper class.

32 | © 2021 General Assembly


Real Cases:
Data Science In Action: Survival Prediction (Cont.)

If we wanted to explore which groups of


people were likely to survive, we could apply
machine learning tools to predict which
passengers survived the tragedy, examining
the attributes of passengers
that would lead to higher survival rates.

33 | © 2021 General Assembly


Discussion:
Data Science In Action: Framing Survival Prediction

What sorts of questions would you ask to identify


attributes of passengers with higher survival rates?

34 | © 2021 General Assembly


Prepare: “What’s Needed?”

Frame Prepare Analyze Interpret Communicate

Develop Select, import, and Structure, visualize, Make business Present data-driven
hypothesis-driven clean relevant data. and complete your decisions based on insights to your
questions for your analysis. data. audience.
analysis.

35 | © 2021 General Assembly


Why Bother Cleaning and Preparing Data?

Suggested Timing

36 | © 2021 General Assembly


Cleaning Data...

Suggested
Ensures that
Timing
data is defined and structured.

Helps to check and polish data formatting.

Preprocesses data into a format that’s interpretable


by machine learning frameworks.

37 | © 2021 General Assembly


Cleaning and Preparing Data...

Ensures that data is defined and structured.

Helps to check and polish data formatting.

Preprocesses data into a format that’s interpretable


by machine learning frameworks.

Examples of machine learning frameworks:


● Natural language processing (string data such as tweets or product reviews).
● Categorical data into binary dummies (1/0).
● Images into multi-dimensional NumPy arrays.
● Timestamps into datetime format.
Suggested
How Do You Prepare Data?
Timing

Often, we’re given secondary data, or


data that was collected previously.

In these cases, we have to learn as


much as possible about our data using
tools like data dictionaries to
determine how the set was gathered.

39 | © 2021 General Assembly


Real Cases:
Warby Parker

● Used an open-source project to generate its data dictionary.


● Needed all business units to agree on terms of the dictionary.
● Secured approval from the co-CEOs to implement a data dictionary
sign-off date.
● Top-down support proved to be valuable to its data teams.

40 | © 2021 General Assembly


Suggested
Using Data Dictionaries
Timing

Here’s an example:

Data Dictionary: A list of key Variable


Variable Description
Type
terms and metrics with definitions.
survival Fate of passenger Binary

Ensures that all stakeholders are pclass Ticket class Discrete


on the same page with the
age Age in years of passenger Continuous
meanings of all variables.

fare Price of ticket (1912 dollars) Continuous

41 | © 2021 General Assembly


Suggested
Variable Types
Timing

A data dictionary is a list of key terms and metrics with definitions.

Variable
Variable Description
Type
Binary data is discrete data
survival Fate of passenger Binary that can only be in one of
two categories — either yes
pclass Ticket class Discrete or no, 1 or 0, off or on, etc. It
age Age in years of passenger Continuous can be thought of as
ordinal, nominal, count, or
interval data.
fare Price of ticket (1912 dollars) Continuous

42 | © 2021 General Assembly


Suggested
Variable Types (Cont.)
Timing

A data dictionary is a list of key terms and metrics with definitions.

Variable
Variable Description
Type

survival Fate of passenger Binary


Discrete data can’t be
pclass Ticket class Discrete measured, but it can
be counted.
age Age in years of passenger Continuous

fare Price of ticket (1912 dollars) Continuous

43 | © 2021 General Assembly


Suggested
Variable Types (Cont.)
Timing

A data dictionary is a list of key terms and metrics with definitions.

Variable
Variable Description
Type

survival Fate of passenger Binary

pclass Ticket class Discrete


Continuous data
age Age in years of passenger Continuous
represents measurements.
Its values can’t be counted,
fare Price of ticket (1912 dollars) Continuous but they can be measured.

44 | © 2021 General Assembly


Discussion:
Data Science In Action: Framing

What sort of features would you want to see in this data


set that are necessary for determining survival rates?

45 | © 2021 General Assembly


Analyze: “What Happened?”

Frame Prepare Analyze Interpret Communicate

Develop Select, import, and Structure, visualize, Make business Present data-driven
hypothesis-driven clean relevant data. and complete your decisions based on insights to your
questions for your analysis. data. audience.
analysis.

46 | © 2021 General Assembly


Digging Deeper With Data Analysis

After you’ve collected the right


data to answer your questions, it’s
time to start data analysis.

47 | © 2021 General Assembly


Digging Deeper With Data Analysis (Cont.)

This step — the initial analysis of


trends, correlations, variations, and
outliers in your data — helps you to...

● Focus your data analysis on


answering your initial questions
in better ways.
● Address any objections others
might have.

48 | © 2021 General Assembly


Common Stats

Data scientists often check the mean, standard deviation,


or specific frequency counts of their data.

49 | © 2021 General Assembly


Real Cases:
Data Science In Action: Survival Prediction Statistics

Revisiting our Titanic example from earlier…

The following are statistics we might expect survival variables to include:

Variable Mean or Frequency (%)

survival 38.38%

pclass 1: 24.24%, 2: 20.65%, 3: 55.11%

age 29.70 years

fare $32.20

50 | © 2021 General Assembly


Interpret: “Why and How Did This Happen?”

Frame Prepare Analyze Interpret Communicate

Develop Select, import, and Structure, visualize, Make business Present data-driven
hypothesis-driven clean relevant data. and complete your decisions based on insights to your
questions for your analysis. data. audience.
analysis.

51 | © 2021 General Assembly


Interpreting Your Data

Now that you’ve analyzed


your data, you can begin to
interpret the results.

52 | © 2021 General Assembly


Discussion:

Keys to Interpreting Data

What factors should you keep in mind when


interpreting data?

53 | © 2021 General Assembly


Questions to Form a Hypothesis

1. Does the data answer your original question?


2. Does the data help you defend against any objections?
3. Are there any limitations on your conclusions?

54 | © 2021 General Assembly


Questions to Form a Hypothesis

1. Does the data answer your original question?


2. Does the data help you defend against any objections?
3. Are there any limitations on your conclusions?

If your interpretation of the data holds up under


all of these questions and considerations, then
you have likely come to a productive conclusion.

55 | © 2021 General Assembly


Suggested
Forming Conclusions
Timing

Now that you have a hypothesis, what


are some things you should check?

Can you convert your findings into a


conclusion or next step?

56 | © 2021 General Assembly


Communicate: “How Do We Share This?”

Frame Prepare Analyze Interpret Communicate

Develop Select, import, and Structure, visualize, Make business Present data-driven
hypothesis-driven clean relevant data. and complete your decisions based on insights to your
questions for your analysis. data. audience.
analysis.

57 | © 2021 General Assembly


Suggested
Show (and Explain) The Results
Timing

You’ve framed the problem, and you’ve


prepared, analyzed, and interpreted
the data to develop a solution.

Now, you need to distill that into


something that can be clearly
communicated to an audience.

58 | © 2021 General Assembly


Discussion:
Presenting Data Effectively

What are some key factors to consider when


presenting data science findings and conclusions?

59 | © 2021 General Assembly


Suggested
Capture Their Attention
Timing

Presentations are a critical part of


your analysis.

The most basic form of a data science


presentation should describe your
results in the most simple and
engaging way for your audience.

60 | © 2021 General Assembly


A Good Story: The Key to Effective Data Presentations

Set the scene for your listeners, relating the problem to your audience's interests.

Focus on your hypothesis/solution. Help your audience see what you’re proposing.

Highlight your methodology. How did you come to your conclusion? Be concise —
present steps at a high level.

Feature contributions made and results. Highlight how your results made an impact.

61 | © 2021 General Assembly


Real Cases:
Static Presentation of Data

● The PollEverywhere team wanted


to look for opportunities to
improve employee benefits
packages.
● Author of this data presentation
highlighted the main takeaways
for the audience, explaining each
axis’ meaning.
● This data helped to clearly
illustrate next steps for the
company in crafting benefits
packages. .

62 | © 2021 General Assembly


Real Cases:
Interactive Presentation of Data

Data science presentations can also be far more complex and exciting, like some of the
research presented by Nate Silver's FiveThirtyEight blog.

63 | © 2021 General Assembly


Do Your
Thing

64 | © 2021 General Assembly


Let’s Recap

● The Data Science Workflow is used to iteratively


develop solutions.

● Crafting good questions is key.

● Cleaning and preparing your data is crucial.

● Analyzing data helps answer outstanding questions.

● Interpreting data leads you to form a


hypothesis/solution.

● Clearly communicating findings creates relevancy


for your audience.

65 | © 2021 General Assembly


Intro to Data Science

Decision Trees

WELCOME TO GA
GENERAL ASSEMBLY
Imagine a flow chart where each level is a question
with a yes or no answer, eventually leading to a
solution to the original question.

That’s a decision tree.

67 | © 2021 General Assembly


What Are Decision Trees?

Decision trees are a Machine


Learning Model for regression and
classification that help to classify
complex data science challenges.

68 68
| © |2021 General
© 2021 Assembly
General Assembly
Back to the Workflow...
When are decision trees used?

Frame Prepare Analyze Interpret Communicate

Develop Select, import, and Structure, visualize, Make business Present data-driven
hypothesis-driven clean relevant data. and complete your decisions based on insights to your
questions for your analysis. data. audience.
analysis.

Where you’d
use decision
trees.

69 | © 2021 General Assembly


Real Cases:
Non-Data Science Decision Tree

● This tree models a set of


sequential, hierarchical Alone or with
friends
decisions that ultimately lead
Alone Friends
to some final result.

● Decisions remain “high level” Weather Weather


outside? outside?
to keep the tree small and
achieve a higher level of Sunny Rainy Sunny Rainy

accuracy.

Video Video Soccer Movie


games games

70 | © 2021 General Assembly


Discussion:
Decision Tree Questions

Let’s say we’re using a data set consisting Does the animal
of animals with lots of different breathe air?

characteristics and you wanted to classify Yes No

them as mammals, birds, or fish.


Fish
What might be a good decision tree
question to start predicting their
classification?

71 | © 2021 General Assembly


Discussion:
Decision Tree Questions (Cont.)

What’s a second question that could further Does the animal


determine their class? breathe air?
Yes No

Does the animal


Fish
lay eggs?

Yes No

Bird Mammal

72 | © 2021 General Assembly


Decision Trees

In data science, the creation of


decision tree rules are governed by
an algorithm that learns which
questions to ask by analyzing an
entire data set.

73 | © 2021 General Assembly


The “knowledge” learned by a
decision tree is directly formulated
into a hierarchical structure, which
is determined by what yes/no rules
will predict the outcome variable.

74 | © 2021 General Assembly


Decision Trees (Cont.)

These yes/no rules appear as a


tree with several branching paths, or
splits.

❗ Adding too many splits makes decision trees overly complex and not adaptable
to new data.
75 | © 2021 General Assembly
Decision Trees (Cont.)
ROOT
The starting point of a decision tree
is referred to as the root.

76 | © 2021 General Assembly


Decision Trees (Cont.)

Subsequent points are called nodes.


NODE

77 | © 2021 General Assembly


Decision Trees (Cont.)

Splits resulting from nodes


are called branches.

BRANCH

78 | © 2021 General Assembly


Decision Trees (Cont.)

Nodes that do not split further


are then called leaves.

LEAVES

79 | © 2021 General Assembly


Splitting a Decision Tree

Two metrics decide how to split a tree...

● Gini impurity: A measurement of the likelihood of an incorrect


classification of a new instance of a random variable.

● Entropy: The measure of impurity (or uncertainty) in variables. This affects


how a decision tree draws its boundaries.

80 | © 2021 General Assembly


Group Exercise:
Knowledge Check

Let’s take a look at this example of a decision tree.

● Which is the node, which is a branch, and


which are leaves?

● Why is each subsequent level of branches


wider?

81 | © 2021 General Assembly


Cognitive
Load Break

82 | © 2021 General Assembly


Back to Titanic Survival Prediction

Let’s see how data sets and a decision


tree model could be used to predict
Titanic passenger survival rates.

83 | © 2021 General Assembly


Group Exercise:
Titanic Survival Prediction

In this example, our decision (leaves) will be survival (0 = no; 1 = yes).

Features will be the following conditions (nodes):


● sex: Sex (0 = female; 1 = male)
● pclass: Passenger class (1 = first; 2 = second; 3 = third)
● fare: Passenger fare (in 1912 dollars)
● age: Age (in years)

84 | © 2021 General Assembly


Group Exercise:
Titanic Survival Prediction (Cont.)

Each condition (node) represents a feature.

In this case, this would be either a category such as male or female or a range
of numbers (greater than or equal to age 10).

For variables that have more than one category — cabin class, for example —
you would make another branch off of a condition.*

*Within those that are NOT Class 3 and also NOT Class 2.

85 | © 2021 General Assembly


Group Exercise:
Titanic Passenger Survival Prediction

They’re most likely going to survive.


They probably won’t.

86 | © 2021 General Assembly


Group Exercise:
Titanic Passenger Survival Prediction (Cont.)

1. Given that the root node is sex, why would you think that this is the best way to
predict if someone died when the Titanic sank? (male = 1)?

87 | © 2021 General Assembly


Group Exercise:
Titanic Passenger Survival Prediction (Cont.)

1. Given that the root node is sex, why would you think that this is the best way to
predict if someone died when the Titanic sank? (male = 1)?

88 | © 2021 General Assembly


Group Exercise:
Titanic Passenger Survival Prediction (Cont.)

2.
1. What is the probability of death, given you are a male in second or third class?

89 | © 2021 General Assembly


Group Exercise:
Titanic Passenger Survival Prediction (Cont.)

2.
1. What is the probability of death, given you are a male in second or third class?

90 | © 2021 General Assembly


Group Exercise:
Titanic Passenger Survival Prediction (Cont.)

3.
1. What is the survival rate of a female in first or second class who paid more than $32?

91 | © 2021 General Assembly


Group Exercise:
Titanic Passenger Survival Prediction (Cont.)

3.
1. What is the survival rate of a female in first or second class who paid more than $32?

92 | © 2021 General Assembly


Group Exercise:
Titanic Passenger Survival Prediction (Cont.)

4.
1. If you were a 7-year-old boy in third class, would you be more likely to survive than a
7-year old boy in first class? What's the difference in your chances of survival?

93 | © 2021 General Assembly


Group Exercise:
Titanic Passenger Survival Prediction (Cont.)

4.
1. If you were a 7-year-old boy in third class, would you be more likely to survive than a
7-year old boy in first class? What's the difference in your chances of survival?

94 | © 2021 General Assembly


Decision Trees Visually Explained

95 | © 2021 General Assembly


Today We’ve...

● Defined data science.


● Outlined the Data Science Workflow.
● Defined supervised and unsupervised learning.
● Explored decision trees.
● Examined the differences between regression and classification.

96 | © 2021 General Assembly


AMA: Ask Me Anything!

97 | © 2021 General Assembly


What’s Next?

Let us know what you liked about this class


and what we can improve.

Complete a quick survey at:


ga.co/introclass

This survey is mobile- and laptop-friendly.

98 | © 2021 General Assembly


Want to Learn More?
Career-Changing Courses: bit.ly/fulltimeclasses

10–12 week Immersive courses developed to help you make a career pivot.

Skill-Building Courses: bit.ly/parttimeclasses

8–10-week part-time or 1-week accelerated courses developed to help you advance your career.

Short-Form Workshops: bit.ly/galaworkshops

Learn a skill in as little as two hours, or tackle something in more depth for 1–2 days.

99 | © 2021 General Assembly


Thank You!

WELCOME TO GA
GENERAL ASSEMBLY

You might also like