Introduction
Introduction
(Adapted from material developed by Profs. Milind Kulkarni, Stanley Chan, Chris
Brinton, David Inouye, and Qiang Qiu)
what is data?
lots of different definitions
3
humans have used data forever
4
why do we use data?
5
what has changed?
6
a parable of Purdue professors
Prof. Philip E. Paré (ECE) Prof. Mahsa Ghasemi (ECE)
develops models and algorithms studies efficient and reliable
for predicting and mitigating viral use of data in sequential
spread in networks using data decision-making problems
analyzing data
collecting/organizing data interpreting data
9
data science is a lot of things
making predictions identifying patterns in data
from data
analyzing data
collecting/organizing data interpreting data
10
what industries has it impacted?
• Hard to think of one that is not being impacted
by data science!
• Medicine: Analytics from wearable trackers,
studying disease patterns, …
• Retail: Analyzing consumer behavior, predicting
customer satisfaction, …
• Transportation: Assisted/autonomous
navigation, predicting equipment failures, …
• Education: Tracking student engagement,
personalizing learning content, …
11
what about Python?
• General purpose programming language,
first appeared in the 90s
• Easily recognized by use of whitespace
indentation rather than { } brackets to
enhance readability
• Becoming the industry standard for data
science (displacing R?)
• Many useful, open-source libraries: numpy,
pandas, matplotlib, pytorch
• And standard control functions (e.g., loops)
from lower-level languages to help
structure programs
12
what about Python?
landscape
• This is an introductory programming and
statistics course that emphasizes data
science problems with some math
• Other data science courses in ECE, e.g.,
• ECE 30010 - Introduction to Machine
Learning and Pattern Recognition
• ECE 47300 - Introduction to Artificial
Intelligence
• ECE 57000 - Artificial Intelligence
• ECE 59500 - Machine Learning I
• But data science is a Purdue-wide initiative!
14
syllabus break!
15
some data analysis examples
16
data analysis in “practice”
• Let’s say we have a data set of applicants to Purdue
17
descriptive statistics
• Which students come from which states?
• What is the distribution of GPAs? SAT scores? 50
20
• Can build histograms, e.g., for the GPAs 10
18
reasoning about data
• How do Purdue applicants compare to the national average?
• Mean GPA of applicants: 3.6
• Is this high or low?
• Can sample GPA of all high school students
• Suppose we collect 1000 GPAs and find a mean of 3.4
• Does this mean Purdue students have a higher GPA on average?
• Need more information! In particular …
• Was the sampling method we used unbiased?
• What is the variance of the sample collected (i.e., the spread of GPAs)?
• What confidence interval can be built for the population mean (i.e., what is the likely range of
the true mean GPA)?
19
making predictions
• Can we predict how successful a particular applicant
might be at Purdue?
• How do we define success? GPA?
• Idea: Look at the application statistics of the current
seniors and see if there is a relationship between these
statistics and their current GPA
• One way to find a relationship is using linear regression
• Might tell you something like: “a Purdue student’s
GPA can be predicted mostly by their high school
GPA, with their SAT score having a lighter influence”
• Many other prediction algorithms exist too
20
classification
• Can we make admissions decisions quicker through
automation?
• Idea: Compare each applicant’s statistics to past applicants
that were admitted, and to those that were rejected
• Train a classifier to analyze these past applicants and
maximize the ability to predict whether a student would be
accepted or not
• For example, a k-nearest neighbor classifier would assess
whether a given applicant is more similar to the pool of
admitted applicants or to the rejected applicants
• Why might we run into trouble here?
21
clustering
• What if we want to identify groups of students
beyond “admitted” vs. “rejected”?
• Idea: See if students cluster together according
to some measure of distance
• Some students look more like “nearby”
students than students that are “far away”
• Important question: What features of students
should be considered for the clustering?
• E.g., maybe don’t consider something like hair
color!
• With k-means clustering, k groups of students
would be extracted based on “closeness”
22