0% found this document useful (0 votes)
18 views22 pages

Introduction

Uploaded by

kanishkisrani01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views22 pages

Introduction

Uploaded by

kanishkisrani01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

ECE 20875

Python for Data Science


Qiang Qiu, Yi Ding and Aristides Carrillo

(Adapted from material developed by Profs. Milind Kulkarni, Stanley Chan, Chris
Brinton, David Inouye, and Qiang Qiu)
what is data?
lots of different definitions

3
humans have used data forever

The oldest known mathematical artifacts


(tally stick or lunar calendar?)

4
why do we use data?

• Analyzing data helps us make decisions


and take actions

5
what has changed?

• There’s a lot more data


• Machines can also collect (and in turn
use) it
• And we’re trying to do more with it

• Google processes 3.5 billion search queries per day.


• Instagram users post 54,000 photos each minute.
• Twitter user post 3,000 tweets every second.

6
a parable of Purdue professors
Prof. Philip E. Paré (ECE) Prof. Mahsa Ghasemi (ECE)
develops models and algorithms studies efficient and reliable
for predicting and mitigating viral use of data in sequential
spread in networks using data decision-making problems

Prof. Qiang Qiu (ECE)


studies computer vision Prof. Murat Kocaoglu (ECE)
and machine learning develops algorithms for
learning causal structures
to derive actionable
insights from data.

Jennifer Neville (CS) builds


new machine learning tools
to study graphs and networks Prof. Chris Brinton (ECE) develops
algorithms for optimizing social and
Prof. Milind Kulkarni (ECE) builds systems communication networks from data
to make data analyses run faster
7
what is data science?
• Collecting data from a wide variety of sources and putting
them into a consistent format?
• Making observations about patterns in data?
• Visualizing trends in data?
• Identifying similarities between data points?
• Making predictions about what will happen in the future?
• Prescribing courses of action to take based on forecasts?
• Developing new machine learning and data mining
algorithms?
• Accelerating analysis algorithms?
8
data science is a lot of things
making predictions identifying patterns in data
from data

building systems dealing with


for data analysis privacy concerns
visualizing data

analyzing data
collecting/organizing data interpreting data

ethics writing data analyses

9
data science is a lot of things
making predictions identifying patterns in data
from data

building systems dealing with


for data analysis privacy concerns
visualizing data

analyzing data
collecting/organizing data interpreting data

ethics writing data analyses

10
what industries has it impacted?
• Hard to think of one that is not being impacted
by data science!
• Medicine: Analytics from wearable trackers,
studying disease patterns, …
• Retail: Analyzing consumer behavior, predicting
customer satisfaction, …
• Transportation: Assisted/autonomous
navigation, predicting equipment failures, …
• Education: Tracking student engagement,
personalizing learning content, …
11
what about Python?
• General purpose programming language,
first appeared in the 90s
• Easily recognized by use of whitespace
indentation rather than { } brackets to
enhance readability
• Becoming the industry standard for data
science (displacing R?)
• Many useful, open-source libraries: numpy,
pandas, matplotlib, pytorch
• And standard control functions (e.g., loops)
from lower-level languages to help
structure programs
12
what about Python?
landscape
• This is an introductory programming and
statistics course that emphasizes data
science problems with some math
• Other data science courses in ECE, e.g.,
• ECE 30010 - Introduction to Machine
Learning and Pattern Recognition
• ECE 47300 - Introduction to Artificial
Intelligence
• ECE 57000 - Artificial Intelligence
• ECE 59500 - Machine Learning I
• But data science is a Purdue-wide initiative!
14
syllabus break!

15
some data analysis examples

16
data analysis in “practice”
• Let’s say we have a data set of applicants to Purdue

Name High school GPA SAT Math SAT R/W Residence


Jane Doe 4.7 760 700 Indiana
Purdue Pete 3.5 680 620 Indiana
B. O. Iler 3.0 800 650 Michigan
Engy Neer 4.2 750 590 North Carolina
Mark Faller 3.8 780 550 New Jersey
… … … … …

• What might we want to learn about them?

17
descriptive statistics
• Which students come from which states?
• What is the distribution of GPAs? SAT scores? 50

• GPAs may need to be normalized to a 40


consistent range across all schools 30

20
• Can build histograms, e.g., for the GPAs 10

• But how do we know how big to make the 0


buckets? 2.5–3.0 3.0–3.5 3.5–4.0 4.0+

18
reasoning about data
• How do Purdue applicants compare to the national average?
• Mean GPA of applicants: 3.6
• Is this high or low?
• Can sample GPA of all high school students
• Suppose we collect 1000 GPAs and find a mean of 3.4
• Does this mean Purdue students have a higher GPA on average?
• Need more information! In particular …
• Was the sampling method we used unbiased?
• What is the variance of the sample collected (i.e., the spread of GPAs)?
• What confidence interval can be built for the population mean (i.e., what is the likely range of
the true mean GPA)?

19
making predictions
• Can we predict how successful a particular applicant
might be at Purdue?
• How do we define success? GPA?
• Idea: Look at the application statistics of the current
seniors and see if there is a relationship between these
statistics and their current GPA
• One way to find a relationship is using linear regression
• Might tell you something like: “a Purdue student’s
GPA can be predicted mostly by their high school
GPA, with their SAT score having a lighter influence”
• Many other prediction algorithms exist too
20
classification
• Can we make admissions decisions quicker through
automation?
• Idea: Compare each applicant’s statistics to past applicants
that were admitted, and to those that were rejected
• Train a classifier to analyze these past applicants and
maximize the ability to predict whether a student would be
accepted or not
• For example, a k-nearest neighbor classifier would assess
whether a given applicant is more similar to the pool of
admitted applicants or to the rejected applicants
• Why might we run into trouble here?

21
clustering
• What if we want to identify groups of students
beyond “admitted” vs. “rejected”?
• Idea: See if students cluster together according
to some measure of distance
• Some students look more like “nearby”
students than students that are “far away”
• Important question: What features of students
should be considered for the clustering?
• E.g., maybe don’t consider something like hair
color!
• With k-means clustering, k groups of students
would be extracted based on “closeness”

22

You might also like