0% found this document useful (0 votes)
20 views

Syllabus PracticalDataScience

Uploaded by

donrafee2.0
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Syllabus PracticalDataScience

Uploaded by

donrafee2.0
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Nicholas Eubank IDS 720

Assistant Research Professor Fall 2022


Office Hours: Friday, 2:30-3:30
https://duke.zoom.us/my/nickeubank

Practical Data Science:


Wrestling with Data & Answering Questions

1 Course Description
Data Science is an intrinsically applied field, and yet all too often students are taught the advanced
math and statistics behind data science tools, but are left to fend for themselves when it comes
to learning the tools we use to do data science on a day-to-day basis or how to manage actual
projects. This course is designed to fill that gap.
Practical Data Science is a flipped-classroom, exercise and project-focused course. It is designed
to give students practical experience manipulating and analyzing manipulating real (often messy,
error ridden, and poorly documented) data using the full range of bread-and-butter Python data
science tools (like the command line, git, python (especially numpy and pandas), jupyter note-
books, and more). By the end of the course, students will be able to:
• Manipulate and analyze data in any format, including cleaning, merging, and summarizing
all standard tabular formats and levels of cleanliness, as well as large datasets and GIS data,
• Identify and resolve data issues using defensive programming practices,
• Setup and manage a data science programming environment on their own computers, in-
cluding installing Python, managing packages with pip and conda, setting PATH variables,
and working with VS Code,
• Collaborate with colleagues effectively using git and github,
• Plan and execute a full data science project from planning data manipulations through
analysis and presentation of findings.

2 For Whom Is This Course Meant?


2.1 Pre-Requisites
This course is primarily designed for incoming Masters in Data Science (MIDS) students. As such,
the only pre-requisites are the three things taught in the MIDS student boot-camp:
• A familiarity with basic python
• A familiarity with basic statistics (i.e. what you’d get from an intro stats course)
• A familiarity with git and github
MIDS students:
The only knowledge assumed is the portion of these topics covered in boot-camp, so unless you
skipped (the mandatory) bootcamp, you should be good.
MIDS students should also be aware that while many of the topics in the course schedule may
seem like things you covered over the summer, we will be exploring them in much more depth, and
practicing techniques you’ve learned with much messier real-world data. Your experience from
DataCamp will make these sections easier, but it does not obviate the need for this class.
For non-MIDS students:
By “basic python” I mean a familiarity with the core Python programming language, including
concepts like variables, loops, lists, dictionaries, and defining functions. Unfortunately, if you
haven’t worked with Python before, I’m afraid you will likely find this course hard to follow.
Git and github are a lot easier to learn than Python, though, so if you know Python but not git
and github, talk to me and we can figure something out.
Much of this course will focus on learn about and getting experience working with the Python
packages numpy and pandas. While familiarity with these packages is not an explicit
pre-requisite for this course, you should be aware that incoming MIDS students have
been exposed to these packages through DataCamp tutorials they completed over
the summer. As a result, if you come into this course without ever having seen those
packages, you may have to do some extra work since you’ll be seeing these packages
for the first time. This should not prevent you from being able to succeed in this
course, but if you are in this position please talk to me after class to make a plan.

2.2 For Whom Would This Course Be Inappropriate?


If you were a computer science major as an undergraduate or worked in a job that made intense
use of Python for Data Science applications, please speak to me after class, as the first portion
of this course might be somewhat boring for you. With that said, even students who have taken
computer science courses may find that this class offers a very different perspective on familiar
tools. CS programs tend to be oriented towards a style of programming best suited for software
development which can differ substantially from the tools and style used in data science. Moreover,
project design should be new to anyone who hasn’t worked in the data science field.

3 What Do You Mean By Data Science?


There are, broadly speaking, two branches of what is often referred to as Data Science, which I
will term Software Development Data Science and Data Analysis Data Science.
In Software Development Data Science, programmers write programs that gets bundled up in
software and distributed widely, or gets run on the cloud for millions of people. For example,
software development data scientists wrote the recommendation engine that lets Netflix tell you
what movies you might enjoy, or what people might be your friends on Facebook. As a result,
they generally write generalizable code that is designed to run on data with a known structure.
In Data Analysis Data Science, the data scientist is generally employed to answer a single, specific
question. For example, a Data Analysis Data Scientist may be hired to figure out how to reduce

2
anti-biotic resistant infections in a hospital, or to identify what campaign promises are most likely
to convince voters to support a politician. As a result, Data Analysis Data Scientists are generally
writing code that is only meant to be used for their specific project. Moreover, Data Analysis Data
Scientists don’t generally have the luxury of working with data with a known structure – where
a Netflix Data Scientist may get data from a company database that’s clean and well organized,
a Data Analysis Data Scientist may have to work with data that has come from lots of different
sources and which no one has cleaned and organized (e.g. notes from nurses, or voting data from
different states compiled by hand by minimum wage government employees).
To be clear, these branches are not completely distinct. Most data scientists do things that fall
into both categories (for example, even a Software Developer will likely do some ad hoc analyses
before developing a fully deployable tool). But these two types of data science do emphasize
different skills. Software Development Data Scientists, for example, are well served by traditional
computer science curricula, and need a much deeper understanding of concepts like object-oriented
programming, and software deployment. By contrast, Data Analysis Data Scientists need to be
comfortable working with data in different formats, and to understand how to clean and fit together
datasets that were never actually built to be integrated.
The focus of this course will be on the skills of Data Analysis Data Science: cleaning and merg-
ing data, data exploration, and designing projects to answer very specific questions. If you’re
interested in policy analysis, or health-sector analysis, or applied empirical research, this course is
for you; if you’re interested in developing programs you can deploy in an iPhone app to improve
recommendations, then while there will be material that will be of use to you (the Python data
science stack, working at the command line, git and github), the emphasis of the material won’t
quite be what you’re looking for.

4 Python
In this class we will primarily be working with Python.
Why Python? Because it’s currently one of the two most-used programs in data science (the other
being R, which you’ll be working with in other classes), which means there is a good chance you’ll
be called upon to use it when working in teams.
It is worth emphasizing that we’re not learning Python because it is necessarily the “the best”
language. The reality is that there are lots of tools for statistical programming, and each has its
own strengths and weaknesses (e.g. R, Stata, SPSS, Python, Julia, Matlab, etc., etc.). People
often develop strong opinions about which language is best, and sometimes pass judgement on
people who use other languages. Every programming language has its strengths and weaknesses,
and what is “best” depends on your use-case (the types of things you are using the language to
do). This is true not only because languages themselves have strengths and weaknesses, but also
because the tools and packages that have been created for use in different languages differ (e.g.
people just haven’t made a good package for doing geo-spatial work in Julia yet, for example).
And if you’re working on teams, you’ll also have to make decisions based on the backgrounds of
your tool sets. All of which is to say: there is no single best language for all purposes. But Python
is a very popular, strong, general purpose language, so will serve as a great starting point.
As a result, over the course of your career you may find yourself gravitating to one tool or another as

3
required by your research. But in providing you with a firm foundation in a very popular language
like Python, you will not only be learning a tool that will allow you to do most everything you’ll
want to do in graduate school, but you will also be providing yourself with a solid foundation in
generalizable skills that you will find useful if you later change platforms.

5 Class Organization
Data science is an applied discipline, and so this will be an intensely applied class with lots of
hands-on exercises.
To make it possible for us to work through problems together as they arise, we will dedicate most
of our class time to completing these exercises in small groups. That means that students will
be required to read instructional material before every class so they will be ready to do these
exercises. This is what is referred to as “flipping the classroom.”
In order to make this class organization work, it will be critically important that students do their
assigned readings before every class, and as discussed below, this will be reflected in how grades
are assigned in this class. Students who do not complete their assigned readings and tutorials
before each class should not expect to receive good grades, regardless of performance on project
assignments.
This class is organized around having two (synchronous) class sessions every week. While the plan
is for most of these will be in person, some classes will inevitably end up needing to be held online.
Synchronous attendance, whether classes online or in person, is required unless you
are unable to participate synchronously due to extenuating circumstances (such as
an internet connection that will not support synchronous participation (for online
classes) or illness (for in person classes)).
With that said, everyone’s health and safety is of course our first priority, so while it is very
important you attend class whenever possible, you should never hesitate to stay home if you’re
not feeling well. If you are not feeling well and need to miss class – or need to miss class for covid
related reasons (e.g. quarantine) – please reach out to me so that we can make a plan to make
sure you’re fully supported!

6 Assignments & Grading


6.1 Participation (25% of Grade)
Note that a major component of good participation is good preparation. Because we will mostly
reserve class time for hands-on exercises, it is absolutely critical that students do their assigned
readings before every class. Students who do not work through the instructional materials they
have been assigned before class will not only get very little out of the hands-on exercises designed
to reinforce the assigned materials, but they will also undermine the learning of the students they
are asked to work with. With that in mind, students who do not complete their assigned readings
before every class should expect to see this reflected in their participation grades.

4
Participation will be graded as follows:1
A range. You are fully and consistently engaged in class discussion and exercises. You both listen
and contribute actively. You are well-prepared for class. Having done more than merely read the
material, you have spent time thinking carefully and deeply about the material’s relationship to
other materials and ideas presented in previous classes. You are not only able to answer questions
about the material, but also come to class with thoughtful questions. When working in teams, you
work with your partner. If your partner is struggling with an exercise, you help them understand
the material rather than just completing the material on your own. If you are struggling with
material, you ask for help (both from the instructor and your fellow students) and do not simply
lean on your partner to complete the exercise.

B range. You are engaged in class discussion and exercises. You listen and contribute regu-
larly. You come well-prepared to class having read the material and your contributions show your
familiarity, but your level of engagement lacks the depth accumulated through extra time spent
thinking about the material. When working in teams, you work with your partner when they
have a similar level of understanding, but do not always invest in helping a struggling partner to
understand the material. You often ask for help when you are struggling, but other times you let
your partner just complete the exercise.

C range. You have met the minimum requirements of participation. You are usually, but not
always prepared. You participate sometimes, but not regularly. The comments that you offer
show a basic familiarity with the materials, but do not help to build a coherent or productive
discussion. When working in teams, you only sometimes work with your partner. When your
partner is struggling, you often just do the exercise yourself. If you are struggling, you often do
not ask for help and allow your partner to take over the exercise.

D range. You have not met the minimum requirements of participation. You are unprepared
for class. You have not read with the material with sufficient engagement to know even the most
basic elements. When working in teams, you do not attempt to work with your partner. When
your partner is struggling, you just do the exercise yourself. If you are struggling, you do not ask
for help and allow your partner to take over the exercise.

As should be clear from this rubric, above all it is important to emphasize that partic-
ipation is evaluated on the basis of quality and consistently, not quantity. Moreover,
when completing in-class exercises, good participation is not about finishing first or
without ever asking for help; good participation in in-class exercises is about helping
your partner understand the material, and asking for help when you need it.

6.2 Quizzes (20%)


In order to ensure students are doing their readings in advance of class, from time to time we will
start class with short quizzes. These quizzes are designed to be relatively straightforward if you
1
This rubric is adapted from that of Duke Political Science Professor Adriane Fresh.

5
did the readings—they won’t be full of gotcha questions—but will require you to have done the
readings.

6.3 Interim Assignments (30% of Grade)


Over the course of the semester, students will be asked to complete a number of small assignments
as homework. These assignments will, in total, be worth 30% of student grades.

6.4 Team Data Science Project. 25%


Around mid-semester, students will be assigned a large team Data Science Project. The goal and
general framework for this team project will be provided to students, but the project will require
students to complete the analysis component of a full data science project, including gathering
data, cleaning and merging that data, analyzing the data, and presenting results.

6.5 Late Assignments, Make Up Exams and Extra Credit


Grading
All assignments will be given a numerical score on a 0-1 scale. These scores will be multiplied by
the value of the assignment (see above) and the following scale will be used to assign a final letter
grade.

Late Assignment
All students get one “freebie” – they may submit one assignment up to two days late without
penalty.
Freebie’s may be used for team assignments, but only if all team members have a freebie to use,
and all agree to use their freebie for the team assignment.
After that, because of the difficulty associated with managing late assignment in large classes, all
late assignments will be penalized 10% per late day (up to a maximum deduction of 50%). Excep-
tions may be made for students dealing with exceptional circumstances (illness for themselves or
family, etc.) – if you are dealing with a difficult situation, please feel free to contact me to discuss
your situation.

7 Texts
We will rely on two primary texts for this course (both of which, thankfully, are reasonably priced):
• Python Data Science Handbook: Essential Tools for Working with Data by Jake VanderPlas.
Referred to in the syllabus as JVP.
• Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython, Second
Edition by Wes McKinney. Referred to in the syllabus as WM.
Make sure to buy the Second Edition!.

6
We will also do some readings from Code: The Hidden Language of Computer Hardware and
Software by Petzold, Charles. It’s a fun book and not very expensive, but we won’t use it a lot so
copies of relevant chapters will be provided if you don’t want to buy it.

8 Course Schedule
Because one aim of this course is to ensure that all MIDS students have a solid foundation
for their time at Duke, the exact organization of this course is likely to change regularly as
the course proceeds. Students will therefore be expected to regularly (i.e. before every class)
check on the updated course schedule (which will include assignments for the next class) at
www.practicaldatascience.org.

9 Honor Policy
Duke University is a community dedicated to scholarship, leadership, and service and to the
principles of honesty, fairness, respect, and accountability. Citizens of this community commit
to reflect upon and uphold these principles in all academic and nonacademic endeavors, and to
protect and promote a culture of integrity.
Remember the Duke Community Standard that you have agreed to abide by:
• I will not lie, cheat, or steal in my academic endeavors;
• I will conduct myself honorably in all my endeavors; and
• I will act if the Standard is compromised.
Cheating on exams or plagiarism on homework assignments, lying about an illness or absence and
other forms of academic dishonesty are a breach of trust with classmates and faculty, violate the
Duke Community Standard, and will not be tolerated. Such incidences will result in a 0 grade
for all parties involved. Additionally, there may be penalties to your final class grade along with
being reported to the MIDS program directors.

10 Disability Statement
In an effort to prevent students with disabilities from having to explain and justify their condition
separately to each of their various instructors, Duke has centralized disability management in the
Student Disabilities Access Office. If you think there is a possibility you may need an accommo-
dation during this course, please reach out to their office as soon as possible (processing can take
a little time).
Medical information shared with the SDAO are strictly confidential, and if SDAO determines an
accommodation is appropriate, faculty members will simply be informed of the accommodation
they are required to provide, not the underlying medical reason for the accommodation.
If you have any problems with SDAO, please let me know as soon as possible.

7
11 Student Signature
I, the undersigned, confirm I have read and understand the expectations of this class.
Name: ______________
Signature: ______________
Date: ______________

You might also like