Syllabus PracticalDataScience
Syllabus PracticalDataScience
1 Course Description
Data Science is an intrinsically applied field, and yet all too often students are taught the advanced
math and statistics behind data science tools, but are left to fend for themselves when it comes
to learning the tools we use to do data science on a day-to-day basis or how to manage actual
projects. This course is designed to fill that gap.
Practical Data Science is a flipped-classroom, exercise and project-focused course. It is designed
to give students practical experience manipulating and analyzing manipulating real (often messy,
error ridden, and poorly documented) data using the full range of bread-and-butter Python data
science tools (like the command line, git, python (especially numpy and pandas), jupyter note-
books, and more). By the end of the course, students will be able to:
• Manipulate and analyze data in any format, including cleaning, merging, and summarizing
all standard tabular formats and levels of cleanliness, as well as large datasets and GIS data,
• Identify and resolve data issues using defensive programming practices,
• Setup and manage a data science programming environment on their own computers, in-
cluding installing Python, managing packages with pip and conda, setting PATH variables,
and working with VS Code,
• Collaborate with colleagues effectively using git and github,
• Plan and execute a full data science project from planning data manipulations through
analysis and presentation of findings.
2
anti-biotic resistant infections in a hospital, or to identify what campaign promises are most likely
to convince voters to support a politician. As a result, Data Analysis Data Scientists are generally
writing code that is only meant to be used for their specific project. Moreover, Data Analysis Data
Scientists don’t generally have the luxury of working with data with a known structure – where
a Netflix Data Scientist may get data from a company database that’s clean and well organized,
a Data Analysis Data Scientist may have to work with data that has come from lots of different
sources and which no one has cleaned and organized (e.g. notes from nurses, or voting data from
different states compiled by hand by minimum wage government employees).
To be clear, these branches are not completely distinct. Most data scientists do things that fall
into both categories (for example, even a Software Developer will likely do some ad hoc analyses
before developing a fully deployable tool). But these two types of data science do emphasize
different skills. Software Development Data Scientists, for example, are well served by traditional
computer science curricula, and need a much deeper understanding of concepts like object-oriented
programming, and software deployment. By contrast, Data Analysis Data Scientists need to be
comfortable working with data in different formats, and to understand how to clean and fit together
datasets that were never actually built to be integrated.
The focus of this course will be on the skills of Data Analysis Data Science: cleaning and merg-
ing data, data exploration, and designing projects to answer very specific questions. If you’re
interested in policy analysis, or health-sector analysis, or applied empirical research, this course is
for you; if you’re interested in developing programs you can deploy in an iPhone app to improve
recommendations, then while there will be material that will be of use to you (the Python data
science stack, working at the command line, git and github), the emphasis of the material won’t
quite be what you’re looking for.
4 Python
In this class we will primarily be working with Python.
Why Python? Because it’s currently one of the two most-used programs in data science (the other
being R, which you’ll be working with in other classes), which means there is a good chance you’ll
be called upon to use it when working in teams.
It is worth emphasizing that we’re not learning Python because it is necessarily the “the best”
language. The reality is that there are lots of tools for statistical programming, and each has its
own strengths and weaknesses (e.g. R, Stata, SPSS, Python, Julia, Matlab, etc., etc.). People
often develop strong opinions about which language is best, and sometimes pass judgement on
people who use other languages. Every programming language has its strengths and weaknesses,
and what is “best” depends on your use-case (the types of things you are using the language to
do). This is true not only because languages themselves have strengths and weaknesses, but also
because the tools and packages that have been created for use in different languages differ (e.g.
people just haven’t made a good package for doing geo-spatial work in Julia yet, for example).
And if you’re working on teams, you’ll also have to make decisions based on the backgrounds of
your tool sets. All of which is to say: there is no single best language for all purposes. But Python
is a very popular, strong, general purpose language, so will serve as a great starting point.
As a result, over the course of your career you may find yourself gravitating to one tool or another as
3
required by your research. But in providing you with a firm foundation in a very popular language
like Python, you will not only be learning a tool that will allow you to do most everything you’ll
want to do in graduate school, but you will also be providing yourself with a solid foundation in
generalizable skills that you will find useful if you later change platforms.
5 Class Organization
Data science is an applied discipline, and so this will be an intensely applied class with lots of
hands-on exercises.
To make it possible for us to work through problems together as they arise, we will dedicate most
of our class time to completing these exercises in small groups. That means that students will
be required to read instructional material before every class so they will be ready to do these
exercises. This is what is referred to as “flipping the classroom.”
In order to make this class organization work, it will be critically important that students do their
assigned readings before every class, and as discussed below, this will be reflected in how grades
are assigned in this class. Students who do not complete their assigned readings and tutorials
before each class should not expect to receive good grades, regardless of performance on project
assignments.
This class is organized around having two (synchronous) class sessions every week. While the plan
is for most of these will be in person, some classes will inevitably end up needing to be held online.
Synchronous attendance, whether classes online or in person, is required unless you
are unable to participate synchronously due to extenuating circumstances (such as
an internet connection that will not support synchronous participation (for online
classes) or illness (for in person classes)).
With that said, everyone’s health and safety is of course our first priority, so while it is very
important you attend class whenever possible, you should never hesitate to stay home if you’re
not feeling well. If you are not feeling well and need to miss class – or need to miss class for covid
related reasons (e.g. quarantine) – please reach out to me so that we can make a plan to make
sure you’re fully supported!
4
Participation will be graded as follows:1
A range. You are fully and consistently engaged in class discussion and exercises. You both listen
and contribute actively. You are well-prepared for class. Having done more than merely read the
material, you have spent time thinking carefully and deeply about the material’s relationship to
other materials and ideas presented in previous classes. You are not only able to answer questions
about the material, but also come to class with thoughtful questions. When working in teams, you
work with your partner. If your partner is struggling with an exercise, you help them understand
the material rather than just completing the material on your own. If you are struggling with
material, you ask for help (both from the instructor and your fellow students) and do not simply
lean on your partner to complete the exercise.
B range. You are engaged in class discussion and exercises. You listen and contribute regu-
larly. You come well-prepared to class having read the material and your contributions show your
familiarity, but your level of engagement lacks the depth accumulated through extra time spent
thinking about the material. When working in teams, you work with your partner when they
have a similar level of understanding, but do not always invest in helping a struggling partner to
understand the material. You often ask for help when you are struggling, but other times you let
your partner just complete the exercise.
C range. You have met the minimum requirements of participation. You are usually, but not
always prepared. You participate sometimes, but not regularly. The comments that you offer
show a basic familiarity with the materials, but do not help to build a coherent or productive
discussion. When working in teams, you only sometimes work with your partner. When your
partner is struggling, you often just do the exercise yourself. If you are struggling, you often do
not ask for help and allow your partner to take over the exercise.
D range. You have not met the minimum requirements of participation. You are unprepared
for class. You have not read with the material with sufficient engagement to know even the most
basic elements. When working in teams, you do not attempt to work with your partner. When
your partner is struggling, you just do the exercise yourself. If you are struggling, you do not ask
for help and allow your partner to take over the exercise.
As should be clear from this rubric, above all it is important to emphasize that partic-
ipation is evaluated on the basis of quality and consistently, not quantity. Moreover,
when completing in-class exercises, good participation is not about finishing first or
without ever asking for help; good participation in in-class exercises is about helping
your partner understand the material, and asking for help when you need it.
5
did the readings—they won’t be full of gotcha questions—but will require you to have done the
readings.
Late Assignment
All students get one “freebie” – they may submit one assignment up to two days late without
penalty.
Freebie’s may be used for team assignments, but only if all team members have a freebie to use,
and all agree to use their freebie for the team assignment.
After that, because of the difficulty associated with managing late assignment in large classes, all
late assignments will be penalized 10% per late day (up to a maximum deduction of 50%). Excep-
tions may be made for students dealing with exceptional circumstances (illness for themselves or
family, etc.) – if you are dealing with a difficult situation, please feel free to contact me to discuss
your situation.
7 Texts
We will rely on two primary texts for this course (both of which, thankfully, are reasonably priced):
• Python Data Science Handbook: Essential Tools for Working with Data by Jake VanderPlas.
Referred to in the syllabus as JVP.
• Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython, Second
Edition by Wes McKinney. Referred to in the syllabus as WM.
Make sure to buy the Second Edition!.
6
We will also do some readings from Code: The Hidden Language of Computer Hardware and
Software by Petzold, Charles. It’s a fun book and not very expensive, but we won’t use it a lot so
copies of relevant chapters will be provided if you don’t want to buy it.
8 Course Schedule
Because one aim of this course is to ensure that all MIDS students have a solid foundation
for their time at Duke, the exact organization of this course is likely to change regularly as
the course proceeds. Students will therefore be expected to regularly (i.e. before every class)
check on the updated course schedule (which will include assignments for the next class) at
www.practicaldatascience.org.
9 Honor Policy
Duke University is a community dedicated to scholarship, leadership, and service and to the
principles of honesty, fairness, respect, and accountability. Citizens of this community commit
to reflect upon and uphold these principles in all academic and nonacademic endeavors, and to
protect and promote a culture of integrity.
Remember the Duke Community Standard that you have agreed to abide by:
• I will not lie, cheat, or steal in my academic endeavors;
• I will conduct myself honorably in all my endeavors; and
• I will act if the Standard is compromised.
Cheating on exams or plagiarism on homework assignments, lying about an illness or absence and
other forms of academic dishonesty are a breach of trust with classmates and faculty, violate the
Duke Community Standard, and will not be tolerated. Such incidences will result in a 0 grade
for all parties involved. Additionally, there may be penalties to your final class grade along with
being reported to the MIDS program directors.
10 Disability Statement
In an effort to prevent students with disabilities from having to explain and justify their condition
separately to each of their various instructors, Duke has centralized disability management in the
Student Disabilities Access Office. If you think there is a possibility you may need an accommo-
dation during this course, please reach out to their office as soon as possible (processing can take
a little time).
Medical information shared with the SDAO are strictly confidential, and if SDAO determines an
accommodation is appropriate, faculty members will simply be informed of the accommodation
they are required to provide, not the underlying medical reason for the accommodation.
If you have any problems with SDAO, please let me know as soon as possible.
7
11 Student Signature
I, the undersigned, confirm I have read and understand the expectations of this class.
Name: ______________
Signature: ______________
Date: ______________