机器学习方法 - Python
机器学习方法 - Python
Meetings:
Class: Locations:
Section 01 - MW 10:30-11:50am Keller 0007
Section 02 - MW 1:30-2:50pm Keller 0001
Course Description
It’s an exciting time to study machine learning and data science more generally! We live in a digital era where
many of our decisions and actions are tracked. Information is being produced and recorded at a stifling pace.
While this may not seem novel to those who were born and have grown up in the Information Age, the amount of
data available to researchers and policymakers is much more than what existed even a decade ago. Coupled with
cheap computing power and expanded data storage, recent developments across statistics, computer science,
and data-driven social sciences allow us to use all this data in a myriad of interesting ways. But what questions
will we seek to answer with this newly available big data and these newly developed machine learning tools?
While these tools are already being used extensively in marketing, finance, and business, their application
to public policy is in its infancy (despite the techniques being the same across disciplines). Early examples of
questions with policy implications include: can we predict unavailable data we take for granted in the developed
world from available information in a developing world context? Is it possible to improve the accuracy of
judges’ bail decisions that hinge on whether the accused will commit additional crimes? Or can we inform
doctors about the trade-offs inherent in prescribing potentially addictive opioids to patients for short-term pain
relief by predicting who is likely to develop an addiction in the long run?
In order to ask and inform questions like these, this class will introduce you to ways to detect patterns in
data, then use what you have learned to predict important outcomes or describe the salient relationships among
inputs. While this requires an understanding of how and why these tools work, we will emphasize the intuition
and application of these techniques over their theoretical underpinnings. We will do so by exploring nascent,
policy-relevant applications of these methods, but, ultimately, the full impact of how these machine learning
techniques inform and influence policy has yet to be determined. That’s up to you!
1 TA office hour times and locations are listed on the Canvas calendar.
1
Learning Objectives: “What’s My Incentive for Taking This Course?”
Specifically, the purpose of the course is to introduce you to a wide array of the fundamental methods in modern
machine learning. Each week, we will learn about and discuss a different set of techniques and their applications
to public policy during lecture sections. During lab sessions, you will gain experience with those techniques by
coding their implementation in Python.
Along the way you can expect to:
• Understand how the machine learning approach, which focuses on prediction, differs from the approach
to fundamental statistical and/or causal inference you learned in Harris’ core statistics classes.
• Gain an appreciation of why the bias-variance trade-off makes prediction inherently difficult.
• Recognize the different ways “long” and “wide” big data allow us to improve our predictions.
• Continue developing your coding skills in Python as you learn new tools.
• Visualize, interpret, and convey your findings to audiences of different levels of technical sophistication.
The overall course objective is for you to be able to use machine learning tools to inform better policy and make
the world a better place, as well as to become an informed and critical consumer of policy recommendations
based on machine learning techniques. Additionally, the course will allow you to market your newly gained
machine learning knowledge and skills when applying for jobs.
Prerequisites
• PPHA 30537 Data and Programming for Public Policy I - Python Programming and
• PPHA 30538 Data and Programming for Public Policy II - Python Programming.
This course is the third installment of the three-quarter core sequence of the Certificate in Data Analytics
(https://harris.uchicago.edu/academics/design-your-path/certificates/certificate-data-analytics) at Harris. Stu-
dents at Harris and from other parts of the University may enroll without having taken previous courses in the
sequence (after students who have taken the prerequisites have had a chance to enroll). However, it is necessary
for MPP students to take the full sequence in order to meet the necessary requirements of the Certificate in Data
Analytics.
Considering the Course without the Prerequisites? For anyone who has not taken the prerequisites and
is considering taking this course, first, thanks for your interest in my class! This course introduces machine
learning techniques, then has students practice and apply them via Python coding-based labs, problem sets, and
mini-projects. So while the class doesn’t directly follow the prerequisites (which teach general coding skills
in Python), you will be responsible for knowledge of the material covered in those classes. I allow students
to waive the prerequisites if they have sufficient experience coding in Python and are aware that they may be
at a bit of a disadvantage relative to the majority of the students in the class who have taken the prerequisites.
2
If you are considering taking the class out of sequence, I would recommend looking over the syllabi for the
prerequisite classes and making sure that you’re comfortable with the topics and techniques that are covered
before making your decision on whether or not to enroll. It may also be helpful to take a look at the textbook,
which is available online (for free; see the “Materials” section of this syllabus for details on how to access it).
Evaluation
Your final grade in this course will be related to performance in several areas. The weight placed on each
component will be as follows:
Problem Sets and Mini-Projects There are four problem sets and four mini-projects in this class. All as-
signments will be submitted on Cavnvas via the Gradescope option. You may submit assignments late for up
to 24 hours after the due date with a four percentage point deduction per hour.2 I will drop the lowest grade
among these assignments when calculating your grade.
Problem sets will consist of more structured questions (primarily) from the textbook. They are designed to help
students cement their understanding of the conceptual material covered in lecture and get practice both applying
the tools we learn and with coding.
Mini-projects are designed to apply the machine learning concepts and tools covered in class to policy-relevant
questions. As such, they are less structured, based on real-world data, and emphasize application to public
policy over statistical concepts.
You are welcome (and encouraged) to form study groups of no more than 2 students to work on the problem sets
and mini-projects together. But you must write your own code and your own solutions. Please be sure to include
the names of those in your group on your submission. Please also be sure to practice the good coding practices
you learned in the Data and Programming classes and comment your code, cite any sources you consult, etc.3
Class Participation Class participation points will be based on your level of active, attentive, inquisitive
participation during in-class discussions and/or on the discussion board. For in-class participation, note that
regular class attendance is generally a necessary (but not sufficient) component of earning in-class participation
points. Additionally, to earn credit, you must record each instance of your participation (e.g., when you ask
a question, provide an answer, contribute to a class discussion, etc.) using the submission form linked on the
main Canvas course page.4 Please submit a separate entry each time you participate. You only need a brief
description of your question/answer/etc. (enough to jog my memory) and you should record all participation
within 24 hours after class ends. You do not need to record participation via the discussion board - just your
in-class participation!
2 These deductions are not fractional (e.g. turning an assignment in one second or 59 minutes and 59 seconds late will result in an
developing and demonstrating your ability to apply those techniques. Part of both doing and demonstrating that requires using good
coding style (in part because it makes it easier for the graders to see that you understand what you’re doing). So while good coding
style is secondary to applying the ML techniques, we may take points off if the code is hard to follow.
4 You will have to be logged into your UChicago Google account to submit a response.
3
We will supplement in-class participation with the Ed Discussion discussion board on Canvas. Please use
the discussion board to post questions, discuss the material covered in the lectures or on the assignments, and
answer questions posed by your peers. As being a good colleague is both an important way to have social impact
and is valued by employers, participation points can be earned by making posts that are helpful to your peers.5
While this can take many forms, points will primarily be awarded for answering classmates’ questions on the
discussion board. In doing so, you may not explicitly share code, provide step-by-step solution algorithms (e.g.,
pseudo code), or direct solutions. You may clarify ambiguities in the assignments, discuss conceptual aspects of
lectures or problems, show output and error messages, and provide general guidance on how to correct errors in
understanding or code.6 Additionally, you may post brief summaries of news articles that describe applications
of machine learning techniques to public policy relevant issues.7
Grades
This class requires a 60% or above to pass and is not curved. All passing letter grades will be determined based
on the following intervals used in the Data Science Certificate sequence:
A [95% − 102%] A- [90% − 95%) B+ [85% − 90%) B [80% − 85%) B- [60% − 80%)
Pass/Fail (P/F), Withdrawal, and Incomplete grade requests will be handled in accordance with University and
Harris policy. Students who wish to take the course pass/fail rather than for a letter grade must use the Harris P/F
request form (https://harris.uchicago.edu/form/pass-fail) and must meet the Harris deadline, which is generally
9am on the Monday of the 5th week of courses. To earn a P grade, students taking the course P/F must: submit
at least seven of the eight assignments and earn a grade that is overall equivalent to at least a C- letter grade.
Materials
Textbooks
Required: An Introduction to Statistical Learning with Applications in Python, 1st Edition, by Gareth James,
Daniela Witten, Trevor Hastie, Robert Tibshirani, and Jonathan Taylor. (ISBN-10: 3031387465)
• You can download a free PDF of the book from the author’s website:
https://www.statlearning.com/.
• There is also an analog of the book with coding examples written in R.
to participate in class via all possible modes of communication, although you are welcome to. There are multiple ways to participate
because I want to give students as many opportunities to earn credit as possible, not because I want you to feel overwhelmed.
4
Data Analysis and Statistical Software
We will use Python software in this class, and you are required to code all assignments in Python.8 Python is
free and open-source software that can be installed on all operating systems. Please install Anaconda Python
version 3.x from www.anaconda.com/distribution. The text editor or integrated development environment (IDE)
you use for writing code for assignments is up to you.
My (and the TAs’) office hours for this class are listed on the first page of the syllabus. The intent of office
hours is part of the “hidden curriculum” for some students, so I want to clarify my expectations about how
students should view and approach office hours. Those hours are for you, so please make use of them! You
do not need to make an appointment to see me during my office hours; just drop by (be it with questions about
course material, to discuss ideas, or just to chat). I will be available during those times. If a sufficient number
of students attend at the same time we may move to a bigger room, and I will leave a note on my door to let
everyone know where we are.
Please make your best effort to attend during the posted times, but if you have a conflict or want to talk with
me one-on-one, you are welcome to make an appointment for another time. I am happy to meet with students
outside of office hours.9
Harris offers 10 hours of free tutoring support for coding in Python, Stata, and R. Tutoring will be available
to Harris students starting Week 3 of the quarter. You can read more about the program on the Harris Stu-
dent Handbook Canvas site (https://canvas.uchicago.edu/courses/42004/pages/harris-tutoring-program). Any
questions should be directed to your academic advisor or [email protected].
Course Policies
General
• The class will be taught in-person with recordings available for students needing temporary accommo-
dations for short-term absences. Should changing COVID-19 conditions necessitate, we will switch to
holding class remotely according to University policy and my discretion. Student input will be welcomed
in making this determination.
8 Note that there is an analog of this class that is taught by another instructor with R software (PPHA 30545).
9I only ask that you do your best to attend the regularly scheduled office hours since I have many students and there are economies
of scale in the production of knowledge. Also, if you know in advance that you cannot make a scheduled appointment, please email me
to let me know.
5
• There is no attendance requirement, but regular attendance is necessary (but not sufficient) to do well in
the class.
– That said, if you are experiencing COVID-19 symptoms or illness more generally, please do
not attend class in person! I will record classes on Zoom in order to make this easier.
– If you need a more-permanent remote learning accommodation, please contact the Dean of Stu-
dents, Kate Biddle ([email protected]). Per Harris policy, all such requests can only be ap-
proved centrally, not by individual instructors.
– More generally, if you get sick, are caring for a sick relative, or anything else that becomes an
obstacle to your coursework, please inform me and your advisor as soon as you are able. We will
all work together to develop appropriate accommodations.
• The class webpage is available through the Canvas portal. I will use it to post announcements, assign-
ments, and grades. Please check it regularly.
• Email, Canvas postings, and the discussion board are the official means of communication for out-of-class
messaging. In other words, you are expected to check your UChicago email account and the Canvas site
regularly.
• Email is inefficient. If you have a question about the class or the material, others probably do too!
Questions and answers (knowledge) are public goods, so post your question to the discussion board, and
feel free to answer questions your classmates ask. The TAs and I will monitor and respond as well.
• If you have a question or concern about something you don’t want to discuss publicly, feel free to email
me. I will respond to email within 2 business days (Monday-Friday, 9:00am-5:00pm). I teach multiple
classes, so please include “ML:” as a prefix to your subject.
• Any and all results of in-class and out-of-class assignments and examinations are data sources for research
and may be used in published research. All such use will always be anonymous.
Recording
• I will record all lectures and post them only to Canvas in accordance with University and Family Educa-
tional Rights and Privacy Act (FERPA) guidelines.
– The University has developed specific policies and procedures regarding the use of video/audio
recordings (https://teachingremotely.uchicago.edu/recording-policy/).
– FERPA is a federal statute that, broadly speaking, guarantees privacy over certain aspects of your ed-
ucational records. You can view the details of the policy on the registrar’s website (https://registrar
.uchicago.edu/records/ferpa/).
• If you record a class, discussion section, office hours, or meeting without permission, or if you share any
of the recorded videos without permission, you may be violating eavesdropping laws, copyright laws, or
the FERPA statute. So do not post or share any such videos outside of Canvas. This also applies to any
manipulated video.
Assignments
• The goal of the assignments in this course is not just to demonstrate that you can write code and answer
questions based on the output, but to help you develop an understanding of complex concepts and as-
sociated critical thinking skills that can only come from grappling with the material (both alone and in
6
discussion with peers). Because the use of artificial intelligence (AI) tools such as large language mod-
els (LLMs) inhibit the development of these skills, students are not allowed to use any such tools (e.g.,
ChatGPT) in this course. If you are unclear if something is an AI tool, please check with me.
• No assignments will be accepted after the 24 hour late period for any reason, valid or otherwise.10 Not
turning in an assignment, handing it in more than 24 hours late, or failing to turn it in before the link
expires will result in a grade of zero.11 I understand that students sometimes have legitimate reasons for
being unable to complete assignments on time or give their full effort, so your lowest assignment grade
will be dropped. Dropping the lowest assignment grade is intended to cover ordinary illness and other
emergencies. Only long-term issues of sufficient magnitude that warrant involving the Academic and
Student Affairs team in the discussion can qualify for an exception to this policy.
Academic Integrity12
As a member of the Student Government Judicial Branch as an undergraduate and a graduate student at a
university where any non-trivial act of lying, cheating or stealing results in expulsion, I take the Harris Academic
Honesty and Plagiarism Policies (https://harris.uchicago.edu/student-life/dean-of-students-office/policies) very
seriously. All students suspected of academic dishonesty will be reported to the Harris Dean of Students for
investigation and adjudication. The disciplinary process can result in sanctions up to and including suspension
or expulsion from the University. In addition, if in my judgment, the preponderance of the evidence indicates
that a student has committed an honor violation on an assignment, that student will receive an immediate
grade of zero for that assignment and cannot earn a grade higher than a B- in the course, regardless of their
performance on other assignments. This is regardless of the outcome of the disciplinary process. I trust every
student in this course to fully comply with all of the provisions of UChicgo and Harris’ integrity policies. Here
are specific expectations:
• On assignments, it is expected that you will neither receive nor give aid, nor access any material other
than items explicitly outlined in the instructions.
• For other assignments, you may (and should!) work with other students, but it is expected that you will
collaborate on all parts of the assignment (as opposed to the “divide and conquer” method).
• During the entire quarter, it is expected that you will not access old problem sets, projects, answer keys,
or any other class material at any time. This includes websites that post solutions under the guise of
tutoring. (These sites both facilitate cheating and steal the intellectual property of the author.) This does
not include the textbook authors’ websites, Python documentation, or StackOverflow.
• During the entire semester and thereafter, it is expected that you will neither post any class material on
the internet nor share any class materials with other students through any other means. Furthermore, if
you become aware that this has occurred, you are obligated to let me know immediately.
10 Reasons include, but are not limited to: illnesses, athletic competitions, work trips, job fairs, job interviews, travel reservations,
relative illnesses, relative funerals, out-of-town weddings, car accidents, car trouble, scooter trouble, tickets to see Billy Joel in concert,
and emergency visits to the veterinarian with your dog.
11 This is both because we post answer keys after the late period and because students from advantaged social backgrounds are more
likely to make requests for extensions, so being responsive to these requests can increase educational inequality.
12 I apologize for the heavy handed tone of this section. It is intended to protect the many honest students who take my class and
7
Americans With Disabilities Act
Students with disabilities needing an academic accommodation should contact UChicago’s Student Disability
Services (SDS). Please see their webpage for contact information (https://disabilities.uchicago.edu). If SDS
determines a disability accommodation is appropriate, you should inform the Harris Dean of Students office by
the end of the first week of class. The Harris Dean of Students office will work with the student and instructor
to coordinate the implementation of the student’s accommodations. Harris students are not required to submit
their accommodations letter to the instructor, but please feel free to come talk to me if you are comfortable
doing so. I’m happy to support your learning however I can.
Students differ in how much they know about mental health services. Your use of UChicago’s Student Health
and Counseling Services (SHCS) is free, confidential, and not linked to your academic file. If you find yourself
suffering in silence, please do not hesitate to make use of the services provided by SHCS. Please see SHCS’
mental health webpage for services and contact information (https://wellness.uchicago.edu/ mental-health/).
And if you are having serious mental, physical, or other problems, immediately contact the urgent medical care
line at (773) 702-3625 (available 24 hours a day, 7 days a week).
UChicago is committed to diversity and rigorous inquiry that arises from multiple perspectives, and Harris
encourages thought-provoking discourse that involves not only speaking freely about all issues but also listening
carefully and respectfully to the views of others. I concur with this commitment and view the diversity that
students bring to my class as a valuable resource and a benefit to learning. I expect to maintain a productive
learning environment based on open communication, mutual respect, and non-discrimination. I strive to present
materials in a way that is respectful of diverse student backgrounds. As there can always be a gap between intent
and execution, suggestions for promoting a positive and open environment are welcomed. Please feel free to
correct me on your preferred name and gender pronouns if necessary.
All University of Chicago faculty and TAs are classified as “Responsible Employees.” As such, they are required
to report any discussions of sexual misconduct, dating violence, domestic violence or stalking to the Title IX
Coordinator for the University. This includes the identities of the student making the complaint and alleged
perpetrator. You will receive an email once a report is filed, but you are not obligated to meet with anyone or
engage in the process. Alternatively, there are “Confidential Resource” employees at the University who do not
have an obligation to share identifying information. For more information, including phone numbers, see the
UChicago UMatter website (https://umatter.uchicago.edu/find-support/).
Except for changes that substantially affect implementation of the evaluation (grading) statement, this syllabus
is a guide for the course and is subject to change with advance notice.
8
Tentative Course Outline
The weekly coverage might change as it depends on the progress of the class. The “ISL” in the “Reading”
column that follows indicates the chapter in the “An Introduction to Statistical Learning” textbook that corre-
sponds to the topic we’re covering in class that day. “PS” is an abbreviation for “Problem Set,” and “MP” is an
abbreviation for “Mini-Project.”
Please note that we lose two days of instruction on Mondays this quarter because the academic calendar starts
on a Tuesday (January 2nd) rather than a Monday and there is no class on Monday, January 15 due to the
Martin Luther King, Jr. Day holiday. Harris administration has indicated that all courses should offer a full
nine weeks’ worth of content, regardless of differences in the academic calendar across days of the week. Thus,
I want to flag a few “nonlinearities” in the schedule (flagged in bold in the schedule) as a result of following
Harris policy:
• I will lecture during the regularly scheduled Monday/Wednesday class times on Friday the first week of
classes. There will be no lab sections that Friday.
• I will record an asynchronous lecture that will cover two “Classification” topics: the linear probability
and logit models. Previous students have told me that this material is review.