Software Engineering For Data Scientists (MEAP V2) Andrew Treadway Download
Software Engineering For Data Scientists (MEAP V2) Andrew Treadway Download
https://ebookmeta.com/product/software-engineering-for-data-
scientists-meap-v2-andrew-treadway/
https://ebookmeta.com/product/data-analytics-for-drilling-
engineering-theory-algorithms-experiments-software-information-
fusion-and-data-science-qilong-xue/
https://ebookmeta.com/product/bash-for-data-scientists-1st-
edition-oswald-campesato/
https://ebookmeta.com/product/data-analysis-for-scientists-and-
engineers-edward-l-robinson/
https://ebookmeta.com/product/fokker-aircraft-of-wwi-
volume-5-1918-designs-part-2-production-fighters-1st-edition-
jack-herris/
Demon Slayer Kimetsu No Yaiba Volume 14 Chapters 117
120 Koyoharu Got■ge
https://ebookmeta.com/product/demon-slayer-kimetsu-no-yaiba-
volume-14-chapters-117-120-koyoharu-gotoge/
https://ebookmeta.com/product/greene-s-infectious-diseases-of-
the-dog-and-cat-fifth-edition-jane-e-sykes/
https://ebookmeta.com/product/a-nation-of-provincials-celia-
applegate/
https://ebookmeta.com/product/conceptual-physical-chemistry-
part-1-upto-gaseous-liquid-state-pg-235-iit-jee-main-
advanced-9th-edition-prabhat-kumar/
https://ebookmeta.com/product/wearable-communication-systems-and-
antennas-design-efficiency-and-miniaturization-techniques-2nd-
edition-albert-sabban/
MEAP Edition
Manning Early Access Program
Software Engineering for Data Scientists
Version 2
Lastly, Part 4 will teach you how to effectively monitor your code in production. This is especially
relevant when you deploy a machine learning model to make predictions on a recurring or
automated basis. We’ll cover logging, automated reporting, and how to build dashboards with
Python.
In addition to the direct topics we cover in the book, you’ll also get hands-on experience with the code examples.
The code examples in the book are meant to be runnable on your own with downloadable datasets, and you’ll find
corresponding files available in the Github repository. Besides the examples laid out in the book, you’ll also find
Practice on your own sections at the end of most chapters so that you can delve further into the material in a
practical way.
The book covers an extensive set of topics, and I hope you find it helpful in your technical journey. If you have any
questions, comments, or suggestions, please share them in Manning’s liveBook Discussion forum.
— Andrew Treadway
APPENDIXES
Suppose you’re working on a project with several others (could be data scientists, software
engineers, etc.). How do you handle modifying the same code files? What about testing out new
features or modeling techniques? What’s the best way to track these experiments or to revert
changes? Often times, data scientists will use tools like Jupyter Notebook (a software tool that
for writing code and viewing its results in a single integrated environment). Because Jupyter
Notebook allows for easy viewing of code results (such as showing charts or other visuals, for
example), it’s a popular tool for data scientists. However, working on these notebook files can
often get messy quickly (you may hear the term spaghetti code). This is generally because part of
being a data scientist is experimentation and exploration - trying out various ideas, creating
visualizations, and searching for answers in data. Applying software engineering principles, such
as reducing redundancies, making code more readable, or using object-oriented programming (a
topic we’ll introduce later), can vastly improve your workflow even if your direct involvement
with software engineers is limited. These principles make your code easier to maintain and to
understand (especially if you’re looking at it some length of time after you’re written it).
Additionally, being able to pass your code to someone else (like a software engineer, or even
another data scientist) can be very important when it comes to getting your code or model to be
used by others. Having messy code spread across Jupyter Notebook files, or perhaps scattered
across several programming languages or tools makes transitioning a code base to someone else
much more painful and frustrating. It can also cost more time and resources in order to re-write a
data scientist’s code into something that is more readable, maintainable, and able to be put into
production. Having a greater ability to think like a software engineer can greatly help a data
scientist minimize these frustrations.
The diagram in Figure 1.1 shows an example of a codebase where a data scientist’s code may be
scattered across several notebooks, potentially in multiple languages. This lack of a cohesive
structure makes it much more difficult to integrate the code (for example, a model) into another
codebase. We’ll revisit this diagram later in the chapter within an updated, improved codebase
example.
Figure 1.1 For data scientists, the state of their codebase is often a scattered collection of files. These
might be across multiple languages, like Python or R. The code within each file might also have little
structure, forming what is commonly known as spaghetti code.
Next, let’s delve into what data scientists need to know about software engineering.
Better-structured code to minimize errors (both in terms of bugs in the code, but also in
terms of inputs into code functions, such as the features being fed into a model)
Collaboration among co-workers is a key part of any data science team. Software
engineering principles can and should be applied to make collaboration and working
together on the same code base seamless and effective
Scaling code to be able to process large datasets efficiently and effectively is also very
important in modern data science.
Putting models into production (as mentioned above)
Effectively testing your code to reduce future issues
We’ll cover each of these points in more detail in the next section. First, let’s briefly discuss the
intersection of data science and software engineering.
Josh Wills (a former director of data engineering at Slack) once said that to be a data scientist,
you need to be better at statistics than a software engineer, and better at software engineering
than a statistician. One thing is for certain - the skills that a data scientist is expected to have has
grown much over the years. This greater skillset is needed as both technology and business needs
have evolved. For example, there’s no recommending posts or videos to users if there’s no
internet. Platforms, like Facebook, TikTok, Spotify, etc. also benefit from advanced technology
and hardware available in modern times that allow them to process and train models on massive
datasets in relatively short periods of time. Models like predicting customer churn or potential
fraud are much more prevalent nowadays because more data being collected, and more
companies looking to data scientists to provide solutions for these problems.
Additionally, if you’re interviewing for a data scientist position nowadays, chances are you’ll run
into questions around programming, deploying models, and model monitoring. These are in
addition to being tested on more traditional statistics and machine learning questions. This makes
the interview process more challenging, but also creates more opportunities for those
knowledgeable in both data science and software engineering. Data scientists need to have a
solid knowledge of several areas in software engineering in their day-to-day work. For example,
one key area where software engineering comes into play is around implementation. The
example models we’ve mentioned so far, like recommendation systems, customer churn
prediction, or fraud models all need to be implemented in production in order to provide value.
Otherwise, those models are just existing in a data scientist’s code files, never making
predictions on new data.
Before we delve more deeply in the engineering principles mentioned above, let’s walk through
a few more examples of common issues in a data scientist’s work where software engineering
can help!
We covered this point already earlier in the chapter, so we’ll just briefly rehash that improving
the structure of your code can greatly help for several key reasons, including sharing your code
with others, and integrating the code into other applications.
Extending on the earlier scenario, collaborating on the same code base with others is crucial
across almost any data science organization or team. Applying software engineering principles
through source control allows you to track code changes, revert to previous code file versions,
and (importantly) allows for multiple people to easily change the same code files without threat
of losing someone else’s changes. These benefits of source control can also be useful even if
you’re working a on a project alone because it makes it much easier to keep track of the changes
or experiments that you may have tried.
Figure 1.2 Collaboration is an important part of coding in many companies. Data scientists, data
engineers, and software engineers are three common roles that often interact with each other, and
share code with each other. Working effectively with a shared codebase across multiple (or many) users
is a topic we’ll delve into in the next chapter.
Another common scenario involves scaling. Scaling involves improving your code’s ability to
handle larger amounts of data from both a memory perspective, as well as an efficiency point of
view. Scaling can come up in many different scenarios. For instance, even reading in a large file
might take precious time when your compute resources are constrained. Data processing and
cleaning, such as merging datasets together, transforming variables, etc. can also take up a lot of
©Manning Publications Co. To comment go to liveBook
https://livebook.manning.com/book/software-engineering-for-data-scientists/discussion
7
time and potentially memory. Luckily, there exists many techniques for handling these problems
on a large scale, which we’ll cover in more detail later in this book. It is quite common for data
science code to be initially written inefficiently. Again, this is often because data scientists spend
a lot of time exploring data and experimentally trying out new features, models, etc. By applying
software engineering concepts and tools available, you can transform your code to run on a more
robust basis, scaling to larger numbers of observations much more efficiently.
Once you’ve developed a model, you need to put it to use in order to provide value for your team
or company. This is where putting your model into production comes into play, which as
mentioned above, essentially means scheduling your model to make predictions on a recurring
(or potentially real-time) basis. Putting code in production has long been a software engineering
task. It’s the heart of software engineering work at many different companies. For example,
software engineers are heavily involved in creating and putting apps on your phone into
production, which allows you to use the apps in the first place. The same principles behind doing
this can also be applied to data science in order to put models into production, making them
usable by others, operating anywhere from a small number of users to billions (like predicting
credit card fraud), depending on the use case.
So you’ve created a model and it’s soon going to be making new predictions. Maybe it’s a model
to predict which customers are going to churn in the next month. Maybe it’s predicting how
much new insurance claims will cost. Whatever it is, how do you know the model will continue
to perform adequately? Even before that, how can you ensure the code base extracting,
processing, and inputting data into the model doesn’t fail at some point? Or how can you
mitigate the results of the code base failing? Variations of these issues are constantly faced by
software engineers with respect to code they’re writing. For example, an engineer developing a
new app for your phone is going to be concerned with making sure the app is rigorously tested to
make sure it performs like it is supposed to. Similarly, by thinking as a software engineer, you
can develop tests for your data science code to make sure it runs effectively and is able to handle
potential errors.
In Figure 1.3, we show an example of inputting features into a customer churn prediction model.
Here, we might add tests to ensure the inputs to the model, like customer age or number of
transactions over last 30 days, are valid values within pre-defined ranges.
Figure 1.3 This snapshot shows a sample of tests that could be performed when deploying a customer
churn model. For example, if the model has two inputs - customer age and number of transactions over
the last 30 days, we could perform checks to make sure those inputs values are within set ranges prior
to inputting into the model.
Engineering Advantages
principle
Better structured Makes code more easily integratable, easier to maintain, and helps improve coding collaboration.
code
Improving code In addition to better structured code, we can use additional tools - such as source control - to make it
collaboration easy to work together on the same codebase across many users or teams.
Scaling your code Making your code robust enough to handle large volumes of data and generalizable enough to deal
with new variations of inputs, data, or various errors that may occur.
Deploying models Make the application of your code usable or accessible by others. This can be anything from a
into production customer churn model making predictions to an app on your phone tracking your fitness goals.
Effective testing Any model or application that will be handling data or being used by others needs to be rigorously
tested to ensure it can handle potential issues.
Next, let’s go through a sample data science workflow using a specific example. This will help to
tie the above scenarios to a common data science use case.
Gather data
Data can be collected in several ways:
Writing SQL to extract data from various tables / databases (MySQL, SQL Server, etc.).
Scraping data from documents (e.g. CSV, excel files, or even PDF / Word documents in
some cases)
Extracting data from webpages or through web APIs
Exploratory data analysis (EDA) / data validation
EDA involves steps like checking the distributions of key variables, investigating missing
values, looking at correlation plots, and checking descriptive statistics (such as the
median values for numeric variables or most common value for categorical features).
Data validation might involve checking the results of EDA against domain knowledge or
across multiple data sources to ensure that the data is accurate and reasonable to use for
analysis and modeling.
Data cleaning
Data cleaning involves minimizing the number of issues in a dataset, including the
following:
Replacing missing values
Removing highly correlated variables
Treating highly skewed variables (potentially transforming certain features)
Dealing with an imbalanced dataset (think about predicting ad clicks, for example, where
the vast majority of users never click on an ad)
Feature engineering
Feature engineering is the process of developing new features from existing variables.
This is generally done in an effort to improve model performance. For example, certain
machine learning models will perform better when the inputs follow a normal
distribution, so there are existing techniques that make these transformations. In other
instances, feature engineering is absolutely necessary to get anything useful out of a
variable. This is typical of date variables, for instance. For example, in the credit fraud
use case, transaction date could be a raw variable, but cannot be input directly into a
machine learning model. Instead, we parse out new features, such as the day of the week,
hour of the day, month, etc.
Model training
Model training involves inputting data into machine learning algorithms like logistic
regression, random forests, etc. so that the algorithm can learn the patterns in the data in
order to be able to make predictions on new data. This process may involve testing out
several different models, performing fine-tuning the parameters of the models, and
perhaps selecting a subset of the more important features relevant to a model.
Model evaluation
Model evaluation is a key component where data scientists need to check how well the
model performs. This is usually done by evaluating the model on a fresh, or hold-out,
dataset that was not used for model development. There are a variety of metrics that may
be used to assess performance, such as accuracy (number of examples where the model
was correct / total number of predictions) or correlation score (sometimes used to
evaluate models outputting a continuous prediction).
Figure 1.4 A sample data science workflow, as described above. Data science workflows involve key
steps like gathering data, exploring and cleaning the data, feature engineering, and model
development. The model evaluation piece is also a highly important step, that we will cover in detail
when we discuss model monitoring later in this book.
How do we allow easier coding collaboration between data scientists, data engineers,
software engineers, and whomever else may be working on the project?
While data scientists ultimately use data to perform analysis and modeling, data
engineers are more heavily involved in creating new tables, managing databases, and
developing workflows that bring data from some source (like raw logging on a website)
to a more easily ingestible place where data scientists can query a table - or small number
of tables - in order to get the data needed for modeling or analysis. Software engineers, as
mentioned above, help to create reliable applications that might be used either by internal
employees or external consumers. These three roles often work closely together, and their
exact responsibilities may overlap depending on different companies.
How do we use the developed model to predict churn for customers on an ongoing basis?
We can think of this question as an extension of the last point in our data science
workflow concerning model evaluation. Once we’re satisfied with a model’s
performance, how do we go from a trained model to one that is making predictions on a
regular cadence? This heavily involves one of the main topics of this book, which is
putting models into production.
How can we have fresh data ready for the model to use in production?
This question is partially related to the first piece of the data science workflow -
gathering data. Essentially, ensuring fresh data is available in production involves
automating the gathering data process and hardening it to reduce the possibility of errors
or data issues.
How do we handle invalid inputs into the model or other errors?
Handling invalid inputs into the model is often needed once the model is developed and
ready to be deployed into production. In a production environment with new data coming
in, we may need to create checks like making sure the data types of each feature input is
correct (for example, no character inputs when a numeric value is expected) or that
numeric values are in an expected range (such as avoiding negative values when a
positive number is expected). Other types of errors may also occur in the workflow above
when we are automating those steps for fresh incoming data. For instance, whatever code
is being used to extract the new data may fail or some reason (such as the server hosting a
database going down), so you could develop logic to retry running the code after a few
minutes.
How can we effectively monitor the model’s performance once it’s making predictions
on a recurring basis?
Model monitoring can be thought of as an extension of the model evaluation step in our
data science workflow above. Effective monitoring of a model is necessary to ensure
confidence in a model’s performance over time. The exact way we monitor a model will
depend upon what the model is actually doing. In our example above, we would know
after 30 days whether a customer actually churned. We could use this to then chart the
accuracy of the model over time, along with other metrics like precision and recall
(which may actually be more important depending on the balance of the dataset). These
metrics can be tracked using a dashboard. Additionally, we might monitor information
like the distributions of the features in the model. This can help to debug potential issues,
such as dips in model performance. We will discuss Python tools for introductory web
development and dashboarding later in this book.
Do we need to re-train the model on an ongoing schedule?
Re-training a model on a recurring cadence is tied to the performance of the model over
time. The short response to this question is…it depends. If the model performance is
dropping a week after you’ve trained it, then you might need to re-train the model
frequently (or consider different features). If the model performance is stable over time,
you might not need to re-train the model very often, though it’s a must to monitor the
model to ensure it is meeting the standards expected.
How can we scale the model to millions of users?
Scaling can be a fairly in-depth topic, but we can broadly think of it in terms of memory
and efficiency. Many machine learning applications can be extremely intensive in terms
of both CPU (or GPU) cycles, as well as memory. Enabling ML and data workflows to
handle larger-scale data is an important task, and one that is likely to grow in importance
as datasets get larger and more diverse. Scaling code is heavily a software engineering
topic. When applied to data science, it can involve areas like parallelizing code or using
advanced data structures.
We will dive into more detailed approaches to each of these concerns in later chapters, but for
now let’s introduce a few summary points. Broadly speaking, software engineering provides
solutions to these issues. Software engineering helps to fortify our modeling code to reduce
errors and increase reliability. It can integrate a model trained in a data scientist’s development
environment into a robust application making predictions on millions of observations.
There are a few software engineering concepts we can apply to make our data science workflow
more robust and to handle these issues.
Source control
Source control (sometimes called version control) refers to a set of practices to manage
and monitor changes to a collection of code files.
Exception handling
Exception handling involves developing logic to handle errors or cases where a piece of
code may fail. For instance, this would come up in the example mentioned earlier where
a section of code retrieving data from a database might fail and exception handling logic
could be implemented to retry retrieving the data after several minutes. A few examples
of exception handling can be seen in Figure 1.5.
Figure 1.5 Exception handling can take many forms. A few examples are shown in this figure. For
example, we may need to query data on a regular basis for model training (for instance, updating the
customer churn model). What happens if the query returns zero rows one day? There could be multiple
solutions, but one could be sending an alert/email to an oncall data scientist (or engineer) about the
problem. Or what if you’re scraping data from a collection of webpages and a request fails due to a
non-existing webpage? We might want to skip over the webpage without ending the program in error.
We’ll delve into exception handling more fully in Chapter Three.
In the workplace, data pipelines are often created by data engineers. However, depending on a
company’s structure, data scientists may also write data pipelines. In general, if you’re a data
scientist, it’s recommended to know about data pipelines and understand the source of the data
you’re using for your modeling or analysis projects. This is important for several reasons, but
especially because it helps you to build confidence in knowing where the data you’re using is
coming from. Software engineers can also be involved in data pipelines. For example, at tech
companies, it is common for software engineers to write the code that logs data from a particular
web application (a simple example would be logging whether someone clicks on an ad).
Data pipelines are necessary because they ensure that data is served reliably for a variety of data
science applications. Even if you’re working on an insight analysis, rather than a model, for
example, you need to be able to retrieve reliable data. Data pipelines are used in these cases to
bring data from between different tables and sources. They can also be used for logging data,
such as storing predictions from a model.
Sometimes you may see data pipelines referred in the context of reporting and analytics, as well,
in addition to being used for machine learning models. The main summary point to keep in mind
is that data pipelines are ultimately used to flow data from a source (or collection of sources) into
finalized outputs, usually in the form of structured tables with collections of rows and columns.
The exact application of ultimately using the data can vary. Let’s give a real-world example of a
data pipeline.
The data points above may be derived from several different tables. Let’s list those out below:
In order to train a model in the first place, we need to have a method of combining and
aggregating the data inputs we need from the above tables. This process might involve multiple
sources of data and in some cases, may involve using various languages or frameworks. A
common language used in extracting data from tabular sources is SQL. SQL, or Structured
Query Language, is basically a programming language designed to extract data from databases.
The final dataset may include additional pre-processing and feature engineering prior to the
actual model development. Depending on the setup, some of these components may be handled
in the data pipeline or in the modeling code component, which we’ll cover next. For instance,
suppose that a person’s age is being used as a feature in predicting customer churn. Rather than
inputting age directly into a model, we might want to apply a transformation to age, like
bucketing it into different groups (for instance, under 18, 18 - 24, 25 - 30, 31 - 40, etc.). This
bucketing could be handled directly in a data pipeline by creating a table that has a column (
bucketed_age) with those categories, or it could be handled after the data (including the age
variable) has been passed to the model, with some pre-processing code to create that bucketed
feature prior to inputting it into a model.
Figure 1.6 An example flow of data from several source tables to a final output dataset used for
predicting customer churn.
Now that we’ve covered the flow of data from several potential sources to a finalized dataset,
what happens to the data? This is where machine learning pipelines come into the picture, as we
will discuss next.
Figure 1.7 A sample workflow of a machine learning pipeline extending from a data pipeline. This
diagram shows an outline of a typical machine learning pipeline. Keep in mind that the starting point is
data. Data is always needed for any machine learning pipeline. The final part is monitoring the model to
validate its performance on an ongoing basis. These and the in-between components are covered in
detail below.
Now, let’s dive into the details of the ML pipline components displayed above.
1.4.2 Pre-processing
Pre-processing, broadly speaking, involves performing any additional cleaning or feature
engineering prior to model training. For example, the cleaning component might involve
replacing missing values or capping outliers. As stated earlier, feature engineering could be
implemented in the data pipeline by creating the new features directly into table columns, which
are ingested by a model. Alternatively, feature engineering could be done as part of the
pre-processing step after the data has already been extracted. When the pre-processing step is
complete, the next component involves training a model on the data.
Train a model once, and use the static model for fetching predictions on an ongoing basis.
This type of model structure is useful in cases where the data does not change very
rapidly. For instance, building a model to predict insurance claim cost from the time a
claim is filed might be an example where the underlying data doesn’t change from week
to week (the relationship of the variables in the model vs. actual claim cost could be
relatively static for shorter durations), but rather it might only need to be updated every
6-12 months.
Re-train the model on a periodic basis, such as monthly, weekly, or even more frequently
if needed. This setup is useful when the performance of the model drops by a
high-enough margin over shorter time periods to warrant updating it on an ongoing basis.
This is related to the topic of model monitoring, which we’ll cover in a later chapter. An
example of this might be a recommendation system, for instance, where new and more
varied types of content are frequently being made available, and the relationships
between the inputs and the target that is being predicted changes fairly rapidly, requiring
more frequent re-training.
After a model is trained, we need to evaluate the performance of the model. The main reason for
this is that is we need to be confident that the model is performing adequately for our standards,
which we’ll discuss next.
Once we are confident in a model’s performance, it is time for putting the model into production
order to make predictions on an ongoing basis.
Real-time models
Real-time models return predictions in real-time. Predicting whether a credit card
transaction is fraudulent is a high-impact example where a real-time model can be
critical. These predictions are often fetched via an API call, which we’ll discuss in more
detail later in this book.
Offline models
Offline models make predictions on a scheduled basis, such as daily, weekly, hourly, etc.
Our example of predicting customer churn could potentially be an offline model, running
daily.
After the model is deployed and able to make predictions on new data, it is important to monitor
the model’s performance on an ongoing basis. This allows us to be alerted if there are any issues
that arise in terms of the model’s performance or changes in the data being used for the model.
Before we go into model monitoring, let’s summarize a few examples based on real-time vs.
offline prediction and recurring training vs. single (or infrequent) model training.
Before we close out the chapter, let’s revisit Figure 1. Since we now know about data pipelines
and ML pipelines, we can think about restructuring the scattered notebooks/files shown in Figure
1 into a more sequential collection of files. Ideally, these would be standardized across languages
as much as possible (for example, keeping code files only in Python rather than across Python,
R, Julia, etc.). For example, this might look like the following table:
This tables show a few sample code files corresponding to different steps in the ML pipeline.
There could be additional code files as well. For example - feature engineering might be divided
into several components in different files. We’ll go further into the possible structure of files in
the data science workflow when we get to Chapter Three. Now, let’s close out this chapter with a
summary of what we’ve we covered.
1.5 Summary
In this chapter, we discussed data pipelines, ML pipelines, and provided an overview of putting
models into production.
Data scientists need software engineering concepts in order to make code easier to
maintain, allow for more seamless collaboration, implement models into production,
scaling, and rigorously testing your code.
Data pipelines are used for merging, aggregating, and extracting data from some
underlying source into a final output (typically a table or collection of tables).
Machine learning (ML) pipelines are used to deploy machine learning models to make
predictions on an automated basis. This might be making predictions in real-time or
recurring on a schedule, such as daily or weekly predictions.
Key software engineering concepts that are used in fortifying data pipelines and ML
pipelines include source control, object-oriented programming, scale, and exception
handling.
Putting a model in production requires fortification of code like handling errors /
exceptions, scaling as necessary, and structuring code to be more readable and allow for
easier collaboration
In the last chapter, we introduced several key software engineering concepts that will improve
your life as a data scientist. These included:
Source control
Exception handling
Putting a model into production
Object-oriented programming (OOP)
Automated testing
Scale
In this chapter, we’re going to delve deeper into the first of these - namely, source control.
Source control (also called version control) is basically a way of tracking changes to a codebase.
As the number and size of codebases has grown indescribably over the years, the need for
monitoring code changes, and making it easier for various developers to collaborate is absolutely
crucial. Because software engineering has existed longer than modern data science, source
control has been a software engineering concept longer than a data science one. However, as
we’ll demonstrate in this chapter, source control is an important tool to learn for any data
scientist. Before we delve into using source control for data science, however, let’s discuss how
source control fits into the picture of applying software engineering to data science. Recall in the
last chapter, we discussed several key concepts of software engineering, such as better structured
code, object-oriented programming, exception handling, etc.
Source control can (and should) be used from the very beginning and throughout a project. That
project could be a purely software engineering application, a data science project, or some
combination. For example, the project could be developing a new machine learning model, like
our example in the last chapter around predicting customer churn. It could also be a purely
engineering project, like code to create a new app for your phone. There is a consistent theme
between source control, better structured code, and object-oriented programming (explained in a
later chapter) in that each of these software engineering concepts make collaboration between
developers (or data scientists) much easier. This chapter will focus specifically on common
software for using source control.
Next, let’s explore how source control will help you in your projects. Going back to the customer
churn example from the first chapter, suppose you and a colleague are working on a data science
project together to predict whether a customer will churn. To make this concrete, we will use the
customer churn dataset available from Kaggle here:
www.kaggle.com/competitions/customer-churn-prediction-2020/data. This dataset involves
predicting whether a customer from a Telecom company will churn. The workflow of this project
can be structured similarly to the data science life cycle that we discussed in the first chapter. As
a refresher, we’ve repeated the chapter one diagram showing the combined data pipeline/ML
pipeline view in Figure 2.1.
Figure 2.1 Adding to what we covered in the previous chapter, we can use source code at almost any
step of of a data pipeline or ML pipeline. This is especially useful when multiple coworkers are
collaborating together on the same codebase, but can also be helpful for tracking changes in a
codebase even if you are the only one working on it.
In the following steps, we tie the pipeline components to our specific Kaggle dataset:
Gathering data
Fetch the data needed to build the customer churn model. In the Kaggle dataset we’ll be
using, this includes information like length of account (account age), number of day calls,
area code, etc.
Exploratory data analysis (EDA) / data validation
Analyzing the data for patterns, distributions, correlations, missing values, etc. For
example, what’s the proportion of churn to non-churn? What’s the association with total
day minutes used (total_day_minutes) and churn? Etc.
Data cleaning
Handle issues, such as missing values, outliers, dirty data, etc.
Feature engineering
Create new features for the models you both will build. For example, you could create a
new feature based on the average total day charge for the state the user resides in.
Model training
Develop models, such as logistic regression or random forest to predict churn.
Model evaluation.
Evaluate the performance of the models. Since churn vs. non-churn is a classification
problem, we might use metrics like precision or recall here.
Model deployment
Make the finalized model accessible to others. For this case, that might mean prioritizing
individuals predicted to churn for marketing or lower price incentives.
Each of these steps might involve collaboration between you and your colleague. Consider a few
example scenarios:
When gathering data, you might write code to extract data from the last year. Your
colleague, wishing to get more data, modifies the codebase to extract data from the last
three years.
Your colleague adds code that creates several visualizations of the data, like the
distribution of daytime calls. You find a bug in your colleague’s code and want to correct
it, so you modify the shared codebase.
The two of you are working on cleaning the data. One colleague adds code to replace
missing values and handle outliers. The other one realizes that one of the fields has a mix
of numeric and string values, and writes code to clean this issue. Both sets of changes go
into the same file.
One of you works on developing an logistic regression model, while the other wants to
try out a random forest. But you both want to share code with each other and potentially
make modifications (like use different parameters or features)
Source control makes these types of collaborations much easier and trackable. Next, let’s dive
into how source control would help you and your colleague in this situation, or in general for any
data science project you may work on.
Figure 2.2 Source control enables easier coding collaboration between different developers of the same
codebase, often stored on a shared remote repository
An alternative to using source control would be to simply use a shared network directory where
different developers could add or modify files to a centralized location. However, this has several
key problems:
How do we track changes based by different users? In other words, how can I easily tell
who made what change? With the methodology above, this is very difficult.
Difficult to revert changes. This can be especially important in cases where a new change
causes an issue, like an app to crash, for instance.
Easy to overwrite others' changes.
Merging changes from different users working on the same code file is a challenge
Taking our customer churn dataset example, suppose you write code that creates a
collection of new features for the churn model. Your colleague wants to modify the same
code file(s) that you created. This process is called merging. Let’s look at an example
below.
Merging example:
p50_account_length_by_state = train.groupby("state").\
median()["account_length"]
p50_account_length_by_state = train.groupby("state").\
median()["account_length"]
p50_day_minutes_by_state = train.groupby("state").\
median()["total_day_minutes"]
In the first code snippet, DS #1 creates a code file that calculates median account length values
by state. Another data scientist (DS #2) changes the code to calculate median total daytime
minutes by state (adding an extra line of code). DS #2 can now merge the changes made into a
shared repository where DS #1 can downloaded DS #1’s contributions. Now, let’s move into
how source control helps with the above-mentioned problems.
Tracking who changed what. Source control makes it easy to track who created or
modified any file in a repository.
Undoing changes to the repository is straightforward. Depending on the specific software
you’re using for source control, this may even be as simple as executing one line of code
to revert the changes.
Provides a system for merging changes together from multiple users.
Provides a backup for the codebase. This can be useful even if you’re the only person
contributing code. It’s all too easy to accidentally overwrite a file or potentially lose work
if a system crashes. Source control helps to mitigate these issues, in addition to delivering
the benefits listed above.
Code consistency. Too often, different members of a team may follow various styles or
have different sets of functions, which may perform overlapping or similar actions.
Source control allows for easier code-sharing among team members, enabling greater
consistency in the codebase used across team members.
©Manning Publications Co. To comment go to liveBook
https://livebook.manning.com/book/software-engineering-for-data-scientists/discussion
30
The benefits of using source control can be applied to any codebase, regardless of the
application, environment, company, etc. Though source control has long been used by software
engineers, it is also an important tool for data scientists to learn, as we’ll explain below.
Easily track any experimental changes done by either an individual or a group of data
scientists working on the same project
Enables data scientists, software engineers, data engineers, etc. to modify the same code
base in parallel, without fear of overwriting anyone’s changes
Allows for more easily packaging of the code base and hand-off to software engineers.
This can be very important for putting models, data pipelines, etc. into production
Makes it easier to merge changes to the same underlying files
To make the concept of source control less abstract, let’s use a concrete example of a version
control software called Git.
Git can be used via the command line (terminal) or through a UI, like Github’s web UI
(SourceTree and GitKraken are other UI’s for Git). We can think of a typical Git workflow as
shown in the below diagram.
Figure 2.3 This diagram is an extension of the one shown earlier. Developers push and pull code to and
from, respectively, a remote repository. Each user must first commit his or her code to a local repository
before pushing the code to the remote repo.
Next, let’s get hands-on experience using Git so that this workflow will become more clear.
If you’re using a Mac, Git comes pre-installed. On Windows, you’ll need to install Git. A
common way to get Git setup on Windows is to go to Git’s website (git-scm.com/) and download
the installer for Windows.
Once you have Git installed, you can get started with it by opening a terminal. Any Git command
you use will be comprised of git followed by some keyword or other parameters. Our first
example of this is using Git to download a remote repo for the first time.
Downloading a repository
To download an existing remote repository, we can use the git clone command, like this:
In this command, we just need to write git clone followed by the URL of the repository we want
to download. For example, suppose you want to download the popular Python requests library
repository from Github. On Github, you can find the link you need to use by going to the repo’s
©Manning Publications Co. To comment go to liveBook
https://livebook.manning.com/book/software-engineering-for-data-scientists/discussion
32
Clicking on Code should bring up a view like below, where you can see and copy the HTTPS
URL (github.com/psf/requests.git in this case).
Next, you can download the contents of the repo like this:
Using git clone here to download (or clone) a repository from Github
A benefit of using git clone is that it automatically links to the downloaded repo to being tracked
©Manning Publications Co. To comment go to liveBook
https://livebook.manning.com/book/software-engineering-for-data-scientists/discussion
33
by git for you. For example, running the above command will download a folder (in this case,
called requests3) from the Github repository. But now, any changes you make in this folder will
be automatically tracked. For example, let’s suppose we create a new file within the downloaded
directory called test_file.txt. Then, running git status in the terminal shows the following
message:
Figure 2.6 Running ls -a in the terminal will show the newly created hidden git files
From this message we can see that there is one file (the one we just created) that is currently
untracked. Additionally, we can see our local repository is up-to-date with the remote repository
on Github. This means no else has made any changes to the remote repository since we’ve
downloaded the local repository.
git clone is a useful command when there’s an already-existing remote repository that you’re
planing to modify. However, what if a remote repository doesn’t exist? That’s where git init
comes in handy, which we’ll discuss next.
Now that we walked through how to clone the repository for a sample repo, let’s download the
repo for this book!
Using git clone here to download the repository for this book
Now, you should have all the files from the book’s repository in whatever directory you selected
in your computer.
What if you want to version control a local collection of files? For instance, let’s go through how
we created the book repository in the first place. To start with, we will use the git init command.
If you enter git init in the terminal, a new local repository will be created in the current working
©Manning Publications Co. To comment go to liveBook
https://livebook.manning.com/book/software-engineering-for-data-scientists/discussion
34
To create a repository in another folder, you just need to specify the name of the that directory,
like in the example below, where we create a repository in /some/other/folder.
Listing 2.7 Use git init [directory name] to create a new repository in another folder
git init /some/other/folder
When you use the git init command, git will create several hidden files in the input directory. If
you’re using a Mac or Linux, you can see this by running ls -a in the terminal. Similarly, you can
use the same command in Bash on Windows, as well.
Figure 2.7 Running ls -a in the terminal will show the newly created hidden git files
git init can be run in a directory either before or after you’ve created files that you want to
backup via version control. Let’s suppose you’ve already setup a collection of sub-directories
and files corresponding to what you can see in this book’s remote repository.
ch2/…
ch3/…
ch4/…
Etc.
Next, we need to tell git to track the files we’ve created. We can do that easily enough by
running git add . The period at the end of this line tells git to track all of the files within the
directory.
Listing 2.8 Use git add . to tell Git to track all files that were changed
git add .
Now, we can commit our changes. This means that we want to save a snapshot of our changes.
We do this by writing git commit followed by a message that is associated with the commit. To
©Manning Publications Co. To comment go to liveBook
https://livebook.manning.com/book/software-engineering-for-data-scientists/discussion
35
specify the message, you need to type the parameter -m followed by the message in quotes.
Listing 2.9 git commit saves the changes you’ve made as a snapshot of the repo. The
parameter -m is used to include a message / description for the commit.
git commit -m "upload initial set of files"
Running this git commit command will save the changes we’ve made to the repo
(in this case, creating the 1_process_data.py file)
Now that we’ve committed our changes, we need to create a remote repository in order to upload
our committed changes. This remote repository will be where other users can download or view
your changes.
Figure 2.8 To create a new repo on Github, look for the plus sign (top right corner of the webpage) to
click and create a new repo.
Give a name to your new repository. It’s recommended that this name matches the name
of the folder you’re storing the local copies of the files you’re dealing with. For example,
if sample_data_science_project is the name of your local folder, then you could also
name your repository sample_data_science_project. For the book’s codebase, our
repository name is software_engineering_for_data_scientists.
Next, you’re ready to push the local repository to the remote one you just created.
To actually push your local repository to the one on Github, you can run a command similar to
this:
Listing 2.10 Use _git remote add origin [remote repo URL] to enable pushing your changes
to a remote repository.
git remote add origin https://github.com/USERNAME/sample_data_science_project.git
Running this command will allow us to push our changes to the remote repo at this
URL: github.com/USERNAME/sample_data_science_project.git
Next, run the line below to tell git that you want to use the main branch. Usually, the main
branch is called master or main. This branch should be considered the source of truth. In other
words, there might be other branches that deal with experimental code (for instance, feature_x
branch), but the main branch should use code that has closer to being production-ready, or at
least passed off to software engineers for production. In some cases, if the codebase is small, or
there are only a few contributers, you might decide to just use a single main branch. However, as
codebases get larger, and more people get involved in contributing code to the repository, then
creating separate branches can help keep the main branch clean from messy spaghetti code that
frequently changes.
Listing 2.11 git branch -M main tells Git that you want to use the main branch when you
push or pull changes.
git branch -M main
Lastly, you can run git push -u origin main to push your local repo changes to the remote
repository. After running this line, you should be able to see the changes in the remote
repository.
Listing 2.12 git push -u origin main will push your local changes to the remote repo.
git push -u origin main
Push the changes in the local repo to the main remote repository
©Manning Publications Co. To comment go to liveBook
https://livebook.manning.com/book/software-engineering-for-data-scientists/discussion
37
As you and your potential colleagues make changes to the same repo, it is convenient to be able
to easily tell who made which changes. Fortunately, there’s a straightfoward way to do this,
which we’ll cover next.
Let’s suppose now that we want to add a new file to our working directory called
1_process_data.py. This file could be a Python script that reads in data from a database and
performs basic processing / cleaning of the data. As a naming convention, we might add "1_" to
the front of the script name in order to convey that this is the first script in a collection of
potential files that needs to be run. For now, though, let’s suppose we’ve just created this
1_process_data.py file. In order to backup our new file via git’s source control system, we’ll
need to commit the file (more on this in just a moment). First, however, let’s run git status. This
command, as described above, is a simple way of checking what files have been modified or
created, but are not currently being tracked in git’s version control system.
Listing 2.13 Use git status to check the status of the current directory, which will show any
files that have been created or modified, but not yet committed.
git status
Running git status will tell you what files have been created or modified, but not
yet committed
Figure 2.9 Running git status shows what files, if any, are not currently being tracked via git
Next, let’s tell Git to track our new file, 1_process_data.py. Again, we can do that easily enough
by running git add . to add all changed/new files for tracking.
Listing 2.14 Use git add [file_name] to tell git to track a specific file. In this case, we will
track our new file, 1_process_data.py.
git add 1_process_data.py
You can also add multiple files by running the git add command separately. For example, after
running the command above, you could run git add 2_generate_features.py and Git will prepare
2_generate_features.py to be committed. This sample file is available in the Git repo, along with
1_process_data.py, so that you can practice on your own.
Now, we can commit our new file. The -m parameter adds a message in the commit providing a
short description.
Listing 2.15 git commit saves the changes you’ve made as a snapshot of the repo. The
parameter -m is used to include a message / description for the commit.
git commit -m "create 1_process_data.py for processing data"
Running this git commit command will save the changes we’ve made to the repo
(in this case, creating the 1_process_data.py file)
Listing 2.16 git push -u origin main will push your local changes to the remote repo.
git push -u origin main
Push the changes in the local repo to the main remote repository
Next, let’s a take a look at how we can see who made commits.
Listing 2.17 git log will print out what commits have been made in the repo.
git log
Figure 2.10 Running git log will show the commit history for the repo
The long combination of digits and characters after commit is the commit hash ID. It’s a unique
identifier for a specific commit. You can also see the contents of the commit (the actual code
change) by using git show, followed by the a hash commit ID.
Figure 2.11 Running git show will show the contents of a specific commit, as can be seen in thie
example snapshot.
In this case, git show doesn’t display any code because our sample file we created was empty.
But, let’s suppose we have created and committed another file for feature creation. If we run git
show for this other commit, we might get something like the snapshot below showing lines of
code that have been added to a file.
Figure 2.12 This time git show displays the code changes to a specific file
In addition to committing your own changes, it is important to be able to get the latest updates
from the remote repository. Let’s dive into how to get the latest changes from the remote
repository next!
Listing 2.19 Use the git pull command to get the latest changes from a repo
git pull origin
In the above command, origin refers the remote repository (you can think of it as the original
repo). Alternatively, you can just run git pull, which will pull the changes from the same remote
repository by default.
Listing 2.20 We can also omit the "origin" snippet in our previous git pull command.
git pull
Conflict example
Suppose the 1_process_data.py file mentioned previously contains a simple Python function, like
this:
Listing 2.21 Sample Python function to demonstrate what happens when two users update
the same line of code
def read_data(file_name):
df = pd.read_csv(file_name)
return df
Let’s say that you change the name of the function to read_file. However, your colleague
changed the name to read_info in a local repo and then pushed those changes to the remote
repository that both of you are working with. In this case, running git pull will result in an error
message that looks like this:
Listing 2.22 Run git diff to show the conflict differences between your repo vs. the remote
one.
Automatic merge failed; fix conflicts and then commit the result.
CASCADE AT BRIARS.
19th.—My son and I rose very early. Our task had been finished
on the preceding day; and as the Emperor could not want me for
some time, we availed ourselves of the fineness of the morning to
explore the neighbourhood of our abode.
Passing through the valley of James-Town, on the right of our little
level height at Briars, was a deep ravine, the sides of which were
intersected by numerous perpendicular cliffs. We descended into the
ravine, not without difficulty, and found ourselves at the edge of a
little limpid streamlet, beside which grew abundance of cresses. We
amused ourselves by gathering them as we passed along; and after
a few windings we soon reached the extremity of the valley and the
streamlet, which are closed transversely by a huge pointed mass of
rock, from the summit of which issues a pretty cascade, produced
from the waters of the surrounding-heights. This water-fall
descending into the valley forms the streamlet along which we had
just passed, and which rolls sometimes in a rapid stream to the sea.
The water of the cascade was at this moment dispersed above our
heads in small rain or light vapour; but in stormy weather it rushes
forth in a torrent, and furiously dashes through the ravine till it
reaches the sea. To us the scene presented a gloomy, solitary, and
melancholy aspect; though it was altogether so interesting that we
quitted it with regret.
To-day was Sunday, and we all dined with the Emperor; he good
humouredly observed that we composed his state party. After dinner
the circle of our amusements was not very extensive: he asked us
whether we would have a comedy, an opera, or a tragedy. We
decided in favour of a comedy, and he himself read a portion of
Moliere’s Avare, which was continued by other individuals of the
party. The Emperor had a cold, and was slightly feverish. He
withdrew early from his walk in the garden, and desired me to see
him again that evening, if he should not have gone to bed. My son
and I accompanied the rest of the gentlemen to the town; and on
our return, the Emperor had retired to rest.
WAR.—PRINCIPLES.—APPLICATION.—OPINIONS ON
SEVERAL GENERALS.