Software Engineering For Data Scientists Chap5

Uploaded by

akratiiet

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

284 views13 pages

Software Engineering For Data Scientists Chap5

Uploaded by

akratiiet

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 13

O'REILLY Chapter 5. VOCUITICIMAWLON ANOTE FOR EARLY RELEASE READERS With Early Release ebooks, you get books in their earliest form—the author’s raw and unedited content as they write—so you can take advan- tage of these technologies long before the official release of these titles. This will be the 10th chapter of the final book. Please note that the GitHub repo will be made active later on. If you have comments about how we might improve the content and/or examples in this book, or if you notice missing material within this chapter, please reach out to the author at catherine.nelson1 @gmail.com. Documentation is an often overlooked aspect of data science. It’s something that is commonly left until the end of a project, but then you're ex- cited to move on to the new project and the documentation is rushed or omitted completely. However, as discussed in “Readability”, documentation is a crucial part of making your code reproducible. If you want other people to use your code, or if you want to come back to your code in the future, it needs good documentation. It’s impossible to remember all your thoughts from when you originally wrote the code or initially carried out the experiments, so they need to be recorded. Good documentation communicates ideas well. Your reader needs to understand what you want them to understand, So firstly, it’s important to consider who you're writing the documentation for. Are you recording your experiments for another data scientist who might take over your project in the future? Are you documenting a piece of code that you think might be useful for other people on your team? Or are you recording your own thoughts so that you can come back to them in 6 months’ time? Pick your level of detail and the language you use so that it is appropriate for your expected reader.Other aspects of good documentation include being up to date: documentation is not useful if it is not maintained. Documentation should be up- dated at the same time as code changes are made. Good documentation also should be well structured. The most important information should be easiest to find, so put it at the start or make it obvious where to find it. Good documentation can save you a huge amount of time, and reduce the complexity of a data science project. If you know what work has already been done, you reduce the chance of repeating the same work. You can also get up to speed much faster on a new project, or easily remember what you were working on a year ago. All of the above statements are relevant to the documentation for any project, but there are some considerations that are more specific to data science projects. In a data science project, it’s common to try out several potential solutions to a problem before settling on the one that works best. Because of this, it's good practice to document the thought process that goes into your experimentation and decision making. This is useful if you are asked questions on it later, or you need to go back and revisit the project in the future, Try to answer questions like this in your documentation: * Why did you select the data you used in this project? © What are the assumptions that you are making about your data? ‘© Why did you choose this analysis method rather than another? * Are there circumstances where this analysis method does not work? ‘* What (if any) shortcuts did you take that could be improved later? ‘* What are some other avenues for future experimentation that you would suggest to anyone who works on this project in the future? ‘* What are the lessons you learned from this project? In this chapter, I'll discuss different types of documentation, and best practices for writing it. Documentation within the codebase As discussed in Chapter 1, good code is readable! A readable codebase should contain text as well as code, in the form of comments, docstringsand longer documents. The code itself should be readable, and good names are the key to this Names Every time you write code, you need to choose a lot of names. Variabl functions, notebooks, projects, all of them need a name. Good names are an important part of making your code readable. If someone else wants to use your code, they will read through it before making any changes. The names you use will communicate what you want your code to do. For example, a function named download_and_clean_monthly_data communicates much more than a function named process_data . Your Jupyter Notebook should never be named untitled1.ipynb , because this communicates nothing about what is in the notebook. If you need something from that notebook in the future, you wouldn’t be able to find it without a good name. Good names are expressive, an appropriate length, and easy to read. Good names also use language that is relevant to the project you're working on, and to your company or organization. For example, if your company has a particular shorthand for a customer name, use that in your code, Units are a great thing to include in variable names: distance_km is much more informative than distance . Even if you don’t pick a good, name at first, don’t be afraid to update it later: your IDE will make it easy to do this by updating all the instances of a name at the same time. So what is an appropriate length for a name? Variable and function names shouldn’t be too short, because if a name is too short it increases the mental load for the person reading your code. They will need to trans- late from the name to a meaning. For example, image_id is much more informative than im_id, and clean_df is better than cl_df. Full words are much easier to read, and they are also easier to search for in your IDE if you need to look up their usage or alter them later. For example, the following code snippet needs a comment to explain what is happening, because single letters are used for the variables: # calculate the accuracy of the predictions compared to the test data a = sum(x == y for x, y in zip(p, t))/len(p)Choosing better names makes the code much easier to read: accuracy = sum(x == y for x, y in zip(predictions, test_data))/len(predictions) The variables x and y have been left as short names because they are only used within the call to sum() , and as such are only temporary. Similarly, a convention in Python is to use the single letters i and j as counters, as in the following example: i= while i < len(processed_results): # do something iv. Other commonly used conventions are df for a Pandas dataframe, and fig and ax for figures and axes when using Matplotlib. It’s ok to use short names that your readers will recognize. Names also shouldn't be too similar to each other. I always have to look up the documentation for the common Python datetime functions strptime and strftime to remember the difference between them, because they are so similar. Again, this means your reader needs to hold ad- ditional knowledge to use the code. Additionally, you can make the names in your code readable by using Python formatting conventions. Variables and functions should use snake_case , where all the words are lowercase and joined with under- scores. It’s easier to read variable names if there is an underscore between each word: x_train_array is clearer than xtrainarray . Class definitions should use CamelCase , where the initial letter of each word is capitalised. Constants or global variables should use ALL_CAPS .WARNING Don’t use names of Python built-in functions as variable names, otherwise you ‘won't be able to use the original functions. Here’s an example: list = [@, 2, 4] This will cause the following line of code to return an error, instead of creating an empty list: enpty_list = list() Taken together, all of these will make your code much more readable, and make it e: sier for other people to use your code. Comments Comments are one of the most useful forms of documentation within the codebase, but you need to take care to use them well. Comments can sum- marize, explain or add caveats to your code, or mark places where you need to come back and change things later. They can also be a useful way to start writing a function: you can start with pseudocode in the form of comments, then fill in the real code, which makes it easier to structure the function. A comment in Python is designated with a # symbol: # This is a comment. Comments should not repeat the information that is already in the code. This doesn’t help your reader, and you will also need to change the comment if the code chages. This violates the “Don’t Repeat Yourself” princi- ple from Chapter 1. This s an example of a bad comment # Train the classification model classifier. fit(X_train, y_train)Agood comment adds caveats, summarizes information, or explains something isn’t already in the code. This is an example of a useful comment: from statistics import mode # If there are two modes in the data, the first one found is returned. mode(my_data_array) Comments should be easy for your reader to understand. It’s best to use full words and sentences instead of abbreviations. It’s also a good idea to write the comments at the same time as you are writing the code, rather than adding in explanations later on. This gives you an opportunity to add all the extra thoughts you are having while writing the code. Comments should always be professional, without offensive slang or curse words. But comments can be lighthearted and fun, if that fits in with your company’s culture. The Apollo 11 source code from NASA has some great examples: 243 ezr—Po3spors | BRANCH TF ANTENNA ALREADY IN POSITION 1 244 245 car conesee 4 ASTRONAUT: PLEASE CRAWK THE 246, TC BANKCALL ’ SILLY THING AROUND 287 aOR GOPERFA 248 Ter coToPeH # TemnmaTE 249 Ter P63SPOTS 4 PROCEED SEE TIF HE'S LYING 250 251 PoaspoT4 Te BANKCALL # TER [INITIALIZE LANDING RADAR 252 aor sETPOS 253 254 Te POSTaUNP 4 OFF TO SEE THE WIZARD Figure 5-1. Comments in the Apollo 11 source code Source: https://github.comychrislgarry/Apollo- 67 rd 704fa1 836240800 7 Docstrings Python docstrings are a formalized longer version of comments that are commonly included at the start of a function or class definition, or at the top of the file. They give your reader an overall view of what the function or script should be doing. Docstrings are a crucial part of making your code easily readable to another person, because you can provide more detail on the purpose of a function than you can communicate just by the name of that function.A function docstring should describe what the expected inputs and outputs of that function are, including their types. This is the Python stan- dard for the documentation for a function, and it means that the text you enter can be returned by calling the help function in a Python inter- preter. Additionally, there are also automated documentation solutions such as Sphinx which will pick up the text from the docstrings and generate web documentation from them, Here’s a great example of a docstring from the Pandas codebase. The head() method displays the first n rows of a Pandas dataframe. It’s stan- dard practice to enclose the docstring in triple". def head(self: NDFrameT, n: int = 5) -> NDFrameT: Return the first “n> rows. @ This function returns the first ~n” rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it. For negative values of “n’, this function returns all rows except the last ~|n]° rows, equivalent to ~“df[:n]">. @ If n is larger than the number of rows, this function returns all rows. Parameters n: int, default 5 @ Number of rows to select. Returns same type as caller © The first “n° rows of the caller object. his gives us an overall description of the function. @ ‘this is a caveat that tells us that the behavior may not be what we expect. © This is an edge case. © The input parameter with the default and the expected type. The output, which can be one of many types. Common ones for this,function would be a Pandas dataframe or series. Source: https//github.com/pandas- dev/pandas/blob/v1.5.0/pandas/core/generic,py#L5479-L5552 You can get the same documentation from help(df.head) in a Python in- terpreter. This docstring is also picked up by the autogenerated documentation for Pandas. There are three main templates for docstrings: Google docstrings, numpy docstrings and reStructuredext docstrings. I recommend picking one of these and sticking to it, because the standardization makes it easy to read: the format is familiar. You can configure a code editor to automatically generate a docstring template in your preferred format. Readmes, tutorials, and other longer documents Longer documents help give your readers the overall context of your project, and they advertise the work that you have done. As with every- thing else in this chapter, it is essential that they are kept up to date. There’s very few things that are more frustrating than trying to follow a tutorial example, then finding that the underlying code has changed and the example doesn’t run. Automated tests for the examples in your tutorials prevent this. Every code repository should contain a README.md file that gives an overall introduction to the project. It should also describe how to run it, what the requirements are, and any other relevant information that someone would need to know if they are using your project or planning to con- tinue your work. The combination of good names, useful comments, completed docstrings and an overall introduction will ensure that your code is easy to run, maintain, and work on in the future.Documentation in Jupyter Notebooks Jupyter notebooks can be very informal, and they are often used for the initial stages of a project. But even if your notebook is a blind alley, where you try something out and it doesn’t work, the code in that notebook could very well be useful in the future. So it’s worthwhile to be able to find that code again, and know what is in the notebook. As discussed in “Names”, it’s important that your notebook has a descriptive name. Additionally, it's a good idea to give a description of what's in the notebook at the very start. ‘ucture and Within a notebook, you can add text to give the notebook add explanatory notes. Again, don’t repeat the information that is in the code, Use the text to add summaries, caveats and explanations. And you should always update the text when you update the code. You can add text to Jupyter notebooks using markdown. To convert a cell from code to markdown, either press the m hotkey when you are outside a notebook, or use the dropdown menu at the top of the notebook. The following example shows a good mix of text and code. The text adds information to the notebook without duplicating the code. Tokenize Loa pretation pad: Figure 5-2. Mixing code and text in a Jupyter Notebook Source: hub comhuggingface/notebooks/blob/main/transformers docien/preprocessing ipynb You can also use markdown in a notebook to add headings, which will help your readers navigate through the notebook. Headings use the # symbol as follows:# This is a top level heading ## This is a second level heading ### This is a third level heading Giving a result like this: Table of contents Ox Preprocess NLP Tokenize Pad Truncation Build tensors Audio Resample Feature extractor Pad and truncate Vision Feature extractor Data augmentation Figure 5:3. Using headings in a Jupyter Notebook Source: tpsithub.comvhutnafacenalebookyblobimanransformers declenpeprocessingipynbDocumenting experiments Experiments need to be documented in a structured manner, to ensure you are being rigorous. To do this, you'll need to ensure that you are tracking all the variables that change in each iteration of your experiment. This will ensure that you can reproduce your experiment in the future, or someone else can pick up the project and know what variables have been tested. You should also record the hypotheses that you are testing, and any assumptions that you are making along the way. This is especially useful for machine learning projects, when you might want to try out a large number of different parameters, Consider recording the following: «The data that you used to train the model. + The training/evaluation/test split. + The feature engineering choices that you made. * The model hyperparameters (such as the regularization in a logistic regression model, or the learning rate for a neural network). « The metrics that you are evaluating your model on, such as accuracy, precision and recall. Weights and Biases is a very useful tool for tracking machine learning experiments. It easily integrates with scikitlearn, TensorFlow and Pytorch, and logs your training parameters to a web dashboard as shown below: i Figure 5-4, Tracking experiments in Weights and Biases Source: /guides/track/app Other experiment tracking solutions include the open source package sa- cred, and Sagemaker Experiments from AWS.Key Takeaways Good documentation is crucial for getting your code used by other people. Time spent writing documentation will repay itself in the future. It'll make it so much easier for other people to get started on your project, and for you to understand your code in the future. The following points will help you write good documentation: Names Names of variables, functions and files should be informative, an appropriate length, and easy to read. Comments Your comments should add extra information that isn’t contained in the code, such as a summary or a caveat. Docstrings Your functions should always have a docstring that describes the inputs and outputs of the function, as well as the purpose of that function. Readmes Every repository or project should have an introduction that adver- tises your code and lets other people know why they should use it. Jupyter Notebooks Your notebooks will be much easier to read if you give them good names, give them a structure, and intersperse text and code. Experiment Tracking Experiments, especially in machine learning projects, should be tracked in a structured way. All the techniques in this chapter will make your code clear to read, advertise what it does, and make it easy for other people to use it.

Python Basics
No ratings yet
Python Basics
69 pages
Well Crafted Code
No ratings yet
Well Crafted Code
8 pages
3 - Code Formatting Tools
No ratings yet
3 - Code Formatting Tools
13 pages
Coding Guidelines By-Hemal Rajyaguru
No ratings yet
Coding Guidelines By-Hemal Rajyaguru
36 pages
Python Cheat Sheet For Begginers
No ratings yet
Python Cheat Sheet For Begginers
10 pages
Chap 4
No ratings yet
Chap 4
31 pages
Writing Idiomatic Python 3 PDF
100% (3)
Writing Idiomatic Python 3 PDF
66 pages
Unit IV
No ratings yet
Unit IV
70 pages
Chap 2
No ratings yet
Chap 2
46 pages
Chap 7
No ratings yet
Chap 7
44 pages
Chap 8
No ratings yet
Chap 8
36 pages
Software Engineering For Data Scientists Chap4
No ratings yet
Software Engineering For Data Scientists Chap4
13 pages
Se Unit 4
No ratings yet
Se Unit 4
33 pages
Chap 5
No ratings yet
Chap 5
25 pages
David Mertz Better Python Code A Guide For Aspiring Experts Early Release Addison Wesley Professi
No ratings yet
David Mertz Better Python Code A Guide For Aspiring Experts Early Release Addison Wesley Professi
345 pages
Introduction To Clean Code
No ratings yet
Introduction To Clean Code
8 pages
From Louvain To Leiden: Guaranteeing Well-Connected Communities
No ratings yet
From Louvain To Leiden: Guaranteeing Well-Connected Communities
12 pages
50 Coding Laws That Would Make You A Decent Programmer
No ratings yet
50 Coding Laws That Would Make You A Decent Programmer
35 pages
50 Coding Laws That Would Make You A Decent Programmer. - by Alexander Obidiegwu - Medium
No ratings yet
50 Coding Laws That Would Make You A Decent Programmer. - by Alexander Obidiegwu - Medium
29 pages
PYTHON Khurramshahzad
No ratings yet
PYTHON Khurramshahzad
20 pages
PEP-8 Tutorial - Code Standards in Python PDF
No ratings yet
PEP-8 Tutorial - Code Standards in Python PDF
20 pages
Grade 7 Lesson 4 N 5 Week 4 (9!9!24 To 13-9-24) Data Structures
No ratings yet
Grade 7 Lesson 4 N 5 Week 4 (9!9!24 To 13-9-24) Data Structures
12 pages
Pmse Module III
No ratings yet
Pmse Module III
73 pages
Notes On Writing Code
100% (1)
Notes On Writing Code
16 pages
II Unit Modular Approach
No ratings yet
II Unit Modular Approach
12 pages
Documenting Python Code: A Complete Guide
No ratings yet
Documenting Python Code: A Complete Guide
17 pages
Documenting Python Code: A Complete Guide
No ratings yet
Documenting Python Code: A Complete Guide
17 pages
Chapter 4
No ratings yet
Chapter 4
8 pages
Lecture 28 - Programming Styles and Writing Good Code
No ratings yet
Lecture 28 - Programming Styles and Writing Good Code
53 pages
50 Coding Laws That Would Make You A Decent Programmer. - by Alexander Obidiegwu - Medium
No ratings yet
50 Coding Laws That Would Make You A Decent Programmer. - by Alexander Obidiegwu - Medium
30 pages
Python Presentation 1
No ratings yet
Python Presentation 1
3 pages
4 Procedural Programming
No ratings yet
4 Procedural Programming
26 pages
Dihci L24
No ratings yet
Dihci L24
4 pages
Matlab Style Guidelines Cheat Sheet PDF
No ratings yet
Matlab Style Guidelines Cheat Sheet PDF
1 page
Seminar
No ratings yet
Seminar
24 pages
Good Programming Practices: Andrew Showers, Salles Viana Alac
No ratings yet
Good Programming Practices: Andrew Showers, Salles Viana Alac
38 pages
Module Summary - Python
No ratings yet
Module Summary - Python
5 pages
Documentation
No ratings yet
Documentation
6 pages
75-1591946604918-HND PRG W9 Coding Standards
No ratings yet
75-1591946604918-HND PRG W9 Coding Standards
13 pages
Clean - Code - Check - List
No ratings yet
Clean - Code - Check - List
12 pages
Csci 40 Notes
No ratings yet
Csci 40 Notes
18 pages
Se M2 Combine
No ratings yet
Se M2 Combine
34 pages
Python Code Example and Review
No ratings yet
Python Code Example and Review
20 pages
Day2-Faciltation Guide (Variables and Operators)
No ratings yet
Day2-Faciltation Guide (Variables and Operators)
12 pages
Best Practices For Coding in Python
No ratings yet
Best Practices For Coding in Python
8 pages
Lesson5 - Python
No ratings yet
Lesson5 - Python
5 pages
Python Like PRO Light Mode
No ratings yet
Python Like PRO Light Mode
36 pages
Developing Good Style: Commenting
No ratings yet
Developing Good Style: Commenting
6 pages
Unit Iv
No ratings yet
Unit Iv
16 pages
Avoiding Bad Comments - JetBrains Academy - Learn Programming by Building Your Own Apps
No ratings yet
Avoiding Bad Comments - JetBrains Academy - Learn Programming by Building Your Own Apps
3 pages
Python Style Guide - How To Write Neat and Impressive Python Code
No ratings yet
Python Style Guide - How To Write Neat and Impressive Python Code
14 pages
How To - Computer Science Project
No ratings yet
How To - Computer Science Project
6 pages
The Art of Coding
No ratings yet
The Art of Coding
10 pages
Python Unit-1
No ratings yet
Python Unit-1
7 pages
Python Coding Standards
No ratings yet
Python Coding Standards
4 pages
Python Cheat Sheet
No ratings yet
Python Cheat Sheet
9 pages
Pravin Python Micro Project
No ratings yet
Pravin Python Micro Project
9 pages
Coding
No ratings yet
Coding
31 pages
07 Good Programming Style
No ratings yet
07 Good Programming Style
6 pages

Software Engineering For Data Scientists Chap5

Uploaded by

Software Engineering For Data Scientists Chap5

Uploaded by

You might also like