Principles of Data Science WEB 2
Principles of Data Science WEB 2
EXAMPLE 1.5
Problem
A dataset has a list of keywords that were searched on a web search engine in the past week. Is this dataset
structured or unstructured?
Solution
The dataset is an unstructured dataset since each entry in the dataset can be a freeform text: a single word,
multiple words, or even multiple sentences.
EXAMPLE 1.6
Problem
The dataset from the previous example is processed so that now each search record is summarized as up to
three words, along with the timestamp (i.e., when the search occurred). Is this dataset structured or
unstructured?
Solution
It is a structured dataset since every entry of this dataset is in the same structure with two attributes: a
short keyword along with the timestamp.
Table 1.3 summarizes the advantages and disadvantages of CSV, JSON, and XML dataset formats. Each of these
is described in more detail below.
22 1 • What Are Data and Data Science?
EXPLORING FURTHER
Data.gov (https://openstax.org/r/datagov)
Kaggle (https://openstax.org/r/kaggle1)
Statista (https://openstax.org/r/statista)
There is some flexibility on how to end a line with CSV files. It is acceptable to end with or without commas,
as some software or programming languages automatically add a comma when generating a CSV dataset.
CSV files can be opened with spreadsheet software such as MS Excel and Google Sheets. The spreadsheet
software visualizes CSV files more intuitively in the form of a table (see Figure 1.5). We will cover the basic use
of Python for analyzing CSV files in Data Science with Python.
Figure 1.5 ch1-courseEvaluations.csv Opened with Microsoft Excel (Used with permission from Microsoft)
HOW TO DOWNLOAD AND OPEN A DATASET FROM THE CH1-DATA SPREADSHEET IN THIS TEXT
A spreadsheet file accompanies each chapter of this textbook. The files include multiple tabs corresponding
to a single dataset in the chapter. For example, the spreadsheet file for this chapter (https://openstax.org/r/
spreadsheet4) is shown in Figure 1.6. Notice that it includes multiple tabs with the names
“ch1-courseEvaluations.csv,” “ch1-cancerdoc.csv,” and “ch1-riris.csv,” which are the names of each dataset.
24 1 • What Are Data and Data Science?
Figure 1.6 The dataset spreadsheet file for Chapter 1 (Used with permission from Microsoft)
To save each dataset as a separate CSV file, choose the tab of your interest and select File > Save As ... > CSV
File Format. This will only save the current tab as a CSV file. Make sure the file name is set correctly; it may
have used the name of the spreadsheet file—“ch1-data.xlsx” in this case. You should name the generated
CSV file as the name of the corresponding tab. For example, if you have generated a CSV file for the first tab
of “ch1-data.xlsx,” make sure the generated file name is “ch1-courseEvaluations.csv.” This will prevent future
confusion when following instructions in this textbook.
Figure 1.7 provides an example of the JSON representation of the same dataset depicted in Figure 1.6.
Notice that the JSON format starts and ends with a pair of curly braces ({}). Inside, there are multiple pairs of
two fields that are separated by a colon (:). These two fields that are placed on the left and right of the colon
are called a key and value, respectively,—key : value. For example, the dataset in Figure 1.7 has five pairs of
key-values with the key "Semester": "Fall 2020", "Semester": "Spring 2021", "Semester": "Fall
2021", "Semester": "Spring 2022", "Semester": "Fall 2022", and "Semester": "Spring 2023".
CourseEvaluations.xml below only includes the first three items in the original dataset.
Figure 1.8 ch1-courseEvaluations.xml with the First Three Entries Only, Opened with Visual Studio Code
CourseEvaluations.xml lists each item of the dataset between a pair of tags, <members> and </members>.
Under <members>, each item is defined between <evaluation> and </evaluation>. Since the dataset in
Figure 1.8 has three items, we can see three blocks of <evaluation> ... </evaluation>. Each item has four
attributes, and they are defined as different XML tags as well—<semester>, <instructor>, <classsize>,
and <rating>. They are also followed by closing tags such as </semester>, </instructor>, </classsize>,
and </rating>.
PubMed datasets (https://openstax.org/r/pubmed1) provides a list of articles that are published in the National
Library of Medicine in XML format. Click Annual Baseline and download/open any .xml file. Note that all the
.xml files are so big that they are compressed to .gz files. However, once you download one and attempt to
open it by double-clicking, the file will automatically be decompressed and open. You will see a bunch of XML
tags along with information about numerous publications, such as published venue, title, published date, etc.
Figure 1.9 The Directory Structure of the Small Traffic Light Dataset
The annotation directories have a list of XML files, each of which corresponds to an image file with the same
filename inside the corresponding image directory (Figure 1.10). Figure 1.11 shows that the first XML file in the
Annotations directory includes information about the .jpg file with the same filename.
Figure 1.10 List of XML Files under the Annotations Directory in the Small Traffic Light Dataset
(source: “Small Traffic Light Dataset,” https://www.kaggle.com/datasets/sovitrath/small-traffic-light-dataset-xml-format)
Figure 1.11 2020-03-30 11_30_03.690871079.xml, an Example XML file within the Small Traffic Light Dataset (Source: “Small Traffic
Light Dataset,” https://www.kaggle.com/datasets/sovitrath/small-traffic-light-dataset-xml-format)
The Face Mask Detection (https://openstax.org/r/andrewmvd) dataset has a set of images of human faces with
masks on. It follows a similar structure as well. The dataset consists of two directories—annotations and
images. The former is in the XML format. The name of each XML file includes any text description about the
28 1 • What Are Data and Data Science?
image with the same filename. For example, “maksssksksss0.xml” includes information on
“maksssksksss0.png.”
EXAMPLE 1.7
Problem
The Iris Flower dataset (ch1-iris.csv (https://openstax.org/r/filed)) is a classic dataset in the field of data
1
analysis. Download this dataset and open it with a code editor (e.g., Sublime Text, XCode, Visual Studio
Code). (We recommend that if you do not have any code editor installed, you install one. All three of these
editors are quick and easy to install.) Now, answer these questions:
Solution
There are 151 rows in the dataset with the header row at the top, totaling 150 items. There are five
attributes listed across columns: sepal_length, sepal_width, petal_length, petal_width, species.
The second attribute is sepal_width.
EXAMPLE 1.8
Problem
Solution
There are 409 items in the dataset, and each item has seven attributes: “category”, “air_date”,
“question”, “value”, “answer”, “round”, and “show_number”. The third item is located at index 2 of the first
list as shown in Figure 1.14.
1 The Iris Flower dataset was introduced by the British statistician and biologist Ronald Fisher in his 1936 paper “The Use of Multiple
Measurements in Taxonomic Problems.” This work became a landmark study in the use of multivariate data in classification
problems and frequently makes an appearance in data science as a convenient test case for machine learning and neural network
algorithms. The Iris Flower dataset is often used as a beginner's dataset to demonstrate various techniques, such as classification of
algorithms, formatted in CSV.
Technology empowers data analysts, researchers, and organizations to leverage data, extract actionable
insights, and make decisions that optimize processes and improve outcomes in many areas of our lives.
Specifically, technology provides the tools, platforms, and algorithms that enable users to efficiently process
and analyze data—especially complex datasets. The choice of technology used for a data science project will
vary depending on the goals of the project, the size of the datasets, and the kind of analysis required.
Spreadsheet Programs
Spreadsheet programs such as Excel and Google Sheets are software applications consisting of electronic
worksheets with rows and columns where data can be entered, manipulated, and calculated. Spreadsheet
programs offer a variety of functions for data manipulation and can be used to easily create charts and tables.
Excel is one of the most widely used spreadsheet programs, and as part of Microsoft Office, it integrates well
with other Office products. Excel was first released by Microsoft in 1987, and it has become one of the most
popular choices for loading and analyzing tabular data in a spreadsheet format. You are likely to have used
Excel in some form or other —perhaps to organize the possible roommate combinations in your dorm room or
to plan a party, or in some instructional context. We refer to the use of Excel to manipulate data in some of the
examples of this text because sometimes a spreadsheet is simply the easiest way to work with certain
datasets. (See Appendix A: Review of Excel for Data Science for a review of Excel functionality.)
Google Sheets is a cloud-based spreadsheet program provided by Google as part of the Google Workspace.
Because it is cloud-based, it is possible to access spreadsheets from any device with an internet connection.
This accessibility allows for collaboration and real-time updates among multiple users, enhancing
communication within a team and making it ideal for team projects or data sharing among colleagues. Users
can leave comments, track changes, and communicate within the spreadsheet itself.
The user interfaces for both Excel and Google Sheets make these programs very user-friendly for many
applications. But these programs have some limitations when it comes to large databases or complex
analyses. In these instances, data scientists will often turn to a programming language such as Python, R, or
SPSS.
Programming Languages
A programming language is a formal language that consists of a set of instructions or commands used to
communicate with a computer and to instruct it to perform specific tasks that may include data manipulation,
computation, and input/output operations. Programming languages allow developers to write algorithms,
create software applications, and automate tasks and are better suited than spreadsheet programs to handle
complex analyses.
Python and R are two of the most commonly used programming languages today. Both are open-source
programming languages. While Python started as a general-purpose language that covers various types of
tasks (e.g., numerical computation, data analysis, image processing, and web development), R is more
specifically designed for statistical computing and graphics. Both use simple syntax compared to conventional
programming languages such as Java or C/C++. They offer a broad collection of packages for data
manipulation, statistical analysis, visualization, machine learning, and complex data modeling tasks.
30 1 • What Are Data and Data Science?
We have chosen to focus on Python in this text because of its straightforward and intuitive syntax, which
makes it especially easy for beginners to learn and apply. Also, Python skills can apply to a wide range of
computing work. Python also contains a vast network of libraries and frameworks specifically designed for
data analysis, machine learning, and scientific computing. As we’ll see, Python libraries such as NumPy
(https://openstax.org/r/nump), Pandas (https://openstax.org/r/panda), Matplotlib (https://openstax.org/r/
matplot), and Seaborn (https://openstax.org/r/seabo) provide powerful tools for data manipulation,
visualization, and machine learning tasks, making Python a versatile choice for handling datasets. You’ll find
the basics of R and various R code examples in Appendix B: Review of R Studio for Data Science. If you want to
learn how to use R, you may want to practice with RStudio (https://openstax.org/r/posit1), a commonly used
software application to edit/run R programs.
PYTHON IN EXCEL
Microsoft recently launched a new feature for Excel named “Python in Excel.” This feature allows a user to
run Python code to analyze data directly in Excel. This textbook does not cover this feature, but instead
presents Python separately, as it is also crucial for you to know how to use each tool in its more commonly
used environment. If interested, refer to Microsoft’s announcement (https://openstax.org/r/youtu).
EXPLORING FURTHER
In general, data visualization aims to make complex data more understandable and usable. The tools and
technology described in this section offer a variety of ways to go about creating visualizations that are most
accessible. Refer to Visualizing Data for a deeper discussion of the types of visualizations—charts, graphs,
boxplots, histograms, etc.—that can be generated to help find the meaning in data.
EXPLORING FURTHER
Multiple tools are available for writing and executing Python programs. Jupyter Notebook is one convenient
and user-friendly tool. The next section explains how to set up the Jupyter Notebook environment using
Google Colaboratory (Colab) and then provides the basics of two open-source Python libraries named Pandas
and Matplotlib. These libraries are specialized for data analysis and data visualization, respectively.
EXPLORING FURTHER
Python Programming
In the discussion below, we assume you are familiar with basic Python syntax and know how to write a
simple program using Python. If you need a refresher on the basics, please refer to Das, U., Lawson, A.,
Mayfield, C., & Norouzi, N. (2024). Introduction to Python Programming. OpenStax. https://openstax.org/
books/introduction-python-programming/pages/1-introduction (https://openstax.org/r/page1).
Setting up Colab is simple. On your Google Drive, click New > More. If your Google Drive has already installed
Colab before, you will see Colaboratory under More. If not, click “Connect more apps” and install Colab by
searching “Colaboratory” on the app store (Figure 1.15). For further information, see the Google Colaboratory
Ecosystem (https://openstax.org/r/1pp) animation.
32 1 • What Are Data and Data Science?
Now click New > More > Google Laboratory. A new, empty Jupyter Notebook will show up as in Figure 1.16.
The gray area with the play button is called a cell. A cell is a block where you can type either code or plain text.
Notice that there are two buttons on top of the first cell—“+ Code” and “+ Text.” These two buttons add a code
or text cell, respectively. A code cell is for the code you want to run; a text cell is to add any text description or
note.
Let’s run a Python program on Colab. Type the following code in a code cell.
PYTHON CODE
hello world!
You can write a Python program across multiple cells and put text cells in between. Colab would treat all the
code cells as part of a single program, running from the top to bottom of the current Jupyter Notebook. For
example, the two code cells below run as if it is a single program.
When running one cell at a time from the top, we see the following outputs under each cell.
PYTHON CODE
a = 1
print ("The a value in the first cell:", a)
PYTHON CODE
b = 3
print ("a in the second cell:", a)
print ("b in the second cell:", b)
a + b
While conventional Python syntax requires print() syntax to print something to the program console,
Jupyter Notebook does not require print(). On Jupyter Notebook, the line a+b instead of print(a+b)
also prints the value of a+b as an output. But keep in mind that if there are multiple lines of code that
trigger printing some values, only the output from the last line will show.
You can also run multiple cells in bulk. Click Runtime on the menu, and you will see there are multiple ways of
running multiple cells at once (Figure 1.17). The two commonly used ones are “Run all” and “Run before.” “Run
all” runs all the cells in order from the top; “Run before” runs all the cells before the currently selected one.
One thing to keep in mind is that being able to split a long program into multiple blocks and run one block at a
time raises chances of user error. Let’s look at a modified code from the previous example.
PYTHON CODE
a = 1
print ("the value in the first cell:", a)
PYTHON CODE
b = 3
print ("a in the second cell:", a)
print ("b in the second cell:", b)
a + b
PYTHON CODE
a = 2
a + b
The modified code has an additional cell at the end, updating a from 1 to 2. Notice that now a+b returns 5 as a
has been changed to 2. Now suppose you need to run the second cell for some reason, so you run the second
cell again.
PYTHON CODE
a = 1
print ("the a value in the first cell:", a)
PYTHON CODE
b = 3
print ("a in the second cell:", a)
print ("b in the second cell:", b)
a + b
PYTHON CODE
a = 2
a + b
The value of a has changed to 2. This implies that the execution order of each cell matters! If you have run the
third cell before the second cell, the value of a will have the value from the third one even though the third cell
is located below the second cell. Therefore, it is recommended to use “Run all” or “Run before” after you make
changes across multiple cells of code. This way your code is guaranteed to run sequentially from the top.
Python Pandas
One of the strengths of Python is that it includes a variety of free, open-source libraries. Libraries are a set of
already-implemented methods that a programmer can refer to, allowing a programmer to avoid building
common functions from scratch.
36 1 • What Are Data and Data Science?
Pandas is a Python library specialized for data manipulation and analysis, and it is very commonly used among
data scientists. It offers a variety of methods, which allows data scientists to quickly use them for data analysis.
You will learn how to analyze data using Pandas throughout this textbook.
Colab already has Pandas installed, so you just need to import Pandas and you are set to use all the methods
in Pandas. Note that it is convention to abbreviate pandas to pd so that when you call a method from Pandas,
you can do so by using pd instead of having to type out Pandas every time. It offers a bit of convenience for a
programmer!
PYTHON CODE
EXPLORING FURTHER
Open the Notebook and allow it to access files in your Google Drive by following these steps:
First, click the Files icon on the side tab (Figure 1.18).
Then click the Mount Drive icon (Figure 1.19) and select “Connect to Google Drive” on the pop-up window.
Notice that a new cell has been inserted on the Notebook as a result (Figure 1.20).
Connect your Google Drive by running the cell, and now your Notebook file can access all the files under
content/drive. Navigate folders under drive to find your Notebook and ch1-movieprofit.csv
(https://openstax.org/r/filed) files. Then click “…” > Copy Path (Figure 1.21).
Figure 1.21 Copying the Path of a CSV File Located in a Google Drive Folder
Now replace [Path] with the copied path in the below code. Run the code and you will see the dataset has been
loaded as a table and stored as a Python variable data.
PYTHON CODE
data = pd.read_csv("[Path]")
data
The read_csv() method in Pandas loads a CSV file and stores it as a DataFrame. A DataFrame is a data type
that Pandas uses to store multi-column tabular data. Therefore, the variable data holds the table in
ch1-movieprofit.csv (https://openstax.org/r/filed) in the form of a Pandas DataFrame.
Pandas defines two data types for tabular data—DataFrame and Series. While DataFrame is used for multi-
column tabular data, Series is used for single-column data. Many methods in Pandas support both
DataFrame and Series, but some are only for one or the other. It is always good to check if the method you
are using works as you expect. For more information, refer to the Pandas documentation
(https://openstax.org/r/docs) or Das, U., Lawson, A., Mayfield, C., & Norouzi, N. (2024). Introduction to
Python Programming. OpenStax. https://openstax.org/books/introduction-python-programming/pages/
1-introduction (https://openstax.org/r/page1).
EXAMPLE 1.9
Problem
Remember the Iris dataset we used in Data and Datasets? Load the dataset ch1-iris.csv
(https://openstax.org/r/filed) to a Python program using Pandas.
Solution
The following code loads the ch1-iris.csv (https://openstax.org/r/filed) that is stored in a Google Drive. Make
sure to replace the path with the actual path to ch1-iris.csv (https://openstax.org/r/filed) on your Google
Drive.
PYTHON CODE
import pandas as pd
EXPLORING FURTHER
Can I load a file that is uploaded to someone else’s Google Drive and shared with me?
Yes! This is useful especially when your Google Drive runs out of space. Simply add the shortcut of the
shared file to your own drive. Right-click > Organize > Add Shortcut will let you select where to store the
shortcut. Once done, you can call pd.read_csv() using the path of the shortcut.
PYTHON CODE
like this:
describe() returns a table whose columns are a subset of the columns in the entire dataset and whose rows
are different statistics. The statistics include the number of unique values in a column (count), mean (mean),
standard deviation (std), minimum and maximum values (min/max), and different quartiles
(25%/50%/75%), which you will learn about in Measures of Variation. Using this representation, you can
compute such statistics of different columns easily.
EXAMPLE 1.10
Problem
Summarize the IRIS dataset using describe() of ch1-iris.csv (https://openstax.org/r/filed) you loaded in
the previous example.
Solution
The following code in a new cell returns the summary of the dataset.
PYTHON CODE
PYTHON CODE
data["US_Gross_Million"]
like this:
0 760.51
1 858.37
2 659.33
3 936.66
4 678.82
...
961 77.22
962 177.20
963 102.31
964 106.89
965 75.47
Name: US_Gross_Million, Length: 966, dtype: float64
DataFrame.iloc[] enables a more powerful selection—it lets a programmer select by both column and row,
using column and row indices. Let’s look at some code examples below.
42 1 • What Are Data and Data Science?
PYTHON CODE
0 2009
1 2019
2 1997
3 2015
4 2018
...
961 2010
962 1982
963 1993
964 1999
965 2017
Name: Year, Length: 966, dtype: object
PYTHON CODE
Unnamed: 0 3
Title Titanic
Year 1997
Genre Drama
Rating 7.9
Duration 194
US_Gross_Million 659.33
Worldwide_Gross_Million 2201.65
Votes 1,162,142
Name: 2, dtype: object
To pinpoint a specific value within the “US_Gross_Million” column, you can use an index number.
PYTHON CODE
760.51
659.33
You can also use DataFrame.iloc[] to select a specific group of cells on the table. The example code below
shows different ways of using iloc[]. There are multiple ways of using iloc[], but this chapter introduces a
couple of common ones. You will learn more techniques for working with data throughout this textbook.
PYTHON CODE
0 Avatar
1 Avengers: Endgame
2 Titanic
3 Star Wars: Episode VII - The Force Awakens
4 Avengers: Infinity War
...
961 The A-Team
962 Tootsie
963 In the Line of Fire
964 Analyze This
965 The Hitman's Bodyguard
Name: Title, Length: 966, dtype: object
PYTHON CODE
EXAMPLE 1.11
Problem
Select a “sepal_width” column of the IRIS dataset using the column name.
Solution
PYTHON CODE
data["sepal_width"]
0 3.5
1 3.0
2 3.2
3 3.1
4 3.6
...
145 3.0
146 2.5
147 3.0
148 3.4
149 3.0
Name: sepal_width, Length: 150, dtype: float64
EXAMPLE 1.12
Problem
Solution
PYTHON CODE
data.iloc[:, 2]
0 1.4
1 1.4
2 1.3
3 1.5
4 1.4
...
145 5.2
146 5.0
147 5.2
148 5.4
149 5.1
Name: petal_length, Length: 150, dtype: float64
PYTHON CODE
data.loc[data['Genre'] == 'Comedy']
EXAMPLE 1.13
Problem
Using DataFrame.loc[], search for all the items of Iris-virginica species in the IRIS dataset.
Solution
The following code returns a filtered DataFrame whose species are Iris-virginica. All such rows show up as
an output.
PYTHON CODE
data.loc[data['species'] == 'Iris-virginica']
EXAMPLE 1.14
Problem
This time, search for all the items whose species is Iris-virginica and whose sepal width is wider than 3.2.
Solution
You can use a Boolean expression—in other words, an expression that evaluates as either True or
False—inside data.loc[].
PYTHON CODE
Type the following import statement in a new cell. Note it is convention to denote matplotlib.pyplot with
plt, similarly to denoting Pandas with pd.
PYTHON CODE
Matplotlib offers a method for each type of plot, and you will learn the Matplotlib methods for all of the
commonly used types throughout this textbook. In this chapter, however, let’s briefly look at how to draw a plot
using Matplotlib in general.
Suppose you want to draw a scatterplot between “US_Gross_Million” and “Worldwide_Gross_Million” of the
movie profit dataset (ch1-movieprofit.csv (https://openstax.org/r/filed)). You will investigate scatterplots in
more detail in Correlation and Linear Regression Analysis. The example code below draws such a scatterplot
using the method scatter(). scatter() takes the two columns of your interest—data["US_Gross_Million"]
and data["Worldwide_Gross_Million"]—as the inputs and assigns them for the x- and y-axes, respectively.
PYTHON CODE
Notice that it simply has a set of dots on a white plane. The plot itself does not show what each axis
represents, what this plot is about, etc. Without them, it is difficult to capture what the plot shows. You can set
these with the following code. The resulting plot below indicates that there is a positive correlation between
domestic gross and worldwide gross.
PYTHON CODE
# draw a scatterplot
plt.scatter(data["US_Gross_Million"], data["Worldwide_Gross_Million"])
You can also change the range of numbers along the x- and y-axes with plt.xlim() and plt.ylim(). Add
the following two lines of code to the cell in the previous Python code example, which plots the scatterplot.
PYTHON CODE
# draw a scatterplot
plt.scatter(data["US_Gross_Million"], data["Worldwide_Gross_Million"])