0% found this document useful (0 votes)
37 views30 pages

Principles of Data Science WEB 2

The document discusses the classification of datasets as structured or unstructured, providing examples of each. It also covers common dataset formats such as CSV, JSON, and XML, highlighting their pros and cons, typical uses, and how to access public datasets. Additionally, it emphasizes the importance of technology in data analysis and introduces spreadsheet programs like Excel for data manipulation.

Uploaded by

pihak21291
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views30 pages

Principles of Data Science WEB 2

The document discusses the classification of datasets as structured or unstructured, providing examples of each. It also covers common dataset formats such as CSV, JSON, and XML, highlighting their pros and cons, typical uses, and how to access public datasets. Additionally, it emphasizes the importance of technology in data analysis and introduces spreadsheet programs like Excel for data manipulation.

Uploaded by

pihak21291
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

1.

3 • Data and Datasets 21

EXAMPLE 1.5

Problem

A dataset has a list of keywords that were searched on a web search engine in the past week. Is this dataset
structured or unstructured?

Solution

The dataset is an unstructured dataset since each entry in the dataset can be a freeform text: a single word,
multiple words, or even multiple sentences.

EXAMPLE 1.6

Problem

The dataset from the previous example is processed so that now each search record is summarized as up to
three words, along with the timestamp (i.e., when the search occurred). Is this dataset structured or
unstructured?

Solution

It is a structured dataset since every entry of this dataset is in the same structure with two attributes: a
short keyword along with the timestamp.

Dataset Formats and Structures (CSV, JSON, XML)


Datasets can be stored in different formats, and it’s important to be able to recognize the most commonly
used formats. This section covers three of the most often used formats for structured datasets—comma-
separated values (CSV), JavaScript Object Notation (JSON), and Extensible Markup Language (XML). While
CSV is the most intuitive way of encoding a tabular dataset, much of the data we would collect from the web
(e.g., websites, mobile applications) is stored in the JSON or XML format. The reason is that JSON is the most
suitable for exchanging data between a user and a server, and XML is the most suitable for complex dataset
due to its hierarchy-friendly nature. Since they all store data as plain texts, you can open them using any
typical text editors such as Notepad, Visual Studio Code, Sublime Text, or VI editor.

Table 1.3 summarizes the advantages and disadvantages of CSV, JSON, and XML dataset formats. Each of these
is described in more detail below.
22 1 • What Are Data and Data Science?

Dataset Format Pros Cons Typical Use


CSV • Simple • Difficult to add • Tabular data
metadata
• Difficult to parse if
there are special
characters
• Flat structure

JSON • Simple • Difficult to add • Data that needs to


• Compatible with metadata be exchanged
many languages • Cannot leave between a user and
• Easy to parse comments a server

XML • Structured (so more • Verbose • Hierarchical data


readable) • Complex structure structures
• Possible to add with tags
metadata

Table 1.3 Summary of the CSV, JSON, and XML formats

EXPLORING FURTHER

Popular and Reliable Databases to Search for Public Datasets


Multiple online databases offer public datasets for free. When you want to look for a dataset of interest, the
following sources can be your initial go-to.

Government data sources include:

Data.gov (https://openstax.org/r/datagov)

Bureau of Labor Statistics (BLS) (https://openstax.org/r/blsgov1)

National Oceanic and Atmospheric Administration (NOAA) (https://openstax.org/r/noaagov)

World Health Organization (WHO) (https://openstax.org/r/who)

Some reputable nongovernment data sources are:

Kaggle (https://openstax.org/r/kaggle1)

Statista (https://openstax.org/r/statista)

Pew Research Center (https://openstax.org/r/pewresearch)

Comma-Separated Values (CSV)


The CSV stores each item in the dataset in a single line. Variable values for each item are listed all in one line,
separated by commas (“,”). The previous example about signing up for a course can be stored as a CSV file.
Figure 1.4 shows how the dataset looks when opened with a text editor (e.g., Notepad, TextEdit, MS Word,
Google Doc) or programming software in the form of a code editor (e.g., Sublime Text, Visual Studio Code,
XCode). Notice that commas are used to separate the attribute values within a single line (see Figure 1.4).

Access for free at openstax.org


1.3 • Data and Datasets 23

Figure 1.4 ch1-courseEvaluations.csv Opened with Visual Studio Code

COMMA FOR NEW LINE?

There is some flexibility on how to end a line with CSV files. It is acceptable to end with or without commas,
as some software or programming languages automatically add a comma when generating a CSV dataset.

CSV files can be opened with spreadsheet software such as MS Excel and Google Sheets. The spreadsheet
software visualizes CSV files more intuitively in the form of a table (see Figure 1.5). We will cover the basic use
of Python for analyzing CSV files in Data Science with Python.

Figure 1.5 ch1-courseEvaluations.csv Opened with Microsoft Excel (Used with permission from Microsoft)

HOW TO DOWNLOAD AND OPEN A DATASET FROM THE CH1-DATA SPREADSHEET IN THIS TEXT

A spreadsheet file accompanies each chapter of this textbook. The files include multiple tabs corresponding
to a single dataset in the chapter. For example, the spreadsheet file for this chapter (https://openstax.org/r/
spreadsheet4) is shown in Figure 1.6. Notice that it includes multiple tabs with the names
“ch1-courseEvaluations.csv,” “ch1-cancerdoc.csv,” and “ch1-riris.csv,” which are the names of each dataset.
24 1 • What Are Data and Data Science?

Figure 1.6 The dataset spreadsheet file for Chapter 1 (Used with permission from Microsoft)

To save each dataset as a separate CSV file, choose the tab of your interest and select File > Save As ... > CSV
File Format. This will only save the current tab as a CSV file. Make sure the file name is set correctly; it may
have used the name of the spreadsheet file—“ch1-data.xlsx” in this case. You should name the generated
CSV file as the name of the corresponding tab. For example, if you have generated a CSV file for the first tab
of “ch1-data.xlsx,” make sure the generated file name is “ch1-courseEvaluations.csv.” This will prevent future
confusion when following instructions in this textbook.

JavaScript Object Notation (JSON)


JSON uses the syntax of a programming language named JavaScript. Specifically, it follows JavaScript’s object
syntax. Don’t worry, though! You do not need to know JavaScript to understand the JSON format.

Figure 1.7 provides an example of the JSON representation of the same dataset depicted in Figure 1.6.

Figure 1.7 CourseEvaluations.json Opened with Visual Studio Code

Notice that the JSON format starts and ends with a pair of curly braces ({}). Inside, there are multiple pairs of
two fields that are separated by a colon (:). These two fields that are placed on the left and right of the colon
are called a key and value, respectively,—key : value. For example, the dataset in Figure 1.7 has five pairs of
key-values with the key "Semester": "Fall 2020", "Semester": "Spring 2021", "Semester": "Fall
2021", "Semester": "Spring 2022", "Semester": "Fall 2022", and "Semester": "Spring 2023".

CourseEvaluations.json (https://openstax.org/r/filed1v) has one key-value pair at the highest level:


"Members": [...]. You can see that each item of the dataset is listed in the form of an array or list under the
key "Members". Inside the array, each item is also bound by curly braces and has a list of key-value pairs,
separated by commas. Keys are used to describe attributes in the dataset, and values are used to define the
corresponding values. For example, the first item in the JSON dataset above has four keys, each of which maps
to each attribute—Semester, Instructor, Class Size, and Rating. Their values are "Fall 2020", "A", 100, and
"Not recommended at all".

Extensible Markup Language (XML)


The XML format is like JSON, but it lists each item of the dataset using different symbols named tags. An XML
tag is any block of text that consists of a pair of angle brackets (< >) with some text inside. Let’s look at the
example XML representation of the same dataset in Figure 1.8. Note that the screenshot of

Access for free at openstax.org


1.3 • Data and Datasets 25

CourseEvaluations.xml below only includes the first three items in the original dataset.

Figure 1.8 ch1-courseEvaluations.xml with the First Three Entries Only, Opened with Visual Studio Code

CourseEvaluations.xml lists each item of the dataset between a pair of tags, <members> and </members>.
Under <members>, each item is defined between <evaluation> and </evaluation>. Since the dataset in
Figure 1.8 has three items, we can see three blocks of <evaluation> ... </evaluation>. Each item has four
attributes, and they are defined as different XML tags as well—<semester>, <instructor>, <classsize>,
and <rating>. They are also followed by closing tags such as </semester>, </instructor>, </classsize>,
and </rating>.

PubMed datasets (https://openstax.org/r/pubmed1) provides a list of articles that are published in the National
Library of Medicine in XML format. Click Annual Baseline and download/open any .xml file. Note that all the
.xml files are so big that they are compressed to .gz files. However, once you download one and attempt to
open it by double-clicking, the file will automatically be decompressed and open. You will see a bunch of XML
tags along with information about numerous publications, such as published venue, title, published date, etc.

XML and Image Data


The XML format is also commonly used as an attachment to some image data. It is used to note
supplementary information about the image. For example, the Small Traffic Light Dataset
(https://openstax.org/r/traffic) in Figure 1.9 comes with a set of traffic light images, placed in one of the three
directories: JPEGImages, train_images, and valid_images. Each image directory is accompanied with
another directory just for annotations such as Annotations, train_annotations, and
valid_annotations.
26 1 • What Are Data and Data Science?

Figure 1.9 The Directory Structure of the Small Traffic Light Dataset

The annotation directories have a list of XML files, each of which corresponds to an image file with the same
filename inside the corresponding image directory (Figure 1.10). Figure 1.11 shows that the first XML file in the
Annotations directory includes information about the .jpg file with the same filename.

Figure 1.10 List of XML Files under the Annotations Directory in the Small Traffic Light Dataset
(source: “Small Traffic Light Dataset,” https://www.kaggle.com/datasets/sovitrath/small-traffic-light-dataset-xml-format)

Figure 1.11 2020-03-30 11_30_03.690871079.xml, an Example XML file within the Small Traffic Light Dataset (Source: “Small Traffic
Light Dataset,” https://www.kaggle.com/datasets/sovitrath/small-traffic-light-dataset-xml-format)

Access for free at openstax.org


1.3 • Data and Datasets 27

JSON and XML Dataset Descriptions


Both JSON and XML files often include some description(s) of the dataset itself as well (known as metadata),
and they are included as a separate entry in the file ({} or <>). In Figure 1.12 and Figure 1.13, the actual data
entries are listed inside “itemData” and <data>, respectively. The rest are used to provide background
information on the dataset. For example:

• “creationDateTime”: describes when the dataset was created.


• <name> is used to write the name of this dataset.
• <metadata> is used to describe each column name of the dataset along with its data type.

Figure 1.12 An Example JSON File with Metadata

Figure 1.13 An Example XML Dataset with Metadata

The Face Mask Detection (https://openstax.org/r/andrewmvd) dataset has a set of images of human faces with
masks on. It follows a similar structure as well. The dataset consists of two directories—annotations and
images. The former is in the XML format. The name of each XML file includes any text description about the
28 1 • What Are Data and Data Science?

image with the same filename. For example, “maksssksksss0.xml” includes information on
“maksssksksss0.png.”

EXAMPLE 1.7

Problem

The Iris Flower dataset (ch1-iris.csv (https://openstax.org/r/filed)) is a classic dataset in the field of data
1
analysis. Download this dataset and open it with a code editor (e.g., Sublime Text, XCode, Visual Studio
Code). (We recommend that if you do not have any code editor installed, you install one. All three of these
editors are quick and easy to install.) Now, answer these questions:

• How many items are there in the dataset?


• How many attributes are there in the dataset?
• What is the second attribute in the dataset?

Solution

There are 151 rows in the dataset with the header row at the top, totaling 150 items. There are five
attributes listed across columns: sepal_length, sepal_width, petal_length, petal_width, species.
The second attribute is sepal_width.

EXAMPLE 1.8

Problem

The Jeopardy dataset (ch1-jeopardy.json (https://openstax.org/r/filed15)) is formatted in JSON. Download


and open it with a code editor (e.g., Notepad, Sublime Text, Xcode, Visual Studio Code).

• How many items are there in the dataset?


• How many attributes are there in the dataset?
• What is the third item in the dataset?

Solution

There are 409 items in the dataset, and each item has seven attributes: “category”, “air_date”,
“question”, “value”, “answer”, “round”, and “show_number”. The third item is located at index 2 of the first
list as shown in Figure 1.14.

Figure 1.14 The Third Item in the Jeopardy Dataset

1 The Iris Flower dataset was introduced by the British statistician and biologist Ronald Fisher in his 1936 paper “The Use of Multiple
Measurements in Taxonomic Problems.” This work became a landmark study in the use of multivariate data in classification
problems and frequently makes an appearance in data science as a convenient test case for machine learning and neural network
algorithms. The Iris Flower dataset is often used as a beginner's dataset to demonstrate various techniques, such as classification of
algorithms, formatted in CSV.

Access for free at openstax.org


1.4 • Using Technology for Data Science 29

1.4 Using Technology for Data Science


Learning Outcomes
By the end of this section, you should be able to:
• 1.4.1 Explain how statistical software can help with data analysis.
• 1.4.2 Explain the uses of different programs and programming languages for data manipulation,
analysis, and visualizations.
• 1.4.3 Explain the uses of various data analysis tools used in data science applications.

Technology empowers data analysts, researchers, and organizations to leverage data, extract actionable
insights, and make decisions that optimize processes and improve outcomes in many areas of our lives.
Specifically, technology provides the tools, platforms, and algorithms that enable users to efficiently process
and analyze data—especially complex datasets. The choice of technology used for a data science project will
vary depending on the goals of the project, the size of the datasets, and the kind of analysis required.

Spreadsheet Programs
Spreadsheet programs such as Excel and Google Sheets are software applications consisting of electronic
worksheets with rows and columns where data can be entered, manipulated, and calculated. Spreadsheet
programs offer a variety of functions for data manipulation and can be used to easily create charts and tables.
Excel is one of the most widely used spreadsheet programs, and as part of Microsoft Office, it integrates well
with other Office products. Excel was first released by Microsoft in 1987, and it has become one of the most
popular choices for loading and analyzing tabular data in a spreadsheet format. You are likely to have used
Excel in some form or other —perhaps to organize the possible roommate combinations in your dorm room or
to plan a party, or in some instructional context. We refer to the use of Excel to manipulate data in some of the
examples of this text because sometimes a spreadsheet is simply the easiest way to work with certain
datasets. (See Appendix A: Review of Excel for Data Science for a review of Excel functionality.)

Google Sheets is a cloud-based spreadsheet program provided by Google as part of the Google Workspace.
Because it is cloud-based, it is possible to access spreadsheets from any device with an internet connection.
This accessibility allows for collaboration and real-time updates among multiple users, enhancing
communication within a team and making it ideal for team projects or data sharing among colleagues. Users
can leave comments, track changes, and communicate within the spreadsheet itself.

The user interfaces for both Excel and Google Sheets make these programs very user-friendly for many
applications. But these programs have some limitations when it comes to large databases or complex
analyses. In these instances, data scientists will often turn to a programming language such as Python, R, or
SPSS.

Programming Languages
A programming language is a formal language that consists of a set of instructions or commands used to
communicate with a computer and to instruct it to perform specific tasks that may include data manipulation,
computation, and input/output operations. Programming languages allow developers to write algorithms,
create software applications, and automate tasks and are better suited than spreadsheet programs to handle
complex analyses.

Python and R are two of the most commonly used programming languages today. Both are open-source
programming languages. While Python started as a general-purpose language that covers various types of
tasks (e.g., numerical computation, data analysis, image processing, and web development), R is more
specifically designed for statistical computing and graphics. Both use simple syntax compared to conventional
programming languages such as Java or C/C++. They offer a broad collection of packages for data
manipulation, statistical analysis, visualization, machine learning, and complex data modeling tasks.
30 1 • What Are Data and Data Science?

We have chosen to focus on Python in this text because of its straightforward and intuitive syntax, which
makes it especially easy for beginners to learn and apply. Also, Python skills can apply to a wide range of
computing work. Python also contains a vast network of libraries and frameworks specifically designed for
data analysis, machine learning, and scientific computing. As we’ll see, Python libraries such as NumPy
(https://openstax.org/r/nump), Pandas (https://openstax.org/r/panda), Matplotlib (https://openstax.org/r/
matplot), and Seaborn (https://openstax.org/r/seabo) provide powerful tools for data manipulation,
visualization, and machine learning tasks, making Python a versatile choice for handling datasets. You’ll find
the basics of R and various R code examples in Appendix B: Review of R Studio for Data Science. If you want to
learn how to use R, you may want to practice with RStudio (https://openstax.org/r/posit1), a commonly used
software application to edit/run R programs.

PYTHON IN EXCEL

Microsoft recently launched a new feature for Excel named “Python in Excel.” This feature allows a user to
run Python code to analyze data directly in Excel. This textbook does not cover this feature, but instead
presents Python separately, as it is also crucial for you to know how to use each tool in its more commonly
used environment. If interested, refer to Microsoft’s announcement (https://openstax.org/r/youtu).

EXPLORING FURTHER

Specialized Programming Languages


Other programming languages are more specialized for a particular task with data. These include SQL,
Scala, and Julia, among others, as briefly described in this article on “The Nine Top Programming Languages
for Data Science (https://openstax.org/r/9top).”

Other Data Analysis/Visualization Tools


There are a few other data analysis tools that are strong in data visualization. Tableau (https://openstax.org/r/
tableau1) and PowerBI (https://openstax.org/r/micro) are user-friendly applications for data visualization. They
are known for offering more sophisticated, interactive visualizations for high-dimensional data. They also offer
a relatively easy user interface for compiling an analysis dashboard. Both allow users to run a simple data
analysis as well before visualizing the results, similar to Excel.

In general, data visualization aims to make complex data more understandable and usable. The tools and
technology described in this section offer a variety of ways to go about creating visualizations that are most
accessible. Refer to Visualizing Data for a deeper discussion of the types of visualizations—charts, graphs,
boxplots, histograms, etc.—that can be generated to help find the meaning in data.

EXPLORING FURTHER

Evolving Professional Standards


Data science is a field that is changing daily; the introduction of artificial intelligence (AI) has increased this
pace. Technological, social, and ethical challenges with AI are discussed in Natural Language Processing,
and ethical issues associated with the whole data science process, including the use of machine learning
and artificial intelligence, are covered in Ethics Throughout the Data Science Cycle. A variety of data science
professional organizations are working to define and update process and ethical standards on an ongoing
basis. Useful references may include the following:

Access for free at openstax.org


1.5 • Data Science with Python 31

• Initiative for Analytics and Data Science Standards (IADSS) (https://openstax.org/r/iadss1)


• Data Science Association (DSA) (https://openstax.org/r/datascienceassn1)
• Association of Data Scientists (ADaSci) (https://openstax.org/r/adasci1)

1.5 Data Science with Python


Learning Outcomes
By the end of this section, you should be able to
• 1.5.1 Load data to Python.
• 1.5.2 Perform basic data analysis using Python.
• 1.5.3 Use visualization principles to graphically plot data using Python.

Multiple tools are available for writing and executing Python programs. Jupyter Notebook is one convenient
and user-friendly tool. The next section explains how to set up the Jupyter Notebook environment using
Google Colaboratory (Colab) and then provides the basics of two open-source Python libraries named Pandas
and Matplotlib. These libraries are specialized for data analysis and data visualization, respectively.

EXPLORING FURTHER

Python Programming
In the discussion below, we assume you are familiar with basic Python syntax and know how to write a
simple program using Python. If you need a refresher on the basics, please refer to Das, U., Lawson, A.,
Mayfield, C., & Norouzi, N. (2024). Introduction to Python Programming. OpenStax. https://openstax.org/
books/introduction-python-programming/pages/1-introduction (https://openstax.org/r/page1).

Jupyter Notebook on Google Colaboratory


Jupyter Notebook is a web-based environment that allows you to run a Python program more interactively,
using programming code, math equations, visualizations, and plain texts. There are multiple web applications
or software you could use to edit a Jupyter Notebook, but in this textbook we will use Google’s free application
named Google Colaboratory (Colab) (https://openstax.org/r/colab1), often abbreviated as Colab. It is a cloud-
based platform, which means that you can open, edit, run, and save a Jupyter Notebook on your Google Drive.

Setting up Colab is simple. On your Google Drive, click New > More. If your Google Drive has already installed
Colab before, you will see Colaboratory under More. If not, click “Connect more apps” and install Colab by
searching “Colaboratory” on the app store (Figure 1.15). For further information, see the Google Colaboratory
Ecosystem (https://openstax.org/r/1pp) animation.
32 1 • What Are Data and Data Science?

Figure 1.15 Install Google Colaboratory (Colab)

Now click New > More > Google Laboratory. A new, empty Jupyter Notebook will show up as in Figure 1.16.

Figure 1.16 Google Colaboratory Notebook

The gray area with the play button is called a cell. A cell is a block where you can type either code or plain text.
Notice that there are two buttons on top of the first cell—“+ Code” and “+ Text.” These two buttons add a code
or text cell, respectively. A code cell is for the code you want to run; a text cell is to add any text description or
note.

Let’s run a Python program on Colab. Type the following code in a code cell.

PYTHON CODE

print ("hello world!")

The resulting output will look like this:

hello world!

You can write a Python program across multiple cells and put text cells in between. Colab would treat all the
code cells as part of a single program, running from the top to bottom of the current Jupyter Notebook. For
example, the two code cells below run as if it is a single program.

When running one cell at a time from the top, we see the following outputs under each cell.

PYTHON CODE

Access for free at openstax.org


1.5 • Data Science with Python 33

a = 1
print ("The a value in the first cell:", a)

The resulting output will look like this:

The a value in the first cell: 1

PYTHON CODE

b = 3
print ("a in the second cell:", a)
print ("b in the second cell:", b)
a + b

The resulting output will look like this:

a in the second cell: 1


b in the second cell: 3
4

CONVENTIONAL PYTHON VERSUS JUPYTER NOTEBOOK SYNTAX

While conventional Python syntax requires print() syntax to print something to the program console,
Jupyter Notebook does not require print(). On Jupyter Notebook, the line a+b instead of print(a+b)
also prints the value of a+b as an output. But keep in mind that if there are multiple lines of code that
trigger printing some values, only the output from the last line will show.

You can also run multiple cells in bulk. Click Runtime on the menu, and you will see there are multiple ways of
running multiple cells at once (Figure 1.17). The two commonly used ones are “Run all” and “Run before.” “Run
all” runs all the cells in order from the top; “Run before” runs all the cells before the currently selected one.

Figure 1.17 Multiple Ways of Running Cells on Colab


34 1 • What Are Data and Data Science?

One thing to keep in mind is that being able to split a long program into multiple blocks and run one block at a
time raises chances of user error. Let’s look at a modified code from the previous example.

PYTHON CODE

a = 1
print ("the value in the first cell:", a)

The resulting output will look like this:

the value in the first cell: 1

PYTHON CODE

b = 3
print ("a in the second cell:", a)
print ("b in the second cell:", b)
a + b

The resulting output will look like this:

a in the second cell: 1


b in the second cell: 3
4

PYTHON CODE

a = 2
a + b

The resulting output will look like this:

The modified code has an additional cell at the end, updating a from 1 to 2. Notice that now a+b returns 5 as a
has been changed to 2. Now suppose you need to run the second cell for some reason, so you run the second
cell again.

Access for free at openstax.org


1.5 • Data Science with Python 35

PYTHON CODE

a = 1
print ("the a value in the first cell:", a)

The resulting output will look like this:

the a value in the first cell: 1

PYTHON CODE

b = 3
print ("a in the second cell:", a)
print ("b in the second cell:", b)
a + b

The resulting output will look like this:

a in the second cell: 2


b in the second cell: 3
5

PYTHON CODE

a = 2
a + b

The resulting output will look like this:

The value of a has changed to 2. This implies that the execution order of each cell matters! If you have run the
third cell before the second cell, the value of a will have the value from the third one even though the third cell
is located below the second cell. Therefore, it is recommended to use “Run all” or “Run before” after you make
changes across multiple cells of code. This way your code is guaranteed to run sequentially from the top.

Python Pandas
One of the strengths of Python is that it includes a variety of free, open-source libraries. Libraries are a set of
already-implemented methods that a programmer can refer to, allowing a programmer to avoid building
common functions from scratch.
36 1 • What Are Data and Data Science?

Pandas is a Python library specialized for data manipulation and analysis, and it is very commonly used among
data scientists. It offers a variety of methods, which allows data scientists to quickly use them for data analysis.
You will learn how to analyze data using Pandas throughout this textbook.

Colab already has Pandas installed, so you just need to import Pandas and you are set to use all the methods
in Pandas. Note that it is convention to abbreviate pandas to pd so that when you call a method from Pandas,
you can do so by using pd instead of having to type out Pandas every time. It offers a bit of convenience for a
programmer!

PYTHON CODE

# import Pandas and assign an abbreviated identifier "pd"


import pandas as pd

EXPLORING FURTHER

Installing Pandas on Your Computer


If you wish to install Pandas on your own computer, refer to the installation page of the Pandas website
(https://openstax.org/r/pyd).

Load Data Using Python Pandas


The first step for data analysis is to load the data of your interest to your Notebook. Let’s create a folder on
Google Drive where you can keep a CSV file for the dataset and a Notebook for data analysis. Download a
public dataset, ch1-movieprofit.csv (https://openstax.org/r/filed), and store it in a Google Drive folder. Then
open a new Notebook in that folder by entering that folder and clicking New > More > Google Colaboratory.

Open the Notebook and allow it to access files in your Google Drive by following these steps:

First, click the Files icon on the side tab (Figure 1.18).

Figure 1.18 Side Tab of Colab

Then click the Mount Drive icon (Figure 1.19) and select “Connect to Google Drive” on the pop-up window.

Access for free at openstax.org


1.5 • Data Science with Python 37

Figure 1.19 Features under Files on Colab

Notice that a new cell has been inserted on the Notebook as a result (Figure 1.20).

Figure 1.20 An Inserted Cell to Mount Your Google Drive

Connect your Google Drive by running the cell, and now your Notebook file can access all the files under
content/drive. Navigate folders under drive to find your Notebook and ch1-movieprofit.csv
(https://openstax.org/r/filed) files. Then click “…” > Copy Path (Figure 1.21).

Figure 1.21 Copying the Path of a CSV File Located in a Google Drive Folder

Now replace [Path] with the copied path in the below code. Run the code and you will see the dataset has been
loaded as a table and stored as a Python variable data.

PYTHON CODE

# import Pandas and assign an abbreviated identifier "pd"


import pandas as pd
38 1 • What Are Data and Data Science?

data = pd.read_csv("[Path]")
data

The resulting output will look like this:

The read_csv() method in Pandas loads a CSV file and stores it as a DataFrame. A DataFrame is a data type
that Pandas uses to store multi-column tabular data. Therefore, the variable data holds the table in
ch1-movieprofit.csv (https://openstax.org/r/filed) in the form of a Pandas DataFrame.

DATAFRAME VERSUS SERIES

Pandas defines two data types for tabular data—DataFrame and Series. While DataFrame is used for multi-
column tabular data, Series is used for single-column data. Many methods in Pandas support both
DataFrame and Series, but some are only for one or the other. It is always good to check if the method you
are using works as you expect. For more information, refer to the Pandas documentation
(https://openstax.org/r/docs) or Das, U., Lawson, A., Mayfield, C., & Norouzi, N. (2024). Introduction to
Python Programming. OpenStax. https://openstax.org/books/introduction-python-programming/pages/
1-introduction (https://openstax.org/r/page1).

EXAMPLE 1.9

Problem

Remember the Iris dataset we used in Data and Datasets? Load the dataset ch1-iris.csv
(https://openstax.org/r/filed) to a Python program using Pandas.

Solution

The following code loads the ch1-iris.csv (https://openstax.org/r/filed) that is stored in a Google Drive. Make
sure to replace the path with the actual path to ch1-iris.csv (https://openstax.org/r/filed) on your Google
Drive.

Access for free at openstax.org


1.5 • Data Science with Python 39

PYTHON CODE

import pandas as pd

data = pd.read_csv("[Path to ch1-iris.csv]") # Replace the path


data

The resulting output will look like this:

EXPLORING FURTHER

Can I load a file that is uploaded to someone else’s Google Drive and shared with me?
Yes! This is useful especially when your Google Drive runs out of space. Simply add the shortcut of the
shared file to your own drive. Right-click > Organize > Add Shortcut will let you select where to store the
shortcut. Once done, you can call pd.read_csv() using the path of the shortcut.

Summarize Data Using Python Pandas


You can compute basic statistics for data quite quickly by using the DataFrame.describe() method. Add and
run the following code in a new cell. It calls the describe() method upon data, the DataFrame we defined
earlier with ch1-movieprofit.csv (https://openstax.org/r/filed).
40 1 • What Are Data and Data Science?

PYTHON CODE

data = pd.read_csv("[Path to ch1-movieprofit.csv]")


data.describe()

like this:

describe() returns a table whose columns are a subset of the columns in the entire dataset and whose rows
are different statistics. The statistics include the number of unique values in a column (count), mean (mean),
standard deviation (std), minimum and maximum values (min/max), and different quartiles
(25%/50%/75%), which you will learn about in Measures of Variation. Using this representation, you can
compute such statistics of different columns easily.

EXAMPLE 1.10

Problem

Summarize the IRIS dataset using describe() of ch1-iris.csv (https://openstax.org/r/filed) you loaded in
the previous example.

Solution

The following code in a new cell returns the summary of the dataset.

PYTHON CODE

data = pd.read_csv("[Path to ch1-iriscsv]")


data.describe()

The resulting output will look like this:

Access for free at openstax.org


1.5 • Data Science with Python 41

Select Data Using Python Pandas


The Pandas DataFrame allows a programmer to use the column name itself when selecting a column. For
example, the following code prints all the values in the “US_Gross_Million” column in the form of a Series
(remember the data from a single column is stored in the Series type in Pandas).

PYTHON CODE

data = pd.read_csv("[Path to ch1-movieprofit.csv]")

data["US_Gross_Million"]

like this:

0 760.51
1 858.37
2 659.33
3 936.66
4 678.82
...
961 77.22
962 177.20
963 102.31
964 106.89
965 75.47
Name: US_Gross_Million, Length: 966, dtype: float64

DataFrame.iloc[] enables a more powerful selection—it lets a programmer select by both column and row,
using column and row indices. Let’s look at some code examples below.
42 1 • What Are Data and Data Science?

PYTHON CODE

data.iloc[:, 2] # select all values in the second column

The resulting output will look like this:

0 2009
1 2019
2 1997
3 2015
4 2018
...
961 2010
962 1982
963 1993
964 1999
965 2017
Name: Year, Length: 966, dtype: object

PYTHON CODE

data.iloc[2,:] # select all values in the third row

The resulting output will look like this:

Unnamed: 0 3
Title Titanic
Year 1997
Genre Drama
Rating 7.9
Duration 194
US_Gross_Million 659.33
Worldwide_Gross_Million 2201.65
Votes 1,162,142
Name: 2, dtype: object

To pinpoint a specific value within the “US_Gross_Million” column, you can use an index number.

PYTHON CODE

Access for free at openstax.org


1.5 • Data Science with Python 43

print (data["US_Gross_Million"][0]) # index 0 refers to the top row


print (data["US_Gross_Million"][2]) # index 2 refers to the third row

The resulting output will look like this:

760.51
659.33

You can also use DataFrame.iloc[] to select a specific group of cells on the table. The example code below
shows different ways of using iloc[]. There are multiple ways of using iloc[], but this chapter introduces a
couple of common ones. You will learn more techniques for working with data throughout this textbook.

PYTHON CODE

data.iloc[:, 1] # select all values in the second column (index 1)

The resulting output will look like this:

0 Avatar
1 Avengers: Endgame
2 Titanic
3 Star Wars: Episode VII - The Force Awakens
4 Avengers: Infinity War
...
961 The A-Team
962 Tootsie
963 In the Line of Fire
964 Analyze This
965 The Hitman's Bodyguard
Name: Title, Length: 966, dtype: object

PYTHON CODE

data.iloc[[1, 3], [2, 3]]


# select the rows at index 1 and 3, the columns at index 2 and 3

The resulting output will look like this:


44 1 • What Are Data and Data Science?

EXAMPLE 1.11

Problem

Select a “sepal_width” column of the IRIS dataset using the column name.

Solution

The following code in a new cell returns the “sepal_width” column.

PYTHON CODE

data = pd.read_csv("[Path to ch1-iris.csv]")

data["sepal_width"]

The resulting output will look like this:

0 3.5
1 3.0
2 3.2
3 3.1
4 3.6
...
145 3.0
146 2.5
147 3.0
148 3.4
149 3.0
Name: sepal_width, Length: 150, dtype: float64

EXAMPLE 1.12

Problem

Select a “petal_length” column of the IRIS dataset using iloc[].

Access for free at openstax.org


1.5 • Data Science with Python 45

Solution

The following code in a new cell returns the “petal_length” column.

PYTHON CODE

data.iloc[:, 2]

The resulting output will look like this:

0 1.4
1 1.4
2 1.3
3 1.5
4 1.4
...
145 5.2
146 5.0
147 5.2
148 5.4
149 5.1
Name: petal_length, Length: 150, dtype: float64

Search Data Using Python Pandas


To search for some data entries that fulfill specific criteria (i.e., filter), you can use DataFrame.loc[] of
Pandas. When you indicate the filtering criteria inside the brackets, [], the output returns the filtered rows
within the DataFrame. For example, the code below filters out the rows whose genre is comedy. Notice that the
output only has 307 out of the full 3,400 rows. You can check the output on your own, and you will see their
Genre values are all “Comedy.”

PYTHON CODE

data = pd.read_csv("[Path to ch1-movieprofit.csv]")

data.loc[data['Genre'] == 'Comedy']

The resulting output will look like this:


46 1 • What Are Data and Data Science?

EXAMPLE 1.13

Problem

Using DataFrame.loc[], search for all the items of Iris-virginica species in the IRIS dataset.

Solution

The following code returns a filtered DataFrame whose species are Iris-virginica. All such rows show up as
an output.

PYTHON CODE

data = pd.read_csv("[Path to ch1-iris.csv]")

data.loc[data['species'] == 'Iris-virginica']

The resulting figure will look like this:

Access for free at openstax.org


1.5 • Data Science with Python 47

(Rows 109 through 149 not shown.)

EXAMPLE 1.14

Problem

This time, search for all the items whose species is Iris-virginica and whose sepal width is wider than 3.2.

Solution

You can use a Boolean expression—in other words, an expression that evaluates as either True or
False—inside data.loc[].

PYTHON CODE

data.loc[(data['species'] == 'Iris-virginica') & (data['sepal_width'] > 3.2)]

The resulting output will look like this:


48 1 • What Are Data and Data Science?

Visualize Data Using Python Matplotlib


There are multiple ways to draw plots of data in Python. The most common and straightforward way is to
import another library, Matplotlib, which is specialized for data visualization. Matplotlib is a huge library,
and to draw the plots you only need to import a submodule named pyplot.

Type the following import statement in a new cell. Note it is convention to denote matplotlib.pyplot with
plt, similarly to denoting Pandas with pd.

PYTHON CODE

import matplotlib.pyplot as plt

Matplotlib offers a method for each type of plot, and you will learn the Matplotlib methods for all of the
commonly used types throughout this textbook. In this chapter, however, let’s briefly look at how to draw a plot
using Matplotlib in general.

Suppose you want to draw a scatterplot between “US_Gross_Million” and “Worldwide_Gross_Million” of the
movie profit dataset (ch1-movieprofit.csv (https://openstax.org/r/filed)). You will investigate scatterplots in
more detail in Correlation and Linear Regression Analysis. The example code below draws such a scatterplot
using the method scatter(). scatter() takes the two columns of your interest—data["US_Gross_Million"]
and data["Worldwide_Gross_Million"]—as the inputs and assigns them for the x- and y-axes, respectively.

PYTHON CODE

data = pd.read_csv("[Path to ch1-movieprofit.csv]")

Access for free at openstax.org


1.5 • Data Science with Python 49

# draw a scatterplot using matplotlib’s scatter()


plt.scatter(data["US_Gross_Million"], data["Worldwide_Gross_Million"])

The resulting output will look like this:

Notice that it simply has a set of dots on a white plane. The plot itself does not show what each axis
represents, what this plot is about, etc. Without them, it is difficult to capture what the plot shows. You can set
these with the following code. The resulting plot below indicates that there is a positive correlation between
domestic gross and worldwide gross.

PYTHON CODE

# draw a scatterplot
plt.scatter(data["US_Gross_Million"], data["Worldwide_Gross_Million"])

# set the title


plt.title("Domestic vs. Worldwide Gross")

# set the x-axis label


plt.xlabel("Domestic")

# set the y-axis label


plt.ylabel("Worldwide")

The resulting output will look like this:


50 1 • What Are Data and Data Science?

You can also change the range of numbers along the x- and y-axes with plt.xlim() and plt.ylim(). Add
the following two lines of code to the cell in the previous Python code example, which plots the scatterplot.

PYTHON CODE

# draw a scatterplot
plt.scatter(data["US_Gross_Million"], data["Worldwide_Gross_Million"])

# set the title


plt.title("Domestic vs. Worldwide Gross")

# set the x-axis label


plt.xlabel("Domestic")

# set the y-axis label


plt.ylabel("Worldwide")

# set the range of values of the x- and y-axes


plt.xlim(1*10**2, 3*10**2) # x axis: 100 to 300
plt.ylim(1*10**2, 1*10**3) # y axis: 100 to 1,000

The resulting output will look like this:

Access for free at openstax.org

You might also like