Ultimate Pandas for Data Manipulation and Visualization: Efficiently Process and Visualize Data with Python's Most Popular Data Manipulation Library (English Edition)
()
About this ebook
Related to Ultimate Pandas for Data Manipulation and Visualization
Related ebooks
Ultimate Pandas for Data Manipulation and Visualization: Efficiently Process and Visualize Data with Python’s Most Popular Data Manipulation Library Rating: 0 out of 5 stars0 ratingsHands-on NumPy for Numerical Analysis Rating: 0 out of 5 stars0 ratingsUltimate Apache Superset for Data Visualization and Analytics Rating: 0 out of 5 stars0 ratingsData Science Mastery: From Beginner to Expert in Big Data Analytics Rating: 0 out of 5 stars0 ratingsUnleashing the Power of Data: Innovative Data Mining with Python Rating: 0 out of 5 stars0 ratingsPandas Cookbook: Practical recipes for scientific computing, time series, and exploratory data analysis using Python Rating: 0 out of 5 stars0 ratingsData Insights: The Science of Data Analysis Rating: 0 out of 5 stars0 ratingsMastering Data Engineering and Analytics with Databricks Rating: 0 out of 5 stars0 ratingsData Analysis Foundations with Python: Master Data Analysis with Python: From Basics to Advanced Techniques Rating: 0 out of 5 stars0 ratingsUltimate MLOps for Machine Learning Models Rating: 0 out of 5 stars0 ratingsLearning pandas - Second Edition Rating: 4 out of 5 stars4/5Data Science Unveiled: A Practical Guide to Key Techniques Rating: 0 out of 5 stars0 ratingsData Engineering with Databricks Cookbook: Build effective data and AI solutions using Apache Spark, Databricks, and Delta Lake Rating: 0 out of 5 stars0 ratings
Computers For You
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates Rating: 4 out of 5 stars4/5Mastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 4 out of 5 stars4/5The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology Rating: 4 out of 5 stars4/5Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad Rating: 5 out of 5 stars5/5Elon Musk Rating: 4 out of 5 stars4/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5Data Analytics for Beginners: Introduction to Data Analytics Rating: 4 out of 5 stars4/5Technical Writing For Dummies Rating: 0 out of 5 stars0 ratingsUX/UI Design Playbook Rating: 4 out of 5 stars4/5The Self-Taught Computer Scientist: The Beginner's Guide to Data Structures & Algorithms Rating: 0 out of 5 stars0 ratingsStorytelling with Data: Let's Practice! Rating: 4 out of 5 stars4/5CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide Rating: 5 out of 5 stars5/5The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution Rating: 4 out of 5 stars4/5Deep Search: How to Explore the Internet More Effectively Rating: 5 out of 5 stars5/5Microsoft Azure For Dummies Rating: 0 out of 5 stars0 ratingsBecoming a Data Head: How to Think, Speak, and Understand Data Science, Statistics, and Machine Learning Rating: 5 out of 5 stars5/5Learning the Chess Openings Rating: 5 out of 5 stars5/5Computer Science I Essentials Rating: 5 out of 5 stars5/5Fundamentals of Programming: Using Python Rating: 5 out of 5 stars5/5The Musician's Ai Handbook: Enhance And Promote Your Music With Artificial Intelligence Rating: 5 out of 5 stars5/5A Quickstart Guide To Becoming A ChatGPT Millionaire: The ChatGPT Book For Beginners (Lazy Money Series®) Rating: 4 out of 5 stars4/5Get Into UX: A foolproof guide to getting your first user experience job Rating: 4 out of 5 stars4/5CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61 Rating: 0 out of 5 stars0 ratings
Reviews for Ultimate Pandas for Data Manipulation and Visualization
0 ratings0 reviews
Book preview
Ultimate Pandas for Data Manipulation and Visualization - Tahera Firdose
CHAPTER 1
Introduction to Pandas and Data Analysis
Introduction
In today’s data-driven era, organizations of all sizes and across various industries are faced with the challenge of extracting meaningful information from the vast amounts of data available to them. Making sense of this data requires powerful tools and techniques that enable efficient data manipulation, pre-processing, and exploration. This is where pandas truly shine.
We will dive deep into the capabilities of pandas, exploring their countless functionalities for data manipulation, exploration, and analysis. We will start with the basics, learning how to load data into pandas from various sources, handle missing values, and clean messy datasets. From there, we will progress to more advanced techniques, such as reshaping and pivoting data, merging and joining datasets, and applying statistical computations.
Structure
In this chapter, we will cover the following essential topics that form the foundation of pandas and data analysis:
Overview of Pandas and Their Role in Data Analysis
Installation and Setup of Pandas
Introduction to IPython Notebooks and how They Integrate with Pandas
Understanding the two Core Pandas Objects: Series and DataFrame
Understanding Data Types
Loading Data from Files and the Web
Overview of Pandas and Their Role in Data Analysis
Pandas, an open-source Python library, was first developed by Wes McKinney in 2008 while working at AQR Capital Management. Wes created pandas to address the limitations he encountered while working with data in Python, aiming to provide a powerful and efficient tool specifically designed for data manipulation and analysis.
Initially, pandas was primarily used in the financial industry, where it quickly gained traction due to its ability to handle large and complex datasets. Its intuitive data structures and comprehensive set of functionalities made it a game-changer for quantitative analysts, traders, and researchers who needed to process and analyze vast amounts of financial data efficiently.
Over time, pandas expanded beyond the financial sector and gained popularity across various domains and industries. Today, it is widely used in academia, scientific research, marketing, social sciences, healthcare, and more. Any field that deals with data analysis, exploration, and pre-processing can benefit from pandas’ capabilities.
Pandas Popularity
The popularity of pandas can be attributed to several factors. First, its user-friendly interface and intuitive syntax make it accessible to both novice and experienced Python users. The DataFrame and Series data structures mimic the tabular structure of data, resembling what users are already familiar with in spreadsheets or SQL tables.
Furthermore, pandas’ rich set of functions and methods for data manipulation, cleaning, and analysis streamline the workflow of data professionals. It provides concise and efficient ways to handle common data tasks, allowing users to focus on the analysis itself rather than the intricacies of data manipulation.
The community support surrounding pandas has also contributed to its popularity. The open-source nature of the library has encouraged contributions from a vast number of developers worldwide. This has led to the rapid development of new features, bug fixes, and enhancements, ensuring that pandas stays up-to-date with the evolving needs of data analysts and scientists.
Moreover, the seamless integration of pandas with other popular libraries in the Python ecosystem, such as NumPy, Matplotlib, and scikit-learn, has further propelled its popularity. This integration allows users to combine the strengths of different libraries, enabling powerful data analysis, visualization, and machine-learning workflows.
Advantages of Pandas over Traditional Data Analysis Methods
Here are the advantages of Pandas over traditional data analysis methods:
Efficient Data Handling: Pandas provides highly efficient data structures, such as DataFrames and Series, which are optimized for handling large datasets. These structures allow for fast data manipulation operations, such as filtering, aggregation, and sorting, resulting in improved performance compared to traditional methods like manual looping or using spreadsheets.
Broad Data Format Support: Unlike traditional methods that often rely on specific data formats, Pandas supports a wide range of data formats, including CSV, Excel, SQL databases, and JSON. This versatility enables seamless integration and analysis of data from various sources, eliminating the need for manual data conversion or preprocessing.
Advanced Data Manipulation: Pandas offers a rich set of functions and methods for data manipulation, transformation, and cleaning. It provides easy-to-use functionalities for handling missing values, reshaping data, merging datasets, and performing complex operations, reducing the complexity and time required for data preprocessing.
Time Series Analysis: Pandas provides specialized tools and functions for working with time series data. It offers built-in support for time-based indexing, resampling, and time shifting operations, making it particularly well-suited for analyzing and modelling time-dependent data.
Integration with the Python Ecosystem: Pandas seamlessly integrates with other popular libraries in the Python ecosystem, such as NumPy, Matplotlib, asci-kit-learn. This integration allows for efficient data exchange and collaboration between different tools, enhancing the capabilities and flexibility of data analysis workflows.
Installation and Setup
Pandas require Python 3.7 or later versions to run properly. It is recommended to use the latest stable version of Python available at the time of installation. Pandas is compatible with both Python 2.x and Python 3.x, but Python 2.x is no longer actively supported, so it’s strongly advised to use Python 3.x.
Before installing Pandas, ensure that you have Python installed on your system. You can check the Python version by opening a command prompt or terminal and running the following command:
python –version
Figure 1.1: Python version
If you have Python installed and the version displayed is 3.7 or later, you meet the Python requirement to run Pandas. If you don’t have Python installed or have an older version, you can download and install the latest version of Python from the official Python website (https://www.python.org).
Once you have Python installed, you can proceed with installing Pandas using the appropriate method, such as pip or Anaconda.
Installing Pandas on Windows
To install Pandas on Windows, follow these steps:
Using pip:
Open the command prompt by pressing Win + R and typing cmd.
Enter the following command to install Pandas:
pip install pandas
Using Anaconda:
Download Anaconda from the official website (https://www.anaconda.com/products/individual) and run the installer.
Follow the installation instructions, selecting the desired options.
Open Anaconda Prompt from the Start menu.
Enter the following command to install Pandas:
conda install pandas
Installing Pandas on MaCOS
To install Pandas on MaCOS, follow these steps:
Using pip:
Open the terminal by going to "Applications >
Utilities >
Terminal".
Enter the following command to install Pandas:
pip install pandas
Installing Pandas on Linux
To install Pandas on Linux, follow these steps:
Using pip:
Open the terminal.
Enter the following command to install Pandas:
pip install pandas
If you’re using Pandas and it is already installed, but you want to update it to the latest version, use the following command:
pip install --upgrade pandas
IPython Notebooks and its Integration with Pandas
IPython Notebooks, now known as Jupyter Notebooks, provide an interactive computing environment for creating and sharing documents that combine code, visualizations, and explanatory text. Jupyter Notebooks have become immensely popular in the data science community and seamlessly integrate with Pandas, a powerful data analysis library in Python.
Overview of IPython/Jupyter Notebooks:
Jupyter Notebooks are web-based environments that allow you to create and execute code, visualize data, and document your analysis in a single document.
The notebooks are organized into cells, each of which can contain code (Python, in this case), markdown text, or raw text.
Code cells can be executed independently, allowing for an interactive and iterative data analysis process.
Notebooks provide a rich interface that supports the inclusion of charts, tables, mathematical equations, images, and more.
Jupyter Notebooks foster reproducibility by combining code, visualizations, and explanations in a shareable format.
Installing Jupyter Notebooks
To install Jupyter Notebooks, you can follow these steps:
Ensure that you have Python installed on your system. You can download Python from the official website (https://www.python.org) and follow the installation instructions.
Open a command prompt or terminal.
Install Jupyter Notebooks using pip, which is a package manager for Python. Enter the following command:
pip install jupyter
Wait for the installation to complete. Jupyter Notebooks and its dependencies will be installed in your Python environment.
To check if Jupyter Notebook is already installed on your system, you can follow these steps:
Open a command prompt or terminal.
Type the following command and press Enter
jupyter notebook –version
If Jupyter Notebook is installed, the command will display the version number. For example, you might see something like this:
6.4.0
Let’s run Jupyter notebook, assuming you already have installed Anaconda.
Open the Anaconda Navigator application. You can typically find it in your system’s application launcher or start menu. Once opened, the Anaconda Navigator window will appear.
In the Anaconda Navigator window, you will see several tools and environments. Click the "Launch" button under the Jupyter Notebook tile. This action will open a new window or tab in your default web browser.
Figure 1.2: Anaconda navigator
The web browser will display the Jupyter Notebook interface. It will show a file browser on the left side and the list of available notebooks in the selected directory.
Figure 1.3: Jupyter Notebook
To create a new notebook, click the "New button located at the top-right corner of the interface. From the drop-down menu, select
Python 3" to create a new Python notebook.
Figure 1.4: Create new Python file
The notebook dashboard will appear, showing the newly created notebook. It will have the file extension .ipynb. You can see the notebook’s name at the top, and it can be renamed by clicking the title.
Figure 1.5: New Notebook
In the notebook, you will find an empty cell where you can write and execute Python code.
To add a new cell, click the "+" button in the toolbar or use the keyboard shortcut B to insert a cell below the currently selected cell.
You can change the cell type from "Code to
Markdown" by selecting the appropriate option from the drop-down menu in the toolbar. Markdown cells allow you to include formatted text, headings, bullet points, and more.
You can write Python code in the cell and execute it by pressing Shift+Enter or by clicking the "Run" button in the toolbar.
To save the notebook, click the floppy disk icon in the toolbar or go to "File >
Save and Checkpoint".
To exit the notebook, close the browser tab containing the notebook interface or go to "File >
Close and Halt".
Understanding Pandas Objects: Series and DataFrame
In this section, we will explore the two core Pandas objects: Series and DataFrame. These are powerful tools for working with data in one or two dimensions, with labels and types. We will show you how to create them using Python.
Before we can work with Series and DataFrame, we need to import pandas, which is a library of useful functions and methods for data analysis. We can do this by typing: import pandas as pd. This will give us a shortcut to use pandas by typing pd before any pandas function or method.
Import pandas as pd
Series
A Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating-point numbers, Python objects, and more). It consists of two main components: the data and the index.
Data: The data component of a Series represents the values or elements that the Series holds. These values can be of any data type, such as numbers, text, or even more complex objects. The data can be provided using a NumPy array, a Python list, or a scalar value.
Index: It is a sequence of labels which identifies each element in the Series. By default, the index starts from 0 and increments by 1, but you can customize it.
Example 1: We will start with a basic example using a Python list. Suppose you have a list of weekly temperatures: [25, 28, 30, 26, 29, 31, 27]. Pandas offers a data structure called a Series, which is ideal for storing and working with this type of data.
Temperatures = [25, 28, 30, 26, 29, 31, 27]
series = pd.Series(temperatures)
print(series)
Output:
Figure 1.6: Series output
Example 2: In this example, we are using a scalar value. Suppose you want to create a Series with the same value repeated multiple times. Let’s say you want a Series with the value 10 repeated 5 times.
Value = 10
series = pd.Series(value, index=[0, 1, 2, 3, 4])
print(series)
Output:
Figure 1.7: Output: creating a series with repeated scalar value
This example demonstrates that the data component of the Series is the scalar value 10, which is repeated 5 times.
Index: The index component of a Series represents the labels or names assigned to each element in the Series. It helps to identify and access specific elements of the Series. By default, the index starts from 0 and increments by 1 for each element, but you can customize it to any sequence of labels.
Example 1: Using default index
Let’s consider the previous example of the temperature Series. The default index labels are assigned automatically when we create the Series.
Temperatures = [25, 28, 30, 26, 29, 31, 27]
series = pd.Series(temperatures)
print(series)
Output:
Figure 1.8: Series with default index labels
In this example, the default index labels are 0, 1, 2, 3, 4, 5, and 6.
Example 2: Using custom index
Suppose you have a Series representing the ages of different people, and you want to assign custom labels to each age.
Ages = [25, 30, 35, 28, 32]
index_labels = [‘John’, ‘Jane’, ‘Mike’, ‘Emily’, ‘Alex’]
series = pd.Series(ages, index=index_labels)
print(series)
Output:
Figure 1.9: Series with custom index labels
In this example, we assigned custom index labels (names) to each age in the Series, making it easier to identify the age of each person.
The data and index components together form a Series, where each element has both a value and a corresponding label. This makes it convenient to work with and access specific elements in the Series based on their labels.
DataFrame
A DataFrame in Pandas is a two-dimensional labeled data structure that can hold multiple columns. It can be thought of as a table or spreadsheet where each column represents a variable or attribute, and each row represents a specific observation or record.
A DataFrame consists of three main components: data, index, and columns.
Data: The data component of a DataFrame represents the actual values in the table. It can be created from various data structures, such as Python dictionaries, NumPy arrays, or other DataFrames.
Example 1: Creating a DataFrame from a Python dictionary:
data = {‘Name’: [‘John’, ‘Jane’, ‘Mike’],
‘Age’: [25, 30, 35],
‘City’: [‘New York’, ‘Paris’, ‘London’]}
df = pd.DataFrame(data)
print(df)
Output:
Figure 1.10: Output: dataFrame created from a Python dictionary
In this example, we create a DataFrame named "df" from a Python dictionary. The dictionary keys represent column names (‘Name’, ‘Age’, ‘City’), and the corresponding values represent the data for each column. The resulting DataFrame has three columns: ‘Name’, ‘Age’, and ‘City’, and each row represents a person’s information.
Index: The index component of a DataFrame represents the labels assigned to each row. It helps to uniquely identify and access specific rows in the DataFrame. By default, Pandas assigns a numeric index starting from 0, but you can customize it with your own labels.
Example 2: Customizing the index labels of a DataFrame:
data = {‘Name’: [‘John’, ‘Jane’, ‘Mike’],
‘Age’: [25, 30, 35],
‘City’: [‘New York’, ‘Paris’, ‘London’]}
df = pd.DataFrame(data, index=[‘A’, ‘B’, ‘C’])
print(df)
Output:
Figure 1.11: Customizing the index labels of a DataFrame
In this example, we create a DataFrame named "df" with custom index labels (‘A’, ‘B’, ‘C’). Now each row in the DataFrame has a unique identifier based on the assigned index labels.
Datatypes of Pandas
Pandas data structures: Series and DataFrame can store different types of data, such as numbers, strings, booleans, and dates. In this section, we will learn how to use the datatypes of pandas in Series and DataFrame.
Defining Datatypes
Datatypes are the categories of data that tell us how the data is stored and what operations can be performed on it. For example, integers are a datatype that can store whole numbers and can be added, subtracted, multiplied, and so on. Strings are a datatype that can store text and can be concatenated, sliced, searched, and more.
Python has several built-in datatypes, such as int, float, str, bool, and so on. However, pandas borrows its datatypes from another Python library called NumPy, which is a library for scientific computing. NumPy has more datatypes than Python, such as int8, int16, int32, int64, uint8, uint16, uint32, uint64, float16, float32, float64, complex64, complex128, and so on. These datatypes allow us to specify the size and precision of the data.
Pandas also has some datatypes that are specific to pandas, such as datetime64, timedelta64, and category. These datatypes allow us to work with dates and times and categorical data.
Using the Datatypes of Pandas in Series and DataFrame
Pandas will automatically assign a suitable datatype to each column or Series based on the values in it. We can also specify our own datatype by using the dtype argument in the constructor.
Here are some examples of how to create and use different datatypes in pandas:
Object
The object datatype is used to store any type of data that is not numeric or boolean. It can store strings, mixed types or Python objects. The object datatype is also used when pandas cannot infer a specific datatype for a column or Series.
For example:
# Create a Series of strings
s = pd.Series([apple
, banana
, cherry
])
# Check the datatype of the Series
print(s.dtype)
Output:
Figure 1.12: Series with datatype object
We can also create a DataFrame with object columns by using a dictionary of lists or Series. For example:
# Create a DataFrame with object columns
df = pd.DataFrame({name
: [Alice
, Bob
, Charlie
],
gender
: [F
, M
, M
],
hobby
: [reading
, gaming
, cooking
]})
# Check the datatypes of all the columns
print(df.dtypes)
Output:
Figure 1.13: Dataframe with datatype object
Int64
The int64 datatype is used to store 64-bit integers. It can store whole numbers from -9223372036854775808 to 9223372036854775807. It is the default datatype for numeric columns or Series that do not have decimal points or missing values.
For example:
# Create a Series of integers
s = pd.Series([1, 2, 3, 4])
# Check the datatype of the Series
print(s.dtype)
Figure 1.14: Series with datatype integer64
We can also create a DataFrame with int64 columns by using a list of lists or a dictionary of lists or Series. For example:
# Create a DataFrame with int64 columns
df = pd.DataFrame({id
: [1, 2, 3],
age
: [25, 30, 35],
score
: [80, 90, 100]})
# Check the datatypes of all the columns
print(df.dtypes)
Output:
Figure 1.15: DataFrame with datatype integer64
Float64
The float64 datatype is used to store 64-bit floating-point numbers. It can store decimal numbers with up to 15 digits of precision. It is the default datatype for numeric columns or Series that have decimal points or missing values.
For example:
# Create a Series of floats
s = pd.Series([1.0, 2.5, 3.2])
# Check the datatype of the Series
print(s.dtype)
Output:
Figure 1.16: Series with datatype float64
We can also create a DataFrame with float64 columns by using a list of lists or a dictionary of lists or Series. For example:
df = pd.DataFrame({price
: [10.0, np.nan, 15.0],
discount
: [0.1, np.nan, np.nan],
final_price
: [9.0,np.nan, np.nan]})
# Check the datatypes of all the columns
print(df.dtypes)
Output:
Figure 1.17: DataFrame with datatype float64
Boolean
The boolean datatype is used to store True or False values. It can be used to represent logical conditions or binary choices. It is the default datatype for columns or Series that contain only True or False values.
For example:
# Create a Series of booleans
s = pd.Series([True, False, True])
# Check the datatype of the Series
print(s.dtype)
Output:
Figure 1.18: Series with datatype boolean
We can also create a DataFrame with bool columns by using a list of lists or a dictionary of lists or Series. For example,
# Create a DataFrame with bool columns
df = pd.DataFrame({is_even
: [True, False, True],
is_positive
: [True, True, False],
is_prime
: [False, True, False]})
# Check the datatypes of all the columns
print(df.dtypes)
Output:
Figure 1.19: DataFrame with datatype boolean
Loading Data from Files and the Web for Pandas
One of the most common tasks in data analysis is loading data from various sources, such as files and the web. Pandas provides several functions and methods to help you read and write data in different formats, such as CSV, Excel, JSON, HTML, and SQL.
In this section, we will explore the most common ways to load data using Pandas. Specifically, we will learn how to use the read_csv and read_excel functions to load data from CSV and Excel files, respectively. Additionally, we will learn how to use the read_html function to load data from web pages
Loading Data from CSV Files Using pandas.read_csv()
Comma-S Values (CSV) is a common file format for storing tabular data. A CSV file consists of rows and columns separated by commas or other delimiters. Pandas provides the pandas.read_csv()function to read data from CSV files into a DataFrame object. A DataFrame is a two-dimensional table of data with rows and columns.
To use pandas.read_csv(), you need to pass the file path or file-like object as the first argument. You can also specify other optional arguments to customize the behavior of the function.
Here are some of the most commonly used parameters:
filepath_or_buffer: This parameter specifies the path of the CSV file to be read.
sep: This parameter specifies the delimiter used in the CSV file. The default value is ‘,’.
header: This parameter specifies which row of the CSV file should be used as the column names. The default value is 0.
index_col: This parameter specifies which column of the CSV file should be used as the index. The default value is None.
Use cols: This parameter specifies which columns of the CSV file should be read into the DataFrame. The default value is None, which means all columns are read.
dtype: This parameter specifies the data type of each column in the DataFrame. The default value is None, which means pandas will try to infer the data types automatically.
skiprows: This parameter specifies how many rows should be skipped from the beginning of the CSV file. The default value is 0.
nrows: This parameter specifies how many rows should be read from the CSV file. The default value is None, which means all rows are read.
Here is an example