Python by Example Book 2 (Data Manipulation and Analysis)
Python by Example Book 2 (Data Manipulation and Analysis)
Book 2
August 2023
Disclaimer
This book has been created using various tools, including AI tools,
development tools, and other services. While the book's development
has involved the utilization of these tools, it is important to note that
the content has been planned and organized by the author.
Data manipulation is the process of transforming raw data into a more structured
and usable format, making it easier to extract meaningful insights and derive
valuable information. It encompasses a wide range of operations, including cleaning,
filtering, sorting, aggregating, and transforming data. It involves modifying the
structure or content of data to meet specific requirements, making it suitable for
analysis and interpretation. Data manipulation plays a vital role in the entire data
analysis workflow, from data preprocessing and cleaning to advanced analytics and
modeling.
Data Cleaning: Real-world datasets are often noisy and may contain missing
or inconsistent values. Data manipulation allows us to clean and preprocess
the data, ensuring its accuracy and reliability.
Data Integration: In many scenarios, data is collected from multiple sources.
Data manipulation helps in integrating and merging data from different
sources to create a unified dataset for analysis.
Feature Engineering: Data manipulation allows us to create new features
from existing data, which can significantly improve the performance of
machine learning models.
Compiled & Edited by Muhammad Nadeem Khokhar ([email protected]) 1|P a g e
Python by Example (Book 2: Data Manipulation and Analysis)
One of the key reasons behind Python's popularity is its rich set of data
structures, which enable efficient and organized data manipulation. In this section,
we will explore Python's essential data structures, including lists, tuples, sets, and
dictionaries. Through coding examples, we will demonstrate the versatility and
power of these data structures in various scenarios.
Lists: Lists are one of the most fundamental data structures in Python, allowing
us to store collections of items in a sequential order. Lists are versatile, as they can
hold elements of different data types and can be modified after creation.
# List slicing
subset = numbers[1:4]
print(subset) # Output: [2, 10, 4]
Brief description:
Tuples: Tuples are similar to lists, but they are immutable, meaning their
elements cannot be modified after creation. Tuples are used to represent fixed
collections of items that should not change throughout the program's execution.
Brief Description:
1. Creating a Tuple: The "colors" tuple is created using parentheses and contains
the elements 'red', 'green', and 'blue'.
2. Accessing Elements: The code demonstrates how to access specific elements
within the tuple using indexing. `colors[0]` is used to access the first element,
which returns the output 'red', and `colors[-1]` accesses the last element,
returning 'blue'.
3. Tuple Immutability: The code showcases the immutability of tuples by
attempting to modify the element at index 0 using `colors[0] = 'yellow'`. Since
tuples cannot be changed after creation, this operation raises an error.
Sets: Sets are unordered collections of unique elements. They are useful for
performing mathematical operations like union, intersection, and difference
efficiently.
# Creating sets
set1 = {1, 2, 3, 4, 5}
set2 = {4, 5, 6, 7, 8}
# Union of sets
union_set = set1.union(set2)
print(union_set) # Output: {1, 2, 3, 4, 5, 6, 7, 8}
# Intersection of sets
intersection_set = set1.intersection(set2)
print(intersection_set) # Output: {4, 5}
# Difference of sets
difference_set = set1.difference(set2)
print(difference_set) # Output: {1, 2, 3}
Brief Description:
1. Creating Sets: Two sets, "set1" and "set2," are created using curly braces
and contain unique elements. "set1" includes elements 1, 2, 3, 4, and 5,
while "set2" includes elements 4, 5, 6, 7, and 8.
2. Union of Sets: The code showcases the union operation using the
`union()` method. The union of "set1" and "set2" combines all unique
elements from both sets, resulting in the output `{1, 2, 3, 4, 5, 6, 7, 8}`.
3. Intersection of Sets: The code demonstrates the intersection operation
using the `intersection()` method. The intersection of "set1" and "set2"
identifies the common elements present in both sets, yielding the output
`{4, 5}`.
4. Difference of Sets: The code showcases the difference operation using the
`difference()` method. The difference of "set1" and "set2" identifies the
elements that are present in "set1" but not in "set2," resulting in the
output `{1, 2, 3}`.
print(student)
# Output: {'name': 'John Doe', 'age': 26, 'grade': 'A'}
Brief Description:
these tasks efficiently. In this section, we will explore how to access and modify data
elements in lists, tuples, sets, and dictionaries.
Accessing and Modifying Elements in Lists: Lists are mutable data structures that
allow us to store a collection of items in a sequential order. Accessing and modifying
elements within a list is a straightforward process in Python.
# List slicing
subset = fruits[1:4]
print(subset) # Output: ['grape', 'cherry', 'date']
Brief Description:
1. Creating a List: The "fruits" list is created using square brackets and contains
the elements 'apple', 'banana', 'cherry', and 'date'.
2. Accessing Elements: The code demonstrates how to access specific elements
within the list using indexing. For instance, `fruits[0]` retrieves the first
element, which is 'apple', and `fruits[-1]` retrieves the last element, which is
'date'.
Accessing Elements in Tuples: Tuples, unlike lists, are immutable, meaning their
elements cannot be changed after creation. Accessing elements within a tuple is
similar to accessing elements in a list.
Brief Description:
Since tuples are immutable, trying to modify elements in a tuple will raise an
error.
Brief Description:
1. Creating a Set: The "prime_numbers" set is created using curly braces and
contains the elements 2, 3, 5, 7, and 11. Since sets only store unique
elements, duplicate values are automatically removed.
2. Adding Elements: The code demonstrates how to add a new element to the
set using the `add()` method. The value 13 is added to the "prime_numbers"
set, resulting in `{2, 3, 5, 7, 11, 13}`.
3. Removing Elements: The code showcases how to remove a specific element
from the set using the `remove()` method. In this case, the element 5 is
removed from the "prime_numbers" set, resulting in `{2, 3, 7, 11, 13}`.
Brief Description:
modification, the dictionary becomes `{'Alice': 85, 'Bob': 90, 'Charlie': 82,
'David': 92}`.
4. Adding New Key-Value Pairs: The code showcases how to add a new key-
value pair to the dictionary. A new key 'Eve' with the value 88 is added to the
"student_scores" dictionary using `student_scores['Eve'] = 88`. After the
addition, the dictionary becomes `{'Alice': 85, 'Bob': 90, 'Charlie': 82, 'David':
92, 'Eve': 88}`.
The "for" Loop: The "for" loop is commonly used for iterating over elements in
data structures like lists, tuples, sets, and dictionaries. It iterates through each item
in the collection and executes the associated block of code until all items have been
processed.
Brief Description:
1. Creating a List: The "numbers" list is created using square brackets and
contains the elements 1, 2, 3, 4, and 5.
2. Iterating with "for" Loop: The code uses a "for" loop to iterate over each
element in the "numbers" list. The "for" loop syntax is as follows: `for
element in list`. In this case, the loop iterates through the "numbers" list, and
the variable "num" takes on the value of each element during each iteration.
3. Printing the Elements: Inside the "for" loop, the code uses the `print()`
function to output each element to the console. The output will be the
numbers 1, 2, 3, 4, and 5, each printed on a new line.
Brief Description:
corresponding score, like "Alice scored 85," "Bob scored 90," and "Charlie
scored 78."
The "while" Loop: The "while" loop executes a block of code repeatedly as long
as a specified condition is true. It is useful when the number of iterations is
uncertain, and the loop continues until the condition becomes false.
print(even_numbers)
Brief Description:
5. Printing the Result: Once the while loop exits (when 5 even numbers are found),
the "even_numbers" list is printed, displaying the first 5 even numbers.
Loop Control Statements: Python provides loop control statements like "break"
and "continue" to alter the flow of loops. "break" is used to exit the loop
prematurely, while "continue" skips the current iteration and moves to the next.
Brief Description:
1. List and Target Value: The code creates a list named "numbers" containing
elements 10, 25, 5, 18, 30, and 12. It also sets the variable "target" to 30,
representing the value we want to find in the list.
2. "for" Loop: The code uses a "for" loop to iterate through each element in the
"numbers" list. During each iteration, the variable "num" takes on the value
of the current element in the list.
3. Comparing with Target: Inside the "for" loop, the code compares the value of
"num" with the "target" value using the condition `if num == target`. If a
match is found (the target value is equal to an element in the list), the code
prints a message indicating that the "target value" has been found and then
exits the loop using `break`.
4. "else" Block: If the "for" loop completes without finding the target value, the
code executes the "else" block, which prints the message "Target value not
found."
Brief Description:
Brief Description:
Brief Description:
3. "for" Loop: The code uses a "for" loop to iterate through each element in the
"numbers" list. During each iteration, the variable "num" takes on the value
of the current element in the list.
4. Checking for Even Numbers: Inside the "for" loop, the code checks if the
current element (num) is even using the condition `if num % 2 == 0`. If the
number is even (the remainder of the division by 2 is 0), it is appended to the
"even_numbers" list using `even_numbers.append(num)`.
5. Printing the Result: After the "for" loop completes, the "even_numbers" list is
printed, displaying the even numbers selected from the original "numbers"
list.
Brief Description:
# Nested list
matrix = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
Brief Description:
1. Nested List (2D List): The code creates a nested list named "matrix"
containing three sublists, each representing a row in the 2D matrix. Each
sublist contains three elements, forming a 3x3 matrix.
With a nested list comprehension, the same operation can be performed more
succinctly:
Brief Description:
1. Nested List (2D List): The code creates a nested list named "matrix"
containing three sublists, each representing a row in the 2D matrix.
2. List Comprehension: The code uses a nested list comprehension to create the
"flattened_list." The list comprehension syntax is `[expression for item in
iterable for item2 in iterable2]`, where the "expression" is evaluated for each
combination of "item" and "item2" from the specified iterables. In this case,
the "expression" is simply `num`, which represents each element in the
nested "matrix," and the nested for loops iterate through each "row" in the
"matrix" and each "num" in the "row."
3. Flattening the List: The nested list comprehension iterates through each row
of the "matrix" using the first "for" loop (`for row in matrix`), and for each
"row," it iterates through each element "num" using the second "for" loop
(`for num in row`). The "num" variable represents each individual element in
the "matrix," and these elements are directly included in the "flattened_list."
4. Printing the Result: After the nested list comprehension is complete, the
"flattened_list" is printed, displaying the flattened 1D list containing all the
elements from the original 2D "matrix."
In addition to list comprehensions, Python provides two more powerful tools for
data transformations: dictionary comprehensions and set comprehensions.
Dictionary comprehensions allow us to create dictionaries with concise syntax, while
set comprehensions enable the creation of sets with unique elements effortlessly.
Brief Description:
Brief Description:
Brief Description:
Brief Description:
Brief Description:
even_squared_dict = {
num: num ** 2
for num in numbers
if num % 2 == 0
}
Brief Description:
Fast and Efficient Operations: NumPy is built on top of highly optimized C and
Fortran libraries, enabling it to perform array operations much faster than standard
Python lists. These operations are implemented as low-level routines, making them
highly efficient and suitable for handling large datasets. The ability to perform
element-wise operations and array broadcasting allows for concise and expressive
code that operates on entire arrays at once, reducing the need for explicit loops and
improving performance.
Array Indexing and Slicing: NumPy provides flexible and powerful indexing and
slicing capabilities for accessing elements or subsets of an array. The indexing starts
from 0, similar to Python lists, and supports various slicing techniques, including
using slices, integer arrays, boolean arrays, and even fancy indexing. These features
make it easy to extract specific elements or subsets of data from large arrays,
enabling efficient data manipulations.
NumPy provides a powerful array object called "ndarray" that enables us to work
with multi-dimensional data efficiently. In this section, we will explore different
methods to create NumPy arrays and understand their flexibility and usefulness in
numerical computing.
Creating Arrays from Python Lists: One of the simplest ways to create a NumPy
array is by converting a Python list into an ndarray using the `numpy.array()`
function.
import numpy as np
print(numpy_array)
In this example, we import NumPy as `np` for brevity. We then create a Python
list called `data_list` containing elements 1, 2, 3, 4, and 5. Using the `np.array()`
function, we convert the Python list into a NumPy array named `numpy_array`.
import numpy as np
print(zeros_array)
In this example, we import NumPy as `np` and use the `np.zeros()` function to
create an array of zeros with shape (3, 4).
Creating Arrays with Sequences: NumPy provides functions to create arrays with
sequences of numbers. One such function is `numpy.arange()`, which creates an
array with a range of values. Let's consider an example:
import numpy as np
print(sequence_array)
In this example, we import NumPy as `np` and use the `np.arange()` function to
create an array with values from 0 to 9.
import numpy as np
print(random_array)
Array indexing and slicing are powerful features of NumPy that allow us to access
and manipulate specific elements or subsets of elements in a NumPy array. In this
Compiled & Edited by Muhammad Nadeem Khokhar ([email protected]) 32 | P a g e
Python by Example (Book 2: Data Manipulation and Analysis)
section, we will explore how to perform array indexing and slicing, providing
examples to demonstrate their utility in data manipulation and analysis.
import numpy as np
print(element_at_index_2)
In this example, we import NumPy as `np` and create a NumPy array called
`data_array` containing elements 10, 20, 30, 40, and 50. We then access the element
at index 2 using `data_array[2]`.
import numpy as np
print(sliced_array)
In this example, we import NumPy as `np` and create a NumPy array called
`data_array` with elements 10, 20, 30, 40, and 50. We then use slicing to extract a
subset of elements from index 1 to 4 (exclusive) using `data_array[1:4]`.
Array Slicing with Step: We can also use the `step` parameter in slicing to skip
elements and create subarrays with a specific interval.
import numpy as np
print(sliced_array)
Modifying Array Elements using Slicing: Slicing can also be used to modify
elements of a NumPy array.
import numpy as np
print(data_array)
In this example, we import NumPy as `np` and create a NumPy array called
`data_array` with elements 10, 20, 30, 40, and 50. We use slicing (`data_array[1:4]`)
to access elements from index 1 to 4 (exclusive) and modify them with the values
[25, 35, 45].
import numpy as np
# Element-wise addition
result_addition = array1 + array2
print(result_addition)
In this example, we import NumPy as `np` and create two NumPy arrays, `array1`
and `array2`, with values [1, 2, 3, 4, 5] and [10, 20, 30, 40, 50] respectively. We
perform element-wise addition using the `+` operator (`array1 + array2`) and store
the result in `result_addition`.
import numpy as np
print(result_broadcasting)
In this example, we import NumPy as `np` and create a NumPy array called
`array` with values [1, 2, 3, 4, 5]. We perform element-wise multiplication with a
scalar value (10) using the `*` operator (`array * 10`). NumPy automatically
broadcasts the scalar to match the shape of the array, and the result is stored in
`result_broadcasting`.
import numpy as np
print(result_sqrt)
In this example, we import NumPy as `np` and create a NumPy array called
`array` with values [1, 2, 3, 4, 5]. We use the `np.sqrt()` function to perform element-
wise square root on the array and store the result in `result_sqrt`.
import numpy as np
print(result_broadcasting)
In this example, we import NumPy as `np` and create two NumPy arrays, `array1`
and `array2`, with values [1, 2, 3] and [[10], [20], [30]] respectively. We perform
element-wise multiplication with broadcasting (`array1 * array2`). NumPy broadcasts
the arrays to match their shapes and then performs the element-wise multiplication.
import pandas as pd
print(fruits)
In this example, we import Pandas as `pd` and create a Series called `fruits` with
four elements. The output will display the Series along with its index.
import pandas as pd
df = pd.DataFrame(data)
print(df)
In this example, we import Pandas as `pd` and create a DataFrame `df` using a
dictionary `data`. Each key in the dictionary corresponds to a column name, and its
associated values form the column's data.
Accessing Data in Series and DataFrames: Both Series and DataFrames support
indexing and slicing for data retrieval. For Series, indexing is based on the provided
labels, while for DataFrames, it extends to both rows and columns. Let's see an
example:
import pandas as pd
Indexing with Labels: Pandas provides the `loc` indexer to access data by labels,
both for rows and columns. Let's consider an example:
import pandas as pd
# Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 28, 22]}
df = pd.DataFrame(data)
Here, we create a DataFrame `df` and use the `loc` indexer to access the value in
the second row and the 'Name' column.
Indexing with Position: Pandas also provides the `iloc` indexer for accessing data
by integer position. This is particularly useful when dealing with numeric indexing.
Let's see an example:
import pandas as pd
# Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 28, 22]}
df = pd.DataFrame(data)
In this example, we use the `iloc` indexer to access the value in the third row and
the second column.
Selecting Columns: You can easily select specific columns from a DataFrame by
providing their names in a list. Let's consider an example:
import pandas as pd
# Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 28, 22]}
df = pd.DataFrame(data)
print(selected_columns)
Here, we create a DataFrame `df` and select only the 'Name' and 'Age' columns
using double square brackets.
Conditional Selection: You can also use boolean conditions to filter data within a
DataFrame. This is particularly useful for extracting rows that meet specific criteria.
Let's see an example:
import pandas as pd
# Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 28, 22]}
df = pd.DataFrame(data)
# Conditional selection
young_people = df[df['Age'] < 30]
print(young_people)
In the realm of data analysis, real-world datasets often come with imperfections,
such as missing or inconsistent data. Pandas equips you with powerful tools to clean
and handle these issues, ensuring that your data is accurate and ready for analysis.
Detecting Missing Values: Pandas provides the `isna()` and `isnull()` methods to
detect missing values within a DataFrame. Let's consider an example:
Compiled & Edited by Muhammad Nadeem Khokhar ([email protected]) 41 | P a g e
Python by Example (Book 2: Data Manipulation and Analysis)
import pandas as pd
print(missing_values)
Here, we create a DataFrame `df` with missing values and use the `isna()` method
to create a boolean DataFrame that indicates the presence of missing values.
Handling Missing Values: Pandas provides several methods for handling missing
values. The `dropna()` method allows you to remove rows or columns with missing
values. The `fillna()` method lets you replace missing values with specified values or
strategies. Let's explore an example:
import pandas as pd
print(cleaned_df)
Filling Missing Values: You can use the `fillna()` method to replace missing values
with specified values or strategies. Let's see an example:
import pandas as pd
print(filled_df)
Here, we use the `fillna()` method to replace missing values with the string
'Unknown'.
Handling Missing Values with Strategies: You can also use strategies like mean,
median, or mode to fill missing values based on the distribution of the data. Let's
consider an example:
import pandas as pd
print(df)
In this example, we compute the mean of the 'Age' column using `.mean()` and
then fill the missing values with this mean using `.fillna()`.
In the process of data analysis, it's often essential to aggregate and summarize
data to gain insights and draw meaningful conclusions. Pandas provides powerful
tools for data aggregation and grouping, allowing you to efficiently analyze and
manipulate data based on specific criteria.
Grouping Data: Pandas allows you to group data based on one or more columns
using the `groupby()` function. This function creates a grouped object that can be
used for aggregation. Let's consider an example:
import pandas as pd
# Creating a DataFrame
data = {'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
'Value': [10, 20, 15, 25, 12, 18]}
df = pd.DataFrame(data)
print(grouped)
Here, we create a DataFrame `df` and group the data based on the 'Category'
column using the `groupby()` function. The result is a grouped object that can be
used for further aggregation.
Aggregating Data: Once you have a grouped object, you can apply various
aggregation functions to compute summary statistics for each group. Common
aggregation functions include `sum()`, `mean()`, `max()`, `min()`, and more. Let's
explore an example:
import pandas as pd
# Creating a DataFrame
data = {'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
Compiled & Edited by Muhammad Nadeem Khokhar ([email protected]) 44 | P a g e
Python by Example (Book 2: Data Manipulation and Analysis)
print(mean_values)
In this example, we group the data by 'Category' and compute the mean value of
the 'Value' column for each group using `.mean()`.
import pandas as pd
# Creating a DataFrame
data = {'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
'Value': [10, 20, 15, 25, 12, 18]}
df = pd.DataFrame(data)
print(summary)
Here, we group the data by 'Category' and apply multiple aggregation functions
(`sum`, `mean`, `max`) to the 'Value' column using `.agg()`.
Custom Aggregation: You can also define custom aggregation functions using the
`agg()` function. This allows you to perform more complex calculations based on
specific requirements. Let's explore an example:
import pandas as pd
# Creating a DataFrame
data = {'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
'Value': [10, 20, 15, 25, 12, 18]}
df = pd.DataFrame(data)
print(custom_summary)
Data visualization is a powerful and essential tool in the field of data analysis and
interpretation. It involves the representation of data through visual elements such as
charts, graphs, and plots, with the primary goal of communicating complex
information in a more accessible and understandable format. Visualization goes
beyond mere aesthetics; it provides a means to uncover patterns, trends, and
insights that might otherwise remain hidden within raw data.
1. Line Charts: Used to display trends over time or a sequence of data points,
line charts are effective for showing continuous data patterns.
2. Bar Charts: These charts are suitable for comparing discrete categories or
data points, making them ideal for showcasing differences or trends.
3. Scatter Plots: Scatter plots depict the relationship between two variables,
helping to identify correlations, clusters, and outliers.
4. Pie Charts: Useful for illustrating parts of a whole, pie charts provide a visual
representation of proportions and percentages.
5. Histograms: Histograms visualize the distribution of continuous data by
grouping it into bins, allowing the analysis of frequency patterns.
6. Heatmaps: Heatmaps represent data values using color intensity, making
them effective for visualizing large datasets and correlations.
In this section, we delve into the practical realm of data visualization using
Matplotlib. We explore the creation of fundamental plot types, equipping you with
the skills to convey data insights effectively. Through concise examples and hands-on
experience, we'll uncover how to construct essential visualizations that lay the
foundation for more advanced techniques.
Line Plot: A line plot is a fundamental visualization type used to represent data
points with connected lines. It is suitable for illustrating trends over time or a
sequence of data points. Let's create a simple line plot:
# Sample data
x = [1, 2, 3, 4, 5]
y = [10, 25, 15, 30, 20]
In this example, we import Matplotlib as `plt` and create two lists, `x` and `y`,
representing data points. We use `plt.plot()` to create the line plot and `plt.xlabel()`,
`plt.ylabel()`, and `plt.title()` to add labels and a title. Finally, `plt.show()` displays the
plot.
Scatter Plot: A scatter plot is used to visualize the relationship between two
numerical variables. Each data point is represented as a dot, and patterns like
correlation or clustering become apparent. Let's create a scatter plot:
# Sample data
x = [1, 2, 3, 4, 5]
y = [10, 25, 15, 30, 20]
In this example, we import Matplotlib as `plt`, create `x` and `y` lists, and use
`plt.scatter()` to generate the scatter plot. The parameters `color` and `marker`
customize the appearance of the dots. Labels and a title are added using
`plt.xlabel()`, `plt.ylabel()`, and `plt.title()`, followed by `plt.show()` to display the
plot.
Bar Chart: A bar chart is effective for comparing categorical data or discrete
values. It uses rectangular bars to represent data points, making it easy to compare
quantities across categories. Let's create a bar chart:
# Sample data
categories = ['A', 'B', 'C', 'D', 'E']
values = [10, 25, 15, 30, 20]
# Sample data
data = [10, 25, 15, 30, 20, 40, 50, 35, 10, 25]
# Creating a histogram
plt.hist(data, bins=5, color='green', edgecolor='black')
In this example, we import Matplotlib as `plt`, define a `data` list, and use
`plt.hist()` to create the histogram. The `bins` parameter specifies the number of
bins, and `color` and `edgecolor` customize the appearance. Labels and a title are
added, and the plot is displayed using `plt.show()`.
Adding Labels and Titles: Clear and descriptive labels and titles provide context
and guide the audience's understanding of a plot. Let's see how to add labels and
titles to a scatter plot:
# Sample data
x = [1, 2, 3, 4, 5]
y = [10, 25, 15, 30, 20]
In this example, we utilize `plt.xlabel()` and `plt.ylabel()` to add labels to the x and
y axes, respectively. The `plt.title()` function adds a title to the plot, enhancing its
context and clarity.
Customizing Colors and Styles: Matplotlib allows you to choose colors and styles
that align with your visualization's purpose and aesthetic. Let's customize the style
and color of a line plot:
# Sample data
x = [1, 2, 3, 4, 5]
y = [10, 25, 15, 30, 20]
# Adding a legend
plt.legend()
Color Maps and Colorbars: Color maps are crucial for visualizing data with color
intensity. They are particularly useful for heatmaps and contour plots. Let's use a
color map and colorbar with a heatmap:
# Sample data
data = np.random.rand(5, 5)
# Adding a colorbar
plt.colorbar()
In this example, we use `plt.imshow()` with the `cmap` parameter to apply the
'viridis' color map to the heatmap. The `plt.colorbar()` function adds a colorbar to
indicate the color mapping.
Compiled & Edited by Muhammad Nadeem Khokhar ([email protected]) 53 | P a g e
Python by Example (Book 2: Data Manipulation and Analysis)
Styling Text and Annotations: Annotations and text enhance plot clarity by
providing additional context. Let's add annotations and text to a bar chart:
# Sample data
categories = ['A', 'B', 'C', 'D', 'E']
values = [10, 25, 15, 30, 20]
# Adding annotations
for i, v in enumerate(values):
plt.text(i, v + 1, str(v), color='black', ha='center')
Data visualization often involves working with data stored in NumPy arrays and
Pandas DataFrames, which are powerful data structures commonly used in Python
for data manipulation and analysis. Matplotlib, a versatile plotting library, seamlessly
integrates with these structures to create insightful visualizations.
Plotting from NumPy Arrays: NumPy arrays provide a foundation for numerical
computing, and Matplotlib can visualize this data effectively. Let's create a simple
line plot from a NumPy array:
import numpy as np
import matplotlib.pyplot as plt
# Adding a legend
plt.legend()
Here, we generate a NumPy array `x` with values evenly spaced between 0 and
10. The `np.sin()` function calculates the sine of each value in `x`, creating a
sinusoidal curve. We then use `plt.plot()` to create a line plot from the NumPy
arrays.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(data)
# Adding a legend
plt.legend()
Combining Plotting with NumPy and Pandas: Matplotlib can visualize data
derived from both NumPy arrays and Pandas DataFrames within the same plot. Let's
illustrate this by overlaying a line plot and scatter plot:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Adding a legend
plt.legend()
In this example, we generate a Pandas DataFrame `df` with columns 'X' and 'Y'
containing the NumPy-generated values. The `plt.plot()` function creates a line plot,
and `plt.scatter()` overlays selected data points as red dots.
One frequently encounters the need to combine datasets, which can originate
from various sources or possess related information. Pandas provides powerful tools
for merging and joining data, allowing data professionals to seamlessly integrate
disparate datasets and unlock deeper insights. In this section, we'll explore the
techniques of data merging and joining using Pandas.
import pandas as pd
print(result)
In this case, two DataFrames, `df1` and `df2`, are concatenated along the rows
using `pd.concat()`. The resulting DataFrame, `result`, contains all rows from both
input DataFrames.
import pandas as pd
print(merged_df)
Here, the `pd.merge()` function performs an inner join on the 'key' column of the
`left` and `right` DataFrames, producing a merged DataFrame with only the matching
rows.
import pandas as pd
# Joining on index
joined_df = left.join(right)
print(joined_df)
The `left.join(right)` operation joins the DataFrames based on their indices. Non-
matching indices result in NaN values, providing a consolidated view of data from
both DataFrames.
Data rarely conforms to a single structure, and effective data manipulation often
requires reshaping to facilitate analysis. Pandas provides powerful tools for
reshaping data, enabling data professionals to transform data between wide and
long formats seamlessly. In this section, we'll explore key reshaping techniques,
including pivoting, melting, and using `stack` and `unstack` methods.
import pandas as pd
df = pd.DataFrame(data)
print(pivot_df)
In this case, the `pivot()` method transforms the DataFrame `df` by using 'Date'
as the index, 'Variable' as the columns, and 'Value' as the values. This operation
creates a pivoted DataFrame, `pivot_df`, which provides a clearer view of the data.
import pandas as pd
df = pd.DataFrame(data)
print(melted_df)
The `melt()` function converts the wide-format DataFrame `df` into a long-format
DataFrame, `melted_df`, where 'Date' is the identifier variable, 'Variable' represents
the original column names, and 'Value' contains the corresponding values.
Stack and Unstack: The `stack()` and `unstack()` methods provide a dynamic way
to reshape data by moving levels of the DataFrame's column index to become the
row index or vice versa. Let's explore this concept:
import pandas as pd
df = pd.DataFrame(data)
print(unstacked_df)
In this example, `stack()` and `unstack()` are used to reshape the DataFrame.
Initially, 'Date' is set as the index using `set_index()`. Then, `stack()` converts
columns into rows, and `unstack()` reverses the process, restoring the original
DataFrame structure.
As datasets grow and evolve, the need to combine multiple DataFrames into a
cohesive structure becomes paramount. Pandas offers powerful methods for
combining DataFrames, allowing data professionals to seamlessly merge data from
various sources. In this section, we'll explore the techniques of concatenation and
appending, demonstrating how to merge DataFrames both vertically and
horizontally.
import pandas as pd
print(concatenated_df)
Here, the `pd.concat()` function is used to concatenate `df1` and `df2` vertically.
The resulting DataFrame, `concatenated_df`, combines the rows from both input
DataFrames.
import pandas as pd
print(concatenated_df)
In this case, `pd.concat()` with `axis=1` concatenates `df1` and `df2` horizontally,
merging columns from both DataFrames. The resulting DataFrame,
`concatenated_df`, presents the combined information side by side.
import pandas as pd
print(appended_df)
Time and date data are fundamental elements in many real-world datasets,
providing context and structure to observations. Python offers robust libraries for
handling and manipulating time-related information, enabling data professionals to
effectively manage temporal data. In this section, we'll cover everything from
creating and formatting dates to performing arithmetic operations and handling
time zones.
import datetime
# Formatting a date
formatted_date = today.strftime('%d-%m-%Y')
print(formatted_date) # Output: DD-MM-YYYY
import datetime
Here, we calculate the difference between `date2` and `date1`, which yields a
`timedelta` object. By accessing the `days` attribute, we obtain the difference in
days.
Working with `pandas` Timestamps: The `pandas` library extends time handling
capabilities with its `Timestamp` object, enhancing time series data manipulation.
Let's explore creating and indexing `Timestamps`:
import pandas as pd
# Creating a Timestamp
timestamp = pd.Timestamp('2023-07-01 09:00:00')
print(timestamp) # Output: 2023-07-01 09:00:00
Handling Time Zones: Time zones are crucial when dealing with global data.
`pandas` simplifies time zone handling, making it easier to work with diverse
temporal datasets:
import pandas as pd
Here, we create a `Timestamp` in UTC, then convert it to Eastern Time using the
`tz_convert()` function.
Creating a Time Series DataFrame: To begin, let's create a time series DataFrame
using Pandas. We'll generate a sample dataset with timestamped data points:
import pandas as pd
import numpy as np
print(time_series_df)
Here, we create a time range using `pd.date_range()` and use it as an index for a
DataFrame containing random data. This establishes a time series dataset for
exploration.
Indexing by Date and Time: Pandas allows indexing using specific dates or date
ranges. Let's demonstrate this by indexing data for a particular date:
Using `.loc[]`, we can access data for a specific date, extracting the corresponding
row from the DataFrame.
Compiled & Edited by Muhammad Nadeem Khokhar ([email protected]) 67 | P a g e
Python by Example (Book 2: Data Manipulation and Analysis)
Slicing Time Series Data: Slicing empowers us to extract specific time periods
from a time series. Let's slice the data for a range of dates:
By providing a date range as the index, we use slicing to extract data between
the specified dates, creating a new DataFrame.
Resampling Time Series Data: Resampling is useful for changing the frequency of
time series data. Let's demonstrate resampling by aggregating data to a weekly
frequency:
Shifting Time Series Data: Shifting allows us to move data points forwards or
backwards in time. Let's shift our data by one time step:
Using `shift()`, we displace the data by one time step, creating a DataFrame with
data points shifted.
Time series data often comes with varying frequencies, which can make analysis
and comparison challenging. Resampling, a crucial technique in time series analysis,
allows us to change the frequency of our data, enabling better insight extraction and
trend identification. In this section, we'll delve into resampling and frequency
conversion using the powerful Pandas library.
import pandas as pd
import numpy as np
print(monthly_resampled_sum)
print(monthly_resampled_max)
Handling Missing Data: Resampling can lead to missing data points, especially
when upsampling. Handling missing data is crucial for accurate analysis. Let's
address this using a combination of resampling and interpolation:
print(upsampled_interpolated)
In this example, we upsample our data to a 6-hour frequency and employ linear
interpolation (`interpolate()`) to estimate missing values, enhancing the accuracy of
our upsampled dataset.
print(custom_resampled)
Here, we define a custom resampling function that computes the range between
maximum and minimum values. We then apply this function to downsample our
data to a weekly frequency, gaining insights into the variability within each week.
Exploratory Data Analysis (EDA): Exploratory Data Analysis is the first step in
analyzing any dataset. Let's dive into EDA using Python and the Pandas library:
import pandas as pd
# Load a dataset
url = ('https://raw.githubusercontent.com/datasciencedojo/'
'datasets/master/titanic.csv')
data = pd.read_csv(url)
Here, we load the Titanic dataset from an online source and perform basic
exploratory analysis. We display statistical summaries and identify missing values
using the Pandas library.
Data Visualization: Visualizing data is crucial for gaining insights. Let's use
Matplotlib and Seaborn to create visualizations:
# Create a histogram
plt.figure(figsize=(8, 5))
sns.histplot(data['Age'].dropna(), bins=20, kde=True)
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()
Here, we encode the 'Sex' variable into numerical values and create a new
feature, 'FamilySize', by combining 'SibSp' and 'Parch'. These transformations
enhance the dataset's suitability for analysis.
Data Filtering: Filtering data allows us to focus on specific subsets. Let's filter
passengers who survived:
# Filter survivors
survivors = data[data['Survived'] == 1]
Here, we filter the dataset to isolate survivors and display statistical summaries
specifically for this subset, aiding our understanding of survivor demographics.
Anomaly Detection: Anomalies are data points that deviate significantly from the
norm. Let's use Z-score to detect anomalies in the 'Fare' column:
This code snippet analyzes the survival rates of passengers within each cluster,
offering insights into the relationship between age, fare clusters, and survival
outcomes.
The art of effective data communication lies in presenting complex insights and
patterns in a clear and concise manner. Visualizations serve as powerful tools to
convey information, enabling data analysts to communicate findings, support
Compiled & Edited by Muhammad Nadeem Khokhar ([email protected]) 75 | P a g e
Python by Example (Book 2: Data Manipulation and Analysis)
Line Plot: Line plots are suitable for showing trends and variations over time.
Let's visualize the change in stock prices using a line plot:
This code snippet demonstrates the creation of a line plot to visualize the trend
in stock prices over time, enhancing the audience's understanding of price
fluctuations.
Bar Chart: Bar charts are effective for comparing values across categories. Let's
create a bar chart to display sales data for different products:
plt.xticks(rotation=30)
plt.tight_layout()
plt.show()
This example illustrates the use of a bar chart to compare sales figures for
different products, providing a clear visualization of sales performance.
Histogram: Histograms help analyze the distribution of data. Let's visualize the
distribution of exam scores using a histogram:
# Create a histogram
plt.hist(scores, bins=10, color='orange', edgecolor='black')
plt.title('Exam Score Distribution')
plt.xlabel('Score Range')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()
Pie Chart: Pie charts display the proportion of different categories in a dataset.
Let's visualize the market share of mobile operating systems:
This example showcases the creation of a pie chart to represent the market share
of different mobile operating systems, providing a visual depiction of their respective
proportions.
Scatter Plot: Scatter plots reveal relationships between two variables. Let's
visualize the correlation between study hours and exam scores:
This code snippet demonstrates the creation of a scatter plot to visualize the
relationship between study hours and exam scores, facilitating an understanding of
their correlation.
As the era of big data continues to unfold, the ability to effectively manage and
manipulate large datasets has become a crucial skill for data professionals. In this
Compiled & Edited by Muhammad Nadeem Khokhar ([email protected]) 78 | P a g e
Python by Example (Book 2: Data Manipulation and Analysis)
section, we will explore strategies and techniques for handling large datasets in
Python, ensuring that your data analysis remains efficient, scalable, and
manageable.
import pandas as pd
The code demonstrates how to read a large CSV file in chunks using Pandas'
`read_csv` function with the `chunksize` parameter. This enables the iterative
processing of each chunk of data, alleviating memory constraints.
Dask: Dask is a parallel computing library that seamlessly integrates with familiar
APIs like NumPy and Pandas. It enables you to work with larger-than-memory
datasets by breaking them into smaller computational units called "tasks" that can
be executed in parallel.
import dask.dataframe as dd
result = df.groupby('category')['value'].compute()
import sqlite3
This code illustrates how to utilize SQLite to create a database, load data into it,
and perform SQL queries. Databases offer efficient storage and retrieval mechanisms
for handling large datasets.
import concurrent.futures
import numpy as np
NumPy's vectorized operations eliminate the need for explicit loops, enhancing
computation speed. This code showcases how to multiply each element of an array
by 2 in a vectorized manner.
import pandas as pd
This code snippet demonstrates how to use Pandas' `groupby` and aggregation
functions to efficiently calculate the sum of values for each category. Aggregating
data in this way minimizes computation time.
Streaming Data Processing: For continuous data streams or very large files,
streaming data processing avoids loading the entire dataset into memory. Libraries
like `streamz` provide tools to work with streaming data efficiently.
import streamz
source = streamz.Source()
stream = source.scatter()
stream.map(process_data).sink(print_result)
source.start()
Here, the code sets up a data stream using the `streamz` library. The stream
processes data using the `map` function and outputs the results through the `sink`.
Streaming data processing ensures efficient handling of continuous or large
datasets.
Parallel Processing with Dask: Dask enables parallel and distributed computing
with a familiar API. By breaking tasks into smaller units, Dask efficiently utilizes
multicore processors or distributed clusters for faster data processing.
import dask.dataframe as dd
import dask.bag as db
@lru_cache(maxsize=None)
def expensive_function(arg):
# Expensive computation
return result
The code demonstrates how to use the `functools` library to apply memoization,
caching the results of expensive computations. This approach avoids recalculating
results, enhancing processing efficiency.
Efficient data manipulation is crucial for working with large datasets. NumPy and
Pandas offer various techniques to optimize the performance of your data
processing tasks. In this section, we'll explore key strategies to enhance the speed
and efficiency of your code, enabling you to handle sizable datasets with ease.
import numpy as np
import pandas as pd
Using the Apply Function Wisely: While Pandas' `apply` function is versatile, it
can be slow on large datasets. Utilize it for complex operations, but opt for
vectorized operations when possible to maximize performance.
import numpy as np
Filtering Data with NumPy and Pandas: Efficiently filtering data based on
conditions is crucial. NumPy's boolean indexing and Pandas' query function offer
optimized ways to filter data.
import numpy as np
import pandas as pd
df = pd.read_csv('large_dataset.csv')
filtered_df = df.query('column_name > 100')
This code showcases how NumPy's boolean indexing and Pandas' query function
efficiently filter data based on conditions, improving performance.
carefully hones their tools and techniques, a proficient data practitioner must also
adopt a set of practices that ensure their code is not only functional but also robust
and scalable. From structuring your code for clarity to optimizing performance and
ensuring reliability, this chapter is designed to equip you with the skills needed to
elevate your data manipulation endeavors to new heights.
Writing code that is not only functional but also clean and efficient is of
paramount importance. Clean code is more readable, easier to maintain, and less
prone to errors. Efficient code ensures that your data processing tasks are executed
swiftly, enabling you to analyze large datasets without unnecessary delays. In this
section, we will explore essential practices and techniques for crafting clean and
efficient data manipulation code in Python.
The improved variable name "total_sales" provides clear context, enhancing the
code's readability and making its purpose evident.
# Inconsistent indentation
if condition:
do_something()
do_something_else()
...
Modularizing code improves readability, allows for easier debugging, and makes
code maintenance more manageable.
Efficient Looping: When working with Pandas, prefer vectorized operations over
explicit loops whenever possible. Vectorized operations are often faster and more
concise.
# Loop-based calculation
result = []
for value in df['column']:
result.append(value * 2)
Documentation: Provide clear and concise comments to explain the purpose and
functionality of your code. Documenting complex sections or functions is particularly
important.
# Unclear code
def process_data(data):
# ...
if flag == 1:
# Process data differently
...
Documentation helps you and others understand the code's intent and
functionality, making it easier to maintain and collaborate on.
Pythonic idioms and best practices are the cornerstone of writing clean,
readable, and efficient Python code. These practices are rooted in the philosophy of
the Python programming language, emphasizing simplicity, readability, and the
utilization of built-in language features. In this section, we will see some essential
Pythonic idioms and best practices that contribute to the development of high-
quality data manipulation code.
List comprehensions offer a more elegant and compact syntax for creating lists,
enhancing code readability and reducing the number of lines.
Context Managers with "with": Context managers, often used with the "with"
statement, facilitate resource management and exception handling. They ensure
that resources are properly acquired and released.
Context managers simplify resource management and ensure that resources are
properly cleaned up, even in the presence of exceptions.
# List comprehension
squared_numbers = [num ** 2 for num in numbers]
# Generator expression
squared_generator = (num ** 2 for num in numbers)
# Without enumerate
for i in range(len(names)):
print(f"Name at index {i}: {names[i]}")
# Using enumerate
for i, name in enumerate(names):
print(f"Name at index {i}: {name}")
PEP 8: Adhering to the PEP 8 style guide promotes code consistency and
readability. Consistent naming conventions, proper indentation, and clear formatting
enhance code quality.
# Inconsistent naming
MaxValue = max(numbers)
Total_sum = sum(numbers)
Following PEP 8 guidelines ensures that your code is easily readable and
understandable by the Python community.
DRY Principle: The "Don't Repeat Yourself" (DRY) principle emphasizes code
reusability by avoiding duplicate code. Create functions and modules for repeated
logic.
# Repeated logic
result1 = perform_calculation(data1)
result2 = perform_calculation(data2)
result1 = calculate_result(data1)
result2 = calculate_result(data2)
Error handling and debugging are integral skills for any programmer. When
working with data manipulation and analysis, it's crucial to effectively manage errors
and troubleshoot issues that may arise in your code. In this section, we will explore
various strategies and techniques for error handling and debugging in Python.
try:
result = 10 / 0
except ZeroDivisionError:
print("Error: Division by zero")
Exception handling ensures that your program continues running even when
encountering errors, making it more robust.
Logging: Logging is an essential tool for understanding the behavior of your code.
The "logging" module provides various levels of logging, helping you track the flow
and state of your program.
import logging
logging.basicConfig(level=logging.DEBUG)
logging.debug("Debugging message")
def calculate_tax(income):
calculate_tax(-1000)
Print Statements: Print statements are a simple yet effective way to inspect
variable values and trace the execution flow of your code.
calculate_interest(1000, 0.05, 3)
Print statements provide quick insights into variable values and the execution
sequence, helping you locate issues.
Error Messages and Stack Traces: When an error occurs, Python generates an
error message and a stack trace, indicating where the error occurred in your code.
Understanding error messages and stack traces helps pinpoint the root cause of
errors and facilitates effective troubleshooting.
Unit Testing: Writing unit tests using frameworks like "unittest" and "pytest" can
help catch errors early in development and ensure the correctness of your code.
import unittest
class TestDivision(unittest.TestCase):
def test_division(self):
self.assertEqual(divide(10, 2), 5)
if __name__ == '__main__':
unittest.main()
Unit tests provide a systematic way to validate the functionality of your code and
identify regressions.
observe how your library has expanded and evolved over the passage of
time.
Deriving Insights and Discerning Patterns: Delve into the data to discern
patterns and insights that underscore the popularity of specific genres,
authors, and other noteworthy attributes. Through meticulous analysis, gain
a deeper understanding of your book collection and its underlying dynamics.
Code
The following code is organized into modules that correspond to the steps
outlined in the problem statement. Each module involves loading, cleaning,
manipulating, and analyzing the dataset while utilizing Pandas, NumPy, and
Matplotlib libraries. Proper comments provide clarity and guidance throughout the
code, ensuring a comprehensive and effective analysis of the personal book library
dataset.
import pandas as pd
import numpy as np
insights from the dataset using Python and various data manipulation and
visualization techniques.
Exploring the Dataset: In this step, the necessary CSV files (Books.csv,
Ratings.csv, and Users.csv) are loaded into Pandas DataFrames. The `.read_csv()`
function is used to read the CSV files, and the `.info()` method provides information
about the datasets, including their sizes, data types, and non-null counts.
Data Cleansing and Preprocessing: In this step, missing values are handled by
using the `.dropna()` method, which removes rows with any missing values. The
`'Year of Publication'` column is standardized by converting it to a datetime format
using `pd.to_datetime()`, with `errors='coerce'` handling any errors by converting
them to NaN values.
Data Manipulation and Insight Generation: In this step, various insights are
generated from the dataset. The average rating for each book author is computed
using `.groupby()` and `.mean()` methods. The distribution of book genres is
calculated using `.value_counts()`. The yearly growth of the book library is obtained
by setting the `'Year of Publication'` column as the index and using `.resample()` to
count the number of books published each year.
Unveiling Time-Driven Patterns: This step involves creating a line plot using
Matplotlib to visualize the yearly growth of the book library. The `plt.plot()` function
is used to plot the `yearly_growth` data, and labels and a title are added using
`plt.xlabel()`, `plt.ylabel()`, and `plt.title()` functions. The resulting plot is displayed
using `plt.show()`.
Crafting Informative Visualizations: This step involves creating a bar plot using
Matplotlib to visualize the average rating by author. The `plt.bar()` function is used
to create the plot, and labels, a title, and rotation for x-axis labels are added using
Compiled & Edited by Muhammad Nadeem Khokhar ([email protected]) 100 | P a g e
Python by Example (Book 2: Data Manipulation and Analysis)
Enhancing Visual Appeal: This step involves creating a scatter plot using
Matplotlib to visualize the relationship between book length and ratings. The
`plt.scatter()` function is used to create the plot, and labels and a title are added
using `plt.xlabel()`, `plt.ylabel()`, and `plt.title()` functions. The resulting plot is
displayed using `plt.show()`.
Deriving Insights and Discerning Patterns: In this step, the five most popular
book genres are extracted from the `genre_distribution` using slicing. The resulting
data is printed to the console.
Conveying Discoveries through Visuals: This step involves creating a pie chart
using Matplotlib to visualize the distribution of the top 5 popular book genres. The
`plt.pie()` function is used to create the chart, and labels and a title are added using
using plt.title() and plt.show() functions.