Ch-2 Python Libraries For ML
Ch-2 Python Libraries For ML
1
Ch – 2 Python Libraries for Machine
learning
❖Table of Contain
2.1 NumPy
❖ Creating Array: array()
❖ Accessing Array: by referring to its index number
❖ Stacking & Splitting: stack(), array_split()
❖ Maths Functions: add(), subtract(), multiply(), divide(), power(), mod()
❖ Statistics Functions: amin(), amax(), mean(), median(), std(), var(), average(), ptp()
2
NumPy
3
NumPy
❖NumPy (Numerical python) is an Open Source Python library that’s used in almost
every field of Science and Engineering.
❖ It’s the universal standard for working with numerical data in Python, and it’s at the core
of the scientific Python and PyData ecosystems.
4
Features of NumPy
NumPy has various features including these important ones:
❖ A powerful N-dimensional array object
❖ Sophisticated (broadcasting) functions
❖ Tools for integrating C/C++ and Fortran code
❖ Useful linear algebra, Fourier transform, and random number capabilities
5
Installation Process NumPy:
6
7
8
9
10
11
12
13
Array transforms sequences of sequences into two dimensional arrays, sequences of sequences into three-
dimensional arrays and so on….
14
15
16
* Sometimes, we want to create arrays, but we don’t know the exact numbers to put in them yet, NumPy has
special functions to help us with that. Let’s see how they work in a simpler way :
*Zeros : The Zeros functions creates an array filled with all Zeros. It’s like having an empty container ready to
hold a bunch of Zeros.
17
*Ones : The ones function creates an array filled with all ones. It’s like having a container ready to hold a
bunch of ones.
18
*Empty : The empty function creates an array without any specific initial numbers. It’s like having a container
that has some random stuff inside, but we-re not sure what it is yet. The content depends on the memory
state of the computer.
19
20
21
22
23
24
arange(starting_number , ending_number , delta) :
In NumPy, you can create a regularly spaced array using the NumPy. Arrange() function. This function generates a
sequence of numbers within a specified range with a specified step size. The basic syntax for NumPy. Arrange() is
as follow :
Where:
• start(optional) : The start of the sequence(inclusive). If not provided, it defaults to 0.
• stop : The end of the sequence(exclusive). The generated sequence will go up to , but not include, this value.
• step(optional) : The step size between the numbers in the sequence. If not provided, it defaults to 1.
• dtype(optional) : The data type of the elements in the resulting array. If not provided, it will be automatically
determined based on the inputs.
25
❑ When arrange is used with floating point arguments, it is generally not possible to predict the number of
elements obtained, due to the finite floating-point precision. For this reason , it is usually better to use the
function linespace that receives as an argument the number of elements that we want, instead of the step :
26
Example of arange :
import numpy as np
arr = np.arange(10)
print(arr) # Output: [0 1 2 3 4 5 6 7 8 9]
27
❑ NumPy also allows you to access a range of elements from an array using slicing. Slicing is done using the
colon : notation within square brackets []. The syntax for slicing is [ start : stop : step ], where start is the
starting index , stop is the stopping index (exclusive) , and step is the interval between elements.
28
29
30
31
32
Explain Reshape() :
In NumPy, the reshape() function is used to change the shape (dimensions) of an existing array while preserving
the total number of elements. Reshaping is a common operation when you want to convert a one-dimensional
array into a multi-dimensional array or change the dimensions of a multi-dimensional array. The reshape()
function returns a new view of the original data with the specified shape.
The basic syntax for the reshape() function is as follows:
numpy.reshape(a, newshape, order='C')
Where:
a: The array to be reshaped.
newshape: A tuple or list specifying the new shape (dimensions) of the array. The total number of elements in the
original array must match the total number of elements in the new shape.
order (optional): Specifies the memory layout order of the elements in the reshaped array. It can be 'C' for C-style
(row-major) or 'F' for Fortran-style (column-major). The default is 'C'.
33
import numpy as np import numpy as np
arr1d = np.array([1, 2, 3, 4, 5, 6]) # Reshape the 1D array into a 3D array with dimensions
(2, 3, 2)
arr3d = arr1d.reshape((2, 3, 2))
# Reshape the 1D array into a 2D array print(arr3d)
with 2 rows and 3 columns Output:
[[[ 0 1]
arr2d = arr1d.reshape((2, 3)) [ 2 3]
[ 4 5]]
print(arr2d)
Output: [[ 6 7]
[ 8 9]
[[1 2 3] [10 11]]]
[4 5 6]]
34
Explain Array Splitting:
In NumPy, you can split an array into smaller sub-arrays using various functions, depending on your requirements.
Here are some common ways to split arrays:
1. numpy.split(): This function splits an array into multiple sub-arrays along a specified axis.
import numpy as np
35
Explain Array Splitting:
2. numpy.array_split(): Similar to numpy.split(), but allows for uneven splits. You can specify the number of splits
you want.
import numpy as np
36
Explain Array Splitting:
3. numpy.hsplit() and numpy.vsplit(): These functions split arrays horizontally and vertically, respectively. They
are useful for 2D arrays.
import numpy as np
37
2.2 PANDAS
• Pandas is mostly used for data analysis tasks in Python. Numpy is mostly used for working with Numerical values as it
makes it easy to apply mathematical functions. Pandas library works well for numeric, alphabets and heterogeneous
types of data simultaneously.
• Pandas is an open-source Python library used for data manipulations, analysis and visualization. It provides easy-to-use
data structures and data analysis tools for Python, making it a popular choice for working with data in scientific
computing, finance, data analysis and machine learning.
• The key data structures in Pandas are the series and DataFrame. A Series is a one-dimensional labelled array that can
hold any data type. A DataFrame is a two-dimensional labelled data structure with columns of potentially different
types. In addition to these data structures, Pandas provides functions for reading and writing data to and from various
file formats, including CSV, excel, SQL databases and more.
• Pandas also offers powerful tools for data cleaning, aggregation, filtering and transformation, as well as statistical and
time series analysis. Its integration with other popular data analysis libraries in Python, such as NumPy and Matplotlib,
make it a powerful tool for data analysis and visualization.
38
2.2.1 Features of Pandas
Pandas is a popular open-source data manipulation and analysis library for Python. It provides data structures and functions
for working with structured data, making it an essential tool for data scientists, analysts, and developers. Here are some of
the key features of Pandas:
1. Data Structures:
• Series: A one-dimensional labeled array capable of holding data of various types.
• DataFrame: A two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes
(rows and columns).
2. Data I/O:
Pandas can read data from various file formats, including CSV, Excel, SQL databases, JSON, and more, and write data back to
these formats.
3. Data Alignment:
Pandas can automatically align data from multiple data sources based on index labels. This makes it easy to work with data
from different sources and combine them.
39
2.2.1 Features of Pandas
4. Missing Data Handling:
Pandas provides tools to handle missing data, either by filling in missing values with appropriate default values or by
dropping missing data points.
5. Data Filtering and Selection:
You can use Pandas to select, filter, and slice data based on various conditions and criteria, making it easy to extract the
information you need.
6. Data Transformation:
Pandas supports various data transformation operations, such as merging, joining, reshaping, and pivoting data, which are
essential for data cleaning and analysis.
7. Grouping and Aggregation:
Pandas allows you to group data based on one or more columns and perform aggregation functions like sum, mean, count,
etc., on these groups.
40
2.2.1 Features of Pandas
8. Time Series and Date Functionality:
Pandas has built-in support for working with time series data, including date and time manipulation, resampling, and rolling
statistics.
9. Data Visualization:
While Pandas itself doesn't provide data visualization capabilities, it can be easily integrated with popular data visualization
libraries like Matplotlib and Seaborn for creating charts and plots.
10. High Performance:
Pandas is designed to handle large datasets efficiently, and it leverages low-level libraries like NumPy for high-speed data
operations.
11. Flexibility:
You can work with a wide range of data types within Pandas, including numeric, string, categorical, and datetime data,
making it suitable for various data analysis tasks.
41
2.2.1 Features of Pandas
12. Customization:
Pandas provides options for customizing data structures and operations, allowing users to adapt the library to their specific
needs.
13. Interoperability:
Pandas can be easily integrated with other data analysis and machine learning libraries in the Python ecosystem, such as
NumPy, Scikit-Learn, and TensorFlow.
14. Community and Ecosystem:
Pandas has a large and active community, which means there are plenty of resources, tutorials, and third-party extensions
available to enhance its functionality.
42
Advantages of Pandas
• Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data.
• Size mutability : columns can be inserted and deleted from DataFrame and higher dimensional objects.
43
2.2.2 Creating Series
❑ To create a Series in Pandas, you can use the pd.Series() constructor.
import pandas as pd
Output :
# Create a Series from a list 0 1
1 2
data_list = [1, 2, 3, 4, 5] 2 3
series_from_list = pd.Series(data_list) 3 4
4 5
print(series_from_list ) dtype: int64
44
2.2.2 Creating Series
❑ To create a Series in Pandas, you can use the pd.Series() constructor.
import pandas as pd
Output :
A 100
B 200
# Create a Series from a dictionary C 300
data_dict = {'A': 100, 'B': 200, 'C': 300, 'D': 400}
D 400
dtype: int64
series_from_dict = pd.Series(data_dict)
print(series_from_dict )
Output :
# Create a Series with custom index
First 0.5
data_values = [0.5, 0.7, 0.2, 0.1] Second 0.7
custom_index = ['First', 'Second', 'Third', 'Fourth'] Third 0.2
Fourth 0.1
series_with_custom_index = pd.Series(data_values, index=custom_index)
dtype: float64
print(series_with_custom_index )
45
2.2.3 Creating DataFrames
❑ To create a DataFrame in Pandas, you can use the pd.DataFrame() constructor.
1. From a Dictionary of Lists/Arrays:
You can create a DataFrame by passing a dictionary where each key represents a column label, and the associated value is a
list or array containing the column's data.
46
Name Age City 0 Alice 25 New York 1 Bob 30 Los Angeles 2 Charlie 35 Chicago 3 David 40 Houston
df = pd.DataFrame(data)
print(df)
47
Name Age City 0 Alice 25 New York 1 Bob 30 Los Angeles 2 Charlie 35 Chicago 3 David 40 Houston
Output :
2.2.3 Creating DataFrames Duration Pulse Maxpulse
Calories
2. Pandas Read CSV FILE: 0 60 110 130
• A simple way to store big data sets is to use CSV files 409.1
1 60 117 145
(comma separated files). 479.0
• CSV files contains plain text and is a well know format that 2 60 103 135
340.0
can be read by everyone including Pandas. 3 45 109 175
• In our examples we will be using a CSV file called 'data.csv’. 282.4
4 45 117 148
import pandas as pd 406.0
5 60 102 127
300.5
df = pd.read_csv('data.csv') 6 60 110 136
print(df.to_string()) 374.0
7 45 104 134
253.3
8 30 109 133
195.1
9 60 98 124
269.0
10 60 103 48 147
Name Age City 0 Alice 25 New York 1 Bob 30 Los Angeles 2 Charlie 35 Chicago 3 David 40 Houston
Output :
2.2.3 Clearing Empty Cell Duration Pulse Maxpulse
Calories
One way to deal with empty cells is to remove rows that 0 60 110 130
contain empty cells. 409.1
1 60 117 145
This is usually OK, since data sets can be very big, and 479.0
removing a few rows will not have a big impact on the result. 2 60 103 135
340.0
By default, the dropna() method returns a new DataFrame, 3 45 109 175
and will not change the original. 282.4
4 45 117 148
import pandas as pd 406.0
5 60 102 127
300.5
df = pd.read_csv('data.csv') 6 60 110 136
new_df = df.dropna() 374.0
7 45 104 134
print(new_df.to_string()) 253.3
8 30 109 133
195.1
9 60 98 124
269.0
10 60 103 49 147
Name Age City 0 Alice 25 New York 1 Bob 30 Los Angeles 2 Charlie 35 Chicago 3 David 40 Houston
Output :
2.2.3 Clearing Empty Cell Duration Pulse Maxpulse
Calories
Replace Empty Values 0 60 110 130
Another way of dealing with empty cells is to insert a new 409.1
1 60 117 145
value instead. 479.0
This way you do not have to delete entire rows just because 2 60 103 135
340.0
of some empty cells. 3 45 109 175
The fillna() method allows us to replace empty cells with a 282.4
4 45 117 148
value: 406.0
import pandas as pd 5 60 102 127
300.5
6 60 110 136
df = pd.read_csv('data.csv') 374.0
7 45 104 134
df.fillna(130, inplace = True) 253.3
print(df.to_string()) 8 30 109 133
195.1
9 60 98 124
269.0
10 60 103 50 147
Name Age City 0 Alice 25 New York 1 Bob 30 Los Angeles 2 Charlie 35 Chicago 3 David 40 Houston
51
Name Age City 0 Alice 25 New York 1 Bob 30 Los Angeles 2 Charlie 35 Chicago 3 David 40 Houston
Pandas Plotting
These functions provide valuable tools for analyzing data and computing various
Plotting :
Pandas uses the plot() method to create diagrams.We can use Pyplot, a submodule of
54
Name Age City 0 Alice 25 New York 1 Bob 30 Los Angeles 2 Charlie 35 Chicago 3 David 40 Houston
Pandas Plotting
import sys
import matplotlib
matplotlib.use('Agg')
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data.csv')
df.plot()
plt.show()
#Two lines to make our compiler able to draw:
plt.savefig(sys.stdout.buffer)
sys.stdout.flush()
55
Name Age City 0 Alice 25 New York 1 Bob 30 Los Angeles 2 Charlie 35 Chicago 3 David 40 Houston
kind = 'scatter'
In the example below we will use "Duration" for the x-axis and "Calories" for the y-axis.
x = 'Duration', y = 'Calories'
56
Name Age City 0 Alice 25 New York 1 Bob 30 Los Angeles 2 Charlie 35 Chicago 3 David 40 Houston
57
Name Age City 0 Alice 25 New York 1 Bob 30 Los Angeles 2 Charlie 35 Chicago 3 David 40 Houston
kind = 'hist'
A histogram shows us the frequency of each interval, e.g. how many workouts lasted between 50
and 60 minutes?
In the example below we will use the "Duration" column to create the histogram:
58
Name Age City 0 Alice 25 New York 1 Bob 30 Los Angeles 2 Charlie 35 Chicago 3 David 40 Houston
59
Name Age City 0 Alice 25 New York 1 Bob 30 Los Angeles 2 Charlie 35 Chicago 3 David 40 Houston
Pandas NumPy
60
MATPLOTLIB
Human minds are more adaptive for the visual representation of data rather than textual data. We can
easily understand things when they are visualized. It is better to represent the data through the graph
where we can analyze the data more efficiently and make the specific decision according to data analysis.
Before learning the matplotlib, we need to understand data visualization and why data visualization is
important.
FEATURES OF MATPLOTLIB
Matplotlib is a popular data visualization library in python that provides a wide range of tools for creating
static, animated, and interactive visualizations. It is highly customizable and allows you to create variety of
plots, charts, and graphs. Here are some of the key features of Matplotlib:
❖ Plotting Functions: Matplotlib offers a comprehensive set of plotting functions to create various types
of visualizations, including line plots, scatter plots, bar plots , histograms , pie charts, box plots, and
many more. These functions provide flexible to customize the appearance and style of the plots.
61
MATPLOTLIB (Features Cont.)
❖ Support for Multiple Backends: Matplotlib supports multiple rendering backends, which allows you to
display plots in different environments. It can render plots in various formats such as interactive GUI
windows, static images in different file format (PNG, JPEG, PDF, etc.), or embedded within web
applications.
❖ Object-Oriented Interface: Matplotlib provides an object-oriented interface that allows you to have
fine-grained control over the elements of a plot. You can create Figure and Axes objects to define the
layout and positioning of plots, and manipulate individual plot components such as lines, markers,
labels, and legends.
❖ Customization Options: Matplotlib offers extensive customization options to tailor the appearance of
your plots. You can modify colors, line styles, markers, fonts, gridlines, and other visual elements.
Additionally, you can add titles, labels, annotations, and legends to enhance the readability of the
plots.
62
MATPLOTLIB (Features Cont.)
❖ Support for Multiple Plotting Styles: Matplotlib supports different plotting styles and themes, allowing
you to change the overall look and feel of your plots. It provides a set of predefined styles, such as
classic, seaborn, ggplot, and more. You can also create and customize your own styles.
❖ 3D Plotting: Matplotlib includes functionality for creating 3D visualizations. It allows you to plot three-
dimensional data, such as surface plots, wireframes, scatter plots, and contour plots. These capabilities
are useful for visualizing data in three dimensions and understanding complex relationships.
❖ Subplotting: Matplotlib enables you to create multiple plots within a single figure using subplotting.
You can divide the figure into a grid of subplots and plot different data or views side by side. This
feature is helpful for comparing data, creating multi-panel visualizations, or building complex layouts
63
MATPLOTLIB (Features Cont.)
❖ Animation: Matplotlib supports creating animated plots and visualizations. It provides an animation module that allows you to
animate changes in your data over time. You can create dynamic Visualizations such as line animations, scatter plot animations, or
custom animations based on your specific requirements.
❖ Integration with Pandas and NumPy: Matplotlib integrates well with other libraries, such as Pandas and NumPy, making it easy to
plot data stored in these data structures. It can directly plot data from Pandas DataFrames or NumPy arrays, allowing for seamless
integration into dat analysis workflows.
These are some of the key features of Matplotlib that make it a versatile and powerful library for data Visualization in Python. It is
widely used in various fields such as data analysis, scientific research, machine learning and data exploration.
64
MATPLOTLIB (Features Cont.)
Pyplot is a Matplotlib module that provides a MATLAB-like interface. Matplotlib is designed to be as usable as
MATLAB, with the ability to use Python and the advantage of being free and open-source. Each pyplot function
makes some change to a figure: e.g., creates a figure, creates a plotting area in a figure, plots some lines in a
plotting area, decorates the plot with labels, etc. The various plots we can utilize using Pyplot are Line Plot,
Histogram, Scatter, 3D Plot, Image, Contour, and Polar.
After knowing a brief about Matplotlib and pyplot let’s see how to create a simple plot.
65
Adding Title
The title() method in matplotlib module is used to specify the title of the visualization depicted and displays the title using various
attributes.
Syntax:
matplotlib.pyplot.title(label, fontdict=None, loc=’center’, pad=None, **kwargs)
plt.show()
66
Adding X Label and Y Label
In layman’s terms, the X label and the Y label are the titles given to X-axis and Y-axis respectively. These can be
added to the graph by using the xlabel() and ylabel() methods.
Syntax:
matplotlib.pyplot.xlabel(xlabel, fontdict=None, labelpad=None, **kwargs)
matplotlib.pyplot.ylabel(ylabel, fontdict=None, labelpad=None, **kwargs)
import matplotlib.pyplot as plt
67
Arrow annotation
Arrow notations are similar to text annotations, but they include an arrow that points from one location to
another. These annotations are useful for highlighting specific relationships or connections between data
points. Arrow annotations can be created using the plt.annotate() or ax.annotate() functions.
import matplotlib
import matplotlib.pyplot as plt
plt.xlabel('X-axis label')
plt.ylabel('Y-axis label')
plt.title('Plot Title')
plt.show()
68
Arrow annotation - Circle
Circle annotations are used to draw a circle or ellipses on the plot. They are typically employed to mark specific data points or
regions. Circle annotations can be created using the plt.circle() or ax.add_patch() functions.
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.patches as patches
plt.plot([1,2,3,4],[1,4,2,3], label='Series 1')
plt.plot([1,2,3,4],[4,3,1,3], label='Series 2')
plt.xlabel('X-axis label')
plt.ylabel('Y-axis label')
plt.title('Plot Title')
plt.show()
69
Arrow annotation - Rectangular
rectangle annotations are used to draw rectangular shape on the plot. They are typically employed to
emphasis a specific regions. rectangular annotations can be created using the plt.Rectangle() or ax.add_patch()
functions.
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.patches as patches
plt.plot([1,2,3,4],[1,4,2,3], label='Series 1')
plt.plot([1,2,3,4],[4,3,1,3], label='Series 2')
plt.xlabel('X-axis label')
plt.ylabel('Y-axis label')
plt.title('Plot Title')
plt.text(3,4, 'Example annotation', fontsize = 12 , color='red')
plt.annotate('Reversal',xy=(3,2), xytext=(2,1.5),arrowprops=dict(facecolor = 'black',arrowstyle = '->'))
plt.annotate('',xy=(3,1), xytext=(2.4,1.5),arrowprops=dict(facecolor = 'red',arrowstyle = '->'))
circle = patches.Circle((2,4), 0.4 , facecolor='green', alpha = 0.3)
plt.gca().add_patch(circle)
rect = patches.Rectangle((1,4.5),3, 0.7 , facecolor='red', alpha = 0.3)
plt.gca().add_patch(rect)
plt.show()
70