Unit 4
Unit 4
Unit 4
Use of NumPy
import numpy as np
# Create an array
a = np.array([1, 2, 3, 4])
# performing simple math operations
b = a * 2 #Multiplying all element with 2, Ans: b= [2, 4, 6, 8]
c = np.sum(a)#Finding sum of all elements of an array, Ans: c = 10
• Optimization
• Signal processing
• Integration (calculus)
• Interpolation
• Statistics
• Fourier transforms
Module Purpose
scipy.optimize Minimization, curve fitting
scipy.integrate Calculus: integration, differential equations
scipy.linalg Advanced linear algebra
scipy.fft Fourier transforms (signal processing)
scipy.stats Probability distributions, statistics
scipy.spatial Distance metrics, spatial algorithms (KD-trees)
scipy.ndimage Image processing
scipy.signal Signal processing (filters, convolution, etc.)
4.3 Pandas
Pandas is a Python library used for data manipulation and analysis. It provides
easy-to-use tools to work with structured data, like spreadsheets and databases.
It can be assumed like Excel for Python, but much more powerful.
Use of Pandas
1. Series
2. DataFrame
Before using the objects of pandas library, we need to import the library.
Series:
Series is an object of pandas library that represents one dimensional data structure.
It is similar to an array and there are two arrays associated with each other. The
main array holds the data and the other array is used for indexing. To create a series,
Series() constructor is called by passing values as an argument.
a=pd.Series([7,9,13,15])
print(a)
Output:
0 7
1 9
2 13
3 15
dtype: int64
# Indexing
a=pd.Series([7,9,13,15],index=['i','ii','iii','iv'])
print(a)
Output:
i 7
ii 9
iii 13
iv 15
dtype: int64
a[2]
Output:
13
A[‘iii’]
Output:
13
a[a>9]
Output:
iii 13
iv 15
dtype: int64
unique() function
This functions returns the unique values in a series excluding duplicate.
Output:
array([10, 30, 20, 40, 50], dtype=int64)
value_counts() function
isin( ) function
This function evaluates the membership an element or a series of elements in a data
structure. It returns boolean value ‘Ture’ if the element is contained in the data
structure and ‘False’ if the element is not contained in the data structure.
Output:
0 True
1 True
2 False
3 True
4 False
5 False
6 False
7 False
8 False
dtype: bool
import pandas as pd
# uploading CSV file “data.csv”
df = pd.read_csv("data.csv")
# uploading Excel file “data.xlsx”
df = pd.read_excel("data.xlsx")
# uploading JSON file “data.json”
df = pd.read_json("data.json")
Indexing and slicing in a DataFrame using pandas can be done using different ways:
Output:
1. Selecting columns
print(df['Name']) # returns a Series
Output:
0 Rakesh
1 Druv
2 Aina
3 Ekta
4 Eva
print(df[['Name', 'City']]) # returns a DataFrame with two columns
Output:
Name City
0 Rakesh Dehradun
1 Druv Lucknow
2 Aina Delhi
3 Ekta Chandigarh
4 Eva Kolkata
Output:
Name Eva
Age 29
City Kolkata
Output:
Name Age City
2 Aina 22 Delhi
3 Ekta 32 Chandigarh
Output:
Name Age
1 Druv 27
2 Aina 22
3 Ekta 32
Output:
Name Rakesh
Age 24
City Dehradun
Output:
Name City
1 Druv Lucknow
2 Aina Delhi
3 Ekta Chandigarh
4 Eva Kolkata
4. Re-indexing
We can reindex a single row or multiple rows by using reindex() method. Default
values in the new index that are not present in the dataframe are assigned NaN.
new_index = [0,1,2,3,4]
df_new = df.reindex(new_index)
df_new
Output:
0 Rakesh 24 Dehradun
1 Druv 27 Lucknow
2 Aina 22 Delhi
3 Ekta 32 Chandigarh
4 Eva 29 Kolkata
5. Sorting in DataFrame:
Sorting by Values
Output:
Output:
df_sorted = df.sort_values(by='Age')
df_sorted
Output:
In place of single column, we can provide a list of columns in which order we wish to
sort.
7. Filtering in DataFrame
Filtering in DataFrame can be done by defining conditions.
To filter the data of students having age>=25
df[df['Age']>=25]
Output:
Output:
8. Ranking in DataFrame
Ranking assigns ranks to values in a Series or DataFrame, typically used to
determine relative positions (like 1st, 2nd, etc.) of numeric values.
df['Rank'] = df['Age'].rank(ascending=False)
print(df)
Output:
df['Rank'] = df['Age'].rank()
print(df)
Output:
4.4 Normalization
Normalization is a technique used to scale data so that it fits within a specific range.
Different normalizing methods are discussed below. For normalization, we use the
following sample data ‘df’
X1 X2
26 36
35 37
110 100
89 65
98 89
68 110
84 256
1. Min-Max Normalization
𝑋 − 𝑋𝑚𝑖𝑛
𝑋𝑛𝑜𝑟 =
𝑋𝑀𝑎𝑥 − 𝑋𝑚𝑖𝑛
where,
𝑋𝑚𝑎𝑥 , 𝑋𝑚𝑖𝑛 are the maximum , minimum values of the attribute 𝑋.
X1 X2 X1_nor X2_nor
26 36 0.00 0.00
35 37 0.11 0.00
110 100 1.00 0.29
89 65 0.75 0.13
98 89 0.86 0.24
68 110 0.50 0.34
84 256 0.69 1.00
Python programming
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(df)
2. Z-score Normalization
𝑋−𝜇
𝑋𝑛𝑜𝑟 =
𝜎
where, 𝜇 and 𝜎 are the mean and standard deviation values of the attribute 𝑋.
𝜇1 = 72.86, 𝜎1 = 29.40
𝜇2 = 99, 𝜎2 = 69.52
It converts a data into standard Normal distribution with in [-3, 3] range with mean
= 0 and standard deviation =1.
Normalized data:
X1 X2 X1_nor X2_nor
26 36 -1.59 -0.91
35 37 -1.29 -0.89
110 100 1.26 0.01
89 65 0.55 -0.49
98 89 0.86 -0.14
68 110 -0.17 0.16
84 256 0.38 2.26
Python programming
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
standardized_data = scaler.fit_transform(data)
Normalized data:
X1 X2 X1_nor X2_nor
26 36 0.24 0.14
35 37 0.32 0.14
110 100 1.00 0.39
89 65 0.81 0.25
98 89 0.89 0.35
68 110 0.62 0.43
84 256 0.76 1.00
from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()
normalized_data = scaler.fit_transform(df)
S2:
b 4
c 5
d 6
S1+ S2
a NaN
b 6.0
c 8.0
d NaN
4.6 Aggregation
Aggregation means reducing a dataset to summary statistics like mean, sum, count,
min, max, etc. In python programming agg() method is used for the purpose. Let us
consider the DataFrame used in normalization, suppose we wish to find 'minium',
'maximum', 'mean', 'median', 'variance', 'standard deviation'.
X1 X2
min 26.000000 36.000000
max 110.000000 256.000000
mean 72.857143 99.000000
median 84.000000 89.000000
var 1008.142857 5640.000000
std 31.751265 75.099933
If some of the attribute is categorical and we wish to find the aggregate according to
different labels, then groupby() method is used. Let us consider the DataFrame
data = pd.DataFrame({
'Department': ['HR', 'HR', 'IT', 'IT', 'Sale', 'IT', 'Sale'],
'Salary': [40000, 42000, 50000, 52000, 60000, 55000, 62000]
})
data
Output:
Department Salary
0 HR 40000
1 HR 42000
2 IT 50000
3 IT 52000
4 Sale 60000
5 IT 55000
6 Sale 62000
Salary
min max mean
Department
HR 40000 42000 41000.000000
IT 50000 5 5000 52333.333333
Sale 60000 62000 61000.000000
4.7 Summarization
Summarization gives a quick overview or statistical snapshot of your data.
describe() method is used for the same.
data.describe()
Output:
Salary
count 7.000000
mean 51571.428571
std 8363.753999
min 40000.000000
25% 46000.000000
50% 52000.000000
75% 57500.000000
max 62000.000000
Time series data analysis involves studying data points collected in chronological
time order to identify current trends, patterns and other behaviors. This helps extract
actionable insights and supports accurate forecasting and decision-making.
Key Concepts in Time Series Analysis
• Order: The order of differencing refers to the number of times the time series
data needs to be differenced to achieve stationarity.
Output:
date Sale
0 2024-02-01 1000
1 2024-01-15 500
2 2024-02-03 1500
3 2024-02-03 920
4 2024-03-02 1200
5 2024-03-10 1050
date Sale
2024-02-01 1000
2024-01-15 500
2024-02-03 1500
2024-02-03 920
2024-03-02 1200
2024-03-10 1050
Resampling: To better understand the trend of the data we use the resampling
method which provides a clearer view of trends and patterns when we are dealing
with daily data.
df_resampled = df.resample('M').mean(numeric_only=True):
It resamples data to monthly frequency and calculates the mean of all numeric
columns for each month. ‘ME’ is used in new versions of Python in place of M
monthly = df.resample('ME').mean() #
monthly
Output:
Date Sale
2024-01-31 500.0
2024-02-29 1140.0
2024-03-31 1125.0
Moving Average
df['rolling_mean'] = df['Sale'].rolling(window=2).mean()
Output:
4.9 Matplotlib
Matplotlib is a powerful Python library used for creating static, interactive, and
animated visualizations. It can be viewed as the "drawing tool" of Python —
perfect for plotting graphs, charts, and figures.
Plot Types
Plot Type Function Use
Line Plot
A Line Plot (or Line Chart) is a graph that connects individual data points with a
straight line. It's used to show trends over time or the relationship between two
continuous variables.
Output:
Bar Graph
A Bar Graph is a chart that uses rectangular bars to represent and compare
quantitative data across different categories.
• Each bar's height (or length) shows the value of the category.
Key Features
Output:
Histogram Example
import matplotlib.pyplot as plt
# Sample data: marks of 20 students
marks = [30, 35, 42, 73, 72, 85, 55, 77, 70, 75, 80, 68, 65, 55,
60, 78, 63, 75, 50, 66, 63]
# Create histogram
plt.hist(marks, bins=5, color='skyblue', edgecolor='black')
# Add labels and title
plt.title('Student Marks Distribution')
plt.xlabel('Marks Range')
plt.ylabel('Number of Students')
# Show plot
plt.show()
Output:
Scatter Plot
A Scatter Plot is a type of graph used to visualize the relationship between two
variables. It displays points (dots) for each observation in the dataset — one variable
on the x-axis, and the other on the y-axis.
Output:
Pie chart
A pie chart is a circular graph that uses slices to represent the proportions of different
categories within a whole. Each slice's area is proportional to the value it represents,
visually illustrating how different parts contribute to the total.
Circular Representation: The chart is a circle divided into slices, with each slice
representing a different category or group.
Proportional Slices: The size of each slice is directly related to the percentage or
proportion it represents of the total.
Visualizing Proportions: Pie charts are effective for showing how different
components contribute to a whole, especially when comparing parts to each other and
to the total.
Limitations:
Pie charts are best used for small datasets and are not ideal for comparing values
across different categories.
Output: