Python
Python
1.Aim: To understand preprocessing required for databefore using it for AI agent training/testing
2. Prerequisites:
1. Python programming, Basics of probability Theory
FAMT/ IT / Semester –VI (Rev-2019) / DS using Python Lab / Academic Year: 2024-25 / First Half of 2025
13. Experiment/Assignment Evaluation
Experiment/Assignment Evaluation:
References:
[3] Howard J. Seltman, Experimental Design and Analysis, Carnegie Mellon University,
2012/1.
[4] Ethem Alpaydın, “Introduction to Machine Learning”, MIT Press
Viva Questions
.
1 hat is data?
W
2. What is data processing?
3. What if there are null or missing values in the data?
4. How to identify outliers in the data?
FAMT/ IT / Semester –VI (Rev-2019) / DS using Python Lab / Academic Year: 2024-25 / First Half of 2025
Importing data file into process
import pandas as pd
df = pd.read_csv('employees.csv')
print(type(df))
<class 'pandas.core.frame.DataFrame'>
print(df)
TEAM
0 Marketing
1 NaN
2 Finance
3 Finance
4 Client Services
.. ...
63 Human Resources
64 Business Development
65 Distribution
66 Business Development
67 Finance
Getting the column labels (i.e., the names of all the columns) of the DataFrame df. It returns a pandas.Index
object, which contains the column names of the DataFrame.
df.columns
df['TEAM'].value_counts()
count
TEAM
Client Services 10
Business Development 9
Finance 8
Product 7
Legal 6
Engineering 6
Marketing 5
Human Resources 5
Sales 5
Distribution 3
dtype: int64
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68 entries, 0 to 67
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 REG 68 non-null object
1 NAME 68 non-null object
2 GENDER 67 non-null object
3 SALARY 68 non-null int64
4 BONUS 65 non-null float64
5 TEAM 64 non-null object
dtypes: float64(1), int64(1), object(4)
memory usage: 3.3+ KB
df.describe()
SALARY BONUS
df.isnull()
68 rows × 6 columns
Calculating the total number of missing values (NaN) in each column of a DataFrame.
df.isnull().sum()
0
REG 0
NAME 0
GENDER 1
SALARY 0
BONUS 3
TEAM 4
dtype: int64
df.isnull().sum().sum()
createing a new DataFrame by removing rows from the original DataFrame that contain any missing values
(NaN). The original DataFrame remains unchanged unless explicitly reassigned.
df_without_nan= df.dropna()
df_without_nan.head(5)
NAME
Next
steps:
Generate code
with
df_without_nan
toggle_off View recommended
plots
New interactive
sheet
df_without_nan.info()
<class 'pandas.core.frame.DataFrame'>
Index: 60 entries, 0 to 67
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 REG 60 non-null object
1 NAME 60 non-null object
2 GENDER 60 non-null object
3 SALARY 60 non-null int64
4 BONUS 60 non-null float64
5 TEAM 60 non-null object
dtypes: float64(1), int64(1), object(4)
memory usage: 3.3+ KB
df_without_nan.isnull().sum()
REG 0
NAME 0
GENDER 0
SALARY 0
BONUS 0
TEAM 0
dtype: int64
df_without_nan.isnull().sum().sum()
displaying the first rows of a DataFrame. By default, df.head() returns the first 5 rows if no argument is
provided, but you can specify the number of rows to display by passing an integer argument.
df.head(10)
displaying the last 10 rows of a DataFrame. If the DataFrame has fewer than 10 rows, it will display all
available rows.
df.tail(5)
Accessing rows with index labels from M to N (inclusive) and all columns in the DataFrame
df.loc[11:15,:]
df['NAME'].head(5)
NAME
dtype: object
df[['NAME','SALARY']]
NAME SALARY
68 rows × 2 columns
Selecting rows in the dataframe whose salary is greater than or equal to 10000 and gender is Male
TD-23-
4 BACHIM ATHARV MARUTI M 101004 1.389 Client Services
0502
Business
9 T-22-0091 DAMALE KEDAR PRAVIN M 139852 7.524
Development
Business
61 T-22-0514 SAWANT PRANAD VINAYAK ** M 106862 3.699
Development
Filtering rows in the DataFrame that contain at least one NaN value in any of their columns.
df[df.isnull().any(axis=1)]
df['SALARY'].median()
95273.0
df['BONUS'].mean()
10.338061538461538
df['GENDER'].mode()
GENDER
0 M
dtype: object
Filling the missing (NaN) values in specific columns of the DataFrame with appropriate values
df['GENDER'].fillna(df['GENDER'].mode()[0],inplace= True)
df['BONUS'].fillna(df['BONUS'].mean(),inplace= True)
df['TEAM'].fillna(df['TEAM'].mode()[0],inplace= True)
<ipython-input-26-77d482235238>:1: FutureWarning: A value is trying to be set on a copy of a Da
The behavior will change in pandas 3.0. This inplace method will never work because the interme
For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value
df['GENDER'].fillna(df['GENDER'].mode()[0],inplace= True)
<ipython-input-26-77d482235238>:2: FutureWarning: A value is trying to be set on a copy of a Da
The behavior will change in pandas 3.0. This inplace method will never work because the interme
For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value
df['BONUS'].fillna(df['BONUS'].mean(),inplace= True)
<ipython-input-26-77d482235238>:3: FutureWarning: A value is trying to be set on a copy of a Da
The behavior will change in pandas 3.0. This inplace method will never work because the interme
For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value
df['TEAM'].fillna(df['TEAM'].mode()[0],inplace= True)
df.isnull().sum().sum()
Filtering the DataFrame which will only contain rows where the SALARY column has values between 10,000
and 200,000, inclusive.
65 rows × 6 columns
Replacing values in the SALARY column that are more than 150,000 units away from the median salary with
the median salary itself.
For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value
68 rows × 6 columns
Outliers refers to data points that are significantly different from the rest of the data, often seen as extreme
or unusual values.
creating a new DataFrame that contains rows where the SALARY column values are either less than 10,000
or greater than 200,000. These rows are considered outliers in the SALARY column based on the specified
salary range.
Filtering out rows in the DataFrame where the SALARY column contains values that are more than 60,000
units away (in absolute terms) from the median salary. These rows are considered outliers based on the
specified threshold.
outliers.sort_values(by="SALARY")
outliers.sort_values(by="SALARY",ascending=False)
68 rows × 6 columns
Seting a specific column as the index for the DataFrame. This means that the column will no longer be a
regular column but will instead become the row labels (index).
df.set_index('REG').head(10)
REG
df.set_index('NAME',inplace=True)
df
NAME
68 rows × 5 columns
1.Importing the numpy library, a powerful library for numerical computing in Python.
2.Importing the StandardScaler class from the sklearn.preprocessing module. It is a tool from the scikit-
learn library used for feature scaling.
import numpy as np
from sklearn.preprocessing import StandardScaler
s=StandardScaler()
Performing feature scaling on the SALARY column of the DataFrame using the StandardScaler.
The SALARY column will be scaled to have a mean of 0 and a standard deviation of 1 and the values in the
SALARY column will be standardized.
Standardization: The values in the SALARY column are now adjusted so that they fit within a standardized
range, making them easier to work with for many machine learning algorithms.
Reshaping: The reshaping (reshape(-1, 1)) is required because the scaler expects a 2D array as input.
df['SALARY'] = s.fit_transform(np.array(df['SALARY']).reshape(-1, 1))
df
NAME
T itle: Data Visualization / Exploratory Data Analysis for the selected data set using Matplotlib
and Seaborn
Marks: Teacher’s Signature:
2. Prerequisites:
1. Python programming, Basics of probability Theory
FAMT/ IT / Semester –VI (Rev-2019) / DS using Python Lab / Academic Year: 2024-25 / First Half of 2025
13. Experiment/Assignment Evaluation
Experiment/Assignment Evaluation:
References:
[3] Howard J. Seltman, Experimental Design and Analysis, Carnegie Mellon University,
2012/1.
[4] Ethem Alpaydın, “Introduction to Machine Learning”, MIT Press
Viva Questions
. W
1 hat are matplotlib and seaborn packages?
2. What are different plots supported by those packages?
3. What is EDA?
FAMT/ IT / Semester –VI (Rev-2019) / DS using Python Lab / Academic Year: 2024-25 / First Half of 2025
Importing the pandas library and loading data from csv file into a dataframe
import pandas as pd
df= pd.read_csv('employees.csv')
Replacing missing values in the GENDER column with the most frequently occurring value (mode) in that column,replacing
missing values in the BONUS column with the mean (average) of the column, replacing missing values in the TEAM column
with the most frequently occurring value (mode), similar to the GENDER column.
df['GENDER'].fillna(df['GENDER'].mode()[0],inplace= True)
df['BONUS'].fillna(df['BONUS'].mean(),inplace= True)
df['TEAM'].fillna(df['TEAM'].mode()[0],inplace= True)
Identifing and handling outliers in the SALARY column by replacing extreme values with the median of the SALARY column
df.head(10)
Next steps: Generate code with df toggle_off View recommended plots New interactive sheet
Importing the pyplot module from the matplotlib library and assigns it the alias plt
plt.plot(df['SALARY'])
[<matplotlib.lines.Line2D at 0x7fb84bd3b370>]
x: Determines the position of each bar on the X-axis. and height: Specifies the height of each bar (corresponding to the Y-axis
values).
plt.bar(x=df.index,height=df['SALARY'])
Creates a box plot, which is a graphical representation of the distribution of data based on five summary statistics and grid ()
adds a grid to the plot for better readability
plt.boxplot(df['SALARY'])
plt.grid()
Same as above but showmeans arguments adds mean representation to the boxplot
plt.boxplot(df['SALARY'],showmeans=True)
plt.grid()
plt.hist(df['BONUS'])
(array([8., 2., 6., 8., 6., 4., 9., 3., 6., 8.]),
array([ 1.256 , 3.0718, 4.8876, 6.7034, 8.5192, 10.335 , 12.1508,
13.9666, 15.7824, 17.5982, 19.414 ]),
<BarContainer object of 10 artists>)
plt.hist(df['BONUS'],density=True)
df.dropna(inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 60 entries, 0 to 67
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 REG 60 non-null object
1 NAME 60 non-null object
2 GENDER 60 non-null object
3 SALARY 60 non-null int64
4 BONUS 60 non-null float64
5 TEAM 60 non-null object
dtypes: float64(1), int64(1), object(4)
memory usage: 3.3+ KB
Creating line plot of the SALARY column where both the outliers are clearly indicated
[<matplotlib.lines.Line2D at 0x7b4367afe810>]
Creating vertical bar plot of the SALARY column where both the outliers are clearly indicated
plt.bar(x=df.index,height=df['SALARY'])
<BarContainer object of 60 artists>
Creating boxplot of the SALARY column where both the outliers are clearly indicated
plt.boxplot(df['SALARY'],showmeans=True)
plt.grid()
importing the Seaborn library for creating statistical and visually appealing plots.
Createing a vertical box plot to visualize the distribution of the "BONUS" column in the DataFrame df, highlighting its median,
quartiles, and outliers.
sns.boxplot(y="BONUS",data=df)
<Axes: ylabel='BONUS'>
Creating a scatter plot with a regression line to visualize the relationship between "SALARY" (x-axis) and "BONUS" (y-axis) in
the DataFrame df.
sns.regplot(x="SALARY",y="BONUS",data=df)
Creating a bar plot showing the "BONUS" values (y-axis) for each index in the DataFrame df (x-axis), displaying the mean of
"BONUS" by default with error bars.
sns.barplot(x=df.index,y="BONUS",data=df)
<Axes: xlabel='None', ylabel='BONUS'>
Creating a count plot to display the frequency (count) of each unique value in the "GENDER" column of the DataFrame df .
sns.countplot(x="GENDER",data=df)
Creating a grouped count plot to show the frequency of each "GENDER" category, further grouped by the "TEAM" column, in the
DataFrame df.
sns.countplot(x="GENDER", hue="TEAM",data=df)
<Axes: xlabel='GENDER', ylabel='count'>
Creating a horizontal count plot to display the frequency (count) of each unique value in the "TEAM" column of the DataFrame
df.
sns.countplot(y="TEAM",data=df)
Creating a horizontal count plot that shows the frequency of each "TEAM" category, further segmented by "GENDER," in the
DataFrame df .
sns.countplot(y="TEAM",hue="GENDER",data=df)
<Axes: xlabel='count', ylabel='TEAM'>
Creating a horizontal count plot showing the frequency of each "TEAM" category, segmented by "GENDER" with custom colors
(tomato red and bright red) for each gender, using the specified color palette ["#FF6347", "#FF0001"] .
Creating a vertical box plot to visualize the distribution of "BONUS" values, grouped by "GENDER" in the DataFrame df , with the
mean values shown for each group.
sns.boxplot(y="BONUS",data=df,hue="GENDER",showmeans=True)
<Axes: ylabel='BONUS'>
Creating a cross-tabulation (contingency table) that shows the count of occurrences for each unique value in the "TEAM"
column of the DataFrame df . The result is a summary of the frequency of each team.
pd.crosstab(index=df['TEAM'],columns="count")
col_0 count
TEAM
Business Development 9
Client Services 9
Distribution 3
Engineering 5
Finance 8
Human Resources 5
Legal 4
Marketing 5
Product 7
Sales 5
Creating a cross-tabulation (contingency table) that shows the frequency distribution of "GENDER" values for each unique
"TEAM" in the DataFrame df , displaying how many occurrences of each gender are present in each team.
pd.crosstab(index=df['TEAM'],columns=df["GENDER"])
GENDER F M
TEAM
Business Development 2 7
Client Services 5 4
Distribution 1 2
Engineering 3 2
Finance 4 4
CreatingHuman
a cross-tabulation
Resources (contingency
1 4 table) that shows the normalized (relative frequency) count of occurrences for each
unique value inLegal
the "TEAM" column
2 2 of the DataFrame df , with the results expressed as proportions (sum of counts = 1).
Marketing 4 1
pd.crosstab(index=df['TEAM'],columns="count",normalize=True)
Product 3 4
TEAM
Distribution 0.050000
Engineering 0.083333
Finance 0.133333
Legal 0.066667
Marketing 0.083333
Product 0.116667
Sales 0.083333
Creating a cross-tabulation (contingency table) that shows the normalized (relative frequency) distribution of "GENDER" within
each "TEAM" in the DataFrame df , with the results expressed as proportions (sum of each row = 1).
pd.crosstab(index=df['TEAM'],columns=df["GENDER"],normalize=True)
GENDER F M
TEAM
Title:Data Modeling
1.Aim: To understand how to split given data intoa training and testing set, and validate it.
2. Prerequisites:
1. Python programming, Basics of probability Theory
FAMT/ IT / Semester –VI (Rev-2019) / DS using Python Lab / Academic Year: 2024-25 / First Half of 2025
13. Experiment/Assignment Evaluation
Experiment/Assignment Evaluation:
References:
[3] Howard J. Seltman, Experimental Design and Analysis, Carnegie Mellon University,
2012/1.
[4] Ethem Alpaydın, “Introduction to Machine Learning”, MIT Press
Viva Questions
. W
1 hat are packages that support functionality to split the data into two sets?
2. What do you mean by data validation?
3. What is a two-sample Z-test?
FAMT/ IT / Semester –VI (Rev-2019) / DS using Python Lab / Academic Year: 2024-25 / First Half of 2025