0% found this document useful (0 votes)
3 views

Python

The document outlines an assignment for a Data Science lab course at Finolex Academy, focusing on data preparation using Python libraries NumPy and Pandas. It includes the aim, prerequisites, hardware and software requirements, learning objectives, and evaluation criteria for the experiment. Additionally, it provides code examples for data manipulation and analysis, including handling missing values and generating summary statistics.

Uploaded by

Viral Van
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Python

The document outlines an assignment for a Data Science lab course at Finolex Academy, focusing on data preparation using Python libraries NumPy and Pandas. It includes the aim, prerequisites, hardware and software requirements, learning objectives, and evaluation criteria for the experiment. Additionally, it provides code examples for data manipulation and analysis, including handling missing values and generating summary statistics.

Uploaded by

Viral Van
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

‭Finolex Academy of Management and Technology, Ratnagiri‬

‭Department of Information Technology‬

‭Subject:‬ ‭DS using Python Lab. (ITL605)‬

‭Class:‬ ‭TE IT / Semester – VI (Rev-2019 ‘C’) / Academic year: 2024-25‬


‭ ame of‬
N
‭Kedar Pravin Damale.‬
‭Student:‬
‭Roll No:‬ ‭10‬ ‭Date of performance (DOP) :‬

‭Assignment/Experiment No:‬ ‭01‬ ‭Date of checking (DOC) :‬

‭Title: Data preparation using NumPy and Pandas‬

‭Marks:‬ ‭Teacher’s Signature:‬

‭1.‬‭Aim‬‭: To understand preprocessing required for data‬‭before using it for AI agent training/testing‬

‭2. Prerequisites‬‭:‬
‭1.‬ ‭Python programming, Basics of probability Theory‬

‭3. Hardware Requirements‬‭:‬


‭1.‬ ‭PC with minimum 2GB RAM‬

‭4. Software Requirements:‬


‭1.‬ ‭Windows / Linux OS.‬
‭2.‬ ‭Python 3.6 or higher‬

‭5. Learning Objectives:‬


‭1.‬ ‭To know the records with missing values‬
‭2.‬ ‭To know the records with outliers‬
‭3.‬ ‭To know how to deal with missing values and outliers‬

‭ . Learning Objectives Applicable: LO1‬


6
‭7. Program Outcomes Applicable: PO1, PO2, PO4‬
‭8. Program Education Objectives Applicable: PEO1‬

‭FAMT/ IT / Semester –VI (Rev-2019) / DS using Python Lab / Academic Year: 2024-25 / First Half of 2025‬
‭13. Experiment/Assignment Evaluation‬
‭Experiment/Assignment Evaluation:‬

‭Sr. No.‬ ‭Parameters‬ ‭ arks‬


M ‭Out of‬
‭obtained‬

‭1‬ ‭Technical‬ ‭Understanding‬ ‭(Assessment‬ ‭may‬‭be‬‭done‬‭based‬‭on‬‭Q‬‭&‬‭A‬‭or‬‭any‬‭other‬‭relevant‬ ‭6‬


‭method.)‬‭Teacher should mention the other method used‬‭-‬

‭2‬ ‭Neatness/presentation‬ ‭2‬

‭3‬ ‭Punctuality‬ ‭2‬

‭Date of performance (DOP)‬ ‭Total marks obtained‬ ‭10‬

‭Date of checking (DOC)‬ ‭Signature of teacher‬

‭References‬‭:‬
‭[3] Howard J. Seltman, Experimental Design and Analysis, Carnegie Mellon University,‬
‭2012/1.‬
‭[4] Ethem Alpaydın, “Introduction to Machine Learning”, MIT Press‬

‭Viva Questions‬
‭ .‬
1 ‭ hat is data?‬
W
‭2.‬ ‭What is data processing?‬
‭3.‬ ‭What if there are null or missing values in the data?‬
‭4.‬ ‭How to identify outliers in the data?‬

‭FAMT/ IT / Semester –VI (Rev-2019) / DS using Python Lab / Academic Year: 2024-25 / First Half of 2025‬
Importing data file into process

import pandas as pd
df = pd.read_csv('employees.csv')

Checking data type of our dataframe

print(type(df))

<class 'pandas.core.frame.DataFrame'>

Displaying the data

print(df)

REG NAME GENDER SALARY BONUS \


0 T-22-0107 AMBERKAR KOMAL SURYAKANT F 97308 6.945
1 T-22-0144 ARLEKAR PRATHAMESH MAHESH M 61933 4.170
2 T-22-0459 AYARE DARSHAN NARESH M 130590 11.858
3 T-22-0140 AYARE SANIA NARENDRA F 138705 9.340
4 TD-23-0502 BACHIM ATHARV MARUTI M 101004 1.389
.. ... ... ... ... ...
63 T-22-0519 SHAIKH FABIHA IMTIYAZ F 35203 18.040
64 T-22-0007 TALEKAR ARYAN RAJENDRA M 77834 18.771
65 T-22-0074 TARVE OMKAR SHANIL M 1012655 12.428
66 T-22-0026 THATTE GANDHAR NILESH M 125250 2.672
67 T-22-0273 VASKAR RAHUL DIPAK M 51178 9.735

TEAM
0 Marketing
1 NaN
2 Finance
3 Finance
4 Client Services
.. ...
63 Human Resources
64 Business Development
65 Distribution
66 Business Development
67 Finance

[68 rows x 6 columns]

Getting the column labels (i.e., the names of all the columns) of the DataFrame df. It returns a pandas.Index
object, which contains the column names of the DataFrame.

df.columns

Index(['REG', 'GENDER', 'SALARY', 'BONUS', 'TEAM'], dtype='object')


Getting the count of unique values in the TEAM column of the DataFrame. Specifically, it will return the
number of occurrences of each unique value in the TEAM column, sorted in descending order by default.

df['TEAM'].value_counts()

count

TEAM

Client Services 10

Business Development 9

Finance 8

Product 7

Legal 6

Engineering 6

Marketing 5

Human Resources 5

Sales 5

Distribution 3

dtype: int64

displaying concise summary of a DataFrame

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68 entries, 0 to 67
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 REG 68 non-null object
1 NAME 68 non-null object
2 GENDER 67 non-null object
3 SALARY 68 non-null int64
4 BONUS 65 non-null float64
5 TEAM 64 non-null object
dtypes: float64(1), int64(1), object(4)
memory usage: 3.3+ KB

generating summary statistics of the numerical columns in a DataFrame

df.describe()
SALARY BONUS

count 6.800000e+01 65.000000

mean 1.185428e+05 10.338062

std 1.747416e+05 5.439170

min 1.000000e+01 1.256000

25% 6.471475e+04 6.083000

50% 9.527300e+04 10.012000

75% 1.188555e+05 14.543000

max 1.175987e+06 19.414000

Detecting missing values in a DataFrame

df.isnull()

REG NAME GENDER SALARY BONUS TEAM

0 False False False False False False

1 False False False False False True

2 False False False False False False

3 False False False False False False

4 False False False False False False

... ... ... ... ... ... ...

63 False False False False False False

64 False False False False False False

65 False False False False False False

66 False False False False False False

67 False False False False False False

68 rows × 6 columns

Calculating the total number of missing values (NaN) in each column of a DataFrame.

df.isnull().sum()
0

REG 0

NAME 0

GENDER 1

SALARY 0

BONUS 3

TEAM 4

dtype: int64

calculate the total number of missing values (NaN) in entire DataFrame.

df.isnull().sum().sum()

createing a new DataFrame by removing rows from the original DataFrame that contain any missing values
(NaN). The original DataFrame remains unchanged unless explicitly reassigned.

df_without_nan= df.dropna()
df_without_nan.head(5)

REG GENDER SALARY BONUS TEAM

NAME

AMBERKAR KOMAL SURYAKANT T-22-0107 F -0.122425 6.945 Marketing

AYARE DARSHAN NARESH T-22-0459 M 0.069456 11.858 Finance

AYARE SANIA NARENDRA T-22-0140 F 0.116241 9.340 Finance

BACHIM ATHARV MARUTI TD-23-0502 M -0.101116 1.389 Client Services

BHOMBAL ZIA AMIR T-22-0498 F -0.305945 10.012 Product

Next
steps:
Generate code
with
df_without_nan
toggle_off View recommended
plots
New interactive
sheet

df_without_nan.info()

<class 'pandas.core.frame.DataFrame'>
Index: 60 entries, 0 to 67
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 REG 60 non-null object
1 NAME 60 non-null object
2 GENDER 60 non-null object
3 SALARY 60 non-null int64
4 BONUS 60 non-null float64
5 TEAM 60 non-null object
dtypes: float64(1), int64(1), object(4)
memory usage: 3.3+ KB

df_without_nan.isnull().sum()

REG 0

NAME 0

GENDER 0

SALARY 0

BONUS 0

TEAM 0

dtype: int64

df_without_nan.isnull().sum().sum()

displaying the first rows of a DataFrame. By default, df.head() returns the first 5 rows if no argument is
provided, but you can specify the number of rows to display by passing an integer argument.

df.head(10)

REG NAME GENDER SALARY BONUS TEAM

0 T-22-0107 AMBERKAR KOMAL SURYAKANT F 97308 6.945 Marketing

1 T-22-0144 ARLEKAR PRATHAMESH MAHESH M 61933 4.170 NaN

2 T-22-0459 AYARE DARSHAN NARESH M 130590 11.858 Finance

3 T-22-0140 AYARE SANIA NARENDRA F 138705 9.340 Finance

4 TD-23-0502 BACHIM ATHARV MARUTI M 101004 1.389 Client Services

5 T-22-0048 BHATKAR SAHIL VILAS NaN 115163 10.125 Legal

6 T-22-0498 BHOMBAL ZIA AMIR F 65476 10.012 Product

7 T-22-0525 BHUJBAL YUKTA SADASHIV F 45906 11.598 Finance

8 T-22-0085 DABHOLKAR PRACHI PRADIP F 95570 NaN Engineering

9 T-22-0091 DAMALE KEDAR PRAVIN M 139852 7.524 Business Development

displaying the last 10 rows of a DataFrame. If the DataFrame has fewer than 10 rows, it will display all
available rows.
df.tail(5)

REG NAME GENDER SALARY BONUS TEAM

63 T-22-0519 SHAIKH FABIHA IMTIYAZ F 35203 18.040 Human Resources

64 T-22-0007 TALEKAR ARYAN RAJENDRA M 77834 18.771 Business Development

65 T-22-0074 TARVE OMKAR SHANIL M 1012655 12.428 Distribution

66 T-22-0026 THATTE GANDHAR NILESH M 125250 2.672 Business Development

67 T-22-0273 VASKAR RAHUL DIPAK M 51178 9.735 Finance

Accessing rows with index labels from M to N (inclusive) and all columns in the DataFrame

df.loc[11:15,:]

# : Denotes all columns


# 11:15 specifies the range

REG NAME GENDER SALARY BONUS TEAM

11 T-22-0442 DHANE KARAN ASHOK M 102508 12.637 Legal

12 T-22-0108 DHURI ADITYA SHANKAR M 112807 17.492 Human Resources

13 T-22-0531 DORLEKAR DEEYA VISHVANTH ** F 109831 5.831 Sales

14 T-22-0516 GADEKAR ANUJA VINAYAK F 41426 14.543 Finance

15 T-22-0247 GARJE MAYUR BALASAHEB M 10 1.256 Product

Accessing single column in the DataFrame

df['NAME'].head(5)

NAME

0 AMBERKAR KOMAL SURYAKANT

1 ARLEKAR PRATHAMESH MAHESH

2 AYARE DARSHAN NARESH

3 AYARE SANIA NARENDRA

4 BACHIM ATHARV MARUTI

dtype: object

Accessing mutiple columns in the DataFrame

df[['NAME','SALARY']]
NAME SALARY

0 AMBERKAR KOMAL SURYAKANT 97308

1 ARLEKAR PRATHAMESH MAHESH 61933

2 AYARE DARSHAN NARESH 130590

3 AYARE SANIA NARENDRA 138705

4 BACHIM ATHARV MARUTI 101004

... ... ...

63 SHAIKH FABIHA IMTIYAZ 35203

64 TALEKAR ARYAN RAJENDRA 77834

65 TARVE OMKAR SHANIL 1012655

66 THATTE GANDHAR NILESH 125250

67 VASKAR RAHUL DIPAK 51178

68 rows × 2 columns

Selecting rows in the dataframe whose salary is greater than or equal to 10000 and gender is Male

df_without_nan[(df_without_nan['SALARY'] >= 100000) & (df_without_nan['GENDER'] == 'M')]

REG NAME GENDER SALARY BONUS TEAM

2 T-22-0459 AYARE DARSHAN NARESH M 130590 11.858 Finance

TD-23-
4 BACHIM ATHARV MARUTI M 101004 1.389 Client Services
0502

Business
9 T-22-0091 DAMALE KEDAR PRAVIN M 139852 7.524
Development

11 T-22-0442 DHANE KARAN ASHOK M 102508 12.637 Legal

12 T-22-0108 DHURI ADITYA SHANKAR M 112807 17.492 Human Resources

17 T-22-0515 GOGAVALE SAHIL VINOD M 111737 6.414 Product

30 T-22-0451 KALVANKAR ROHIT PRADIP M 118780 9.096 Engineering

KARANGUTKAR SOHAM Business


33 T-22-0500 M 119082 16.180
GUNAVANT Development

39 T-22-0463 LENDE KARTIK RAJESH M 122173 7.797 Client Services

44 T-21-0324 MORE YASH NITIN ** M 145146 7.482 Product

49 T-22-0272 PARSHARAM ADITYA ANAND M 113590 3.055 Sales

56 T-22-0167 RANE DARSHAN SANJAY M 130276 16.084 Finance

Business
61 T-22-0514 SAWANT PRANAD VINAYAK ** M 106862 3.699
Development
Filtering rows in the DataFrame that contain at least one NaN value in any of their columns.

df[df.isnull().any(axis=1)]

REG NAME GENDER SALARY BONUS TEAM

1 T-22-0144 ARLEKAR PRATHAMESH MAHESH M 61933 4.170 NaN

5 T-22-0048 BHATKAR SAHIL VILAS NaN 115163 10.125 Legal

8 T-22-0085 DABHOLKAR PRACHI PRADIP F 95570 NaN Engineering

10 T-22-0010 DHAMASKAR REHAN RIYAZ M 63241 15.132 NaN

23 T-22-0171 JADHAV TANVI PANKAJ F 125792 5.042 NaN

27 T-21-0123 KADAM GAURI SANTOSH ** F 122367 NaN Legal

32 T-22-0227 KAPADI SANIYA RAFIQUE ** F 122340 6.417 NaN

37 T-22-0126 KHEDEKAR SUJAL SADANAND M 57427 NaN Client Services

calculating the median of the SALARY column in the DataFrame.

df['SALARY'].median()

95273.0

calculating the mean of the BONUS column in the DataFrame.

df['BONUS'].mean()

10.338061538461538

Calculating the mode of the Gender column in the DataFrame

df['GENDER'].mode()

GENDER

0 M

dtype: object

Filling the missing (NaN) values in specific columns of the DataFrame with appropriate values

df['GENDER'].fillna(df['GENDER'].mode()[0],inplace= True)
df['BONUS'].fillna(df['BONUS'].mean(),inplace= True)
df['TEAM'].fillna(df['TEAM'].mode()[0],inplace= True)
<ipython-input-26-77d482235238>:1: FutureWarning: A value is trying to be set on a copy of a Da
The behavior will change in pandas 3.0. This inplace method will never work because the interme

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value

df['GENDER'].fillna(df['GENDER'].mode()[0],inplace= True)
<ipython-input-26-77d482235238>:2: FutureWarning: A value is trying to be set on a copy of a Da
The behavior will change in pandas 3.0. This inplace method will never work because the interme

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value

df['BONUS'].fillna(df['BONUS'].mean(),inplace= True)
<ipython-input-26-77d482235238>:3: FutureWarning: A value is trying to be set on a copy of a Da
The behavior will change in pandas 3.0. This inplace method will never work because the interme

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value

df['TEAM'].fillna(df['TEAM'].mode()[0],inplace= True)

df.isnull().sum().sum()

Filtering the DataFrame which will only contain rows where the SALARY column has values between 10,000
and 200,000, inclusive.

df[(df['SALARY'] >= 10000) & (df['SALARY'] <= 200000)]

REG NAME GENDER SALARY BONUS TEAM

0 T-22-0107 AMBERKAR KOMAL SURYAKANT F 97308 6.945 Marketing

1 T-22-0144 ARLEKAR PRATHAMESH MAHESH M 61933 4.170 Client Services

2 T-22-0459 AYARE DARSHAN NARESH M 130590 11.858 Finance

3 T-22-0140 AYARE SANIA NARENDRA F 138705 9.340 Finance

4 TD-23-0502 BACHIM ATHARV MARUTI M 101004 1.389 Client Services

... ... ... ... ... ... ...

62 T-22-0128 SAWANT SHUBHANKAR DATTARAM M 58112 19.414 Marketing

63 T-22-0519 SHAIKH FABIHA IMTIYAZ F 35203 18.040 Human Resources

64 T-22-0007 TALEKAR ARYAN RAJENDRA M 77834 18.771 Business Development

66 T-22-0026 THATTE GANDHAR NILESH M 125250 2.672 Business Development

67 T-22-0273 VASKAR RAHUL DIPAK M 51178 9.735 Finance

65 rows × 6 columns
Replacing values in the SALARY column that are more than 150,000 units away from the median salary with
the median salary itself.

df['SALARY'].where((df['SALARY'] - df['SALARY'].median()).abs() <= 150000, df['SALARY'].median(), i


df
#inplace=True modifies the DataFrame df directly.

<ipython-input-33-f64a9357ffca>:1: FutureWarning: A value is trying to be set on a copy of a Da


The behavior will change in pandas 3.0. This inplace method will never work because the interme

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value

df['SALARY'].where((df['SALARY'] - df['SALARY'].median()).abs() <= 150000, df['SALARY'].media


REG NAME GENDER SALARY BONUS TEAM

0 T-22-0107 AMBERKAR KOMAL SURYAKANT F 97308 6.945 Marketing

1 T-22-0144 ARLEKAR PRATHAMESH MAHESH M 61933 4.170 Client Services

2 T-22-0459 AYARE DARSHAN NARESH M 130590 11.858 Finance

3 T-22-0140 AYARE SANIA NARENDRA F 138705 9.340 Finance

4 TD-23-0502 BACHIM ATHARV MARUTI M 101004 1.389 Client Services

... ... ... ... ... ... ...

63 T-22-0519 SHAIKH FABIHA IMTIYAZ F 35203 18.040 Human Resources

64 T-22-0007 TALEKAR ARYAN RAJENDRA M 77834 18.771 Business Development

65 T-22-0074 TARVE OMKAR SHANIL M 95273 12.428 Distribution

66 T-22-0026 THATTE GANDHAR NILESH M 125250 2.672 Business Development

67 T-22-0273 VASKAR RAHUL DIPAK M 51178 9.735 Finance

68 rows × 6 columns

Outliers refers to data points that are significantly different from the rest of the data, often seen as extreme
or unusual values.

creating a new DataFrame that contains rows where the SALARY column values are either less than 10,000
or greater than 200,000. These rows are considered outliers in the SALARY column based on the specified
salary range.

outliers=df[(df['SALARY']< 10000) | (df['SALARY'] > 200000)]


outliers
REG NAME GENDER SALARY BONUS TEAM

15 T-22-0247 GARJE MAYUR BALASAHEB M 10 1.256 Product

59 T-22-0517 SALVI SANIKA JITENDRA ** F 1175987 11.279 Engineering

65 T-22-0074 TARVE OMKAR SHANIL M 1012655 12.428 Distribution

Filtering out rows in the DataFrame where the SALARY column contains values that are more than 60,000
units away (in absolute terms) from the median salary. These rows are considered outliers based on the
specified threshold.

outliers=df[(df['SALARY']- df['SALARY'].median()).abs() > 60000]


outliers

REG NAME GENDER SALARY BONUS TEAM

15 T-22-0247 GARJE MAYUR BALASAHEB M 10 1.256 Product

59 T-22-0517 SALVI SANIKA JITENDRA ** F 1175987 11.279 Engineering

63 T-22-0519 SHAIKH FABIHA IMTIYAZ F 35203 18.040 Human Resources

65 T-22-0074 TARVE OMKAR SHANIL M 1012655 12.428 Distribution

Sorting DataFrame by the values in the SALARY column in ascending order.

outliers.sort_values(by="SALARY")

REG NAME GENDER SALARY BONUS TEAM

15 T-22-0247 GARJE MAYUR BALASAHEB M 10 1.256 Product

63 T-22-0519 SHAIKH FABIHA IMTIYAZ F 35203 18.040 Human Resources

65 T-22-0074 TARVE OMKAR SHANIL M 1012655 12.428 Distribution

59 T-22-0517 SALVI SANIKA JITENDRA ** F 1175987 11.279 Engineering

Sorting DataFrame by the values in the SALARY column in descending order.

outliers.sort_values(by="SALARY",ascending=False)

REG NAME GENDER SALARY BONUS TEAM

59 T-22-0517 SALVI SANIKA JITENDRA ** F 1175987 11.279 Engineering

65 T-22-0074 TARVE OMKAR SHANIL M 1012655 12.428 Distribution

63 T-22-0519 SHAIKH FABIHA IMTIYAZ F 35203 18.040 Human Resources

15 T-22-0247 GARJE MAYUR BALASAHEB M 10 1.256 Product


Replacing the outlier rows in the SALARY column with the median salary

df.loc[(df['SALARY'] - df['SALARY'].median()).abs() > 60000, 'SALARY'] = df['SALARY'].median()


df

REG NAME GENDER SALARY BONUS TEAM

0 T-22-0107 AMBERKAR KOMAL SURYAKANT F 97308 6.945 Marketing

1 T-22-0144 ARLEKAR PRATHAMESH MAHESH M 61933 4.170 NaN

2 T-22-0459 AYARE DARSHAN NARESH M 130590 11.858 Finance

3 T-22-0140 AYARE SANIA NARENDRA F 138705 9.340 Finance

4 TD-23-0502 BACHIM ATHARV MARUTI M 101004 1.389 Client Services

... ... ... ... ... ... ...

63 T-22-0519 SHAIKH FABIHA IMTIYAZ F 95273 18.040 Human Resources

64 T-22-0007 TALEKAR ARYAN RAJENDRA M 77834 18.771 Business Development

65 T-22-0074 TARVE OMKAR SHANIL M 95273 12.428 Distribution

66 T-22-0026 THATTE GANDHAR NILESH M 125250 2.672 Business Development

67 T-22-0273 VASKAR RAHUL DIPAK M 51178 9.735 Finance

68 rows × 6 columns

Seting a specific column as the index for the DataFrame. This means that the column will no longer be a
regular column but will instead become the row labels (index).

df.set_index('REG').head(10)

NAME GENDER SALARY BONUS TEAM

REG

T-22-0107 AMBERKAR KOMAL SURYAKANT F 97308 6.945 Marketing

T-22-0144 ARLEKAR PRATHAMESH MAHESH M 61933 4.170 NaN

T-22-0459 AYARE DARSHAN NARESH M 130590 11.858 Finance

T-22-0140 AYARE SANIA NARENDRA F 138705 9.340 Finance

TD-23-0502 BACHIM ATHARV MARUTI M 101004 1.389 Client Services

T-22-0048 BHATKAR SAHIL VILAS NaN 115163 10.125 Legal

T-22-0498 BHOMBAL ZIA AMIR F 65476 10.012 Product

T-22-0525 BHUJBAL YUKTA SADASHIV F 45906 11.598 Finance

T-22-0085 DABHOLKAR PRACHI PRADIP F 95570 NaN Engineering

T-22-0091 DAMALE KEDAR PRAVIN M 139852 7.524 Business Development


seting the NAME column as the index of the DataFrame and modifies it in place. This means the NAME
column will no longer be part of the regular columns but will instead become the row labels (index).

df.set_index('NAME',inplace=True)
df

REG GENDER SALARY BONUS TEAM

NAME

AMBERKAR KOMAL SURYAKANT T-22-0107 F 97308 6.945 Marketing

ARLEKAR PRATHAMESH MAHESH T-22-0144 M 61933 4.170 NaN

AYARE DARSHAN NARESH T-22-0459 M 130590 11.858 Finance

AYARE SANIA NARENDRA T-22-0140 F 138705 9.340 Finance

BACHIM ATHARV MARUTI TD-23-0502 M 101004 1.389 Client Services

... ... ... ... ... ...

SHAIKH FABIHA IMTIYAZ T-22-0519 F 35203 18.040 Human Resources

TALEKAR ARYAN RAJENDRA T-22-0007 M 77834 18.771 Business Development

TARVE OMKAR SHANIL T-22-0074 M 1012655 12.428 Distribution

THATTE GANDHAR NILESH T-22-0026 M 125250 2.672 Business Development

VASKAR RAHUL DIPAK T-22-0273 M 51178 9.735 Finance

68 rows × 5 columns

1.Importing the numpy library, a powerful library for numerical computing in Python.

2.Importing the StandardScaler class from the sklearn.preprocessing module. It is a tool from the scikit-
learn library used for feature scaling.

3.Creating an instance of the StandardScaler class

import numpy as np
from sklearn.preprocessing import StandardScaler
s=StandardScaler()

Performing feature scaling on the SALARY column of the DataFrame using the StandardScaler.

The SALARY column will be scaled to have a mean of 0 and a standard deviation of 1 and the values in the
SALARY column will be standardized.

Standardization: The values in the SALARY column are now adjusted so that they fit within a standardized
range, making them easier to work with for many machine learning algorithms.

Reshaping: The reshaping (reshape(-1, 1)) is required because the scaler expects a 2D array as input.
df['SALARY'] = s.fit_transform(np.array(df['SALARY']).reshape(-1, 1))
df

REG GENDER SALARY BONUS TEAM

NAME

AMBERKAR KOMAL SURYAKANT T-22-0107 F -0.122425 6.945 Marketing

ARLEKAR PRATHAMESH MAHESH T-22-0144 M -0.326372 4.170 NaN

AYARE DARSHAN NARESH T-22-0459 M 0.069456 11.858 Finance

AYARE SANIA NARENDRA T-22-0140 F 0.116241 9.340 Finance

BACHIM ATHARV MARUTI TD-23-0502 M -0.101116 1.389 Client Services

... ... ... ... ... ...

SHAIKH FABIHA IMTIYAZ T-22-0519 F -0.480478 18.040 Human Resources

TALEKAR ARYAN RAJENDRA T-22-0007 M -0.234698 18.771 Business Development

TARVE OMKAR SHANIL T-22-0074 M 5.154813 12.428 Distribution


‭Finolex Academy of Management and Technology, Ratnagiri‬

‭Department of Information Technology‬

‭Subject:‬ ‭DS using Python Lab. (ITL605)‬

‭Class:‬ ‭TE IT / Semester – VI (Rev-2019 ‘C’) / Academic year: 2024-25‬


‭ ame of‬
N
‭Kedar Pravin Damale.‬
‭Student:‬
‭Roll No:‬ ‭10‬ ‭Date of performance (DOP) :‬

‭Assignment/Experiment No:‬ ‭02‬ ‭Date of checking (DOC) :‬

T‭ itle: Data Visualization / Exploratory Data Analysis for the selected data set using Matplotlib‬
‭and Seaborn‬
‭Marks:‬ ‭Teacher’s Signature:‬

‭1.‬‭Aim‬‭: To understand how to visualize and understand‬‭the data‬

‭2. Prerequisites‬‭:‬
‭1.‬ ‭Python programming, Basics of probability Theory‬

‭3. Hardware Requirements‬‭:‬


‭1.‬ ‭PC with minimum 2GB RAM‬

‭4. Software Requirements:‬


‭1.‬ ‭Windows / Linux OS.‬
‭2.‬ ‭Python 3.6 or higher‬

‭5. Learning Objectives:‬


‭1.‬ ‭To understand matplotlib and seaborn packages of Python‬
‭2.‬ ‭To understand various plots supported by these packages‬
‭3.‬ ‭To understand the importance of these plots in understanding of the data.‬

‭ . Learning Objectives Applicable: LO2‬


6
‭7. Program Outcomes Applicable: PO2, PO4, PO5‬
‭8. Program Education Objectives Applicable: PEO1‬

‭FAMT/ IT / Semester –VI (Rev-2019) / DS using Python Lab / Academic Year: 2024-25 / First Half of 2025‬
‭13. Experiment/Assignment Evaluation‬
‭Experiment/Assignment Evaluation:‬

‭Sr. No.‬ ‭Parameters‬ ‭ arks‬


M ‭Out of‬
‭obtained‬

‭1‬ ‭Technical‬ ‭Understanding‬ ‭(Assessment‬ ‭may‬‭be‬‭done‬‭based‬‭on‬‭Q‬‭&‬‭A‬‭or‬‭any‬‭other‬‭relevant‬ ‭6‬


‭method.)‬‭Teacher should mention the other method used‬‭-‬

‭2‬ ‭Neatness/presentation‬ ‭2‬

‭3‬ ‭Punctuality‬ ‭2‬

‭Date of performance (DOP)‬ ‭Total marks obtained‬ ‭10‬

‭Date of checking (DOC)‬ ‭Signature of teacher‬

‭References‬‭:‬
‭[3] Howard J. Seltman, Experimental Design and Analysis, Carnegie Mellon University,‬
‭2012/1.‬
‭[4] Ethem Alpaydın, “Introduction to Machine Learning”, MIT Press‬

‭Viva Questions‬
‭ .‬ W
1 ‭ hat are matplotlib and seaborn packages?‬
‭2.‬ ‭What are different plots supported by those packages?‬
‭3.‬ ‭What is EDA?‬

‭FAMT/ IT / Semester –VI (Rev-2019) / DS using Python Lab / Academic Year: 2024-25 / First Half of 2025‬
Importing the pandas library and loading data from csv file into a dataframe

import pandas as pd
df= pd.read_csv('employees.csv')

Replacing missing values in the GENDER column with the most frequently occurring value (mode) in that column,replacing
missing values in the BONUS column with the mean (average) of the column, replacing missing values in the TEAM column
with the most frequently occurring value (mode), similar to the GENDER column.

df['GENDER'].fillna(df['GENDER'].mode()[0],inplace= True)
df['BONUS'].fillna(df['BONUS'].mean(),inplace= True)
df['TEAM'].fillna(df['TEAM'].mode()[0],inplace= True)

Identifing and handling outliers in the SALARY column by replacing extreme values with the median of the SALARY column

df[(df['SALARY'] - df['SALARY'].median()).abs() > 60000] = df['SALARY'].median()

Displaying the dataframe

df.head(10)

REG NAME GENDER SALARY BONUS TEAM

0 T-22-0107 AMBERKAR KOMAL SURYAKANT F 97308 6.945 Marketing

1 T-22-0144 ARLEKAR PRATHAMESH MAHESH M 61933 4.170 NaN

2 T-22-0459 AYARE DARSHAN NARESH M 130590 11.858 Finance

3 T-22-0140 AYARE SANIA NARENDRA F 138705 9.340 Finance

4 TD-23-0502 BACHIM ATHARV MARUTI M 101004 1.389 Client Services

5 T-22-0048 BHATKAR SAHIL VILAS NaN 115163 10.125 Legal

6 T-22-0498 BHOMBAL ZIA AMIR F 65476 10.012 Product

7 T-22-0525 BHUJBAL YUKTA SADASHIV F 45906 11.598 Finance

8 T-22-0085 DABHOLKAR PRACHI PRADIP F 95570 NaN Engineering

9 T-22-0091 DAMALE KEDAR PRAVIN M 139852 7.524 Business Development

Next steps: Generate code with df toggle_off View recommended plots New interactive sheet

Importing the pyplot module from the matplotlib library and assigns it the alias plt

import matplotlib.pyplot as plt

Creating a line plot of the SALARY column

plt.plot(df['SALARY'])
[<matplotlib.lines.Line2D at 0x7fb84bd3b370>]

Creates a vertical bar chart where:

x: Determines the position of each bar on the X-axis. and height: Specifies the height of each bar (corresponding to the Y-axis
values).

plt.bar(x=df.index,height=df['SALARY'])

<BarContainer object of 68 artists>

Creates a box plot, which is a graphical representation of the distribution of data based on five summary statistics and grid ()
adds a grid to the plot for better readability

plt.boxplot(df['SALARY'])
plt.grid()
Same as above but showmeans arguments adds mean representation to the boxplot

plt.boxplot(df['SALARY'],showmeans=True)
plt.grid()

Creating histograph of the BONUS column

plt.hist(df['BONUS'])
(array([8., 2., 6., 8., 6., 4., 9., 3., 6., 8.]),
array([ 1.256 , 3.0718, 4.8876, 6.7034, 8.5192, 10.335 , 12.1508,
13.9666, 15.7824, 17.5982, 19.414 ]),
<BarContainer object of 10 artists>)

Plotting normalized (related to probability) histograph

plt.hist(df['BONUS'],density=True)

(array([0.07342953, 0.01835738, 0.05507214, 0.07342953, 0.05507214,


0.03671476, 0.08260822, 0.02753607, 0.05507214, 0.07342953]),
array([ 1.256 , 3.0718, 4.8876, 6.7034, 8.5192, 10.335 , 12.1508,
13.9666, 15.7824, 17.5982, 19.414 ]),
<BarContainer object of 10 artists>)

Creating graphs without null values keeping outliers in the data

df.dropna(inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 60 entries, 0 to 67
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 REG 60 non-null object
1 NAME 60 non-null object
2 GENDER 60 non-null object
3 SALARY 60 non-null int64
4 BONUS 60 non-null float64
5 TEAM 60 non-null object
dtypes: float64(1), int64(1), object(4)
memory usage: 3.3+ KB

Creating line plot of the SALARY column where both the outliers are clearly indicated

import matplotlib.pyplot as plt


plt.plot(df['SALARY'])

[<matplotlib.lines.Line2D at 0x7b4367afe810>]

Creating vertical bar plot of the SALARY column where both the outliers are clearly indicated

plt.bar(x=df.index,height=df['SALARY'])
<BarContainer object of 60 artists>

Creating boxplot of the SALARY column where both the outliers are clearly indicated

plt.boxplot(df['SALARY'],showmeans=True)
plt.grid()

importing the Seaborn library for creating statistical and visually appealing plots.

import seaborn as sns

Createing a vertical box plot to visualize the distribution of the "BONUS" column in the DataFrame df, highlighting its median,
quartiles, and outliers.
sns.boxplot(y="BONUS",data=df)

<Axes: ylabel='BONUS'>

Creating a scatter plot with a regression line to visualize the relationship between "SALARY" (x-axis) and "BONUS" (y-axis) in
the DataFrame df.

sns.regplot(x="SALARY",y="BONUS",data=df)

<Axes: xlabel='SALARY', ylabel='BONUS'>

Creating a bar plot showing the "BONUS" values (y-axis) for each index in the DataFrame df (x-axis), displaying the mean of
"BONUS" by default with error bars.

sns.barplot(x=df.index,y="BONUS",data=df)
<Axes: xlabel='None', ylabel='BONUS'>

Creating a count plot to display the frequency (count) of each unique value in the "GENDER" column of the DataFrame df .

sns.countplot(x="GENDER",data=df)

<Axes: xlabel='GENDER', ylabel='count'>

Creating a grouped count plot to show the frequency of each "GENDER" category, further grouped by the "TEAM" column, in the
DataFrame df.

sns.countplot(x="GENDER", hue="TEAM",data=df)
<Axes: xlabel='GENDER', ylabel='count'>

Creating a horizontal count plot to display the frequency (count) of each unique value in the "TEAM" column of the DataFrame
df.

sns.countplot(y="TEAM",data=df)

<Axes: xlabel='count', ylabel='TEAM'>

Creating a horizontal count plot that shows the frequency of each "TEAM" category, further segmented by "GENDER," in the
DataFrame df .

sns.countplot(y="TEAM",hue="GENDER",data=df)
<Axes: xlabel='count', ylabel='TEAM'>

Creating a horizontal count plot showing the frequency of each "TEAM" category, segmented by "GENDER" with custom colors
(tomato red and bright red) for each gender, using the specified color palette ["#FF6347", "#FF0001"] .

sns.countplot(y="TEAM", hue="GENDER", data=df, palette=["#FF6347", "#FF0001"])

<Axes: xlabel='count', ylabel='TEAM'>

Creating a vertical box plot to visualize the distribution of "BONUS" values, grouped by "GENDER" in the DataFrame df , with the
mean values shown for each group.

sns.boxplot(y="BONUS",data=df,hue="GENDER",showmeans=True)
<Axes: ylabel='BONUS'>

Creating a cross-tabulation (contingency table) that shows the count of occurrences for each unique value in the "TEAM"
column of the DataFrame df . The result is a summary of the frequency of each team.

pd.crosstab(index=df['TEAM'],columns="count")

col_0 count

TEAM

Business Development 9

Client Services 9

Distribution 3

Engineering 5

Finance 8

Human Resources 5

Legal 4

Marketing 5

Product 7

Sales 5

Creating a cross-tabulation (contingency table) that shows the frequency distribution of "GENDER" values for each unique
"TEAM" in the DataFrame df , displaying how many occurrences of each gender are present in each team.

pd.crosstab(index=df['TEAM'],columns=df["GENDER"])
GENDER F M

TEAM

Business Development 2 7

Client Services 5 4

Distribution 1 2

Engineering 3 2

Finance 4 4

CreatingHuman
a cross-tabulation
Resources (contingency
1 4 table) that shows the normalized (relative frequency) count of occurrences for each
unique value inLegal
the "TEAM" column
2 2 of the DataFrame df , with the results expressed as proportions (sum of counts = 1).

Marketing 4 1
pd.crosstab(index=df['TEAM'],columns="count",normalize=True)
Product 3 4

Sales col_0 2 count


3

TEAM

Business Development 0.150000

Client Services 0.150000

Distribution 0.050000

Engineering 0.083333

Finance 0.133333

Human Resources 0.083333

Legal 0.066667

Marketing 0.083333

Product 0.116667

Sales 0.083333

Creating a cross-tabulation (contingency table) that shows the normalized (relative frequency) distribution of "GENDER" within
each "TEAM" in the DataFrame df , with the results expressed as proportions (sum of each row = 1).

pd.crosstab(index=df['TEAM'],columns=df["GENDER"],normalize=True)

GENDER F M

TEAM

Business Development 0.033333 0.116667

Client Services 0.083333 0.066667

Distribution 0.016667 0.033333

Engineering 0.050000 0.033333

Finance 0.066667 0.066667

Human Resources 0.016667 0.066667

Legal 0.033333 0.033333

Marketing 0.066667 0.016667

Product 0.050000 0.066667

Sales 0.033333 0.050000


‭Finolex Academy of Management and Technology, Ratnagiri‬

‭Department of Information Technology‬

‭Subject:‬ ‭DS using Python Lab. (ITL605)‬

‭Class:‬ ‭TE IT / Semester – VI (Rev-2019 ‘C’) / Academic year: 2024-25‬


‭ ame of‬
N
‭Kedar Pravin Damale.‬
‭Student:‬
‭Roll No:‬ ‭10‬ ‭Date of performance (DOP) :‬

‭Assignment/Experiment No:‬ ‭03‬ ‭Date of checking (DOC) :‬

‭Title:‬‭Data Modeling‬

‭Marks:‬ ‭Teacher’s Signature:‬

‭1.‬‭Aim‬‭: To understand how to split given data into‬‭a training and testing set, and validate it.‬

‭2. Prerequisites‬‭:‬
‭1.‬ ‭Python programming, Basics of probability Theory‬

‭3. Hardware Requirements‬‭:‬


‭1.‬ ‭PC with minimum 2GB RAM‬

‭4. Software Requirements:‬


‭1.‬ ‭Windows / Linux OS.‬
‭2.‬ ‭Python 3.6 or higher‬

‭5. Learning Objectives:‬


‭1.‬ ‭To understand train_test_split() method‬
‭2.‬ ‭To understand how to validate the data using a two-sample Z-test.‬

‭ . Learning Objectives Applicable: LO2, LO3‬


6
‭7. Program Outcomes Applicable: PO2, PO3, PSO1‬
‭8. Program Education Objectives Applicable: PEO2‬

‭FAMT/ IT / Semester –VI (Rev-2019) / DS using Python Lab / Academic Year: 2024-25 / First Half of 2025‬
‭13. Experiment/Assignment Evaluation‬
‭Experiment/Assignment Evaluation:‬

‭Sr. No.‬ ‭Parameters‬ ‭ arks‬


M ‭Out of‬
‭obtained‬

‭1‬ ‭Technical‬ ‭Understanding‬ ‭(Assessment‬ ‭may‬‭be‬‭done‬‭based‬‭on‬‭Q‬‭&‬‭A‬‭or‬‭any‬‭other‬‭relevant‬ ‭6‬


‭method.)‬‭Teacher should mention the other method used‬‭-‬

‭2‬ ‭Neatness/presentation‬ ‭2‬

‭3‬ ‭Punctuality‬ ‭2‬

‭Date of performance (DOP)‬ ‭Total marks obtained‬ ‭10‬

‭Date of checking (DOC)‬ ‭Signature of teacher‬

‭References‬‭:‬
‭[3] Howard J. Seltman, Experimental Design and Analysis, Carnegie Mellon University,‬
‭2012/1.‬
‭[4] Ethem Alpaydın, “Introduction to Machine Learning”, MIT Press‬

‭Viva Questions‬
‭ .‬ W
1 ‭ hat are packages that support functionality to split the data into two sets?‬
‭2.‬ ‭What do you mean by data validation?‬
‭3.‬ ‭What is a two-sample Z-test?‬

‭FAMT/ IT / Semester –VI (Rev-2019) / DS using Python Lab / Academic Year: 2024-25 / First Half of 2025‬

You might also like