Department of AIML
PAI Practile file
NAME: Shivam
BRANCH: CSE(AI-ML)
SEM: 6TH
ROLL NO: 23242
Shivam (23242)
Department of CSE AIML
Certificate
Certified that this Practical entitled “Big Data Lab” submitted by Shivam (23242), student
of Computer Science & Engineering Department, Dronacharya College of
Engineering, Gurgaon in the partial fulfillment of the requirement for the award
Bachelor’s of Technology (Branch) Degree of MDU, Rohtak, is a record of student own
study carried under my supervision & guidance.
Shivam (23242)
Sr. Practical Name Signature
No.
1. Introduction of various python libraries used for
machine
learning.
2. Write a program to perform data pre-processing
techniques for effective machine learning.
3. Write a program to apply different feature encoding
schemes on the given dataset.
4. Write a program to apply filter feature selection
techniques
5.
6.
7.
8.
9.
10.
Shivam (23242)
PROGRAM 1: Introduction of various python libraries used for machine learning.
Code:
[1]: pandas as pd import numpy as np
import
[2]: # reading data
data=pd.read_csv("data.csv")
[3]: data
[3]: Country Age Salary Purchased
0 France 44.0 72000.0 No
1 Spain 27.0 48000.0 Yes
2 Germany 30.0 54000.0 No
3 Spain 38.0 61000.0 No
4 Germany 40.0 NaN Yes
5 France 35.0 58000.0 Yes
6 Spain NaN 52000.0 No
7 France 48.0 79000.0 Yes
8 Germany 50.0 83000.0 No
9 France 37.0 67000.0 Yes
[4]: student_data = {"Name":['Prateek','Ronak','Geetanshu','Naman','Ankit'], "exam_no":[18,25,45,34,36],
"Result":['pass','fail','pass','pass','fail']}
df = pd.DataFrame(student_data) df
[4] : Name exam_no Result
0 Prateek 18 pass
1 Ronak 25 fail
2 Geetanshu 45 pass
3 Naman 34 pass
4 Ankit 36 fail
[6]: # access data with the help of label
[6] : df.loc[2,['Name']]
Name Geetanshu
Name: 2, dtype:
object
Shivam (23242)
[7]: df.iloc[2,0]
[7] : 'Geetanshu'
[]:
PROGRAM 2: Write a program to perform data pre-processing techniques for effective
machine learning
Shivam (23242)
[1]:# import pandas
import pandas as pd
[47]:#read csv file
df=pd.read_csv('data.csv')
[30]:# print first 5 elements
df.head()
[30]: Country Age Salary Purchased
0 France 44.0 72000.0 No
1 Spain 27.0 48000.0 Yes
2 Germany 30.0 54000.0 No
3 Spain 38.0 61000.0 No
4 Germany 40.0 NaN Yes
[6]:# import numpy
import numpy as np
[7]:# import StringIO
from io import StringIO
[31]:# check for the null value
df.isnull()
[31]: Country Age Salary Purchased
0 False False False False
1 False False False False
2 False False False False
3 False False False False
4 False False True False
5 False False False False
6 False True False False
7 False False False False
8 False False False False
9 False False False False
Shivam (23242)
[59]: # assign 10 in place of null value df["Age"].fillna(10, inplace = True) df["Salary"].fillna(10, inplace =
True)
[60]: # print updates dataset
df
[60]: Country Age Salary Purchased
0 France 44.0 72000.0 No
1 Spain 27.0 48000.0 Yes
2 Germany 30.0 54000.0 No
3 Spain 38.0 61000.0 No
4 Germany 40.0 10.0 Yes
5 France 35.0 58000.0 Yes
6 Spain 10.0 52000.0 No
7 France 48.0 79000.0 Yes
8 Germany 50.0 83000.0 No
9 France 37.0 67000.0 Yes
[34]: # check for null value after updation
df.isnull().sum()
[34]: Country 0
Age 0
Salary 0
Purchased 0
dtype: int64
[35]: # import SimpleImputer from sklearn
from sklearn.impute import SimpleImputer
[36]: # set model attributes
imr = SimpleImputer(strategy="constant",fill_value= 10 )
[37]: # Fit the data into the model
imr = imr.fit(df.values)
[54]: imputed_data = imr.transform(df.values)
[55]: # print data after transormed
imputed_data
[55]: array([['France', 44.0, 72000.0, 'No'],
['Spain', 27.0, 48000.0, 'Yes'],
['Germany', 30.0, 54000.0, 'No'],
['Spain', 38.0, 61000.0, 'No'],
['Germany', 40.0, 10, 'Yes'],
Shivam(23242)
['France', 35.0, 58000.0, 'Yes'],
['Spain', 10, 52000.0, 'No'],
['France', 48.0, 79000.0, 'Yes'],
['Germany', 50.0, 83000.0, 'No'],
['France', 37.0, 67000.0, 'Yes']], dtype=object)
Shivam(23242)
PROGRAM 3: Write a program to apply different feature encoding schemes on the given dataset.
[57]: #df.describe()
[57]: Age Salary
count 9.000000 9.000000
mean 38.777778 63777.777778
std 7.693793 12265.579662
min 27.000000 48000.000000
25% 35.000000 54000.000000
50% 38.000000 61000.000000
75% 44.000000 72000.000000
max 50.000000 83000.000000
[42]: # import and apply LabelEncoder to the data from sklearn.preprocessing import
LabelEncoder df_le= df
class_le = LabelEncoder()
df_le['Country'] = class_le.fit_transform(df_le['Country'].values) df_le
[42]: Country Age Salary Purchased
0 0 44.0 72000.0 No
1 2 27.0 48000.0 Yes
2 1 30.0 54000.0 No
3 2 38.0 61000.0 No
4 1 40.0 10.0 Yes
5 0 35.0 58000.0 Yes
6 2 10.0 52000.0 No
7 0 48.0 79000.0 Yes
8 1 50.0 83000.0 No
9 0 37.0 67000.0 Yes
[48]: df
[48]: Country Age Salary Purchased
0 France 44.0 72000.0 No
1 Spain 27.0 48000.0 Yes
Shivam(23242)
2 Germany 30.0 54000.0 No
3 Spain 38.0 61000.0 No
4 Germany 40.0 NaN Yes
5 France 35.0 58000.0 Yes
6 Spain NaN 52000.0 No
7 France 48.0 79000.0 Yes
8 Germany 50.0 83000.0 No
9 France 37.0 67000.0 Yes
[61]: df_new=pd.get_dummies(df)
[62]: df_new
[62]: Age Salary Country_France Country_Germany Country_Spain \
0 44.0 72000.0 1 0 0
1 27.0 48000.0 0 0 1
2 30.0 54000.0 0 1 0
3 38.0 61000.0 0 0 1
4 40.0 10.0 0 1 0
5 35.0 58000.0 1 0 0
6 10.0 52000.0 0 0 1
7 48.0 79000.0 1 0 0
8 50.0 83000.0 0 1 0
9 37.0 67000.0 1 0 0
Purchased_No Purchased_Yes
0 1 0
1 0 1
2 1 0
3 1 0
4 0 1
5 0 1
6 1 0
7 0 1
8 1 0
9 0 1
[63]: df_le['Country']
[63]: 0 0
1 2
2 1
3 2
4 1
5 0
Shivam(23242)
6 2
Shivam(23242)
7 0
8 1
9 0
Shivam(23242)
PROGRAM 4: Write a program to apply filter feature selection techniques.
Shivam(23242)
Shivam(23242)
Shivam(23242)
Shivam(23242)