0% found this document useful (0 votes)
6 views4 pages

Data Wrangling, 2

lab experiment data science and big data analytics

Uploaded by

yashisolanki02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
6 views4 pages

Data Wrangling, 2

lab experiment data science and big data analytics

Uploaded by

yashisolanki02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 4
4 Create an “Academic performance” dataset of students and perform the fotLowing operc # Python. # Scan all variables for missing values and inconsistencies. If there are missing valu # inconsistencies, use any of the suitable techniques to deal with them. import pandas as pd import opendatasets as od import matplotlib.pylab as plt import numpy as np od. download("https: //www.kaggle. com/datasets/sankha1998/student-semester-result”) Please provide your Kaggle credentials to download this dataset. Learn more: http://b it. ly/kaggle-creds Your Kaggle username: Your Kaggle Key: Dounloading student-senester-result.zip to .\student-senester-result 100% || 2.41 /2.41k [00:00<00:00, 413kB/s] df = pd.read_csv("student-semester-result/data.csv") print (d#) Ist 2nd 3rd 4th Sth College Code Gender Roll Roll no. \ @ 8.11 7.68 7.11 7.43 8.18 115 Female NaN 17020.0 1 6.48 5.98 4.15 4.29 4.96 5 Male NaN 17021.0 2 8.41 8.24 7.52 8.25 7.75 115 Female NaN 1702.0 3 7.33 6.83 6.33 6.79 6.89 5 Male NaN 17023.0 4 7.89 7.34 7.22 7.32 7.46 5 Male NaN 17024.0 173 7.48 7.55 7.67 7.39 8.65 241 F 17048.6 NaN 174 7.38 6.41 6.59 7.11 7.38 241 M 17049.0 NaN 175 6.30 6.28 5.89 5.71 6.50 241 M 17050.0 NaN 176 7.04 7.10 6.81 7.08 6.92 241 M 17051.0 NaN 177 6.78 6.81 6.52 5.39 7.00 2a M 17052.0 NaN Subject Code 6 16 1 16 2 16 3 16 4 16 173 28 174 28 175 28 176 28 17 28 [178 rows x 10 columns] # Scan all variables for missing values and inconsistencies. If there are missing valu # inconsistencies, use any of the suitable techniques to deal with them. df.info() RangeIndex: 178 entries, @ to 177 Data columns (total 10 columns): # Column Non-Null Count Dtype 176 non-null —‘float64 2 1 174 non-null —float64 2 176 non-null —float64 3 173 non-null —float64 4 sth 172 non-null —float64a 5 College Code 178 non-null —intea 5 Gender 177 non-null object 7 Roll 132 non-null —float64 8B Roll no. 46 non-null —float64 39 Subject Code 178 non-null inte4 dtypes: float64(7), int64(2), object(1) memory usage: 14.0+ KB dF.isnull().sum() ast 2nd 3rd ath sth College Code Gender Roll Roll no. Subject Code dtype: integ of Rueaunen # calculate the mean vaule for all subject columns avg_ist_Marks = df["1st"].astype("Floate4").mean(axis = ‘avg_2nd Marks = df["2nd"].astype("Floats4").mean(axis = avg_3rd Marks “Floats4") mean (axis avg_ath Marks mean (axis avg_Sth_Marks = df[" print print “average marks of Ist Paper:", avg ist Marks) ‘Average marks of 2nd Paper avg_2nd_Marks) print("Average marks of 3rd Paper:", avg 3rd Marks) print("Average marks of 4th Paper:", avg 4th Marks) print("Average marks of Sth Paper:", avg Sth Marks) Average marks of 1st Paper: 7.038863636363637 Average marks of 2nd Paper: 6.943390804597701 Average marks of 3rd Paper: 6.6225 Average marks of 4th Paper: 7.027745664739886 Average marks of Sth Paper: 7.432558139534884 # replace NaN by mean value in "1st to Sth " column st"].replace(np.nan, avg ast Marks, inplace = True) ind") .replace(np-nan, avg 2nd Marks, inplace ied") .replace(np-nan, avg 3rd_Marks, inplace afl afl th] -replace(np.nan, avg sth Marks, inplace = True) ‘th"]-replace(np.nan, avg sth_Marks, inplace = True) df-isnull().sum() ast 2nd 3rd ath sth College Code Gender Roll Roll no. Subject Code dtype: intes ef Bucoccce # Apply data transformations on at Least one of the variables. The purpose of this # transformation should be one of the following reasons: # to change the scale for better understanding of the variable, # to convert a non-Linear relation into a Linear one, or # to decrease the skewness and convert the distribution into a normal distribution max_ist = df['1st’ ].max() max_2nd = df{'2nd" }.max() max_3rd = df['3rd" }.max() max_ath = df['4th’ ].max() max_Sth = df['Sth’ ].max() print(max_1st, max_2nd, max_ard, max_4th, max_Sth) 9.15 9.21 9.59 9.31 9.46 cgpa_colunns = [‘1st', ‘2nd’, ‘3rd', ‘4th’, ‘Sth'] max_values = [max_1st, max_2nd, max 3rd, max Ath, max_Sth] for col, max_value in zip(cgpa_colunns, max_values): df[col + '_Percentage'] = (df[col] / max_value) * 100 print (df) 173 174 475 176 7 173 174 175, 176 47 Ast 2nd 3rd au 7, 6.48 5. 8.41 8. 7.33 6. 7.897. wees 2 2 See Subject 7 6. 6. 75 6. 4th_Percentage 79. 46. 88. 72. 2B. 79. 76. 61. 75. 57. [178 rows x 15 colunns] 5th College Code 115 us us us us 4th 68 7.11 7.43 8.18 98 4.15 4.29 4.96 24 7.52 8.25 7.75 83 6.33 6.79 6.89 347.22 7.32 7.46 55 7.67 7.39 8.65 41 6.59 7.11 7.38 28 5.89 5.71 6.50 10 6.81 7.00 6.92 81 6.52 5.39 7.00 Code 1st_percentage 16 88.6388 16 70.819672 16 91.912568 16 0.109290 16 86,229508 28 81. 748634 28 79.781421 28 68.852459 28 76.939891 28 73.224084 5th_Percentage 806660 86.469345 979484 52.431298 614393 1.923890 932331 72.832981 625134 78.858351 377014 91.437632 369495, 78.012685 331901, 68.710359 187970 73.150106 394737 73.995772 29, 7. 79. a1. 69. 68. 7. B. 2nd_Percentage 83. 64. 387622 260803 467978 158523 695983, 976113 598263, 186754 290119 941368 Gender Roll Roll no. Female NaN 17020.0 Male NaN 17021.0 Fenale NaN 1702.0 Male NaN 17023.0 Male NaN 17024.0 F 17048. NaN M 17049. NaN M 17050.0 NaN M 17051.0 NaN M 17052.0 NaN 3rd_Percentage \ 74,139729 43.274244 78.415016 6.006257 75.286757 79.979145 68.717414 61.418144 71.011478 67.987487 \

You might also like