0% found this document useful (0 votes)
20 views16 pages

Iot Da1

internet of things digital assignment

Uploaded by

keerthiu2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views16 pages

Iot Da1

internet of things digital assignment

Uploaded by

keerthiu2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

Name: Keerthi Uppalapati

Reg No: 20BEC0174

ECE3502
IoT Domain Analyst Lab
TASK-1

FACULTY: Biswajit Dwivedy


SLOT: L5+L6
Aim of the Experiment:
The aim of this experiment is exploratory Data Analysis/ Pre-processing
using Python.

Name of the simulation platform:


Python using numpy, pandas and seaborn modules.

Theory:
EDA is a phenomenon under data analysis used for gaining a better
understanding of data aspects like:
1. main features of data
2. variables and relationships that hold between them
3. identifying which variables are important for our problem

We shall look at various exploratory data analysis methods like:


1. Descriptive Statistics, which is a way of giving a brief overview of the
dataset we are dealing with, including some measures and features of
the sample
2. Grouping data [Basic grouping with group by]
3. ANOVA, Analysis of Variance, which is a computational method to divide
variations in an observations set into different components.
4. Correlation and correlation methods
Algorithm/Flowchart:

(a)Perform the following operations on the data-set: Exploratory


Data Analysis/ Pre-processing using Python
https://www.kaggle.com/spscientist/students-performance-
in- exams?select=StudentsPerformance.csv
(i) First and Last five rows
Program:

#20BEC0174 - Keerthi Uppalapati


import pandas as pd
import numpy as np
import seaborn as sns
df = pd.read_csv('C:/Users\keert\Downloads\StudentsPerformance.csv')
print(df.head())
print(df.tail())
Output:

(ii) Size of the Data set

Program:

#20BEC0174 - Keerthi Uppalapati


import pandas as pd
import numpy as np
import seaborn as sns
df = pd.read_csv('C:/Users\keert\Downloads\StudentsPerformance.csv')
print(f"size of csv file:{df.shape}")
Output:

(iii) Describe the dataset

Program:

#20BEC0174 - Keerthi Uppalapati


import pandas as pd
import numpy as np
import seaborn as sns
df = pd.read_csv('C:/Users\keert\Downloads\StudentsPerformance.csv')
print(f"{df.info()}\n")
print(df.describe())
Output:

(iv) Function of “nunique”

Program:

#20BEC0174 - Keerthi Uppalapati


import pandas as pd
import numpy as np
import seaborn as sns
df = pd.read_csv('C:/Users\keert\Downloads\StudentsPerformance.csv')
print(df.nunique())
Output:

(v) Correcting the dataset: Removing columns, Outlier detection, null


value detection etc.

Program:

#20BEC0174 - Keerthi Uppalapati


import pandas as pd
import numpy as np
import seaborn as sns
df = pd.read_csv('C:/Users\keert\Downloads\StudentsPerformance.csv')
print(f"The number of null values in the entire dataset is \n {df.isnull().sum()}")
print(f"\nthe column gender has been removed from the dataset \n
{df.drop('gender',axis=1)}")
maths=df["math score"]
math2=np.array(df["math score"])
outliers=[]
mean=np.mean(math2)
std=np.std(math2)
for i in math2:
zscore=(i-mean)/std
if np.abs(zscore)>3:
outliers.append(i)
print(f"\n the number of outliers is {len(outliers)}")
print(outliers)

Output:

(vi) Data Normalization on any two columns

Program:

#20BEC0174 - Keerthi Uppalapati


import pandas as pd
import numpy as np
import seaborn as sns
df = pd.read_csv('C:/Users\keert\Downloads\StudentsPerformance.csv')
df_max_scaled = df.copy()
column = 'math score'
df_max_scaled[column] = df_max_scaled[column]
/df_max_scaled[column].abs().max()
column1='reading score'
df_max_scaled[column1] =
df_max_scaled[column1]/df_max_scaled[column1].abs().max()
print(df_max_scaled[['math score','reading score']])
Output:

(vii) Corelation between different variables

Program:

#20BEC0174 - Keerthi Uppalapati


import pandas as pd
import numpy as np
import seaborn as sns
df = pd.read_csv('C:/Users\keert\Downloads\StudentsPerformance.csv')
correlation = df.corr()
print(correlation)
Output:

(viii) Heatmap to represent Corelation


Program:
#20BEC0174 - Keerthi Uppalapati
import pandas as pd
import numpy as np
import seaborn as sns
df = pd.read_csv('C:/Users\keert\Downloads\StudentsPerformance.csv')
correlation = df.corr()
print(correlation)
sns.heatmap(correlation, xticklabels=correlation.columns,
yticklabels=correlation.columns,annot=True)
Output:

(ix) Use of relplot, pairplot, distplot.


Program:
#20BEC0174 - Keerthi Uppalapati
import pandas as pd
import numpy as np
import seaborn as sns
df = pd.read_csv('C:/Users\keert\Downloads\StudentsPerformance.csv')
sns.pairplot(df)
sns.relplot(x = 'math score', y = 'reading score', hue= 'lunch', data= df)
sns.distplot(df['writing score'])
Output:
(b) Binarization (Dataset: pima-indians-diabetes.csv)

Program:
#20BEC0174 - Keerthi Uppalapati
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing
df = pd.read_csv('C:/Users\keert\Downloads\pima-indians-diabetes.csv')
df2=df.copy()
threshold=90
df2 = (df['Glucose'] > threshold).astype(int)
print(df2)

Output:
(c) Standardization (Dataset: pima-indians-diabetes.csv)

Program:
#20BEC0174 - Keerthi Uppalapati
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing
df = pd.read_csv('C:/Users\keert\Downloads\pima-indians-diabetes.csv')
df2=df.copy()
m=df['Glucose'].mean()
s=df['Glucose'].std()
#using df[‘column’] =( df[‘column’] – df[‘column’].mean() ) /
df[‘column’].std()
df2['Glucose']=(df['Glucose']-m)/s
print(df['Glucose'])
print(df2['Glucose'])

Output:

(d) Data- Labelling

Program:
#20BEC0174 - Keerthi Uppalapati
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing
df = pd.read_csv('C:/Users\keert\Downloads\pima-indians-diabetes.csv')
df["label"] = "default_label"
df.loc[df["Glucose"] > 100, "label"] = "diabetic"
df.loc[df["Glucose"] <= 100, "label"] = "Not diabetic"
print(df)

Output:

Conclusion:
This is how we’ll do Exploratory Data Analysis. Exploratory Data
Analysis (EDA) helps us to look beyond the data. The more we explore the
data, the more the insights we draw from it. As a data analyst, almost 80% of
our time will be spent understanding data and solving various business
problems through EDA.

Signature of student

You might also like