DSBDA Lab Manual 2022-23
DSBDA Lab Manual 2022-23
DSBDA Lab Manual 2022-23
Vision
Mission
• Generation of national wealth through education and research
• Imparting quality technical education at the cost affordable to all strata of
the society
• Enhancing the quality of life through sustainable development
• Carrying out high-quality intellectual work
• Achieving the distinction of the highest preferred engineering college in the
eyes of the stakeholders
Vision
Mission
• To produce Best Quality Computer Science Professionals by
imparting quality training, hands on experience and value education.
Companion Course: Data Science and Big Data Analytics Laboratory (310256)
Course Objectives:
To understand principles of Data Science for the analysis of real time
problems
To develop in depth understanding and implementation of the key
technologies in Data Scienceand Big Data Analytics
To analyze and demonstrate knowledge of statistical data analysis
techniques for decision-making
To gain practical, hands-on experience with statistics programming
languages and Big Data tools
Course Outcomes:
On completion of the course, learner will be able to
CO1: Apply principles of Data Science for the analysis of real time problems
CO2: Implement data representation using statistical methods
CO3: Implement and evaluate data analytics algorithms
CO4: Perform text preprocessing
CO5: Implement data visualization techniques
CO6: Use cutting edge tools and technologies to analyze Big Data
Lab Manual
TE Computer
YEAR:-2022-2023
SEM-II
INDEX
Sr. Page
No. Name of Assignment No. Date Remark
Group A : Data Science
Data Wrangling, I
1 Perform the following operations using Python on any open
source dataset (e.g., data.csv)
1. Import all the required Python Libraries.
2. Locate an open source data from the web (e.g.,
https://www.kaggle.com). Provide a clear
description of the data and its source (i.e., URL of
the web site).
3. Load the Dataset into pandas dataframe.
4. Data Preprocessing: check for missing values in the
data using pandas isnull(), describe() function to get
some initial statistics. Provide variable descriptions.
Types of variables etc. Check the dimensions of the
data frame.
5. Data Formatting and Data Normalization:
Summarize the types of variables by checking the
data types (i.e., character, numeric, integer, factor,
and logical) of the variables in the data set. If
variables are not in the correct data type, apply
proper type conversions.
6. Turn categorical variables into quantitative variables
in Python.
In addition to the codes and outputs, explain every operation
that you do in the above steps andexplain everything that
you do to import/read/scrape the data set.
Data Wrangling II
2 Create an “Academic performance” dataset of students and
perform the following operations usingPython.
Refer dataset
https://github.com/rashida048/Some-NLP-
Projects/blob/master/movie_dataset.csv
4. Use the following covid_vaccine_statewise.csv dataset and
perform following analytics on thegiven dataset
https://www.kaggle.com/sudalairajkumar/covid19-in-
india?select=covid_vaccine_statewise.csv
a. Describe the dataset
b. Number of persons state wise vaccinated for first dose in
India
c. Number of persons state wise vaccinated for second dose
in India
d. Number of Males vaccinated
d. Number of females vaccinated
5. Write a case study to process data driven for Digital
Marketing OR Health care systems withHadoop Ecosystem
components as shown. (Mandatory)
● HDFS: Hadoop Distributed File System
● YARN: Yet Another Resource Negotiator
● MapReduce: Programming based Data Processing
● Spark: In-Memory data processing
● PIG, HIVE: Query based processing of data
services
● HBase: NoSQL Database (Provides real-time
reads and writes)
● Mahout, Spark MLLib: (Provides analytical
tools) Machine Learning algorithmlibraries
● Solar, Lucene: Searching and Indexing
Group A
Assignment No: 1
Objective of the Assignment: Students should be able to perform the data wrangling
operation using Python on any open source dataset
Prerequisite:
1. Basic of Python Programming
2. Concept of Data Preprocessing, Data Formatting , Data Normalization and Data
Cleaning.
---------------------------------------------------------------------------------------------------------------
Contents for Theory:
1. Introduction to Dataset
2. Python Libraries for Data Science
3. Description of Dataset
1. Introduction to Dataset
A dataset is a collection of records, similar to a relational database table. Records are
similar to table rows, but the columns can contain not only strings or numbers, but also
nested data structures such as lists, maps, and other records.
Instance: A single row of data is called an instance. It is an observation from the domain.
Feature: A single column of data is called a feature. It is a component of an observation
and is also called an attribute of a data instance. Some features may be inputs to a model
(the predictors) and others may be outputs or the features to be predicted.
Data Type: Features have a data type. They may be real or integer-valued or may have a
categorical or ordinal value. You can have strings, dates, times, and more complex types,
but typically they are reduced to real or categorical values when working with traditional
machine learning methods.
Datasets: A collection of instances is a dataset and when working with machine learning
methods we typically need a few datasets for different purposes.
Training Dataset: A dataset that we feed into our machine learning algorithm to train
our model.
Testing Dataset: A dataset that we use to validate the accuracy of our model but is not
used to train the model. It may be called the validation dataset.
Data Represented in a Table:
Data should be arranged in a two-dimensional space made up of rows and columns. This
type of data structure makes it easy to understand the data and pinpoint any problems. An
example of some raw data stored as a CSV (comma separated values).
Pandas Python
NumPy type Usage
dtype type
int64 int int_, int8, int16, int32, int64, uint8, uint16, Integer numbers
uint32, uint64
1. Basic array operations: add, multiply, slice, flatten, reshape, index arrays
2. Advanced array operations: stack arrays, split into sections, broadcast arrays
3. Work with DateTime or Linear Algebra
c. Matplotlib
This is undoubtedly my favorite and a quintessential Python library. You can create
stories with the data visualized with Matplotlib. Another library from the SciPy Stack,
Matplotlib plots 2D figures.
Histogram, bar plots, scatter plots, area plot to pie plot, Matplotlib can depict a wide
range of visualizations. With a bit of effort and tint of visualization capabilities, with
Matplotlib, you can create just any visualizations:Line plots
● Scatter plots
● Area plots
● Bar charts and Histograms
● Pie charts
● Stem plots
● Contour plots
● Quiver plots
● Spectrograms
Matplotlib also facilitates labels, grids, legends, and some more formatting entities with
Matplotlib.
d. Seaborn
So when you read the official documentation on Seaborn, it is defined as the data
visualization library based on Matplotlib that provides a high-level interface for drawing
attractive and informative statistical graphics. Putting it simply, seaborn is an extension
of Matplotlib with advanced features.
Introduced to the world as a Google Summer of Code project, Scikit Learn is a robust
machine learning library for Python. It features ML algorithms like SVMs, random
forests, k-means clustering, spectral clustering, mean shift, cross-validation and more...
Even NumPy, SciPy and related scientific operations are supported by Scikit Learn with
Scikit Learn being a part of the SciPy Stack.
3. Description of Dataset:
The Iris dataset was used in R.A. Fisher's classic 1936 paper, The Use of Multiple
Measurements in Taxonomic Problems, and can also be found on the UCI Machine Learning
Repository.
It includes three iris species with 50 samples each as well as some properties about each
flower. One flower species is linearly separable from the other two, but the other two are not
linearly separable from each other.
Description of Dataset-
3. The csv file at the UCI repository does not contain the variable/column names. They are
located in a separate file.
col_names = ['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width','Species']
4. read in the dataset from the UCI Machine Learning Repository link and specify column
names to use
iris = pd.read_csv(csv_url, names = col_names)
2 dataset.tail(n=5)
Return the last n rows.
17 dataset.iloc[:m, :n] a subset of the first m rows and the first n columns
dataset[cols_2_4]
Function: DataFrame.isnull()
Output:
Function: DataFrame.isnull().any()
Output:
c. count of missing values across each column using isna() and isnull()
In order to get the count of missing values of the entire dataframe isnull() function is
used. sum() which does the column wise sum first and doing another sum() will get
the count of missing values of the entire dataframe.
Function: dataframe.isnull().sum().sum()
Output : 8
d. count row wise missing value using isnull()
Function: dataframe.isnull().sum(axis = 1)
Output:
Method 2:
unction: dataframe.isna().sum()
df1.Gender.isnull().sum()
Output: 2
g. groupby count of missing values of a column.
In order to get the count of missing values of the particular column by group in
pandas we will be using isnull() and sum() function with apply() and groupby()
which performs the group wise count of missing values as shown below.
Function:
df1.groupby(['Gender'])['Score'].apply(lambda x:
x.isnull().sum())
Output:
analyzed or modelled effectively, and there are several techniques for this process.
a. Data Formatting: Ensuring all data formats are correct (e.g. object, text, floating
number, integer, etc.) is another part of this initial ‘cleaning’ process. If you are
working with dates in Pandas, they also need to be stored in the exact format to use
special date-time functions.
b. Data normalization: Mapping all the nominal data values onto a uniform scale
(e.g. from 0 to 1) is involved in data normalization. Making the ranges consistent
across variables helps with statistical analysis and ensures better comparisons
later on.It is also known as Min-Max scaling.
Algorithm:
Step 1 : Import pandas and sklearn library for preprocessing
from sklearn import preprocessing
Step 2: Load the iris dataset in dataframe object df
Step 3: Print iris dataset.
df.head()
Step 5: Create a minimum and maximum processor object
min_max_scaler = preprocessing.MinMaxScaler()
Example : Suppose we have a column Height in some dataset. After applying label
encoding, the Height column is converted into:
where 0 is the label for tall, 1 is the label for medium, and 2 is a label for short height.
Label Encoding on iris dataset: For iris dataset the target column which is Species. It
contains three species Iris-setosa, Iris-versicolor, Iris-virginica.
Sklearn Functions for Label Encoding:
● preprocessing.LabelEncoder : It Encode labels with value between 0
and n_classes-1.
● fit_transform(y):
Parameters: yarray-like of shape (n_samples,)
Target values.
Returns: yarray-like of shape (n_samples,)
Encoded labels.
This transformer should be used to encode target values, and not the input.
Algorithm:
Step 1 : Import pandas and sklearn library for preprocessing
from sklearn import preprocessing
Step 2: Load the iris dataset in dataframe object df
Step 3: Observe the unique values for the Species column.
df['Species'].unique()
output: array(['Iris-setosa', 'Iris-versicolor',
'Iris-virginica'], dtype=object)
Step 4: define label_encoder object knows how to understand word labels.
label_encoder = preprocessing.LabelEncoder()
Step 5: Encode labels in column 'species'.
df['Species']= label_encoder.fit_transform(df['Species'])
Step 6: Observe the unique values for the Species column.
● Use LabelEncoder when there are only two possible values of a categorical feature.
For example, features having value such as yes or no. Or, maybe, gender features
when there are only two possible values including male or female.
Limitation: Label encoding converts the data in machine-readable form, but it assigns a
unique number(starting from 0) to each class of data. This may lead to the generation
of priority issues in the data sets. A label with a high value may be considered to have
high priority than a label having a lower value.
b. One-Hot Encoding:
In one-hot encoding, we create a new set of dummy (binary) variables that is equal to the
number of categories (k) in the variable. For example, let’s say we have a categorical
variable Color with three categories called “Red”, “Green” and “Blue”, we need to use
three dummy variables to encode this variable using one-hot encoding. A dummy
(binary) variable just takes the value 0 or 1 to indicate the exclusion or inclusion of a
category.
In one-hot encoding,
“Red” color is encoded as [1 0 0] vector of size 3.
“Green” color is encoded as [0 1 0] vector of size 3.
“Blue” color is encoded as [0 0 1] vector of size 3.
Dummy encoding also uses dummy (binary) variables. Instead of creating a number of
dummy variables that is equal to the number of categories (k) in the variable, dummy
encoding uses k-1 dummy variables. To encode the same Color variable with three
categories using the dummy encoding, we need to use only two dummy variables.
Column names in the DataFrame to be encoded. If columns is None then all the
columns with object or category dtype will be converted.
Algorithm:
Step 1 : Import pandas and sklearn library for preprocessing
from sklearn import preprocessing
Step 2: Load the iris dataset in dataframe object df
Step 3: Observe the unique values for the Species column.
df['Species'].unique()
output: array(['Iris-setosa', 'Iris-versicolor',
'Iris-virginica'], dtype=object)
Step 4: Apply label_encoder object for label encoding the Observe the unique
values for the Species column.
df['Species'].unique()
Output: array([0, 1, 2], dtype=int64)
Step 6: Apply one_hot encoder with dummy variables for Species column.
one_hot_df = pd.get_dummies(df, prefix="Species",
columns=['Species'], drop_first=True)
Step 7: Observe the merge dataframe
one_hot_df
Conclusion- In this way we have explored the functions of the python library for Data
Preprocessing, Data Wrangling Techniques and How to Handle missing values on Iris Dataset.
Assignment Question
1. Explain Data Frame with Suitable example.
2. What is the limitation of the label encoding method?
3. What is the need of data normalization?
4. What are the different Techniques for Handling the Missing Data?
Group A
Assignment No: 2
Prerequisite:
1. Basic of Python Programming
2. Concept of Data Preprocessing, Data Formatting , Data Normalization and Data
Cleaning.
---------------------------------------------------------------------------------------------------------------
Contents for Theory:
1. Creation of Dataset using Microsoft Excel.
2. Identification and Handling of Null Values
Step 2: Enter the name of the dataset and Save the dataset astye CSV(MS-DOS).
Step 3: Fill the dara by using RANDOMBETWEEN function. For every feature , fill
the data by considering above spectified range.
one example is given:
The placement count largely depends on the placement score. It is considered that if
placement score <75, 1 offer is facilitated; for placement score >75 , 2 offer is facilitated
and for else (>85) 3 offer is facilitated. Nested If formula is used for ease of data filling.
Step 4: In 20% data, fill the impurities. The range of math score is [60,80], updating a
few instances values below 60 or above 80. Repeat this for Writing_Score [60,80],
Placement_Score[75-100], Club_Join_Date [2018-2021].
Step 5: To violate the ruleof response variable, update few valus . If placement scoreis
greater then 85, facilated only 1 offer.
1. None: None is a Python singleton object that is often used for missing data in
Python code.
2. NaN : NaN (an acronym for Not a Number), is a special floating-point value
recognized by all systems that use the standard IEEE floating-point
representation.
Pandas treat None and NaN as essentially interchangeable for indicating missing
or null values. To facilitate this convention, there are several useful functions for
detecting, removing, and replacing null values in Pandas DataFrame :
● isnull()
● notnull()
● dropna()
● fillna()
● replace()
Algorithm:
Step 1 : Import pandas and numpy in order to check missing values in Pandas
DataFrame
import pandas as pd
import numpy as np
Step 2: Load the dataset in dataframe object df
df=pd.read_csv("/content/StudentsPerformanceTest1.csv")
Step 3: Display the data frame
df
Step 5: To create a series true for NaN values for specific columns. for example
math score in dataset and display data with only math score as NaN
series = pd.isnull(df["math score"])
df[series]
Algorithm:
Step 1 : Import pandas and numpy in order to check missing values in Pandas
DataFrame
import pandas as pd
import numpy as np
Step 2: Load the dataset in dataframe object df
df=pd.read_csv("/content/StudentsPerformanceTest1.csv")
Step 3: Display the data frame
df
Step 5: To create a series true for NaN values for specific columns. for example
math score in dataset and display data with only math score as NaN
series1 = pd.notnull(df["math score"])
df[series1]
See that there are also categorical values in the dataset, for this, you need to use
Label Encoding or One Hot Encoding.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['gender'] = le.fit_transform(df['gender'])
newdf=df
df
In order to fill null values in a datasets, fillna(), replace() functions are used.
These functions replace NaN values with some value of their own. All these
functions help in filling null values in datasets of a DataFrame.
df = pd.read_csv("StudentsPerformanceTest1.csv", na_values =
missing_values)
df
Step 5: filling missing values using mean, median and standard deviation of that
column.
Following line will replace Nan value in dataframe with value -99
Algorithm:
Step 1 : Import pandas and numpy in order to check missing values in Pandas
DataFrame
import pandas as pd
import numpy as np
Step 2: Load the dataset in dataframe object df
df=pd.read_csv("/content/StudentsPerformanceTest1.csv")
Step 3: Display the data frame
df
Step 4:To drop rows with at least 1 null value
ndf.dropna()
new_data
Similarly, an Outlier is an observation in a given dataset that lies far from the rest
of the observations. That means an outlier is vastly larger or smaller than the remaining
values in the set.
Mean is the accurate measure to describe the data when we do not have any
outliers present. Median is used if there is an outlier in the dataset. Mode is used if there
is an outlier AND about ½ or more of the data is the same.
‘Mean’ is the only measure of central tendency that is affected by the outliers
which in turn impacts Standard deviation.
Example:
Consider a small dataset, sample= [15, 101, 18, 7, 13, 16, 11, 21, 5, 15, 10, 9]. By
looking at it, one can quickly say ‘101’ is an outlier that is much larger than the other
values.
From the above calculations, we can clearly say the Mean is more affected than the
Median.
4. Detecting Outliers
If our dataset is small, we can detect the outlier by just looking at the dataset. But
what if we have a huge dataset, how do we identify the outliers then? We need to use
visualization and mathematical techniques.
● Boxplots
● Scatterplots
● Z-score
● Inter Quantile Range(IQR)
import numpy as np
Step 2: Load the dataset in dataframe object df
df=pd.read_csv("/content/demo.csv")
Step 3: Display the data frame
df
Step 4:Select the columns for boxplot and draw the boxplot.
Step 5: We can now print the outliers for each column with reference to the box plot.
print(np.where(df['math score']>90))
print(np.where(df['reading score']<25))
print(np.where(df['writing score']<30))
Step 5: We can now print the outliers with reference to scatter plot.
print(np.where((df['placement score']<50) & (df['placement
offer count']>1)))
print(np.where((df['placement score']>85) & (df['placement
offer count']<3)))
Algorithm:
Step 1 : Import numpy and stats from scipy libraries
import numpy as np
upper = Q3 +1.5*IQR
lower = Q1 – 1.5*IQR
In the above formula as according to statistics, the 0.5 scale-up of IQR
(new_IQR = IQR + 0.5*IQR) is taken.
Algorithm:
Step 1 : Import numpy library
import numpy as np
Step 2: Sort Reading Score feature and store it into sorted_rscore.
sorted_rscore= sorted(df['reading score'])
Step 3: Print sorted_rscore
sorted_rscore
Step 4: Calculate and print Quartile 1 and Quartile 3
q1 = np.percentile(sorted_rscore, 25)
q3 = np.percentile(sorted_rscore, 75)
print(q1,q3)
b = np.where(df_stud['math score']>ninetieth_percentile,
ninetieth_percentile, df_stud['math score'])
print("New array:",b)
df_stud.insert(1,"m score",b,True)
df_stud
● Mean/Median imputation:
As the mean value is highly influenced by the outliers, it is advised to replace the
outliers with the median value.
1. Plot the box plot for reading score
col = ['reading score']
df.boxplot(col)
median=np.median(sorted_rscore)
median
4. Replace the upper bound outliers using median value
refined_df=df
refined_df['reading score'] = np.where(refined_df['reading
score'] >upr_bound, median,refined_df['reading score'])
5. Display redefined_df
● Smoothing: It is a process that is used to remove noise from the dataset using some
algorithms It allows for highlighting important features present in the dataset. It
helps in predicting the patterns
● Aggregation: Data collection or aggregation is the method of storing and presenting
data in a summary format. The data may be obtained from multiple data sources to
integrate these data sources into a data analysis description. This is a crucial step
since the accuracy of data analysis insights is highly dependent on the quantity and
quality of the data used.
● Generalization:It converts low-level data attributes to high-level data attributes
using concept hierarchy. For Example Age initially in Numerical form (22, 25) is
converted into categorical value (young, old).
● Normalization: Data normalization involves converting all data variables into a
given range. Some of the techniques that are used for accomplishing normalization
are:
○ Min–max normalization: This transforms the original data linearly.
○ Z-score normalization: In z-score normalization (or zero-mean normalization)
the values of an attribute (A), are normalized based on the mean of A and its
standard deviation.
○ Normalization by decimal scaling: It normalizes the values of an attribute by
changing the position of their decimal points
● Attribute or feature construction.
○ New attributes constructed from the given ones: Where new attributes are
created & applied to assist the mining process from the given set of attributes.
This simplifies the original data & makes the mining more efficient.
In this assignment , The purpose of this transformation should be one of the
following reasons:
Algorithm:
Step 1 : Detecting outliers using Z-Score for the Math_score variable and
remove the outliers.
Step 2: Observe the histogram for math_score variable.
import matplotlib.pyplot as plt
new_df['math score'].plot(kind = 'hist')
Step 3: Convert the variables to logarithm at the scale 10.
df['log_math'] = np.log10(df['math score'])
Group A
Assignment No: 3
Provide the codes with outputs and explain everything that you do in this step.
----------------------------------------------------------------------------------------------------------------
Objective of the Assignment: Students should be able to perform the Statistical operations
using Python on any open source dataset.
---------------------------------------------------------------------------------------------------------------
Prerequisite:
1. Basic of Python Programming
2. Concept of statistics such as mean, median, minimum, maximum, standard deviation
etc.
---------------------------------------------------------------------------------------------------------------
Contents for Theory:
1. Summary statistics
2. Types of Variables
1. Summary statistics:
What is Statistics?
Statistics is the science of collecting data and analysing them to infer proportions (sample)
that are representative of the population. In other words, statistics is interpreting data in
order to make predictions for the population.
Branches of Statistics:
There are two branches of Statistics.
DESCRIPTIVE STATISTICS : Descriptive Statistics is a statistics or a measure that
describes the data.
INFERENTIAL STATISTICS : Using a random sample of data taken from a population to
describe and make inferences about the population is called Inferential Statistics.
Descriptive Statistics
Descriptive Statistics is summarising the data at hand through certain numbers like mean,
median etc. so as to make the understanding of the data easier. It does not involve any
generalisation or inference beyond what is available. This means that the descriptive
statistics are just the representation of the data (sample) available and not based on any
theory of probability.
b. Median : Median is the point which divides the entire data into two equal
halves. One-half of the data is less than the median, and the other half is greater
than the same. Median is calculated by first arranging the data in either ascending
or descending order.
○ If the number of observations is odd, the median is given by the middle
observation in the sorted form.
○ If the number of observations are even, median is given by the mean of the
two middle observations in the sorted form.
An important point to note is that the order of the data (ascending or
descending) does not affect the median.
c. Mode : Mode is the number which has the maximum frequency in the entire data
set, or in other words,mode is the number that appears the maximum number of
times. A data can have one or more than one mode.
● If there is only one number that appears the maximum number of times,
the data has one mode, and is called Uni-modal.
● If there are two numbers that appear the maximum number of times, the
data has two modes, and is called Bi-modal.
● If there are more than two numbers that appear the maximum number of
times, the data has more than two modes, and is called Multi-modal.
Mode is given by the number that occurs the maximum number of times.
Here, 17 and 21 both occur twice. Hence, this is a Bimodal data and the modes
are 17 and 21.
Measures of Dispersion describes the spread of the data around the central value (or the
Measures of Central Tendency)
1. Absolute Deviation from Mean — The Absolute Deviation from Mean, also
called Mean Absolute Deviation (MAD), describes the variation in the data set, in
the sense that it tells the average absolute distance of each data point in the set. It
is calculated as
2. Variance — Variance measures how far are data points spread out from the mean.
A high variance indicates that data points are spread widely and a small variance
indicates that the data points are closer to the mean of the data set. It is calculated
as
4. Range — Range is the difference between the Maximum value and the Minimum
value in the data set. It is given as
5. Quartiles — Quartiles are the points in the data set that divides the data set into
four equal parts. Q1, Q2 and Q3 are the first, second and third quartile of the data
set.
● 25% of the data points lie below Q1 and 75% lie above it.
● 50% of the data points lie below Q2 and 50% lie above it. Q2 is nothing but
Median.
● 75% of the data points lie below Q3 and 25% lie above it.
Positive Skew — This is the case when the tail on the right side of the curve is
bigger than that on the left side. For these distributions, mean is greater than the
mode.
Negative Skew — This is the case when the tail on the left side of the curve is
bigger than that on the right side. For these distributions, mean is smaller than the
mode.
Python Code:
1. Mean
To find mean of all columns
Syntax:
df.mean()
Output:
2. Median
To find median of all columns
Syntax:
df.median()
Output:
3. Mode
To find mode of all columns
Syntax:
df.mode()
Output:
In the Genre Column mode is Female, for column Age mode is 32 etc. If a
particular column does not have mode all the values will be displayed in
the column.
To find the mode of a specific column.
Syntax:
df.loc[:,'Age'].mode()
Output:
32
4. Minimum
To find median of all columns
Syntax:
df.min()
Output:
df.loc[:,'Age'].min(skipna = False)
Output:
18
5. Maximum
To find median of all columns
Syntax:
df.max()
Output:
6. Standard Deviation
To find Standard Deviation of all columns
Syntax:
df.std()
Output:
13.969007331558883
To find Standard Deviation row wise
Syntax:
df.std(axis=1)[0:4]
Output:
2. Types of Variables:
A variable is a characteristic that can be measured and that can assume different values.
Height, age, income, province or country of birth, grades obtained at school and type of
housing are all examples of variables.
● Categorical and
● Numeric.
Each category is then classified in two subcategories: nominal or ordinal for categorical
variables, discrete or continuous for numeric variables.
● Categorical variables
○ Ordinal Variable
An ordinal variable is a variable whose values are defined by an order relation
between the different categories. In following table, the variable “behaviour” is
ordinal because the category “Excellent” is better than the category “Very good,”
which is better than the category “Good,” etc. There is some natural ordering, but
it is limited since we do not know by how much “Excellent” behaviour is better
than “Very good” behaviour.
● Numerical Variables
A numeric variable (also called quantitative variable) is a quantifiable characteristic
whose values are numbers (except numbers which are codes standing up for categories).
○ Continuous variables
A variable is said to be continuous if it can assume an infinite number of real
values within a given interval.
For instance, consider the height of a student. The height can’t take any
values. It can’t be negative and it can’t be higher than three metres. But between 0
and 3, the number of possible values is theoretically infinite. A student may be
1.6321748755 … metres tall.
○ Discrete variables
As opposed to a continuous variable, a discrete variable can assume only a finite
number of real values within a given interval.
An example of a discrete variable would be the score given by a judge to a
gymnast in competition: the range is 0 to 10 and the score is always given to one
decimal (e.g. a score of 8.5)
Syntax:
df_u=df.rename(columns= {'Annual Income
k$)':'Income'},inplace=False)
(df_u.groupby(['Genre']).Income.mean())
Output:
To create a list that contains a numeric value for each response to the categorical variable.
from sklearn import preprocessing
enc = preprocessing.OneHotEncoder()
enc_df = pd.DataFrame(enc.fit_transform(df[['Genre']]).toarray())
enc_df
6. To display basic statistical details like percentile, mean,standard deviation etc. forIris-
setosa use describe
print('Iris-setosa')
print(iris[irisSet].describe())
8. To display basic statistical details like percentile, mean,standard deviation etc. forIris-
versicolor use describe
print('Iris-versicolor')
print(iris[irisVer].describe())
10. To display basic statistical details like percentile, mean,standard deviation etc. forIris-
virginica use describe
print('Iris-virginica')
print(iris[irisVir].describe())
Conclusion:
Measures of central tendency describe the centre of a data set. It includes the
mean, median, and mode.
Measures of variability or spread describe the dispersion of data within the set and
it includes standard deviation, variance, minimum and maximum variables.
Assignment Questions:
1. Explain Measures of Central Tendency with examples.
2. What are the different types of variables. Explain with examples.
3. Which method is used to statistic the dataframe? write the code.
Group A
Assignment No: 4
Title of the Assignment: Create a Linear Regression Model using Python/R to predict
home prices using Boston Housing Dataset (https://www.kaggle.com/c/boston-housing).
The Boston Housing dataset contains information about various houses in Boston through
different parameters. There are 506 samples and 14 feature variables in this dataset.
The objective is to predict the value of prices of the house using the given features.
----------------------------------------------------------------------------------------------------------------
Objective of the Assignment: Students should be able to data analysis using liner regression
using Python for any open source dataset
---------------------------------------------------------------------------------------------------------------
Prerequisite:
1. Basic of Python Programming
2.Concept of Regresion.
---------------------------------------------------------------------------------------------------------------
Contents for Theory:
1. Linear Regression : Univariate and Multivariate
2. Least Square Method for Linear Regression
3. Measuring Performance of Linear Regression
4. Example of Linear Regression
5. Training data set and Testing data set
---------------------------------------------------------------------------------------------------------------
1. Linear Regression: It is a machine learning algorithm based on supervised learning. It
targets prediction values on the basis of independent variables.
● It is preferred to find out the relationship between forecasting and variables.
● A linear relationship between a dependent variable (X) is continuous; while
independent variable(Y) relationship may be continuous or discrete. A linear
relationship should be available in between predictor and target variable so known
as Linear Regression.
● Linear regression is popular because the cost function is Mean Squared Error
(MSE) which is equal to the average squared difference between an observation’s
actual and predicted values.
● It is shown as an equation of line like :
Y = m*X + b + e
Where : b is intercepted, m is slope of the line and e is error term.
This equation can be used to predict the value of target variable Y based on given
predictor variable(s) X, as shown in Fig. 1.
● Fig. 2 shown below is about the relation between weight (in Kg) and height (in
cm), a linear relation. It is an approach of studying in a statistical manner to
summarise and learn the relationships among continuous (quantitative) variables.
● Here a variable, denoted by ‘x’ is considered as the predictor, explanatory, or
independent variable.
Fig.2 : Relation between weight (in Kg) and height (in cm)
MultiVariate Regression :It concerns the study of two or more predictor variables.
Usually a transformation of the original features into polynomial features from a given
degree is preferred and further Linear Regression is applied on it.
● A simple linear model Y = a + bX is in original feature will be transformed into
polynomial feature is transformed and further a linear regression applied to it and
it will be something like
Y=a + bX + cX2
● If a high degree value is used in transformation the curve becomes over-fitted as it
captures the noise from data as well.
● A simple linear model is the one which involves only one dependent and one independent
variable. Regression Models are usually denoted in Matrix Notations.
● However, for a simple univariate linear model, it can be denoted by the regression
equation
𝑦=β + β 𝑥 (1)
0 1
● This linear equation represents a line also known as the ‘regression line’. The least square
estimation technique is one of the basic techniques used to guess the values of the
parameters and based on a sample set.
● This technique estimates parameters β and β and by trying to minimise the square
0 1
of errors at all the points in the sample set. The error is the deviation of the actual sample
● data point from the regression line. The technique can be represented by the equation.
𝑛
2
𝑚i𝑛 ∑ (𝑦 − 𝑦) (2)
i=0
Using differential calculus on equation 1 we can find the values of β and β such
0 1
β = 𝑦 −β 𝑥 (4)
0 1
Once the Linear Model is estimated using equations (3) and (4), we can estimate the
value of the dependent variable in the given range only. Going outside the range is called
extrapolation which is inaccurate if simple regression techniques are used.
3. Measuring Performance of Linear Regression
Mean Square Error:
The Mean squared error (MSE) represents the error of the estimator or predictive model
created based on the given set of observations in the sample. Two or more regression
models created using a given sample data can be compared based on their MSE. The
lesser the MSE, the better the regression model is. When the linear regression model is
trained using a given set of observations, the model with the least mean sum of squares
error (MSE) is selected as the best model. The Python or R packages select the best-fit
model as the model with the lowest MSE or lowest RMSE when training the linear
regression models.
Mathematically, the MSE can be calculated as the average sum of the squared difference
between the actual value and the predicted or estimated value represented by the
regression model (line or plane).
An MSE of zero (0) represents the fact that the predictor is a perfect predictor.
RMSE:
Root Mean Squared Error method that basically calculates the least-squares error and takes a
root of the summed values.
Mathematically speaking, Root Mean Squared Error is the square root of the sum of all errors
divided by the total number of values. This is the formula to calculate RMSE
R-Squared is the ratio of the sum of squares regression (SSR) and the sum of squares total
(SST).
SST : total sum of squares (SST), regression sum of squares (SSR), Sum of square of errors
(SSE) are all showing the variation with different measures.
A value of R-squared closer to 1 would mean that the regression model covers most part
of the variance of the values of the response variable and can be termed as a good
model.
One can alternatively use MSE or R-Squared based on what is appropriate and the need of the
hour. However, the disadvantage of using MSE rather than R-squared is that it will be difficult
to gauge the performance of the model using MSE as the value of MSE can vary from 0 to
any larger number. However, in the case of R-squared, the value is bounded between 0 and .
4. Example of Linear Regression
Consider following data for 5 students.
Each Xi (i = 1 to 5) represents the score of ith student in standard X and corresponding
Yi (i = 1 to 5) represents the score of ith student in standard XII.
(i) Linear regression equation best predicts standard XIIth score
(ii) Interpretation for the equation of Linear Regression
(iii) If a student's score is 80 in std X, then what is his expected score in XII standard?
5 60 70
x y 𝑥 −𝑥 𝑦 −𝑦 (𝑥 −𝑥 )2 (𝑥 −𝑥 )(𝑦 − 𝑦 )
95 85 17 8 289 136
85 95 7 18 49 126
80 70 2 -7 4 -14
70 65 -8 -12 64 96
60 70 -18 -7 324 126
𝑥 = 78 𝑦= 77 ε (𝑥 −𝑥 )2= 730 ε (𝑥 −𝑥 )(𝑦 − 𝑦 ) = 470
(i) linear regression equation that best predicts standard XIIth score
𝑦 =β + β 𝑥
0 1
𝑛 𝑛 2
β = ∑ (𝑥 − 𝑥 ) − 𝑦 )/ ∑ (𝑥 𝑥)
1 i i i−
i=1 i=1
β = 470/730 = 0. 644
1
β = 𝑦 −β 𝑥
0 1
𝑦 = 26. 76 + 0. 644 𝑥
Interpretation 1
For an increase in value of x by 0.644 units there is an increase in value of y in one unit.
Interpretation 2
Score in XII standard (Yi) is 0.644 units depending on Score in X standard (Xi) but other
factors will also contribute to the result of XII standard by 26.768 .
(iii) If a student's score is 65 in std X, then his expected score in XII standard is 78.288
(c) Generalization
● Generalization is the prediction of the future based on the past system.
● It needs to generalize beyond the training data to some future data that it might not have
seen yet.
● The ultimate aim of the machine learning model is to minimize the generalization error.
● The generalization error is essentially the average error for data the model has never
seen.
● In general, the dataset is divided into two partition training and test sets.
● The fit method is called on the training set to build the model.
● This fit method is applied to the model on the test set to estimate the target value and
evaluate the model's performance.
● The reason the data is divided into training and test sets is to use the test set to estimate
how well the model trained on the training data and how well it would perform on the
unseen data.
Algorithm (Synthesis Dataset):
Step 1: Import libraries and create alias for Pandas, Numpy and Matplotlib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Step 2: Create a Dataframe with Dependent Variable(x) and independent variable y.
x=np.array([95,85,80,70,60])
y=np.array([85,95,70,65,70])
Step 3 : Create Linear Regression Model using Polyfit Function:
model= np.polyfit(x, y, 1)
Step 4: Observe the coefficients of the model.
model
Output:
array([ 0.64383562, 26.78082192])
Step 5: Predict the Y value for X and observe the output.
predict = np.poly1d(model)
predict(65)
Output:
68.63
Step 6: Predict the y_pred for all values of x.
y_pred= predict(x)
y_pred
Output:
Output:
data.head()
Step 5: Adding target variable to dataframe
data['PRICE'] = boston.target
Step 6: Perform Data Preprocessing( Check for missing values)
data.isnull().sum()
Step 7: Split dependent variable and independent variables
x = data.drop(['PRICE'], axis = 1)
y = data['PRICE']
Step 8: splitting data to training and testing dataset.
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest =
train_test_split(x, y, test_size =0.2,random_state = 0)
Step 9: Use linear regression( Train the Machine ) to Create Model
import sklearn
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
model=lm.fit(xtrain, ytrain)
Step 10: Predict the y_pred for all values of train_x and test_x
ytrain_pred = lm.predict(xtrain)
ytest_pred = lm.predict(xtest)
Step 11:Evaluate the performance of Model for train_y and test_y
df=pd.DataFrame(ytrain_pred,ytrain)
df=pd.DataFrame(ytest_pred,ytest)
Step 12: Calculate Mean Square Paper for train_y and test_y
from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(ytest, ytest_pred)
print(mse)
mse = mean_squared_error(ytrain_pred,ytrain)
print(mse)
Output:
33.44897999767638
mse = mean_squared_error(ytest, ytest_pred)
print(mse)
Output:
19.32647020358573
Step 13: Plotting the linear regression model
lt.scatter(ytrain ,ytrain_pred,c='blue',marker='o',label='Training data')
plt.scatter(ytest,ytest_pred ,c='lightgreen',marker='s',label='Test data')
plt.xlabel('True values')
plt.ylabel('Predicted')
plt.title("True value vs Predicted value")
plt.legend(loc= 'upper left')
#plt.hlines(y=0,xmin=0,xmax=50)
plt.plot()
plt.show()
Conclusion:
In this way we have done data analysis using linear regression for Boston Dataset and
predict the price of houses using the features of the Boston Dataset.
Assignment Question:
1) Compute SST, SSE, SSR, MSE, RMSE, R Square for the below example .
2) Comment on whether the model is best fit or not based on the calculated values.
3) Write python code to calculate the RSquare for Boston Dataset.
(Consider the linear regression model created in practical session)
.
Group A
Assignment No: 5
---------------------------------------------------------------------------------------------------------------
Logistic Regression can be used for various classification problems such as spam
detection. Diabetes prediction, if a given customer will purchase a particular product or
will they churn another competitor, whether the user will click on a given advertisement
link or not, and many more examples are in the bucket.
Logistic Regression is one of the most simple and commonly used Machine Learning
algorithms for two-class classification. It is easy to implement and can be used as the
baseline for any binary classification problem. Its basic fundamental concepts are also
constructive in deep learning. Logistic regression describes and estimates the relationship
between one dependent binary variable and independent variables.
Logistic regression is a statistical method for predicting binary classes. The outcome or
target variable is dichotomous in nature. Dichotomous means there are only two possible
classes. For example, it can be used for cancer detection problems. It computes the
It is a special case of linear regression where the target variable is categorical in nature. It
uses a log of odds as the dependent variable. Logistic Regression predicts the probability
Where, y is a dependent variable and x1, x2 ... and Xn are explanatory variables.
Sigmoid Function:
3. Sigmoid Function
The sigmoid function, also called logistic function, gives an ‘S’ shaped curve that can take any
real-valued number and map it into a value between 0 and 1. If the curve goes to positive infinity,
y predicted will become 1, and if the curve goes to negative infinity, y predicted will become 0.
If the output of the sigmoid function is more than 0.5, we can classify the outcome as 1 or YES,
and if it is less than 0.5, we can classify it as 0 or NO. The outputcannotFor example: If the
output is 0.75, we can say in terms of probability as: There is a 75 percent chance that a patient
will suffer from cancer.
4. Types of LogisticRegression
Binary Logistic Regression: The target variable has only two possible outcomes such as
Spam or Not Spam, Cancer or No Cancer.
Multinomial Logistic Regression: The target variable has three or more nominal
categories such as predicting the type of Wine.
Ordinal Logistic Regression: the target variable has three or more ordinal categories
such as restaurant or product rating from 1 to 5.
The following table shows the confusion matrix for a two class classifier.
Here each row indicates the actual classes recorded in the test data set and the each column indicates the
classes as predicted by the classifier.
Numbers on the descending diagonal indicate correct predictions, while the ascending diagonal concerns
prediction errors.
● Number of positive (Pos) : Total number instances which are labelled as positive in a given
dataset.
● Number of negative (Neg) : Total number instances which are labelled as negative in a given
dataset.
● Number of True Positive (TP) : Number of instances which are actually labelled as positive
and the predicted class by classifier is also positive.
● Number of True Negative (TN) : Number of instances which are actually labelled as negative
and the predicted class by classifier is also negative.
● Number of False Positive (FP) : Number of instances which are actually labelled as negative
and the predicted class by classifier is positive.
● Number of False Negative (FN): Number of instances which are actually labelled as positive
and the class predicted by the classifier is negative.
● Accuracy: Accuracy is calculated as the number of correctly classified instances divided by total
number of instances.
The ideal value of accuracy is 1, and the worst is 0. It is also calculated as the sum of true positive
and true negative (TP + TN) divided by the total number of instances.
𝑇𝑃+𝑇𝑁 𝑇𝑃+𝑇𝑁
𝑎𝑐𝑐 = 𝑇𝑃+𝐹𝑃+𝑇𝑁+𝐹𝑁
= 𝑃𝑜𝑠+𝑁𝑒𝑔
● Error Rate: Error Rate is calculated as the number of incorrectly classified instances divided
by total number of instances.
The ideal value of accuracy is 0, and the worst is 1. It is also calculated as the sum of false
positive and false negative (FP + FN) divided by the total number of instances.
𝐹𝑃+𝐹𝑁 𝐹𝑃+𝐹𝑁
𝑒𝑟𝑟 = 𝑇𝑃+𝐹𝑃+𝑇𝑁+𝐹𝑁
= 𝑃𝑜𝑠+𝑁𝑒𝑔
Or
𝑒𝑟𝑟 = 1 − 𝑎𝑐𝑐
● Precision: It is calculated as the number of correctly classified positive instances divided by the
total number of instances which are predicted positive. It is also called confidence value. The
ideal value is 1, whereas the worst is 0.
𝑝𝑟𝑒𝑐i𝑠i𝑜𝑛 = 𝑇𝑃
𝑇𝑃+𝐹𝑃
● Recall: .It is calculated as the number of correctly classified positive instances divided by the
total number of positive instances. It is also called recall or sensitivity. The ideal value of
sensitivity is 1, whereas the worst is 0.
It is calculated as the number of correctly classified positive instances divided by the total number
of positive instances.
𝑟𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃
𝑇𝑃+𝐹 𝑁
Step 6: Predict the y_pred for all values of train_x and test_x
Conclusion:
In this way we have done data analysis using logistic regression for Social Media Adv. and
evaluate the performance of model.
Value Addition: Visualising Confusion Matrix using Heatmap
Assignment Question:
1) Consider the binary classification task with two classes positive and negative.
Find out TP,TP, FP, TN, FN, Accuracy, Error rate, Precision, Recall
2) Comment on whether the model is best fit or not based on the calculated values.
3) Write python code for the preprocessing mentioned in step 4. and Explain every
step in detail.
Group A
Assignment No: 6
For example, P(A), P(B), P(C) are prior probabilities because while calculating P(A),
occurrences of event B or C are not concerned i.e. no information about occurrence of
any other event is used.
Conditional Probabilities:
Conditional Probability
Here, we are predicting the probability of class1 and class2 based on the given condition. If I try
to write the same formula in terms of classes and features, we will get the following equation
Now we have two classes and four features, so if we write this formula for class C1, it will be
Here, we replaced Ck with C1 and X with the intersection of X1, X2, X3, X4. You might have a
question, It’s because we are taking the situation when all these features are present at the same
time.
The Naive Bayes algorithm assumes that all the features are independent of each other or in other
words all the features are unrelated. With that assumption, we can further simplify the above
This is the final equation of the Naive Bayes and we have to calculate the probability of both C1
P (N0 | Today) > P (Yes | Today) So, the prediction that golf would be played is ‘No’.
Step 5: Use Naive Bayes algorithm( Train the Machine ) to Create Model
# import the class
from sklearn.naive_bayes import GaussianNB
gaussian = GaussianNB()
gaussian.fit(X_train, y_train)
Step 6: Predict the y_pred for all values of train_x and test_x
Y_pred = gaussian.predict(X_test)
Conclusion:
In this way we have done data analysis using Naive Bayes Algorithm for Iris dataset and
evaluated the performance of the model.
Value Addition: Visualising Confusion Matrix using Heatmap
Assignment Question:
1) Consider the observation for the car theft scenario having 3 attributes colour, Type and
origin.
Find the probability of car theft having scenarios Red SUV and Domestic.
2) Write python code for the preprocessing mentioned in step 4. and Explain every step in
detail.
Group A
Assignment No: 7
Text mining is also referred to as text analytics. Text mining is a process of exploring
sizable textual data and finding patterns. Text Mining processes the text itself, while NLP
processes with the underlying metadata. Finding frequency counts of words, length of the
sentence, presence/absence of specific words is known as text mining. Natural language
processing is one of the components of text mining. NLP helps identify sentiment,
finding entities in the sentence, and category of blog/article. Text mining is preprocessed
data for text analytics. In Text Analytics, statistical and machine learning algorithms are
used to classify information.
2. Text Analysis Operations using natural language toolkit
sent_tokenize() method
● Word tokenization : split a sentence into list of words using word_tokenize()
method
Lemmatization Vs Stemming
Stemming algorithm works by cutting the suffix from the word. In a broader sense
cuts either the beginning or end of the word.
On the contrary, Lemmatization is a more powerful operation, and it takes into
consideration morphological analysis of the words. It returns the lemma which is
the base form of all its inflectional forms. In-depth linguistic knowledge is
required to create dictionaries and look for the proper form of the word.
Stemming is a general operation while lemmatization is an intelligent operation
where the proper form will be looked in the dictionary. Hence, lemmatization
helps in forming better machine learning features.
2.4. POS Tagging
POS (Parts of Speech) tell us about grammatical information of words of the
Department of Computer Engineering, SAE, Kondhwa Page 111
Data Science and Big Data Analytics Laboratory TE Computer Engineering (2022-23)
Example:
The initial step is to make a vocabulary of unique words and calculate TF for each
document. TF will be more for words that frequently appear in a document and
less for rare words in a document.
After applying TFIDF, text in A and B documents can be represented as a TFIDF vector of
dimension equal to the vocabulary words. The value corresponding to each word represents
the importance of that word in a particular document.
TFIDF is the product of TF with IDF. Since TF values lie between 0 and 1, not using ln can
result in high IDF for some words, thereby dominating the TFIDF. We don’t want that, and
therefore, we use ln so that the IDF should not completely dominate the TFIDF.
● Disadvantage of TFIDF
It is unable to capture the semantics. For example, funny and humorous are synonyms, but
TFIDF does not capture that. Moreover, TFIDF can be computationally expensive if the
vocabulary is vast.
4. Bag of Words (BoW)
Machine learning algorithms cannot work with raw text directly. Rather, the text must be
converted into vectors of numbers. In natural language processing, a common technique
for extracting features from text is to place all of the words that occur in the text in a
bucket. This approach is called a bag of words model or BoW for short. It’s referred to
as a “bag” of words because any information about the structure of the sentence is lost.
Algorithm for Tokenization, POS Tagging, stop words removal, Stemming and
Lemmatization:
Step 1: Download the required packages
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
Step 2: Initialize the text
text= "Tokenization is the first step in text analytics. The
process of breaking down a text paragraph into smaller chunks
such as words or sentences is called Tokenization."
Step 3: Perform Tokenization
#Sentence Tokenization
from nltk.tokenize import sent_tokenize
tokenized_text= sent_tokenize(text)
print(tokenized_text)
#Word Tokenization
from nltk.tokenize import word_tokenize
tokenized_word=word_tokenize(text)
print(tokenized_word)
for w in e_words:
rootWord=ps.stem(w)
print(rootWord)
idfDict = dict.fromkeys(documents[0].keys(), 0)
for document in documents:
for word, val in document.items():
if val > 0:
idfDict[word] += 1
Conclusion:
In this way we have done text data analysis using TF IDF algorithm.
Assignment Question:
1) Perform Stemming for text = "studies studying cries cry". Compare
the results generated with Lemmatization. Comment on your answer how
Stemming and Lemmatization differ from each other.
2) Write Python code for removing stop words from the below documents, conver the
documents into lowercase and calculate the TF, IDF and TFIDF score for each
document.
documentA = 'Jupiter is the largest Planet'
documentB = 'Mars is the fourth planet from the Sun'
Assignment No: 8
Prerequisite:
1. Basic of Python Programming
2. Seaborn Library, Concept of Data Visualization.
---------------------------------------------------------------------------------------------------------------
Contents for Theory:
1. Downloading the Seaborn Library
2. The Dataset
3. Distributional Plots
4. Categorical Plots
---------------------------------------------------------------------------------------------------------------
Theory:
Seaborn which is another extremely useful library for data visualization in Python. The Seaborn
library is built on top of Matplotlib and offers many advanced data visualization capabilities.
Though, the Seaborn library can be used to draw a variety of charts such as matrix plots, grid plots,
regression plots etc., Seaborn library can be used to draw distributional and categorial plots. To
draw regression plots, matrix plots, and grid plots, Seaborn library need to download.
The Dataset
Titanic dataset is used, which is downloaded by default with the Seaborn library. The load_dataset
function is used to load the dataset and pass the name of the dataset.
import pandas as pd
import numpy as np
dataset = sns.load_dataset('titanic')
dataset.head()
The script above loads the Titanic dataset and displays the first five rows of the dataset using the
head function. The output looks like this:
The dataset contains 891 rows and 15 columns and contains information about the passengers who
boarded the unfortunate Titanic ship. The original task is to predict whether or not the passenger
survived depending upon different features such as their age, ticket, cabin they boarded, the class of
the ticket, etc. Seaborn library is used to find any patterns in the data.
Distributional Plots
Distributional plots, as the name suggests are type of plots that show the statistical distribution of
data.
The distplot() shows the histogram distribution of data for a single column. The column name is
passed as a parameter to the distplot() function. To check how the price of the ticket for each
passenger is distributed, execute the following script:
sns.distplot(dataset['fare'])
Output:
There is no line for the kernel density estimation on the plot. To pass the value for the bins
parameter in order to find more or less details in the graph, execute the following script:
By setting the number of bins to 10, data distributed in 10 bins as shown in the following output:
Output:
In the output, there are more than 700 passengers, the ticket price is between 0 and 50.
The jointplot()is used to display the mutual distribution of each column. There is need to pass three
parameters to jointplot. The first parameter is the column name which display the distribution of
data on x-axis. The second parameter is the column name which display the distribution of data on
y-axis. Finally, the third parameter is the name of the data frame. Plot a joint plot of age and fare
columns to see if there is any relationship between the two.
Output:
From the output, a joint plot has three parts. A distribution plot at the top for the column on the x-
axis, a distribution plot on the right for the column on the y-axis and a scatter plot in between that
shows the mutual distribution of data for both the columns. There is no correlation observed
between prices and the fares.
To change the type of the joint plot by passing a value for the kind parameter. The distribution of
data can be displayed in the form of a hexagonal plot, by passing the value hex for the kind
parameter.
Output:
In the hexagonal plot, the hexagon with most number of points gets darker color. From the
hexagonal plot, most of the passengers are between age 20 and 30 and most of them paid between
10-50 for the tickets.
The pairplot() is a type of distribution plot that basically plots a joint plot for all the possible
combination of numeric and Boolean columns in dataset. The name of your dataset need to pass as
the parameter to the pairplot() function as shown below:
sns.pairplot(dataset)
Note: Before executing the script above, remove all null values from the dataset using the following
command:
dataset = dataset.dropna()
From the output of the pair plot , It is clear that joint plots for all the numeric and Boolean columns
in the Titanic dataset.
To add information from the categorical column to the pair plot, The name of the categorical
column have to pass to the hue parameter. For instance to plot the gender information on the pair
plot, execute the following script:
sns.pairplot(dataset, hue='sex')
Output:
In the output, the information about the males in orange and the information about the female in
blue (as shown in the legend). From the joint plot on the top left, it is clear that among the surviving
passengers, the majority were female.
The rugplot() is used to draw small bars along x-axis for each point in the dataset. To plot a rug plot,
pass the name of the column. Plot a rug plot for fare.
sns.rugplot(dataset['fare'])
Output:
From the output, it is clear that as was the case with the distplot(), most of the instances for the fares
have values between 0 and 100.
These are some of the most commonly used distribution plots offered by the Python's Seaborn
Library. Some of categorical plots in the Seaborn library as follows
Categorical Plots
Categorical plots, as the name suggests are normally used to plot categorical data. The categorical
plots plot the values in the categorical column against another categorical column or a numeric
column. Most commonly used categorical data as follows:
The barplot() is used to display the mean value for each value in a categorical column, against a
numeric column. The first parameter is the categorical column, the second parameter is the numeric
column while the third parameter is the dataset. To find the mean value of the age of the male and
female passengers, use the bar plot as follows.
Output:
From the output, the average age of male passengers is just less than 40 while the average age of
female passengers is around 33.
To find the average, the bar plot can also be used to calculate other aggregate values for each
category. Pass the aggregate function to the estimator. To calculate the standard deviation for the
age of each gender as follows:
import numpy as np
Notice, in the above script, the std aggregate function used from the numpy library to calculate the
standard deviation for the ages of male and female passengers. The output:
The count plot is similar to the bar plot, It displays the count of the categories in a specific column.
To count the number of males and women passenger,use count plot as follows:
sns.countplot(x='sex', data=dataset)
Output:
The box plot is used to display the distribution of the categorical data in the form of quartiles. The
center of the box shows the median value. The value from the lower whisker to the bottom of the
box shows the first quartile. From the bottom of the box to the middle of the box lies the second
quartile. From the middle of the box to the top of the box lies the third quartile and finally from the
top of the box to the top whisker lies the last quartile.
To plot a box plot that displays the distribution for the age with respect to each gender,to pass the
categorical column as the first parameter (which is sex) and the numeric column (age) as the second
parameter. The dataset is passed as the third parameter.
Output:
From the above plot,the first quartile starts at around 5 and ends at 22 which means that 25% of the
passengers are aged between 5 and 25. The second quartile starts at around 23 and ends at around 32
which means that 25% of the passengers are aged between 23 and 32. Similarly, the third quartile
starts and ends between 34 and 42, hence 25% passengers are aged within this range and finally the
fourth or last quartile starts at 43 and ends around 65.
If there are any outliers or the passengers that do not belong to any of the quartiles, they are called
outliers and are represented by dots on the box plot.
To see the box plots of forage of passengers of both genders, along with the information about
whether or not they survived, pass the survived as value to the hue parameter as shown below:
Output:
In addition to the information about the age of each gender, distribution of the passengers who
survived is also displayed. For instance, it is seen that among the male passengers, on average more
younger people survived as compared to the older ones. Similarly, it is observed that the variation
among the age of female passengers who did not survive is much greater than the age of the
surviving female passengers.
The violin plot is similar to the box plot, however, the violin plot allows us to display all the
components that actually correspond to the data point. The violinplot() function is used to plot the
violin plot. Like the box plot, the first parameter is the categorical column, the second parameter is
the numeric column while the third parameter is the dataset.
Let's plot a violin plot that displays the distribution for the age with respect to each gender.
Output:
To see from the figure above that violin plots provide much more information about the data as
compared to the box plot. Instead of plotting the quartile, the violin plot allows us to see all the
components that actually correspond to the data. The area where the violin plot is thicker has a
higher number of instances for the age. For instance, from the violin plot for males, it is clearly
evident that the number of passengers with age between 20 and 40 is higher than all the rest of the
age brackets.
Like box plots, you can also add another categorical variable to the violin plot using the hue
parameter as shown below:
Now to find a lot of information on the violin plot. For instance, if you look at the bottom of the
violin plot for the males who survived (left-orange), you can see that it is thicker than the bottom of
the violin plot for the males who didn't survive (left-blue). This means that the number of young
male passengers who survived is greater than the number of young male passengers who did not
survive. The violin plots convey a lot of information, however, on the downside; it takes a bit of
time and effort to understand the violin plots.
Instead of plotting two different graphs for the passengers who survived and those who did not, you
can have one violin plot divided into two halves, where one half represents surviving while the other
half represents the non-surviving passengers. To do so, you need to pass True as value for the split
parameter of the violinplot() function. Let's see how we can do this:
Now it can clearly observed the comparison between the age of the passengers who survived and
who did not for both males and females.
Both violin and box plots can be extremely useful. However, as a rule of thumb if you are
presenting your data to a non-technical audience, box plots should be preferred since they are easy
to comprehend. On the other hand, if you are presenting your results to the research community it is
more convenient to use violin plot to save space and to convey more information in less time.
The strip plot draws a scatter plot where one of the variables is categorical. We have seen scatter
plots in the joint plot and the pair plot sections where we had two numeric variables. The strip plot
is different in a way that one of the variables is categorical in this case, and for each category in the
categorical variable, you will see scatter plot with respect to the numeric column.
The stripplot() function is used to plot the violin plot. Like the box plot, the first parameter is the
categorical column, the second parameter is the numeric column while the third parameter is the
dataset. Look at the following script:
Output:
You can see the scattered plots of age for both males and females. The data points look like strips. It
is difficult to comprehend the distribution of data in this form. To better comprehend the data, pass
True for the jitter parameter which adds some random noise to the data. Look at the following
script:
Output:
Now better view for the distribution of age across the genders can be observed.
Like violin and box plots, you can add an additional categorical column to strip plot using hue
parameter as shown below:
Again you can see there are more points for the males who survived near the bottom of the plot
compared to those who did not survive.
Like violin plots, we can also split the strip plots. Execute the following script:
Output:
Now you can clearly see the difference in the distribution for the age of both male and female
passengers who survived and those who did not survive.
The swarm plot is a combination of the strip and the violin plots. In the swarm plots, the points are
adjusted in such a way that they don't overlap. Let's plot a swarm plot for the distribution of age
against gender. The swarmplot() function is used to plot the violin plot. Like the box plot, the first
parameter is the categorical column, the second parameter is the numeric column while the third
parameter is the dataset. Look at the following script:
You can clearly see that the above plot contains scattered data points like the strip plot and the data
points are not overlapping. Rather they are arranged to give a view similar to that of a violin plot.
Let's add another categorical column to the swarm plot using the hue parameter.
Output:
From the output, it is evident that the ratio of surviving males is less than the ratio of surviving
females. Since for the male plot, there are more blue points and less orange points. On the other
hand, for females, there are more orange points (surviving) than the blue points (not surviving).
Another observation is that amongst males of age less than 10, more passengers survived as
compared to those who didn't.
We can also split swarm plots as we did in the case of strip and box plots. Execute the following
script to do so:
Output:
Now you can clearly see that more women survived, as compared to men.
Swarm plots are not recommended if you have a huge dataset since they do not scale well because
they have to plot each data point. If you really like swarm plots, a better way is to combine two
plots. For instance, to combine a violin plot with swarm plot, you need to execute the following
script:
Output:
Conclusion:
Seaborn is an advanced data visualization library built on top of Matplotlib library. In this
assignment, we have explored distributional and categorical plots using Seaborn library.
Assignment Question:
2. Explain when you will use distribution plots and when you will use categorical plots.
3. Write the conclusion from the following swarm plot (consider titanic dataset).
4. Which parameter is used to add another categorical variable to the violin plot, Explain with
syntax and example.
Group A
Assignment
No: 9
Prerequisite:
1. Basic of Python Programming
2. Seaborn Library, Concept of Data Visualization.
---------------------------------------------------------------------------------------------------------------
Contents for Theory:
1. Exploratory Data Analysis
2. Univariate Analysis
---------------------------------------------------------------------------------------------------------------
There are various techniques to understand the data, And the basic need is the knowledge of
Numpy for mathematical operations and Pandas for data manipulation. Titanic dataset is used.
For demonstrating some of the techniques, use an inbuilt dataset of seaborn as tips data which
explains the tips each waiter gets from different customers.
import numpy as np
import pandas pd
#titanic dataset
data = pd.read_csv("titanic_train.csv")
#tips dataset
tips = load_dataset("tips")
Univariate Analysis
Univariate analysis is the simplest form of analysis where we explore a single variable.
Univariate analysis is performed to describe the data in a better way. we perform Univariate
analysis of Numerical and categorical variables differently because plotting uses different plots.
Categorical Data:
A variable that has text-based information is referred to as categorical variables. Now following
are various plots which we can use for visualizing Categorical data.
1) CountPlot:
Countplot is basically a count of frequency plot in form of a bar graph. It plots the count of
each category in a separate bar. When we use the pandas’ value counts function on any column.
It is the same visual form of the value counts function. In our data-target variable is survived
and it is categorical so plot a countplot of this.
sns.countplot(data['Survived'])
plt.show()
OUTPUT:
2) Pie Chart:
The pie chart is also the same as the countplot, only gives us additional information about the
percentage presence of each category in data means which category is getting how much
weightage in data. Now we check about the Sex column, what is a percentage of Male and
Female members traveling.
data['Sex'].value_counts().plot(kind="pie", autopct="%.2f")
plt.show()
OUTPUT:
Numerical Data:
1) Histogram:
plt.hist(data['Age'], bins=5)
plt.show()
OUTPUT:
2) Distplot:
Distplot is also known as the second Histogram because it is a slight improvement version of
the Histogram. Distplot gives us a KDE(Kernel Density Estimation) over histogram which
explains PDF(Probability Density Function) which means what is the probability of each value
occurring in this column.
sns.distplot(data['Age'])
plt.show()
OUTPUT:
3) Boxplot:
Boxplot is a very interesting plot that basically plots a 5 number summary. to get 5 number
summary some terms we need to describe.
• Percentile – Gives any number which is number of values present before this percentile
like for example 50 under 25th percentile so it explains total of 50 values are there below 25th
percentile
• Minimum and Maximum – These are not minimum and maximum values, rather they
describe the lower and upper boundary of standard deviation which is calculated using
Interquartile range(IQR).
IQR = Q3 - Q1
Here Q1 and Q3 is 1st quantile (25th percentile) and 3rd Quantile(75th percentile).
We have study about various plots to explore single categorical and numerical data. Bivariate
Analysis is used when we have to explore the relationship between 2 different variables and we
have to do this because, in the end, our main task is to explore the relationship between
variables to build a powerful model. And when we analyze more than 2 variables together then
it is known as Multivariate Analysis. we will work on different plots for Bivariate as well on
Multivariate Analysis.
1) Scatter Plot:
To plot the relationship between two numerical variables scatter plot is a simple plot to do. Let
us see the relationship between the total bill and tip provided using a scatter plot.
sns.scatterplot(tips["total_bill"], tips["tip"])
We can also plot 3 variable or 4 variable relationships with scatter plot. suppose we want to
find the separate ratio of male and female with total bill and tip provided.
plt.show()
OUTPUT:
We can also see 4 variable multivariate analyses with scatter plots using style argument.
Suppose along with gender we also want to know whether the customer was a smoker or not so
we can do this.
plt.show()
OUTPUT:
If one variable is numerical and one is categorical then there are various plots that we can use
for Bivariate and Multivariate analysis.
1) Bar Plot:
Bar plot is a simple plot which we can use to plot categorical variable on the x-axis and
numerical variable on y-axis and explore the relationship between both variables. The blacktip
on top of each bar shows the confidence Interval. let us explore P-Class with age.
sns.barplot(data['Pclass'], data['Age'])
plt.show()
OUTPUT:
Hue’s argument is very useful which helps to analyze more than 2 variables. Now along with
the above relationship we want to see with gender.
plt.show()
OUTPUT:
2) Boxplot:
We have already study about boxplots in the Univariate analysis above. we can draw a separate
boxplot for both the variable. let us explore gender with age using a boxplot.
sns.boxplot(data['Sex'], data["Age"])
OUTPUT:
Along with age and gender let’s see who has survived and who has not.
plt.show()
OUTPUT:
3) Distplot:
Distplot explains the PDF function using kernel density estimation. Distplot does not have a
hue parameter but we can create it. Suppose we want to see the probability of people with an
age range that of survival probability and find out whose survival probability is high to the age
range of death ratio.
plt.show()
OUTPUT:
In above graph, the blue one shows the probability of dying and the orange plot shows the
survival probability. If we observe it we can see that children’s survival probability is higher
than death and which is the opposite in the case of aged peoples. This small analysis tells
sometimes some big things about data and it helps while preparing data stories.
1) Heatmap:
If you have ever used a crosstab function of pandas then Heatmap is a similar visual
representation of that only. It basically shows that how much presence of one category
concerning another category is present in the dataset. let me show first with crosstab and then
with heatmap.
pd.crosstab(data['Pclass'], data['Survived'])
Now with heatmap, we have to find how many people survived and died.
sns.heatmap(pd.crosstab(data['Pclass'], data['Survived']))
2) Cluster map:
We can also use a cluster map to understand the relationship between two categorical variables.
A cluster map basically plots a dendrogram that shows the categories of similar behavior
together.
sns.clustermap(pd.crosstab(data['Parch'], data['Survived']))
plt.show()
OUTPUT:
Conclusion-
In this way we have explored the functions of the python library for Data
Preprocessing, Data Wrangling Techniques and How to Handle missing values on
Iris Dataset.
Group A
Assignment No: 10
Aim: Download the Iris flower dataset or any other dataset into a DataFrame. (eg
https://archive.ics.uci.edu/ml/datasets/Iris ) Use Python/R and Perform following –
How many features are there and what are their types (e.g., numeric, nominal)?
Compute and display summary statistics for each feature available in the dataset. (eg.
minimum value, maximum value, mean, range, standard deviation, variance and
percentiles
Data Visualization-Create a histogram for each feature in the dataset to illustrate the
feature distributions. Plot each histogram.
Create a boxplot for each feature in the dataset. All of the boxplots should be combined
into a single plot. Compare distributions and identify outliers.
Objectives: To learn the concept of how to display summary statistics for each feature available
in the dataset.
Implement a dataset into a dataframe. Implement the following operations:
1. Display data set details.
2. Calculate min, max ,mean, range, standard deviation, variance.
3. Create histograph using hist function.
4. Create boxplot using boxplot function.
Theory:
How to Find the Mean, Median, Mode, Range, and Standard Deviation
Simplify comparisons of sets of number, especially large sets of number, by calculating the
center values using mean, mode and median. Use the ranges and standard deviations of the sets
to examine the variability of data.
Calculating Mean
The mean identifies the average value of the set of numbers. For example, consider the data
set containing the values 20, 24, 25, 36, 25, 22, 23.
Formula
To find the mean, use the formula: Mean equals the sum of the numbers in the data set
divided by the number of values in the data set. In mathematical terms: Mean=(sum of all
terms)÷(how many terms or values in the set).
Finding Divisor
Divide by the number of data points in the set. This set has seven values so divide by 7.
Finding Mean
Insert the values into the formula to calculate the mean. The mean equals the sum of the values
(175) divided by the number of data points (7). Since 175÷7=25, the mean of this data set equals
25. Not all mean values will equal a whole number.
Calculating Range
Range shows the mathematical distance between the lowest and highest values in the data set.
Range measures the variability of the data set. A wide range indicates greater variability in the
data, or perhaps a single outlier far from the rest of the data. Outliers may skew, or shift, the
mean value enough to impact data analysis.
In the sample group, the lowest value is 20 and the highest value is 36.
Calculating Range
To calculate range, subtract the lowest value from the highest value. Since 36-20=16, the
range equals 16.
Standard deviation measures the variability of the data set. Like range, a
smaller standard deviation indicates less variability.
Formula
Finding standard deviation requires summing the squared difference between each data point and
the mean [∑(x-µ)2], adding all the squares, dividing that sum by one less than the number of
values (N-1), and finally calculating the square root of the dividend.
Department of Computer Engineering, SAE, Kondhwa Page 148
Data Science and Big Data Analytics Laboratory TE Computer Engineering (2022-23)
Calculate the mean by adding all the data point values, then dividing by the number of data
points. In the sample data set, 20+24+25+36+25+22+23=175. Divide the sum, 175, by the
number of data points, 7, or 175÷7=25. The mean equals 25.
Standard Deviation
Calculate the standard deviation by finding the square root of the division by N-1. In the
example, the square root of 26.6667 equals approximately 5.164. Therefore, the standard
deviation equals approximately 5.164.
Standard deviation helps evaluate data. Numbers in the data set that fall within one standard
deviation of the mean are part of the data set. Numbers that fall outside of two standard
deviations are extreme values or outliers. In the example set, the value 36 lies more than two
standard deviations from the mean, so 36 is an outlier. Outliers may represent erroneous data or
may suggest unforeseen circumstances and should be carefully considered when interpreting
data.
Application:
1. The histogram is suitable for visualizing distribution of numerical data over a continuous
interval, or a certain time period. The histogram organizes large amounts of data, and produces
visualization quickly, using a single dimension.
2. The box plot allows quick graphical examination of one or more data sets. Box plots may
seem more primitive than a histogram but they do have some advantages. They take up less
space and are therefore particularly useful for comparing distributions between several groups or
sets of data. Choice of number and width of bins techniques can heavily influence the appearance
of a histogram, and choice of bandwidth can heavily influence the appearance of a kernel density
estimate.
3. Data Visualization Application lets you quickly create insightful data visualizations, in
minutes.
Data visualization tools allow anyone to organize and present information intuitively. They
enables users to share data visualizations with others.
Input:
Output:
Conclusion:
Hence, we have studied using dataset into a dataframe and compare distribution and identify
outliers.
Questions:
5. What is dataset?