DSBDA Manual
DSBDA Manual
LABORATORY MANUAL
AY: 2022-23
TE Computer Engineering
Semester –II
Subject Code – (310251)
Prepared By:
Objective of the Assignment: Students should be able to perform the data wrangling
operation using Python on any open source dataset
Prerequisite:
1. Basic of Python Programming
2. Concept of Data Preprocessing, Data Formatting, Data Normalization and Data Cleaning.
---------------------------------------------------------------------------------------------------------------
1. Introduction to Dataset
A dataset is a collection of records, similar to a relational database table. Records are
similar to table rows, but the columns can contain not only strings or numbers, but also
nested data structures such as lists, maps, and other records.
Instance: A single row of data is called an instance. It is an observation from the domain.
Feature: A single column of data is called a feature. It is a component of an observation
and is also called an attribute of a data instance. Some features may be inputs to a model
3. Description of Dataset:
The Iris dataset was used in R.A. Fisher's classic 1936 paper, The Use of Multiple
Measurements in Taxonomic Problems, and can also be found on the UCI Machine Learning
Repository.
It includes three iris species with 50 samples each as well as some properties about each
flower. One flower species is linearly separable from the other two, but the other two are not
linearly separable from each other.
Total Sample- 150
The columns in this dataset are:
1. Id
2. SepalLengthCm
3. SepalWidthCm
4. PetalLengthCm
5. PetalWidthCm
6. Species
1. Dataframe Operations:
2 dataset.tail(n=5)
Return the last n rows.
17 dataset.iloc[:m, :n] a subset of the first m rows and the first n columns
modelled effectively, and there are several techniques for this process.
a. Data Formatting: Ensuring all data formats are correct (e.g. object, text, floating
number, integer, etc.) is another part of this initial ‘cleaning’ process. If you are
working with dates in Pandas, they also need to be stored in the exact format to
use special date-time functions.
b. Data normalization: Mapping all the nominal data values onto a uniform scale
(e.g. from 0 to 1) is involved in data normalization. Making the ranges consistent
across variables helps with statistical analysis and ensures better comparisons
later on. It is also known as Min-Max scaling.
Algorithm:
Step 1 : Import pandas and sklearn library for preprocessing
from sklearn import preprocessing
Step 2: Load the iris dataset in dataframe object df
Step 3: Print iris dataset.
df.head()
Step 5: Create a minimum and maximum processor object
min_max_scaler = preprocessing.MinMaxScaler()
Step 6: Separate the feature from the class label
x=df.iloc[:,:4]
Step 6: Create an object to transform the data to fit minmax processor
x_scaled = min_max_scaler.fit_transform(x)
Step 7:Run the normalizer on the dataframe
df_normalized = pd.DataFrame(x_scaled)
There are many ways to convert categorical data into numerical data. Here the three most used
methods are discussed.
1. Label Encoding: Label Encoding refers to converting the labels into a numeric
form so as to convert them into the machine-readable form. It is an important
preprocessing step for the structured dataset in supervised learning.
2. One-Hot Encoding: In one-hot encoding, we create a new set of dummy (binary) variables
that is equal to the number of categories (k) in the variable. For example, let’s say we have a
categorical variable Color with three categories called “Red”, “Green” and “Blue”, we need to
use three dummy variables to encode this variable using one-hot encoding.
Conclusion- In this way we have explored the functions of the python library for Data
Preprocessing, Data Wrangling Techniques and to handle missing values on Iris Dataset.
Viva Questions
1. Explain Data Frame with Suitable example.
2. What is the limitation of the label encoding method?
3. What is the need of data normalization?
4. What are the different Techniques for Handling the Missing Data?
Group A
Assignment No: 2
Title of the Assignment: Data Wrangling, II
Create an “Academic performance” dataset of students and perform the following
operations using Python.
1. Scan all variables for missing values and inconsistencies. If there are missing
values and/or inconsistencies, use any of the suitable techniques to deal with
them.
2. Scan all numeric variables for outliers. If there are outliers, use any of the suitable
techniques to deal with them.
3. Apply data transformations on at least one of the variables. The purpose of this
transformation should be one of the following reasons: to change the scale for
better understanding of the variable, to convert a non-linear relation into a linear
one, or to decrease the skewness and convert the distribution into a normal
distribution.
Reason and document your approach properly.
Prerequisite:
1. Basic of Python Programming
2. Concept of Data Preprocessing, Data Formatting , Data Normalization and Data
Cleaning.
Theory:
1. Creation of Dataset using Microsoft Excel.
The dataset is created in “CSV” format.
● The name of dataset is Students Performance
● The features of the dataset are: Math_Score, Reading_Score, Writing_Score,
Placement_Score, Club_Join_Date .
● Number of Instances: 10
● The response variable is: Placement_Offer_Count .
● Range of Values:
Math_Score [60-80], Reading_Score[75-,95], ,Writing_Score [60,80],
Placement_Score[75-100], Club_Join_Date [2018-2021].
1. None: None is a Python singleton object that is often used for missing data in
Python code.
2. NaN : NaN (an acronym for Not a Number), is a special floating-point value
recognized by all systems that use the standard IEEE floating-point
representation.
To facilitate the convention, there are several useful functions for detecting,
removing, and replacing null values in Pandas DataFrame :
● isnull()
● notnull()
● dropna()
● fillna()
● replace()
1. Checking for missing values using isnull() and notnull()
In order to fill null values in a datasets, fillna(), replace() functions are used.
These functions replace NaN values with some value of their own. All these
functions help in filling null values in datasets of a DataFrame.
● Filling null values with a single value
ndf=df
ndf.fillna(0)
Following line will replace Nan value in dataframe with value -99
Similarly, an Outlier is an observation in a given dataset that lies far from the rest
of the observations. That means an outlier is vastly larger or smaller than the remaining
values in the set.
2. Why do they occur?
An outlier may occur due to the variability in the data, or due to experimental
error/human error.
Mean is the accurate measure to describe the data when we do not have any
outliers present. Median is used if there is an outlier in the dataset. Mode is used if there
is an outlier AND about ½ or more of the data is the same.
‘Mean’ is the only measure of central tendency that is affected by the outliers
which in turn impacts Standard deviation.
Example:
Consider a small dataset, sample= [15, 101, 18, 7, 13, 16, 11, 21, 5, 15, 10, 9]. By
looking at it, one can quickly say ‘101’ is an outlier that is much larger than the other
values.
From the above calculations, we can clearly say the Mean is more affected than the
Median.
4. Detecting Outliers
If our dataset is small, we can detect the outlier by just looking at the dataset. But
what if we have a huge dataset, how do we identify the outliers then? We need to use
visualization and mathematical techniques.
Below are some of the techniques of detecting outliers
● Boxplots
● Scatterplots
● Z-score
● Inter Quantile Range(IQR)
upper = Q3 +1.5*IQR
lower = Q1 – 1.5*IQR
In the above formula as according to statistics, the 0.5 scale-up of IQR
(new_IQR = IQR + 0.5*IQR) is taken.
Handling of Outliers:
For removing the outlier, one must follow the same process of removing an entry
from the dataset using its exact position in the dataset because in all the above methods of
detecting the outliers end result is the list of all those data items that satisfy the outlier
definition according to the method used.
Data Transformation: Data transformation is the process of converting raw data into a
format or structure that would be more suitable for model building and also data discovery in
general. The process of data transformation can also be referred to as extract/transform/load
(ETL). The data transformation involves steps that are.
● Smoothing: It is a process that is used to remove noise from the dataset using some
algorithms It allows for highlighting important features present in the dataset. It
helps in predicting the patterns
● Aggregation: Data collection or aggregation is the method of storing and presenting
data in a summary format. The data may be obtained from multiple data sources to
integrate these data sources into a data analysis description. This is a crucial step
since the accuracy of data analysis insights is highly dependent on the quantity and
quality of the data used.
● Generalization: It converts low-level data attributes to high-level data attributes
using concept hierarchy. For Example Age initially in Numerical form (22, 25) is
converted into categorical value (young, old).
● Normalization: Data normalization involves converting all data variables into a
given range. Some of the techniques that are used for accomplishing normalization
are:
○ Min–max normalization: This transforms the original data linearly.
○ Z-score normalization: In z-score normalization (or zero-mean normalization)
the values of an attribute (A), are normalized based on the mean of A and its
standard deviation.
○ Normalization by decimal scaling: It normalizes the values of an attribute by
changing the position of their decimal points
Conclusion: In this way we have explored the functions of the python library for Data
Identifying and handling the outliers. Data Transformations Techniques are explored with the
purpose of creating the new variable and reducing the skewness from datasets.
Viva Questions:
1. Explain the methods to detect the outlier.
2. Explain data transformation methods
3. Write the algorithm to display the statistics of Null values present in the dataset.
4. Write an algorithm to replace the outlier value with the mean of the variable.
Group A
Assignment No: 3
Objective of the Assignment: Students should be able to perform the Statistical operations
using Python on any open source dataset.
Prerequisite:
1. Basic of Python Programming
2. Concept of statistics such as means, median, minimum, maximum, standard deviation etc.
There are three main measures of central tendency: the mode, the median and the mean.
Each of these measures describes a different indication of the typical or central value in the
distribution.
1. Mode
The mode is the most commonly occurring value in a distribution.
Consider this dataset showing the retirement age of 11 people, in whole years:
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
This table shows a simple frequency distribution of the retirement age data.
Age Frequency
54 3
55 1
56 1
57 2
58 2
60 2
The most commonly occurring value is 54, therefore the mode of this distribution is 54 years.
The mode has an advantage over the median and the mean as it can be found for both
numerical and categorical (non-numerical) data.
There are some limitations to using the mode. In some distributions, the mode may not
reflect the center of the distribution very well. When the distribution of retirement age is
ordered from lowest to highest value, it is easy to see that the center of the distribution is 57
years, but the mode is lower, at 54 years.
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
It is also possible for there to be more than one mode for the same distribution of data, (bi-
modal, or multi-modal). The presence of more than one mode can limit the ability of the
mode in describing the center or typical value of the distribution because a single value to
describe the center cannot be identified.
In some cases, particularly where the data are continuous, the distribution may have no mode
at all (i.e. if all values are different).
In cases such as these, it may be better to consider using the median or mean, or group the
data into appropriate intervals, and find the modal class.
2. Median
The median is the middle value in distribution when the values are arranged in ascending or
descending order.
The median divides the distribution in half (there are 50% of observations on either side of
the median value). In a distribution with an odd number of observations, the median value
is the middle value.
Looking at the retirement age distribution (which has 11 observations), the median is the
middle value, which is 57 years:
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
When the distribution has an even number of observations, the median value is the mean of
the two middle values. In the following distribution, the two middle values are 56 and 57,
therefore the median equals 56.5 years:
52, 54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
The median is less affected by outliers and skewed data than the mean, and is usually the
preferred measure of central tendency when the distribution is not symmetrical.
The median cannot be identified for categorical nominal data, as it cannot be logically
ordered.
3. Mean
The mean is the sum of the value of each observation in a dataset divided by the number of
observations. This is also known as the arithmetic average.
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
The mean can be used for both continuous and discrete numeric data.
The mean cannot be calculated for categorical data, as the values cannot be summed.
As the mean includes every value in the distribution the mean is influenced by outliers and skewed
distributions.
---------------------------------------------------------------------------------------------------------------
Variability
The Variance
In a population, variance is the average squared deviation from the population mean, as defined by
the following formula:
σ2 = Σ (Xi - μ )2 / N
Where σ2 is the population variance, μ is the population mean, Xi is the ith element from the
population, and N is the number of elements in the population.
Observations from a simple random sample can be used to estimate the variance of a population.
For this purpose, sample variance is defined by slightly different formula, and uses a slightly
different notation:
s2 = Σ (xi - x )2 / ( n - 1 )
Where s2 is the sample variance, x is the sample mean, xi is the ith element from the sample, and
n is the number of elements in the sample. Using this formula, the sample variance can be
considered an unbiased estimate of the true population variance. Therefore, if you need to
estimate an unknown population variance, based on data from a simple random sample, this is
the formula to use.
The standard deviation is the square root of the variance. Thus, the standard deviation of a
population is:
σ = sqrt [ σ2 ] = sqrt [ Σ ( Xi - μ )2 / N ]
Where σ is the population standard deviation, μ is the population mean, X i is the ith element
from the population, and N is the number of elements in the population.
Statisticians often use simple random samples to estimate the standard deviation of a population,
based on sample data. Given a simple random sample, the best estimate of the standard deviation
of a population is:
Where s is the sample standard deviation, x is the sample mean, xi is the ith element from the
sample, and n is the number of elements in the sample.
Conclusion: In this way we have explored the functions of the python library for basic
statistical details like percentile, mean, standard deviation etc. of the species of ‘Iris-
setosa’, ‘Iris-versicolor’ and ‘Iris- versicolor’ of iris.csv dataset.
Viva Questions:
1. Explain Measures of Central Tendency with examples.
2. What are the different types of variables? Explain with examples.
3. Which method is used to display statistics of the data frame? write the code.
Group A
Assignment No: 4
Title of the Assignment: Create a Linear Regression Model using Python/R to predict
home prices using Boston Housing Dataset (https://www.kaggle.com/c/boston-housing).
The Boston Housing dataset contains information about various houses in Boston through
different parameters. There are 506 samples and 14 feature variables in this dataset.
The objective is to predict the value of prices of the house using the given features.
Objective of the Assignment: Students should be able to data analysis using liner regression
using Python for any open source dataset
Prerequisite:
1. Basic of Python Programming
2.Concept of Regression.
(4)
2. Measuring Performance of Linear Regression
Mean Square Error:
The Mean squared error (MSE) represents the error of the estimator or predictive model
created based on the given set of observations in the sample. Two or more regression
models created using a given sample data can be compared based on their MSE.
An MSE of zero (0) represents the fact that the predictor is a perfect predictor.
RMSE:
R-Squared :
R-Squared is the ratio of the sum of squares regression (SSR) and the sum of squares total
(SST).
Conclusion:
In this way we have done data analysis using linear regression for Boston Dataset and
predict the price of houses using the features of the Boston Dataset.
Viva Questions:
1) Compute SST, SSE, SSR, MSE, RMSE, R Square for the below example .
2) Comment on whether the model is best fit or not based on the calculated values.
3) Write python code to calculate the RSquare for Boston Dataset.
(Consider the linear regression model created in practical session)
.
Group A
Assignment No: 5
Objective of the Assignment: Students should be able to data analysis using logistic
regression using Python for any open source dataset
Prerequisite:
1. Basic of Python Programming
2.Concept of Regression.
1. Logistic Regression: Classification techniques are an essential part of machine learning
and data mining applications. Approximately 70% of problems in Data Science are
classification problems. Another category of classification is Multinomial classification,
which handles the issues where multiple classes are present in the target variable. For
example, the IRIS dataset is a very famous example of multi-class classification.
Logistic regression is a statistical method for predicting binary classes. The outcome or
target variable is dichotomous in nature. Dichotomous means there are only two possible
classes. For example, it can be used for cancer detection problems. It computes the
Where, y is a dependent variable and x1, x2 ... and Xn are explanatory variables.
Sigmoid Function:
3. Sigmoid Function
The sigmoid function, also called logistic function, gives an ‘S’ shaped curve that can take any
real-valued number and map it into a value between 0 and 1. If the curve goes to positive infinity,
y predicted will become 1, and if the curve goes to negative infinity, y predicted will become 0.
If the output of the sigmoid function is more than 0.5, we can classify the outcome as 1 or YES,
and if it is less than 0.5, we can classify it as 0 or NO.
The following table shows the confusion matrix for a two class classifier.
Here each row indicates the actual classes recorded in the test data set and the each column indicates the
classes as predicted by the classifier.
Numbers on the descending diagonal indicate correct predictions, while the ascending diagonal concerns
prediction errors.
● Accuracy: Accuracy is calculated as the number of correctly classified instances divided by total
number of instances.
The ideal value of accuracy is 1, and the worst is 0. It is also calculated as the sum of true positive
and true negative (TP + TN) divided by the total number of instances.
𝑇𝑃+𝑇𝑁 𝑇𝑃+𝑇𝑁
𝑎𝑐𝑐 = 𝑇𝑃+𝐹𝑃+𝑇𝑁+𝐹𝑁
= 𝑃𝑜𝑠+𝑁𝑒𝑔
● Error Rate: Error Rate is calculated as the number of incorrectly classified instances divided
by total number of instances. The ideal value of accuracy is 0, and the worst is 1. It is also
calculated as the sum of false positive and false negative (FP + FN) divided by the total number
of instances.
𝐹𝑃+𝐹𝑁 𝐹𝑃+𝐹𝑁
𝑒𝑟𝑟 = 𝑇𝑃+𝐹𝑃+𝑇𝑁+𝐹𝑁
= 𝑃𝑜𝑠+𝑁𝑒𝑔
𝑒𝑟𝑟 = 1 − 𝑎𝑐𝑐
● Precision: It is calculated as the number of correctly classified positive instances divided by the
total number of instances which are predicted positive. It is also called confidence value. The
ideal value is 1, whereas the worst is 0.
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃
𝑇𝑃+𝐹𝑃
● Recall: .It is calculated as the number of correctly classified positive instances divided by the
total number of positive instances. It is also called recall or sensitivity. The ideal value of
sensitivity is 1, whereas the worst is 0.
It is calculated as the number of correctly classified positive instances divided by the total number
of positive instances.
𝑟𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃+𝐹𝑁
𝑇𝑃
Conclusion:
In this way we have done data analysis using logistic regression for Social Media Adv. and
evaluate the performance of model.
Viva Questions:
1) Consider the binary classification task with two classes positive and negative.
Find out TP,TP, FP, TN, FN, Accuracy, Error rate, Precision, Recall
2) Comment on whether the model is best fit or not based on the calculated values.
3) Write python code for the preprocessing mentioned in step 4. and Explain every
step in detail.
.
Group A
Assignment No: 6
Objective of the Assignment: Students should be able to data analysis using Naïve Bayes classification
algorithm using Python for any open source dataset
Prerequisite:
1. Basic of Python Programming
2.Concept of Join and Marginal Probability.
Conditional Probability
Here, we are predicting the probability of class1 and class2 based on the given condition. If I try
to write the same formula in terms of classes and features, we will get the following equation
Now we have two classes and four features, so if we write this formula for class C1, it will be
Here, we replaced Ck with C1 and X with the intersection of X1, X2, X3, X4. You might have a
question, it’s because we are taking the situation when all these features are present at the same
time.
The Naive Bayes algorithm assumes that all the features are independent of each other or in other
words all the features are unrelated. With that assumption, we can further simplify the above
This is the final equation of the Naive Bayes and we have to calculate the probability of both C1
P (N0 | Today) > P (Yes | Today) So, the prediction that golf would be played is ‘No’.
Step 5: Use Naive Bayes algorithm( Train the Machine ) to Create Model
Step 6: Predict the y_pred for all values of train_x and test_x
Conclusion:
In this way we have explored data analysis using Naive Bayes Algorithm for Iris dataset and
evaluated the performance of the model.
Viva Questions:
1) Consider the observation for the car theft scenario having 3 attributes colour, Type and
origin.
Find the probability of car theft having scenarios Red SUV and Domestic.
2) Write python code for the preprocessing mentioned in step 4 and explain every step in
detail.
.
Group A
Assignment No: 7
Objective of the Assignment: Students should be able to perform Text Analysis using TF
IDF Algorithm
Prerequisite:
1. Basic of Python Programming
2. Basic of English language.
The initial step is to make a vocabulary of unique words and calculate TF for each
document. TF will be more for words that frequently appear in a document and
less for rare words in a document.
● Inverse Document Frequency (IDF)
It is the measure of the importance of a word. Term frequency (TF) does not
consider the importance of words. Some words such as’ of’, ‘and’, etc. can be
most frequently present but are of little significance. IDF provides weightage to
each word based on its frequency in the corpus D.
After applying TFIDF, text in A and B documents can be represented as a TFIDF vector of
dimension equal to the vocabulary words. The value corresponding to each word represents
the importance of that word in a particular document.
TFIDF is the product of TF with IDF. Since TF values lie between 0 and 1, not using ln can
result in high IDF for some words, thereby dominating the TFIDF. We don’t want that, and
therefore, we use ln so that the IDF should not completely dominate the TFIDF.
● Disadvantage of TFIDF
It is unable to capture the semantics. For example, funny and humorous are synonyms, but
TFIDF does not capture that. Moreover, TFIDF can be computationally expensive if the
vocabulary is vast.
4. Bag of Words (BoW)
Machine learning algorithms cannot work with raw text directly. Rather, the text must be
converted into vectors of numbers. In natural language processing, a common technique
for extracting features from text is to place all of the words that occur in the text in a
bucket. This approach is called a bag of words model or BoW for short. It’s referred to
as a “bag” of words because any information about the structure of the sentence is lost.
Conclusion:
In this way we have completed text data analysis using TF IDF algorithm
Viva Questions:
1) Perform Stemming for text = "studies studying cries cry". Compare
the results generated with Lemmatization. Comment on your answer how
Stemming and Lemmatization differ from each other.
2) Write Python code for removing stop words from the below documents, conver the
documents into lowercase and calculate the TF, IDF and TFIDF score for each
document.
documentA = 'Jupiter is the largest Planet'
documentB = 'Mars is the fourth planet from the Sun'
.
Group A
Assignment No: 8
Objective of the Assignment: Students should be able to perform the data Visualization
operation using Python on any open source dataset
Prerequisite:
1. Basic of Python Programming
2. Seaborn Library, Concept of Data Visualization.
Theory:
Data Visualisation plays a very important role in Data mining. Various data scientists spent their
time exploring data through visualisation. To accelerate this process we need to have a well-
documentation of all the plots.
Even plenty of resources can’t be transformed into valuable goods without planning and
architecture
The dataset contains 891 rows and 15 columns and contains information about the passengers
who boarded the unfortunate Titanic ship. The original task is to predict whether or not the
passenger survived depending upon different features such as their age, ticket, cabin they
boarded, the class of the ticket, etc. We will use the Seaborn library to see if we can find any
patterns in the data.
a. Dist-Plot
b. Joint Plot
d. Rug Plot
B. Categorical Plots
a. Bar Plot
b. Count Plot
c. Box Plot
d. Violin Plot
C. Advanced Plots
a. Strip Plot
b. Swarm Plot
D. Matrix Plots
a. Heat Map
b. Cluster Map
A. Distribution Plots:
These plots help us to visualise the distribution of data. We can use these plots to understand the
● We can change the number of bins i.e. number of vertical bars in a histogram
b. Joint Plot
● It is the combination of the distplot of two variables.
● It is an example of bivariate analysis.
● We additionally obtain a scatter plot between the variables to reflect their linear
relationship. We can customise the scatter plot into a hexagonal plot, where, the
more the colour intensity, the more will be the number of observations.
2. Categorical Plots
Categorical plots, as the name suggests, are normally used to plot categorical data. The
categorical plots plot the values in the categorical column against another categorical column or
a numeric column. Let's see some of the most commonly used categorical data.
In addition to finding the average, the bar plot can also be used to calculate other aggregate
values for each category. To do so, you need to pass the aggregate function to the estimator.
The box plot is used to display the distribution of the categorical data in the form of quartiles.
The centre of the box shows the median value. The value from the lower whisker to the bottom
of the box shows the first quartile. From the bottom of the box to the middle of the box lies the
second quartile. From the middle of the box to the top of the box lies the third quartile and finally
from the top of the box to the top whisker lies the last quartile.
If there are any outliers or the passengers that do not belong to any of the quartiles, they are
called outliers and are represented by dots on the box plot.
Advanced Plots:
a. The Strip Plot
The strip plot draws a scatter plot where one of the variables is categorical. We have seen scatter
plots in the joint plot and the pair plot sections where we had two numeric variables. The strip
plot is different in a way that one of the variables is categorical in this case, and for each
category in the categorical variable, you will see a scatter plot with respect to the numeric
column.
The stripplot() function is used to plot the violin plot. Like the box plot, the first parameter is the
categorical column; the second parameter is the numeric column while the third parameter is the
dataset.
1. Matrix Plots
Matrix plots are the type of plots that show data in the form of rows and columns. Heat maps are
the prime examples of matrix plots.
a. Heat Maps
Heat maps are normally used to plot correlation between numeric columns in the form of a
matrix. It is important to mention here that to draw matrix plots, you need to have meaningful
information on rows as well as columns.
To plot matrix plots, we need useful information on both columns and row headers. One way to
do this is to call the corr() method on the dataset. The corr() function returns the correlation
between all the numeric columns of the dataset.
dataset.corr()
Now to create a heat map with these correlation values, you need to call the heatmap() function
and pass it your correlation dataframe. Look at the following script:
corr = dataset.corr()
sns.heatmap(corr)
b. Cluster Map:
In addition to the heat map, another commonly used matrix plot is the cluster map. The
cluster map basically uses Hierarchical Clustering to cluster the rows and columns of the
matrix.
4. Checking how the price of the ticket (column name: 'fare') for each passenger is
distributed by plotting a histogram.
import seaborn as sns
dataset = sns.load_dataset('titanic')
Conclusion-
Seaborn is an advanced data visualisation library built on top of Matplotlib library. In this
assignment, we looked at how we can draw distributional and categorical plots using the Seaborn
library. We have seen how to plot matrix plots in Seaborn. We also saw how to change plot styles
and use grid functions to manipulate subplots.
Assignment Questions
1. List out different types of plot to find patterns of data
2. Explain when you will use distribution plots and when you will use categorical plots.
3. Write the conclusion from the following swarm plot (consider titanic dataset)
4. Which parameter is used to add another categorical variable to the violin plot,
Explain with syntax and example?
Group A
Assignment No: 9
Title of the Assignment: Data Visualization II
1. Use the inbuilt dataset 'titanic' as used in the above problem. Plot a box plot for
distribution of age with respect to each gender along with the information about whether they
survived or not. (Column names : 'sex' and 'age')
2. Write observations on the inference from the above statistics.
Prerequisite:
1. Basic of Python Programming
2. Seaborn Library, Concept of Data Visualization.
Theory:
BoxPlot:
A boxplot is a standardized way of displaying the distribution of data based on a five number
summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”). It can
tell you about your outliers and what their values are. It can also tell you if your data is
symmetrical, how tightly your data is grouped, and if and how your data is skewed.
first quartile (Q1/25th Percentile): the middle number between the smallest number (not the
“minimum”) and the median of the dataset.
Third quartile (Q3/75th Percentile): the middle value between the median and the highest value (not
the “maximum”) of the dataset.
Viva Questions
1. Explain following plots with sample diagrams:
• Histogram
• Violin Plot
2. Write code to plot a box plot for distribution of age with respect to each genderalong
with the information about whether they survived or not. Write the observations.
Group A
Assignment No: 10
Objective of the Assignment: Students should be able to perform the data Visualization
operation using Python on any open source dataset
Prerequisite:
1. Basic of Python Programming
2. Seaborn Library, Concept of Data Visualization.
3. Types of variables
Theory:
Histograms:
A histogram is basically used to represent data provided in a form of some groups.It is accurate
method for the graphical representation of numerical data distribution. It is a type of bar plot where
X-axis represents the bin ranges while Y-axis gives information about frequency.
The following table shows the parameters accepted by matplotlib.pyplot.hist() function :
Attribute Parameter
optional parameter used to create type of histogram [bar, barstacked, step, stepfilled],
histtype default is “bar”
align optional parameter controls the plotting of histogram [left, right, mid]
rwidth optional parameter which is relative width of the bars with respect to bin width
label optional parameter string or sequence of string to match with multiple datasets
Algorithm:
1. Import required libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pylab
import seaborn as sns
import os
2. Create the data frame for downloaded iris.csv dataset.
os.chdir("D:\Pandas")
df =
pd.read_csv("Iris.csv") df
3. Apply data preprocessing techniques.
df.isnull().sum()
df.describe()
4. Plot the box plot for each feature in the dataset and observe and detect the
outliers. sns.set(style ="whitegrid", palette = "GnBu_d", rc =
{'figure.figsize':(11.7,8.27)} ) sns.boxplot(x='Species', y='SepalLengthCm', data=df)
plt.title('Distribution of sepal length')
plt.show()
5. Plot the histogram for each feature in the dataset.
df.hist()
Viva Questions
1. For the iris dataset, list down the features and their types.
2. Write a code to create a histogram for each feature. (iris dataset)
3. Write a code to create a boxplot for each feature. (iris dataset)
4. Identify the outliers from the boxplot drawn for iris dataset.
Group B
Assignment No: 1
Theory:
Steps to Install Hadoop
Step 2) Download hadoop-core-1.2.1.jar, which is used to compile and execute the program. Visit
the following link
http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-core/1.2.1
javac -classpath/home/vijay/words/hadoop-core-1.2.1.jar/home/vijay/words/WordCount.java
hadoop fs -mkdir/input
hadoop fs –ls/input
hadoop fs –ls/out321
hadoop fs –cat/out321/part-r-00000
Conclusion: In this way we have explored the program in JAVA for a simple Word Count application
that counts the number occurrences of each word in a given input set using the Hadoop Map-Reduce
framework on local-standalone set-up.
Viva Questions
1. What is the map reduce? Explain with a small example?
2. Write down steps to install hadoop.
Group B
Assignment No: 2
Theory:
● Steps to Install Hadoop for distributed environment
cd hadoop-2.7.3
Step 2) Once the NameNode is formatted, go to hadoop-2.7.3/sbin directory and start all the
daemons/nodes.
cd hadoop-2.7.3/sbin
1) Start NameNode:
The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files
stored in the HDFS and tracks all the files stored across the cluster.
2) Start DataNode:
On startup, a DataNode connects to the Namenode and it responds to the requests from the
Namenode for different operations.
3) Start ResourceManager:
ResourceManager is the master that arbitrates all the available cluster resources and thus helps in
managing the distributed applications running on the YARN system. Its work is to manage each
NodeManagers and each application’s ApplicationMaster.
4) Start NodeManager:
The NodeManager in each machine framework is the agent which is responsible for managing
containers, monitoring their resource usage and reporting the same to the ResourceManager.
5) Start JobHistoryServer:
JobHistoryServer is responsible for servicing all job history related requests from client.
Step 3) To check that all the Hadoop services are up and running, run the below command.
jps
Step 4) cd
Step 9) cd mapreduce_vijay/
Step 10) ls
Step 11) sudo chmod +r *.*
Step 14) ls
Step 17) cd ..
Step 20) ls
Step 21) cd
Viva Questions
1. Write down the steps for Design a distributed application using MapReduce which
processes a log file of a system.
Group B
Assignment No: 3
1) Install Scala
Step 2) Install Scala from the apt repository by running the following commands to search for
scala and install it.
Apache Spark is an open-source, distributed processing system used for big data workloads. It
utilizes in-memory caching, and optimized query execution for fast analytic queries against data
of any size.
Step 1) Now go to the official Apache Spark download page and grab the latest version (i.e.
3.2.1) at the time of writing this article. Alternatively, you can use the wget command to
download the file directly in the terminal.
wget https://apachemirror.wuchna.com/spark/spark-3.2.1/spark-3.2.1-bin-hadoop2.7.tgz
source ~/.profile
Step 6) ls -l /opt/spark
Step 7) Run the following command to start the Spark master service and slave service.
start-master.sh
start-workers.sh spark://localhost:7077
(if workers not starting then remove and install openssh:
sudo apt-get remove openssh-client openssh-server
sudo apt-get install openssh-client openssh-server)
Step 8) Once the service is started go to the browser and type the following URL access spark
page. From the page, you can see my master and slave service is started.
http://localhost:8080/
Step 9) You can also check if spark-shell works fine by launching the spark-shell command.
Spark-shell
Conclusion: In this way we have explored the program in SCALA using Apache Spark framework
Viva Questions
1. Write down steps to install Scala.
Group C
Assignment No: 2
We will begin by scraping and storing Twitter data. We will then classify the Tweets into
positive, negative, or neutral sentiment with a simple algorithm. Then, we will build charts
using Plotly and Matplotlib to identify trends in sentiment.
Command -
import pandas as pd
df = pd.read_csv('/content/data_visualization.csv')
Output -
/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshe
ll.py:2882: DtypeWarning: Columns (22,24) have mixed types.Specify
dtype option on import or set low_memory=False.
exec(code_obj, self.user_global_ns, self.user_ns)
Let's now take a look at some of the variables present in the data frame:
Command -
df.info()
Output -
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33590 entries, 0 to 33589
Data columns (total 36 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 33590 non-null int64
1 conversation_id 33590 non-null int64
2 created_at 33590 non-null object
3 date 33590 non-null object
4 time 33590 non-null object
5 timezone 33590 non-null int64
6 user_id 33590 non-null int64
7 username 33590 non-null object
8 name 33590 non-null object
9 place 85 non-null object
10 tweet 33590 non-null object
11 language 33590 non-null object
12 mentions 33590 non-null object
13 urls 33590 non-null object
14 photos 33590 non-null object
15 replies_count 33590 non-null int64
16 retweets_count 33590 non-null int64
17 likes_count 33590 non-null int64
18 hashtags 33590 non-null object
19 cashtags 33590 non-null object
20 link 33590 non-null object
21 retweet 33590 non-null bool
22 quote_url 1241 non-null object
23 video 33590 non-null int64
24 thumbnail 9473 non-null object
25 near 0 non-null float64
26 geo 0 non-null float64
27 source 0 non-null float64
28 user_rt_id 0 non-null float64
29 user_rt 0 non-null float64
30 retweet_id 0 non-null float64
31 reply_to 33590 non-null object
32 retweet_date 0 non-null float64
33 translate 0 non-null float64
34 trans_src 0 non-null float64
35 trans_dest 0 non-null float64
dtypes: bool(1), float64(10), int64(8), object(17)
memory usage: 9.0+ MB
The data frame has 35 columns. The most main variables we will be using in this analysis are
date and tweet. Let's take a look at a sample Tweet in this dataset, and see if we can predict
whether it is positive or negative:
Command -
df['tweet'][10]
Output -
We are pleased to invite you to the EDHEC DataViz Challenge grand
final for a virtual exchange with all Top 10 finalists to see how
data visualization creates impact and can bring out compelling
stories in support of @UNICEF’s mission. https://t.co/Vbj9B48VjV
Step 2: Sentiment Analysis
The Tweet above is clearly positive. Let's see if the model is able to pick up on this, and
return a positive prediction. Run the following lines of code to import the NLTK library,
along with the SentimentIntensityAnalyzer (SID) module.
Command -
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
import re
import pandas as pd
import nltk
nltk.download('words')
words = set(nltk.corpus.words.words())
The SID module takes in a string and returns a score in each of these four categories -
positive, negative, neutral, and compound. The compound score is calculated by normalizing
the positive, negative, and neutral scores. If the compound score is closer to 1, then the Tweet
can be classified as positive. If it is closer to -1, then the Tweet can be classified as negative.
Let's now analyze the above sentence with the sentiment intensity analyzer.
Command -
sentence = df['tweet'][0]
sid.polarity_scores(sentence)['compound']
The output of the code above is 0.7089, indicating that the sentence is of positive sentiment.
Let's now create a function that predicts the sentiment of every Tweet in the dataframe, and
stores it as a separate column called 'sentiment.' First, run the following lines of code to clean
the Tweets in the data frame:
Command -
def cleaner(tweet):
tweet = re.sub("@[A-Za-z0-9]+","",tweet) #Remove @ sign
tweet = re.sub(r"(?:\@|http?\://|https?\://|www)\S+", "",
tweet) #Remove http links
tweet = " ".join(tweet.split())
tweet = tweet.replace("#", "").replace("_", " ") #Remove
hashtag sign but keep the text
tweet = " ".join(w for w in nltk.wordpunct_tokenize(tweet)
if w.lower() in words or not w.isalpha())
return tweet
df['tweet_clean'] = df['tweet'].apply(cleaner)
Now that the Tweets are cleaned, run the following lines of code to perform the sentiment
analysis:
Command -
word_dict =
{'manipulate':-1,'manipulative':-1,'jamescharlesiscancelled':-1,'j
amescharlesisoverparty':-1,
'pedophile':-1,'pedo':-1,'cancel':-1,'cancelled':-1,'cancel
culture':0.4,'teamtati':-1,'teamjames':1,
'teamjamescharles':1,'liar':-1}
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
sid.lexicon.update(word_dict)
list1 = []
for i in df['tweet_clean']:
list1.append((sid.polarity_scores(str(i)))['compound'])
The word_dict created above is a dictionary of custom words I wanted to add into the model.
Words like 'teamjames' mean that people's sentiment around James Charles is positive, and
that they support him. The dictionary used to train the sentiment intensity analyzer wouldn't
already have these words in them, so we can update it ourselves with custom words.
Now, we need to convert the compound scores into categories - 'positive', 'negative', and
'neutral.'
Command -
df['sentiment'] = pd.Series(list1)
def sentiment_category(sentiment):
label = ''
if(sentiment>0):
label = 'positive'
elif(sentiment == 0):
label = 'neutral'
else:
label = 'negative'
return(label)
df['sentiment_category'] =
df['sentiment'].apply(sentiment_category)
Let's take a look at the head of the data frame to ensure everything is working properly:
Command -
df = df[['tweet','date','id','sentiment','sentiment_category']]
df.head()
Output -
Notice that the first few Tweets are the combination of positive, negative and neutral
sentiment. For this analysis, we will only be using Tweets with positive and negative
sentiment, since we want to visualize how stronger sentiments have changed over time.
Step 3: Visualization
Now that we have Tweets classified as positive and negative, let's take a look at changes in
sentiment over time. We first need to group positive and negative sentiment and count them
by date:
Command -
neg = df[df['sentiment_category']=='negative']
neg = neg.groupby(['date'],as_index=False).count()
pos = df[df['sentiment_category']=='positive']
pos = pos.groupby(['date'],as_index=False).count()
pos = pos[['date','id']]
neg = neg[['date','id']]
Now, we can visualize sentiment by date using Plotly, by running the following lines of code:
Command -
import plotly.graph_objs as go
fig = go.Figure()
for col in pos.columns:
fig.add_trace(go.Scatter(x=pos['date'], y=pos['id'],
name = col,
mode = 'markers+lines',
line=dict(shape='linear'),
connectgaps=True,
line_color='green'
)
)
for col in neg.columns:
fig.add_trace(go.Scatter(x=neg['date'], y=neg['id'],
name = col,
mode = 'markers+lines',
line=dict(shape='linear'),
connectgaps=True,
line_color='red'
)
)
fig.show()
Final Output - You should see a chart that looks like this:
The red line represents negative sentiment, and the green line represents positive sentiment.
Assignment Questions:
Hadoop Ecosystem:
Hadoop Ecosystem is neither a programming language nor a service, it is a platform or
framework which solves big data problems. You can consider it as a suite which encompasses
a number of services (ingesting, storing, analyzing and maintaining) inside it. Let us discuss
and get a brief idea about how the services work individually and in collaboration. Below are
the Hadoop components, that together form a Hadoop ecosystem,
HDFS -
● Hadoop Distributed File System is the core component or you can say, the backbone
of Hadoop Ecosystem.
● HDFS is the one, which makes it possible to store different types of large data sets
(i.e. structured, unstructured and semi structured data).
● HDFS creates a level of abstraction over the resources, from where we can see the
whole HDFS as a single unit.
● It helps us in storing our data across various nodes and maintaining the log file about
the stored data (metadata).
● HDFS has two core components, i.e. NameNode and DataNode.
1. The NameNode is the main node and it doesn’t store the actual data. It
contains metadata, just like a log file or you can say as a table of content.
Therefore, it requires less storage and high computational resources.
2. On the other hand, all your data is stored on the DataNodes and hence it
requires more storage resources. These DataNodes are commodity hardware
(like your laptops and desktops) in the distributed environment. That’s the
reason, why Hadoop solutions are very cost effective.
3. You always communicate to the NameNode while writing the data. Then, it
internally sends a request to the client to store and replicate data on various
DataNodes.
YARN - Consider YARN as the brain of your Hadoop Ecosystem. It performs all your
processing activities by allocating resources and scheduling tasks.
● It has two major components, i.e. ResourceManager and NodeManager.
1. ResourceManager is again a main node in the processing department.
2. It receives the processing requests, and then passes the parts of requests to
corresponding NodeManagers accordingly, where the actual processing takes
place.
3. NodeManagers are installed on every DataNode. It is responsible for
execution of task on every single DataNode.
1. Schedulers: Based on your application resource requirements, Schedulers perform
scheduling algorithms and allocates the resources.
2. ApplicationsManager: While ApplicationsManager accepts the job submission,
negotiates to containers (i.e. the Data node environment where process executes) for
executing the application specific ApplicationMaster and monitoring the progress.
MAPREDUCE -
Let us take the above example to have a better understanding of a MapReduce program.
We have a sample case of students and their respective departments. We want to calculate the
number of students in each department. Initially, Map program will execute and calculate the
students appearing in each department, producing the key value pair as mentioned above.
This key value pair is the input to the Reduce function. The Reduce function will then
aggregate each department and calculate the total number of students in each department and
produce the given result.
APACHE PIG -
● PIG has two parts: Pig Latin, the language and the pig runtime, for the execution
environment. You can better understand it as Java and JVM.
● It supports pig latin language, which has SQL like command structure.
As everyone does not belong from a programming background. So, Apache PIG relieves
them. You might be curious to know how?
But don’t be shocked when I say that at the back end of Pig job, a map-reduce job executes.
● The compiler internally converts pig latin to MapReduce. It produces a sequential set
of MapReduce jobs, and that’s an abstraction (which works like black box).
● PIG was initially developed by Yahoo.
● It gives you a platform for building data flow for ETL (Extract, Transform and Load),
processing and analyzing huge data sets.
APACHE HIVE -
● Facebook created HIVE for people who are fluent with SQL. Thus, HIVE makes
them feel at home while working in a Hadoop Ecosystem.
● Basically, HIVE is a data warehousing component which performs reading, writing
and managing large data sets in a distributed environment using SQL-like interface.
HIVE + SQL = HQL
● The query language of Hive is called Hive Query Language(HQL), which is very
similar like SQL.
● It has 2 basic components: Hive Command Line and JDBC/ODBC driver.
● The Hive Command line interface is used to execute HQL commands.
● While, Java Database Connectivity (JDBC) and Object Database Connectivity
(ODBC) is used to establish connection from data storage.
● Secondly, Hive is highly scalable. As, it can serve both the purposes, i.e. large data set
processing (i.e. Batch query processing) and real time processing (i.e. Interactive
query processing).
● It supports all primitive data types of SQL.
● You can use predefined functions, or write tailored user defined functions (UDF) also
to accomplish your specific needs.
APACHE MAHOUT -
Now, let us talk about Mahout which is renowned for machine learning. Mahout provides an
environment for creating machine learning applications which are scalable.
Mahout provides a command line to invoke various algorithms. It has a predefined set of
library which already contains different inbuilt algorithms for different use cases.
APACHE SPARK -
● Apache Spark is a framework for real time data analytics in a distributed computing
environment.
● The Spark is written in Scala and was originally developed at the University of
California, Berkeley.
● It executes in-memory computations to increase speed of data processing over Map-
Reduce.
● It is 100x faster than Hadoop for large scale data processing by exploiting in-memory
computations and other optimizations. Therefore, it requires high processing power
than Map-Reduce.
As you can see, Spark comes packed with high-level libraries, including support for R, SQL,
Python, Scala, Java etc. These standard libraries increase the seamless integrations in
complex workflow. Over this, it also allows various sets of services to integrate with it like
MLlib, GraphX, SQL + Data Frames, Streaming services etc. to increase its capabilities.
The Answer to this – This is not an apple to apple comparison. Apache Spark best fits for real
time processing, whereas Hadoop was designed to store unstructured data and execute batch
processing over it. When we combine, Apache Spark’s ability, i.e. high processing speed,
advance analytics and multiple integration support with Hadoop’s low cost operation on
commodity hardware, it gives the best results.
That is the reason why, Spark and Hadoop are used together by many companies for
processing and analyzing their Big Data stored in HDFS.
APACHE HBASE -
Apache Solr and Apache Lucene are the two services which are used for searching and
indexing in the Hadoop Ecosystem.
● Apache Lucene is based on Java, which also helps in spell checking.
● If Apache Lucene is the engine, Apache Solr is the car built around it. Solr is a
complete application built around Lucene.
● It uses the Lucene Java search library as a core for search and full indexing.
Assignment Questions -
1. Why is there a need for Big Data Analytics in Healthcare?
2. What is the role of Hadoop in Healthcare Analytics?
3. Explain how an IBM Watson platform can be used for healthcare analytics?