DSBDAL
DSBDAL
1
Aim: Data Wrangling: I Perform the following operations using Python on any open
source dataset (e.g., data.csv)
4. Data Preprocessing: check for missing values in the data using pandas isnull(),
describe() function to get some initial statistics. Provide variable descriptions. Types of
variables etc. Check the dimensions of the data frame.
Python3
import pandas as pd
# Assign data
# Display data
df
Output:
Dealing with missing values, as we can see from the previous output, there are
NaN values present in the MARKS column which are going to be taken care of by
replacing them with the column mean.
Python3
# Compute average
c = avg = 0
c += 1
avg += ele
avg /= c
df = df.replace(to_replace="NaN",
value=avg)
# Display data
df
Output:
Reshaping data, in the GENDER column, we can reshape the data by categorizing
them into different numbers.
Python3
# Categorize gender
df['Gender'] = df['Gender'].map({'M': 0,
'F': 1, }).astype(float)
# Display data
df
Output:
Filtering data, suppose there is a requirement for the details regarding name,
gender, marks of the top-scoring students. Here we need to remove some
unwanted data.
Python3
df = df.drop(['Age'], axis=1)
# Display data
df
Output:
Hence, we have finally obtained an efficient dataset which can be further used for
various purposes.
Now that we know the basics of data wrangling. Below we will discuss various
operations using which we can perform data wrangling:
Wrangling Data Using Merge Operation:
Merge operation is used to merge raw data and into the desired format.
Syntax:
pd.merge( data_frame1,data_frame2, on="field ")
Here the field is the name of the column which is similar on both data-frame.
For example: Suppose that a Teacher has two types of Data, first type of Data consist
of Details of Students and Second type of Data Consist of Pending Fees Status which
is taken from Account Office. So The Teacher will use merge operation here in order
to merge the data and provide it meaning. So that teacher will analyze it easily and it
also reduces time and effort of Teacher from Manual Merging.
FIRST TYPE OF DATA:
Python3
# import module
import pandas as pd
pd.DataFrame({
'Nikita',
'Saurabh', 'Ayush', 'Dolly', "Mohit"],
# printing details
print(details)
Output:
# Import module
import pandas as pd
fees_status = pd.DataFrame(
# Printing fees_status
print(fees_status)
Output:
# Import module
import pandas as pd
# Creating Dataframe
details = pd.DataFrame({
'Nikita',
fees_status = pd.DataFrame(
Output:
Python3
# Import module
import pandas as pd
# Creating Data
'Hyundai',
'Sold': [6, 7, 9, 8, 3, 5,
2, 8, 7, 2, 4, 2]}
df = pd.DataFrame(car_selling_data)
# printing Dataframe
print(df)
Output:
import pandas as pd
# Creating Data
'Sold': [6, 7, 9, 8, 3, 5,
2, 8, 7, 2, 4, 2]}
df = pd.DataFrame(car_selling_data)
grouped = df.groupby('Year')
print(grouped.get_group(2010))
Output:
Python3
# Import module
import pandas as pd
# Initializing Data
'[email protected]', '[email protected]',
'[email protected]', '[email protected]',
'[email protected]', '[email protected]',
'[email protected]', '[email protected]',
'[email protected]', '[email protected]']}
df = pd.DataFrame(student_data)
# Printing Dataframe
print(df)
Output:
Python3
# import module
import pandas as pd
# initializing Data
'[email protected]', '[email protected]',
'[email protected]', '[email protected]',
'[email protected]', '[email protected]',
'[email protected]', '[email protected]',
'[email protected]', '[email protected]']}
# creating dataframe
df = pd.DataFrame(student_data)
print(non_duplicate)
Output:
4. Data Preprocessing: check for missing values in the data using pandas is null(),
describe() function to get some initial statistics. Provide variable descriptions. Types of
variables etc. Check the dimensions of the data frame.
Title: Create an “Academic performance” dataset of students and perform the given
operations using Python.
Objective:
1. Scan all variables for missing values and inconsistencies. If there are missing
values and/or inconsistencies, use any of the suitable techniques to deal with them.
2. Scan all numeric variables for outliers. If there are outliers, use any of the suitable
techniques to deal with them.
3. Apply data transformations on at least one of the variables. The purpose of this
transformation should be one of the following reasons: to change the scale for better
understanding of the variable, to convert a non-linear relation into a linear one, or to
decrease the skewness and convert the distribution into a normal distribution.
Theory:
None: None is a Python singleton object that is often used for missing data in Python code.
NaN:NaN (an acronym for Not a Number), is a special floating-point value recognized by
all systems
∙ List of Animals
o cat, fox, rabbit, fish
Methods
1. Using Box Plot
It captures the summary of the data effectively and efficiently with only a simple
box and whiskers.
# Box Plot
import seaborn as sns
sns.boxplot(df_boston['DIS'])
2. Using ScatterPlot
It is used when you have paired numerical data, or when your dependent variable has
multiple values for each reading independent variable, or when trying to determine the
relationship between the two variables.
# Scatter plot
fig, ax = plt.subplots(figsize = (18,10))
ax.scatter(df_boston['INDUS'], df_boston['TAX'])
# x-axis label
ax.set_xlabel('(Proportion non-retail business acres)/(town)')
# y-axis label
ax.set_ylabel('(Full-value property-tax rate)/( $10,000)')
plt.show()
3. Z-score
It is also called a standard score. This value/score helps to understand that how far is the
data point from the mean. And after setting up a threshold value one can utilize z score
values of data points to define the outliers.
Zscore = (data_point -mean) / std. deviation
# Z score
from scipy import stats
import numpy as np
z = np.abs(stats.zscore(df_boston['DIS']))
print(z)
Now to define an outlier threshold value is chosen which is generally 3.0. As 99.7%
of the data points lie between +/- 3 standard deviation (using Gaussian Distribution
approach).
threshold = 3
# Position of the outlier
print(np.where(z > 3))
Q3 = np.percentile(df_boston['DIS'], 75,
interpolation = 'midpoint')
IQR = Q3 - Q1
To define the outlier base value is defined above and below datasets normal range
namely Upper and Lower bounds, define the upper and the lower bound (1.5*IQR
value is considered) :
upper = Q3 +1.5*IQR
lower = Q1 – 1.5*IQR
# Above Upper bound
upper = df_boston['DIS'] >= (Q3+1.5*IQR)
print("Upper bound:",upper)
print(np.where(upper))
For removing the outlier, one must follow the same process of removing an entry from the
dataset using its exact position in the dataset because in all the above methods of detecting the
outliers end result is the list of all those data items that satisfy the outlier definition according to
the method used.
dataframe.drop( row_index, inplace = True
The above code can be used to drop a row from the dataset given the row_indexes to be dropped.
Inplace =True is used to tell python to make the required change in the original dataset.
row_index can be only one value or list of values or NumPy array but it must be one
dimensional.
To change the scale for better understanding of the variable
Consider,
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.DataFrame({
'Income': [15000, 1800, 120000, 10000],
'Age': [25, 18, 42, 51],
'Department': ['HR','Legal','Marketing','Management']
})
For that, we 1st create a copy of our dataframe and store the numerical feature names in a list,
and their values as well:
df_scaled = df.copy()
col_names = ['Income', 'Age']
features = df_scaled[col_names]
MinMax Scaler
It just scales all the data between 0 and 1. The formula for calculating the scaled value is
df_scaled[col_names] = scaler.fit_transform(features.values)
df_scaled
Standard Scaler
For each feature, the Standard Scaler scales the values such that the mean is 0 and the standard
deviation is 1(or the variance).
x_scaled = x – mean/std_dev
df_scaled[col_names] = scaler.fit_transform(features.values)
df_scaled
MaxAbsScaler
In simplest terms, the MaxAbs scaler takes the absolute maximum value of each column and
divides each value in the column by the maximum value.
Thus, it first takes the absolute value of each value in the column and then takes the maximum
value out of those. This operation scales the data between the range [-1, 1].
df["Balance"] = [100.0, -263.0, 2000.0, -5.0]
from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()
df_scaled[col_names] = scaler.fit_transform(features.values)
df_scaled
Robust Scaler
The Robust Scaler, as the name suggests is not sensitive to outliers. This scaler
1. removes the median from the data
2. scales the data by the InterQuartile Range(IQR)
To decrease the skewness and convert the distribution into a normal distribution
Log Transform
It is primarily used to convert a skewed distribution to a normal distribution/less-skewed
distribution. In this transform, we take the log of the values in a column and use these values as
the column instead.
df['log_income'] = np.log(df['Income'])
# We created a new column to store the log values
df['log_income'].plot.hist(bins = 5)
Conclusion: Thus we scan all variables for missing values & inconsistencies, outliers and
applied suitable technique to deal with them. We also applied data transformation on
variables.
Group A Assignment 3
Title:
Provide summary statistics (mean, median, minimum, maximum, standard deviation) for a
dataset (age, income etc.) with numeric variables grouped by one of the qualitative
(categorical) variable.
Objective:
1. If your categorical variable is age groups and quantitative variable is income, then
provide summary statistics of income grouped bythe age groups.
2. Create a list that contains a numeric value for each response to the categorical variable. 3.
Display some basic statistical details like percentile, mean, standard deviation etc. of the
species of ‘Iris-setosa’, ‘Iris-versicolor’ and ‘Iris- versicolor’ of iris.csv dataset.
Theory:
What is Statistics?
✔ Statistics is a branch of mathematics that deals with collecting, analyzing, interpreting, and
visualizing empirical data.
✔ Descriptive statistics and inferential statistics are the two major areas of statistics.
✔ Descriptive statistics are for describing the properties of sample and population data (what has
happened).
✔ Inferential statistics use those properties to test hypotheses, reach conclusions, and make
predictions (what can you expect).
To find the mean or the average salary of the employees, you can use the mean() functions in
Python.
Mode
✔ The Mode refers to the most frequently occurring value in your data.
✔ You find the frequency of occurrence of each number and the number with the highest
frequency is your mode. If there are no recurring numbers, then there is no mode in the data.
✔ Using the mode, you can find the most commonly occurring point in your data. This is helpful
when you have to find the central tendency of categorical values, like the flavor of the most
popular chip sold by a brand. You cannot find the average based on the orders; instead, you
choose the chip flavor with the highest orders.
✔ Usually, you can count the most frequently occurring values and get your mean. But this only
works when the values are discrete. Now, again take the example of class marks.
✔ Example: Take the following marks of students :
Over here, the value 35 occurs the most frequently and hence is the mode.
✔ But what if the values are categorical? In that case, you must use the formula below:
Where,
l = lower limit of modal class
h = lower limit of preceding modal class
f1 = frequency of modal class
f0 = frequency of class preceding modal class
f2 = frequency of class succeeding modal class
The modal class is simply the class with the highest frequency. Consider the range of frequencies
given for the marks obtained by students in a class:
Marks 10-20 20-30 30-40 40-50
Number of Students 1 3 5 4
In this case, you can see that class 30-40 has the highest frequency, hence it is the modal class.
The remaining values are as follows: l = 30, h = 20, f1 = 5, f0 = 3, f2 = 4
In that case, the mode becomes :
Hence, the mark which occurs most frequently is 43.33. The mode of salary from the salary data
frame can be calculated as:
Median
✔ Median refers to the middle value of a data. To find the median, you first sort the data in either
ascending or descending order or then find the numerical value present in the middle of your
data.
✔ It can be used to figure out the point around which the data is centered. It divides the data into
two halves and has the same number of data points above and below.
✔ The median is especially useful when the data is skewed data. That is, it has high data
distribution towards one side. In this case, the average wouldn't give you a fair mid-value but
would lean more towards the higher values. In this case, you can use the middle data point as
the central point instead.
✔ Consider n terms X_1, X_2, X_3,………… X_n. The basic formula for the median is by
dividing the total number of observations by 2. This works fine when you have an odd number
of terms because you will have one middle term and the same number of terms above and
below. For an even number of terms, consider the two middle terms and find their average.
Example: Consider following are students marks
To find the middle term, you first have to sort the data or arrange the data in ascending or
descending order. This ensures that consecutive terms are next to each other.
You
can see that we have 12 data points, so use the median formula for even numbers.
So, the middle term in the range of marks is 37. This means that the other marks lie in a
frequency range of around 37.
The median() function in Python can help you find the median value of a column. From the
salary data frame, you can find the median salary as:
✔ The Variance is defined as: The average of the squared differences from the Mean.
To calculate the Variance, take each difference, square it, and then average the
result: Variance =
groupby()
A groupby operation involves some combination of splitting the object, applying a function, and
combining the results. This can be used to group large amounts of data and compute operations on
these groups.
Example:
df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',
... 'Parrot', 'Parrot'],
... 'Max Speed': [380., 370., 24., 26.]})
>>> df
Animal Max Speed
0 Falcon 380.0
1 Falcon 370.0
2 Parrot 24.0
3 Parrot 26.0
>>> df.groupby(['Animal']).mean()
Max Speed
Animal
Falcon 375.0
Parrot 25.0
Objective:
1. To predict the value of prices of the house using the given features.
Theory:
Linear Regression is the supervised Machine Learning model in which the model finds the best
fit linear line between the independent and dependent variable i.e it finds the linear relationship
between the dependent and independent variable.
The Housing dataset which contains information about different houses in Boston. This data
was originally a part of UCI Machine Learning Repository and has been removed now. We
can also access this data from the scikit-learn library. There are 506 samples and 14 feature
variables in this dataset. The objective is to predict the value of prices of the house using the
given features.
The problem that we are going to solve here is that given a set of features that describe a
house in Boston, our machine learning model must predict the house price. To train our
machine learning model with boston housing data, we will be using scikit-learn’s boston
dataset.
In this dataset, each row describes a boston town or suburb. There are 506 rows and
14 attributes (features) with a target column (MEDV).
Now, split the data into independent variables (X’s) and dependent variable (Y)data sets. The
data will be stored in df_x for the dependent variables and df_y for the dependent variable.
Initialize the Linear Regression model, split the data into 67% training and 33% testing data,
and then train the model with the training data set that contains the independent variables.
#Initialize the linear regression model
reg = linear_model.LinearRegression()#Split the data into 67% training and 33% testing
data #NOTE: We have to split the dependent variables (x) and the target or independent
variable (y)
x_train, x_test, y_train, y_test = train_test_split(df_x, df_y, test_size=0.33,
random_state=42)#Train our model with the training data
reg.fit(x_train, y_train)
print(reg.coef_)
Now that we are done training the linear regression model and looking at the coefficients that
describe the linear function, let’s print the model's predictions (what it thinks the values will
be for houses) on the test data.
We want to know what was the actual values for that test data set, so I will print those values
to the screen, but first I will print at least one row from the model's prediction, just to make it
a little easier to compare data.
#Print the the prediction for the third row of our test data actual price =
13.6 y_pred[2]#print the actual price of houses from the testing data set
y_test[0]
To check the model's performance/accuracy I will use a metric called mean squared error
(MSE). This measurement is simple to implement and easy to understand. The MSE is a
measure of the quality of an estimator — it is always non-negative, and values closer to zero
indicate a better fit. Usually you want to evaluate your model with other metrics as well to
truly get an idea of how well your model performs. I’ll do this two different ways, one using
numpy and the other using sklearn.metrics.
# Two different ways to check model performance/accuracy using,
# mean squared error which tells you how close a regression line is to a set of points.
# 1. Mean squared error by numpy
print(np.mean((y_pred-y_test)**2))
Conclusion: Thus, we predicted the value of prices of the house using the given features.
Group A Assignment 5
Title:
Implement logistic regression using Python/R to perform classification on
Social_Network_Ads.csv dataset.
Objective:
Compute Confusion matrix to find TP, FP, TN, FN, Accuracy, Error rate,
Precision, Recall on the given dataset.
Theory:
Logistic regression is a statistical model that in its basic form uses a logistic function to model
a binary dependent variable, although many more complex extensions exist. In regression
analysis, logistic regression (or logit regression) is estimating the parameters of a logistic model
(a form of binary regression). Logistic regression is basically a supervised classification
algorithm. In a classification problem, the target variable(or output), y, can take only discrete
values for a given set of features(or inputs), X.
● Dependent variable: The target variable in a logistic regression model, which we are trying to
predict.
● Independent variables: The input characteristics or predictor factors applied to the dependent
variable’s predictions.
● Logistic function: The formula used to represent how the independent and dependent variables
relate to one another. The logistic function transforms the input variables into a probability value
between 0 and 1, which represents the likelihood of the dependent variable being 1 or 0.
● Odds: The proportion of an event’s chances of happening to its chances of not happening. The
chances are used in logistic regression to model the connection between the independent and
dependent variables.
● Log-odds: The logistic regression model’s calculation is made simpler by using the logarithm
of the odds.
● Coefficient: The logistic regression model’s estimated parameters, which show how the
independent and dependent variables relate to one another.
● Intercept: A constant term in the logistic regression model, which represents the log-odds when
all independent variables are equal to zero.
● Maximum likelihood estimation: The method used to estimate the coefficients of the logistic
regression model, which maximizes the likelihood of observing the data given the model. ●
Confusion matrix: A table that lists the number of true positive, true negative, false positive,
and
false negative predictions made by a logistic regression model is used to assess the model’s
performance.
We will be taking data from social network ads which tell us whether a person
will purchase the ad or not based on the features such as age and salary.
import numpy as np
import pandas as pd
Now we will import the dataset and select only age and salary as the features
dataset = pd.read_csv('Social_Network_Ads.csv')
dataset.head()
Now we will perform splitting for training and testing. We will take 75% of the data for
training, and test on the remaining data
Next, we scale the features to avoid variation and let the features follow a normal
distribution
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
The preprocessing part is over. It is time to fit the model
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)
We fitted the model on training data. We will predict the labels of test
It is a matrix of size 2×2 for binary classification with actual values on one axis and predicted
on another.
Precision
Recall
Out of the total positive, what percentage are predicted positive. It is the same as TPR (true
positive rate).
The prediction is over. Now we will evaluate the performance of our
cm = confusion_matrix(y_test, y_pred)
cl_report=classification_report(y_test,y_pred)
Conclusion: Thus, we implemented the logistic regression on social network adv dataset and
identified the accuracy, precision and recall.
Assignment 6
Title: Data Analytics III
Aim
1. Implement Simple Naïve Bayes classification algorithm using Python/R on iris.csv dataset. 2.
Compute Confusion matrix to find TP, FP, TN, FN, Accuracy, Error rate, Precision, Recall on the
given dataset.
Objective:
Students are able to learn:
Theory:
Contents for Theory:
1. Concepts used in Naïve Bayes classifier
2. Naive Bayes Example
3. Confusion Matrix Evaluation Metrics
---------------------------------------------------------------------------------------------------------
------ 1. Concepts used in Naïve Bayes classifier
● Naïve Bayes Classifier can be used for Classification of categorical
data. ○ Let there be a ‘j’ number of classes. C={1,2,….j}
○ Let, input observation is specified by ‘P’ features. Therefore input observation
x is given , x = {F1,F2,…..Fp}
○ The Naïve Bayes classifier depends on Bayes' rule from probability theory. ●
Prior probabilities: Probabilities which are calculated for some event based on no other
information are called Prior probabilities.
For example, P(A), P(B), P(C) are prior probabilities because while calculating P(A),
occurrences of event B or C are not concerned i.e. no information about occurrence
of any other event is used.
Conditional Probabilities:
From equation (1) and (2) ,
We have a dataset with some features Outlook, Temp, Humidity, and Windy, and
the target here is to predict whether a person or team will play tennis or not.
Conditional Probability
Here, we are predicting the probability of class1 and class2 based on the given condition. If I
try to write the same formula in terms of classes and features, we will get the following
equation
so if we write this formula for class C1, it will be something like this.
Here, we replaced Ck with C1 and X with the intersection of X1, X2, X3, X4. You might have
a question, It’s because we are taking the situation when all these features are present at the
same time.
The Naive Bayes algorithm assumes that all the features are independent of each other or in
other words all the features are unrelated. With that assumption, we can further simplify the
This is the final equation of the Naive Bayes and we have to calculate the probability of both
C1 and C2.For this particular example.
P (N0 | Today) > P (Yes | Today) So, the prediction that golf would be played is ‘No’.
For this implementation we will use the Iris Flower Species Dataset.
The Iris Flower Dataset involves predicting the flower species given measurements of iris flowers.
It is a multiclass classification problem. The number of observations for each class is balanced. There
are 150 observations with 4 input variables and 1 output variable. The variable names are as follows:
y = df['species'].values
Step 6: Predict the y_pred for all values of train_x and test_x
Y_pred = gaussian.predict(X_test)
precision =precision_score(y_test,
Y_pred,average='micro') recall =
recall_score(y_test, Y_pred,average='micro')
The confusion matrix shows us how our classifier gets confused while predicting. In a confusion
matrix we have four important terms which are:
1. True Positive (TP)- Both actual and predicted values are Positive.
2. True Negative (TN)- Both actual and predicted values are Negative.
3. False Positive (FP)- The actual value is negative but we predicted it as positive.
4. False Negative (FN)- The actual value is positive but we predicted it as negative.
Performance Metrics
Confusion matrix not only used for finding the errors in prediction but is also useful to find some
important performance metrics like Accuracy, Recall, Precision, F-measure. We will discuss these terms
one by one.
Accuracy
As the name suggests, the value of this metric suggests the accuracy of our classifier in predicting
results.
It is defined as:
Precision
Precision is the measure of all actual positives out of all predicted positive values. It
is defined as:
Recall
Recall is the measure of positive values that are predicted correctly out of all actual positive values.
It is defined as:
Conclusion: In this assignment, we covered how Naïve Bayes theorem used to solve classification
problem for iris flower dataset and what is confusion matrix, its need, and how to derive it in Python
and R.
DS&BDL
Assignment 7
Title: Text Analytics
Aim
1. Extract Sample document and apply following document pre-processing methods:
Tokenization, POS Tagging, stop words removal, Stemming and Lemmatization. 2. Create
representation of document by calculating Term Frequency and Inverse Document Frequency..
Objective:
Students are able to learn:
Text Analytics has lots of applications in today's online world. By analyzing tweets on Twitter, we can
find trending news and peoples reaction on a particular event. Amazon can understand user feedback
or review on the specific product. BookMyShow can discover people's opinion about the movie.
Youtube can also analyze and understand peoples viewpoints on a video.
In this assignment we are going to take sample document and will apply following pre-processing
methods.
#Loading NLTK
import nltk
2. Tokenization
Tokenization is the first step in text analytics. The process of breaking down a text
paragraph into smaller chunks such as words or sentence is called Tokenization. Token is
a single entity that is building blocks for sentence or paragraph.
Sentence Tokenization
Sentence tokenizer breaks text paragraph into sentences.
from nltk.tokenize import sent_tokenize
text=""" """
tokenized_text=sent_tokenize(text)
print(tokenized_text)
Word Tokenization
Word tokenizer breaks text paragraph into words.
from nltk.tokenize import word_tokenize
tokenized_word=word_tokenize(text)
print(tokenized_word)
3. Stop words
Stopwords considered as noise in the text. Text may contain stop words such as is, am, are, this, a,
an, the, etc.
In NLTK for removing stopwords, you need to create a list of stopwords and filter out your list of
tokens from these words.
# Stemming
stemmed_words=[]
for w in filtered_sent:
stemmed_words.append(ps.stem(w))
print("Filtered Sentence:",filtered_sent)
print("Stemmed Sentence:",stemmed_words)
Lemmatization
Lemmatization reduces words to their base word, which is linguistically correct lemmas. It
transforms root word with the use of vocabulary and morphological analysis. Lemmatization is
usually more sophisticated than stemming. Stemmer works on an individual word without
knowledge of the context. For example, The word "better" has "good" as its lemma. This thing will
miss by stemming because it requires a dictionary look-up.
#Lexicon Normalization
lem = WordNetLemmatizer()
stem = PorterStemmer()
word = "flying"
print("Lemmatized Word:",lem.lemmatize(word,"v"))
print("Stemmed Word:",stem.stem(word))
5. POS Tagging
The primary target of Part-of-Speech(POS) tagging is to identify the grammatical group of a given
word. Whether it is a NOUN, PRONOUN, ADJECTIVE, VERB, ADVERBS, etc. based on the
context. POS Tagging looks for relationships within the sentence and assigns a corresponding tag
to the word.
sent = " "
tokens=nltk.word_tokenize(sent)
print(tokens)
['Albert', 'Einstein', 'was', 'born', 'in', 'Ulm', ',', 'Germany', 'in', '1879', '.']
nltk.pos_tag(tokens)
6. Text Classification
Text Classification Model using TF-IDF. First, import the MultinomialNB module and create a
Multinomial Naive Bayes classifier object using MultinomialNB() function. Then, fit your model
on a train set using fit() and perform prediction on the test set using predict().
from sklearn.naive_bayes import MultinomialNB
predicted= clf.predict(X_test)
In Term Frequency(TF), you just count the number of words occurred in each document. The main
issue with this Term Frequency is that it will give more weight to longer documents. Term frequency
is basically the output of the BoW model.
IDF(Inverse Document Frequency) measures the amount of information a given word provides across
the document. IDF is the logarithmically scaled inverse ratio of the number of documents that contain
Conclusion: In this assignment, we have learned what Text Analytics is, NLP and text mining,
basics of text analytics operations using NLTK such as Tokenization, Normalization, Stemming,
Lemmatization and POS tagging. What is Text Classification Model using TF-IDF.
Group A Assignment 8
Title:
1. Use the inbuilt dataset 'titanic'. The dataset contains 891 rows and contains information about
the passengers who boarded the unfortunate Titanic ship. Use the Seaborn library to see if we can
find any patterns in the data.
2. Write a code to check how the price of the ticket (column name: 'fare') for each passenger is
distributed by plotting a histogram
Objective:
Use the Seaborn library to see if we can find any patterns in the data.
Theory:
Contents for Theory:
1. Seaborn Library Basics
2. Know your Data
3. Finding patterns of data.
4. Checking how the price of the ticket (column name: 'fare') for each passenger is
distributed by plotting a histogram.
--------------------------------------------------------------------------------------------------------------
- Theory:
Data Visualisation plays a very important role in Data mining. Various data scientists spent their
time exploring data through visualization. To accelerate this process we need to have a well-
documentation of all the plots.
Even plenty of resources can’t be transformed into valuable goods without planning and
architecture
import pandas as pd
import numpy as np
dataset = sns.load_dataset('titanic')
dataset.head()
The dataset contains 891 rows and 15 columns and contains information about the passengers who
boarded the unfortunate Titanic ship. The original task is to predict whether or not the passenger
survived depending upon different features such as their age, ticket, cabin they boarded, the class
of the ticket, etc. We will use the Seaborn library to see if we can find any patterns in the data.
A. Distribution Plots
a. Dist-Plot
b. Joint Plot
d. Rug Plot
B. Categorical Plots
a. Bar Plot
b. Count Plot
c. Box Plot
d. Violin Plot
C. Advanced Plots
a. Strip Plot
b. Swarm Plot
D. Matrix Plots
a. Heat Map
b. Cluster Map
A. Distribution Plots:
These plots help us to visualise the distribution of data. We can use these plots to understand
a. Distplot
● We can change the number of bins i.e. number of vertical bars in a histogram
Here the x-axis is the age and the y-axis displays frequency. For example, for bins = 10,
b. Joint Plot
# For Plot 1
# For Plot 2
● From the output, you can see that a joint plot has three parts. A distribution plot at the top for
the column on the x-axis, a distribution plot on the right for the column on the y-axis and a
scatter plot in between that shows the mutual distribution of data for both the columns. You
can see that there is no correlation observed between prices and the fares.
● You can change the type of the joint plot by passing a value for the kind parameter. For instance,
if instead of a scatter plot, you want to display the distribution of data in the form of a
hexagonal plot, you can pass the value hex for the kind parameter.
● In the hexagonal plot, the hexagon with the most number of points gets darker colour. So if you
look at the above plot, you can see that most of the passengers are between the ages of 20 and
30 and most of them paid between 10-50 for the tickets.
b. The rugplot() is used to draw small bars along the x-axis for each point in the dataset. To plot
a rug plot, you need to pass the name of the column. Let's plot a rug plot for fare.
sns.rugplot(dataset['fare'])
From the output, you can see that most of the instances for the fares have values between 0 and
100.
These are some of the most commonly used distribution plots offered by the Python's Seaborn
Library. Let's see some of the categorical plots in the Seaborn library.
2. Categorical Plots
Categorical plots, as the name suggests, are normally used to plot categorical data. The categorical
plots plot the values in the categorical column against another categorical column or a numeric
column. Let's see some of the most commonly used categorical data.
From the output, you can clearly see that the average age of male passengers is just less than 40
while the average age of female passengers is around 33.
In addition to finding the average, the bar plot can also be used to calculate other aggregate values
for each category. To do so, you need to pass the aggregate function to the estimator. For instance,
you can calculate the standard deviation for the age of each gender as follows:
import numpy as np
Notice, in the above script we use the std aggregate function from the numpy library to calculate
the standard deviation for the ages of male and female passengers. The output looks like this:
The box plot is used to display the distribution of the categorical data in the form of quartiles. The
centre of the box shows the median value. The value from the lower whisker to the bottom of the
box shows the first quartile. From the bottom of the box to the middle of the box lies the second
quartile. From the middle of the box to the top of the box lies the third quartile and finally from
the top of the box to the top whisker lies the last quartile.
Now let's plot a box plot that displays the distribution for the age with respect to each gender. You
need to pass the categorical column as the first parameter (which is sex in our case) and the numeric
column (age in our case) as the second parameter. Finally, the dataset is passed as the third
parameter, take a look at the following script:
sns.boxplot(x='sex', y='age', data=dataset)
Let's try to understand the box plot for females. The first quartile starts at around 1 and ends at 20
which means that 25% of the passengers are aged between 1 and 20. The second quartile starts at
around 20 and ends at around 28 which means that 25% of the passengers are aged between20 and
28. Similarly, the third quartile starts and ends between 28 and 38, hence 25% passengers are aged
within this range and finally the fourth or last quartile starts at 38 and ends around 64.
If there are any outliers or the passengers that do not belong to any of the quartiles, they are
called outliers and are represented by dots on the box plot.
You can make your box plots more fancy by adding another layer of distribution. For instance, if
you want to see the box plots of forage of passengers of both genders, along with the information
about whether or not they survived, you can pass the survived as value to the hue parameter as
shown below:
sns.boxplot(x='sex', y='age', data=dataset, hue="survived")
Now in addition to the information about the age of each gender, you can also see the distribution
of the passengers who survived. For instance, you can see that among the male passengers, on
average more younger people survived as compared to the older ones. Similarly, you can see that
the variation among the age of female passengers who did not survive is much greater than the age
of the surviving female passengers.
Let's plot a violin plot that displays the distribution for the age with respect to each
Like box plots, you can also add another categorical variable to the violin plot using the hue
parameter as shown below:
Now you can see a lot of information on the violin plot. For instance, if you look at the bottom of
the violin plot for the males who survived (left-orange), you can see that it is thicker than the
bottom of the violin plot for the males who didn't survive (left-blue). This means that the number
of young male passengers who survived is greater than the number of young male passengers
who did not survive
Advanced Plots:
The stripplot() function is used to plot the violin plot. Like the box plot, the first parameter is the
categorical column, the second parameter is the numeric column while the third parameter is the
dataset. Look at the following script:
You can see the scattered plots of age for both males and females. The data points look like
strips. It is difficult to comprehend the distribution of data in this form. To better comprehend the
data, pass True for the jitter parameter which adds some random noise to the data. Look at the
following script:
sns.stripplot(x='sex', y='age', data=dataset, jitter=True)
Now you have a better view for the distribution of age across the genders.
Like violin and box plots, you can add an additional categorical column to strip plot using hue
parameter as shown below:
sns.stripplot(x='sex', y='age', data=dataset, jitter=True,
hue='survived')
b. The Swarm Plot
The swarm plot is a combination of the strip and the violin plots. In the swarm plots, the points
are adjusted in such a way that they don't overlap. Let's plot a swarm plot for the distribution of
age against gender. The swarmplot() function is used to plot the violin plot. Like the box plot,
the first parameter is the categorical column, the second parameter is the numeric column while
the third parameter is the dataset. Look at the following script:
sns.swarmplot(x='sex', y='age', data=dataset)
You can clearly see that the above plot contains scattered data points like the strip plot and the
data points are not overlapping. Rather they are arranged to give a view similar to that of a violin
plot.
Let's add another categorical column to the swarm plot using the hue
hue='survived')
From the output, it is evident that the ratio of surviving males is less than the ratio of surviving
females. Since for the male plot, there are more blue points and less orange points. On the other
hand, for females, there are more orange points (surviving) than the blue points (not surviving).
Another observation is that amongst males of age less than 10, more passengers survived as
compared to those who didn't.
1. Matrix Plots
Matrix plots are the type of plots that show data in the form of rows and columns. Heat maps are
the prime examples of matrix plots.
a. Heat Maps
Heat maps are normally used to plot correlation between numeric columns in the form of a matrix.
It is important to mention here that to draw matrix plots, you need to have meaningful information
on rows as well as columns. Let's plot the first five rows of the Titanic dataset to see if both the
rows and column headers have meaningful information. Execute the following script:
import pandas as pd
import numpy as np
dataset = sns.load_dataset('titanic')
dataset.head()
From the output, you can see that the column headers contain useful information such as
passengers surviving, their age, fare etc. However the row headers only contain indexes 0, 1, 2,
etc. To plot matrix plots, we need useful information on both columns and row headers. One way
to do this is to call the corr() method on the dataset. The corr() function returns the correlation
between all the numeric columns of the dataset. Execute the following script:
dataset.corr()
In the output, you will see that both the columns and the rows have meaningful header
information, as shown below:
Now to create a heat map with these correlation values, you need to call the heatmap() function
and pass it your correlation dataframe. Look at the following script:
corr = dataset.corr()
sns.heatmap(corr)
From the output, it can be seen that what heatmap essentially does is that it plots a box for every
combination of rows and column value. The colour of the box depends upon the gradient. For
instance, in the above image if there is a high correlation between two features, the corresponding
cell or the box is white, on the other hand if there is no correlation, the corresponding cell remains
black.
The correlation values can also be plotted on the heatmap by passing True for the annot
parameter. Execute the following script to see this in action:
corr = dataset.corr()
sns.heatmap(corr, annot=True)
You can also change the colour of the heatmap by passing an argument for the cmap
parameter. For now, just look at the following script:
corr = dataset.corr()
sns.heatmap(corr)
b. Cluster Map:
In addition to the heat map, another commonly used matrix plot is the cluster map. The
cluster map basically uses Hierarchical Clustering to cluster the rows and columns of the
matrix.
Let's plot a cluster map for the number of passengers who travelled in a specific month
of a specific year. Execute the following script:
4. Checking how the price of the ticket (column name: 'fare') for each passenger is
distributed by plotting a histogram.
import seaborn as sns
dataset = sns.load_dataset('titanic')
sns.histplot(dataset['fare'], kde=False, bins=10)
From the histogram, it is seen that for around 730 passengers the price of the ticket is 50.
For 100 passengers the price of the ticket is 100 and so on.
Conclusion
Seaborn is an advanced data visualization library built on top of Matplotlib library. In this
assignment, we looked at how we can draw distributional and categorical plots using the Seaborn
library. We have seen how to plot matrix plots in Seaborn. We also saw how to change plot styles
and use grid functions to manipulate subplots.
Assignment 9
Title: Data Visualization
Aim: 1. Use the inbuilt dataset 'titanic' as used in the above problem. Plot a box plot for
distribution of age with respect to each gender along with the information about whether they
survived or not. (Column names : 'sex' and 'age').
2. Write observations on the inference from the above statistics.
Objective:
Students are able to learn:
Theory:
There are various techniques to understand your data, And the basic need is you should have the
knowledge of Numpy for mathematical operations and Pandas for data manipulation. We are
using Titanic dataset. For demonstrating some of the techniques we will also use an inbuilt
dataset of seaborn as tips data which explains the tips each waiter gets from different customers.
import numpy as np
import pandas pd
import matplotlib.pyplot as plt
import seaborn as sns
from seaborn import load_dataset
#titanic dataset
data = pd.read_csv("titanic_train.csv")
#tips dataset
tips = load_dataset("tips")
Univariate Analysis
Univariate analysis is the simplest form of analysis where we explore a single variable.
Univariate analysis is performed to describe the data in a better way. we perform Univariate
analysis of Numerical and categorical variables differently because plotting uses different plots.
Categorical Data:
A variable that has text-based information is referred to as categorical variables. Now
following are various plots which we can use for visualizing Categorical data.
1) CountPlot:
Countplot is basically a count of frequency plot in form of a bar graph. It plots the count of
each category in a separate bar. When we use the pandas’ value counts function on any
column. It is the same visual form of the value counts function. In our data-target variable is
survived and it is categorical so plot a countplot of this.
sns.countplot(data['Survived'])
plt.show()
2) Pie Chart:
The pie chart is also the same as the countplot, only gives us additional information about the
percentage presence of each category in data means which category is getting how much
weightage in data. Now we check about the Sex column, what is a percentage of Male and
Female members traveling.
data['Sex'].value_counts().plot(kind="pie", autopct="%.2f")
plt.show()
Numerical Data:
Analyzing Numerical data is important because understanding the distribution of variables
helps to further process the data. Most of the time, we will find much inconsistency with
numerical data so we have to explore numerical variables.
1) Histogram:
A histogram is a value distribution plot of numerical columns. It basically creates bins in various
ranges in values and plots it where we can visualize how values are distributed. We can have a
look where more values lie like in positive, negative, or at the center(mean). Let’s have a look at
the Age column.
plt.hist(data['Age'], bins=5)
plt.show()
2) Distplot:
Distplot is also known as the second Histogram because it is a slight improvement version of
the Histogram. Distplot gives us a KDE(Kernel Density Estimation) over histogram which
explains PDF(Probability Density Function) which means what is the probability of each value
occurring in this column.
sns.distplot(data['Age'])
plt.show()
3) Boxplot:
Boxplot is a very interesting plot that basically plots a 5 number summary. to get 5 number
summary some terms we need to describe.
IQR = Q3 - Q1
Lower_boundary = Q1 - 1.5 * IQR
Upper_bounday = Q3 + 1.5 * IQR
Here Q1 and Q3 is 1st quantile (25th percentile) and 3rd Quantile(75th percentile).
sns.scatterplot(tips["total_bill"], tips["tip"])
Multivariate analysis with scatter plot:
We can also plot 3 variable or 4 variable relationships with scatter plot. suppose we want to
find the separate ratio of male and female with total bill and tip provided.
We can also see 4 variable multivariate analyses with scatter plots using style argument. Suppose
along with gender we also want to know whether the customer was a smoker or not so we can
do this.
1) Bar Plot:
Bar plot is a simple plot which we can use to plot categorical variable on the x-axis and
numerical variable on y-axis and explore the relationship between both variables. The blacktip
on top of each bar shows the confidence Interval. let us explore P-Class with age.
sns.barplot(data['Pclass'], data['Age'])
plt.show()
sns.boxplot(data['Sex'], data["Age"])
In above graph, the blue one shows the probability of dying and the orange plot shows the
survival probability. If we observe it we can see that children’s survival probability is
higher than death and which is the opposite in the case of aged peoples. This small analysis
tells sometimes some big things about data and it helps while preparing data stories.
Categorical and Categorical:
Now, we will work on categorical and categorical columns.
1) Heatmap:
If you have ever used a crosstab function of pandas then Heatmap is a similar visual
representation of that only. It basically shows that how much presence of one category
concerning another category is present in the dataset. let me show first with crosstab and then
with heatmap.
pd.crosstab(data['Pclass'], data['Survived'])
Now with heatmap, we have to find how many people survived and died.
sns.heatmap(pd.crosstab(data['Pclass'], data['Survived']))
2) Cluster map:
We can also use a cluster map to understand the relationship between two categorical variables.
A cluster map basically plots a dendrogram that shows the categories of similar behavior
together.
sns.clustermap(pd.crosstab(data['Parch'], data['Survived']))
plt.show()
Conclusion: In this assignment, we covered how to plot different type of Data Visualization
plots. Also studied how to generate inference from it.
Group A
Assignment No: 10
Objective of the Assignment: Students should be able to perform the data Visualization
operation using Python on any open source dataset .
Prerequisite:
1. Basic of Python Programming
2. Seaborn Library, Concept of Data Visualization.
3. Types of variables
Theory:
Boxplot:A box plot (or box-and-whisker plot) shows the distribution of quantitative data in a way that
facilitates comparisons between variables or across levels of a categorical variable. The box shows the
quartiles of the dataset while the whiskers extend to show the rest of the distribution, except for points
that are determined to be “outliers” using a method that is a function of the inter-quartile range.
Draw a single horizontal boxplot, assigning the data directly to the coordinate variable:
df = sns.load_dataset("iris")
sns.boxplot(x=df["sepal_length"])
Histograms:
A histogram is basically used to represent data provided in a form of some groups.It is accurate method
for the graphical representation of numerical data distribution. It is a type of bar plot where X-axis
represents the bin ranges while Y-axis gives information about frequency.
Syntax: matplotlib.pyplot.hist(x, bins=None, range=None, density=False, weights=None,
cumulative=False, bottom=None, histtype=’bar’, align=’mid’, orientation=’vertical’,
rwidth=None, log=False, color=None, label=None, stacked=False, \*, data=None, \*\*kwargs)
The following table shows the parameters accepted by matplotlib.pyplot.hist() function :
Attribute Parameter
histtype optional parameter used to create type of histogram [bar, barstacked, step, stepfilled],
default is“bar”
align optional parameter controls the plotting of histogram [left, right, mid] weights
rwidth optional parameter which is relative width of the bars with respect to bin width
label optional parameter string or sequence of string to match with multiple datasets
Algorithm:
1. Import required libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
2. Create the data frame for downloaded iris.csv dataset.
os.chdir("D:\Pandas")
df =
pd.read_csv("Iris.csv")df
3. Apply data preprocessing techniques.
df.isnull().sum()
df.describe()
4. Plot the box plot for each feature in the dataset and observe and detect the
outliers.
Conclusion: In this assignment, we covered how to plot boxplot and Histogram on iris dataset. Also
studied how to generate inference from it.
DS&BDL
Assignment 11
Aim:
Write a code in JAVA for a simple WordCount application that counts the number of occurrences
of each word in a given input set using the Hadoop MapReduce framework on local-standalone set-
up.
Objective:
By completing this task, students will learn the following
Theory:
Map and Reduce tasks in Hadoop-With in a MapReduce job there are two separate tasks map task and
reduce task.
Map task- A MapReduce job splits the input dataset into independent chunks known as input splits in
Hadoop which are processed by the map tasks in a completely parallel manner. Hadoop framework
creates separate map task for each input split.
Reduce task- The output of the maps is sorted by the Hadoop framework which then becomes input
to the reduce tasks.
Hadoop MapReduce framework operates exclusively on <key, value> pairs. In a MapReduce job, the
input to the Map function is a set of <key, value> pairs and output is also a set of <key, value> pairs.
The output <key, value> pair may have different type from the input <key, value> pair.
The output from the map tasks is sorted by the Hadoop framework. MapReduce guarantees that the
input to every reducer is sorted by key. Input and output of the reduce task can be represented as follows.
WordCount example reads text files and counts the frequency of the words. Each mapper takes
a line of the input file as input and breaks it into words. It then emits a key/value pair of the
word (In the form of (word, 1)) and each reducer sums the counts for each word and emits a
single key/value with the word and sum.
In the word count MapReduce code there is a Mapper class (MyMapper) with map function
and a Reducer class (MyReducer) with a reduce function.
1. Map function
From the wordcount.txt file Each line will be passed to the map function in the following
format.
<0, Hello wordcount MapReduce Hadoop program.>
<41, This is my first MapReduce program.>
In the map function the line is split on space and each word is written to the context along
with the value as 1.
So the output from the map function for the two lines will be as follows.
You will also need to add at least the following Hadoop jars so that your code can compile. You will
find these jars inside the /share/hadoop directory of your Hadoop installation. With in
/share/hadoop path look in hdfs, mapreduce and common directories for required jars.
hadoop-common-2.9.0.jar
hadoop-hdfs-2.9.0.jar
hadoop-hdfs-client-2.9.0.jar
hadoop-mapreduce-client-core-2.9.0.jar
hadoop-mapreduce-client-common-2.9.0.jar
hadoop-mapreduce-client-jobclient-2.9.0.jar
hadoop-mapreduce-client-hs-2.9.0.jar
hadoop-mapreduce-client-app-2.9.0.jar
commons-io-2.4.jar
Once you are able to compile your code you need to create jar file.
In the eclipse IDE righ click on your Java program and select Export – Java – jar file.
One your word count MapReduce program is succesfully executed you can verify the output
file.
hdfs dfs -ls /user/out
Found 2 items
As you can see Hadoop framework creates output files using part-r-xxxx format. Since only one
reducer is used here so there is only one output file part-r-00000. You can see the content of the
file using the following command.
hdfs dfs -cat /user/out/part-r-00000
Hadoop 1
Hello 1
MapReduce 2
This 1
first 1
is 1
my 1
program. 2
wordcount 1
Conclusion: In this assignment, we have learned what is HDFS and How Hadoop
MapReduce framework is used to counts the number of occurrences of each word in a given
input set.
Group B
Assignment No: 2
Theory:
● Steps to Install Hadoop for distributed environment
cd hadoop-2.7.3
Step 2) Once the NameNode is formatted, go to hadoop-2.7.3/sbin directory and start all the
daemons/nodes.
cd hadoop-2.7.3/sbin
1) Start NameNode:
The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files
stored in the HDFS and tracks all the files stored across the cluster.
2) Start DataNode:
On startup, a DataNode connects to the Namenode and it responds to the requests from the
Namenode for different operations.
./hadoop-daemon.sh start datanode
3) Start ResourceManager:
ResourceManager is the master that arbitrates all the available cluster resources and thus helps in
managing the distributed applications running on the YARN system. Its work is to manage each
NodeManagers and each application’s ApplicationMaster.
4) Start NodeManager:
The NodeManager in each machine framework is the agent which is responsible for managing
containers, monitoring their resource usage and reporting the same to the ResourceManager.
5) Start JobHistoryServer:
JobHistoryServer is responsible for servicing all job history related requests from client.
Step 3) To check that all the Hadoop services are up and running, run the below
command. jps
Step 4) cd
Step 9) cd mapreduce_vijay/
Step 10) ls
Step 11) sudo chmod +r *.*
Step 12) export CLASSPATH="/home/vijay/hadoop-2.7.3/share/hadoop/mapreduce/hadoop
mapreduce-client-core-2.7.3.jar:/home/vijay/hadoop-2.7.3/share/hadoop/mapreduce/hadoop
mapreduce-client-common-2.7.3.jar:/home/vijay/hadoop-2.7.3/share/hadoop/common/hadoop
common-2.7.3.jar:~/mapreduce_vijay/SalesCountry/*:$HADOOP_HOME/lib/*"
Step 17) cd ..
Step 20) ls
Step 21) cd
----------------------------------------------------------------------------------------------------------------
Objective:
Theory:
1) Install Scala
Step 2) Install Scala from the apt repository by running the following commands to search for
scala and install it.
Apache Spark is an open-source, distributed processing system used for big data workloads. It
utilizes in-memory caching, and optimized query execution for fast analytic queries against data
of any size.
Step 1) Now go to the official Apache Spark download page and grab the latest version (i.e. 3.2.1)
at the time of writing this article. Alternatively, you can use the wget command to download the
file directly in the terminal.
wget https://apachemirror.wuchna.com/spark/spark-3.2.1/spark-3.2.1-bin-hadoop2.7.tgz
Step 4) Now you have to set a few environmental variables in .profile file before starting up the
spark.
Step 5) To make sure that these new environment variables are reachable within the shell and
available to Apache Spark, it is also mandatory to run the following command to take recent
changes into effect.
source ~/.profile
Step 6) ls -l /opt/spark
Step 7) Run the following command to start the Spark master service and slave
service. start-master.sh
start-workers.sh spark://localhost:7077
(if workers not starting then remove and install openssh:
sudo apt-get remove openssh-client openssh-server
sudo apt-get install openssh-client openssh-server)
Step 8) Once the service is started go to the browser and type the following URL access spark
page. From the page, you can see my master and slave service is started.
http://localhost:8080/
Step 9) You can also check if spark-shell works fine by launching the spark-shell command.
Spark-shell
object ExampleString {
def main(args: Array[String]) {
}
}
/**declare a variable*/
var number= (-100);
if(number==0){
println("number is zero");
}
else if(number>0){
println("number is positive");
}
else{
println("number is negative");
}
}
}
if( number1>number2){
println("Largest number is:" + number1);
}
else{
println("Largest number is:" + number2);
}
}
}
Conclusion:Thus we have studied how to execute a program in SCALA using Apache Spark Framework.