DSBDA Lab Manual
DSBDA Lab Manual
Group A
Assignment No: 1
1. Introduction to Dataset
A dataset is a collection of records, similar to a relational database table. Records are
similar to table rows, but the columns can contain not only strings or numbers, but also
nested data structures such as lists, maps, and other records.
Instance: A single row of data is called an instance. It is an observation from the domain.
Department of Computer Engineering Subject : DSBDAL
Data Type: Features have a data type. They may be real or integer-valued or may have a
categorical or ordinal value. You can have strings, dates, times, and more complex types,
but typically they are reduced to real or categorical values when working with traditional
machine learning methods.
Datasets: A collection of instances is a dataset and when working with machine learning
methods we typically need a few datasets for different purposes.
Training Dataset: A dataset that we feed into our machine learning algorithm to train
our model.
Testing Dataset: A dataset that we use to validate the accuracy of our model but is not
used to train the model. It may be called the validation dataset.
Data Represented in a Table:
Data should be arranged in a two-dimensional space made up of rows and columns. This
type of data structure makes it easy to understand the data and pinpoint any problems. An
example of some raw data stored as a CSV (comma separated values).
Pandas Python
NumPy type Usage
dtype type
int64 int int_, int8, int16, int32, int64, uint8, uint16, Integer numbers
uint32, uint64
1. Basic array operations: add, multiply, slice, flatten, reshape, index arrays
2. Advanced array operations: stack arrays, split into sections, broadcast arrays
3. Work with DateTime or Linear Algebra
4. Basic Slicing and Advanced Indexing in NumPy Python
c. Matplotlib
This is undoubtedly my favorite and a quintessential Python library. You can create
stories with the data visualized with Matplotlib. Another library from the SciPy Stack,
Matplotlib plots 2D figures.
Histogram, bar plots, scatter plots, area plot to pie plot, Matplotlib can depict a wide
range of visualizations. With a bit of effort and tint of visualization capabilities, with
Matplotlib, you can create just any visualizations:Line plots
● Scatter plots
● Area plots
● Bar charts and Histograms
● Pie charts
● Stem plots
● Contour plots
● Quiver plots
Department of Computer Engineering Subject : DSBDAL
● Spectrograms
Matplotlib also facilitates labels, grids, legends, and some more formatting entities with
Matplotlib.
d. Seaborn
So when you read the official documentation on Seaborn, it is defined as the data
visualization library based on Matplotlib that provides a high-level interface for drawing
attractive and informative statistical graphics. Putting it simply, seaborn is an extension
of Matplotlib with advanced features.
Introduced to the world as a Google Summer of Code project, Scikit Learn is a robust
machine learning library for Python. It features ML algorithms like SVMs, random
forests, k-means clustering, spectral clustering, mean shift, cross-validation and more...
Even NumPy, SciPy and related scientific operations are supported by Scikit Learn with
Scikit Learn being a part of the SciPy Stack.
3. Description of Dataset:
The Iris dataset was used in R.A. Fisher's classic 1936 paper, The Use of Multiple
Measurements in Taxonomic Problems, and can also be found on the UCI Machine Learning
Repository.
It includes three iris species with 50 samples each as well as some properties about each
flower. One flower species is linearly separable from the other two, but the other two are not
linearly separable from each other.
Total Sample- 150
The columns in this dataset are:
1. Id
2. SepalLengthCm
3. SepalWidthCm
4. PetalLengthCm
5. PetalWidthCm
6. Species
3 Different Types of Species each contain 50 Sample-
Description of Dataset-
Department of Computer Engineering Subject : DSBDAL
3. The csv file at the UCI repository does not contain the variable/column names. They are
located in a separate file.
Department of Computer Engineering Subject : DSBDAL
col_names = ['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width','Species']
4. read in the dataset from the UCI Machine Learning Repository link and specify column
names to use
iris = pd.read_csv(csv_url, names = col_names)
2 dataset.tail(n=5)
Return the last n rows.
17 dataset.iloc[:m, :n] a subset of the first m rows and the first n columns
Department of Computer Engineering Subject : DSBDAL
dataset[cols_2_4]
c. count of missing values across each column using isna() and isnull()
In order to get the count of missing values of the entire dataframe isnull() function is
used. sum() which does the column wise sum first and doing another sum() will get
the count of missing values of the entire dataframe.
Function: dataframe.isnull().sum().sum()
Output : 8
d. count row wise missing value using isnull()
Function: dataframe.isnull().sum(axis = 1)
Output:
Output:
Method 2:
unction: dataframe.isna().sum()
analyzed or modelled effectively, and there are several techniques for this process.
a. Data Formatting: Ensuring all data formats are correct (e.g. object, text, floating
number, integer, etc.) is another part of this initial ‘cleaning’ process. If you are
Department of Computer Engineering Subject : DSBDAL
working with dates in Pandas, they also need to be stored in the exact format to use
special date-time functions.
b. Data normalization: Mapping all the nominal data values onto a uniform scale
(e.g. from 0 to 1) is involved in data normalization. Making the ranges consistent
across variables helps with statistical analysis and ensures better comparisons
later on.It is also known as Min-Max scaling.
Algorithm:
Step 1 : Import pandas and sklearn library for preprocessing
from sklearn import preprocessing
Step 2: Load the iris dataset in dataframe object df
Step 3: Print iris dataset.
df.head()
Step 5: Create a minimum and maximum processor object
min_max_scaler = preprocessing.MinMaxScaler()
Step 6: Separate the feature from the class label
x=df.iloc[:,:4]
Department of Computer Engineering Subject : DSBDAL
● Categorical features refer to string type data and can be easily understood by
human beings. But in case of a machine, it cannot interpret the categorical
data directly. Therefore, the categorical data must be translated into numerical
data that can be understood by machine.
There are many ways to convert categorical data into numerical data. Here the three most used
methods are discussed.
a. Label Encoding: Label Encoding refers to converting the labels into a numeric form
so as to convert them into the machine-readable form. It is an important preprocessing
step for the structured dataset in supervised learning.
Example : Suppose we have a column Height in some dataset. After applying label
encoding, the Height column is converted into:
where 0 is the label for tall, 1 is the label for medium, and 2 is a label for short height.
Label Encoding on iris dataset: For iris dataset the target column which is Species. It
contains three species Iris-setosa, Iris-versicolor, Iris-virginica.
Sklearn Functions for Label Encoding:
● preprocessing.LabelEncoder : It Encode labels with value between 0
and n_classes-1.
● fit_transform(y):
Parameters: yarray-like of shape (n_samples,)
Target values.
Returns: yarray-like of shape (n_samples,)
Encoded labels.
This transformer should be used to encode target values, and not the input.
Algorithm:
Step 1 : Import pandas and sklearn library for preprocessing
Department of Computer Engineering Subject : DSBDAL
● Use LabelEncoder when there are only two possible values of a categorical feature.
For example, features having value such as yes or no. Or, maybe, gender features
when there are only two possible values including male or female.
Limitation: Label encoding converts the data in machine-readable form, but it assigns a
unique number(starting from 0) to each class of data. This may lead to the generation
of priority issues in the data sets. A label with a high value may be considered to have
high priority than a label having a lower value.
b. One-Hot Encoding:
In one-hot encoding, we create a new set of dummy (binary) variables that is equal to the
number of categories (k) in the variable. For example, let’s say we have a categorical
variable Color with three categories called “Red”, “Green” and “Blue”, we need to use
three dummy variables to encode this variable using one-hot encoding. A dummy
(binary) variable just takes the value 0 or 1 to indicate the exclusion or inclusion of a
category.
Department of Computer Engineering Subject : DSBDAL
In one-hot encoding,
“Red” color is encoded as [1 0 0] vector of size 3.
“Green” color is encoded as [0 1 0] vector of size 3.
“Blue” color is encoded as [0 0 1] vector of size 3.
One-hot encoding on iris dataset: For iris dataset the target column which is Species. It
contains three species Iris-setosa, Iris-versicolor, Iris-virginica.
Sklearn Functions for One-hot Encoding:
● sklearn.preprocessing.OneHotEncoder(): Encode categorical
integer features using a one-hot aka one-of-K scheme
Algorithm:
Step 1 : Import pandas and sklearn library for preprocessing
from sklearn import preprocessing
Step 2: Load the iris dataset in dataframe object df
Step 3: Observe the unique values for the Species column.
df['Species'].unique()
output: array(['Iris-setosa', 'Iris-versicolor',
'Iris-virginica'], dtype=object)
Step 4: Apply label_encoder object for label encoding the Observe the unique
values for the Species column.
df['Species'].unique()
Output: array([0, 1, 2], dtype=int64)
Step 5: Remove the target variable from dataset
features_df=df.drop(columns=['Species'])
enc_df=pd.DataFrame(enc.fit_transform(df[['Species']])).toarray()
Step 7: Join the encoded values with Features variable
df_encode = features_df.join(enc_df)
Dummy encoding also uses dummy (binary) variables. Instead of creating a number of
dummy variables that is equal to the number of categories (k) in the variable, dummy
encoding uses k-1 dummy variables. To encode the same Color variable with three
categories using the dummy encoding, we need to use only two dummy variables.
In dummy encoding,
“Red” color is encoded as [1 0] vector of size 2.
“Green” color is encoded as [0 1] vector of size 2.
“Blue” color is encoded as [0 0] vector of size 2.
Dummy encoding removes a duplicate category present in the one-hot encoding.
Pandas Functions for One-hot Encoding with dummy variables:
● pandas.get_dummies(data, prefix=None, prefix_sep='_',
dummy_na=False, columns=None, sparse=False,
drop_first=False, dtype=None): Convert categorical variable into
dummy/indicator variables.
● Parameters:
data:array-like, Series, or DataFrame
Data of which to get dummy indicators.
prefixstr: list of str, or dict of str, default None
String to append DataFrame column names.
prefix_sep: str, default ‘_’
If appending prefix, separator/delimiter to use. Or pass a list or dictionary as
with prefix.
dummy_nabool: default False
Add a column to indicate NaNs, if False NaNs are ignored.
Column names in the DataFrame to be encoded. If columns is None then all the
columns with object or category dtype will be converted.
Algorithm:
Step 1 : Import pandas and sklearn library for preprocessing
from sklearn import preprocessing
Step 2: Load the iris dataset in dataframe object df
Step 3: Observe the unique values for the Species column.
df['Species'].unique()
output: array(['Iris-setosa', 'Iris-versicolor',
'Iris-virginica'], dtype=object)
Step 4: Apply label_encoder object for label encoding the Observe the unique
values for the Species column.
df['Species'].unique()
Output: array([0, 1, 2], dtype=int64)
Step 6: Apply one_hot encoder with dummy variables for Species column.
one_hot_df = pd.get_dummies(df, prefix="Species",
columns=['Species'], drop_first=True)
Step 7: Observe the merge dataframe
one_hot_df
Department of Computer Engineering Subject : DSBDAL
Conclusion- In this way we have explored the functions of the python library for Data
Preprocessing, Data Wrangling Techniques and How to Handle missing values on Iris Dataset.
Assignment Question
1. Explain Data Frame with Suitable example.
2. What is the limitation of the label encoding method?
3. What is the need of data normalization?
4. What are the different Techniques for Handling the Missing Data?
Department of Computer Engineering Subject : DSBDAL
Group A
Assignment No: 2
Step 2: Enter the name of the dataset and Save the dataset astye CSV(MS-DOS).
Step 3: Fill the dara by using RANDOMBETWEEN function. For every feature , fill
the data by considering above spectified range.
one example is given:
The placement count largely depends on the placement score. It is considered that if
placement score <75, 1 offer is facilitated; for placement score >75 , 2 offer is facilitated
and for else (>85) 3 offer is facilitated. Nested If formula is used for ease of data filling.
Department of Computer Engineering Subject : DSBDAL
Step 4: In 20% data, fill the impurities. The range of math score is [60,80], updating a
few instances values below 60 or above 80. Repeat this for Writing_Score [60,80],
Placement_Score[75-100], Club_Join_Date [2018-2021].
Step 5: To violate the ruleof response variable, update few valus . If placement scoreis
greater then 85, facilated only 1 offer.
1. None: None is a Python singleton object that is often used for missing data in
Python code.
2. NaN : NaN (an acronym for Not a Number), is a special floating-point value
recognized by all systems that use the standard IEEE floating-point
representation.
Pandas treat None and NaN as essentially interchangeable for indicating missing
or null values. To facilitate this convention, there are several useful functions for
detecting, removing, and replacing null values in Pandas DataFrame :
Department of Computer Engineering Subject : DSBDAL
● isnull()
● notnull()
● dropna()
● fillna()
● replace()
1. Checking for missing values using isnull() and notnull()
Algorithm:
Step 1 : Import pandas and numpy in order to check missing values in Pandas
DataFrame
import pandas as pd
import numpy as np
Step 2: Load the dataset in dataframe object df
df=pd.read_csv("/content/StudentsPerformanceTest1.csv")
Step 3: Display the data frame
df
Step 5: To create a series true for NaN values for specific columns. for example
math score in dataset and display data with only math score as NaN
series = pd.isnull(df["math score"])
df[series]
Algorithm:
Step 1 : Import pandas and numpy in order to check missing values in Pandas
DataFrame
import pandas as pd
import numpy as np
Step 2: Load the dataset in dataframe object df
df=pd.read_csv("/content/StudentsPerformanceTest1.csv")
Step 3: Display the data frame
df
Department of Computer Engineering Subject : DSBDAL
Step 5: To create a series true for NaN values for specific columns. for example
math score in dataset and display data with only math score as NaN
series1 = pd.notnull(df["math score"])
df[series1]
Department of Computer Engineering Subject : DSBDAL
See that there are also categorical values in the dataset, for this, you need to use
Label Encoding or One Hot Encoding.
■ from sklearn.preprocessing import LabelEncoder
■ le = LabelEncoder()
■ df['gender'] = le.fit_transform(df['gender'])
■ newdf=df
df
In order to fill null values in a datasets, fillna(), replace() functions are used.
These functions replace NaN values with some value of their own. All these
functions help in filling null values in datasets of a DataFrame.
df = pd.read_csv("StudentsPerformanceTest1.csv", na_values =
missing_values)
df
Step 5: filling missing values using mean, median and standard deviation of that
column.
Following line will replace Nan value in dataframe with value -99
ndf.replace(to_replace = np.nan, value = -99)
Department of Computer Engineering Subject : DSBDAL
Algorithm:
Step 1 : Import pandas and numpy in order to check missing values in Pandas
DataFrame
import pandas as pd
import numpy as np
Step 2: Load the dataset in dataframe object df
df=pd.read_csv("/content/StudentsPerformanceTest1.csv")
Step 3: Display the data frame
df
Step 4:To drop rows with at least 1 null value
ndf.dropna()
Department of Computer Engineering Subject : DSBDAL
Similarly, an Outlier is an observation in a given dataset that lies far from the rest
of the observations. That means an outlier is vastly larger or smaller than the remaining
values in the set.
Mean is the accurate measure to describe the data when we do not have any
outliers present. Median is used if there is an outlier in the dataset. Mode is used if there
is an outlier AND about ½ or more of the data is the same.
‘Mean’ is the only measure of central tendency that is affected by the outliers
which in turn impacts Standard deviation.
■ Example:
Consider a small dataset, sample= [15, 101, 18, 7, 13, 16, 11, 21, 5, 15, 10, 9]. By
looking at it, one can quickly say ‘101’ is an outlier that is much larger than the other
values.
Department of Computer Engineering Subject : DSBDAL
From the above calculations, we can clearly say the Mean is more affected than the
Median.
○ 4. Detecting Outliers
If our dataset is small, we can detect the outlier by just looking at the dataset. But
what if we have a huge dataset, how do we identify the outliers then? We need to use
visualization and mathematical techniques.
● Boxplots
● Scatterplots
● Z-score
● Inter Quantile Range(IQR)
Step 4:Select the columns for boxplot and draw the boxplot.
Step 5: We can now print the outliers for each column with reference to the box plot.
print(np.where(df['math score']>90))
print(np.where(df['reading score']<25))
print(np.where(df['writing score']<30))
It is used when you have paired numerical data, or when your dependent variable
has multiple values for each reading independent variable, or when trying to determine
the relationship between the two variables. In the process of utilizing the scatter plot, one
can also use it for outlier detection.
To plot the scatter plot one requires two variables that are somehow related to
each other. So here Placement score and Placement count features are used.
Algorithm:
Step 1 : Import pandas , numpy and matplotlib libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Step 2: Load the dataset in dataframe object df
df=pd.read_csv("/content/demo.csv")
Step 3: Display the data frame
df
Step 4: Draw the scatter plot with placement score and placement offer count
fig, ax = plt.subplots(figsize = (18,10))
ax.scatter(df['placement score'], df['placement offer
count'])
plt.show()
Labels to the axis can be assigned (Optional)
ax.set_xlabel('(Proportion non-retail business
acres)/(town)')
ax.set_ylabel('(Full-value property-tax rate)/(
$10,000)')
Department of Computer Engineering Subject : DSBDAL
Step 5: We can now print the outliers with reference to scatter plot.
print(np.where((df['placement score']<50) & (df['placement
offer count']>1)))
print(np.where((df['placement score']>85) & (df['placement
offer count']<3)))
Algorithm:
Step 1 : Import numpy and stats from scipy libraries
import numpy as np
from scipy import stats
upper = Q3 +1.5*IQR
lower = Q1 – 1.5*IQR
In the above formula as according to statistics, the 0.5 scale-up of IQR
(new_IQR = IQR + 0.5*IQR) is taken.
Algorithm:
Step 1 : Import numpy library
import numpy as np
q3 = np.percentile(sorted_rscore, 75)
print(q1,q3)
Handling of Outliers:
For removing the outlier, one must follow the same process of removing an entry
from the dataset using its exact position in the dataset because in all the above methods of
detecting the outliers end result is the list of all those data items that satisfy the outlier
definition according to the method used.
df_stud.insert(1,"m score",b,True)
df_stud
● Mean/Median imputation:
As the mean value is highly influenced by the outliers, it is advised to replace the
outliers with the median value.
1. Plot the box plot for reading score
col = ['reading score']
df.boxplot(col)
integrate these data sources into a data analysis description. This is a crucial step
since the accuracy of data analysis insights is highly dependent on the quantity and
quality of the data used.
● Generalization:It converts low-level data attributes to high-level data attributes
using concept hierarchy. For Example Age initially in Numerical form (22, 25) is
converted into categorical value (young, old).
● Normalization: Data normalization involves converting all data variables into a
given range. Some of the techniques that are used for accomplishing normalization
are:
○ Min–max normalization: This transforms the original data linearly.
○ Z-score normalization: In z-score normalization (or zero-mean normalization)
the values of an attribute (A), are normalized based on the mean of A and its
standard deviation.
○ Normalization by decimal scaling: It normalizes the values of an attribute by
changing the position of their decimal points
● Attribute or feature construction.
○ New attributes constructed from the given ones: Where new attributes are
created & applied to assist the mining process from the given set of attributes.
This simplifies the original data & makes the mining more efficient.
In this assignment , The purpose of this transformation should be one of the
following reasons:
A positively skewed distribution means that the extreme data results are larger.
This skews the data in that it brings the mean (average) up. The mean will be
larger than the median in a Positively skewed distribution.
A negatively skewed distribution means the opposite: that the extreme data
results are smaller. This means that the mean is brought down, and the median is
larger than the mean in a negatively skewed distribution.
Algorithm:
Step 1 : Detecting outliers using Z-Score for the Math_score variable and
remove the outliers.
Step 2: Observe the histogram for math_score variable.
import matplotlib.pyplot as plt
new_df['math score'].plot(kind = 'hist')
Step 3: Convert the variables to logarithm at the scale 10.
df['log_math'] = np.log10(df['math score'])
Department of Computer Engineering Subject : DSBDAL
Group A
Assignment No: 3
1. Summary statistics
2. Types of Variables
1. Summary statistics:
● What is Statistics?
Statistics is the science of collecting data and analysing them to infer proportions (sample)
that are representative of the population. In other words, statistics is interpreting data in
order to make predictions for the population.
Branches of Statistics:
There are two branches of Statistics.
DESCRIPTIVE STATISTICS : Descriptive Statistics is a statistics or a measure that
describes the data.
INFERENTIAL STATISTICS : Using a random sample of data taken from a population to
describe and make inferences about the population is called Inferential Statistics.
Descriptive Statistics
Descriptive Statistics is summarising the data at hand through certain numbers like mean,
median etc. so as to make the understanding of the data easier. It does not involve any
generalisation or inference beyond what is available. This means that the descriptive
statistics are just the representation of the data (sample) available and not based on any
theory of probability.
Department of Computer Engineering Subject : DSBDAL
b. Median : Median is the point which divides the entire data into two equal
halves. One-half of the data is less than the median, and the other half is greater
than the same. Median is calculated by first arranging the data in either ascending
or descending order.
○ If the number of observations is odd, the median is given by the middle
observation in the sorted form.
○ If the number of observations are even, median is given by the mean of the
two middle observations in the sorted form.
An important point to note is that the order of the data (ascending or
descending) does not affect the median.
c. Mode : Mode is the number which has the maximum frequency in the entire data
set, or in other words,mode is the number that appears the maximum number of
times. A data can have one or more than one mode.
● If there is only one number that appears the maximum number of times,
the data has one mode, and is called Uni-modal.
● If there are two numbers that appear the maximum number of times, the
data has two modes, and is called Bi-modal.
● If there are more than two numbers that appear the maximum number of
times, the data has more than two modes, and is called Multi-modal.
Mode is given by the number that occurs the maximum number of times.
Here, 17 and 21 both occur twice. Hence, this is a Bimodal data and the modes
are 17 and 21.
Measures of Dispersion describes the spread of the data around the central value (or the
Measures of Central Tendency)
1. Absolute Deviation from Mean — The Absolute Deviation from Mean, also
called Mean Absolute Deviation (MAD), describes the variation in the data set, in
the sense that it tells the average absolute distance of each data point in the set. It
is calculated as
Department of Computer Engineering Subject : DSBDAL
2. Variance — Variance measures how far are data points spread out from the mean.
A high variance indicates that data points are spread widely and a small variance
indicates that the data points are closer to the mean of the data set. It is calculated
as
4. Range — Range is the difference between the Maximum value and the Minimum
value in the data set. It is given as
5. Quartiles — Quartiles are the points in the data set that divides the data set into
four equal parts. Q1, Q2 and Q3 are the first, second and third quartile of the data
set.
● 25% of the data points lie below Q1 and 75% lie above it.
● 50% of the data points lie below Q2 and 50% lie above it. Q2 is nothing but
Median.
● 75% of the data points lie below Q3 and 25% lie above it.
Department of Computer Engineering Subject : DSBDAL
Positive Skew — This is the case when the tail on the right side of the curve is
bigger than that on the left side. For these distributions, mean is greater than the
mode.
Negative Skew — This is the case when the tail on the left side of the curve is
bigger than that on the right side. For these distributions, mean is smaller than the
mode.
Python Code:
1. Mean
To find mean of all columns
Syntax:
df.mean()
Output:
2. Median
To find median of all columns
Syntax:
df.median()
Output:
3. Mode
To find mode of all columns
Syntax:
df.mode()
Output:
Department of Computer Engineering Subject : DSBDAL
In the Genre Column mode is Female, for column Age mode is 32 etc. If a
particular column does not have mode all the values will be displayed in
the column.
To find the mode of a specific column.
Syntax:
df.loc[:,'Age'].mode()
Output:
32
4. Minimum
To find minimum of all columns
Syntax:
df.min()
Output:
Output:
18
5. Maximum
To find Maximum of all columns
Syntax:
df.max()
Output:
6. Standard Deviation
To find Standard Deviation of all columns
Syntax:
df.std()
Output:
Syntax:
df.std(axis=1)[0:4]
Output:
2. Types of Variables:
A variable is a characteristic that can be measured and that can assume different values.
Height, age, income, province or country of birth, grades obtained at school and type of
housing are all examples of variables.
● Categorical and
● Numeric.
Each category is then classified in two subcategories: nominal or ordinal for categorical
variables, discrete or continuous for numeric variables.
● Categorical variables
○ Ordinal Variable
An ordinal variable is a variable whose values are defined by an order relation
between the different categories. In the following table, the variable “behaviour”
is ordinal because the category “Excellent” is better than the category “Very
good,” which is better than the category “Good,” etc. There is some natural
ordering, but it is limited since we do not know by how much “Excellent”
behaviour is better than “Very good” behaviour.
● Numerical Variables
A numeric variable (also called quantitative variable) is a quantifiable characteristic
whose values are numbers (except numbers which are codes standing up for categories).
Numeric variables may be either continuous or discrete.
○ Continuous variables
Department of Computer Engineering Subject : DSBDAL
○ Discrete variables
As opposed to a continuous variable, a discrete variable can assume only a finite
number of real values within a given interval.
An example of a discrete variable would be the score given by a judge to a
gymnast in competition: the range is 0 to 10 and the score is always given to one
decimal (e.g. a score of 8.5)
(df_u.groupby(['Genre']).Income.mean())
Output:
To create a list that contains a numeric value for each response to the categorical variable.
from sklearn import preprocessing
enc = preprocessing.OneHotEncoder()
enc_df = pd.DataFrame(enc.fit_transform(df[['Genre']]).toarray())
enc_df
6. To display basic statistical details like percentile, mean,standard deviation etc. forIris-
setosa use describe
print('Iris-setosa')
print(iris[irisSet].describe())
8. To display basic statistical details like percentile, mean,standard deviation etc. forIris-
versicolor use describe
print('Iris-versicolor')
print(iris[irisVer].describe())
10. To display basic statistical details like percentile, mean,standard deviation etc. forIris-
virginica use describe
print('Iris-virginica')
print(iris[irisVir].describe())
Conclusion:
Department of Computer Engineering Subject : DSBDAL
Measures of central tendency describe the centre of a data set. It includes the
mean, median, and mode.
Measures of variability or spread describe the dispersion of data within the set and
it includes standard deviation, variance, minimum and maximum variables.
Assignment Questions:
1. Explain Measures of Central Tendency with examples.
2. What are the different types of variables? Explain with examples.
3. Which method is used to statistic the dataframe? write the code.
.
Department of Computer Engineering Subject : DSBDAL
Group A
Assignment No: 4
● Fig. 2 shown below is about the relation between weight (in Kg) and height (in
cm), a linear relation. It is an approach of studying in a statistical manner to
summarise and learn the relationships among continuous (quantitative) variables.
● Here a variable, denoted by ‘x’ is considered as the predictor, explanatory, or
independent variable.
● Another variable, denoted ‘y’, is considered as the response, outcome, or
dependent variable. While "predictor" and "response" used to refer to these
variables.
● Simple linear regression technique concerned with the study of only one predictor
variable.
Fig.2 : Relation between weight (in Kg) and height (in cm)
Department of Computer Engineering Subject : DSBDAL
MultiVariate Regression :It concerns the study of two or more predictor variables.
Usually a transformation of the original features into polynomial features from a given
degree is preferred and further Linear Regression is applied on it.
● A simple linear model Y = a + bX is in original feature will be transformed into
polynomial feature is transformed and further a linear regression applied to it and
it will be something like
Y=a + bX + cX2
● If a high degree value is used in transformation the curve becomes over-fitted as it
captures the noise from data as well.
^
where 𝑦 is the dependent or the response variable
𝑥 is the independent or the input variable
β is the value of y when x=0 or the y intercept
0
● This linear equation represents a line also known as the ‘regression line’. The least square
estimation technique is one of the basic techniques used to guess the values of the
parameters and based on a sample set.
● This technique estimates parameters β and β and by trying to minimise the square
0 1
of errors at all the points in the sample set. The error is the deviation of the actual sample
● data point from the regression line. The technique can be represented by the equation.
Department of Computer Engineering Subject : DSBDAL
𝑛 ^ 2
𝑚𝑖𝑛 ∑ (𝑦 − 𝑦) (2)
𝑖=0
β = 𝑦 −β 𝑥 (4)
0 1
Once the Linear Model is estimated using equations (3) and (4), we can estimate the
value of the dependent variable in the given range only. Going outside the range is called
extrapolation which is inaccurate if simple regression techniques are used.
3. Measuring Performance of Linear Regression
Mean Square Error:
The Mean squared error (MSE) represents the error of the estimator or predictive model
created based on the given set of observations in the sample. Two or more regression
models created using a given sample data can be compared based on their MSE. The
lesser the MSE, the better the regression model is. When the linear regression model is
trained using a given set of observations, the model with the least mean sum of squares
error (MSE) is selected as the best model. The Python or R packages select the best-fit
Department of Computer Engineering Subject : DSBDAL
model as the model with the lowest MSE or lowest RMSE when training the linear
regression models.
Mathematically, the MSE can be calculated as the average sum of the squared difference
between the actual value and the predicted or estimated value represented by the
regression model (line or plane).
An MSE of zero (0) represents the fact that the predictor is a perfect predictor.
RMSE:
Root Mean Squared Error method that basically calculates the least-squares error and takes
a root of the summed values.
Mathematically speaking, Root Mean Squared Error is the square root of the sum of all
errors divided by the total number of values. This is the formula to calculate RMSE
R-Squared is the ratio of the sum of squares regression (SSR) and the sum of squares total
(SST).
SST : total sum of squares (SST), regression sum of squares (SSR), Sum of square of
errors (SSE) are all showing the variation with different measures.
Department of Computer Engineering Subject : DSBDAL
A value of R-squared closer to 1 would mean that the regression model covers most
part of the variance of the values of the response variable and can be termed as a good
model.
One can alternatively use MSE or R-Squared based on what is appropriate and the need of
the hour. However, the disadvantage of using MSE rather than R-squared is that it will be
difficult to gauge the performance of the model using MSE as the value of MSE can vary
from 0 to any larger number. However, in the case of R-squared, the value is bounded
between 0 and .
4. Example of Linear Regression
Consider following data for 5 students.
Each Xi (i = 1 to 5) represents the score of ith student in standard X and corresponding
Yi (i = 1 to 5) represents the score of ith student in standard XII.
(i) Linear regression equation best predicts standard XIIth score
(ii) Interpretation for the equation of Linear Regression
(iii) If a student's score is 80 in std X, then what is his expected score in XII standard?
1 95 85
2 85 95
3 80 70
4 70 65
5 60 70
x y 𝑥 −𝑥 𝑦 −𝑦 (𝑥 −𝑥 )2 (𝑥 −𝑥 )(𝑦 − 𝑦 )
95 85 17 8 289 136
85 95 7 18 49 126
80 70 2 -7 4 -14
70 65 -8 -12 64 96
Department of Computer Engineering Subject : DSBDAL
(i) linear regression equation that best predicts standard XIIth score
^
𝑦 =β + β 𝑥
0 1
2
𝑛 𝑛
β = ∑ (𝑥 − 𝑥 ) (𝑦 − 𝑦 )/ ∑ (𝑥 𝑥)
1 𝑖 𝑖 𝑖−
𝑖=1 𝑖=1
β = 470/730 = 0. 644
1
β = 𝑦 − β 𝑥
0 1
^
𝑦 = 26. 76 + 0. 644 𝑥
Interpretation 1
For an increase in value of x by 0.644 units there is an increase in value of y in one unit.
Interpretation 2
Score in XII standard (Yi) is 0.644 units depending on Score in X standard (Xi) but other
factors will also contribute to the result of XII standard by 26.768 .
(iii) If a student's score is 65 in std X, then his expected score in XII standard is 78.288
^
𝑦 = 26. 76 + 0. 644 * 65 = 68. 38
model
Output:
array([ 0.64383562, 26.78082192])
Step 5: Predict the Y value for X and observe the output.
predict = np.poly1d(model)
predict(65)
Output:
68.63
Step 6: Predict the y_pred for all values of x.
y_pred= predict(x)
y_pred
Output:
array([81.50684932, 87.94520548, 71.84931507, 68.63013699, 71.84931507])
Step 1: Import libraries and create alias for Pandas, Numpy and Matplotlib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Step 2: Import the Boston Housing dataset
from sklearn.datasets import load_boston
boston = load_boston()
Step 3: Initialize the data frame
data = pd.DataFrame(boston.data)
Step 4: Add the feature names to the dataframe
data.columns = boston.feature_names
data.head()
Step 5: Adding target variable to dataframe
data['PRICE'] = boston.target
Step 6: Perform Data Preprocessing( Check for missing values)
data.isnull().sum()
Step 7: Split dependent variable and independent variables
x = data.drop(['PRICE'], axis = 1)
y = data['PRICE']
Step 8: splitting data to training and testing dataset.
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest =
train_test_split(x, y, test_size =0.2,random_state = 0)
print(mse)
Output:
33.44897999767638
mse = mean_squared_error(ytest, ytest_pred)
print(mse)
Output:
19.32647020358573
Step 13: Plotting the linear regression model
lt.scatter(ytrain ,ytrain_pred,c='blue',marker='o',label='Training data')
plt.scatter(ytest,ytest_pred ,c='lightgreen',marker='s',label='Test data')
plt.xlabel('True values')
plt.ylabel('Predicted')
plt.title("True value vs Predicted value")
plt.legend(loc= 'upper left')
#plt.hlines(y=0,xmin=0,xmax=50)
plt.plot()
plt.show()
Conclusion:
In this way we have done data analysis using linear regression for Boston Dataset and
predict the price of houses using the features of the Boston Dataset.
Assignment Question:
1) Compute SST, SSE, SSR, MSE, RMSE, R Square for the below example .
1 95 85
2 85 95
3 80 70
4 70 65
5 60 70
2) Comment on whether the model is best fit or not based on the calculated values.
3) Write python code to calculate the RSquare for Boston Dataset.
(Consider the linear regression model created in practical session)
.
Department of Computer Engineering Subject : DSBDAL
Group A
Assignment No: 5
Logistic Regression can be used for various classification problems such as spam
detection. Diabetes prediction, if a given customer will purchase a particular product or
will they churn another competitor, whether the user will click on a given advertisement
link or not, and many more examples are in the bucket.
Logistic Regression is one of the most simple and commonly used Machine Learning
algorithms for two-class classification. It is easy to implement and can be used as the
baseline for any binary classification problem. Its basic fundamental concepts are also
constructive in deep learning. Logistic regression describes and estimates the relationship
between one dependent binary variable and independent variables.
Department of Computer Engineering Subject : DSBDAL
Logistic regression is a statistical method for predicting binary classes. The outcome or
target variable is dichotomous in nature. Dichotomous means there are only two possible
classes. For example, it can be used for cancer detection problems. It computes the
It is a special case of linear regression where the target variable is categorical in nature. It
uses a log of odds as the dependent variable. Logistic Regression predicts the probability
Where, y is a dependent variable and x1, x2 ... and Xn are explanatory variables.
Sigmoid Function:
3. Sigmoid Function
The sigmoid function, also called logistic function, gives an ‘S’ shaped curve that can take any
real-valued number and map it into a value between 0 and 1. If the curve goes to positive infinity,
y predicted will become 1, and if the curve goes to negative infinity, y predicted will become 0.
If the output of the sigmoid function is more than 0.5, we can classify the outcome as 1 or YES,
and if it is less than 0.5, we can classify it as 0 or NO. The outputcannotFor example: If the
output is 0.75, we can say in terms of probability as: There is a 75 percent chance that a patient
will suffer from cancer.
4. Types of LogisticRegression
Department of Computer Engineering Subject : DSBDAL
Binary Logistic Regression: The target variable has only two possible outcomes such as
Spam or Not Spam, Cancer or No Cancer.
Multinomial Logistic Regression: The target variable has three or more nominal
categories such as predicting the type of Wine.
Ordinal Logistic Regression: the target variable has three or more ordinal categories
such as restaurant or product rating from 1 to 5.
The following table shows the confusion matrix for a two class classifier.
Here each row indicates the actual classes recorded in the test data set and the each column indicates the
classes as predicted by the classifier.
Numbers on the descending diagonal indicate correct predictions, while the ascending diagonal concerns
prediction errors.
● Number of positive (Pos) : Total number instances which are labelled as positive in a given
dataset.
● Number of negative (Neg) : Total number instances which are labelled as negative in a given
dataset.
Department of Computer Engineering Subject : DSBDAL
● Number of True Positive (TP) : Number of instances which are actually labelled as positive
and the predicted class by classifier is also positive.
● Number of True Negative (TN) : Number of instances which are actually labelled as negative
and the predicted class by classifier is also negative.
● Number of False Positive (FP) : Number of instances which are actually labelled as negative
and the predicted class by classifier is positive.
● Number of False Negative (FN): Number of instances which are actually labelled as positive
and the class predicted by the classifier is negative.
● Accuracy: Accuracy is calculated as the number of correctly classified instances divided by total
number of instances.
The ideal value of accuracy is 1, and the worst is 0. It is also calculated as the sum of true positive
and true negative (TP + TN) divided by the total number of instances.
𝑇𝑃+𝑇𝑁 𝑇𝑃+𝑇𝑁
𝑎𝑐𝑐 = 𝑇𝑃+𝐹𝑃+𝑇𝑁+𝐹𝑁
= 𝑃𝑜𝑠+𝑁𝑒𝑔
● Error Rate: Error Rate is calculated as the number of incorrectly classified instances divided
by total number of instances.
The ideal value of accuracy is 0, and the worst is 1. It is also calculated as the sum of false
positive and false negative (FP + FN) divided by the total number of instances.
𝐹𝑃+𝐹𝑁 𝐹𝑃+𝐹𝑁
𝑒𝑟𝑟 = 𝑇𝑃+𝐹𝑃+𝑇𝑁+𝐹𝑁
= 𝑃𝑜𝑠+𝑁𝑒𝑔
Or
𝑒𝑟𝑟 = 1 − 𝑎𝑐𝑐
● Precision: It is calculated as the number of correctly classified positive instances divided by the
total number of instances which are predicted positive. It is also called confidence value. The
ideal value is 1, whereas the worst is 0.
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃
𝑇𝑃+𝐹𝑃
● Recall: .It is calculated as the number of correctly classified positive instances divided by the
total number of positive instances. It is also called recall or sensitivity. The ideal value of
sensitivity is 1, whereas the worst is 0.
It is calculated as the number of correctly classified positive instances divided by the total number
of positive instances.
𝑟𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃
𝑇𝑃+𝐹𝑁
Department of Computer Engineering Subject : DSBDAL
Step 6: Predict the y_pred for all values of train_x and test_x
Conclusion:
In this way we have done data analysis using logistic regression for Social Media Adv. and
evaluate the performance of model.
Value Addition: Visualising Confusion Matrix using Heatmap
Department of Computer Engineering Subject : DSBDAL
Assignment Question:
1) Consider the binary classification task with two classes positive and negative.
Find out TP,TP, FP, TN, FN, Accuracy, Error rate, Precision, Recall
2) Comment on whether the model is best fit or not based on the calculated values.
3) Write python code for the preprocessing mentioned in step 4. and Explain every
step in detail.
.
Department of Computer Engineering Subject : DSBDAL
Group A
Assignment No: 6
For example, P(A), P(B), P(C) are prior probabilities because while calculating P(A),
occurrences of event B or C are not concerned i.e. no information about occurrence of
any other event is used.
Conditional Probabilities:
We have a dataset with some features Outlook, Temp, Humidity, and Windy, and the
target here is to predict whether a person or team will play tennis or not.
Conditional Probability
Here, we are predicting the probability of class1 and class2 based on the given condition. If I try
to write the same formula in terms of classes and features, we will get the following equation
Department of Computer Engineering Subject : DSBDAL
Now we have two classes and four features, so if we write this formula for class C1, it will be
Here, we replaced Ck with C1 and X with the intersection of X1, X2, X3, X4. You might have a
question, It’s because we are taking the situation when all these features are present at the same
time.
The Naive Bayes algorithm assumes that all the features are independent of each other or in other
words all the features are unrelated. With that assumption, we can further simplify the above
This is the final equation of the Naive Bayes and we have to calculate the probability of both C1
P (N0 | Today) > P (Yes | Today) So, the prediction that golf would be played is ‘No’.
Step 5: Use Naive Bayes algorithm( Train the Machine ) to Create Model
# import the class
from sklearn.naive_bayes import GaussianNB
gaussian = GaussianNB()
gaussian.fit(X_train, y_train)
Step 6: Predict the y_pred for all values of train_x and test_x
Y_pred = gaussian.predict(X_test)
Conclusion:
In this way we have done data analysis using Naive Bayes Algorithm for Iris dataset and
evaluated the performance of the model.
Assignment Question:
1) Consider the observation for the car theft scenario having 3 attributes colour, Type and
origin.
Find the probability of car theft having scenarios Red SUV and Domestic.
2) Write python code for the preprocessing mentioned in step 4. and Explain every step in
detail.
.
Department of Computer Engineering Subject : DSBDAL
Group A
Assignment No: 7
Text mining is also referred to as text analytics. Text mining is a process of exploring
sizable textual data and finding patterns. Text Mining processes the text itself, while NLP
processes with the underlying metadata. Finding frequency counts of words, length of the
sentence, presence/absence of specific words is known as text mining. Natural language
processing is one of the components of text mining. NLP helps identify sentiment,
finding entities in the sentence, and category of blog/article. Text mining is preprocessed
data for text analytics. In Text Analytics, statistical and machine learning algorithms are
used to classify information.
such as WordNet, along with a suite of text processing libraries for classification,
tokenization, stemming, tagging, parsing, and semantic reasoning and many more.
Analysing movie reviews is one of the classic examples to demonstrate a simple NLP
Bag-of-words model, on movie reviews.
Tokenization:
Tokenization is the first step in text analytics. The process of breaking down a text
paragraph into smaller chunks such as words or sentences is called Tokenization.
Token is a single entity that is the building blocks for a sentence or paragraph.
sent_tokenize() method
method
1) Lemmatization Vs Stemming
Stemming algorithm works by cutting the suffix from the word. In a broader sense
cuts either the beginning or end of the word.
POS Tagging
POS (Parts of Speech) tell us about grammatical information of words of the
sentence by assigning specific token (Determiner, noun, adjective , adverb ,
verb,Personal Pronoun etc.) as tag (DT,NN ,JJ,RB,VB,PRP etc) to each words.
Word can have more than one POS depending upon the context where it is used.
We can use POS tags as statistical NLP tasks. It distinguishes a sense of word
which is very helpful in text realization and infer semantic information from text
for sentiment analysis.
Example:
The initial step is to make a vocabulary of unique words and calculate TF for each
document. TF will be more for words that frequently appear in a document and
less for rare words in a document.
● Inverse Document Frequency (IDF)
It is the measure of the importance of a word. Term frequency (TF) does not
consider the importance of words. Some words such as’ of’, ‘and’, etc. can be
most frequently present but are of little significance. IDF provides weightage to
each word based on its frequency in the corpus D.
After applying TFIDF, text in A and B documents can be represented as a TFIDF vector of
dimension equal to the vocabulary words. The value corresponding to each word represents
the importance of that word in a particular document.
Department of Computer Engineering Subject : DSBDAL
TFIDF is the product of TF with IDF. Since TF values lie between 0 and 1, not using ln can
result in high IDF for some words, thereby dominating the TFIDF. We don’t want that, and
therefore, we use ln so that the IDF should not completely dominate the TFIDF.
● Disadvantage of TFIDF
It is unable to capture the semantics. For example, funny and humorous are synonyms, but
TFIDF does not capture that. Moreover, TFIDF can be computationally expensive if the
vocabulary is vast.
4. Bag of Words (BoW)
Machine learning algorithms cannot work with raw text directly. Rather, the text must be
converted into vectors of numbers. In natural language processing, a common technique
for extracting features from text is to place all of the words that occur in the text in a
bucket. This approach is called a bag of words model or BoW for short. It’s referred to
as a “bag” of words because any information about the structure of the sentence is lost.
Algorithm for Tokenization, POS Tagging, stop words removal, Stemming and
Lemmatization:
Step 1: Download the required packages
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
Step 2: Initialize the text
text= "Tokenization is the first step in text analytics. The
process of breaking down a text paragraph into smaller chunks
such as words or sentences is called Tokenization."
Step 3: Perform Tokenization
#Sentence Tokenization
from nltk.tokenize import sent_tokenize
tokenized_text= sent_tokenize(text)
print(tokenized_text)
#Word Tokenization
from nltk.tokenize import word_tokenize
tokenized_word=word_tokenize(text)
print(tokenized_word)
idfDict = dict.fromkeys(documents[0].keys(), 0)
for document in documents:
for word, val in document.items():
if val > 0:
Department of Computer Engineering Subject : DSBDAL
idfDict[word] += 1
Conclusion:
In this way we have done text data analysis using TF IDF algorithm
Assignment Question:
2) Perform Stemming for text = "studies studying cries cry". Compare
the results generated with Lemmatization. Comment on your answer how
Stemming and Lemmatization differ from each other.
3) Write Python code for removing stop words from the below documents, conver the
documents into lowercase and calculate the TF, IDF and TFIDF score for each
document.
documentA = 'Jupiter is the largest Planet'
documentB = 'Mars is the fourth planet from the Sun'
.
DepartmentofComputerEngineering Subject:DSBDAL
Group A
AssignmentNo:8
Contents forTheory:
1. Seaborn Library Basics
2. Know yourData
3. Finding patterns of data.
4. Checkinghowthepriceoftheticket(columnname:'fare')foreachpassengeris distributed
by plotting a histogram.
---------------------------------------------------------------------------------------------------------------
Theory:
Data Visualisation plays a very important role in Datamining.Variousdatascientistsspenttheir
timeexploringdatathroughvisualisation.Toacceleratethisprocessweneedtohaveawell-
documentation of all the plots.
Even plenty of resources can’t be transformed into valuable goods without planning and
architecture
2. Know yourdata
The dataset that we are going to use to draw our plots will be the Titanic dataset, which is
downloaded by default with the Seaborn library. All you have to do is use the load_dataset
function and pass it the name of the dataset.
DepartmentofComputerEngineering Subject:DSBDAL
importpandasaspd import
numpy as np
importmatplotlib.pyplotasplt
dataset=sns.load_dataset('titanic') dataset.head()
The dataset contains 891 rows and 15 columns and contains information about the passengers
who boarded the unfortunate Titanic ship. The original task is to predict whether or not the
passenger survived depending upon different features such as their age, ticket, cabin they
boarded, the class of the ticket, etc. We will use the Seaborn library to see if we can find any
patterns in the data.
A. Distribution Plots
a. Dist-Plot
b. Joint Plot
d. Rug Plot
DepartmentofComputerEngineering Subject:DSBDAL
B. Categorical Plots
a. Bar Plot
b. Count Plot
c. Box Plot
d. ViolinPlot
C. Advanced Plots
a. Strip Plot
b. Swarm Plot
D. Matrix Plots
a. Heat Map
b. Cluster Map
A. Distribution Plots:
Theseplotshelpustovisualisethedistributionofdata.Wecanusetheseplotstounderstandthe mean,
a. Distplot
● Wecanchangethenumberofbinsi.e.numberofverticalbarsinahistogram
Thelinethatyousee represents the kernel density estimation. You can remove this line by
passing False as the parameter for the kde attribute as shown below
Here the x-axis is the age and the y-axis displays frequency. For example, forbins=10,
● We additionally obtain a scatter plot between the variables to reflect their linear
relationship. We can customise the scatter plot into a hexagonal plot, where, the
more the colour intensity, the more will be the number of observations.
importseabornassns #
For Plot 1
sns.jointplot(x=dataset['age'],y=dataset['fare'],kind= 'scatter')
# For Plot 2
From the output, you can see that most of the instances for the fares havevaluesbetween0and
100.
These are some of the most commonly used distribution plots offered by the Python's Seaborn
Library. Let's see some of the categorical plots in the Seaborn library.
2.Categorical Plots
Categorical plots, as the name suggests, are normally used to plot categorical data. The
categorical plots plot the values in the categorical column against anothercategoricalcolumnora
numeric column. Let's see some of the most commonly used categorical data.
b. The BarPlot
The barplot() is used to display the mean value for eachvalueinacategoricalcolumn,againsta
numeric column. The first parameter is the categorical column, the second parameter is the
numeric column while the third parameter is the dataset. For instance, if you want to know the
mean value of the age of the male and female passengers, you can use the bar plot as follows.
DepartmentofComputerEngineering Subject:DSBDAL
From the output, you can clearly see that the average age of malepassengersisjustlessthan40
while the average age of female passengers is around 33.
In addition to finding the average, the bar plot can also be used to calculate other aggregate
values for each category. To do so, you need to pass the aggregatefunctiontotheestimator.For
instance, you can calculate the standard deviation for the age of each gender as follows:
import numpy as np
importmatplotlib.pyplotasplt
sns.countplot(x='sex', data=dataset)
The box plot is used to display the distribution of the categorical data in the form of quartiles.
The centre of the box shows the median value. The value from the lower whisker tothebottomof
the box shows the first quartile. From the bottom of the box to the middle of theboxliesthe
secondquartile.Fromthemiddleoftheboxtothetopoftheboxliesthethirdquartileandfinally from the
top of the box to the top whisker lies the last quartile.
Now let's plot a box plot that displays the distribution for the age with respect to each gender.
You need to pass the categorical column as the firstparameter(whichissexinourcase)andthe
numeric column (age in our case) as the second parameter. Finally, the dataset is passed as the
third parameter, take a look at the following script:
Let'stry to understand the box plot for females. The first quartile starts at around 1 and ends
at20whichmeansthat25% of the passengers are aged between 1 and 20. The second quartile starts
at around 20 and ends at around 28 which means that 25% of the passengers are aged between20
and 28. Similarly, the third quartile starts and ends between 28 and 38, hence 25% passengers are
aged within this range and finally the fourth or last quartile starts at 38andends around 64.
Ifthere are any outliers or the passengers that do not belong to any of the quartiles, they are
called outliers and are represented by dots on the box plot.
Now in addition to the information about the age of each gender, you can also see thedistribution
of the passengers who survived. For instance, you can see that among the male passengers, on
average more younger people survived as compared to the older ones.Similarly, you can see that
the variation among the age of female passengers who did not surviveismuch greater than the age
of the surviving female passengers.
e. TheViolin Plot
The violin plot is similar to the box plot, however, the violin plot allows us to display all the
components that actually correspond to the data point. The violinplot() function is used to plotthe
violin plot. Like the box plot, the first parameter is the categorical column, the second parameter
is the numeric column while the third parameter is the dataset.
Let'splot a violin plot that displays the distribution for the age with respect to each gender.
You can see from the figure above that violin plots provide much more information about the
data as compared to the box plot. Instead of plotting the quartile,theviolinplotallowsustosee all
the components that actually correspond to the data. The area wheretheviolinplotisthicker has a
higher number of instances for the age. For instance, from the violin plot for males, it is clearly
evident that the number of passengers with age between 20 and 40 is higher than allthe rest of the
age brackets.
Like box plots, you can also add another categorical variable to the violin plot using the hue
parameter as shown below:
Advanced Plots:
SNJB’sLateSau.KBJainCollegeofEngineering,ChandwadDist.Nashik,MS
DepartmentofComputerEngineering Subject:DSBDAL
You can see the scattered plots of age for both males and females. The data points look like
strips. It isdifficulttocomprehendthedistributionofdatainthisform.Tobettercomprehendthe data,
pass True for the jitter parameter which adds some random noise to the data. Look at the
following script:
Now you have a better view for the distribution of age across the genders.
Likeviolinandboxplots,youcanaddanadditionalcategoricalcolumntostripplotusinghue parameter as
shown below:
SNJB’sLateSau.KBJainCollegeofEngineering,ChandwadDist.Nashik,MS
DepartmentofComputerEngineering Subject:DSBDAL
You can clearly see that the above plot contains scattered data points like the strip plot and the
data points are notoverlapping.Rathertheyarearrangedtogiveaviewsimilartothatofaviolin plot.
Let'sadd another categorical column to the swarm plot using the hue parameter.
SNJB’sLateSau.KBJainCollegeofEngineering,ChandwadDist.Nashik,MS
DepartmentofComputerEngineering Subject:DSBDAL
From the output, it is evident that the ratio of surviving males is less than the ratio ofsurviving
females. Since for the male plot, there are more blue points andlessorangepoints.Ontheother
hand, for females, there are more orange points (surviving) than the blue points (notsurviving).
Another observation is that amongst males of age less than 10, more passengers survived as
compared to those who didn't.
1. Matrix Plots
Matrix plots are the type ofplotsthatshowdataintheformofrowsandcolumns.Heatmapsare the
prime examples of matrix plots.
a. Heat Maps
Heat maps are normally used to plot correlation between numeric columns in the form of a
matrix. It is important to mention here that to draw matrix plots, you need to have meaningful
information on rows as well ascolumns.Let'splotthefirstfiverowsoftheTitanicdatasettosee if both
the rows and column headers have meaningful information. Execute the following script:
importpandasaspd import
numpy as np
importmatplotlib.pyplotasplt
dataset=sns.load_dataset('titanic') dataset.head()
SNJB’sLateSau.KBJainCollegeofEngineering,ChandwadDist.Nashik,MS
DepartmentofComputerEngineering Subject:DSBDAL
From the output, you can see that the column headers contain useful information such as
passengers surviving, their age, fare etc. However the row headers only contain indexes 0, 1,2,
etc. To plotmatrixplots,weneedusefulinformationonbothcolumnsandrowheaders.Oneway to do
this is to call the corr() method on the dataset. The corr() function returns the correlation between
all the numeric columns of the dataset. Execute the following script:
dataset.corr()
In the output, you will see that both the columns and the rows have meaningful header
information, as shown below:
Now to create a heat map with these correlation values, you need to call theheatmap()function
and pass it your correlation dataframe. Look at the following script:
corr=dataset.corr()
sns.heatmap(corr)
SNJB’sLateSau.KBJainCollegeofEngineering,ChandwadDist.Nashik,MS
DepartmentofComputerEngineering Subject:DSBDAL
The correlation values can also be plotted on the heatmap by passing True for the annot
parameter. Execute the following script to see this in action:
corr = dataset.corr()
sns.heatmap(corr,annot=True)
SNJB’sLateSau.KBJainCollegeofEngineering,ChandwadDist.Nashik,MS
DepartmentofComputerEngineering Subject:DSBDAL
corr=dataset.corr()
sns.heatmap(corr)
b. ClusterMap:
In addition to the heat map, another commonly used matrix plot is the cluster map. The
cluster map basically uses Hierarchical Clustering to clustertherowsandcolumnsofthe
matrix.
Let's plot aclustermapforthenumberofpassengerswhotravelledinaspecificmonthof a
specific year. Execute the following script:
4. Checking how the price of the ticket (column name: 'fare') for each passenger is
distributed by plotting a histogram.
import seaborn as sns
dataset = sns.load_dataset('titanic')
sns.histplot(dataset['fare'],kde=False,bins=10)
SNJB’sLateSau.KBJainCollegeofEngineering,ChandwadDist.Nashik,MS
DepartmentofComputerEngineering Subject:DSBDAL
Conclusion-
Seaborn is an advanced data visualisation library built on top of Matplotlib library. In this
assignment, welookedathowwecandrawdistributionalandcategoricalplotsusingtheSeaborn library.
WehaveseenhowtoplotmatrixplotsinSeaborn.Wealsosawhowtochangeplotstyles and use grid
functions to manipulate subplots.
Assignment Questions
1. Listout differenttypes ofplot tofind patternsof data
2. Explain when you will use distribution plots and when you will use categorical plots.
3. Writetheconclusionfromthefollowingswarmplot(considertitanicdataset)
SNJB’sLateSau.KBJainCollegeofEngineering,ChandwadDist.Nashik,MS
DepartmentofComputerEngineering Subject:DSBDAL
4. Whichparameterisusedtoaddanothercategoricalvariabletotheviolinplot, Explain
with syntax and example.
SNJB’sLateSau.KBJainCollegeofEngineering,ChandwadDist.Nashik,MS
Department of Computer Engineering Subject : DSBDAL
----------------------------------------------------------------------------------------------------------------
Group B
Assignment No: 1
----------------------------------------------------------------------------------------------------------------
Theory:
● Steps to Install Hadoop
● Java Code for word count
● Input File
Step 2) Download hadoop-core-1.2.1.jar, which is used to compile and execute the MapReduce
program. Visit the following
link
http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-core/1.2.1
(Otherwise check in Browsing HDFS -> Utilities -> Browse the file System -> /)
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
import org.apache.hadoop.util.*;
{
Path inputPath = new Path(args[0]);
Path outputPath = new Path(args[1]);
FileInputFormat.setInputPaths(job, inputPath);
FileOutputFormat.setOutputPath(job, outputPath);
job.setJobName("WordCount");
job.setMapperClass(Map.class);
job.setCombinerClass(Reduce.class);
job.setReducerClass(Reduce.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
return job.waitForCompletion(true) ? 0 : 1;
}
Assignment Questions
1. What is the map reduce explain with a small example?
Department of Computer Engineering Subject : DSBDAL
-------------------------------------------------------------------------------------------------
---------------
Group B
Assignment No: 2
-------------------------------------------------------------------------------------------------
---------------
Theory:
● Steps to Install Hadoop for distributed environment
● Java Code for processes a log file of a system
cd hadoop-2.7.3
cd hadoop-2.7.3/sbin
1) Start NameNode:
Department of Computer Engineering Subject : DSBDAL
The NameNode is the centerpiece of an HDFS file system. It keeps the directory
tree of all files stored in the HDFS and tracks all the file stored across the cluster.
2) Start DataNode:
3) Start ResourceManager:
ResourceManager is the master that arbitrates all the available cluster resources
and thus helps in managing the distributed applications running on the YARN
system. Its work is to manage each NodeManagers and the each application’s
ApplicationMaster.
4) Start NodeManager:
5) Start JobHistoryServer:
JobHistoryServer is responsible for servicing all job history related requests from
client.
Step 3) To check that all the Hadoop services are up and running, run the below
command.
jps
Step 4) cd
Step 9) cd mapreduce_vijay/
Step 10) ls
Step 14) ls
Step 17) cd ..
Step 20) ls
Step 21) cd
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;
import java.io.IOException;
Department of Computer Engineering Subject : DSBDAL
import java.util.*;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;
}
output.collect(key, new IntWritable(frequencyForCountry));
}
}
Driver Class:
package SalesCountry;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
my_client.setConf(job_conf);
try {
// Run the job
JobClient.runJob(job_conf);
} catch (Exception e) {
Department of Computer Engineering Subject : DSBDAL
e.printStackTrace();
}
}
}
Input File
Pune
Mumbai
Nashik
Pune
Nashik
Kolapur
Assignment Questions
1. Write down the steps for Design a distributed application using MapReduce
which processes a log file of a system.
Department of Computer Engineering Subject : DSBDAL
Group B
Assignment No: 3
Theory:
● Steps to Install Scala
● Apache Spark Framework Installation
● Souce Code
1) Install Scala
Step 2) Install Scala from the apt repository by running the following commands to search for
scala and install it.
Apache Spark is an open-source, distributed processing system used for big data workloads. It
utilizes in-memory caching, and optimized query execution for fast analytic queries against data
of any size.
Department of Computer Engineering Subject : DSBDAL
Step 1) Now go to the official Apache Spark download page and grab the latest version (i.e.
3.2.1) at the time of writing this article. Alternatively, you can use the wget command to
download the file directly in the terminal.
wget https://apachemirror.wuchna.com/spark/spark-3.2.1/spark-3.2.1-bin-hadoop2.7.tgz
Step 4) Now you have to set a few environmental variables in .profile file before starting up the
spark.
Step 5) To make sure that these new environment variables are reachable within the shell and
available to Apache Spark, it is also mandatory to run the following command to take recent
changes into effect.
source ~/.profile
Step 6) ls -l /opt/spark
Step 7) Run the following command to start the Spark master service and slave service.
start-master.sh
Department of Computer Engineering Subject : DSBDAL
start-workers.sh spark://localhost:7077
(if workers not starting then remove and install openssh:
sudo apt-get remove openssh-client openssh-server
sudo apt-get install openssh-client openssh-server)
Step 8) Once the service is started go to the browser and type the following URL access spark
page. From the page, you can see my master and slave service is started.
http://localhost:8080/
Step 9) You can also check if spark-shell works fine by launching the spark-shell command.
Spark-shell
Source Code:
object ExampleString {
def main(args: Array[String]) {
}
}
/**declare a variable*/
var number= (-100);
if(number==0){
println("number is zero");
}
else if(number>0){
Department of Computer Engineering Subject : DSBDAL
println("number is positive");
}
else{
println("number is negative");
}
}
}
object ExFindLargest {
def main(args: Array[String]) {
var number1=20;
var number2=30;
var x = 10;
if( number1>number2){
println("Largest number is:" + number1);
}
else{
println("Largest number is:" + number2);
}
}
}
Assignment Questions
1. Write down steps to install scala.
DepartmentofComputerEngineering Subject:DSBDAL
Group C
AssignmentNo:2
Contents forTheory:
1. Step 1: Data Collection
2. Step 2: SentimentAnalysis
3. Step 3:Visualization.
---------------------------------------------------------------------------------------------------------------
We will begin by scraping and storing Twitter data. We will then classify the Tweets into
positive, negative, or neutral sentiment with a simple algorithm. Then, we will build charts
using Plotly and Matplotlib to identify trends in sentiment.
Step1:Datacollection
Command -
importpandasaspd
df=pd.read_csv('/content/data_visualization.csv')
Output -
/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshe
ll.py:2882: DtypeWarning: Columns (22,24) have mixed types.Specify
dtype option on import or set low_memory=False.
exec(code_obj,self.user_global_ns,self.user_ns)
Let's now take a look at some of the variables present in the data frame:
Command -
df.info()
Output -
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33590 entries, 0 to 33589
Data columns (total 36 columns):
#Column Non-NullCountDtype
0 id 33590non-nullint64
1 conversation_id33590non-nullint64
2 created_at 33590non-nullobject
3 date 33590non-nullobject
DepartmentofComputerEngineering Subject:DSBDAL
Command -
df['tweet'][10]
Output -
We are pleased to invite you to the EDHEC DataViz Challenge grand
final for a virtual exchange with all Top 10 finalists to see how
data visualization creates impact and can bring out compelling
stories in support of @UNICEF’s mission.https://t.co/Vbj9B48VjV
DepartmentofComputerEngineering Subject:DSBDAL
Step 2: SentimentAnalysis
TheTweetaboveis clearly positive. Let's see if the model is able to pick up on this, and return
a positive prediction. Run the following lines of code to import the NLTK library, along with
the SentimentIntensityAnalyzer (SID) module.
Command -
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
importre
import pandas as pd
import nltk
nltk.download('words')
words=set(nltk.corpus.words.words())
The SID module takes in a string and returns a score in each of these four categories -
positive, negative, neutral, and compound. The compoundscoreiscalculatedbynormalizing the
positive, negative, andneutralscores.Ifthecompoundscoreiscloserto1,thentheTweet can be
classified as positive. If it is closer to -1,thentheTweetcanbeclassifiedasnegative. Let's now
analyze the above sentence with the sentiment intensity analyzer.
Command -
sentence = df['tweet'][0]
sid.polarity_scores(sentence)['compound']
The output of the code above is 0.7089, indicating that the sentence isofpositivesentiment.
Let's now create a function that predicts the sentiment of every Tweet in the dataframe, and
stores it as a separatecolumncalled'sentiment.'First,runthefollowinglinesofcodetoclean the
Tweets in the data frame:
Command -
defcleaner(tweet):
tweet = re.sub("@[A-Za-z0-9]+","",tweet) #Remove @ sign
tweet = re.sub(r"(?:\@|http?\://|https?\://|www)\S+", "",
tweet)#Removehttplinks
tweet="".join(tweet.split())
tweet = tweet.replace("#", "").replace("_", "") #Remove
hashtag sign but keep the text
tweet = "".join(w for w in nltk.wordpunct_tokenize(tweet)
if w.lower() in words or not w.isalpha())
DepartmentofComputerEngineering Subject:DSBDAL
returntweet
df['tweet_clean']=df['tweet'].apply(cleaner)
Command -
word_dict=
{'manipulate':-1,'manipulative':-1,'jamescharlesiscancelled':-
1,'jamescharlesisoverparty':-1,
'pedophile':-1,'pedo':-1,'cancel':-1,'cancelled':-
1,'cancelculture':0.4,'teamtati':-1,'teamjames':1,
'teamjamescharles':1,'liar':-1}
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
sid.lexicon.update(word_dict)
list1=[]
for i in df['tweet_clean']:
list1.append((sid.polarity_scores(str(i)))['compound'])
Command -
df['sentiment'] = pd.Series(list1)
defsentiment_category(sentiment):
label =
''if(sentiment>0):
label = 'positive'
elif(sentiment == 0):
label =
'neutral'else:
label =
'negative'return(label)
df['sentiment_category'] =
df['sentiment'].apply(sentiment_category)
DepartmentofComputerEngineering Subject:DSBDAL
Let's take a look at the head of the data frame to ensure everything is working properly:
Command -
df = df[['tweet','date','id','sentiment','sentiment_category']]
df.head()
Output -
Notice that the first few Tweets are the combination of positive, negative and neutral
sentiment. For this analysis, we will only be using Tweets with positive and negative
sentiment, since we want to visualize how stronger sentiments have changed over time.
Step 3:Visualization
Now that we have Tweets classified as positive and negative, let's take a look at changes in
sentiment over time. We first need to group positive and negative sentiment and count them
by date:
Command -
neg=df[df['sentiment_category']=='negative']
neg = neg.groupby(['date'],as_index=False).count()
pos = df[df['sentiment_category']=='positive']
pos = pos.groupby(['date'],as_index=False).count()
pos = pos[['date','id']]
neg=neg[['date','id']]
Now,wecanvisualizesentimentbydateusingPlotly,byrunningthefollowinglinesof code:
Command -
import plotly.graph_objs as go
fig = go.Figure()
for col in pos.columns:
fig.add_trace(go.Scatter(x=pos['date'], y=pos['id'],
name=col,
mode = 'markers+lines',
line=dict(shape='linear'),
connectgaps=True,
line_color='green'
DepartmentofComputerEngineering Subject:DSBDAL
)
)
forcolinneg.columns:
fig.add_trace(go.Scatter(x=neg['date'],y=neg['id'],
name=col,
mode = 'markers+lines',
line=dict(shape='linear'),
connectgaps=True,
line_color='red'
)
)
fig.show()
DepartmentofComputerEngineering Subject:DSBDAL
FinalOutput-Youshouldseeachartthatlookslike this:
The red line represents negative sentiment, and the green line represents positive sentiment.
Assignment Questions:
1. WhatisTwittersentimentanalysis?
2. What is Natural Language Processing (NLP) and What are the stages in thelife
cycle of NLP?
3. WhatisNLTK?HowtotokenizeasentenceusingtheNLTKpackage?
4. Explainanytwo real-lifeapplicationsof NaturalLanguage Processing.
DepartmentofComputerEngineering Subject:DSBDAL
Group C
AssignmentNo:5
Contents forTheory:
1. Hadoop Ecosystem
2. HDFS
3. YARN
4. MAPREDUCE
5. APACHEPIG
6. APACHEHIVE
7. APACHEMAHOUT
8. APACHESPARK
9. APACHEHBASE
10. APACHESOLR&LUCENE
---------------------------------------------------------------------------------------------------------------
Hadoop Ecosystem:
Hadoop Ecosystem is neither a programming language nor a service, it is a platform or
framework whichsolvesbigdataproblems.Youcanconsideritasasuitewhichencompasses a
number of services (ingesting, storing, analyzing and maintaining) inside it. Let usdiscuss and
get a brief idea abouthowtheservicesworkindividuallyandincollaboration.Beloware the
Hadoop components, that together form a Hadoop ecosystem,
DepartmentofComputerEngineering Subject:DSBDAL
HDFS -
● Hadoop Distributed File System is the core component or you can say, the backbone
of Hadoop Ecosystem.
● HDFSistheone,whichmakesitpossibletostoredifferenttypes of large data sets (i.e.
structured, unstructured and semi structured data).
● HDFScreatesalevelofabstractionovertheresources,fromwherewecansee the whole
HDFS as a single unit.
● It helps us in storing our data across various nodes andmaintainingthelogfileabout the
stored data (metadata).
● HDFS has two core components, i.e. NameNode and DataNode.
1. The NameNode is the main node and it doesn’t store the actual data. It
contains metadata, just like a log file or you can say as a table of content.
Therefore, it requires less storage and high computational resources.
DepartmentofComputerEngineering Subject:DSBDAL
2. On the other hand, all your data is stored on the DataNodes and hence it
requires more storage resources. These DataNodes are commodity hardware
(like your laptops and desktops) in the distributed environment. That’s the
reason, why Hadoop solutions are very cost effective.
3. You always communicate to the NameNode while writing the data. Then, it
internally sends a request to the client to store and replicate data on various
DataNodes.
YARN-ConsiderYARNasthebrainofyourHadoopEcosystem.Itperformsallyour processing
activities by allocating resources and scheduling tasks.
● Ithas two major components, i.e. ResourceManager and NodeManager.
1. ResourceManager is again a main node in the processing department.
2. It receives the processing requests, and then passes the parts of requests to
corresponding NodeManagers accordingly, where the actual processing takes
place.
3. NodeManagers are installed on every DataNode. It is responsible forexecution
of task on every single DataNode.
1. Schedulers: Based on your application resource requirements, Schedulers perform
scheduling algorithms and allocates the resources.
2. ApplicationsManager: While ApplicationsManager accepts the job submission,
negotiates to containers (i.e. the Data node environment where process executes) for
executingtheapplicationspecificApplicationMasterandmonitoringtheprogress.
DepartmentofComputerEngineering Subject:DSBDAL
MAPREDUCE -
Let us take the above example to have a better understanding of a MapReduce program.
We have a sample caseofstudentsandtheirrespectivedepartments.Wewanttocalculatethe
numberofstudentsineachdepartment.Initially,Mapprogramwillexecuteandcalculatethe
DepartmentofComputerEngineering Subject:DSBDAL
students appearing in each department, producing the key value pair as mentioned above.This
key value pair is the input to the Reduce function. The Reduce function will then aggregate
each department and calculate the totalnumberofstudentsineachdepartmentand produce the
given result.
APACHEPIG-
● PIGhastwoparts:PigLatin,thelanguageandthepigruntime,fortheexecution environment.
You can better understand it as Java and JVM.
● It supports pig latin language, which has SQLlikecommand structure.
As everyone does not belong from a programming background. So, Apache PIG relieves
them. You might be curious to know how?
Well,aninterestingfact is:
10 line of pig latin = approx. 200 lines of Map-Reduce Java code
Butdon’t be shockedwhen I say thatat the backend of Pig job,a map-reduce job executes.
● The compiler internally converts pig latin to MapReduce. Itproducesasequentialset of
MapReduce jobs, and that’s an abstraction (which works like black box).
● PIG was initially developed byYahoo.
DepartmentofComputerEngineering Subject:DSBDAL
APACHEHIVE-
● Facebook created HIVE for people who are fluent with SQL. Thus, HIVE makesthem
feel at home while working in a Hadoop Ecosystem.
● Basically, HIVE is a data warehousing component which performs reading, writing
and managing large data sets in a distributed environment using SQL-like interface.
HIVE + SQL= HQL
● The query language of Hive is called Hive Query Language(HQL), which is very
similar like SQL.
● Ithas 2 basic components: Hive Command Line and JDBC/ODBC driver.
● The Hive Command line interface is used to execute HQLcommands.
● While, Java Database Connectivity (JDBC) and Object Database Connectivity
(ODBC) is used to establish connection from data storage.
● Secondly, Hiveishighlyscalable.As,itcanserveboththepurposes,i.e.largedataset
processing (i.e. Batch query processing) and real time processing (i.e. Interactive
query processing).
● It supports all primitive data types of SQL.
DepartmentofComputerEngineering Subject:DSBDAL
APACHEMAHOUT-
Mahout provides a command line to invoke various algorithms. It has a predefined set of
library which already contains different inbuilt algorithms for different use cases.
APACHESPARK-
● Apache Spark is a framework for real time data analytics in a distributed computing
environment.
● The Spark is written in Scala and was originally developed at the University of
California, Berkeley.
● Itexecutesin-memorycomputationstoincreasespeedofdataprocessingover Map-Reduce.
● It is 100x faster than Hadoop forlargescaledataprocessingbyexploitingin-memory
computations and other optimizations. Therefore, it requires high processing power
than Map-Reduce.
Thisisaverycommonquestionineveryone’smind:
DepartmentofComputerEngineering Subject:DSBDAL
That is the reason why, Spark and Hadoop are used together by many companies for
processing and analyzing their Big Data stored in HDFS.
APACHEHBASE-
● HBaseisanopensource,non-relationaldistributeddatabase.Inotherwords, it is a NoSQL
database.
● Itsupportsalltypesofdataandthatiswhy,it’scapableofhandlinganythingand everything
inside a Hadoop ecosystem.
● It is modelled afterGoogle’sBigTable,whichisadistributedstoragesystemdesigned to
cope up with large data sets.
● TheHBasewasdesignedtorunontopofHDFSandprovidesBigTablelike capabilities.
● Itgivesusa fault tolerant way of storing sparse data, which is common in most Big
Data use cases.
● TheHBaseiswritteninJava,whereasHBaseapplicationscanbe written in REST, Avro
and Thrift APIs.
APACHESOLR&LUCENE-
DepartmentofComputerEngineering Subject:DSBDAL
Assignment Questions -
1. Whyisthere aneed forBig DataAnalyticsin Healthcare?
2. WhatistheroleofHadoopinHealthcareAnalytics?
3. ExplainhowanIBMWatsonplatformcanbeusedforhealthcareanalytics?