0% found this document useful (0 votes)
40 views54 pages

DAV Question Bank+Answe

Hgu jo.b kkkn. Jigf

Uploaded by

ronakjoshi9033
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views54 pages

DAV Question Bank+Answe

Hgu jo.b kkkn. Jigf

Uploaded by

ronakjoshi9033
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 54

1.Describe the terms. 1. Probability distributions 2. Statistic 3. Weighted average.

1. Probability distributions
Probability is one of the most fundamental concepts in statistics. Imagine that you’ve
just rolled into Las Vegas and settled into your favorite roulette table over at the
Bellagio. When the roulette wheel spins off, you intuitively understand that there is an
equal chance that the ball will fall into any of the slots of the cylinder on the wheel.
The slot where the ball will land is totally random, and the probability, or likelihood,
of the ball landing in any one slot over another is the same. Because the ball can land
in any slot, with equal probability, there is an equal probability distribution, or a
uniform probability distribution — the ball has an equal probability of landing in any
of the slots in the cylinder. But the slots of the roulette wheel are not all the same —
the wheel has 18 black slots and 20 slots that are either red or green. Because of this
arrangement, there is 18/38 probability that your ball will land on a black slot. You
plan to make successive bets that the ball will land on a black.
2. Statistic
A statistic is a result that’s derived from performing a mathematical operation on
numerical data. In general, you use statistics in decision making. Statistics come in
two flavors:
» Descriptive: Descriptive statistics provide a description that illuminates some
characteristic of a numerical dataset, including dataset distribution, central tendency
(such as mean, min, or max), and dispersion (as in standard deviation and variance).
» Inferential: Rather than focus on pertinent descriptions of a dataset, inferential
statistics carve out a smaller section of the dataset and attempt to deduce significant
information about the larger dataset. Use this type of statistics to get information about
a real-world measure in which you’re interested.
3. Weighted average
A weighted average is an average value of a measure over a very large number of data
points. If you take a weighted average of your winnings (your random variable) across
the probability distribution, this would yield an expectation value — an expected
value for your net winnings over a successive number of bets. (An expectation can
also be thought of as the best guess, if you had to guess.) To describe it more formally,
an expectation is a weighted average of some measure associated with a random
variable.

2. Write a short note on Regression Methods.

Regression Methods Machine learning algorithms of the regression variety were


adopted from the statistics field, to provide data scientists with a set of methods for
describing and quantifying the relationships between variables in a dataset. Use
regression techniques if you want to determine the strength of correlation between
variables in your data. You can use regression to predict future values from historical
values, but be careful: Regression methods assume a cause-and-effect relationship
between variables, but present circumstances are always subject to flux. Predicting
future values from historical ones will generate incorrect results when present
circumstances change. In this section, I tell you all about linear regression, logistic
regression, and the ordinary least squares method.
1
Linear regression

Linear regression is a machine learning method use to describe and quantify the
relationship between your target variable, y — the predictant, in statistics lingo —
and the dataset features you’ve chosen to use as predictor variables (commonly
designated as dataset X in machine learning). When you use just one variable as your
predictor, linear regression is as simple as the middle school algebra formula y=mx+b.
But you can also use linear regression to quantify correlations between several
variables in a dataset called multiple linear regression. Before getting too excited
about using linear regression, though, make sure you’ve considered its limitations:

» Linear regression only works with numerical variables, not categorical ones.

» If dataset has missing values, it will cause problems. Be sure to address your
missing values before attempting to build a linear regression model.

» If data has outliers present, your model will produce inaccurate results. Check for
outliers before proceeding.

» The linear regression assumes that there is a linear relationship between dataset
features and the target variable. Test to make sure this is the case, and if it’s not, try
using a log transformation to compensate.

» The linear regression model assumes that all features are independent of each other.

» Prediction errors, or residuals, should be normally distributed. Don’t forget dataset


size! A good rule of thumb is that you should have at least 20 observations per
predictive feature if you expect to generate reliable results using linear regression.

Logistic regression

Logistic regression is a machine learning method you can use to estimate values for a
categorical target variable based on your selected features. Your target variable should
be numeric, and contain values that describe the target’s class — or category. One
cool thing about logistic regression is that, in addition to predicting the class of
observations in your target variable, it indicates the probability for each of its
estimates. Though logistic regression is like linear regression, it’s requirements are
simpler, in that: » There does not need to be a linear relationship between the features
and target variable.

» Residuals don’t have to be normally distributed.

» Predictive features are not required to have a normal distribution. When deciding
whether logistic regression is a good choice for you, make sure to consider the
following limitations:

» Missing values should be treated or removed.

» Your target variable must be binary or ordinal.


2
» Predictive features should be independent of each other.

Logistic regression requires a greater number of observations (than linear regression)


to produce a reliable result. The rule of thumb is that you should have at least 50
observations per predictive feature if you expect to generate reliable results.

Ordinary least squares (OLS) regression methods

Ordinary least squares (OLS) is a statistical method that fits a linear regression line to
a dataset. With OLS, you do this by squaring the vertical distance values that describe
the distances between the data points and the best-fit line, adding up those squared
distances, and then adjusting the placement of the best-fit line so that the summed
squared distance value is minimized. Use OLS if you want to construct a function
that’s a close approximation to your data. As always, don’t expect the actual value to
be identical to the value predicted by the regression. Values predicted by the
regression are simply estimates that are most similar to the actual values in the model.
OLS is particularly useful for fitting a regression line to models containing more than
one independent variable. In this way, you can use OLS to estimate the target from
dataset features. When using OLS regression methods to fit a regression line that has
more than one independent variable, two or more of the IVs may be interrelated.
When two or more IVs are strongly correlated with each other, this is called
multicollinearity. Multicollinearity tends to adversely affect the reliability of the IVs
as predictors when they’re examined apart from one another. Luckily, however,
multicollinearity doesn’t decrease the overall predictive reliability of the model when
it’s considered collectively.

3. What is Dimensionality Reduction? Why is Dimensionality Reduction


important.

Dimensionality reduction is the process of reducing the number of random variables


under consideration, via obtaining a set of principal variables .It can be divided into
feature extraction and feature selection .
why Dimensionality reduction is important:
1. It reduces the time and storage space required.
2. Removal of multi-collinearity improves the interpretation of the parameters of
the machine learning model.
3. It becomes easier to visualize the data when reduced to very low dimensions
such as 2D or 3D. This is because in general we need to deal with very high
dimensional data i.e. data with 100s and 1000s of dimensions so in that case it wont be
possible to visualize that data and to work upon such data we may not have enough
computation power as well.
4. It avoids the curse of dimensionality.

4. Write down the advantage and disadvantage of Dimensionality Reduction.

Advantages of Dimensionality Reduction

3
 It helps in data compression, and hence reduced storage space.
 It reduces computation time.
 It also helps remove redundant features.

Disadvantages of Dimensionality Reduction


 It may lead to some amount of data loss.
 PCA tends to find linear correlations between variables, which is sometimes
undesirable.
 PCA fails in cases where mean and covariance are not enough to define
datasets.
 may not know how many principal components to keep- in practice, some
thumb rules are applied.

5. Explain Decreasing dimensionality and removing outliers with Principal


Component Analysis.

Working methodology of PCA:

For better understanding of principle working of PCA now take 2D data.

2D representation of data.

1. Firstly normalize the data such that the average value shifts to the origin and all
the data lie in a unit square.

4
2. Now try to fit a line to the data. For that we will try out with a random line.
Now we will rotate the line until it fits best to the data.

Ultimately we end up with the following fit (high degree of fit) which explains the
maximum variance of a feature.

Best fitting line.

How PCA finds the best fitting line?

Let’s work out with a single point.

Now to quantity how good the line fits to the data, PCA projects the data onto it.

i) Then it can measure the distances from the data to the line and try to find a line that
minimizes those distances.

5
ii) or it can try to find the line that maximizes the distances from the projected points to
the origin.

Mathematical Intuition :

To understand the math behind this technique let’s back to our single data point
concept.

After projecting the data onto the line we will get a right angled triangle. Now from
pythogoras theorem we get A² = B² + C². We can see that B and C are inversely
proportional to each other. That means if B gets larger then c must be smaller and vice
versa. Thus PCA can either minimize the distances to the line or maximize the
distances from the projected point to the origin.

It is more easier to find the maximum distance from the projected point to the origin.
Hence PCA finds the best fitting line that maximizes the sum of the squared distances
from the projected points to the origin.

maximum sum of squares of distances.

6
cost function of pca

Here taking squares of the distances so that the negative values won’t cancel the
positive values.

Now we got the best fitting line y = mx + c. This is called PC1 (Principle component
1). Let’s assume we got proportions 4:1 that means we go 4 units on X-axis and 1 unit
on Y-axis which explains that the data is mostly spread on the X-axis.

From theorem
a² = b² + c² => a² = 4² + 1² => sqrt(17)=> 4.12 But the data is scaled hence we divide
each side with 4.12 in order to get the unit vector. i.e.,
F1 = 4 / 4.12 = 0.97 and
F2 = 1 / 4.12 = 0.242
The unit vector that we just calculated is called Eigen vector or PC1 and
the proportions of features (0.97 : 0.242) are called loading scores.
SS( distances for PC1 ) = Eigen values for PC1.
sqrt( Eigen values for PC1 ) = Singular value for PC1.
Now we do the same thing for other features to get principal components. For
projecting the data now we will rotate the axis so that PC1 gets parallel (horizontal) to
the X-axis.

7
Rotating the axis so that PC1 gets horizontal

Projecting the data basing on principal components.

For visualization let’s project the data basing on the projected points on both principal
components.

We can see that it is equal to the original projection of points.

How to calculate variation?

Here calculate variation it using eigen values which is calculated in PCA.

Let suppose we got variation for PC1 = 0.83 and for PC2 = 0.17

8
Now if we want to convert the data from 2D to 1D we choose feature1 as a final 1D
since it covers 83% of the total variation.

This is how PCA works and basing on the variance obtained using principal
components it estimates the features to be eliminated for dimensionality reduction.

6. How to Identify the time series Analysis patterns ?


In these time series, we have used words such as “trend” and “seasonal” .
Trend
A trend exists when there is a long-term increase or decrease in the data. It does not
have to be linear. Sometimes we will refer to a trend as “changing direction,” when it
might go from an increasing trend to a decreasing trend. There is a trend in the
antidiabetic drug sales data shown in Figure.
Seasonal
A seasonal pattern occurs when a time series is affected by seasonal factors such as
the time of the year or the day of the week. Seasonality is always of a fixed and
known frequency. The monthly sales of antidiabetic drugs above shows seasonality
which is induced partly by the change in the cost of the drugs at the end of the
calendar year’
Cyclic
A cycle occurs when the data exhibit rises and falls that are not of a fixed frequency.
These fluctuations are usually due to economic conditions, and are often related to the
“business cycle.” The duration of these fluctuations is usually at least 2 years.
Many people confuse cyclic behaviour with seasonal behaviour, but they are really
quite different. If the fluctuations are not of a fixed frequency then they are cyclic; if
the frequency is unchanging and associated with some aspect of the calendar, then the
pattern is seasonal. In general, the average length of cycles is longer than the length of
a seasonal pattern, and the magnitudes of cycles tend to be more variable than the
magnitudes of seasonal patterns.
Many time series include trend, cycles and seasonality. When choosing a forecasting
method, we will first need to identify the time series patterns in the data, and then
choose a method that is able to capture the patterns properly.
The examples in Figure different combinations of the above components.

9
Four examples of time series showing different patterns.

1. The monthly housing sales (top left) show strong seasonality within each year,
as well as some strong cyclic behaviour with a period of about 6–10 years. There is no
apparent trend in the data over this period.
2. The US treasury bill contracts (top right) show results from the Chicago market
for 100 consecutive trading days in 1981. Here there is no seasonality, but an obvious
downward trend. Possibly, if we had a much longer series, we would see that this
downward trend is actually part of a long cycle, but when viewed over only 100 days
it appears to be a trend.
3. The Australian quarterly electricity production (bottom left) shows a strong
increasing trend, with strong seasonality. There is no evidence of any cyclic behaviour
here.
4. The daily change in the Google closing stock price (bottom right) has no trend,
seasonality or cyclic behaviour. There are random fluctuations which do not appear to
be very predictable, and no strong patterns that would help with developing a
forecasting model.

7. What is K-Means Clustering Algorithm and how its work?

K-Means Clustering is an Unsupervised Learning algorithm, which groups the


unlabeled dataset into different clusters. Here K defines the number of pre-defined
clusters that need to be created in the process, as if K=2, there will be two clusters,
and for K=3, there will be three clusters, and so on. It is an iterative algorithm that
divides the unlabeled dataset into k different clusters in such a way that each dataset
belongs only one group that has similar properties.

It allows us to cluster the data into different groups and a convenient way to discover
the categories of groups in the unlabeled dataset on its own without the need for any
training.

10
It is a centroid-based algorithm, where each cluster is associated with a centroid. The
main aim of this algorithm is to minimize the sum of distances between the data point
and their corresponding clusters.

The algorithm takes the unlabeled dataset as input, divides the dataset into k-number
of clusters, and repeats the process until it does not find the best clusters. The value of
k should be predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks:

o Determines the best value for K center points or centroids by an iterative


process.
o Assigns each data point to its closest k-center. Those data points which are near
to the particular k-center, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is away from other
clusters.

The below diagram explains the working of the K-means Clustering Algorithm:

How does the K-Means Algorithm Work?

The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.
11
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Let's understand the above steps by considering the visual plots:

Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two
variables is given below:

o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them
into different clusters. It means here we will try to group these datasets into two
different clusters.
o We need to choose some random k points or centroid to form the cluster. These
points can be either the points from the dataset or any other point. So, here we are
selecting the below two points as k points, which are not the part of our dataset.
Consider the below image:

o Now we will assign each data point of the scatter plot to its closest K-point or
centroid. We will compute it by applying some mathematics that we have studied to
calculate the distance between two points. So, we will draw a median between both
the centroids. Consider the below image:

12
From the above image, it is clear that points left side of the line is near to the K1 or
blue centroid, and points to the right of the line are close to the yellow centroid. Let's
color them as blue and yellow for clear visualization.

As we need to find the closest cluster, so we will repeat the process by


choosing a new centroid. To choose the new centroids, we will compute the center of
gravity of these centroids, and will find new centroids as below:

o Next, we will reassign each datapoint to the new centroid. For this, we will
repeat the same process of finding a median line. The median will be like below
image:

From the above image, we can see, one yellow point is on the left side of the line, and
two blue points are right to the line. So, these three points will be assigned to new
centroids.

13
As reassignment has taken place, so we will again go to the step-4, which is finding
new centroids or K-points.

o We will repeat the process by finding the center of gravity of centroids, so the
new centroids will be as shown in the below image:

o As we got the new centroids so again will draw the median line and reassign
the data points. So, the image will be:

14
o We can see in the above image; there are no dissimilar data points on either
side of the line, which means our model is formed. Consider the below image:

As our model is ready, so we can now remove the assumed centroids, and the two
final clusters will be as shown in the below image:

8. Explain the Clustering with hierarchical algorithms.

Hierarchical clustering is another unsupervised machine learning algorithm, which is


used to group the unlabeled datasets into a cluster and also known as hierarchical
cluster analysis or HCA.

In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this
tree-shaped structure is known as the dendrogram.

Sometimes the results of K-means clustering and hierarchical clustering may look
similar, but they both differ depending on how they work. As there is no requirement
to predetermine the number of clusters as we did in the K-Means algorithm.

The hierarchical clustering technique has two approaches:

1. Agglomerative: Agglomerative is a bottom-up approach, in which the


algorithm starts with taking all data points as single clusters and merging them until
one cluster is left.

15
2. Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it
is a top-down approach.

Agglomerative Hierarchical clustering

The agglomerative hierarchical clustering algorithm is a popular example of HCA. To


group the datasets into clusters, it follows the bottom-up approach. It means, this
algorithm considers each dataset as a single cluster at the beginning, and then start
combining the closest pair of clusters together. It does this until all the clusters are
merged into a single cluster that contains all the datasets.

This hierarchy of clusters is represented in the form of the dendrogram.

How the Agglomerative Hierarchical clustering Work?

The working of the AHC algorithm can be explained using the below steps:

o Step-1: Create each data point as a single cluster. Let's say there are N data
points, so the number of clusters will also be N.

o Step-2: Take two closest data points or clusters and merge them to form one
cluster. So, there will now be N-1 clusters.

16
o Step-3: Again, take the two closest clusters and merge them together to form
one cluster. There will be N-2 clusters.

o Step-4: Repeat Step 3 until only one cluster left. So, we will get the following
clusters. Consider the below images:

o Step-5: Once all the clusters are combined into one big cluster, develop the
dendrogram to divide the clusters as per the problem.

9. Explain the Linkage methods in hierarchical algorithms.


17
The closest distance between the two clusters is crucial for the hierarchical clustering.
There are various ways to calculate the distance between two clusters, and these ways
decide the rule for clustering. These measures are called Linkage methods. Some of
the popular linkage methods are given below:

1. Single Linkage: It is the Shortest Distance between the closest points of the
clusters. Consider the below image:

2. Complete Linkage: It is the farthest distance between the two points of two
different clusters. It is one of the popular linkage methods as it forms tighter clusters
than single-linkage.

3. Average Linkage: It is the linkage method in which the distance between each
pair of datasets is added up and then divided by the total number of datasets to
calculate the average distance between two clusters. It is also one of the most popular
linkage methods.
4. Centroid Linkage: It is the linkage method in which the distance between the
centroid of the clusters is calculated. Consider the below image:

From the above-given approaches, we can apply any of them according to the type of
problem or business requirement.

10. Explain the Woking of Dendrogram in Hierarchical clustering.

18
The dendrogram is a tree-like structure that is mainly used to store each step as a
memory that the HC algorithm performs. In the dendrogram plot, the Y-axis shows
the Euclidean distances between the data points, and the x-axis shows all the data
points of the given dataset.

The working of the dendrogram can be explained using the below diagram:

In the above diagram, the left part is showing how clusters are created in
agglomerative clustering, and the right part is showing the corresponding dendrogram.

o the datapoints P2 and P3 combine together and form a cluster, correspondingly


a dendrogram is created, which connects P2 and P3 with a rectangular shape. The
hight is decided according to the Euclidean distance between the data points.
o In the next step, P5 and P6 form a cluster, and the corresponding dendrogram is
created. It is higher than of previous, as the Euclidean distance between P5 and P6 is a
little bit greater than the P2 and P3.
o Again, two new dendrograms are created that combine P1, P2, and P3 in one
dendrogram, and P4, P5, and P6, in another dendrogram.
o At last, the final dendrogram is created that combines all the data points
together.

11. Write a short note on Random Forest Algorithm.

Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in
ML. It is based on the concept of ensemble learning, which is a process of combining
multiple classifiers to solve a complex problem and to improve the performance of the
model.

As the name suggests, "Random Forest is a classifier that contains a number of


decision trees on various subsets of the given dataset and takes the average to improve
the predictive accuracy of that dataset." Instead of relying on one decision tree, the
random forest takes the prediction from each tree and based on the majority votes of
predictions, and it predicts the final output.

19
The greater number of trees in the forest leads to higher accuracy and prevents the
problem of overfitting.

The below diagram explains the working of the Random Forest algorithm:

Assumptions for Random Forest

Since the random forest combines multiple trees to predict the class of the dataset, it is
possible that some decision trees may predict the correct output, while others may not.
But together, all the trees predict the correct output. Therefore, below are two
assumptions for a better Random forest classifier:

o There should be some actual values in the feature variable of the dataset so that
the classifier can predict accurate results rather than a guessed result.
o The predictions from each tree must have very low correlations.

Why use Random Forest?

Below are some points that explain why we should use the Random Forest algorithm:

<="" li="">
o It takes less training time as compared to other algorithms.
o It predicts output with high accuracy, even for the large dataset it runs
efficiently.
o It can also maintain accuracy when a large proportion of data is missing.

How does Random Forest algorithm work?

Random Forest works in two-phase first is to create the random forest by combining N
decision tree, and second is to make predictions for each tree created in the first phase.

The Working process can be explained in the below steps and diagram:

Step-1: Select random K data points from the training set.

Step-2: Build the decision trees associated with the selected data points (Subsets).

20
Step-3: Choose the number N for decision trees that you want to build.

Step-4: Repeat Step 1 & 2.

Step-5: For new data points, find the predictions of each decision tree, and assign the
new data points to the category that wins the majority votes.

The working of the algorithm can be better understood by the below example:

Example: Suppose there is a dataset that contains multiple fruit images. So, this
dataset is given to the Random forest classifier. The dataset is divided into subsets and
given to each decision tree. During the training phase, each decision tree produces a
prediction result, and when a new data point occurs, then based on the majority of
results, the Random Forest classifier predicts the final decision. Consider the below
image:

Applications of Random Forest

There are mainly four sectors where Random forest mostly used:

1. Banking: Banking sector mostly uses this algorithm for the identification of
loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the
disease can be identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
4. Marketing: Marketing trends can be identified using this algorithm.

Advantages of Random Forest

o Random Forest is capable of performing both Classification and Regression


tasks.
o It is capable of handling large datasets with high dimensionality.
o It enhances the accuracy of the model and prevents the overfitting issue.

Disadvantages of Random Forest

o Although random forest can be used for both classification and regression
tasks, it is not more suitable for Regression tasks.
21
12. Difference between Classification and Clustering.

Paramenter CLASSIFICATION CLUSTERING

Type used for supervised learning used for unsupervised learning

process of classifying the input grouping the instances based


instances based on their on their similarity without the
Basic corresponding class labels help of class labels

it has labels so there is need of


training and testing dataset for there is no need of training and
Need verifying the model created testing dataset

more complex as compared to less complex as compared to


Complexity clustering classification

k-means clustering algorithm,


Logistic regression, Naive Fuzzy c-means clustering
Example Bayes classifier, Support vector algorithm, Gaussian (EM)
Algorithms machines etc. clustering algorithm etc

13. Classifying data with K- nearest neighbor algorithm.

o K-Nearest Neighbour is one of the simplest Machine Learning algorithms


based on Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most similar to the
available categories.
o K-NN algorithm stores all the available data and classifies a new data point
based on the similarity. This means when new data appears then it can be easily
classified into a well suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but
mostly it is used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.

22
o It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of classification, it
performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets
new data, then it classifies that data into a category that is much similar to the new
data.
o Example: Suppose, we have an image of a creature that looks similar to cat
and dog, but we want to know either it is a cat or dog. So for this identification, we
can use the KNN algorithm, as it works on a similarity measure. Our KNN model will
find the similar features of the new data set to the cats and dogs images and based on
the most similar features it will put it in either cat or dog category.

Why do we need a K-NN Algorithm?

o Suppose there are two categories, i.e., Category A and Category B, and we
have a new data point x1, so this data point will lie in which of these categories. To
solve this type of problem, we need a K-NN algorithm. With the help of K-NN, we
can easily identify the category or class of a particular dataset. Consider the below
diagram:

How does K-NN work?

The K-NN working can be explained on the basis of the below algorithm:

o Step-1: Select the number K of the neighbors


o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each
category.
o Step-5: Assign the new data points to that category for which the number of
the neighbor is maximum.
o Step-6: Our model is ready.

23
Suppose we have a new data point and we need to put it in the required category.
Consider the below image:

o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The
Euclidean distance is the distance between two points, which we have already studied
in geometry. It can be calculated as:

o By calculating the Euclidean distance we got the nearest neighbors, as three


nearest neighbors in category A and two nearest neighbors in category B. Consider the
below image:

24
o As we can see the 3 nearest neighbors are from category A, hence this new data
point must belong to category A.

How to select the value of K in the K-NN Algorithm?

Below are some points to remember while selecting the value of K in the K-NN
algorithm:

o There is no particular way to determine the best value for "K", so we need to
try some values to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the
effects of outliers in the model.
o Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:

o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:

o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the
data points for all the training samples.

14. Solving Real-World Problems with Nearest Neighbor Algorithms.

o Nearest neighbor methods are used extensively to understand and create value
from patterns in retail business data. In the following sections, I present two powerful
cases where kNN and average-NN algorithms are being used to simplify management
and security in daily retail operations. Seeing k-nearest neighbor algorithms in action
K-nearest neighbor techniques for pattern recognition are often used for theft
prevention in the modern retail business. accustomed to seeing CCTV cameras around
almost every store you visit, but most people have no idea how the data gathered from
these devices is being used. You might imagine that someone in the back room,
monitoring these cameras for suspicious activity, and perhaps that is how things were
done in the past. Modeling with Instances 105 a modern surveillance system is
intelligent enough to analyze and interpret video data on its own, without the need for
human assistance.
o The modern systems can now use k-nearest neighbor for visual pattern
recognition to scan and detect hidden packages in the bottom bin of a shopping cart at
checkout. If an object is detected that is an exact match for an object listed in the
database, the price of the spotted product could even automatically be added to the
customer’s bill. Though this automated billing practice is not used extensively now,
the technology has been developed and is available for use. K-nearest neighbor is also
used in retail to detect patterns in credit card usage. Many new transaction-
25
scrutinizing software applications use kNN algorithms to analyze register data and
spot unusual patterns that indicate suspicious activity. For example, if register data
indicates that a lot of customer information is being entered manually rather than
through automated scanning and swiping, this could indicate that the employee who’s
using that register is in fact stealing a customer’s personal information. Or, if register
data indicates that a particular good is being returned or exchanged multiple times,
this could indicate that employees are misusing the return policy or trying to make
money from making fake returns.

15. Explain Naïve Bayes Classifier Algorithm.

o Naïve Bayes algorithm is a supervised learning algorithm, which is based


on Bayes theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training
dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification
algorithms which helps in building the fast machine learning models that can make
quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration,
Sentimental analysis, and classifying articles.

Why is it called Naïve Bayes?

The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can
be described as:

o Naïve: It is called Naïve because it assumes that the occurrence of a certain


feature is independent of the occurrence of other features. Such as if the fruit is
identified on the bases of color, shape, and taste, then red, spherical, and sweet fruit is
recognized as an apple. Hence each feature individually contributes to identify that it
is an apple without depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes'
Theorem.

Bayes' Theorem:

o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends on the
conditional probability.
o The formula for Bayes' theorem is given as:

Where,
26
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that the probability
of a hypothesis is true.

P(A) is Prior Probability: Probability of hypothesis before observing the evidence.

P(B) is Marginal Probability: Probability of Evidence.

Working of Naïve Bayes' Classifier:

Working of Naïve Bayes' Classifier can be understood with the help of the below
example:

Suppose we have a dataset of weather conditions and corresponding target variable


"Play". So using this dataset we need to decide that whether we should play or not on
a particular day according to the weather conditions. So to solve this problem, we
need to follow the below steps:

1. Convert the given dataset into frequency tables.


2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.

Problem: If the weather is sunny, then the Player should play or not?

Solution: To solve this, first consider the below dataset:

Outlook Play

0 Rainy Yes

1 Sunny Yes

2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes

27
11 Rainy No
12 Overcast Yes
13 Overcast Yes

Frequency table for the Weather Conditions:

Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 5

Likelihood table weather condition:

Weather No Yes
Overcast 0 5 5/14= 0.35
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
All 4/14=0.29 10/14=0.71

Applying Bayes'theorem:

P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)

P(Sunny|Yes)= 3/10= 0.3

P(Sunny)= 0.35

P(Yes)=0.71

So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60

P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)

P(Sunny|NO)= 2/4=0.5

P(No)= 0.29

28
P(Sunny)= 0.35

So P(No|Sunny)= 0.5*0.29/0.35 = 0.41

So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)

Hence on a Sunny day, Player can play the game.

Advantages of Naïve Bayes Classifier:

o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of
datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other
Algorithms.
o It is the most popular choice for text classification problems.

Disadvantages of Naïve Bayes Classifier:

o Naive Bayes assumes that all features are independent or unrelated, so it cannot
learn the relationship between features.

Applications of Naïve Bayes Classifier:

o It is used for Credit Scoring.


o It is used in medical data classification.
o It can be used in real-time predictions because Naïve Bayes Classifier is an
eager learner.
o It is used in Text classification such as Spam filtering and Sentiment analysis.

16. Explain Types of Naïve Bayes Model:

There are three types of Naive Bayes Model, which are given below:

o Gaussian: The Gaussian model assumes that features follow a normal


distribution. This means if predictors take continuous values instead of discrete, then
the model assumes that these values are sampled from the Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used when the data is
multinomial distributed. It is primarily used for document classification problems, it
means a particular document belongs to which category such as Sports, Politics,
education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier,
but the predictor variables are the independent Booleans variables. Such as if a
particular word is present or not in a document. This model is also famous for
document classification tasks.

29
17. What is data visualization and why is it important?

Data visualization is the representation of data or information in a graph, chart, or


other visual format. It communicates relationships of the data with images. This is
important because it allows trends and patterns to be more easily seen. With the rise of
big data upon us, we need to be able to interpret increasingly larger batches of data.
Machine learning makes it easier to conduct analyses such as predictive analysis,
which can then serve as helpful visualizations to present. But data visualization is not
only important for data scientists and data analysts, it is necessary to understand data
visualization in any career. Whether you work in finance, marketing, tech, design, or
anything else, you need to visualize data. That fact showcases the importance of data
visualization.

Why do we need data visualization?

We need data visualization because a visual summary of information makes it easier


to identify patterns and trends than looking through thousands of rows on a
spreadsheet. It’s the way the human brain works. Since the purpose of data analysis is
to gain insights, data is much more valuable when it is visualized. Even if a data
analyst can pull insights from data without visualization, it will be more difficult to
communicate the meaning without visualization. Charts and graphs make
communicating data findings easier even if you can identify the patterns without them.

In undergraduate business schools, students are often taught the importance of


presenting data findings with visualization. Without a visual representation of the
insights, it can be hard for the audience to grasp the true meaning of the findings. For
example, rattling off numbers to your boss won’t tell them why they should care about
the data, but showing them a graph of how much money the insights could save/make
them is sure to get their attention.

18. Explain the three main types of data visualizations.

1. Data storytelling for organizational decision makers

Sometimes you have to design data visualizations for a less technical-minded


audience, perhaps in order to help members of this audience make betterinformed
business decisions. The purpose of this type of visualization is to tell your audience
the story behind the data. In data storytelling, the audience depends on you to make
sense of the data behind the visualization and then turn useful insights into visual
stories that they can understand. With data storytelling, your goal should be to create a
clutter-free, highly focused visualization so that members of your audience can
quickly extract meaning without having to make much effort. These visualizations are
best delivered in the form of static images, but more adept decision makers may prefer
to have an interactive dashboard that they can use to do a bit of exploration and what-
if modeling.

2. Data showcasing for analysts If you’re designing for a crowd of logical,


calculating analysts, you can create data visualizations that are rather open-ended. The
30
purpose of this type of visualization is to help audience members visually explore the
data and draw their own conclusions. When using data showcasing techniques, your
goal should be to display a lot of contextual information that supports audience
members. Following the Principles of Data Visualization Design 119 interpretations.
These visualizations should include more contextual data and less conclusive focus so
that people can get in and analyze the data for themselves, and then draw their own
conclusions. These visualizations are best delivered as static images or dynamic,
interactive dashboards.

3. Designing data art for activists You could be designing for an audience of
idealists, dreamers, and changemakers. When designing for this audience, you want
your data visualization to make a point! You can assume that typical audience
members aren’t overly analytical. What they lack in math skills, however, they more
than compensate for in solid convictions. These people look to your data visualization
as a vehicle by which to make a statement. When designing for this audience, data art
is the way to go. The main goal in using data art is to entertain, to provoke, to annoy,
or to do whatever it takes to make a loud, clear, attention-demanding statement. Data
art has little to no narrative and offers no room for viewers to form their own
interpretations. Data scientists have an ethical responsibility to always represent data
accurately. A data scientist should never distort the message of the data to fit what the
audience wants to hear — not even for data art! Nontechnical audiences don’t even
recognize, let alone see, the possible issues. They rely on the data scientist to provide
honest and accurate representations, thus amplifying the level of ethical responsibility
that the data scientist must assume.

19. How to Designing to Meet the Needs of target to make Functional data
Visualization.

Designing of Your Target Audience To make a functional data visualization, you


must get to know your target audience and then design precisely for their needs. But to
make every design decision with your target audience in mind, you need to take a few
steps to make sure that you truly understand your data visualization’s target
consumers. To gain the insights you need about your audience and your purpose,
follow this process:

1. Brainstorm. Think about a specific member of your visualization’s audience, and


make as many educated guesses as you can about that person’s motivations.120 PART
3 Creating Data Visualizations That Clearly Communicate Meaning Give this
(imaginary) audience member a name and a few other identifying characteristics. I
always imagine a 45-year-old divorced mother of two named Brenda.

2. Define the purpose of your visualization. Narrow the purpose of the visualization
by deciding exactly what action or outcome you want audience members to make as a
result of the visualization.

3. Choose a functional design. Review the three main data visualization types
(discussed earlier in this chapter) and decide which type can best help you achieve
your intended outcome.
31
The following sections spell out this process in detail.

Step 1: Brainstorm (about Brenda) To brainstorm properly, pull out a sheet of paper
and think about your imaginary audience member (Brenda) so that you can create a
more functional and effective data visualization. Answer the following questions to
help you better understand her, and thus better understand and design for your target
audience. Form a picture of what Brenda’s average day looks like — what she does
when she gets out of bed in the morning, what she does over her lunch hour, and what
her workplace is like. Also consider how Brenda will use your visualization. To form
a comprehensive view of who Brenda is and how you can best meet her needs, ask
these questions: » Where does Brenda work? What does she do for a living? » What
kind of technical education or experience, if any, does she have? » How old is
Brenda? Is she married? Does she have children? What does she look like? Where
does she live? » What social, political, caused-based, or professional issues are
important to Brenda? What does she think of herself? » What problems and issues
does Brenda have to deal with every day? » How does your data visualization help
solve Brenda’s work problems or her family problems? How does it improve her self-
esteem? » Through what avenue will you present the visualization to Brenda — for
example, over the Internet or in a staff meeting? » What does Brenda need to be able
to do with your data visualization? Following the Principles of Data Visualization
Design 121 Say that Brenda is the manager of the zoning department in Irvine County.
She is 45 years old and a single divorcee with two children who are about to start
college. She is deeply interested in local politics and eventually wants to be on the
county’s board of commissioners. To achieve that position, she has to get some major
“oomph” on her county management résumé. Brenda derives most of her feelings of
self-worth from her job and her keen ability to make good management decisions for
her department. Until now, Brenda has been forced to manage her department
according to her gut-feel intuition, backed by a few disparate business systems reports.
She is not extraordinarily analytical, but she knows enough to understand what she
sees. The problem is that Brenda hasn’t had the visualization tools that are necessary
to display all the relevant data she should be considering. Because she has neither the
time nor the skill to code something herself, she’s been waiting in the lurch. Brenda is
excited that you’ll be attending next Monday’s staff meeting to present the data
visualization alternatives available to help her get under way in making data-driven
management decisions.

Step 2: Define the purpose After you brainstorm about the typical audience
member (see the preceding section), you can much more easily pinpoint exactly what
you’re trying to achieve with the data visualization. Are you attempting to get
consumers to feel a certain way about themselves or the world around them? Are you
trying to make a statement? Are you seeking to influence organizational decision
makers to make good business decisions? Or do you simply want to lay all the data
out there, for all viewers to make sense of, and deduce from what they will? Return to
the hypothetical Brenda: What decisions or processes are you trying to help her
achieve? Well, you need to make sense of her data, and then you need to present it to
her in a way that she can clearly understand. What’s happening within the inner
mechanics of her department? Using your visualization, you seek to guide Brenda into
making the most prudent and effective management choices.
32
Step 3: Choose the most functional visualization type for your purpose Keep in in
mind that you have three main types of visualization from which to choose: data
storytelling, data art, and data showcasing. If you’re designing for organizational
decision makers, you’ll most likely use data storytelling to directly tell your audience
what their data means with respect to their line of business. If you’re designing for a
social justice organization or a political campaign, data art 122 PART 3 Creating Data
Visualizations That Clearly Communicate Meaning can best make a dramatic and
effective statement with your data. Lastly, if you’re designing for engineers, scientists,
or statisticians, stick with data showcasing so that these analytical types have plenty of
room to figure things out on their own. Back to Brenda — because she’s not
extraordinarily analytical and because she’s depending on you to help her make
excellent data-driven decisions, you need to employ data storytelling techniques.
Create either a static or interactive data visualization with some, but not too much,
context. The visual elements of the design should tell a clear story so that Brenda
doesn’t have to work through tons of complexities to get the point of what you’re
trying to tell her about her data and her line of business.

20. Write down the Types of data visualization charts.

Charts are an essential part of working with data, as they are a way to condense large
amounts of data into an easy to understand format. Visualizations of data can bring
out insights to someone looking at the data for the first time, as well as convey
findings to others who won’t see the raw data. There are countless chart types out
there, each with different use cases. Often, the most difficult part of creating a data
visualization is figuring out which chart type is best for the task at hand.

Bar chart

In a bar chart, values are indicated by the length of bars, each of which corresponds
with a measured group. Bar charts can be oriented vertically or horizontally; vertical
bar charts are sometimes called column charts. Horizontal bar charts are a good option
when you have a lot of bars to plot, or the labels on them require additional space to
be legible.

Line chart

33
Line charts show changes in value across continuous measurements, such as those
made over time. Movement of the line up or down helps bring out positive and
negative changes, respectively. It can also expose overall trends, to help the reader
make predictions or projections for future outcomes. Multiple line charts can also give
rise to other related charts like the sparkline or ridgeline plot.

Scatter plot

A scatter plot displays values on two numeric variables using points positioned on two
axes: one for each variable. Scatter plots are a versatile demonstration of the
relationship between the plotted variables—whether that correlation is strong or weak,
positive or negative, linear or non-linear. Scatter plots are also great for identifying
outlier points and possible gaps in the data.

Box plot

A box plot uses boxes and whiskers to summarize the distribution of values within
measured groups. The positions of the box and whisker ends show the regions where
the majority of the data lies. We most commonly see box plots when we have multiple

34
groups to compare to one another; other charts with more detail are preferred when we
have only one group to plot.

Before moving on to other chart types, it’s worth taking a moment to appreciate the
option of just showing the raw numbers. In particular, when you only have one
number to show, just displaying the value is a sensible approach to depicting the data.
When exact values are of interest in an analysis, you can include them in an
accompanying table or through annotations on a graphical visualization.

Common Variations
Additional chart types can come about from changing the ways encodings are used, or
by including additional encodings. Secondary encodings like area, shape, and color
can be useful for adding additional variables to more basic chart types.

Histogram

If the groups depicted in a bar chart are actually continuous numeric ranges, we can
push the bars together to generate a histogram. Bar lengths in histograms typically
correspond to counts of data points, and their patterns demonstrate the distribution of
variables in your data. A different chart type like line chart tends to be used when the
vertical value is not a frequency count.

Stacked bar chart

One modification of the standard bar chart is to divide each bar into multiple smaller
bars based on values of a second grouping variable, called a stacked bar chart. This

35
allows you to not only compare primary group values like in a regular bar chart, but
also illustrate a relative breakdown of each group’s whole into its constituent parts.

Grouped bar chart

If, on the other hand, the sub-bars were placed side-by-side into clusters instead of
kept in their stacks, we would obtain the grouped bar chart. The grouped bar chart
does not allow for comparison of primary group totals, but does a much better job of
allowing for comparison of the sub-groups.

Dot plot

A dot plot is like a bar chart in that it indicates values for different categorical
groupings, but encodes values based on a point’s position rather than a bar’s length.
Dot plots are useful when you need to compare across categories, but the zero baseline
is not informative or useful. You can also think of a dot plot as like a line plot with the
line removed, so that it can be used with variables with unordered categories rather
than just continuous or ordered variables.

Area chart

36
An area chart starts with the same foundation as a line chart – value points connected
by line segments – but adds in a concept from the bar chart with shading between the
line and a baseline. This chart is most often seen when combined with the concept of
stacking, to show how both how a total has changed over time, but also how its
components’ contributions have changed.

Dual-axis chart

Dual-axis charts overlay two different charts with a shared horizontal axis, but
potentially different vertical axis scales (one for each component chart). This can be
useful to show a direct comparison between the two sets of vertical values, while also
including the context of the horizontal-axis variable. It is common to use different
base chart types, like the bar and line combination, to reduce confusion of the different
axis scales for each component chart.

Bubble chart

37
Another way of showing the relationship between three variables is through
modification of a scatter plot. When a third variable is categorical, points can use
different shapes or colors to indicate group membership. If the data points are ordered
in some way, points can also be connected with line segments to show the sequence of
values. When the third variable is numeric in nature, that is where the bubble
chart comes in. A bubble chart builds on the base scatter plot by having the third
variable’s value determine the size of each point.

Density curve

The density curve, or kernel density estimate, is an alternative way of showing


distributions of data instead of the histogram. Rather than collecting data points into
frequency bins, each data point contributes a small volume of data whose collected
whole becomes the density curve. While density curves may imply some data values
that do not exist, they can be a good way to smooth out noise in the data to get an
understanding of the distribution signal.

Violin plot

An alternative to the box plot’s approach to comparing value distributions between


groups is the violin plot. In a violin plot, each set of box and whiskers is replaced with
a density curve built around a central baseline. This can provide a better comparison
of data shapes between groups, though this does lose out on comparisons of precise
statistical values. A frequent variation for violin plots is to include box-style markings
on top of the violin plot to get the best of both worlds.

Heatmap

38
The heatmap presents a grid of values based on two variables of interest. The axis
variables can be numeric or categorical; the grid is created by dividing each variable
into ranges or levels like a histogram or bar chart. Grid cells are colored based on
value, often with darker colors corresponding with higher values. A heatmap can be
an interesting alternative to a scatter plot when there are a lot of data points to plot, but
the point density makes it difficult to see the true relationship between variables.

Specialist Charts
There are plenty of additional charts out there that encode data in other ways for
particular use cases. Xenographics includes a collection of some fanciful charts that
have been driven by very particular purposes. Still, some of these charts have use
cases that are common enough that they can be considered essential to know.

Pie chart

pie charts being sequestered here in the ‘specialist’ section, considering how
commonly they are utilized. However, pie charts use an uncommon encoding,
depicting values as areas sliced from a circular form. Since a pie chart typically lacks
value markings around its perimeter, it is usually difficult to get a good idea of exact
slice sizes. However, the pie chart and its cousin the donut plot excel at telling the
reader that the part-to-whole comparison should be the main takeaway from the
visualization.

39
Funnel chart

A funnel chart is often seen in business contexts where visitors or users need to be
tracked in a pipeline flow. The chart shows how many users make it to each stage of
the tracked process from the width of the funnel at each stage division. The tapering of
the funnel helps to sell the analogy, but can muddle what the true conversion rates are.
A bar chart can often fulfill the same purpose as a funnel chart, but with a cleaner
representation of data.

Bullet chart

The bullet chart enhances a single bar with additional markings for how to
contextualize that bar’s value. This usually means a perpendicular line showing a
target value, but also background shading to provide additional performance
benchmarks. Bullet charts are usually used for multiple metrics, and are more compact
to render than other types of more fanciful gauges.

Map-based plots

40
There are a number of families of specialist plots grouped by usage, but we’ll close
this article out by touching upon one of them: map-based or geospatial plots. When
values in a dataset correspond to actual geographic locations, it can be valuable to
actually plot them with some kind of map. A common example of this type of map is
the choropleth like the one above. This takes a heat map approach to depicting value
through the use of color, but instead of values being plotted in a grid, they are filled
into regions on a map.

21. What is D3.js? Why build Data Visualizations with D3.js? When you should
use D3.jsWhere is D3.js used

D3.js is a JavaScript library for manipulating documents based on data. D3 helps you
bring data to life using HTML, SVG and CSS. D3’s emphasis on web standards gives
you the full capabilities of modern browsers without tying yourself to a proprietary
framework, combining powerful visualization components and a data-driven approach
to DOM manipulation.

Why build Data Visualizations with D3.js?


Use D3.js because it lets you build the data visualization framework that you want
Graphic / Data Visualization frameworks make a great deal of decisions to make the
framework easy to use.‍
D3.js focuses on binding data to DOM elements.‍
D3.js is written in JavaScript and uses a functional style which means you can reuse
code and add specific functions to your heart's content. ‍
Which means it is as powerful as you want to make it.‍
How you chose to style, manipulate, and make interactive the data is up to you.

When you should use D3.js‍


You should use D3.js when your webpage is interacting with data.‍
D3 stands for Data Driven Documents.‍
We will explore D3.js for its graphing capabilities.

41
We will not explore the myriad ways you can use data to in your webpages in non-
graphical ways.‍

Where is D3.js used‍


D3.js is a javascript library added to the front-end of your web application.‍
Your back-end (the server) will generate the necessary data.
The part of the application the users interact with (the front-end) will use D3.js.

22. D3.js using the Cascading Style Sheets (CSS).

CSS are used to style the elements in the DOM. A style sheet can exist as a
separate .css file that you include in your HTML page or can be embedded directly in
the HTML page. Style sheets refer to an ID, class, or type of element and determine
the appearance of that element. The terminology used to define the style is a CSS
selector and is the same type of selector used in the d3.select() syntax. You can set
inline styles (that are applied to only a single element) by using
d3.select("#someElement") .style("opacity", .5) to set the opacity of an element to
50%. Let’s update your d3ia.html to include a style sheet.

Example

<!doctype html>

<html>

<script src="d3.v3.min.js" type="text/JavaScript"></script>

<style>

.inactive, .tentative {

stroke: darkgray;

stroke-width: 2px;

stroke-dasharray: 5 5;

.tentative {

opacity: .5;

42
.active {

stroke: black;

stroke-width: 4px;

stroke-dasharray: 1;

circle {

fill: red;

rect {

fill: darkgray;

</style>

<body>

<div id="infovizDiv">

<svg style="width:500px;height:500px;border:1px lightgray solid;">

<path d="M 10,60 40,30 50,50 60,30 70,80" />

<polygon class="inactive" points="80,400 120,400 160,440 120,480 60,460" />

<g>

<circle class="active tentative" cy="100" cx="200" r="30"/>

<rect class="active" x="410" y="200" width="100" height="50" />

</g>

</svg>

</div>

</body>

</html>
43
Style sheets can also refer to a state of the element, so with :hover you can change the
way an element looks when the user mouses over that element. You can learn about
other complex CSS selectors in more detail in a book devoted to that subject. The
most useful way to do this is to have CSS classes associated with particular stylistic
changes and then change the class of an element. You can change the class of an
element, which is an attribute of an element, by selecting and modifying the class
attribute.

23. What is Tableau Public? Write Advantage and Dis-Advantage.

Tableau has a variety of options available, including a desktop app, server and hosted
online versions, and a free public option. There are hundreds of data import options
available, from CSV files to Google Ads and Analytics data to Salesforce data.Output
options include multiple chart formats as well as mapping capability. That means
designers can create color-coded maps that showcase geographically important data in
a format that’s much easier to digest than a table or chart could ever be.
The public version of Tableau is free to use for anyone looking for a powerful way to
create data visualizations that can be used in a variety of settings. From journalists to
political junkies to those who just want to quantify the data of their own lives, there
are tons of potential uses for Tableau Public. They have an extensive gallery of
infographics and visualizations that have been created with the public version to serve
as inspiration for those who are interested in creating their own.
Advantages

 Hundreds of data import options

 Mapping capability

 Free public version available

 Lots of video tutorials to walk you through how to use Tableau

Disadvantages
 Non-free versions are expensive ($70/month/user for the Tableau Creator
software)

 Public version doesn’t allow you to keep data analyses private

24. How Data science Drive in Growth in E-Commerce?

Data science in e-commerce serves the same purpose that it does in any other
discipline — to derive valuable insights from raw data. In e-commerce, you’re looking
for data insights that you can use to optimize a brand’s marketing return on investment
(ROI) and to drive growth in every layer of the sales funnel. How you end up doing
that is up to you, but the work of most data scientists in e-commerce involves the
following:

44
» Data analysis: Simple statistical and mathematical inference. Segmentation analysis
gets rather complicated when trying to make sense of e-commerce data. You also use
a lot of trend analysis, outlier analysis, and regression analysis.

» Data wrangling: Data wrangling involves using processes and procedures to clean
and convert data from one format and structure to another so that the data is accurate
and in the format that analytics tools and scripts require for consumption. In growth
work, source data is usually captured and generated by analytics applications. Most of
the time, you can derive insight within the application, but sometimes you need to
export the data so that you can create data mashups, perform custom analyses, and
create custom visualizations that aren’t available in your out-of-the-box solutions.
These situations could demand that you use a fair bit of data wrangling to get what
you need from the source datasets.

» Data visualization design: Data graphics in e-commerce are usually quite simple.
Expect to use a lot of line charts, bar charts, scatter charts, and map-based data
visualizations. Data visualizations should be simple and to the point, but the analyses
required to derive meaningful insights may take some time.

» Communication: After you make sense of the data, you have to communicate its
meaning in clear, direct, and concise ways that decision makers can easily understand.
E-commerce data scientists need to be excellent at communicating data insights via
data visualizations, a written narrative, and conversation.

» Custom development work: In some cases, you may need to design custom scripts
for automated custom data analysis and visualization. In other cases, you may have to
go so far as to design a personalization and recommendation system, but because you
can find a ton of prebuilt applications available for these purposes, the typical e-
commerce data scientist position description doesn’t include this requirement.

25. How to Segmenting for faster and easier e-commerce growth ?

Segmenting for faster and easier e-commerce growth Data scientists working in
growth hacking should be familiar with, and know how to derive insight from, the
following user segmentation and targeting applications:

» Google Analytics Segment Builder: Google Analytics contains a Segment Builder


feature that makes it easier for you to set up filters when you configure your segments
within the application. You can use the tool to segment users by demographic data,
such as age, gender, referral source, and nationality.

» Adobe Analytics use Adobe Analytics for advanced user segmentation and customer
churn analysis — or analysis to identify reasons for and preempt customer loss.
Customer churn describes the loss, or churn, of existing customers. Customer churn
analysis is a set of analytical techniques that are designed to identify, monitor, and
issue alerts on indicators that signify when customers are likely to churn. With the
information that’s generated in customer churn analysis, businesses can take
preemptive measures to retain at-risk customers.
45
» Webtrends (http://webtrends.com): Webtrends’ Visitor Segmentation and Scoring
offers real-time customer segmentation features that help you isolate, target, and
engage your highest-value visitors. The Conversion Optimization solution also offers
advanced segmenting and targeting functionality that you can use to optimize your
website, landing pages, and overall customer experience.

» Optimizely (www.optimizely.com): In addition to its testing functionality, you can


use Optimizely for visitor segmentation, targeting, and geotargeting.

» IBM Product Recommendations


(www-01.ibm.com/software/marketingsolutions/products-recommendation-solution):
This solution utilizes IBM Digital Analytics, customer-segmentation, and product-
segmentation methods to make optimal product recommendations to visitors of e-
commerce websites. IBM Product Recommendations Solutions can help you upsell or
cross-sell your offerings.

26)Discuss Outliers
Outliers are an important part of a dataset. They can hold useful information about
your data.

Outliers can give helpful insights into the data you're studying, and they can have an
effect on statistical results. This can potentially help you disover inconsistencies and
detect any errors in your statistical processes.

So, knowing how to find outliers in a dataset will help you better understand your
data.

There are a few different ways to find outliers in statistics.

This article will explain how to detect numeric outliers by calculating the interquartile
range.

I give an example of a very simple dataset and how to calculate the interquartile
range, so you can follow along if you want.

Let's get started!

What is an Outlier in Statistics? A Definition


In simple terms, an outlier is an extremely high or extremely low data point relative to
the nearest data point and the rest of the neighboring co-existing values in a data
graph or dataset you're working with.

Outliers are extreme values that stand out greatly from the overall pattern of values
in a dataset or graph.

Below, on the far left of the graph, there is an outlier.


46
The value in the month of January is significantly less than in the other months.

An outlier has to satisfy either of the following two conditions:

outlier < Q1 - 1.5(IQR)


outlier > Q3 + 1.5(IQR)

27)Briefly discuss time series analysis.

Time series analysis is indispensable in data science, statistics, and analytics.

At its core, time series analysis focuses on studying and interpreting a sequence of
data points recorded or collected at consistent time intervals. Unlike cross-sectional
data, which captures a snapshot in time, time series data is fundamentally dynamic,
evolving over chronological sequences both short and extremely long. This type of
analysis is pivotal in uncovering underlying structures within the data, such as
trends, cycles, and seasonal variations.

Technically, time series analysis seeks to model the inherent structures within the
data, accounting for phenomena like autocorrelation, seasonal patterns, and trends.
The order of data points is crucial; rearranging them could lose meaningful insights
or distort interpretations. Furthermore, time series analysis often requires a
47
substantial dataset to maintain the statistical significance of the findings. This
enables analysts to filter out 'noise,' ensuring that observed patterns are not mere
outliers but statistically significant trends or cycles.

To delve deeper into the subject, you must distinguish between time-series data,
time-series forecasting, and time-series analysis. Time-series data refers to the raw
sequence of observations indexed in time order. On the other hand, time-series
forecasting uses historical data to make future projections, often employing statistical
models like ARIMA (AutoRegressive Integrated Moving Average). But Time series
analysis, the overarching practice, systematically studies this data to identify and
model its internal structures, including seasonality, trends, and cycles. What sets
time series apart is its time-dependent nature, the requirement for a sufficiently large
sample size for accurate analysis, and its unique capacity to highlight cause-effect
relationships that evolve.

Why Do Organizations Use Time Series Analysis?

Time series analysis has become a crucial tool for companies looking to make better
decisions based on data. By studying patterns over time, organizations can
understand past performance and predict future outcomes in a relevant and
actionable way. Time series helps turn raw data into insights companies can use to
improve performance and track historical outcomes.

For example, retailers might look at seasonal sales patterns to adapt their inventory
and marketing. Energy companies could use consumption trends to optimize their
production schedule. The applications even extend to detecting anomalies—like a
sudden drop in website traffic—that reveal deeper issues or opportunities. Financial
firms use it to respond to stock market shifts instantly. And health care systems need
it to assess patient risk in the moment.

Rather than a series of stats, time series helps tell a story about evolving business
conditions over time. It's a dynamic perspective that allows companies to plan
proactively, detect issues early, and capitalize on emerging opportunities.

Components of Time Series Data


48
Time series data is generally comprised of different components that characterize
the patterns and behavior of the data over time. By analyzing these components, we
can better understand the dynamics of the time series and create more accurate
models. Four main elements make up a time series dataset:

 Trends
 Seasonality
 Cycles
 Noise

Trends show the general direction of the data, and whether it is increasing,
decreasing, or remaining stationary over an extended period of time. Trends indicate
the long-term movement in the data and can reveal overall growth or decline. For
example, e-commerce sales may show an upward trend over the last five years.

Seasonality refers to predictable patterns that recur regularly, like yearly retail spikes
during the holiday season. Seasonal components exhibit fluctuations fixed in timing,
direction, and magnitude. For instance, electricity usage may surge every summer
as people turn on their air conditioners.

Cycles demonstrate fluctuations that do not have a fixed period, such as economic
expansions and recessions. These longer-term patterns last longer than a year and
do not have consistent amplitudes or durations. Business cycles that oscillate
between growth and decline are an example.

Finally, noise encompasses the residual variability in the data that the other
components cannot explain. Noise includes unpredictable, erratic deviations after
accounting for trends, seasonality, and cycles.

In summary, the key components of time series data are:

 Trends: Long-term increases, decreases, or stationary movement


 Seasonality: Predictable patterns at fixed intervals
 Cycles: Fluctuations without a consistent period
 Noise: Residual unexplained variability

49
28)Explain over fitting problem.
In the real world, the dataset present will never be clean and perfect. It
means each dataset contains impurities, noisy data, outliers, missing
data, or imbalanced data. Due to these impurities, different problems
occur that affect the accuracy and the performance of the model. One
of such problems is Overfitting in Machine Learning. Overfitting is a
problem that a model can exhibit.

A statistical model is said to be overfitted if it can’t generalize well with unseen data.

Before understanding overfitting, we need to know some basic terms,


which are:

Noise: Noise is meaningless or irrelevant data present in the dataset.


It affects the performance of the model if it is not removed.

Bias: Bias is a prediction error that is introduced in the model due to


oversimplifying the machine learning algorithms. Or it is the difference
between the predicted values and the actual values.

Variance: If the machine learning model performs well with the


training dataset, but does not perform well with the test dataset, then
variance occurs.

Generalization: It shows how well a model is trained to predict


unseen data.

Overfitting & underfitting are the two main errors/problems in the

50
machine learning model, which cause poor performance in Machine
Learning.
o Overfitting occurs when the model fits more data than required, and
it tries to capture each and every datapoint fed to it. Hence it starts
capturing noise and inaccurate data from the dataset, which
degrades the performance of the model.
o An overfitted model doesn't perform accurately with the test/unseen
dataset and can’t generalize well.
o An overfitted model is said to have low bias and high variance.

29)Write the difference between supervised and unsupervised


techniques.
.

Supervised
Learning Unsupervised Learning

Uses Known and


Uses Unknown Data as input
Input Data Labeled Data as input

Computational Less Computational More Computational


Complexity Complexity Complex

Uses Real-Time Analysis of


Uses off-line analysis
Real-Time Data

The number of The number of Classes is not


Number of Classes Classes is known known

Accurate and Moderate Accurate and


Accuracy of Results Reliable Results Reliable Results

The desired output is The desired, output is not


Output data given. given.

Model In supervised In unsupervised learning it is


learning it is not possible to learn larger and
possible to learn more complex models than

51
Supervised
Learning Unsupervised Learning

larger and more


complex models than
in supervised learning
in unsupervised
learning

In supervised
learning training data In unsupervised learning
is used to infer training data is not used.
Training data model

Supervised learning
Unsupervised learning is
is also called
also called clustering.
Another name classification.

We can test our


We can not test our model.
Test of model model.

Optical Character
Find a face in an image.
Example Recognition

30)Write a note on tree pruning.

Pruning is a data compression technique in machine learning and search


algorithms that reduces the size of decision trees by removing sections of the tree
that are non-critical and redundant to classify instances. Pruning reduces the
complexity of the final classifier, and hence improves predictive accuracy by the
reduction of overfitting.
One of the questions that arises in a decision tree algorithm is the optimal size of the
final tree. A tree that is too large risks overfitting the training data and poorly
generalizing to new samples. A small tree might not capture important structural
information about the sample space. However, it is hard to tell when a tree algorithm
should stop because it is impossible to tell if the addition of a single extra node will
dramatically decrease error. This problem is known as the horizon effect. A common
strategy is to grow the tree until each node contains a small number of instances
then use pruning to remove nodes that do not provide additional information.[1]

52
Pruning should reduce the size of a learning tree without reducing predictive
accuracy as measured by a cross-validation set. There are many techniques for tree
pruning that differ in the measurement that is used to optimize performance.
Techniques[
Pruning processes can be divided into two types (pre- and post-pruning).
Pre-pruning procedures prevent a complete induction of the training set by replacing
a stop () criterion in the induction algorithm (e.g. max. Tree depth or information gain
(Attr)> minGain). Pre-pruning methods are considered to be more efficient because
they do not induce an entire set, but rather trees remain small from the start.
Prepruning methods share a common problem, the horizon effect. This is to be
understood as the undesired premature termination of the induction by the stop ()
criterion.
Post-pruning (or just pruning) is the most common way of simplifying trees. Here,
nodes and subtrees are replaced with leaves to reduce complexity. Pruning can not
only significantly reduce the size but also improve the classification accuracy of
unseen objects. It may be the case that the accuracy of the assignment on the train
set deteriorates, but the accuracy of the classification properties of the tree increases
overall.
The procedures are differentiated on the basis of their approach in the tree (top-
down or bottom-up).
Bottom-up pruning
These procedures start at the last node in the tree (the lowest point). Following
recursively upwards, they determine the relevance of each individual node. If the
relevance for the classification is not given, the node is dropped or replaced by a
leaf. The advantage is that no relevant sub-trees can be lost with this method. These
methods include Reduced Error Pruning (REP), Minimum Cost Complexity Pruning
(MCCP), or Minimum Error Pruning (MEP).
Top-down pruning
In contrast to the bottom-up method, this method starts at the root of the tree.
Following the structure below, a relevance check is carried out which decides
whether a node is relevant for the classification of all n items or not. By pruning the
tree at an inner node, it can happen that an entire sub-tree (regardless of its
relevance) is dropped. One of these representatives is pessimistic error pruning
(PEP), which brings quite good results with unseen items.
Pruning algorithms
Reduced error pruning
One of the simplest forms of pruning is reduced error pruning. Starting at the leaves,
each node is replaced with its most popular class. If the prediction accuracy is not
affected then the change is kept. While somewhat naive, reduced error pruning has
the advantage of simplicity and speed.
Cost complexity pruning

Cost complexity pruning generates a series of trees where is the

initial tree and is the root alone. At step , the tree is created by

removing a subtree from tree and replacing it with a leaf node with value
53
chosen as in the tree building algorithm. The subtree that is removed is chosen as
follows:

1. Define the error rate of tree over data set as .

2. The subtree that minimizes is chosen for removal.

54

You might also like