DAV Question Bank+Answe
DAV Question Bank+Answe
1. Probability distributions
Probability is one of the most fundamental concepts in statistics. Imagine that you’ve
just rolled into Las Vegas and settled into your favorite roulette table over at the
Bellagio. When the roulette wheel spins off, you intuitively understand that there is an
equal chance that the ball will fall into any of the slots of the cylinder on the wheel.
The slot where the ball will land is totally random, and the probability, or likelihood,
of the ball landing in any one slot over another is the same. Because the ball can land
in any slot, with equal probability, there is an equal probability distribution, or a
uniform probability distribution — the ball has an equal probability of landing in any
of the slots in the cylinder. But the slots of the roulette wheel are not all the same —
the wheel has 18 black slots and 20 slots that are either red or green. Because of this
arrangement, there is 18/38 probability that your ball will land on a black slot. You
plan to make successive bets that the ball will land on a black.
2. Statistic
A statistic is a result that’s derived from performing a mathematical operation on
numerical data. In general, you use statistics in decision making. Statistics come in
two flavors:
» Descriptive: Descriptive statistics provide a description that illuminates some
characteristic of a numerical dataset, including dataset distribution, central tendency
(such as mean, min, or max), and dispersion (as in standard deviation and variance).
» Inferential: Rather than focus on pertinent descriptions of a dataset, inferential
statistics carve out a smaller section of the dataset and attempt to deduce significant
information about the larger dataset. Use this type of statistics to get information about
a real-world measure in which you’re interested.
3. Weighted average
A weighted average is an average value of a measure over a very large number of data
points. If you take a weighted average of your winnings (your random variable) across
the probability distribution, this would yield an expectation value — an expected
value for your net winnings over a successive number of bets. (An expectation can
also be thought of as the best guess, if you had to guess.) To describe it more formally,
an expectation is a weighted average of some measure associated with a random
variable.
Linear regression is a machine learning method use to describe and quantify the
relationship between your target variable, y — the predictant, in statistics lingo —
and the dataset features you’ve chosen to use as predictor variables (commonly
designated as dataset X in machine learning). When you use just one variable as your
predictor, linear regression is as simple as the middle school algebra formula y=mx+b.
But you can also use linear regression to quantify correlations between several
variables in a dataset called multiple linear regression. Before getting too excited
about using linear regression, though, make sure you’ve considered its limitations:
» Linear regression only works with numerical variables, not categorical ones.
» If dataset has missing values, it will cause problems. Be sure to address your
missing values before attempting to build a linear regression model.
» If data has outliers present, your model will produce inaccurate results. Check for
outliers before proceeding.
» The linear regression assumes that there is a linear relationship between dataset
features and the target variable. Test to make sure this is the case, and if it’s not, try
using a log transformation to compensate.
» The linear regression model assumes that all features are independent of each other.
Logistic regression
Logistic regression is a machine learning method you can use to estimate values for a
categorical target variable based on your selected features. Your target variable should
be numeric, and contain values that describe the target’s class — or category. One
cool thing about logistic regression is that, in addition to predicting the class of
observations in your target variable, it indicates the probability for each of its
estimates. Though logistic regression is like linear regression, it’s requirements are
simpler, in that: » There does not need to be a linear relationship between the features
and target variable.
» Predictive features are not required to have a normal distribution. When deciding
whether logistic regression is a good choice for you, make sure to consider the
following limitations:
Ordinary least squares (OLS) is a statistical method that fits a linear regression line to
a dataset. With OLS, you do this by squaring the vertical distance values that describe
the distances between the data points and the best-fit line, adding up those squared
distances, and then adjusting the placement of the best-fit line so that the summed
squared distance value is minimized. Use OLS if you want to construct a function
that’s a close approximation to your data. As always, don’t expect the actual value to
be identical to the value predicted by the regression. Values predicted by the
regression are simply estimates that are most similar to the actual values in the model.
OLS is particularly useful for fitting a regression line to models containing more than
one independent variable. In this way, you can use OLS to estimate the target from
dataset features. When using OLS regression methods to fit a regression line that has
more than one independent variable, two or more of the IVs may be interrelated.
When two or more IVs are strongly correlated with each other, this is called
multicollinearity. Multicollinearity tends to adversely affect the reliability of the IVs
as predictors when they’re examined apart from one another. Luckily, however,
multicollinearity doesn’t decrease the overall predictive reliability of the model when
it’s considered collectively.
3
It helps in data compression, and hence reduced storage space.
It reduces computation time.
It also helps remove redundant features.
2D representation of data.
1. Firstly normalize the data such that the average value shifts to the origin and all
the data lie in a unit square.
4
2. Now try to fit a line to the data. For that we will try out with a random line.
Now we will rotate the line until it fits best to the data.
Ultimately we end up with the following fit (high degree of fit) which explains the
maximum variance of a feature.
Now to quantity how good the line fits to the data, PCA projects the data onto it.
i) Then it can measure the distances from the data to the line and try to find a line that
minimizes those distances.
5
ii) or it can try to find the line that maximizes the distances from the projected points to
the origin.
Mathematical Intuition :
To understand the math behind this technique let’s back to our single data point
concept.
After projecting the data onto the line we will get a right angled triangle. Now from
pythogoras theorem we get A² = B² + C². We can see that B and C are inversely
proportional to each other. That means if B gets larger then c must be smaller and vice
versa. Thus PCA can either minimize the distances to the line or maximize the
distances from the projected point to the origin.
It is more easier to find the maximum distance from the projected point to the origin.
Hence PCA finds the best fitting line that maximizes the sum of the squared distances
from the projected points to the origin.
6
cost function of pca
Here taking squares of the distances so that the negative values won’t cancel the
positive values.
Now we got the best fitting line y = mx + c. This is called PC1 (Principle component
1). Let’s assume we got proportions 4:1 that means we go 4 units on X-axis and 1 unit
on Y-axis which explains that the data is mostly spread on the X-axis.
From theorem
a² = b² + c² => a² = 4² + 1² => sqrt(17)=> 4.12 But the data is scaled hence we divide
each side with 4.12 in order to get the unit vector. i.e.,
F1 = 4 / 4.12 = 0.97 and
F2 = 1 / 4.12 = 0.242
The unit vector that we just calculated is called Eigen vector or PC1 and
the proportions of features (0.97 : 0.242) are called loading scores.
SS( distances for PC1 ) = Eigen values for PC1.
sqrt( Eigen values for PC1 ) = Singular value for PC1.
Now we do the same thing for other features to get principal components. For
projecting the data now we will rotate the axis so that PC1 gets parallel (horizontal) to
the X-axis.
7
Rotating the axis so that PC1 gets horizontal
For visualization let’s project the data basing on the projected points on both principal
components.
Let suppose we got variation for PC1 = 0.83 and for PC2 = 0.17
8
Now if we want to convert the data from 2D to 1D we choose feature1 as a final 1D
since it covers 83% of the total variation.
This is how PCA works and basing on the variance obtained using principal
components it estimates the features to be eliminated for dimensionality reduction.
9
Four examples of time series showing different patterns.
1. The monthly housing sales (top left) show strong seasonality within each year,
as well as some strong cyclic behaviour with a period of about 6–10 years. There is no
apparent trend in the data over this period.
2. The US treasury bill contracts (top right) show results from the Chicago market
for 100 consecutive trading days in 1981. Here there is no seasonality, but an obvious
downward trend. Possibly, if we had a much longer series, we would see that this
downward trend is actually part of a long cycle, but when viewed over only 100 days
it appears to be a trend.
3. The Australian quarterly electricity production (bottom left) shows a strong
increasing trend, with strong seasonality. There is no evidence of any cyclic behaviour
here.
4. The daily change in the Google closing stock price (bottom right) has no trend,
seasonality or cyclic behaviour. There are random fluctuations which do not appear to
be very predictable, and no strong patterns that would help with developing a
forecasting model.
It allows us to cluster the data into different groups and a convenient way to discover
the categories of groups in the unlabeled dataset on its own without the need for any
training.
10
It is a centroid-based algorithm, where each cluster is associated with a centroid. The
main aim of this algorithm is to minimize the sum of distances between the data point
and their corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number
of clusters, and repeats the process until it does not find the best clusters. The value of
k should be predetermined in this algorithm.
Hence each cluster has datapoints with some commonalities, and it is away from other
clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.
11
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two
variables is given below:
o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them
into different clusters. It means here we will try to group these datasets into two
different clusters.
o We need to choose some random k points or centroid to form the cluster. These
points can be either the points from the dataset or any other point. So, here we are
selecting the below two points as k points, which are not the part of our dataset.
Consider the below image:
o Now we will assign each data point of the scatter plot to its closest K-point or
centroid. We will compute it by applying some mathematics that we have studied to
calculate the distance between two points. So, we will draw a median between both
the centroids. Consider the below image:
12
From the above image, it is clear that points left side of the line is near to the K1 or
blue centroid, and points to the right of the line are close to the yellow centroid. Let's
color them as blue and yellow for clear visualization.
o Next, we will reassign each datapoint to the new centroid. For this, we will
repeat the same process of finding a median line. The median will be like below
image:
From the above image, we can see, one yellow point is on the left side of the line, and
two blue points are right to the line. So, these three points will be assigned to new
centroids.
13
As reassignment has taken place, so we will again go to the step-4, which is finding
new centroids or K-points.
o We will repeat the process by finding the center of gravity of centroids, so the
new centroids will be as shown in the below image:
o As we got the new centroids so again will draw the median line and reassign
the data points. So, the image will be:
14
o We can see in the above image; there are no dissimilar data points on either
side of the line, which means our model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the two
final clusters will be as shown in the below image:
In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this
tree-shaped structure is known as the dendrogram.
Sometimes the results of K-means clustering and hierarchical clustering may look
similar, but they both differ depending on how they work. As there is no requirement
to predetermine the number of clusters as we did in the K-Means algorithm.
15
2. Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it
is a top-down approach.
The working of the AHC algorithm can be explained using the below steps:
o Step-1: Create each data point as a single cluster. Let's say there are N data
points, so the number of clusters will also be N.
o Step-2: Take two closest data points or clusters and merge them to form one
cluster. So, there will now be N-1 clusters.
16
o Step-3: Again, take the two closest clusters and merge them together to form
one cluster. There will be N-2 clusters.
o Step-4: Repeat Step 3 until only one cluster left. So, we will get the following
clusters. Consider the below images:
o Step-5: Once all the clusters are combined into one big cluster, develop the
dendrogram to divide the clusters as per the problem.
1. Single Linkage: It is the Shortest Distance between the closest points of the
clusters. Consider the below image:
2. Complete Linkage: It is the farthest distance between the two points of two
different clusters. It is one of the popular linkage methods as it forms tighter clusters
than single-linkage.
3. Average Linkage: It is the linkage method in which the distance between each
pair of datasets is added up and then divided by the total number of datasets to
calculate the average distance between two clusters. It is also one of the most popular
linkage methods.
4. Centroid Linkage: It is the linkage method in which the distance between the
centroid of the clusters is calculated. Consider the below image:
From the above-given approaches, we can apply any of them according to the type of
problem or business requirement.
18
The dendrogram is a tree-like structure that is mainly used to store each step as a
memory that the HC algorithm performs. In the dendrogram plot, the Y-axis shows
the Euclidean distances between the data points, and the x-axis shows all the data
points of the given dataset.
The working of the dendrogram can be explained using the below diagram:
In the above diagram, the left part is showing how clusters are created in
agglomerative clustering, and the right part is showing the corresponding dendrogram.
Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in
ML. It is based on the concept of ensemble learning, which is a process of combining
multiple classifiers to solve a complex problem and to improve the performance of the
model.
19
The greater number of trees in the forest leads to higher accuracy and prevents the
problem of overfitting.
The below diagram explains the working of the Random Forest algorithm:
Since the random forest combines multiple trees to predict the class of the dataset, it is
possible that some decision trees may predict the correct output, while others may not.
But together, all the trees predict the correct output. Therefore, below are two
assumptions for a better Random forest classifier:
o There should be some actual values in the feature variable of the dataset so that
the classifier can predict accurate results rather than a guessed result.
o The predictions from each tree must have very low correlations.
Below are some points that explain why we should use the Random Forest algorithm:
<="" li="">
o It takes less training time as compared to other algorithms.
o It predicts output with high accuracy, even for the large dataset it runs
efficiently.
o It can also maintain accuracy when a large proportion of data is missing.
Random Forest works in two-phase first is to create the random forest by combining N
decision tree, and second is to make predictions for each tree created in the first phase.
The Working process can be explained in the below steps and diagram:
Step-2: Build the decision trees associated with the selected data points (Subsets).
20
Step-3: Choose the number N for decision trees that you want to build.
Step-5: For new data points, find the predictions of each decision tree, and assign the
new data points to the category that wins the majority votes.
The working of the algorithm can be better understood by the below example:
Example: Suppose there is a dataset that contains multiple fruit images. So, this
dataset is given to the Random forest classifier. The dataset is divided into subsets and
given to each decision tree. During the training phase, each decision tree produces a
prediction result, and when a new data point occurs, then based on the majority of
results, the Random Forest classifier predicts the final decision. Consider the below
image:
There are mainly four sectors where Random forest mostly used:
1. Banking: Banking sector mostly uses this algorithm for the identification of
loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the
disease can be identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
4. Marketing: Marketing trends can be identified using this algorithm.
o Although random forest can be used for both classification and regression
tasks, it is not more suitable for Regression tasks.
21
12. Difference between Classification and Clustering.
22
o It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of classification, it
performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets
new data, then it classifies that data into a category that is much similar to the new
data.
o Example: Suppose, we have an image of a creature that looks similar to cat
and dog, but we want to know either it is a cat or dog. So for this identification, we
can use the KNN algorithm, as it works on a similarity measure. Our KNN model will
find the similar features of the new data set to the cats and dogs images and based on
the most similar features it will put it in either cat or dog category.
o Suppose there are two categories, i.e., Category A and Category B, and we
have a new data point x1, so this data point will lie in which of these categories. To
solve this type of problem, we need a K-NN algorithm. With the help of K-NN, we
can easily identify the category or class of a particular dataset. Consider the below
diagram:
The K-NN working can be explained on the basis of the below algorithm:
23
Suppose we have a new data point and we need to put it in the required category.
Consider the below image:
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The
Euclidean distance is the distance between two points, which we have already studied
in geometry. It can be calculated as:
24
o As we can see the 3 nearest neighbors are from category A, hence this new data
point must belong to category A.
Below are some points to remember while selecting the value of K in the K-NN
algorithm:
o There is no particular way to determine the best value for "K", so we need to
try some values to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the
effects of outliers in the model.
o Large values for K are good, but it may find some difficulties.
o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.
o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the
data points for all the training samples.
o Nearest neighbor methods are used extensively to understand and create value
from patterns in retail business data. In the following sections, I present two powerful
cases where kNN and average-NN algorithms are being used to simplify management
and security in daily retail operations. Seeing k-nearest neighbor algorithms in action
K-nearest neighbor techniques for pattern recognition are often used for theft
prevention in the modern retail business. accustomed to seeing CCTV cameras around
almost every store you visit, but most people have no idea how the data gathered from
these devices is being used. You might imagine that someone in the back room,
monitoring these cameras for suspicious activity, and perhaps that is how things were
done in the past. Modeling with Instances 105 a modern surveillance system is
intelligent enough to analyze and interpret video data on its own, without the need for
human assistance.
o The modern systems can now use k-nearest neighbor for visual pattern
recognition to scan and detect hidden packages in the bottom bin of a shopping cart at
checkout. If an object is detected that is an exact match for an object listed in the
database, the price of the spotted product could even automatically be added to the
customer’s bill. Though this automated billing practice is not used extensively now,
the technology has been developed and is available for use. K-nearest neighbor is also
used in retail to detect patterns in credit card usage. Many new transaction-
25
scrutinizing software applications use kNN algorithms to analyze register data and
spot unusual patterns that indicate suspicious activity. For example, if register data
indicates that a lot of customer information is being entered manually rather than
through automated scanning and swiping, this could indicate that the employee who’s
using that register is in fact stealing a customer’s personal information. Or, if register
data indicates that a particular good is being returned or exchanged multiple times,
this could indicate that employees are misusing the return policy or trying to make
money from making fake returns.
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can
be described as:
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends on the
conditional probability.
o The formula for Bayes' theorem is given as:
Where,
26
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability
of a hypothesis is true.
Working of Naïve Bayes' Classifier can be understood with the help of the below
example:
Problem: If the weather is sunny, then the Player should play or not?
Outlook Play
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
27
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 5
Weather No Yes
Overcast 0 5 5/14= 0.35
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
All 4/14=0.29 10/14=0.71
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny)= 0.35
P(Yes)=0.71
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
28
P(Sunny)= 0.35
o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of
datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other
Algorithms.
o It is the most popular choice for text classification problems.
o Naive Bayes assumes that all features are independent or unrelated, so it cannot
learn the relationship between features.
There are three types of Naive Bayes Model, which are given below:
29
17. What is data visualization and why is it important?
3. Designing data art for activists You could be designing for an audience of
idealists, dreamers, and changemakers. When designing for this audience, you want
your data visualization to make a point! You can assume that typical audience
members aren’t overly analytical. What they lack in math skills, however, they more
than compensate for in solid convictions. These people look to your data visualization
as a vehicle by which to make a statement. When designing for this audience, data art
is the way to go. The main goal in using data art is to entertain, to provoke, to annoy,
or to do whatever it takes to make a loud, clear, attention-demanding statement. Data
art has little to no narrative and offers no room for viewers to form their own
interpretations. Data scientists have an ethical responsibility to always represent data
accurately. A data scientist should never distort the message of the data to fit what the
audience wants to hear — not even for data art! Nontechnical audiences don’t even
recognize, let alone see, the possible issues. They rely on the data scientist to provide
honest and accurate representations, thus amplifying the level of ethical responsibility
that the data scientist must assume.
19. How to Designing to Meet the Needs of target to make Functional data
Visualization.
2. Define the purpose of your visualization. Narrow the purpose of the visualization
by deciding exactly what action or outcome you want audience members to make as a
result of the visualization.
3. Choose a functional design. Review the three main data visualization types
(discussed earlier in this chapter) and decide which type can best help you achieve
your intended outcome.
31
The following sections spell out this process in detail.
Step 1: Brainstorm (about Brenda) To brainstorm properly, pull out a sheet of paper
and think about your imaginary audience member (Brenda) so that you can create a
more functional and effective data visualization. Answer the following questions to
help you better understand her, and thus better understand and design for your target
audience. Form a picture of what Brenda’s average day looks like — what she does
when she gets out of bed in the morning, what she does over her lunch hour, and what
her workplace is like. Also consider how Brenda will use your visualization. To form
a comprehensive view of who Brenda is and how you can best meet her needs, ask
these questions: » Where does Brenda work? What does she do for a living? » What
kind of technical education or experience, if any, does she have? » How old is
Brenda? Is she married? Does she have children? What does she look like? Where
does she live? » What social, political, caused-based, or professional issues are
important to Brenda? What does she think of herself? » What problems and issues
does Brenda have to deal with every day? » How does your data visualization help
solve Brenda’s work problems or her family problems? How does it improve her self-
esteem? » Through what avenue will you present the visualization to Brenda — for
example, over the Internet or in a staff meeting? » What does Brenda need to be able
to do with your data visualization? Following the Principles of Data Visualization
Design 121 Say that Brenda is the manager of the zoning department in Irvine County.
She is 45 years old and a single divorcee with two children who are about to start
college. She is deeply interested in local politics and eventually wants to be on the
county’s board of commissioners. To achieve that position, she has to get some major
“oomph” on her county management résumé. Brenda derives most of her feelings of
self-worth from her job and her keen ability to make good management decisions for
her department. Until now, Brenda has been forced to manage her department
according to her gut-feel intuition, backed by a few disparate business systems reports.
She is not extraordinarily analytical, but she knows enough to understand what she
sees. The problem is that Brenda hasn’t had the visualization tools that are necessary
to display all the relevant data she should be considering. Because she has neither the
time nor the skill to code something herself, she’s been waiting in the lurch. Brenda is
excited that you’ll be attending next Monday’s staff meeting to present the data
visualization alternatives available to help her get under way in making data-driven
management decisions.
Step 2: Define the purpose After you brainstorm about the typical audience
member (see the preceding section), you can much more easily pinpoint exactly what
you’re trying to achieve with the data visualization. Are you attempting to get
consumers to feel a certain way about themselves or the world around them? Are you
trying to make a statement? Are you seeking to influence organizational decision
makers to make good business decisions? Or do you simply want to lay all the data
out there, for all viewers to make sense of, and deduce from what they will? Return to
the hypothetical Brenda: What decisions or processes are you trying to help her
achieve? Well, you need to make sense of her data, and then you need to present it to
her in a way that she can clearly understand. What’s happening within the inner
mechanics of her department? Using your visualization, you seek to guide Brenda into
making the most prudent and effective management choices.
32
Step 3: Choose the most functional visualization type for your purpose Keep in in
mind that you have three main types of visualization from which to choose: data
storytelling, data art, and data showcasing. If you’re designing for organizational
decision makers, you’ll most likely use data storytelling to directly tell your audience
what their data means with respect to their line of business. If you’re designing for a
social justice organization or a political campaign, data art 122 PART 3 Creating Data
Visualizations That Clearly Communicate Meaning can best make a dramatic and
effective statement with your data. Lastly, if you’re designing for engineers, scientists,
or statisticians, stick with data showcasing so that these analytical types have plenty of
room to figure things out on their own. Back to Brenda — because she’s not
extraordinarily analytical and because she’s depending on you to help her make
excellent data-driven decisions, you need to employ data storytelling techniques.
Create either a static or interactive data visualization with some, but not too much,
context. The visual elements of the design should tell a clear story so that Brenda
doesn’t have to work through tons of complexities to get the point of what you’re
trying to tell her about her data and her line of business.
Charts are an essential part of working with data, as they are a way to condense large
amounts of data into an easy to understand format. Visualizations of data can bring
out insights to someone looking at the data for the first time, as well as convey
findings to others who won’t see the raw data. There are countless chart types out
there, each with different use cases. Often, the most difficult part of creating a data
visualization is figuring out which chart type is best for the task at hand.
Bar chart
In a bar chart, values are indicated by the length of bars, each of which corresponds
with a measured group. Bar charts can be oriented vertically or horizontally; vertical
bar charts are sometimes called column charts. Horizontal bar charts are a good option
when you have a lot of bars to plot, or the labels on them require additional space to
be legible.
Line chart
33
Line charts show changes in value across continuous measurements, such as those
made over time. Movement of the line up or down helps bring out positive and
negative changes, respectively. It can also expose overall trends, to help the reader
make predictions or projections for future outcomes. Multiple line charts can also give
rise to other related charts like the sparkline or ridgeline plot.
Scatter plot
A scatter plot displays values on two numeric variables using points positioned on two
axes: one for each variable. Scatter plots are a versatile demonstration of the
relationship between the plotted variables—whether that correlation is strong or weak,
positive or negative, linear or non-linear. Scatter plots are also great for identifying
outlier points and possible gaps in the data.
Box plot
A box plot uses boxes and whiskers to summarize the distribution of values within
measured groups. The positions of the box and whisker ends show the regions where
the majority of the data lies. We most commonly see box plots when we have multiple
34
groups to compare to one another; other charts with more detail are preferred when we
have only one group to plot.
Before moving on to other chart types, it’s worth taking a moment to appreciate the
option of just showing the raw numbers. In particular, when you only have one
number to show, just displaying the value is a sensible approach to depicting the data.
When exact values are of interest in an analysis, you can include them in an
accompanying table or through annotations on a graphical visualization.
Common Variations
Additional chart types can come about from changing the ways encodings are used, or
by including additional encodings. Secondary encodings like area, shape, and color
can be useful for adding additional variables to more basic chart types.
Histogram
If the groups depicted in a bar chart are actually continuous numeric ranges, we can
push the bars together to generate a histogram. Bar lengths in histograms typically
correspond to counts of data points, and their patterns demonstrate the distribution of
variables in your data. A different chart type like line chart tends to be used when the
vertical value is not a frequency count.
One modification of the standard bar chart is to divide each bar into multiple smaller
bars based on values of a second grouping variable, called a stacked bar chart. This
35
allows you to not only compare primary group values like in a regular bar chart, but
also illustrate a relative breakdown of each group’s whole into its constituent parts.
If, on the other hand, the sub-bars were placed side-by-side into clusters instead of
kept in their stacks, we would obtain the grouped bar chart. The grouped bar chart
does not allow for comparison of primary group totals, but does a much better job of
allowing for comparison of the sub-groups.
Dot plot
A dot plot is like a bar chart in that it indicates values for different categorical
groupings, but encodes values based on a point’s position rather than a bar’s length.
Dot plots are useful when you need to compare across categories, but the zero baseline
is not informative or useful. You can also think of a dot plot as like a line plot with the
line removed, so that it can be used with variables with unordered categories rather
than just continuous or ordered variables.
Area chart
36
An area chart starts with the same foundation as a line chart – value points connected
by line segments – but adds in a concept from the bar chart with shading between the
line and a baseline. This chart is most often seen when combined with the concept of
stacking, to show how both how a total has changed over time, but also how its
components’ contributions have changed.
Dual-axis chart
Dual-axis charts overlay two different charts with a shared horizontal axis, but
potentially different vertical axis scales (one for each component chart). This can be
useful to show a direct comparison between the two sets of vertical values, while also
including the context of the horizontal-axis variable. It is common to use different
base chart types, like the bar and line combination, to reduce confusion of the different
axis scales for each component chart.
Bubble chart
37
Another way of showing the relationship between three variables is through
modification of a scatter plot. When a third variable is categorical, points can use
different shapes or colors to indicate group membership. If the data points are ordered
in some way, points can also be connected with line segments to show the sequence of
values. When the third variable is numeric in nature, that is where the bubble
chart comes in. A bubble chart builds on the base scatter plot by having the third
variable’s value determine the size of each point.
Density curve
Violin plot
Heatmap
38
The heatmap presents a grid of values based on two variables of interest. The axis
variables can be numeric or categorical; the grid is created by dividing each variable
into ranges or levels like a histogram or bar chart. Grid cells are colored based on
value, often with darker colors corresponding with higher values. A heatmap can be
an interesting alternative to a scatter plot when there are a lot of data points to plot, but
the point density makes it difficult to see the true relationship between variables.
Specialist Charts
There are plenty of additional charts out there that encode data in other ways for
particular use cases. Xenographics includes a collection of some fanciful charts that
have been driven by very particular purposes. Still, some of these charts have use
cases that are common enough that they can be considered essential to know.
Pie chart
pie charts being sequestered here in the ‘specialist’ section, considering how
commonly they are utilized. However, pie charts use an uncommon encoding,
depicting values as areas sliced from a circular form. Since a pie chart typically lacks
value markings around its perimeter, it is usually difficult to get a good idea of exact
slice sizes. However, the pie chart and its cousin the donut plot excel at telling the
reader that the part-to-whole comparison should be the main takeaway from the
visualization.
39
Funnel chart
A funnel chart is often seen in business contexts where visitors or users need to be
tracked in a pipeline flow. The chart shows how many users make it to each stage of
the tracked process from the width of the funnel at each stage division. The tapering of
the funnel helps to sell the analogy, but can muddle what the true conversion rates are.
A bar chart can often fulfill the same purpose as a funnel chart, but with a cleaner
representation of data.
Bullet chart
The bullet chart enhances a single bar with additional markings for how to
contextualize that bar’s value. This usually means a perpendicular line showing a
target value, but also background shading to provide additional performance
benchmarks. Bullet charts are usually used for multiple metrics, and are more compact
to render than other types of more fanciful gauges.
Map-based plots
40
There are a number of families of specialist plots grouped by usage, but we’ll close
this article out by touching upon one of them: map-based or geospatial plots. When
values in a dataset correspond to actual geographic locations, it can be valuable to
actually plot them with some kind of map. A common example of this type of map is
the choropleth like the one above. This takes a heat map approach to depicting value
through the use of color, but instead of values being plotted in a grid, they are filled
into regions on a map.
21. What is D3.js? Why build Data Visualizations with D3.js? When you should
use D3.jsWhere is D3.js used
D3.js is a JavaScript library for manipulating documents based on data. D3 helps you
bring data to life using HTML, SVG and CSS. D3’s emphasis on web standards gives
you the full capabilities of modern browsers without tying yourself to a proprietary
framework, combining powerful visualization components and a data-driven approach
to DOM manipulation.
CSS are used to style the elements in the DOM. A style sheet can exist as a
separate .css file that you include in your HTML page or can be embedded directly in
the HTML page. Style sheets refer to an ID, class, or type of element and determine
the appearance of that element. The terminology used to define the style is a CSS
selector and is the same type of selector used in the d3.select() syntax. You can set
inline styles (that are applied to only a single element) by using
d3.select("#someElement") .style("opacity", .5) to set the opacity of an element to
50%. Let’s update your d3ia.html to include a style sheet.
Example
<!doctype html>
<html>
<style>
.inactive, .tentative {
stroke: darkgray;
stroke-width: 2px;
stroke-dasharray: 5 5;
.tentative {
opacity: .5;
42
.active {
stroke: black;
stroke-width: 4px;
stroke-dasharray: 1;
circle {
fill: red;
rect {
fill: darkgray;
</style>
<body>
<div id="infovizDiv">
<g>
</g>
</svg>
</div>
</body>
</html>
43
Style sheets can also refer to a state of the element, so with :hover you can change the
way an element looks when the user mouses over that element. You can learn about
other complex CSS selectors in more detail in a book devoted to that subject. The
most useful way to do this is to have CSS classes associated with particular stylistic
changes and then change the class of an element. You can change the class of an
element, which is an attribute of an element, by selecting and modifying the class
attribute.
Tableau has a variety of options available, including a desktop app, server and hosted
online versions, and a free public option. There are hundreds of data import options
available, from CSV files to Google Ads and Analytics data to Salesforce data.Output
options include multiple chart formats as well as mapping capability. That means
designers can create color-coded maps that showcase geographically important data in
a format that’s much easier to digest than a table or chart could ever be.
The public version of Tableau is free to use for anyone looking for a powerful way to
create data visualizations that can be used in a variety of settings. From journalists to
political junkies to those who just want to quantify the data of their own lives, there
are tons of potential uses for Tableau Public. They have an extensive gallery of
infographics and visualizations that have been created with the public version to serve
as inspiration for those who are interested in creating their own.
Advantages
Mapping capability
Disadvantages
Non-free versions are expensive ($70/month/user for the Tableau Creator
software)
Data science in e-commerce serves the same purpose that it does in any other
discipline — to derive valuable insights from raw data. In e-commerce, you’re looking
for data insights that you can use to optimize a brand’s marketing return on investment
(ROI) and to drive growth in every layer of the sales funnel. How you end up doing
that is up to you, but the work of most data scientists in e-commerce involves the
following:
44
» Data analysis: Simple statistical and mathematical inference. Segmentation analysis
gets rather complicated when trying to make sense of e-commerce data. You also use
a lot of trend analysis, outlier analysis, and regression analysis.
» Data wrangling: Data wrangling involves using processes and procedures to clean
and convert data from one format and structure to another so that the data is accurate
and in the format that analytics tools and scripts require for consumption. In growth
work, source data is usually captured and generated by analytics applications. Most of
the time, you can derive insight within the application, but sometimes you need to
export the data so that you can create data mashups, perform custom analyses, and
create custom visualizations that aren’t available in your out-of-the-box solutions.
These situations could demand that you use a fair bit of data wrangling to get what
you need from the source datasets.
» Data visualization design: Data graphics in e-commerce are usually quite simple.
Expect to use a lot of line charts, bar charts, scatter charts, and map-based data
visualizations. Data visualizations should be simple and to the point, but the analyses
required to derive meaningful insights may take some time.
» Communication: After you make sense of the data, you have to communicate its
meaning in clear, direct, and concise ways that decision makers can easily understand.
E-commerce data scientists need to be excellent at communicating data insights via
data visualizations, a written narrative, and conversation.
» Custom development work: In some cases, you may need to design custom scripts
for automated custom data analysis and visualization. In other cases, you may have to
go so far as to design a personalization and recommendation system, but because you
can find a ton of prebuilt applications available for these purposes, the typical e-
commerce data scientist position description doesn’t include this requirement.
Segmenting for faster and easier e-commerce growth Data scientists working in
growth hacking should be familiar with, and know how to derive insight from, the
following user segmentation and targeting applications:
» Adobe Analytics use Adobe Analytics for advanced user segmentation and customer
churn analysis — or analysis to identify reasons for and preempt customer loss.
Customer churn describes the loss, or churn, of existing customers. Customer churn
analysis is a set of analytical techniques that are designed to identify, monitor, and
issue alerts on indicators that signify when customers are likely to churn. With the
information that’s generated in customer churn analysis, businesses can take
preemptive measures to retain at-risk customers.
45
» Webtrends (http://webtrends.com): Webtrends’ Visitor Segmentation and Scoring
offers real-time customer segmentation features that help you isolate, target, and
engage your highest-value visitors. The Conversion Optimization solution also offers
advanced segmenting and targeting functionality that you can use to optimize your
website, landing pages, and overall customer experience.
26)Discuss Outliers
Outliers are an important part of a dataset. They can hold useful information about
your data.
Outliers can give helpful insights into the data you're studying, and they can have an
effect on statistical results. This can potentially help you disover inconsistencies and
detect any errors in your statistical processes.
So, knowing how to find outliers in a dataset will help you better understand your
data.
This article will explain how to detect numeric outliers by calculating the interquartile
range.
I give an example of a very simple dataset and how to calculate the interquartile
range, so you can follow along if you want.
Outliers are extreme values that stand out greatly from the overall pattern of values
in a dataset or graph.
At its core, time series analysis focuses on studying and interpreting a sequence of
data points recorded or collected at consistent time intervals. Unlike cross-sectional
data, which captures a snapshot in time, time series data is fundamentally dynamic,
evolving over chronological sequences both short and extremely long. This type of
analysis is pivotal in uncovering underlying structures within the data, such as
trends, cycles, and seasonal variations.
Technically, time series analysis seeks to model the inherent structures within the
data, accounting for phenomena like autocorrelation, seasonal patterns, and trends.
The order of data points is crucial; rearranging them could lose meaningful insights
or distort interpretations. Furthermore, time series analysis often requires a
47
substantial dataset to maintain the statistical significance of the findings. This
enables analysts to filter out 'noise,' ensuring that observed patterns are not mere
outliers but statistically significant trends or cycles.
To delve deeper into the subject, you must distinguish between time-series data,
time-series forecasting, and time-series analysis. Time-series data refers to the raw
sequence of observations indexed in time order. On the other hand, time-series
forecasting uses historical data to make future projections, often employing statistical
models like ARIMA (AutoRegressive Integrated Moving Average). But Time series
analysis, the overarching practice, systematically studies this data to identify and
model its internal structures, including seasonality, trends, and cycles. What sets
time series apart is its time-dependent nature, the requirement for a sufficiently large
sample size for accurate analysis, and its unique capacity to highlight cause-effect
relationships that evolve.
Time series analysis has become a crucial tool for companies looking to make better
decisions based on data. By studying patterns over time, organizations can
understand past performance and predict future outcomes in a relevant and
actionable way. Time series helps turn raw data into insights companies can use to
improve performance and track historical outcomes.
For example, retailers might look at seasonal sales patterns to adapt their inventory
and marketing. Energy companies could use consumption trends to optimize their
production schedule. The applications even extend to detecting anomalies—like a
sudden drop in website traffic—that reveal deeper issues or opportunities. Financial
firms use it to respond to stock market shifts instantly. And health care systems need
it to assess patient risk in the moment.
Rather than a series of stats, time series helps tell a story about evolving business
conditions over time. It's a dynamic perspective that allows companies to plan
proactively, detect issues early, and capitalize on emerging opportunities.
Trends
Seasonality
Cycles
Noise
Trends show the general direction of the data, and whether it is increasing,
decreasing, or remaining stationary over an extended period of time. Trends indicate
the long-term movement in the data and can reveal overall growth or decline. For
example, e-commerce sales may show an upward trend over the last five years.
Seasonality refers to predictable patterns that recur regularly, like yearly retail spikes
during the holiday season. Seasonal components exhibit fluctuations fixed in timing,
direction, and magnitude. For instance, electricity usage may surge every summer
as people turn on their air conditioners.
Cycles demonstrate fluctuations that do not have a fixed period, such as economic
expansions and recessions. These longer-term patterns last longer than a year and
do not have consistent amplitudes or durations. Business cycles that oscillate
between growth and decline are an example.
Finally, noise encompasses the residual variability in the data that the other
components cannot explain. Noise includes unpredictable, erratic deviations after
accounting for trends, seasonality, and cycles.
49
28)Explain over fitting problem.
In the real world, the dataset present will never be clean and perfect. It
means each dataset contains impurities, noisy data, outliers, missing
data, or imbalanced data. Due to these impurities, different problems
occur that affect the accuracy and the performance of the model. One
of such problems is Overfitting in Machine Learning. Overfitting is a
problem that a model can exhibit.
A statistical model is said to be overfitted if it can’t generalize well with unseen data.
50
machine learning model, which cause poor performance in Machine
Learning.
o Overfitting occurs when the model fits more data than required, and
it tries to capture each and every datapoint fed to it. Hence it starts
capturing noise and inaccurate data from the dataset, which
degrades the performance of the model.
o An overfitted model doesn't perform accurately with the test/unseen
dataset and can’t generalize well.
o An overfitted model is said to have low bias and high variance.
Supervised
Learning Unsupervised Learning
51
Supervised
Learning Unsupervised Learning
In supervised
learning training data In unsupervised learning
is used to infer training data is not used.
Training data model
Supervised learning
Unsupervised learning is
is also called
also called clustering.
Another name classification.
Optical Character
Find a face in an image.
Example Recognition
52
Pruning should reduce the size of a learning tree without reducing predictive
accuracy as measured by a cross-validation set. There are many techniques for tree
pruning that differ in the measurement that is used to optimize performance.
Techniques[
Pruning processes can be divided into two types (pre- and post-pruning).
Pre-pruning procedures prevent a complete induction of the training set by replacing
a stop () criterion in the induction algorithm (e.g. max. Tree depth or information gain
(Attr)> minGain). Pre-pruning methods are considered to be more efficient because
they do not induce an entire set, but rather trees remain small from the start.
Prepruning methods share a common problem, the horizon effect. This is to be
understood as the undesired premature termination of the induction by the stop ()
criterion.
Post-pruning (or just pruning) is the most common way of simplifying trees. Here,
nodes and subtrees are replaced with leaves to reduce complexity. Pruning can not
only significantly reduce the size but also improve the classification accuracy of
unseen objects. It may be the case that the accuracy of the assignment on the train
set deteriorates, but the accuracy of the classification properties of the tree increases
overall.
The procedures are differentiated on the basis of their approach in the tree (top-
down or bottom-up).
Bottom-up pruning
These procedures start at the last node in the tree (the lowest point). Following
recursively upwards, they determine the relevance of each individual node. If the
relevance for the classification is not given, the node is dropped or replaced by a
leaf. The advantage is that no relevant sub-trees can be lost with this method. These
methods include Reduced Error Pruning (REP), Minimum Cost Complexity Pruning
(MCCP), or Minimum Error Pruning (MEP).
Top-down pruning
In contrast to the bottom-up method, this method starts at the root of the tree.
Following the structure below, a relevance check is carried out which decides
whether a node is relevant for the classification of all n items or not. By pruning the
tree at an inner node, it can happen that an entire sub-tree (regardless of its
relevance) is dropped. One of these representatives is pessimistic error pruning
(PEP), which brings quite good results with unseen items.
Pruning algorithms
Reduced error pruning
One of the simplest forms of pruning is reduced error pruning. Starting at the leaves,
each node is replaced with its most popular class. If the prediction accuracy is not
affected then the change is kept. While somewhat naive, reduced error pruning has
the advantage of simplicity and speed.
Cost complexity pruning
initial tree and is the root alone. At step , the tree is created by
removing a subtree from tree and replacing it with a leaf node with value
53
chosen as in the tree building algorithm. The subtree that is removed is chosen as
follows:
54