MFCS Notes
MFCS Notes
■ Properties of PMF:
– 0 ≤ px(si) ≤ 1 must be true since this is a probability
– Since the random variable assigns a value x ∈ R to each sample point s ∈ S, we
must have the sum equal to 1.
Example:
If we toss a fair coin twice and let X be defined as the number of Tails we observe, Find
the range of X (Rx) and Probability Mass Function Px.
Solution:
Here, out sample space is given by
S = {HH, HT, TH, TT}
The number of Tails will be 0, 1 or 2. Hence,
Rx = {0, 1, 2}
Since the Rx = {0, 1, 2} is a countable set, random variable X is a discrete
random variable, and so the PMF will be defined as
Px(k) = P(X=k) for k = 0, 1, 2.
And so we have,
Px(0) = P(X=0) = P(HH)= 0.25
Px(1) = P(X=1) = P({HT,TH}) = 0.25 + 0.25 = 0.5
Px(2) = P(X=2) = P(TT)= 0.25
Although PMF is generally defined in Rx, it is sometimes convenient to extend it to all real
numbers. So in general, we can write
● The PMF is one way to describe the distribution of a discrete random variable.
● But PMF cannot be defined for continuous random variables.
● The cumulative distribution function (CDF) of a random variable is another method to
describe the distribution of random variables.
● The advantage of the CDF is that it can be defined for any kind of random variable
(discrete, continuous, and mixed).
The cumulative distribution function (CDF) of random variable X is defined as
Example:
Let X be a discrete random variable with range RX={1,2,3,...}. Suppose the PMF of X is
given by
PX(k)=1/2k for k=1,2,3,...
a) Find and plot the CDF of X, FX(x)
b) Find P(2<X≤5)
c) Find P(X>4)
Solution:
a. To find the CDF, note that
For x<1, FX(x)=0
For 1≤x<2, FX(x)=PX(1)=12
For 2≤x<3, FX(x)=PX(1)+PX(2)=12+14=34
In general we have
For 0<k≤x<k+1, FX(x) = PX(1) + PX(2) +...+ PX(k)
= (1/2) + (1/4) +...+ (1/2k) = (2k−1)/2k
■ Consider a continuous random variable X with an absolutely continuous CDF FX(x). The
function fX(x) defined by
fX(x) = dFx(x)/dx = F′x(x), if Fx(x) is differentiable at x
is called the probability density function (PDF) of X.
■ Properties of PDF:
● For every x ∈ R, p(x) ≥ 0.
∞
● ∫ 𝑝(𝑥) 𝑑𝑥 = 1
−∞
Example:
Let X be a continuous variable with following PDF
fX(x) = {𝑐𝑒 − 𝑥 𝑓𝑜𝑟 𝑥 ≥ 0 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
C. We can find (1<X<3) using either CDF or PDF. Since we have found the CDF already,
P(1<X<3) = Fx(3) - Fx(1) = [1 - 𝑒 − 3] – [1 - 𝑒 − 1] = 𝑒 − 1 - 𝑒 − 3
Prepared by: Kalpit Patel(210280723001)
Chp 1 topic no (2) : Parametric families of distributions
{ binomial distribution, poisson distribution , geometric distribution }
Binomial Distribution
In probability theory and statistics, the binomial distribution is the discrete
probability distribution that gives only two possible results in an experiment,
either Success or Failure. For example, if we toss a coin, there could be only
two possible outcomes: heads or tails, and if any test is taken, then there
could be only two results: pass or fail. This distribution is also called a
binomial probability distribution.
Ex : Toss of coin 2 times . find prob. Does the head occur 2 times ?
HH HT TH TT
Prob(Head=2) = 1 / 4
But when outcomes are large then ………………. Using this way it's
difficult to find probability…….
The binomial distribution formula is for any random variable X, given by;
Where,
Example 2
A fair coin is tossed 10 times, what are the probability of getting exactly 6
heads and at least six heads.
Solution:
Let x denote the number of heads in an experiment.
Here, the number of times the coin tossed is 10. Hence, n=10.
The probability of getting head, p ½
The probability of getting a tail, q = 1-p = 1-(½) = ½.
The binomial distribution is given by the formula:
P(X= x) = nCxpxqn-x, where = 0, 1, 2, 3, …
Therefore, P(X = x) = 10Cx(½)x(½)10-x
(i) The probability of getting exactly 6 heads is:
P(X=6) = 10C6(½)6(½)10-6
P(X= 6) = 10C6(½)10
P(X = 6) = 105/512.
Let assume that your team is much more skilled and has 75% chances of
winning. It means there is a 25% chance of losing.
In this example:
Example 2:
Telephone calls arrive at an exchange according to the Poisson process
at a rate λ= 2/min. Calculate the probability that exactly two calls will be
received during each of the first 5 minutes of the hour.
Solution:
Assume that “N” is the number of calls received during a 1 minute
period.
Therefore,
P(N= 2) = (e-2. 22)/2!
P(N=2) = 2e-2.
Now, “M” is the number of minutes among 5 minutes considered,
during which exactly 2 calls will be received. Thus “M” follows a
binomial distribution with parameters n=5 and p= 2e-2.
P(M=5) = 32 x e-10
P(M =5) = 0.00145, where “e” is a constant, which is approximately
equal to 2.718.
Geometric Distribution
A geometric distribution is defined as a discrete probability distribution
of a random variable “x” which satisfies some of the conditions. The
geometric distribution conditions are
P(x) = q^x-1 p ; x= 1, 2, ….
0 Else
Nth times attempts and find prob of first time success at rth.
Solution :
P is 0.7
So q is 0.3
X is 10
P(x=10) = q^10-1 p
Uniform Distribution
Example 1
10.4 19.6 18.8 13.9 17.8 16.8 21.6 17.9 12.5 11.1 4.9
12.8 14.8 22.8 20.0 15.9 16.3 13.4 17.1 14.5 19.0 22.8
1.3 0.7 8.9 11.9 10.9 7.3 5.9 3.7 17.9 19.2 9.8
5.8 6.9 2.6 5.8 21.7 11.8 3.4 2.1 4.5 6.3 10.7
8.9 9.4 9.4 7.6 10.0 3.3 6.7 7.8 11.6 13.8 18.6
The sample mean = 11.49 and the sample standard deviation = 6.23.
We will assume that the smiling times, in seconds, follow a uniform distribution
between zero and 23 seconds, inclusive. This means that any smiling time from
zero to and including 23 seconds is equally likely. The histogram that could be
constructed from the sample is an empirical distribution that closely matches the
theoretical uniform distribution.
The notation for the uniform distribution is X ~ U(a, b) where a = the lowest value
of x and b = the highest value of x.
Example 2
The current (in mA) measured in a piece of copper wire is known to follow a
uniform distribution over the interval [0, 25]. Write down the formula for the
probability density function f(x) of the random variable X representing the current.
Calculate the mean and variance of the distribution and find the cumulative
distribution
function F(x).
Solution
Over the interval [0, 25] the probability density function f(x) is given by the
formula
f(x) =1/25− 0
= 0.04, 0 ≤ x ≤ 25
0 otherwise
Using the formulae developed for the mean and variance gives
E(X) =(25 + 0) /2= 12.5 mA
and V(X) =(25 − 0)^2/12
= 52.08 mA2
The cumulative distribution function is obtained by integrating the probability
density function as
shown below.
F(x) =integration from -∞ to x[f(t) dt]
Hence, choosing the three distinct regions x < 0, 0 ≤ x ≤ 25 and x > 25 in turn
gives:
F(x) = 0, x <0
x/25, 0 ≤x ≤25
1, x >25
Example 3
In the manufacture of petroleum the distilling temperature (T◦C) is crucial in
determining the quality of the final product. T can be considered as a random
variable uniformly distributed over 150◦C to 300◦C. It costs £C1 to produce 1
gallon of petroleum. If the oil distills at temperatures
less than 200◦C the product sells for £C2 per gallon. If it distills at a temperature
greater than 200◦C it sells for £C3 per gallon. Find the expected net profit per
gallon.
Example 4
Packages have a nominal net weight of 1 kg. However their actual net weights
have a uniform distribution over the interval 980 g to 1030 g.
(a) Find the probability that the net weight of a package is less than 1 kg.
(b) Find the probability that the net weight of a package is less than w g, where
980 < w <1030.
(c) If the net weights of packages are independent, find the probability that, in a
sample of five packages, all five net weights are less than wg and hence find the
probability density function of the weight of the heaviest of the packages. (Hint:
all five packages weigh less than w g if and only if the heaviest weighs less that w
g).
If you have a collection of numbers a1,a2,...,aN, their average is a single number that describes
the whole collection. Now, consider a random variable X. We would like to define its average, or
as it is called in probability, its expected value or mean. The expected value is defined as the
weighted average of the values in the range.
Let X be a discrete random variable with range RX={x1,x2,x3,...} (finite or countably infinite). The
expected value of X, denoted by EX is defined as:
E= ∑ 𝑥𝑃(𝑥)
𝑓𝑜𝑟 𝑎𝑙𝑙 𝑥
Variance is defined as :
2 2 2
Var(x) = 𝐸([(𝑋 − µ) ] = 𝐸[𝑋 ] − (𝐸[𝑋])
Q. A man draws 3 balls from an urn contatining 5 white and 7 blacks balls. He gets Rs.10 for
each white ball and Rs.5 for each black ball. Find his expectation.
Soln.
Probability of drawing 3 white balls = 5𝐶3/12𝐶3 = 1/22
Probability of drawing 2 white and 1 black = (5𝐶2 * 7𝐶1)/(12𝐶3) = 7/22
Probability of drawing 1 white and 2 black= (5𝐶1 * 7𝐶2)/(12𝐶3 = 21/44
Probability of drawing 3 blacks = 7𝐶3/12𝐶3 = 7/44
Q. A coin is tossed 200 times. Find the probability that number of heads obtained is between 80
and 120. Given 𝑃(𝑍 <− 2. 82) = 0. 0024
Let X = no. of heads, n = 100, p = q = 1/2
By binomial distribution,
µ = 𝑛𝑝
=200 * 1/2
= 100
2
σ = 𝑛𝑝𝑞
= 50
(𝑋−100)
𝑍 =
50
→ 0. 4976 + 0. 4976
→ 0. 0052
Q. A random sample of size 100 is taken from a population whose mean is 60 and variance is
400. Using central limit theorem, with what probability we can assert that mean of sample will
not differ from µ = 60 by more than 4? Given 𝑃(𝑍 <− 2) = 0. 0228
𝑛 = 100
µ𝑖 = 60
2
σ𝑖 = 400
Sample mean is 𝑋 = 𝑋1 + 𝑋2 +..... + 𝑋100/100
100
1
𝐸(𝑋)= 100
∑ 𝐸(𝑋𝑖)
𝑖=1
1
= 100
[60 + 60 +....]
1
= 100
* 60 * 100 = 60
100
1
𝑉𝑎𝑟(𝑋) = 2 * ∑ 𝑉𝑎𝑟(𝑋𝑖)
100 𝑖=1
1
= 2 * 100 * 400 = 4
100
→ 𝑃(|𝑋 − 60| ≤ 4)
→ 𝑃(− 4 ≤ 𝑋 − 60 ≤ 4)
→ 𝑃(56 ≤ 𝑋 ≤ 64)
→ 𝑃(− 2 ≤ 𝑍 ≤ 2)
Q. 20 dice are thrown. Find the approx. probability that the sum obtained is between 65 and 75
using the central limit theorem.
Soln.
𝑥 1 2 3 4 5 6
𝑛+1 6+1
Since distribution is uniform, we have 𝐸(𝑋) = 2
= 2
= 7/2
2
𝑛 −1
𝑉𝑎𝑟(𝑋) = 12
= 35/12
𝑛 = 20
20
𝐸(𝑆𝑛) = ∑ 𝐸(𝑋𝑖)
𝑖=1
=(7/2) * 20 = 70
20
𝑉𝑎𝑟(𝑆𝑛) = ∑ 𝑉𝑎𝑟(𝑋𝑖)
𝑖=1
= 20 * (35/12) = 175/3
→ 𝑃(65 ≤ 𝑆𝑛 ≤ 75)
→ 0. 2422 + 0. 2422
→ 0. 4844
Q. An electric firm manufactures light bulbs that are normally distributed with mean = 800
hours and standard deviation = 40 hours. Find the probability that the bulb burns between 778
and 834 hours.
P = 1-(1-(1/N))n
Topic 2: Sampling distributions of estimators
● Since our estimators are statistics (particular functions of random variables), their
distribution can be derived from the joint distribution of X 1 . . . X n .
● It is called the sampling distribution because it is based on the joint distribution of the
random sample.
● Given a sampling distribution, we can
○ Calculate the probability that an estimator will not differ from the parameter θ
by more than a specified amount.
○ Obtain interval estimates rather than point estimates after we have a sample; an
interval estimate is a random interval such that the true parameter lies within
this interval with a given probability (say 95%).
○ Choose between two estimators we can, for instance, calculate the
mean-squared error of the estimator, E θ [( θ̂ − θ) 2 ] using the distribution of θ.
● Sampling distributions of estimators depend on sample size, and we want to know
exactly how the distribution changes as we change this size so that we can make the
right trade-offs between cost and accuracy
Why is this importance ?
Estimate the population parameter, we can use samples!
Use for better understanding :
https://onlinestatbook.com/2/sampling_distributions/samp_dist_mean.html
Sampling Distribution of the Mean
Learning Objectives
1. State the mean and variance of the sampling distribution of the mean
2. Compute the standard error of the mean
3. State the central limit theorem
The sampling distribution of the mean was defined in the section introducing sampling
distributions. This section reviews some important properties of the sampling distribution of the
mean introduced in the demonstrations in this chapter.
Mean
The mean of the sampling distribution of the mean is the mean of the population from which
the scores were sampled. Therefore, if a population has a mean μ, then the mean of the
sampling distribution of the mean is also μ. The symbol μM is used to refer to the mean of the
sampling distribution of the mean. Therefore, the formula for the mean of the sampling
distribution of the mean can be written as:
μM = μ
Variance
The variance of the sampling distribution of the mean is computed as follows:
That is, the variance of the sampling distribution of the mean is the population variance divided
by N, the sample size (the number of scores used to compute a mean). Thus, the larger the
sample size, the smaller the variance of the sampling distribution of the mean.
(optional) This expression can be derived very easily from the variance sum law. Let's begin by
computing the variance of the sampling distribution of the sum of three numbers sampled from
a population with variance σ2. The variance of the sum would be σ2 + σ2 + σ2. For N numbers,
the variance would be Nσ2. Since the mean is 1/N times the sum, the variance of the sampling
distribution of the mean would be 1/N2 times the variance of the sum, which equals σ2/N.
The standard error of the mean is the standard deviation of the sampling distribution of the
mean. It is therefore the square root of the variance of the sampling distribution of the mean
and can be written as:
The standard error is represented by a σ because it is a standard deviation. The subscript (M)
indicates that the standard error in question is the standard error of the mean.
Central Limit Theorem
The central limit theorem states that:
Given a population with a finite mean μ and a finite non-zero variance σ2, the
sampling distribution of the mean approaches a normal distribution with a mean of
μ and a variance of σ2/N as N, the sample size, increases.
The expressions for the mean and variance of the sampling distribution of the mean are not
new or remarkable. What is remarkable is that regardless of the shape of the parent population,
the sampling distribution of the mean approaches a normal distribution as N increases. If you
have used the "Central Limit Theorem Demo," you have already seen this for yourself. As a
reminder, Figure 1 shows the results of the simulation for N = 2 and N = 10. The parent
population was a uniform distribution. You can see that the distribution for N = 2 is far from a
normal distribution. Nonetheless, it does show that the scores are denser in the middle than in
the tails. For N = 10 the distribution is quite close to a normal distribution. Notice that the
means of the two distributions are the same, but that the spread of the distribution for N = 10 is
smaller.
Figure 1. A simulation of a sampling distribution. The parent population is uniform. The blue line
under "16" indicates that 16 is the mean. The red line extends from the mean plus and minus
one standard deviation.
Figure 2 shows how closely the sampling distribution of the mean approximates a normal
distribution even when the parent population is very non-normal. If you look closely you can
see that the sampling distributions do have a slight positive skew. The larger the sample size,
the closer the sampling distribution of the mean would be to a normal distribution.
Figure 2. A simulation of a sampling distribution. The parent population is very non-normal.
Topic 3: Methods of Moments
Example
Topic 4: Maximum Likelihood
Chapter 3
Presented by : Parth Nirmal (210280723008)
Statistical inference is the process of using data analysis to deduce properties of an underlying probability
distribution.
Inference means the process of drawing conclusions about population parameters based on a sample taken
from the population.
Population means a set of all units and a sample is the data we collect from the population.
Confidence Interval :
A confidence interval is how much uncertainty there is with any particular statistic. Confidence intervals are
often used with a margin of error. It tells you how confident you can be that the results from a poll or
survey reflect what you would expect to find if it were possible to survey the entire population. Confidence
intervals are intrinsically connected to confidence levels.
Z is critical value,
n is sample size,
α is level of significance,
σ is standard deviation
Example :
To infer the average strength of some product A, a sample of size of 80 is taken from the entire lot of that
product. The sample mean is 18.85 with sample variance 30.77. Construct a 99% confidence interval for the
product’s true average strength.
Hypothesis testing :
A hypothesis is an educated guess about something in the world around you. It should be testable, either
by experiment or observation.
Example :
The specification for a certain kind of ribbon calls for a mean breaking strength of 180 pounds. If five pieces
of the ribbon have a mean breaking strength of 169.5 pounds with a standard deviation of 5.7 pounds, test
the null hypothesis µ = 180 pounds against the alternative hypothesis µ < 180 pounds at the 0.01 level of
significance. Assume that the population distribution is normal.
Multivariate Analysis :
Multivariate analysis is an analysis of statistical techniques that analyze the relationship between more
than two variables which shows the effect of more than one variable on one independent variable.
Multivariate analysis shows the effect of more than one variable on one independent variable.
•Data reduction
•Grouping
•Investigate relationship among variables
•Prediction
•Hypothesis construction and testing
Multivariate analysis techniques :
Topic : Statistical Inference and Introduction to multivariate statistical models: regression and
classification problems
Multivariate Statistics :
Multivariate Statistics is a subdivision of statistics encompassing the simultaneous observation and analysis
of more than two outcome variables.
Which simple means it looks at more than two variables.
A class or cluster is a grouping of points in this multidimensional attribute space. Two locations belong to
the same class or cluster if their attributes (vector of band values) are similar. A multiband raster and
individual single band rasters can be used as the input into a multivariate statistical analysis.
In a normal classification problem we give the answer in yes/no right/wrong possible/not possible.
Here we can give an answer that the test tuple actually belongs in which class.
Example :
Here, many classes are available like class1,class2,class3,class4 so, it’s multiclass classification problem.
1. Part of speech tagging (verb,noun,adjective,etc.)
2. Different topics
1. One-vs-all
2. One-vs-one
Example :
Number of classes = 3
M1,M2,M3
M1 and M3 is giving output +1.
So here we can neglect the M2.(so the test tuple is not belongs to M2)
But the probability is different M1 is 90% and M3 is 70% belongs to the test tuple. So here we can say
that the test tuple is belongs to M1 because the probability is greater than M3.
Multivariate Regression helps use to measure the angle of more than one independent variable and more
than one dependent variable. It finds the relation between the variables (Linearly related).
It used to predict the behavior of the outcome variable and the association of predictor variables and how
the predictor variables are changing.
It can be applied to many practical fields like politics, economics, medical, research works and many
different kinds of businesses.
Examples :
(1) If E-commerce Company has collected the data of its customers such as Age, purchased history of a
customer, gender and company want to find the relationship between these different dependents and
independent variables.
(2) A gym trainer has collected the data of his client that are coming to his gym and want to observe some
things of client that are health, eating habits (which kind of product client is consuming every week), the
weight of the client. This wants to find a relation between these variables.
Example :
In above table there are only x and y in dataset so we can easily find the regression.
Multiple features(variable):
Here there are many variables in the dataset. So it’s called multivariate statistics.
Presented By: Zalak Mistry ( 210280723011 )
A statistical model is said to be overfitted when we feed it a lot more data than necessary.
A model is said to be overfitted if we feed it with lot more data than necessary .When a model
fits more data than it needs, it starts catching the noisy data and inaccurate values in the data.
As a result, the efficiency of model decrease.
When a model fits more data than it actually needs, it starts catching the noisy data and
inaccurate values in the data. As a result, the efficiency and accuracy of the model decrease. Let
us take a look at a few examples of overfitting in order to understand how it actually happens.
Examples Of Overfitting
Example 1
If we take an example of simple linear regression, training the data is all about finding out the
minimum cost between the best fit line and the data points. It goes through a number of
iterations to find out the optimum best fit, minimizing the cost. This is where overfitting comes
into the picture.
The line seen in the image above can give a very efficient outcome for a new data point. In the
case of overfitting, when we run the training algorithm on the data set, we allow the cost to
reduce with each number of iteration.
Running this algorithm for too long will mean a reduced cost but it will also fit the noisy data
from the data set. The result would look something like in the graph below.
This might look efficient but isn’t really. The main goal of an algorithm such as linear regression
is to find a dominant trend and fit the data points accordingly. But in this case, the line fits all
data points, which is irrelevant to the efficiency of the model in predicting optimum outcomes
for new entry data points.
Now let us consider a more descriptive example with the help of a problem statement.
Example 2
Problem Statement: Let us consider we want to predict if a soccer player will land a slot in a tier
1 football club based on his/her current performance in the tier 2 league.
Now imagine, we train and fit the model with 10,000 such players with outcomes. When we try
to predict the outcome on the original data set, let us say we got a 99% accuracy. But the
accuracy on a different data set comes around 50 percent. This means the model does not
generalize well from our training data and unseen data.
This is what overfitting looks like. It is a very common problem in Machine Learning and even
data science. Now let us understand the signal and noise.
Signal vs Noise
In predictive modeling, signal refers to the true underlying pattern that helps the model to learn
the data. On the other hand, noise is irrelevant and random data in the data set. To understand
the concept of noise and signal, let us take a real-life example.
Let us suppose we want to model age vs literacy among adults. If we sample a very large part of
the population, we will find a clear relationship. This is the signal, whereas noise interferes with
the signal. If we do the same on a local population, the relationship will become muddy. It
would be affected by outliers and randomness, for e.g, one adult went to school early or some
adult couldn’t afford education, etc.
Talking about noise and signal in terms of Machine Learning, a good Machine Learning
algorithm will automatically separate signals from the noise. If the algorithm is too complex or
inefficient, it may learn the noise too. Hence, overfitting the model.
To address this problem, we can split the initial data set into separate training and test data
sets. With this technique, we can actually approximate how well our model will perform with
the new data.
Let us understand this with an example, imagine we get a 90+ percent accuracy on the training
set and a 50 percent accuracy on the test set. Then, automatically it would be a red flag for the
model.
Another way to detect overfitting is to start with a simplistic model that will serve as a
benchmark.
With this approach, if you try more complex algorithms you will be able to understand if the
additional complexity is even worthwhile for the model or not. It is also known as Occam’s razor
test, it basically chooses the simplistic model in case of comparable performance in case of two
models. Although detecting overfitting is a good practice, but there are several techniques to
prevent overfitting as well. Let us take a look at how we can prevent overfitting in Machine
Learning.
1. Cross-Validation
2. Training with more data
3. Removing Features
4. Early Stopping
5. Regularization
6. Ensembling
1. Cross-Validation
One of the most powerful features to avoid/prevent overfitting is cross-validation. The idea
behind this is to use the initial training data to generate mini train-test-splits, and then use
these splits to tune your model.
In a standard k-fold validation, the data is partitioned into k-subsets also known as folds. After
this, the algorithm is trained iteratively on k-1 folds while using the remaining folds as the test
set, also known as holdout fold.
The cross-validation helps us to tune the hyperparameters with only the original training set. It
basically keeps the test set separately as a true unseen data set for selecting the final model.
Hence, avoiding overfitting altogether.
This technique might not work every time, as we have also discussed in the example above,
where training with a significant amount of population helps the model. It basically helps the
model in identifying the signal better.
But in some cases, the increased data can also mean feeding more noise to the model. When
we are training the model with more data, we have to make sure the data is clean and free from
randomness and inconsistencies.
3. Removing Features
Although some algorithms have an automatic selection of features. For a significant number of
those who do not have a built-in feature selection, we can manually remove a few irrelevant
features from the input features to improve the generalization.
One way to do it is by deriving a conclusion as to how a feature fits into the model. It is quite
similar to debugging the code line-by-line.
In case if a feature is unable to explain the relevancy in the model, we can simply identify those
features. We can even use a few feature selection heuristics for a good starting point.
4. Early Stopping
When the model is training, you can actually measure how well the model performs based on
each iteration. We can do this until a point when the iterations improve the model’s
performance. After this, the model overfits the training data as the generalization weakens after
each iteration.
So basically, early stopping means stopping the training process before the model passes the
point where the model begins to overfit the training data. This technique is mostly used in deep
learning.
5. Regularization
It basically means, artificially forcing your model to be simpler by using a broader range of
techniques. It totally depends on the type of learner that we are using. For example, we can
prune a decision tree, use a dropout on a neural network or add a penalty parameter to the
cost function in regression.
Quite often, regularization is a hyperparameter as well. It means it can also be tuned through
cross-validation.
6. Ensembling
This technique basically combines predictions from different Machine Learning models. Two of
the most common methods for ensembling are listed below:
o Ans: OSI stands for Open System Interconnection is a reference model that describes
how
information from a software application in one computer moves through a physical
medium to the software application in another computer.
o OSI consists of seven layers, and each layer performs a particular network function. o OSI
model was developed by the International Organization for Standardization (ISO) in 1984,
and it is now considered as an architectural model for the inter-computer communications.
o OSI model divides the whole task into seven smaller and manageable tasks. Each layer is
assigned a particular task.
o Each layer is self-contained, so that task assigned to each layer can be performed indepen
dently.
There are the seven OSI layers. Each layer has different functions. A list of seven layers are given
below:
1. Physical Layer
2. Data-Link Layer
3. Network Layer
4. Transport Layer
5. Session Layer
6. Presentation Layer
7. Application Layer
Physical layer
o The main functionality of the physical layer is to transmit the individual bits from one
node to another node.
o It is the lowest layer of the OSI model.
o It establishes, maintains and deactivates the physical connection.
o It specifies the mechanical, electrical and procedural network interface specifications.
o Line Configuration: It defines the way how two or more devices can be connected
physically.
o Data Transmission: It defines the transmission mode whether it is simplex, half duplex or
full-duplex mode between the two devices on the network.
o Topology: It defines the way how network devices are arranged.
o Signals: It determines the type of the signal used for transmitting the information.
Data-Link Layer
o Physical Addressing: The Data link layer adds a header to the frame that contains a
destination address. The frame is transmitted to the destination address mentioned in
the header.
o Flow Control: Flow control is the main functionality of the Data-link layer. It is the
technique through which the constant data rate is maintained on both the sides so that
no data get corrupted. It ensures that the transmitting station such as a server with
higher processing speed does not exceed the receiving station, with lower processing
speed.
o Error Control: Error control is achieved by adding a calculated value CRC (Cyclic
Redundancy Check) that is placed to the Data link layer's trailer which is added to the
message frame before it is sent to the physical layer. If any error seems to occurr, then
the receiver sends the acknowledgment for the retransmission of the corrupted frames.
o Access Control: When two or more devices are connected to the same communication
channel, then the data link layer protocols are used to determine which device has
control over the link at a given time.
Network Layer
o It is a layer 3 that manages device addressing, tracks the location of devices on the
network.
o It determines the best path to move data from source to the destination based on the
network conditions, the priority of service, and other factors.
o The Data link layer is responsible for routing and forwarding the packets.
o Routers are the layer 3 devices, they are specified in this layer and used to provide the
routing services within an internetwork.
o The protocols used to route the network traffic are known as Network layer protocols.
Examples of protocols are IP and Ipv6.
Transport Layer
o The Transport layer is a Layer 4 ensures that messages are transmitted in the order in
which they are sent and there is no duplication of data.
o The main responsibility of the transport layer is to transfer the data completely. o It
receives the data from the upper layer and converts them into smaller units known as
segments.
o This layer can be termed as an end-to-end layer as it provides a point-to-point connection
between source and destination to deliver the data reliably.
The two protocols used in this layer are:
Session Layer
o Dialog control: Session layer acts as a dialog controller that creates a dialog between two
processes or we can say that it allows the communication between two processes which
can be either half-duplex or full-duplex.
o Synchronization: Session layer adds some checkpoints when transmitting the data in a
sequence. If some error occurs in the middle of the transmission of data, then the
transmission will take place again from the checkpoint. This process is known as
Synchronization and recovery.
Presentation Layer
o A Presentation layer is mainly concerned with the syntax and semantics of the information
exchanged between the two systems.
o It acts as a data translator for a network.
o This layer is a part of the operating system that converts the data from one presentation
format to another format.
o The Presentation layer is also known as the syntax layer.
o Translation: The processes in two systems exchange the information in the form of
character strings, numbers and so on. Different computers use different encoding
methods, the presentation layer handles the interoperability between the different
encoding methods. It converts the data from sender-dependent format into a common
format and changes the common format into receiver-dependent format at the receiving
end.
o Encryption: Encryption is needed to maintain privacy. Encryption is a process of
converting the sender-transmitted information into another form and sends the
resulting message over the network.
o Compression: Data compression is a process of compressing the data, i.e., it reduces the
number of bits to be transmitted. Data compression is very important in multimedia
such as text, audio, video.
Application Layer
● An application layer serves as a window for users and application processes to access
network service.
● It handles issues such as network transparency, resource allocation, etc. o An application
layer is not an application, but it performs the application layer functions.
● This layer provides the network services to the end-users.
Directory services: An application provides the distributed database sources and is used to
provide that global information about various objects.
Ans:
• Data mining is how the patterns in large data sets are viewed and discovered using
intersecting techniques such as statistics, machine learning, and ones like databases
systems.
• It involves data extraction from a group of raw and unidentified data sets to
provide some meaningful results through mining.
• The extracted data is then further used by using transformation and ensuring that it is
being arranged to best service as per business requirements and needs.
List of Data Mining Applications
• Here is the list of various Data Mining Applications, which are given below:
9
2 .Health care domain and insurance domain
• The data mining-related applications can efficiently track and monitor a
patient’s health condition and help in efficient diagnosis based on the past
sickness record.
• Similarly, the insurance industry’s growth depends on the ability to convert the
data into knowledge form or by providing various details about the
customers, markets, and prospective competitors.
• Therefore all those companies who have applied the data mining techniques
efficiently have reaped the benefits. T
• his is used over the claims and their analysis, i.e., identifying the medical
procedures claimed together.
• It enables the forecasting of new policies, helps detect risky customer
behaviour patterns, and helps see fraudulent behaviour.
3. Application in the domain of transportation
• The historic or batch form of data will help identify the mode of transport a
particular customer generally opts for going to a particular place, say his home
town, thereby providing him alluring offers and heavy discounts on new
products and launched services.
• This will thus be included in targeted and organic advertisements where the
prospective leader of the customer generates the right to converted the lead. • It
is also helpful in determining the distribution of the schedules among various
warehouses and outlets for analyzing load based patterns.
4. Education
• In education, the application of data mining has been prevalent, where the
emerging field of educational data mining focuses mainly on the ways and
methods by which the data can be extracted from age-old processes and
systems of educational institutions.
• The goal is often provided by making a student grow and learn in various
facets using advanced scientific knowledge.
• Here data mining comes majorly into play by ensuring that the right quality of
knowledge and decision-making content is provided to the education
departments.
3. Explain Scheduling Algorithm of Operating System.
A Process Scheduler schedules different processes to be assigned to the CPU based on
particular scheduling algorithms. There are six popular process scheduling algorithms which
we are going to discuss in this chapter −
P0 0-0=0
P1 5-1=4
P2 8-2=6
P3 16 - 3 = 13
• Easy to implement in Batch systems where required CPU time is known in advance. •
Impossible to implement in interactive systems where required CPU time is not known. • The
processer should know in advance how much time process will take. Given: Table of
processes, and their Arrival time, Execution time
P0 0 5 0
P1 1 3 5
P2 2 8 14
P3 3 6 8
12
Waiting time of each process is as follows −
Process Waiting Time
P0 0-0=0
P1 5-1=4
P2 14 - 2 = 12
P3 8-3=5
• Priority can be decided based on memory requirements, time requirements or any other
resource requirement.
Given: Table of processes, and their Arrival time, Execution time, and priority. Here we are
considering 1 is the lowest priority.
Process Arrival Time Execution Priority Service
Time Time
P0 0 5 1 0
P1 1 3 2 11
P2 2 8 1 14
P3 3 6 3 5
P0 0-0=0
P1 11 - 1 = 10
P2 14 - 2 = 12
P3 5-3=2
• Shortest remaining time (SRT) is the preemptive version of the SJN algorithm.
• The processor is allocated to the job closest to completion but it can be preempted by a
newer ready job with shorter time to completion.
• Impossible to implement in interactive systems where required CPU time is not known. •
It is often used in batch environments where short jobs need to give preference.
Round Robin Scheduling
• Once a process is executed for a given time period, it is preempted and other process
executes for a given time period.
• Context switching is used to save states of preempted processes.
P0 (0 - 0) + (12 - 3) = 9
P1 (3 - 1) = 2
P3 (9 - 3) + (17 - 12) = 11
Multiple-level queues are not an independent scheduling algorithm. They make use of other
existing algorithms to group and schedule jobs with common characteristics.
For example, CPU-bound jobs can be scheduled in one queue and all I/O-bound jobs in
another queue. The Process Scheduler then alternately selects jobs from each queue and
assigns them to the CPU based on the algorithm assigned to the queue.
16
2. Design Phase: This phase aims to transform the requirements gathered in the SRS into a
suitable form which permits further coding in a programming language. It defines the overall
software architecture together with high level and detailed design. All this work is documented
as a Software Design Document (SDD).
3. Implementation and unit testing: During this phase, design is implemented. If the SDD is
complete, the implementation or coding phase proceeds smoothly, because all the information
needed by software developers is contained in the SDD.
During testing, the code is thoroughly examined and modified. Small modules are tested in
isolation initially. After that these modules are tested by writing some overhead code to check
the interaction between these modules and the flow of intermediate output.
4. Integration and System Testing: This phase is highly crucial as the quality of the end product
is determined by the effectiveness of the testing carried out. The better output will lead to
satisfied customers, lower maintenance costs, and accurate results. Unit testing determines the
efficiency of individual modules. However, in this phase, the modules are tested for their
interactions with each other and with the system.
5. Operation and maintenance phase: Maintenance is the task performed by every user once
the software has been delivered to the customer, installed, and operational.
● Some Circumstances where the use of the Waterfall model is most suited are:
When the requirements are constant and not changed regularly.
● A project is short
● The situation is calm
● Where the tools and technology used is consistent and is not changing
● When resources are well prepared and are available to use.
● Advantages of Waterfall model
● This model is simple to implement also the number of resources that are required for it
is minimal.
● The requirements are simple and explicitly declared; they remain unchanged during the
entire project development.
● The start and end points for each phase is fixed, which makes it easy to cover progress. •
The release date for the complete product, as well as its final cost, can be determined
before development.
● It gives easy to control and clarity for the customer due to a strict reporting system.
● In this model, the risk factor is higher, so this model is not suitable for more significant
and complex projects.
● This model cannot accept the changes in requirements during development. • It
becomes tough to go back to the phase. For example, if the application has now shifted
to the coding phase, and there is a change in requirement, It becomes tough to go back
and change it.
● Since the testing done at a later stage, it does not allow identifying the challenges and
risks in the earlier phase, so the risk reduction strategy is difficult to prepare.
INCREMENTAL MODEL
2. Design & Development: In this phase of the Incremental model of SDLC, the design of the
system functionality and the development method are finished with success. When software
develops new practicality, the incremental model uses style and development phase.
18
3. Testing: In the incremental model, the testing phase checks the performance of each existing
function as well as additional functionality. In the testing phase, the various methods are used
to test the behavior of each task.
4. Implementation: Implementation phase enables the coding phase of the development
system. It involves the final coding that design in the designing and development phase and
tests the functionality in the testing phase. After completion of this phase, the number of the
product working is enhanced and upgraded up to the final system product
• Web Analytics is the process of collecting, processing, and analyzing of website data. •
With Web analytics, we can truly see how effective our marketing campaigns have been,
find problems in our online services and make them better, and create customer profiles to
boost the profitability of advertisement and sales efforts.
• Every successful business is based on its ability to understand and utilize the data
provided by its customers, competitors, and partners.
Benefits of Web Aanalytics.
Web analytics help you analyze the online traffic that comes to your website. From how
many potential customers and users visit and browse through your webpage. It also tells
us where most of the potential customers come from, what they were doing on your
webpage and the amount of time they spent on your webpage. Because of this, and the
way the data is presented, you can easily identify what and which activities produce
better results and profits.
The bounce rate of your website basically means the number of times a user has
browsed through your website, or even visited your website without interacting with
your webpage. If you have a high bounce rate, it means that overall, your website has a
weak user experience and the user did not feel as though the content was suited for
their purpose or what they were searching for. And when you have a high bounce rate, it
is really hard for your business to perform well in sales or quality leads and any sort of
conversions as well.
For every business, it is very important that you find the right audience and ensure that
your content reaches to them in order to capitalize onto your efforts. Web analytics help
the companies by giving them information on how and where to find the right audience
as well as how to create content and create the right audience. With the right audience,
you make better and right marketing campaigns that will increase and promote the sales
which further on increases the conversions and improves your website tenfold.
Tracking the success rate of your marketing campaigns will tell you how the campaigns
have been perceived by the user and if it has been a success or not, and also if the
whole campaign was profitable to your website and you. By tracking with the help of
unique links for each campaign, you are also ensuring that if the campaigns perform
terribly, you can always cancel it.