Prob and Stats in AI Unit-4
Prob and Stats in AI Unit-4
Probability
Probability stands for the chance that something will happen and calculates
how likely it is for that event to happen. It’s an intuitive concept that we use on a
Randomness and uncertainty are imperative in the world and thus, it can prove to
In the context of data science, statistical inferences are often used to analyze or
predict trends from data, and these inferences use probability distributions of
data.
Thus, your efficacy of working on data science problems depends
has occurred.
Now, if the probability of the event modifies when the first event is taken into
knowledge about the conditions that might be associated with the event. Reverse
conditional probability is known to us. With the help of this theorem, it’s
these outcomes.
It’s a set of possible values derived from a random experiment. It’s a variable
A discrete random variable is the one that may just take on countable quantity of
different values like 2, 3, 4, 5 etc. These variables are usually counts such as the
number of children in a family, the number of faulty light bulbs in a box of ten etc.
and values that can be taken by a random variable within a given range.
the probability density function. And for a discrete random variable, it’s
event occurring many times over a given number of trials as well as given
the probability of the event in each trial. The normal distribution is symmetric
about the mean, demonstrating that the data closer to the mean are more
2- Statistics
organization of data and thus, data science professionals need to have solid
grasp of statistics.
Descriptive statistics together with probability theory can help them in making
learned in order to excel in the field. There’re some basic algorithms and
theorems that form the foundation of different libraries that are widely used
in data science. Let’s have a look at some common statistical techniques widely
2.1- Classification
variable.
samples are drawn from the actual data samples. Here, the utilization of generic
happen.
It can develop a unique sampling distribution based on the original data by using
and Bootstrapping can help you develop the concept of Resampling Methods.
2.4- Bootstrapping
It’s a technique that aids in different situations like validation of performance of a
It performs by sampling with replacements from the actual data, “not chosen”
2.5- Cross-Validation
It’s a technique followed for validating the model performance. It’s done by
splitting the training data into k parts. The k-1 parts are considered as the
training set while the “held out” part is considered as the test set.
problems. Here, the predictor space gets segmented into various simple regions
develop multiple trees that are merged to obtain a single consensus prediction.
Random forest algorithm, Boosting and Bagging are the major approaches used
here.
2.7- Bagging
such as a Decision Tree. Every single model is trained on a sample data different
size.
This grouping of Decision Trees essentially helps in decreasing the total error, as
there’s a reduction in the overall variance with the addition of every new tree.
In data science, a lot of other concepts and techniques of statistics are used
It’s also important to note that if you obtain a good grasp of statistics in the
context of data science, working with machine learning models can be one
Once you’ve learned the core concepts of statistics, you can try to implement
some machine learning models right from the beginning to develop a good
Key takeaway
getting tapped on. And once you’ve developed a good grasp of likelihood
hypothesis, you can gradually move forward to find out about the measurements
— which will lead you toward deciphering information and helping stakeholders in
Probability
why we use words like “probably” and “unlikely” in our daily conversation, but we
will talk about how to make quantitative claims about those degrees [1].
probability that E will occur. A situation where E might happen (success) or might
This event can be anything like tossing a coin, rolling a die or pulling a colored
ball out of a bag. In these examples the outcome of the event is random, so the
variable.
Let us consider a basic example of tossing a coin. If the coin is fair, then it is just
repeatedly toss the coin many times, we would expect about about half of the
tosses to be heads and and half to be tails. In this case, we say that the
occurs divided by the total number of incidents observed. If for trials and we
observe successes, the probability of success is s/n. In the above example. any
sequence of coin tosses may have more or less than exactly 50% heads.
Theoretical probability on the other hand is given by the number of ways the
particular event can occur divided by the total number of possible outcomes. So,
a head can occur once and possible outcomes are two (head, tail). The true
Joint Probability
Probability of events A and B denoted by(A and B) or P(A ∩ B)is the probability
that events A and B both occur. P(A ∩ B) = P(A). P(B) . This only applies if A
and B are independent, which means that if A occurred, that doesn’t change the
Conditional Probability
Similarly, P(B|A) = P(A ∩ B)/ P(A) . We can write the joint probability of as A
and B as P(A ∩ B)= p(A).P(B|A), which means : “The chance of both things
happening is the chance that the first one happens, and then the second one
Bayes’ Theorem
events. For example, if we want to find the probability of selling ice cream on a
hot and sunny day, Bayes’ theorem gives us the tools to use prior knowledge
about the likelihood of selling ice cream on any other type of day (rainy, windy,
snowy etc.).
where Hand E are events, P(H|E) is the conditional probability that event H occurs
given that event E has already occurred. The probability P(H) in the equation is
basically frequency analysis; given our prior data what is the probability of the
event occurring. The P(E|H) in the equation is called the likelihood and is
essentially the probability that the evidence is correct, given the information from
the frequency analysis. P(E) is the probability that the actual evidence is true.
Let H represent the event that we sell ice cream and E be the event of the
weather. Then we might ask what is the probability of selling ice cream on any
given day given the type of weather? Mathematically this is written as P(H=ice
cream sale | E= type of weather) which is equivalent to the left-hand side of the
equation. P(H) on the right-hand side is the expression that is known as
the prior because we might already know the marginal probability of the sale of
ice cream. In our example this is P(H = ice cream sale), i.e., the probability of
selling ice cream regardless of the type of weather outside. For example, I could
look at data that said 30 people out of a potential 100 actually bought ice cream
at some shop somewhere. So, my P(H = ice cream sale) = 30/100 = 0.3, prior to
me knowing anything about the weather. This is how Bayes’ Theorem allows us to
that during a routine medical examination, your doctor informs you that you have
tested positive for a rare disease. You are also aware that there is some
called the true positive rate) result for 95% of the patients with the disease,
and a Specificity (also called the true negative rate) result for 95% of the
healthy patients.
If we let “+” and “−” denote a positive and negative test result, respectively,
then the test accuracies are the conditional probabilities : P(+|disease) = 0.95,
P(-|healthy) = 0.95,
test, P(disease|+).
P(disease|+) = P(+|disease)* P(disease)/P(+)
0.05.
Importantly, Bayes’ theorem reveals that in order to compute the conditional
probability that you have the disease given the test was positive, you need to
know the “prior” probability you have the disease(disease), given no information
at all. That is, you need to know the overall incidence of the disease in the
population to which you belong. Assuming these tests are applied to a population
In other words, despite the apparent reliability of the test, the probability that you
actually have the disease is still less than 9%. Getting a positive result increases
the probability you have the disease. But it is incorrect to interpret the 95 % test
Descriptive Statistics
information in a data set. We will use below table to describe some of the
In the above table, the elements are the 10 applicants. Elements are also called
cases or subjects.
different values for different elements., marital status, mortgage, income, rank,
are marital status, mortgage, rank, and risk. Qualitative variables are also
are income and year. Quantitative variables are also called numerical variables.
countable number of values is a discrete variable, for which each value can be
graphed as a separate point, with space between each point. ‘year’ is an example
of a discrete variable..
Continuous Variable: A numerical variable that can take infinitely many values
variable.
Random sample: When we take a sample for which each element has an equal
Mean
The mean is the arithmetic average of a data set. To calculate the mean, add up
the values and divide by the number of values. The sample mean is the
Median
The median is the middle data value, when there is an odd number of data values
and the data have been sorted into ascending order. If there is an even number,
the median is the mean of the two middle data values. When the income data are
sorted into ascending order, the two middle values are $32,100 and $32,200, the
Mode
The mode is the data value that occurs with the greatest frequency. Both
quantitative and categorical variables can have modes, but only quantitative
variables can have means or medians. Each income value occurs only once, so
Mid-range
The mid-range is the average of the maximum and minimum values in a data set.
Range
The range of a variable equals the difference between the maximum and
Range only reflects the difference between largest and smallest observation, but
Variance
Population variance is defined as the average of the squared differences from the
with N replaced by n-1. This difference occurs because the sample mean is used
The standard deviation or sd of a bunch of numbers tells you how much the
The sample standard deviation is the square root of the sample variance: sd = √
Three different data distributions with same mean (100) and different standard
deviation (5,10,20)
The smaller the standard deviation, narrower the peak, the data points are closer
to the mean. The further the data points are from the mean, the greater the
standard deviation.
Measures of Position: Percentile, Z-score, Quartiles
Indicate the relative position of a particular data value in the data distribution.
Percentile
The pth percentile of a data set is the data value such that p percent of the
values in the data set are at or below this value. The 50th percentile is the
median. For example, the median income is $32,150, and 50% of the data values
Percentile rank
The percentile rank of a data value equals the percentage of values in the data
set that are at or below that value. For example, the percentile rank. of Applicant
1’s income of $38,000 is 90%, since that is the percentage of incomes equal to or
The first quartile (Q1) is the 25th percentile of a data set; the second quartile (Q2)
is the 50th percentile (median); and the third quartile (Q3) is the 75th percentile.
The IQR measures the difference between 75th and 25th observation using the
Z-score
The Z-score for a particular data value represents how many standard deviations
the Z-score is (24,000 − 32,540)/ 7201 ≈ −1.2, which means the income of
Different ways you can describe patterns found in uni-variate data include central
tendency : mean, mode and median and dispersion: range, variance, maximum,
Pie chart [left] & Bar chart [right] of Marital status from loan applicants table.
The various plots used to visualize uni-variate data typically are Bar Charts,
determining the empirical relationship between them. The various plots used to
Scatter Plots
graph. Each (x, y) point is graphed on a Cartesian plane, with the x axis on the
horizontal and the y axis on the vertical. Scatter plots are sometimes called
correlation plots because they show how two variables are correlated.
Correlation
and direction of the linear relationship between two quantitative variables. The
where sx and sy represent the standard deviation of the x-variable and the y-
variable, respectively. −1 ≤ r ≤ 1.
Box Plots
A box plot is also called a box and whisker plot and it’s used to picture the
distribution of values. When one variable is categorical and the other continuous,
a box-plot is commonly used. When you use a box plot you divide the data values
into four parts called quartiles. You start by finding the median or middle value.
The median splits the data values into halves. Finding the median of each half
of the values at the bottom of the box to the median of the upper half of the
values at the top of the box. A line in the middle of the box occurs at
the median of all the data values. The whiskers then point to the largest and
Box plots are especially useful for indicating whether a distribution is skewed and
whether there are potential unusual observations (outliers) in the data set.
The left whisker extends down to the minimum value which is not an outlier. The
right whisker extends up to the maximum value that is not an outlier. When the
left whisker is longer than the right whisker, then the distribution is left-skewed
and vice versa. When the whiskers are about equal in length, the distribution is
symmetric.
Basic concepts of probability and statistics are a must have for anyone interested
in machine learning. I covered briefly some of the essential concepts that are
mostly used in machine learning. I hope you enjoyed this post and learned