0% found this document useful (0 votes)
47 views107 pages

Data Science Book

Textbook

Uploaded by

priyamihir1513
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
47 views107 pages

Data Science Book

Textbook

Uploaded by

priyamihir1513
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 107
Introduction to Probability Syllabus Introduction to probability theory, Probability thee logy, Funde J cepts in probability-axioms of probability, Application of simple ‘probability learning, Bayes’ theorem, Rs tara peti, Dacia Tse ) Distribution Function (CDF) of a Distribution, Geometric distribution, Exponential distribution, Chi Contents 34 32 33 a4 35 36 a7 38 39 310 pe 3-2 Introduction to Probab, El Introduction to Probability Theory © Probability theory is concerned with the study of random phenomena, phenomena are characterized by the fact that their future behaviour ig predictable in a deterministic fashion. The role or probability theory is to ! the behavior of a system or algorithm assuming the given probability assignment, and distributions. ‘* Probability was developed to analyze the games of chance. It is modeling of the phenomenon of chance or randomness. The measure of called the probability of the statement * The probability of an even is defined as the number of favorable divided by the total number of possible outcomes. Classical Definition of Probability 1, Computing Probability Using the Classical Method * If an experiment has n equally likely simple events and if the m that an event E can occur is m, then the probability of E, P(E), is P(e) = Number of way that E can occur __m Number of possible outcomes in , if S is the sample space of this experiment, then N(E) P(E) = NO denotes the events of non-occurrence of E, then the number in E is n-m and hence the probability of E is : =). om Be as 0 1-P(E) => P(E)+P(E) =1 @ non-negative integer and n is a positive integer and m < called mathematical or priori probability, m Experiment experiments, we are not able to control the value of Tesults will vary from one performance of the most of the conditions are the same. These ‘experiment. An up thrust for knoudarien oe Introduction to Probability Random experiment is defined as an experiment whose outcomes are known before the experiment pert eaee roto but which outcome is going to happen in a Example : If we toss a die, the result with one of the numbers in of the experiment is that it will come up the set {1, 2, 3, 4, 5, 6}. A random variable is simply an expression whose value is the outcome of a particular experiment. Sample Space The totality of the possible outcomes of a random experiment is called the sample space of the experiment and it will be denoted by letter 'S’. There will be more than one sample space that can describe outcomes of an experiment, but there is usually only one that will provide the most information. The sample space is not determined completely by the experiment. It is partially determined by the purpose for which the experiment is carried out. Example 1 : If the experiment consists of flipping two coins, then the sample space consists of the following points: $={(1,1), ©), @, 1, HH) The outcome will be (T, T) if both coins are tails, (T,H) if the first coin is tails and the ond heads, (H, T) if the first is heads and the second tails, and (H,H) if both coins heads. Example 2 : A die is rolled once. We let X denote the outcome of this experiment. wn the sample space for this experiment is the 6-element set S = (123454, It is convenient to classify sample spaces according to the number of elements they contains. If a sample space has a finite number of points, it is called a finite sample space. If it has as many points as there are natural numbers 1, 2, 3, ...» it i untably infinite sample space. If it has as many points as there are in melee on ihe x axis, such a 0S x $ 1, itis called a non-countably infinite sample space. ; ‘A sample space that is finite space, while one that is non-count The result of a trail in a random experiment is called an out come. or countably finite is often called a discrete sample tably infinite is called a non-discrete sample space. Event ‘An event is simply a collection of certain sample points. TECHNICAL PUBLICATIONS® - An up thrust for knowledge - 3-4 Introduction to Probayy, « An event is a subset A of the sample space S, ic, it is a set of possible If the outcome of an experiment is an element of A, we say that the event 4 jg occurred. An event consisting of a single point of S is called a simple or dlenerton event. | «A simple event is any single outcome from a probability experiment. Each simple eoen i | denoted ¢;. i * A single performance of the experiment is known as a trial. + As particular events, we have $ itself; which is the sure or certain event element of $ must occur, and the empty set 0, which is called the impo because an element of cannot occur. * An unusual event is an event that has a low probability of occurring. « Independent events : Let Ej,E2 be two events. Then E;,E2 are | independent events if, P(E, NE) = P(Ei) P(E) © Mutually exclusive events : E,,E,...Ey are said to be mutually E, OE; = 0, forij. IE, and E> are independent and mutually exclu then either P(E;) = 0 or P(E2) = 0. If E, and E are independent e and E§ are also independent. * Mutually exclusive events are sometimes called as Disjoint event. If mutually exclusive, then it is not possible for both events to occur trial. If two events are Mutually Exclusive Events then they do not s outcomes. Example : 1. In throwing a die all 6 possible cases are mutually exclu 2, In tossing a coin the event head turning up and tail turning up are exclusive. ' Two events of a sample space whose intersection is 6 and who tire sample space are called complementary events, If E is an ce S, its complement is denoted by E’ or E. y likely events : Two or more events are said to be equall chances of their happening in a trial is equal. For example, from a pack, any card may be obtained. In the trial, all are equally likely. J ippening of an event in the first trial influences the ive trial, then the events are said to be TECHNICAL PUBLICATIONS® - An up thrust for knowledge ‘scence 3. £ Introduction to Probability gxhaustive events : The total number of all scaled exhaustive events, For example, Jmentary events, Le. head and tail Possible events of a random experiment in tossing a coin, there are two exhaustive po Algebra of Events « Following Fig. 3.1.1 shows the relation between two sets. Data Science 3-6 Introduction to Pry De Morgan's law : * The useful relationship between the three basic operations of forming intersection and complements are known as De Morgrans laws. “oa * The complement of a union (intersection) of two sets A and B equals intersection (union) of the complements A and B. Thus (UB) = AnB (AnB) = AUB Prove that : P(A WB) = P(A)+P(B)- P(A MB). Solution ; Let A and B be any two events. To write AUB as the mutually exclusive events : AN BS, AN B and ASB. AUB = (ANB‘)U(ANB)U (ASB) By axioms 3 : P(AUB) = P(AMB‘)+P(ANB)+P(ASAB) = (AN BS)U (ANB) = (AS AB)U(ANB) = P(ANBS)+P(ANB) = P(ASOB)+P(AMB) s (3) and (4), we get P(ANB‘)+P (ASO B)+2P (ANB) quation (2) and (5), P(AUB)+P (ANB) P (A)+ P(B)~ P (ANB) 1 2 1 5 PCB) = 3 ANB) = P(AS MB) 0) P(AMBS). .) + P(B)- P(A B) og TECHNICAL PUBLICATIONS® - An up thrust for knowledge P®)- P(ANB) Data Science - 3-8 1 6 P(A B) P(A B) So, P(A|B) PB) P(A|B) P(B|A) P(B|A) GEMIIERE) oetermine ) PBA) wf “Je PB) =} P(AvB) = 3. RIF wlelal Solution : 7 B mn pela) = PAC) P(A) First calculate P(A B) from given data. = P(A)+ P(B)- P(AU B) ‘Science ‘Axioms of Probability ; rules, known as ‘The subject of probability is based on three commonsense 7 axioms, ‘li is in terms its One way of defining the probability of an. event 1s a = relative frequency. For an experiment, sample space S is repeatecty a d exactly the same conditions. For each event E of the sample apace 1, we 4 n(Z) to be the number of times in the first n repetitions # the experiment that th event E occurs. Then P(E), the probability of the event E is denoted by, . n(E) Pm = ts The theory of probability starts with the assumption that probabilities car assigned so as to satisfy the following three basic axioms of probability. Suppose we have a sample space S. If $ is discrete, all subsets correspon events and conversely; if $ is non-discrete, only special subsets co events. To each event A in the class C of events, we associate a real number P(A), is called a probability function, and P(A) the probability of the event, if the axioms are satisfied. om 1: For every event A in class C, P(A) 2 0. : For the certain event $ in the class C, P(S) = 1 om 3 : For any number of mutually exclusive events A, A>,A3,... in P(A U Ag U Ag Uns) = P(Aq)+ PlAg)+ P(Ag)” arly for two mutually exclusive events A, and A >, P(A; UA) = P(A1)+P(Ad) st axiom says that the outcome of an experiment is always in th ie second axiom says that the long-run frequency of any event is é and 100 %, The third axiom says that the probability is either too Probability probability is a probability that likelihood ‘ meast will happen concurrently. ae ie are two independent events A and B, i Gviaaldciving tia waa , the probability that A and abilities, ' rule of multiplication shown. ae a fore a P(A and B) = P(A) P(B). 3-11 introduction to Probability rule of multiplication is used to find the joint probability that two ‘ccur. Symbolically, the general rule of multiplication is, P(A and B) = P(A) P(B|A). ility P(A™B) is called the joint probability for two events A and B in the sample space. Venn diagram will readily shows that P(A B) = P(A) + P(B) - P (AUB) P(A B) = P(A)+ P(B)- P(AM B)< P(A) + P(B) ity of the union of two events never exceeds the sum of the event gram is very useful for portraying conditional and joint probabilities. A am portrays outcomes that are mutually exclusive. ition of Simple Probability Rules basket analysis is solved by using association rule learning with the help probability and conditional probability, basket analysis is just one form of frequent pattern mining there are many frequent pattern association rules in frequent mining can be classified in way. basket analysis is an example of frequent itemset mining. The purpose of basket analysis is to determine what products customers purchase together. its name from the idea of customers throwing all their purchases into a g cart (a "market basket’) during grocery shopping. basket analysis is a technique which identifies the strength of association pairs of products purchased together and identify pattems of e. A co-occurrence is when two or more things take place together. pasket analysis creates If-Then scenario rules, for example, if item A is then item B is likely to be purchased. are probabilistic in nature ot, in other words, they are derived from the ies of co-occurrence in the observations. is the proportion of baskets that contain the items of interest. The rules ‘ised in pricing strategies, product placement, and various types of strategies. Dasket analysis takes data at transaction level, which lists all items bought tomer in a single purchase. TECHNICAL PUBLICATIONS® - An up thrust for knowledge Data Science 3-12 Introduction to Prcbata ay ionships of what products were purchased wig, + The technique determines relationship ! which other product(s). These relationships are then used fo build profi containing if-then rules of the items purchased. Association Rule Learning * Association rule learning is also called associ ‘ ¢ Ina retail context, association rule learning is a method for nag relationships that exist in frequently purchased items. Association relationship of the form X — Y (that is, X implies Y). * Association rule mining can be viewed as a two-step Process + 1, Find all frequent item sets : By definition, each of these item sets will least as frequently as a predetermined minimum support count, min su 2. Generate strong association rules from the frequent item sets : By def these rules must satisfy minimum support and minimum confidence, _ * The association rule X -> Y means that transactions containing items fror tend to contain items from set Y. * Association rules show attribute value conditions that occur frequently to, a given data set. A typical example of association rule mining is m analysis. * Data is collected using bar-code scanners in supermarkets. Such n databases consist of a large number of transaction records. * Each record lists all items bought by a customer on a single purchase Managers would be interested to know if certain Sroups of items are purchased together. * They could use this data for adjusting store layouts, for cross promotions, for catalog design and to identify customer segments bi patterns. Association rules provide information of this orm | statements. These rules are computed from the data nike logic, association rules are probabilistic in nature, 2 In addition to the antecedent (if) and the consequent (then), an : two numbers that express the degree of uncertainty about the In association analysis, the antecedent conseq é : , and itemsets) that are disjoint (do not have any items nena ae ation rule mining. type in the TECHNICAL PUBLICATIONS® . An up thrust for S218 Introduction to Probability a 5 « The first number is called the support for the rule. The support is simply the number of transactions that include all i parts of the rule. items in the antecedent and consequent “) aes ‘ount : The support count of an item A is denoted by A. Count in a taset T is the number of transaction in T that contain X. Assume T has 1 transcations. (AUB) Count Support = PP 2 b) Confidence : The rule hold in T with confidence % of transaction that contain A also contain B. Conf = P,(B/A) Confidence = (AB) Count Bayes’ Theorem Bayes’ theorem is a method to revise the probability of an event given additional information. Bayes's theorem calculates a conditional probability called a posterior or revised probability. Bayes’ theorem is a result in probability theory that relates conditional probabilities. If A and B denote two events, P(A|B) denotes the conditional probability of A occurring, given that B occurs, The two conditional probabilities P(A|B) and P(B|A) are in general different. Bayes theorem gives a relation between P(A|B) and P(B|A). An important theorem is that it gives a rule how to update or revise the application of Bayes’ 4 beliefs in light of new evidence a posteriori. strengths of evidence-base A prior probability is an initial probability value originally obtained before any additional information is obtained. A posterior probability is 4 probability value that has been revised by using additional information that is later obtained. Suppose that B,,B2,B3 + B, partition the outcomes of an experiment and that A is another event. For any number, k with 1 sk $n we have the formula : S P(A/B))-P@i) P(BY/A) = PUBLICATIONS® - An up thrust for knowledge eee 3-14 A mechanical factory production line is manufacturing ats, AB. Teh! at, mcine A esos, machine A is defective, 4S fore mctind8 and 2.% from machine C. ‘at random from the production line and found to be defective, What is | it came from : i. machine A it, machine B iti, machine C ? Solution : Let = {bolt is defective}, * = {bolt is from machine A}, = {bolt is from machine B}, = {bolt is from machine C}. aOwrsgd Given data: P(A) = 0.25, P(B) = 0.35, P(C) = 0.4. P(D|A) = 0.05, P(D|B) = 0.04, P(D|C) = 0.02. ‘rom the Bayes’ Theorem : P(D/A)x P(A) P(D/A)x P(A) + P(D/B)x P(B)+ P(D/O)x PO) 0.05 0.25 0.05% 0.25 + 0.04% 0.35+ 0.02x 04 0.0125 0.0125+ 0.014+ 0.008 0.3621 P(A/D) Do) P(/B)x PB) ‘P(D/A)x P(A) + P(D/B)x P(B) + PO/OXPO he 04x 0.35 TRIB OORT UOC P(D/C)x P(C) P(D/A)x P(A) + P(D/B)x P(B)+ PO/OXPO) 4 x 0.4 OUBKOBT OOS OSS TOORCOT TECHNICAL PUBLICATIONS® ~ An up thrust for knowledge ae Introduction to Probability ig 0.008 0.008 0.0125 + 0.014+ 0.008 ~ 0.0345 P(C/D) = 0.2318 At a certain university, 4 % of men are over 6 feet tall and 1 % of toome over 6 feet tall. The total student population is divided in the ratio 3 : 2 in favour of . If a student is selected at random from among all those over six feet tall, what is probability that the student is a woman ? : Let us assume following : M = {Student is Male}, F = {Student is Female}, T = {Student is over 6 feet tall). data : P(M) = 2/5, PF) = 3/5, P(T|M) = 4/100 P(T|F) = 1/100. fe require to find P(F|T)? Ising Bayes’ Theorem we have : P(I/F) PF) P(F/T) = POV P+ POM) POM) PF/T) If the sum of 9 has appeeed, find the proababiity A pair of dice is rolled. ‘one of the dice shows 3. Let A = The event that the sum is 9 E the event the one of dice shows 3. ustive cases = 67 = 36 puBLIGATIONS® - An up trust or krowedge ad ior 3-16 Introduction to Probabjgy — Favorable cases of the event A = (3, 6), (6, 3), (4, 5), (5, 4). So P(A) 4.36 1 9 0 P(A) Favorable case for the event AM B = (3, 6), (6, 3) 2 1 Hence P(AM B) = % But P(AMB) = P(A)x P(B/A) P(ANB) P(B/A) P(A) yig_ 1.9 SE ome a P(B/A) 1/2 PEDERED 4¢ « cerizin university, 4 % of men are over 6 feet tall and over 6 feet tall. ed amcor for oi goa {Student is Male}, {Student is Female}, that M and F partition the sample space of students), {Student is over 6 feet tall) TECHNICAL PUBLICATIONS® - An up thust for knowledge a Introduction to Probability MeO PO PCT/F) PR) + PCM) PM) ied = ss lees 1 100°5 * i005 4 PT) Random Variables A random variable is a set of Possible values from a random experiment. A random variable, usually written X, is a variable whose possible values are numerical outcomes of a random phenomenon. There are two types of random variables, discrete and continuous, Whenever you run and experiment, flip a coin, roll a die, pick a card, you assign a number to represent the value to the outcome that you get. This assign is called a random variable. A random variable is a variable X that assigns a real number [x], for each and every outcome of a random experiment. If § is the sample space containing all the ‘n’ outcomes {e1,€2,€3,-/€)/€q) of random experiment and X is a random variable defined as a function X(e) on S, then for every outcome e; (where i = 1, 2, 3, ...,n) that is in § the random variable X(e;) will assign a real value x;. Advantages of random variables is that user can define certain probability functions that make it both convenient and easy to compute the probabilities of various events. ‘A random variable is a numerically valued variable which takes on different values with given probabilities. les : return on an investment in a one-year period price of an equity. number of customers entering a store. sales volume of a store on a particular day. turnover rate at your organization next year. Discrete Random Variables all possible outcomes for a random variable and . way to list we can find a way we have a discrete random variable. ign probabilities to each one, TEGHNIGAL PUBLICATIONS® - An up thrust for knowledge Data Science 3-18 * The random variable is called a discrete random variable if it is defined sample space having a finite or a countable infinite number of sample this case, random variable takes on discrete values and it is possible to all the values it may assume. ‘ * A discrete random variable can only have a specific (or finite) numerical values. 4 * We can have infinite discrete random variables if we think about # know have an estimated number. Think about the number of stars in the * We know that there are not a specific number that we have a way to co is an example of an infinite discrete random variable. 4 © The mean of any discrete random variable is an average of the poss with each outcome weighted by its probability. © Example : 1. Total of roll of two dice : 2, 3, ..., 12 2. Number of desktops sold : 0, 1, ... 3. Customer count : 0, 1, Continuous Random Variable ‘* In the case sample space having an uncountable infinite number of the associated random variable is called a continuous random values distributed over one or more continuous intervals on the Te * A continuous random variable is one which takes an infinite values. Continuous random variables are usually measurements. Ex height, weight, the amount of sugar in an orange, the time require * A continuous random variable is one having continuous range of be produced from a discrete sample space because of our req random variables be single valued functions of all sample space A continuous random variable is not defined at specific vah defined over an interval of values and is represented by the probability of observing any single value is equal to 0, s jues which may be assumed by the random variable is infinite. th types of random variables are important in science and variable is one for which some of its values are 3-19 Introduction to Probability Mass Function and Cumulative Distribution Function of a Random Variable discrete random variable, the probability that a random variable X taking a ¢ value x;, P(X = x,), is called the probability mass function P(x). ty mass function is a function that maps each outcome of a random to a probability be a discrete random variable with range Ry = (xj, Xz, X3--) (finite or ly infinite). The function Py (x,) = P(X = x,), for k = 1, 2, 3... is called the lity Mass Function (PMF) of X the PMF is a probability measure that gives us probabilities of the possible for a random variable. vior of a random variable is characterized by its probability distribution, is, by the way probabilities are distributed over the values it assumes. A ity mass function are two ways to characterize this distribution for a random variable are equivalent in the sense that the knowledge of either one completely the random variable. The corresponding functions for a continuous variable are the probability distribution function, defined in the same way case of discrete random variable and the probability density function random variable, then the function F(x) is defined by x) = PX xjF (x; for discrete distribution j f xf (x) dx variance 6? (Sigma square) by + D jw? £O) for continuous distribution u Bw for discrete distribution re TECHNICAL PUBLICATIONS® - An up thrust for knowledge o? = J Ow? & for continuous d . ‘The mean (ui) is also denoted by E (X) and is called the expectation of X gives the average value of X to be expected in many trials. © Let us compute the variance of a normal distribution. If X has an J distribution, then Var (X) = E((x-E[Xp*] Here we substituted Z = (x-)/o. Using integration, _ 75+ 80+ 82+ 87+96 420 OS Wea -w)? + 62 -n)? +. + (&q -W)?] ——<$$—<—$ —___ ae n [75~ 84)? + (60- 84)? + (62-8424 (87-84)? 5 9)? +4)? +2)? +6)? 4 aay? pe + (96- 81+ 16+ 4494145 Sl+ 1644494144 _ = = sane Introduction to Probability tion (3) 0 = Vo? = 88 = 7.1274 HEX is @ normal variate with mean 30 and standard deviation 5. Find the b) 245, data : deviation o = 5. M1 = 26 and x, = 40 4 =. 2 45 TECHNICAL PUBLICATIONS® « An up thust for owwdye )K it) Evaluate P(X <6), P(X26), P(O 5 find the minimum value of K. : iv) Determine the distribution function of X. v) Mean vi) Variance. s Solution : i) K K+2K+2K+3K+ K? + 2K? +7K2+K 10K? +9K-1 = 0 (K+1) (0K-1) = 0 K+1=0 and 10K-1=0 K=-1 ad Kei 10 We discard K = — 1 value. Therefore K = To= 0.1. ii) «=P (X <6) P(X=0)+P (X= 1)+P (X=2)+...4 P(X = 5) * 0+K+2K+ 2K+ 3K+ K? 04 s 040.142 (0.1)+2(0.1)+3 (0.1)+ (0.1)? 01 +.0.2 +0.2 +03 + 0.01 P(X <6) = 081 P(X26) = 1-P(XK <6) 1-081 P(x26) = 0.19 W >, minimum value of K. 2 = P(X=0)+P (X=) = 0+K =K = 01 = P(K=0)+P(XK=1)+P(X=2) = 0+K+2K = 3K = 3x01 = 03 PHO) +PK=N+PX=H+PX=3) = 0+K+K+2K = 5K = 5x01 = 05 <4) = 08 (We already calculated) condition is PSK «is suitable fr this minimum value of K = 4. function of X. PUBLICATIONS® - An up thrust for knowledge 3-27 Introduction to Probability ial Distribution means ‘two numbers’. mes of health research are often measured by whether they have occured example, recovered from disease, admitted to hospital, died ete. ial distribution occurs in games of chance, quality inspection, opinion icine and so on. be modelled by assuming that the number of events ‘n’ has a binomial m with a fixed probability of event p. Binomial distribution is ion for a series of Bernoulli trials. distribution written as B (n, p) where n is the total number of events and ability of an event. of binomial distribution : iment consist of n identical trials. ‘trial has only two outcomes. probability of one outcome is p and the other is q = 1 - p. trials are independent. are interested in x, the number of success observed during the n trials. tisfying the above properties are called Bernoulli trials. bility function X, — (™)oxqn-* -(T}p otherwise. The distribution of X with probability function is called the distribution or Bernoulli distribution. 1 (mu) of the binomial distribution is and variance of binomial distribution with parameters (n, p) are given E00 SEX) i=l = np TECHNICAL PUBLICATIONS® - An up thrust for knowiedge — ~ Ss Introduction to Probability s CypXqh—*x2 2 = 1x29Cp p2q?-243x2p3qn-34 +E CyxpXqh yu? +" Cyn (n- 1) np +8 = n(a-)p? > n-2C, 2p%-2q"-* ¢np—n2p? = n(n-1)p? (p+ g)"-2 +np—n2p? = n(n-1)p? +np-n2p? = n?p? ~ np? + np-n2p? = np-np? = np (1-p) o = npq standard deviation (a) of the binomial distribution is /npq. x 0 fet 2 pig “Poke 00 dos 016s 03290329 0.132 the mean value of distribution. xP (X= 0)+ xP (X= 1)+ xP (X= 2)+ xP (X= 3)+ xP (X= 4) + xP(X= 5) = Ox (0.004) + 1x (0.0041) + 2x (0.165) + 3x (0.329) + 4x (0.329) + 5x (0.132) 0+ 0.0041 + 0.33+ 0.987+ 1316+ 0.66 = " = 3.2971 Find the probability of getting 3 and 6 heads inclusive im 10 tosses of a distribution approximation to the binomal distribution. TECHNICAL PUBLICATIONS® - An up thrust for knowledge introduction to Probability = 0.1171 + 0.2050 + 0.246094 + o0sa36 = 0575195 distribution : ider data is continuous b= mpe ix! Normal Distribution lormal distribution was discovered 1733 by de Moivre as an approximation to the ‘omial distribution when the number of trails large. It is derived in 1809 by uss. lormal distribution is also called Gauss distribution. e normal distribution describes a special class of such distributions that are ‘< and can be discribed by the distribution mean p and the standard viation o (or variance 0”). Central Limit Theorem, which states that the sum of a large ber of independent random variables (binomial, Poisson, etc) will roximate a normal distribution. For example : Human height is determined by Iarge mumbo forsonn bal gesate ancl envizormnentsl which ots SoSetemi i ‘effects. Thus, it follows a normal distribution. ccmtirana tania semetese se eicl wo be mommally dietslouted 27% Teen ia ‘ance 6? if its probability density function is, xe ee ov2n f0) is ‘not the same as P(x)- TECHGAL PUBLICATIONS! - An up tmet f mowtedg® 3-33 Introduction to Probability ition is symmetric about its mean. of distribution is determined by standard deviation : Large value of SD ® the height and increase the spread of the curve, small value of SD increase t and reduce the spread of the curve. t all of the distribution will lie within 3 deviations of the mean. total area under the curve is 1 curve extends indefinitely in both directions, approaching, but never touching, rizontal axis as it does so. about normally distributed variables and normal-curve areas : we know the mean and the standard deviation of a normally distributed le, we know its distribution and associated normal curve. The mean and lard deviation are normal distribution’s sufficient statistics, they completely the variable's distribution. probability a normally distributed variable assumes a value between a and b is to the area under the curve between a and b. lation of the probability that a normal random variable lies within some al. Theoretically, we need to calculate the area under the curve between the end points of the interval. Integration for each and every different normal . Due to the complication of calculation and frequency in which it is done, a dardized way has been derived. dard normal distribution : The normal distribution with mean = 0 and lard deviation = 1. dard normal random varible : The normal random variable with the standard distribution is called standard normal random variable. approximation of the binomial probability distribution : Recall the discrete distribution P(x) = CR p*q?* nt xin i tances the continuous normal distribution is a good ae eee binomial distribution. When the probability (p) of distribution is near zero or 1, or n (times of trials) is small, the binomial ‘bution will be nonsymmetrical and the normal will not give an good chs TECHNICAL PUBLICATIONS® - An up thrust for knowledge Data Science Poisson Distribution « Poisson distribution, named after its invertor simeon pols who was: mathematician. He found that if we have a rare event (ie. p is y know the expected or mean (or 1) number of occurances, the prob 2 ... events are given by : evkp® er Poisson distribution : Is a distribution the number of rare events that o of time, distance, space and so on. Examples : 1. Number of insurance claims in a unit of time. 2. Number of accidents in a ten-mile highway. 3, Number of airplane crash in triangle area. s When there is a large number of trials, but a small probability binomial calculate becomes impractical. Example : Number of | horse kicks in the army in different years. The mean number of n trials is p= np. If we substitute p1/n for p, and let n tend to infinity, the binomial becomes the Poisson distribution : : e Mus x! PO) = = Poisson distribution is applied where random events in expected to occur. Deviation from poisson distribution may degree of non-randomness in the events under study. ‘ = Example : 64 deaths in 20 years from thousands of soldiers. * If a mean or average probability of an event happening page/per mile cycled etc., is given and you ar asked to calculate a events happening in a given time/number of pages/number of mil the Poisson distribution is used. ’ + If on the other hand, an exact probability of an event happenng is implied, in the question, and you are asked to calculate the probal then the Binomial distribution must be ae me Introduction to Probability ; Given data: n = 100, p= P= 500 1 n= np =1000x=.. = Q 599 = 2 probability el x! = 0.18 cf Poisson's distribution : number of trials 'n’ are large and probability of success 'p' is very small ial probabilities are approximated by Poisson's distribution. np remains constant when n—>< and p— 0 tributed random variable in 'n' trials is given as, bability of Poisson's dist py = CO yx =m) = SP isson's distribution is applicable when the events do not occur as the outcomes: ‘a definite number of trials. isson's distribution is applied ely rare, but they have a wrrence. of Poisson's distribution is > ‘A random variable X has a Pol X33) Here P(1s X $3) is given 4 SECO eet 2 = 3 we get, to the events whose probabilities of occurrence are large number of independent opportunities for = np and its variance is 02 = np. jason distribution with a mean of 3. Find X53) = PIX a = 3andk aken® 3ten? a 3e i =1,2,3 and (X = 1) Data Science 3-36 PsXx<3) ap De, 9S hg 3e +5e" +5 345% = 0.59744 Geometric Distribution * The geometric distribution represents the number of failures success in a series of Bernoulli trials. This discrete probability represented by the probability density function : f(x) = (1-p)*1p. * Instead of counting the number of successes, we can also count the ni trials until a success is obtained. That is, we shall let the rando X represent the number of trials needed to obtain the first success. In this situation, the number of trials will not be fixed. But if the independent, only two outcomes are available for each trial and the a success is still constant, then the random variable will have distribution. In a geometric distribution, if p is the probability of a success and x is th of trials to obtain the first success, then the following formulas apply. PO) = pal-p)** If X is a geometric random variable with parameter p, then Mean un = E(x) = 1/p 1- VO) = - P Variance o? Assumptions for the geometric distribution are as follows : 1. There are two possible outcomes for each trial (success or failure). 2. The trials are independent. 3. The probability of success is the same for each trial. © The memoryless property means that a given probability distr _ independent of its history. Any time may be marked down as time zero. ‘* Let X be exponentially distributed with parameter A. Suppose we know What is the probability that X is also greater than some value s + t ? That want to know i PX >s+t|X>H Data Science 3-38 Introduction to, Uniform Distribution * A uniform distribution, also called a rectangular distribution, is 2 pro distribution that has constant probability. * This distribution is defined by two parameters,a and b : 1. ais the minimum. 2. bis the maximum. 3. The distribution is written as Ua). * Its probability density function and cumulative distribution functions are ea xelz,b] fo) Mean and variance of uniform distribution : The mean value of a continuous random variable is given by equation as, m, = Jetindee fe pti m, = 25° = m and variance, o2 = xb)? Exponential Distribution exponential distribution is a continuous - tee Introduction to Probabilty type of problem shows wy iP frequently in queueing systems where we're rested in the time between ev , . ee ents, example, suppose that jobs in our s vice times. If we have a job that's been running for one hour, what's the ability that it will continue to run for more than two hours ? the definition of conditional probability, we have stem have exponentially distributed PR set] X>t = ree >s +t then X > tis redundant, so we can simplify the numerator. PX>s4t | X>t = meet the CCDF of the exponential distribution, PX>s+t) PRaeet|x>y = Poe e”“' terms cancel, giving the surprising result. PX>s+t | X>t) = eo joryless property is an important property that simplifies calculations ated with conditional probabilities. Geometric distribution is the only discrete ability distribution that has the memoryless property. exponential distribution is memoryless because the past has no bearing on its behaviour. Every instant is like the beginning of a new random period, has the same distribution regardless of how much time has already elapsed. exponential is the only memoryless continuous random variable. rameters of Continuous Distributions tinuous distributions are defined using the following three parameters : Scale parameter : It defines the range of the continuous distribution. The larger the scale parameter value, larger is the spread of the distribution. Shape parameter : Shape parameter defines the shape of the probability distribution. The changes to the value of shape parameter will change the shape of the distribution. Location parameter : Location parameter locates (or shifts) the distribution on the horizontal axis. TECHNICAL PUBLICATIONS® - An up thrust for knowledge 2 Introduction to Probabilly ose, 21, | Iroduglonnta PISDEREE tial probability density Ae forx20 0 forx Observed test statistic. Do ct null hypothesis. P-value : Low probability that test statistic > Observed test statistic. null hypothesis. ym manufacturing rivets wants to limit variations in their length as much om? data : n = 10, 9 = 0-145 2.15+1.99+2.0 5 + 2.12+ 2.17 +2.01+1.98+2.03+ 2.25+1.93 10 POBLICATIONS® - An up tus fr krowidce sa ~ 0.078 _— : ~ 0018 ba ao 0.052 —_ 27 0.102 basi 201 ~ 0.058 0.003364 Ps oak ono7744 FS Bone o.o01444 225 0.182 0.083124 193 ~ 0138 0.019044 0.09096 0.09096 a = 0.009096 Steps : 1. Null hypothesis (Ho) : 02 <9: a Altemative hypothesis (Hy) :625 33 of significance a = 0.95 tistic 10x0.009096 at (0.145)2 od 3 2 ve introduction to Probability SS-Ohfit test is applied to binned data NESs-OhAt test c . a test can be applied to discrete distributions such as of fit test begins by hypothesizing that the distribution of nner. For example, in order to determine zer may wish to know whether n day of the week (of equal numbers of customers on each day could be Bi be the null hypothesis l& frequency distribution with k categories into which . The frequencies of occurrence of the variable, for each ‘called the observed values. Square goodness of fit test w be in each category if the sample the claim. of cases for each categ made equal to the total of the observed ‘orks is to determine data were ory. The total of the ted number of cases in each category is fof cases in each category ic test that is used to find out phenomena is significantly different from the fest, the term goodness of fit is used to compare the swith the expected probability distribution. Hest determines how well theoretical distribution (such Seen) sis Be exapisical distrbuton- r is i vals. Then the ie caeople data is divided tnt ini {nterval are compared, with the expected probability distribution function minnie * Suppose that the values, say xj, xz, .. ep Xq of size n, have Occupied with frequencies (Of), (Of), (Of); +o OB am AB), respectively, where (Of) stands for observed frequency and S\n i = 1 (Oi = N. State null hypothesized proportions for each category (pj). Alternative is that least one of the proportions is different than specified in the null. Calculate the expected counts for each cell as npi . 3. Calculate the X? statistic : nv eae J Woserved~ expected)? expected Compute the p-value as the proportion above the X? statistic for randomization distribution or a X* distribution with df = (number of categ if expected counts all > 5 . Interpret the p-value in context. Chi-square Test for Independence of Attributes The chi-Square test of independence is used to determine if there is a si relationship between two nominal (categorical) variables. The frequency of each category for one nominal variable is compared categories of the second nominal variable. The data can be displayed in a contingency table where each row repr category for one variable and each column represents a category for the 0 variable. For example, say a researcher wants to examine the relationship b (male vs. female) and empathy (high vs. low). The chi independence can be used to examine this relationship. The null hypothesis for this test is that there is no relationship bei and empathy. * The alternative hypothesis is that there is a relationship between g empathy (eg, there are more high-empathy females than high-empathy n This test is also known as chi-square Test of Association. This test ul contingency table to analyze the data. . A contingency table is an arrangement in which data is classified according to _ ¢ategorical variables. The categories for one variable appear in the rows, and categories for the other variable appear in columns. TECHNICAL PUBLICATIONS® - An up thrust for knowledge 3-45 Introduction to Probability variable must have two or more for a specific pair of categories the sample of N observations, Categories. Each cell reflects the total count “be the observed fr : ee fequency of children of high income level going to R, and R, is row totals and C, and C, are column totals such that C= R, + Ry =N, the total frequency. ae ~ albjdd = 1 can be directly found using the short cur of 2 x 2 contigency table oe n(ad be)? . 22. eyen F112 ngth and Limitation of Chi-Square Test ute than some statistics to comp’ Paik oul makes no assumptions ab distribution of the population no asi 3-47 Introduction to Probability 7 methods are under devetopm discs are made by each metl with liquid rent for making discs of a superconducting thod, and they are checked for superconductivity Method ; Mi M2) M3) M4. Total _Superconducters 31 2 He | 0 Failures Ts Pat. ao 50 50 50 50 200 Significance difference between the proportions of superconductors under different 0.05 level using # hi-square test, lull hypothesis (Hy) : the proportions of semiconductors are equal. ternative hypothesis (H,) : the proportions of semiconductors are not equal. putations of expected frequencies (E,) is Observed frequency (Of) _Expected frequencies (EA) (Omit = 31 (E11 = G0 X 120 ) / 200 = 30 (Ofi2 = 42 (E12 = (60 X 120 ) / 200 = 30 (Offs = 2 (BE)13 = (50 X 120 ) / 200 = 30 Onis = 25 (Bi)14 = (60 X 120 ) / 200 = 30 (O21 = 19 (E921 = (50 X 80) / 200 = 20 (Ofna = 8 (E22 = (50 X 80) / 200 = 20 (O23 = 28 (E823 = (50 X 80) / 200 = 20 (on2s = 25 _2024 = 60X90) / 200= 20 | Determination of degree of freedom = (ma - 1) (n- 1) = (2-1) 4-1) =3 Chi-square statistic _ [cong -@5)° ae ee TECHNICAL PUBLICATIONS® - An up thrust for knowledge Data Science 3-48 Introduction to Probab, (31-30)? (42~30)? 30)? (25-30)? 30 30 2 2 (19-20)? (B= 20)? , (28-20)? | (25-20) er ae a ee = 1950 Step 6: Calculated chi-square test is greater than the tabulated value (7.815). } null hypothesis. Therefore, the proportions of superconductors are not equal. Student's t-Distribution * When the sample values come from a normal distribution, the exact t' was worked out by W. S. Gossett. He called it a t-distribution. * Unfortunately, there is not one t-distribution. There are different t-di each different value of n. If n = 7 there is a certain t-distribution but t-distribution is a little different. We say that the variable t has a with n ~ 1 degrees of freedom. * Suppose a simple random sample of size n is drawn from a pop population from which the sample is taken follows a normal di distribution of the random variable X=Ho s/n ows Student's t-Distribution with n — 1 degrees of freedom. sample mean is X and the sample standard deviation is s. 7 degrees of freedom are the number of free choices left after ¢ as is calculated. When you use a t-distribution to estin the degrees of freedom are equal to one less than the s df. = n-1 t= is normal although this assumption can be relaxed sample was drawn from the population of interest. the comparison of calculated 't' value with the theoreti we conclude : of student's t-distribution Introduction to Probability 7 a 0 2 4 Fig. 3.13.1 Properties of Student's t-Distribution ‘The t-distribution is different for different degrees of freedom. 0 and symmetric about 0. ‘The t-distribution is centered at the left of 0 is 1/2 and the area to The total area under the curve is 1. The area to the right of 0 is 1/2 wut never equals 0. the area in the tails of the increases the graph approaches by ‘As the magnitude of t t-distribution is larger than ‘The area in the tails of the normal distribution. f the t-distribution is dependent on the sa distribution becomes 4 ymple size n. proximately normal. The shape o! ‘As sample size n increases, the greater than 1. ode of the t-dis The standard deviation tribution are equal £0 Zer0- the area in the tails an estimate of 5, thereby introducing further density of the curve of t get closer to the the pecause as the sample size 0 As the sample size ® # v qesult occurs 3 oe ee Critical values for various degrees oF 58392-58000 369 648 /J6 ~ 264545 = 1482 t = 1.482 is less than toy2 = 3.365. < tay2 = 3.365 is accepted. le of 100 iron bars is said to be drawn from a large number of bars. ily distributed with mean 4 feet and S.D. 6 ft. If the sample ‘the sample be regarded as a truly random sample ? : Sample size n = 100, Sample mean X = 4.2, w=4, SD.c =06 (Ho) : Sample can be regarded is (H) : Sample cannot be regarded. eee 24S 02 o/vn 06/100 9.06 1.333) > Tabulated (t = 3) is rejected. significance is 5 %). = 3.503 ition theory and statistics, the F-distribution is a continuous probability At is also known as Snedecor’s F-distribution or the Fisher-Snedecor te of the Fedistribution arises as the ratio of two chi-squared U, have chi-square distributions with d, and dy degrees of f and U2 are independent. ‘as the null distribution of a test statis n Degrees of freedom ‘ 5 16 15 31 : 30 101 100 1001 1000 Normal “Infinite” neither. 1. n 50, the distribution is skewed, s = 2.5 2. n = 25, the distribution is skewed, s = 52.9 3. n = 25, the distribution is normal, o = 4.12 Solution : 1. The normal distribution would be used because the sample size is 50. 2. Neither distribution would be used because n < 30 and the distribution is sk 3. The normal distribution would be used because although n < 30, standard deviation is known. Sd Ue ee ¢ Given data : Sample size n = 6, Here n < 30, so sample size is mean X = 58392, Standard deviation s = 648 ‘of freedom =n-1=6-1=5 esis (Hy) : w = 58000 ze hypothesis (H,) ; 1 # 58000 ance « = 0.05 : Two tailed test is used. Data Science 3-52 * The probability density function of an F (d;,dz) distributed random given by BX) = 1 dx eat di: dix 2/2 a B(d/2, djx+d) for real x 2 0, where d; and dy are positive integers, and B is the beta * An F random variable is defined as a ratio of two independent chi-square ra variables. Basic Properties of F-distributions 1. The total area under an F-curve equals 1 2. An F-curve is only defined for x > 0. 3. An F-curve has value 0 at x = 0, is positive for x > 0, extends i right, and approaches 0 as x +0 4. An F-curve is right-skewed Testing the equality of two variances 1, Test assumption of equal variances that was made in using the t-test Interest in actually comparing the variance of two populations Assume we repeatedly select a random sample of size n from f populations. Consider the distribution of the ratio of two variances : formed in this manner approximates an F distributi ing degrees of freedom : v =n, ~1and v7 =n -1, F table gives the critical values of the F-distribution which depen es of freedom. ce is not significance at 5 % level. fe following menurements of the hest producing | ton) of the coal produced by tune mimes : sails Data Science 3-56 Introduction to Probability — © The P-value is calculated using the t distribution. Two-sample confidence intervals * In addition to two sample t-tests, we can also use the t distribution to co confidence intervals for the mean difference. When 6; and o> are unknown, w can form the following 100:C % confidence interval for the mean d Hi He 2 2 (St) 153) ny ny &-%) + & | * The critical value t), is calculated from a t distribution with degrees of The k is equal to the smaller of (n; ~1) and (n -1). mary of two-sample tests wo independent samples with known o; and o2 : Use two-sample Z-test values calculated using the standard normal distribution. fo independent samples with unknown o and 2 : Use two-sample test lues calculated using the t distribution with degrees of freedom equal er of n, -1 and n> -1. © independent samples with unknown 1 and o and assume they are two-sample t-test with pooled variance estimator. The P-values is cal ig the t distribution with n; +n» ~2 degrees of freedom. samples that are matched pairs : First calculate the differences for then use our usual one-sample t-test on these differences. o Sample t Test with Unknown Variances. we are comparing two populations that have different means but standard deviations. We want to infer about the difference b when the standard deviation is unknown. We can use other alterna' are assuming that both populations have the same standard deviation b two estimates S} and $3, it way to combine theses two estimates to give a more informative — or. The pooled estimator of the variance is TECHNICAL PUBLICATIONS® - An up thrust for knowledge Sampling and Estimation Jy wampling, Population paranerers and vample Hotiante Hany Prohabitiane Hon rebiabsliny vounpling, Sampling iow thunion, Cornral Limi Hhenrem (OFT), Mample Hon for mean vf the populasion, Kavimaniun vf population parameters, Methoit af Latimation af parameters wring method af moments, Bartnation af parameters Witt wh eatin © amping wid Mon Prebabiily Oar hing Lait Tagore (AT) tiizan Beationvation ton thowr A tha Moputediins Pagnbalion Parwmnirers bharinuin (aliiond Eeinetion Oe Dee ERR mrrewection te Senping © A sample is a gree of SAAN 2M tinting ® set Ro a Rigger group (the POPUMAHION). hy. StadvIng the same | & oped to draw wath conctasiona AOU the lager BROUD, A sammie a sateet of a papatation, San NN A a mallee FON, he PAI OF the popadation of wvereat Hat we act Really evaMNRE BL aRMtOE KY gather the information, & sealer collection of wits trom « population used to deternine RAIN © -A:sample & generally selected for stucty becmuse the population is tuo large to = Tre aampe ade representative of the general etter Dest achiewad Oy ranctom sampling. Also, before AUC the researcher carefully and completely NS a description of the member: to be inchuded, Pepalation fr a study of infant health avight be all children bom & She UR im the L880. The sample might be all babies bor on? May in any of Gon is any entire collection of people, animals, plants or things from we may collect data. It is the entire group we are interested in, which we 8 2 collection of objects, It may be finite or infinite according to the ‘ebjects in the population. @ representative sample (or subset) of that population, ‘generalizations about a population, a sample, that is meant —S igaceeradgeprll someyr tear gots i) : 4.9 Sampling and Estimation orresponding Population ta would giv Parameter Por exampl np wan fo « © WKORMALION about the oy eset a fs important that Population mear population before included hample : The population for a study the UK in the 1980's, The sample years, such measures like of a population the investigator cay refully and completely defines the collect id comp ly Meeting the Sample, inel Luding « description of the members to Of infant health might be all children born Might be all babies born on 7"" May in any of the mean, median, mode, variance and standard distribution are computed, they are referred to as A parameter can be simply defined as a summary characteristic of a tion distribution, bilistic and Non-Probability Sampling general approaches to sampling are used. natic random sample d random sample 1 Smee eee Pee SPCC ate, eS WH a Maree names oF strata, ot HOSE : Re SWE Sper grou), etratihed SAMpling can Shar Neme; PREF methods oe Sane Ae Retina ~Panaitation trom. setich sampic ees hat cammpite “selected trom Ey, wlephone Book, city Mie Re ahs drawn DORSET © Frame TS Popalation Ey Total annual GDP or exports), earowotes Liheral in federal election. Also por af a S terme narameten RGREREEROR ESRI ot 2 xamole Eg. manthh unemployment rate SOPOSEATSIC & the prodahility distribunon os the statishe or Be aire BE The population ar the number of elements in Ger Be ize | the sample ar the number at elements in the @e SSammpte OF size in selected in a manner that each tens Phe same prodadility of deing selected

attribute list -> splitting attribute (each outcome j of splitting criterion) D; be the set of data tuples in D satisfying outcome j; is empty then attach a leaf labeled with the majority class in D to node N; attach the node returned by Generate decision tree(D;, attribute list) to node of for loop urn N; cision tree generation consists of two : Tree construction and pruning. tree construction phase, all the examples are at the root. examples recursively based on d attributes. hase, the identification eee ei that reflect

You might also like