100% found this document useful (2 votes)
2K views342 pages

Ssmda Book

Uploaded by

karan tomar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
100% found this document useful (2 votes)
2K views342 pages

Ssmda Book

Uploaded by

karan tomar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 342
STATISTICS, STATISTICAL MODELLING AND DATA ANALYTICS By: Rajni Ajay Kumar Gupta Assistant Professor Assistant Professor MAIT, Delhi MAIT, Delhi Kanika Agarwal Assistant Professor MAIT, Delhi S.K. KATARIA & SONS: Educational Publishers 4885/109, Prakash Mahal, Dr. Subhash Bhargav Lane, Opposite Delhi Medical Association, Daryaganj, New Delhi-1 10002 (INDIA) Phone: +91-1 1-28243489, 48551243; Mobile: +91-9871775858 e-mail: katariabook@ gmail.com; katariabooks @ yahoo.com Website: www.skkatariaandsons.com (e-books, subscriptions and bulk buying also available on our website) Published by: S.K. Kataria & Sons’ 4885/109, Prakash Mahal, Dr. Subhash Bhargav Lane, Opposite Delhi Medical Association, Daryaganj, New Delhi-110002 (INDIA) Phone: +91-11-23243489, 43551243; Mobile: +91-9871775858 e-mail: katariabook @ gmail.com; katariabooks @ yahoo.com Website: www.skkatariaandsons.com @ALL RIGHTS STRICTLY RESERVED ERAN) 'nformation contained in this book has beeh obtained by the author from sources believed to be reliable and is correct to the best of his knowledge. Every effort has been made to avoid errors and omissions and ensure accuracy. Any error ion noted may be brought to the notice of the publisher which shall be taken care of in the forthcoming edition Of this book, Howover, neither the publisher nor the author guarantees the accuracy or completeness of any information| published herein, and neither the publisher nor the author takes any responsiblity or liabilly for any inconvenience, | expenses, losses or damage to anyone resulting from the contents af this book. to render engineering and other professional services. If such services are required, the assistance of an appropriate| professional should be sought. The author of the book has taken all possible care to ensure that the contents ofthe book do not violate any existing Copyright or other intellectual property rights of any person in any manner whatsoever. In the event, the author has been Unable to track any source and if any copyright has been inadvertently infrriged upon, the facts may be brought to the Notice of the publisher in writing for corrective action. ‘Send all correspondence to: Mis. S.K. Kataria & Sons, New Delhi Email: [email protected] First Edition: 2024 Laser Typesetting by PRINTED AT. S.K. Computers, Delhi-32 AAPTE UDHYOG, NOIDA cesta Preface... Welcome to the first edition of “Statistics, Statistical Modelling, and Data Analytics.” This book is tailored for undergraduate students, with a particular emphasis on the curriculum of Guru Gobind Singh Indraprastha University (GGSIPU). However, the content is designed to be comprehensive and adaptable, making it suitable for students enrolled in other universities as well. The field of statistics has evolved into a critical discipline that plays a fundamental role in various domains, from scientific research to business and industry. Understanding statistical concepts, methodologies, and data analytics is crucial for making informed decisions and solving real-world problems. This book is crafted to provide a solid foundation in these areas, catering to the specific needs and requirements of undergraduate students, tics: Introduction & Descriptive Statistics, Data Visualization, Introduction to Probability Distributions, Hypothesis testing, Linear Algebra and Population Statistics, Mathematical Methods and Probability Theory, Sampling Distributions and Statistical Inference, Quantitative analysis. Statistical Modelling: Linear models, regression analysis, analysis of variance, applications in various fields. Gauss-Markov theorem; geometry of least squares, subspace formulation of linear models, orthogonal projections; regression models, factorial experiments, analysis of covariance and model formulae; regression diagnostics, residuals, influence diagnostics, transformations, Box-Cox models, model selection and model building strategies, logistic regression models; Poisson regression models. Data Analytics: Classes of open and closed set, concept of compactness, Metric spaces. concept of Cauchy sequence, completeness, compactness, and connectedness to solve the problems. Advanced concepts in Data Analytics: Describe vector space, subspaces, independence of vectors, basis and dimension. Describe Eigen values, Eigen vectors and related results. Rajni Ajay Kumar Gupta Kanika Agarwal Acknowledgment... The creation of this book is a collaborative effort, and we extend our sincere gratitude to our colleagues and students at Maharaja Agrasen Institute of Technology who played a crucial role in the creation and completion of this book. ‘Their support, encouragement, and expertise have been invaluable. Special thanks to our family and friends whose patience, understanding, and encouragement sustained us during the long hours of writing and editing. Lastly, we extend our appreciation to the publisher S. K. Kataria & Sons who worked tirelessly with us to complete this project. Their collective efforts have left an indelible mark on the book. Rajni Ajay Kumar Gupta Kanika Agarwal (vii) 4 CONTENT 48-74 1. Descriptive Statistics 5 3. Sampling Distribution ... 1.1 Introduction 13.1. Introduction 48 1.2 Descriptive Statistics... 2 8.2 Some Common Terms used 1.3 Descriptive Measures 28 in Sampling Distributions 49 13.1 Measures of Centre 4 33 Sampling Distributions 51 1.3.2. Measures of Dispersion... 6 3.3.1 The Sampling Distribution 1.4 Graphical Representation. of the Mean (6 known) «uu. 52 1.4.1 Pie Charts Central Limit Theorem js... 53 14.2. Bar Charts and Histograms... 12 ‘Sampling Distribution of 143 Boxplots 3 Proportions... .54 Problems 3.3.4 Sampling Distributions and Sums «... 2. Probability Theory and 33.5 See Dees of ” Probability Distribution Mean o Unknown: 2.1 Introduction .. t-distribution 55 22 ‘Types of Random Variables... 33.6 72- Distribution. 56 2.2.1 Discrete Random Variables .. 18 3.3.7 Sampling Distribution 2.2.2 Continuous Random of Vatiance 2 . 58 Variables 21 33.8 FDistribution (Sampling 2.3. Jointly Distributed Random Distribution of the Ratio Variables ... we 21 of Two Sample Variances)... 59 2.3.1 Joint Distribution Function Problems and its Properties ws. 2.3.2 Joint Discrete Distribution een eats Function... . 4.1. Point Estimates : 5 23.3 Joint Continuous Distribution 4.2 Confidence Intervals about the Function 1.25 Mean (jt) when the Population 2.4 Independent and Dependent __ Standard Deviation (9) is Known...... 77 Random Variables. uu 26 4.3 Confidence Intervals about the 2.5. Expectation of the Sum of the | Mean (1!) when the Population Random Variables... 26° Standard Deviation (9) is Unknown... 79 2.6 Expectation of the Product of the 4.4 Confidence Intervals about the aac atiables ce Population Proportion (p) ...sesnnne 81 2.7, Probability Distribution .ennnnm 29 4's Confidence Interval Summary sun. 82 2.7.1 Binomial Distribution 29D lems 83 2.7.2 Poisson Distribution 37 ° 2.7.3 Poisson Process Hypothesis Testing. 85-135, 2.7.4 Normal Distribution (N.Dn)..42 5 1 ‘The Fundamentals of Hypothesis oe i Testing 85 (ox), 5.2 54 55 5.6 5.1.1 Components of a Formal Hypothesis Test 86 5.1.2 The Null and Alternative Hypotheses 87 3 A Two-sided Test 87 5.14 A Right-sided Test 88 5.1.5 A Left-sided Test 88 5.1.6 Statistically Significant 89 5.1.7 Types of Errors 5.18 Power of the Test. Hypothesis Test about the Population Mean (j.) when the Population Standard Deviation (6) is Known... 5.2.1 Testing a Hypothesis Using Classical Approach ..... 91 5.22 Testing a Hypothesis using P-values Hypothesis Test about the Population Mean (jt) when the Population Standard Deviation (9) is Unknown 5.3.1 Testing a Hypothesis Using Classical Approach .. 5.3.2 P-value Approach... Hypothesis Test for a Population Proportion (Dp)... 7 Hypothesis Test about a Variance 96 5.5.1” One-sample Test for Testing the Hypotheses .. 11 Inferences about the Differences of Two Populations 1.112 Inferences about Two Means with Independent Samples (Assuming Unequal Variances) .. Construct and Interpret a Confidence Interval About the Difference of Two Independent Means... 5.6.3. P-value Approach... 5.6.1 112 5.6.2 5.6.4 Pooled Two-sampled t-test (Assuming Equal Variances} 118 5.6.5 Inferences about Two Means with Dependent Samples-Matched Pairs ..... 120 5.6.6 P-value Approach 123 5.6.7 Construct and Interpret a Confidence Interval about the Differences of the Data for Matched Pairs 124 5.6.8. Confidence Interval Interpretations ws 124 5.7 _Inferences about Two Population Proportions... soesensens 126 5.7.1 Using the PValue Approach .. .. 129 5.8 Fest for Comparing Two Population Variances 131 Problems 134 ear Algebra 6.1 Linear Equations... 136 6.2 What is a System of Linear Equations?...... 6.3. Solving Linear Systems... 6.4 Geometric Interpretation of the Solution set ......... 143 6.5 Row Reduction and Echelon FOr cece 144 6.6 Reduced row echelon form (RREF) 147 6.7 Existence and Uniqueness of Solutions 151 Problems 151 7. Simple Linear Regression ... 153-163 7.1 Introduction .. . 153 7.2. Simple Linear Regression Model... 153 7.3 Estimation of the Regression Parameters (By, By) . seve: 155 7.4 Meaning of Regression Parameters... 156 7.5 Properties of the Least Squares Estimators 158 7.6 Coefficient of Determination 160 Problems 161 8. Multiple Linear Regression Model 164-178 8.1 Introduction ...... 164 8.2 Multiple Linear Regression Model 164 8.3. Meaning of Regression Parameter ... 166 8.4 Least-Squares Estimator for P..... 166 8.5 Properties of the Estimated Parameters... . 170 85.1 Unbiases of (8)... 17 85.2 Covar (f) = (XTX! 62... 171 8.5.3. Gauss-Markov Theorem..... 173 8.6 Geometry of Least Squares... 174 8.6.1 Subspace Formulation of Linear Models . ws 175 8.6.2 Orthogonal Projections ..... 176 Problems sl 77 9. Analysis of Variance (ANOVA) Models .. 9.1 Introduction .. 9.2. Single Factor ANOVA Model 9.3 Analysis of Single Factor ANOVA.... 183 93.1 Assumptions Required for the Analysis of Single Factor ANOVA . 184 93.2 Sums of Squares ..rrrcnenr 185 93.3. Testing of Hypotheses (Overall F-test for the Equality of the Factor Level Means) ...... sevens 185 9.4 Two-Factor ANOVA Model ......: 188 9.5 Factorial Experiment... 95.1 Factorial Experiments Factors at Two Levels (22 Factorial Experiment) 9.6 The Two-way Factorial Analysis . 9.6.1 The Linear Model for ‘Two-way Factorial Experiments 197 ith 9.6.2 ANOVA for a Two-way Factorial Design 198 9.7 Analysis of Covariance... 203 9.7.1 The Analysis of Covariance (ANCOVA) Model .. 206 9.7.2 One-Way Model with One Covariate....... 206 9.7.3 Testing Hypotheses 207 Problems Model Diagnosti 10.1 Introduction serve 210 10.2 Checking of the Linear Relationship Between Study and Explanatory Variables .. 211 10.3 Residual Analysis... 213 10.4 Methods for Scaling Residuals ... 214 10.5 Diagnostic Plots _ 217 105.1 Normal Probability Plot... 218 10.5.2 Plots of Residuals Against the Fitted Value ... 10.5.3. Plots of Residuals Against Explanatory Variable ... 10.6 Transformation and Weighing to 222 Correct Model Inadequacies 225 10.6.1 Variance Stabilizing ‘Transformations. 225 10.6.2 Transformations to Linearize the Model .....sassassesses 227 10.7. Analytical Methods for Selecting a Transformation on Study Variable (The Box-Cox Method) . 230 10.8 Leverage and Influence Diagnostics .. 7 234 10.8.1 Leverage ..ncnnnree 235 10.8.2 Measures of Influence........ 236 10.9 Variable Selection .....c:ssenne 239 10.9.1 Omission/Exclusion of Relevant Variables ....... 240 10.9.2 Inclusion of Irrelevant Variables.. 242 10.10 Model Building 244 10.10.1 Coefficient of Determination .............. 244 10.10.2 Coefficient of Determination 245 10.10.3 Residual Mean Square...... 245 10.10.4 Mallow’s C,, Statistics........ 246 10.10.5 Akaike's Inform Criterion (AIC) 248 10,10.6 Bayesian Information Criterion (BIC) 249 10.10.7 PRESS Statistic 249 10.10.8 Partial F-statistics .. 249 10.11 Computational Techniques for Variable Selection. 251 10.1.1 Use all Possible Explanatory Variables... 251 10.11.2 Stepwise Regression Techniques .. Problems 11. Logistic Regression Model and Poisson Regression Model 255-263 11.1 Logistic Regression Model 255 11.2. Linear Predictor and Link Functions ... oe 257 11.3. Maximum Likelihood Estimation of Parameters... . 258 114. Hypothesis Testing of Logistic Regression Model 259 11.5. Poisson Regression Model .......-. 260 11.6 Linear predictor and link functions: 261 11.7. Maximum Likelihood Estimation of Parameters... . 262 11.8 Hypothesis Testing of Poisson Regression Model. 262 Problems 12. Data Analytics .. 12.1 Real analysis and Data Science...... 264 122. Basic Set Theory 12.3. Norms and Metric 123.1 Components of a Metric Space .. 268 12.3.2. Importance of Metric Space .. 269 124 Open set and Closed Set 269 12.4.1 Open Set . 269) 12.4.2 Closed Set svseees 269 12.4.3 Unions of Open Sets Intersections of Closed Sets 270 12.4.4 Example of Open and josedl Set 270 12.4.5 Example of Half-Open Interval 270 12.5 Sequences of Numbers 271 12.5.1 Sequences .. 271 12.5.2 Convergence .. 271 12.5.3 Cauchy Sequence 272 12.6 Compact Sets .. os sovveee 272 12.6.1 Sequential Compactness . 273 12.6.2 Key Properties of Compact Sets we 273, 12.6.3 Examples of Compact Sets . 274 12.64 Importance of Compact Sets ..... 274 12.7 Completeness in metric spaces 274 12.7.1 Examples of Complete Metric Space... sxe 275 12.8 Connected Metric Space... oo 276 12.8.1 Theorems Based on Connectedness of Metric Space .... 276 12.8.2 Few Examples of Connected Metric Spaces ... 277 12.8.3 Problems on Connected Metric Space.. 277 Problems 13. Vector and Vector Space 13.1, Vector 13.2 Vector Addition and Scaler Multiplication ... 13.3 Hyperplanes and Lines in R” 281 13.4. Directions and Magnitudes 13.5. Field 13.6 Vector Space... 13.7 Spanning Sets .. 287 13.8 Subspaces .. (xii) 13.9 Linear Dependence and Independence .ovssnnnanennein 291 13.10 Basis and Dimensions sone 291 Problems . 293 14. Eigen Values and Figen Vectors 14.1 294 14.2 Cayley-Hamilton theorem, Characteristic Polynomial ... Introduction 295 14.3 Eigen Values and Eigen Vectors .... 296 14.4 Some other Properties yen values 14.5 Diagonalization of a Matrix 302 Problems ..... 304 Appendix... 307-318 Bibliography 319 Model Question Paper 321-323 Index . 325-328 \ (xiii) CHAPTER DESCRIPTIVE STATISTICS Chapter Outline * Introduction * Descriptive Statistics ’* Descriptive Measures * Graphical Representation mm 1.1 INTRODUCTION Statistics has evolved as a universal language of the sciences, and as an art of data analysis which can yield influential outcomes. In the field of natural resource management, which includes scientists, researchers, and administrators, we rely extensively on statistical analysis to address inquiries that emerge within the populations under our supervision. To illustrate: Has there been a substantial alteration in the average sawtimber volume within the ted pine stands? * Is there any evidence ofa rise in the prevalence of invasive species within the Great Lakes region? = © What portion of the white-tailed deer in New Hampshire exhibits weights below the threshold considered healthy? © Did fertilizers A, B, or C produce any discernible effects or corn yields? However, before exploring diverse statistical methodologies, a brief overview of descriptive statistics is indispensable. These quéstions typically require statistical analysis for resolution. To tackle these inquiries effectively, it is crucial to acquire a well-constructed random sample from the relevent population. Once obtained we utilize descriptive statistics to organize and summarize our sample data. Subsequently, the next phase involves inferential statistics, enabling us to leverage our sample statistics and extend our findings to the entire population while assessing the reliability of the results. However, prior to delving into various statistical methodologies, a concise review of descriptive statistics is essential Statistics is the scientific field focussed on gathering, structuring, summarizing, analysing, and making sense of information. 1 i. 2 | Statistics, Statistical Modelling and Data Analytics Statistics is a branch of mathematics that involves the collection, analysis, interpretation, presentation, and organization of data. It provides methods for making inferences and decisions in the presence of uncertainty. There are two main branches of statistics: Descriptive Statistics «Inferential Statistics Descriptive Statistics: Descriptive Statistics encompasses the concise surnmarization and systematic organization of data to enhance its comprehensibility. Key metrics employed in descriptive statistics include the mean, median, mode, range, and standard deviation. Inferential Statistics: Inferential statistics encompasses the utilization of data from a subset to draw inferences or predictions about a broader population. It encompasses methodologies such as hypothesis testing, confidence intervals, and regression analysis. Statistics is widely applied in various fields, including science, business, economics, social sciences, and government, to draw meaningful conclusions from data and inform decision-making processes. In this chapter, we will focus on descriptive statistics. Collect data Population | +] Sample Dewconcson Oeste sna] Statistical inference Parameters Statistics, Fig. 1.1 Using sample stati ‘ics to estimate population parameters mm 1.2. DESCRIPTIVE STATISTICS A population represents the subject of study, comprising all the elements within it, Population data entails the complete collection of these elements. For instance: * Every single fish inhabiting the Long Lake. * The entirety of lakes situated in the Adirondack Park. * The total grizzly bear population residing in Yellowstone National Park. In contrast, a sample refers to a portion of data selected from the population under investigation. For example: * 100 fish selected at random from Long Lake. « 25 lakes chosen through random selection within the Adirondack Park. * A sample consisting of 60 grizzly bears, each with a designated home range in ‘Yellowstone National Park, Descriptive Statistics I. 3 |] Populations are characterized by descriptive metrics known as parameters. Inferences regarding these parameters rely on sample statistics. As an illustration, the population mean (x1) is approximated using the sample mean (z), while the population variance (o”) finds its estimation in the sample variance (s?). Variables represent the characteristics of interest, such as: « The length of fish dwelling in Long Lake. ¢ The pH levels of lakes in the Adirondack Park. « The weights of grizzly bears residing in Yellowstone National Park. Variables can be classified into two primary cateories: qualitative and quantitative. Qualitative variables encompass values that denote attributes or categories for, which mathematical operations are not applicable. Examples include gender, race, and petal colour. In contrast, quantitative variables consist of numeric values, such as measurements, to which mathematical operations can be applied. Illustrations of quantitative variables include age, height, and length. Quantitative variables can be further classified into discrete and continuous variables. Discrete variables involve a finite or countable set of potential values much like a collection of ‘hens’ capable of laying specific numbers of eggs, such as 1, 2, 13, and so on. The range of conceivable values for discrete variables is finite and well-defined. Continuous variables have an infinite number of possible values. Think of ontinuous variables as ‘cows’. Cows can give 4.6713245 gallons of milk, or 7.0918754 gallons of milk, or 13.272698 gallons of milk... There are almost an infinite number of values that a continuous variable could take on. Problem 1.1 Is the variable qualitative or quantitative? Species Weight Diamet Zip Code (qualitative quantitative, quantitative, —_qualitative) ™@ 1.3 DESCRIPTIVE MEASURES Descriptive characteristics that pertain to populations are referred to as parameters and are commonly denoted by Greek letters. The population mean is symbolized by pt (pronounced as ‘mu’), the population variance is represented as o” (pronounced as ‘sigma squared’), and the population standard deviation is expressed as o (pronounced as ‘sigma’). Conversely, descriptive properties pertaining to samples are termed as statistics and are typically expressed using Roman letters. The sample mean is denoted as (x) (pronounced as ‘x-bar’), the sample variance is indicated as s*, and the sample standard deviation is represented by s. Sample statistics serve the purpose of estimating unknown population parameters. 4 | Statistics, Statistical Modelling and Data Analytics In this section, we will delve into descriptive statistics, specifically focusing on measures of central tendency and measures of data variability. These descriptive statistics aid us in discerning both the central tendency and the extent of variability within the dataset. 1.3.1 Measures of Centre Mean The arithmetic mean of a variable, often referred to as the average, is calculated by summing up all the values and then dividing the total by the count of values. The population mean is symbolized using the Greek letter j1 (pronounced as ‘mu’), while the sample mean is represented as X (pronounced as ‘x-bar’). Generally, the sample mean serves as the most reliable and unbiased estimate of the population mean. Nevertheless, it can be influenced by extreme values, also known as outliers, and may not be the most suitable measure of central tendency when dealing with highly skewed data. The following formulas illustrate the computation of the population mean and the sample mean: Population Mean (x): w= Dt Sample Mean (x): x = )*i n Here, x, represents an individual element within the dataset, N signifies the total number of elements in the population, and n stands for the number of elements in the sample dataset. Problem 1.2 Find the mean for the following sample data set: 6.4,.5.2, 7.9, 3.4. x = 64452479434 4 Median The median of a variable is determined by identifying the middle value within a dataset when the data are arranged in an ascending order. It effectively divides the data into two equal halves, with 50% of the data points falling below the median and the remaining 50% above it. Importantly, the median is not significantly affected by the presence of outliers and can serve as a more suitable measure of central tendency, especially when dealing with highly skewed data. The method for calculating the median varies based on the number of observations in the dataset. To compute the median for a dataset with an odd number of values (where ‘n’ is an odd number), you should first arrange the data in ascending order. Descriptive Statistics 5 II Median Fig. 1.2 The method of calculating the median varies based on the number of observations in the dataset. To compute the median for a dataset with an odd number of values (where ‘n’ is an ‘odd number), you should first arrange the data in ascending order. Problem 1.3 Find the median for the following sample dataset: 23, 27, 29, 31, 35, 39, 40, 42, 44, 47, 51 The median is 39. It is the middle value that separates the lower 50% of the data from the upper 50% of the data. To calculate the median with an even number of values (n is even), first sort the data from smallest to largest and take the average of the two middle values. Problem 1.4 Find the median for the following sample data set: 23, 27, 29, 31, 35, 39, 40, 42, 44, 47 m= 35439 2 Mode Mode, in statistics, refers to the most frequently occurring value in a dataset. It is particularly useful when working with qualitative data, where the values are categorical and lack numerical properties for mathematical operations such as addition, subtraction, multiplication, or division. As a result, computing the mean and median for categorical data is impractical. However, mode serves as a suitable measure of central tendency for such data sets. In contrast, dealing with quantitative data, mode is less commonly employed as a central measure, particularly when each value occurs only once, rendering the mode irrelevant. Understanding the interplay between the mean and median is essential as it provides insights into the distribution characteristics of a variable. For instance, in a positively skewed distribution (or skewed right), the mean tends to increase due to a few larger observations that pull the distribution towards the right. Meanwhile, the median remains less affected by these extreme larger values. Consequently, in such a scenario, the mean will be greater IL 6 II Statistics, Statistical Modelling and Data Analytics than the median. In a symmetric distribution, the values of mean, median, and mode align closely. Conversely, in a negatively skewed distribution (or skewed left), the mean tends to decrease to account for a few smaller observations that pull the distribution towards the left. Once again, the median is less influenced by these extreme smaller values, resulting in the mean being lower than the median. ‘Skewed right Symmetric distribution skewed left Mode Mode Median Median ‘Mean = Median = Mode Fig. 1.3. Illustration of skewed and symmetric distributions 1.3.2 Measures of Dispersion Measures of central tendency focus on the average or middle values within a dataset, while measures of dispersion analyze the extent or range of data variability. Variability indicales the degree to which values deviate from one another. Data points that are in close proximity to each other exhibit lower measures of variability, whereas values that are more widely dispersed display higher measures of variability. Consider the two histograms provided below. Both sets possess an identical mean weight, yet the data points in Group A exhibit greater dispersion compared to those in Group B. Both groups exhibit an average weight of 267 pounds, but the weights in Group A show a higher degree of variation. Histogram of Group A Histogram of Group B 8 Frequency 20 120 220 320 420 20 250-275 300-325 Weight A Weight 8 Fig. 1.4 Histograms of Group A and Group B This section will examine five measures of dispersion: range, variance, standard deviation, standard error, and coefficient of variation, Descriptive Statistics IL7 I Range The range of a variable is determined by subtracting the smallest value from the largest value within a quantitative dataset, making it the most basic measure that relies solely on these two extreme values. Problem 1.5 Find the range for the given dataset. 12, 29, 32, 34, 38, 49, 57 Range = 57-12 = 45 Variance Variance is calculated by determining the squared differences between each value and the arithmetic mean. This method considers both positive and negative deviations. The sample variance (s?) serves as an unbiased estimator of the population variance (a2), with (n—1) degrees of freedom. Degrees of freedom can be generally defined as the number of values involved minus the number of parameters estimated on the way to obtaining the particular estimate. The sample variance is considered unbiased because of the adjustment in the denominator. If we used ‘n’ instead of ‘n — 1’ in the denominator, we would consistently underestimate the actual population variance. To rectify this bias, we modify the denominator to‘n-1. Population Variance Sample Variance oy cH? e=yae Problem 1.6 Compute the variance of the sample data: 3 ga B25? 46-5) +(7-5) _ ano: aan 7. The sample mean is 5. 4 Standard Deviation The standard deviation is the square root of the variance, and is applicable to both population and sample data. Although the sample variance serves as a positive and unbiased estimator for population variance, it is important to note that variance units are squared. The standard deviation is a frequently used numerical method that describes the spread of a variable’s distribution. Specifically, the population standard deviation is denoted as o (sigma), while the sample standard deviation is represented by s. Population standard deviation Sample standard deviation | 8 II Statistics, Statistical Modelling and Data Analytics Problem 1.7 Compute the standard deviation of the sample data: 3, 5, 7 with a sample mean of 5. Standard Error of the Means Frequently, we utilize the sample mean, denoted as x, to estimate the population mean, pl. For instance, if we aim to estimate the heights of eighty-year-old cherry trees, we can follow these steps: + Randomly select 100 trees. « Calculate the sample mean of these 100 heights. © Utilize this computed mean as our estimate for the population mean. However, it’s important to recognize that our sample of 100 trees is merely one of numerous potential samples, all of the same size, that could have been randomly chosen Imagine if we took a series of distinct random samples from the same population, each with the same sample size’ Sample 1: Compute sample mean x Sample 2: Compute sample mean x. Sample 3: Compute sample mean x ‘And so on. With each new sample, we might obtain a different result because we are using a distinct subset of data to calculate the sample mean, This demonstrates that the sample mean is, in fact, a random variable. ‘The sample mean (%) constitutes a random variable that possesses its own probability distribution known as the sampling distribution of the sample mean. This distribution of the sample mean will have a mean identical to y. and a standard deviation equivalent to the standard error { -- }. (z) ‘The standard error ($) serves as the standard deviation of all conceivable sample vn means. In practical terms, we would typically take only one sample, but it is crucial to comprehend and quantify the variability that occurs between different samples. It's important to note that s? represents the sample variance, while s stands for the sample standard deviation. S.= Note: s? is the sample variance and s is the sample standard deviation. ae Descriptive Statistics I9 I Problem 1.8 Describe the distribution of the sample mean. A population of fish has weights that are normally d : 5 = 2.61. Ifyou take a sample of size n = 6, the sample mean will have a normal distribution with a mean of 8 and a standard deviation (standard error) of 2& = 1.061 Ib. V6 If you increase the sample size to 10, the sample mean will be normally distributed with 26 10 Notice, how the standard error decreases as the sample size increases. The Central Limit Theorem (CLT) The Central Limit Theorem (CLT) states that as the sample size increases, the distribution of sample means approaches a normal distribution. In situations where the underlying distribution of the randomn variable is unknown or poorly understood, the CLT provides assurance that the distribution of sample means (x) becomes increasingly normal with larger sample sizes (n). = 0.822 lb. a mean of 8 Ib. and a standard deviation (standard error) of Asatule of thumb, a commonly accepted guideline suggests that a sample size of n> 30 is typically sufficient for this approximation. The Central Limit Theorem states that as the sample size grows, the sampling distribution of the sample mean becomes approximately normal, irrespective of the shape of the original population distribution. Coefficient of Variation Comparing standard deviations across diverse populations or samples proves challenging due to the influence of units of measurement. The coefficient of variation addresses this issue by representing the standard deviation as a percentage of the sample or population mean, making it a unitless measure. Population data "Sample data cv= = «100 i Cv= 5 x100 u R Problem 1.9 Fisheries biologists conducted research on Pacific salmon, focussing on their length and weight. They gathered a random sample and calculated the mean and standard deviation for both length and weight (provided below). Although the standard deviations are somewhat alike, the challenge lies in comparing variability due to the different units of measurement between lengths and weights. By computing the coefficient of variation for each variable, the biologists can effectively assess which variable exhibits the higher standard deviation, regardless of unit discrepancies. eee 20 Statistics, Statistical Modelling and Data Analytics Sample mean Sample standard deviation Length 63cm 19.97 em Weight 37.6 kg 19.39 kg 19.97 a1 9 39 = 224x100 = 31.7% CV, = = x 100 = 51.6% Canara "~ 376 “There is greater variability in the weight of Pacific salmon as compared t0 its length. Variability vaabilty can be assessed through diverse approaches. The standard deviation measures she aspasion between individual data points in a sample, revealing the diversity among the pled units In contrast the coetictent of vaviation evaluates point-to-point variability sve feshion, considering the mean and remaining independent of measurement in 9 evjeanwhile, the standard error quantifies variability between diferent samples, uni ng the diversity actos various samples within the sampling procedure. Typically or ealing with a single sample, the standard eor provkes a means fo quanti the jherent uncertainty in Our sampling process. it Jem 1.10 Consider the following tally rom 11 sample plots on Heiburg Forest, where obi es number of downed logs per acre. Compute the basic statistics for the sample plots. X isthe qable 1.1 Sample data on number of downed logs per acre from Heiburg Forest | 25 625 7.27 52.88529 4 35 1225 2.73 7.4529 6 | 55 3025 22.73 516.6529 10 | 15 225 -17.25 298.2529 2 1600 7.73 | 59.7529 8 25 25 -7.27 52,8529 5 55 3025 22.73 516.6529 ll 35, 1225 2.73 7.4529 7 45 2025 12.73 162.0529 9 5 25 - 27.27 743.6529 1 20 400 - 12.27 150.1819 3 (Contd...) | H | a Descriptive Statistics [11 |] Sum 355 14025 0.0 [2568.1519| | | yx, | yx Y(x,-) d-R? Sample mean: Lo) Median = 35 Variance: w ajax) Da Oe ye “— = 256.82 4. Standard deviation: s= J = J25682 = 16.0256 5, Range: 55-5 = 50 6. Coefficient of variation: s 16.0256 = x 100 = ee 5 x 100 97 * 100 = 49.66 56.82 = 4.8319 ™™ 1.4 GRAPHICAL REPRESENTATION Data organization and summarization can be accomplished through both graphical and numerical methods. Tables and graphs offer a rapid overview of the collected information to facilitate the presentation of data utilized in the project. Although numerous types of graphics are available, this chapter will concentrate on a selected few commonly employed tools, 1.4.1 Pie Charts Pie charts serve as effective visual aid for as quick understanding of relationships between categories. It's crucial to provide distinct labels for each category, and often, including the frequency or relative frequency can enhance understanding. Nonetheless, it’s essential to ||. 12 || Statistics, Statistical Modelling and Data Analytics exercise caution while incorporating numerous categories, as this can lead to confusion. Avoid overloading a pie chart with excessive information. The first pie chart offers a clear depiction of fish types in relation to the entire sample. Conversely, the second pie chart. with its abundance of categories, proves more challenging to interpret. Selecting the most appropriate graphical representation is imperative when conveying information to the reader. Pie chart of fish Pie chart of names Fig. 1.5 Comparison of pie charts 1.4.2 Bar Charts and Histograms Bar charts provide a graphical representation of the distribution of a qualitative variable, such as fish type, while histograms are used to depict the distribution of a quantitative variable, whether it's discrete or continuous, like bear weight. Bass Carp Calis Porch ‘Trout ° 120. 240 ‘360 480 Bear weight Fig. 1.6 Comparison of a bar chart for qualitative data and a histogram for quantitative data In both the scenarios, the bars have uniform widths, and the y-axis is distinctly defined. For qualitative data, each category corresponds to a distinct bar. When working with continuous data, it’s essential to establish consistent class widths by defining lower and upper class limits. There should be no gap between these classes, ensuring that each observation falls into precisely one class without any overlap. ] i Descriptive Statistics | 13 || 1.4.3 Boxplots A box plot, also known as a box-and-whisker plot, is a graphical representation that displays the distribution of a dataset’s values. Boxplots employ the 5-number summary, which includes the minimum and maximum values along with the three quartiles, to depict ihe central tendency, spread, and data distribution. When combined with histograms, they provide a comprehensive description of the data, presenting both numerical and visual insights. In cases where the data is symmetric, the distribution exhi reasonably symmetrical pattern. It consists of a rectangular box that represents the Interquartile Range (IQR), which spans the middle 50% of the data. Inside the box, a line or ‘whisker’ typically indicates the median (the middle value when the data is sorted). Two additional lines or whiskers extend from the box to represent the range of the data within a certain range from the median. These whiskers can indicate potential outliers or extreme values. Box plots are useful for visualizing the central tendency, spread, and skewness of a dataset and can help identify outliers or data points that deviate significantly from the norm. They are a valuable tool in exploratory data analysis and are often used in statistics to compare and summarize data from different groups or populations. a bell-shaped and ‘Symmetric Frequencies ‘Symmetric 20 15 10 l 0 10 20 30 40 o 10 2, 30 40 Values { ‘Values: Fig. 1.7 A histogram and boxplot of a normal distribution In the context of left-skewed distributions, the histogram visually appears to be shifted or ‘pulled’ towards the left side. When observing the corresponding boxplot, one can notice that the first quartile (Q1) is positioned farther from the median, as are the minimum values. Additionally, the left whisker of the boxplot is notably longer than the right whisker. {| 14 | Statistics, Statistical Modelling and Data Analytics Skewed left oneneee Skewed lft 2 | 2ot — — 1s 10 5 oF a 30 % 5 70 20 30 6 Valuse Values: Fig. 1.8 A histogram and boxplot of a skewed left distribution In the case of right-skewed distributions, the histogram exhibits a visual shift or ‘pull’ towards the right side. When examining the corresponding boxplot, you will notice that the third quartile (Q3) is positioned farther from the median, as is the maximum value. Furthermore, the right whisker in the boxplot is noticeably longer than the left whisker. ‘Skewed right 35 ‘Skewed right 30} 25} 20} 16 10 5 7 - 16 20 30 6 ° 70 2 30 ri Values Values Fig. 1.9 A histogram and boxplot of a skewed right distribution 1. Assample contains the following data values: 1.50, 1.50, 10.50, 3.40, 10.50, 11.50, and 2.00. Find the mean, median, and variance of this data? 2. Set 1:5, 13, -2,-1, 19,27 Set 2: 6, - 3, 23, 15, m, 1 1 Find out should the value of m if the ranges of both sets of numbers are equal? 3. Find the coefficient of variation of the following sample set of numbers. | {1, 5, 6, 8, 10, 40, 65, 88} \ Descriptive Statistics 1.15 |] 4. Brian pouts a question 60 people about their favourite colours and separates their answers into 5 categories. The results are shown in the table below. Draw a ae chart to display the results. Colour | Red | Blue | Green | Yellow | Other Frequency | 10 13 24 5 8 A survey was done for 90 people asking them. How many bathrooms are there in their homes? Freddie makes a pie chart to display the results of this survey. (a) The number of bathrooms in a house was most common question in this survey? (b) Find out the number of people who have 1 bathroom in their house. (c) If Freddie picks someone at random from the group surveyed, what will be the probability that he chooses 1 person with exactly 1 bathroom in their home? Construct a box plot for the following dataset. 3,5, 8, 8,9, 11, 12, 12, 13, 13, 163, 5, 8, 8, 9, 11, 12, 12, 13, 13, 16 0959 CHAPTER PROBABILITY THEORY AND PROBABILITY DISTRIBUTION * Introduction * Types of Random Vatiables = Jointly Di tributed Random Variables * Independent and Dependent Random Variables + Expectation of the Sum of the Random Variables * Expectation of the Product of the Random Variables * Probability Distribution ™ 2.1 INTRODUCTION In random experiments, we are interested in the numerical outcomes i.e., numbers associated with the outcomes of the experiment. For example, when 50 coins are tossed, we ask for the number of heads. Whenever we associate a real number with each outcome of trial, we are dealing with a function whose range is the set of real numbers we ask for such a function is called a random variable (r.v.) chance variable, stochastic variable or simply a variable. Definition: Quantities which vary with some probabilities are called random variables. Definition: By a random variable we mean a real number associated with the outcomes of arandom experiment. Example 2.1 Suppose two coins are tossed simultaneously then the sample space is: S = {HH, HT, TH, TT}. Let X denote the number of heads,’ then if X = 0 then the outcome is {TT} and P(X = 0) = ; . IfX takes the value 1, then outcome is {HT, TH} and P(X =1)= ; Next if X takes the value 2 then the outcome is {HH} and P(X = 2) = + The probability distribution of this random variable X is given by the following table: 0 1 2 Total 1 2 1 4 4 4 v Probability Theory and Probability Distribution \|.17 |] Example 2.2 Out of 24 mangoes 6 are rotten, 2 mangoes are drawn. Obtain the probability distribution of the number of rotten mangoes that can be drawn Let X denote the number of rotten mangoes drawn then X can take values 0, 1, 2. Bo, _ 18x17 _ 51 X= 0) = = C2 = a PIX = 0} 20, 24x23 92" 18, 6 Cx °C, _ 18x6 Pika t= Ge = 1x2 6 o . °C _ 6x5 _ 5 and PIR=2)= GS = Bax 28 ~ 90 Kex 9 0 1 2 Total : 51 9 5 P a = = 1 re : 92 23 92 ™™§ 2.2 TYPES OF RANDOM VARIABLES ‘There are two types of random variables: (i) Discrete random variables (ii) Continuous random variables Distribution function: Let X be a one-dimensional random variable. The function F be defined for all x, by the equation F(x) = P(X x]. |X2x] are called tail events. For distinction, we may label them open, closed, upper and lower tails. Often, simple r.v.’s are expanded as linear combination of tail events. Some Properties of a. c.df. (1) Pla < X 1 for some x (c) Fis not non-decreasing i.e., sometimes F is decreasing also 2.2.1 Discrete Random Variables Quantities which can take only integer values are called discrete random variables. Examples The number of children in a family of a colony. The number of rooms in the houses of a township. Probability mass function: Probability distribution Definition: Let X be a discrete random variable taking value x, x = 0, 1, 2,3, ... then P(X = x) is called the probability mass function of X and it satisfies the following (i) P(X= x) 20 (ii) y P(X =x) x=0 i | | probability Theory and Probability Distribution (1.19 |) piscrete distribution function Aru. X is said to be discrete, if there exist a countable number of points x;, Xp. X3, p(xi) 2 O such that ¥ p(x) = LF) = Pes) xel ee Finite equiprobable space (Uniform space) A finite equiprobable space is finite probability distribution where each sample point xj. Xen %3y Xp has the same probability for all fie., P(X = x) = p; = a constant for all i and Daan XS Example 2.4 A random variable X has the following probability distribution: x[o[if[2[3]4[s5][eo6]7]8 Pi) | & | 3k | 5k | 7 | oe | 11k | 13k | 15K | 17% (a) Determine the value of k (b) Find P(X < 4), P(X 25), PO << 4) (o) Find the c.df. (d) Find the largest value of x for which P(X 5) = 58 81 po J oF 60) ds if X is continuous Definition Variance: Variance characterizes the variablility in the distributions, since two distributions with same mean can still have different dispersion of data about their means, Variance of r.v. X is for X discrete o? = El(X- p)*] = D(X - w)? flx) for X discrete o®= EXP] = J” (X—p)? £60) de for X is continuous Probability Theory and Probability Distribution {| 21 || Standard Deviation: Standard deviation denoted by o (S.D,) is the positive square oot of variance oF = El(X— p)?] = E(X — pl? fl) =X (x? = 2 px + 42) flo) © fix) — 2 E xflx) + p2E, fO) = E(X*)- 2p pt p21 = B(x?) - 2 Le Xf}, Ey flx) = 1 Since, 2.2.2 Continuous Random Variables Let X be a continuous random variable taking value x, a 0 implies the graph flx) is above x = axis. 3. Area under the probability curve y = flx) bounded by x = a, x = b is simply Pla erg || 22 |] Statistics, Statistical Modelling and Data Analytics Example 2.5 When a card is drawn from an ordinary deck, it may be characterized according to its suit in some order viz., say clubs, diamonds, hearts and spades and Y be a variate that assumes the values 1, 2, 3, ..., 13 which correspond to the denominations: Ace, 2,3, ..., 10, d, Q, K. Then (X, Y) is a 2-dimensional variate. The probability of drawing a particular card will be denoted by f(x, y) and if each card is equi-probable of being drawn, the density of (X, Y) is fle, y) = 5 visx<4V1sy x}. 3. Rectangle rule: Let a, b, c, d be any real numbers with a < b andc < d. Then, P(a < X< b,c < Y 0, and is continuous. || 26 || Statistics, Statistical Modelling and Data Analytics Let X be a discrete random variable taking value x, x = 0, 1, 2, 3, ... then the mathematical expectation of X is denoted by E(X) [read it as expected value of X] and is defined as E(X) = S)xP(X =x) Similarly, if k is any positive integer then, E(X*) = 2, x* P(X = x) Similarly, if X is a continuous random variable taking the value x, - 29 < x < 9 with flx) as the probability density function then E(X) = J xf (x)dx and E(X*) = C x fxd de. Definition: E(X) is also called mean or arithmetic mean of X denoted by p. Definition: If X is a random variable, then the variance of X is denoted by V(X) and is defined as V(X) = El(X -E(X)?] This can be simplified as V(X) = E(X?) - (E(X)}?. Notation: The variance is denoted by 0” = V(X). Standard deviation: The positive square root of variance is defined as standard deviation and is denoted by o. Therefore, o = V(X) ma 2.4 INDEPENDENT AND DEPENDENT RANDOM VARIABLES Two random variables X and Y are called independent if for every pair of real number ‘x’ and ‘y’, the two events {X < x} and {Y < y} are independent. That is we can express as: P{X =P, and ye, = is jal By generalization of the above theorem, we have E(x, + XQ + x3 +... + X,) = Elxy) + Elxp) + Elxg) + ... + Elx,) This completes the proof of the theorem. mm 2.6 EXPECTATION OF THE PRODUCT OF THE RANDOM VARIABLES The mathematical expectation of the product of a number of independent random variates is equal to product of their expectations. Proof: Let us use the notation as E(xy) = 3); x; yj Py Since the variates are independent, by the law of compound probabilities we have y= PP, LY «PvP, = DP, Lae, = LexE) =£w Y px =EWeW) jot i d t i i The theorem can be generalized for a number of independent random variates such that Eley Xp you. Xq) = Eley) . Ely) « Ely)... Eb) This completes proof of the theorem. Note: E(x, y) = E(x) Ely) does not guarantee the independent of x and y One can easily verify that the following expectations: (a) Ela) =a (b) E(aX) = a E(x) || 28 |] Statistics, Statistical Modelling and Data Analytics (c) E(aX * bY) = aE(X) bE(Y) (d) E(aX + b) = a E(X} + b (e) Via) = 0 (f) Vax b) = a VIX) (g)_ Vix) = Elx*) - [E(x]? (h) ViaX + BY) = a’o% + b°of + 2ab ayy Example 2.6 Two coins are tossed simultaneously. Let X denote the number of heads, Find E(X) and V(X)? Solution: K=x 0 1 2 Total ‘ 1 2 1 PR=x) | 7 nm ; 1 Mean: pe eo<0-d41-2 42-201 Variance: o? = V(X) = E(X?) - (EX)? 1 2.2 1 =@ v2.2.1 _qp : 4 * 4 * 4 cu Hence, the solution. Example 2.7 Ifit rains, a dealer in raincoats earns & 500 per day and if itis fair, he loses %50 per day. If the probability of a rainy day is 0.4. Find his ayerage daily income? Solution: t 500 -50 Total 04 0.6 1 Average, E(X) = 500 (0.4) + (50) (0.6) = 200-30 = €170/- Hence, the solution. ae Probability Theory and Probability Distribution {| 29 || ml 2.7 PROBABILITY DISTRIBUTION 2.7.1 Binomial Distribution This distribution was discovered by James Bernoulli, This is a discrete distribution. It occurs in cases of repeated trials such as students writing an examination, births in a hospital, etc. Here, all the trials are assumed to be independent and each trial has only two outcomes namely success and failure. Let an experiment consist of ‘n’ independent trials. Let it succeed ‘x’ times. Let ‘p’ be the probability of success and ‘q’ be the probability of failure in each trial. ptq=1 The probability of getting x successes = p -p- p. ... . p(x times) = p The probability of getting (n — x) failures = q-q-q.....q(n-x times) = . From multiplication theorem, the probability of getting x successes and (n - x) failures is p'q"-*, This is the probability of getting x successes in one combination. There are such "C, mutually exclusive combinations each with probability p*q"~*. :. From addition theorem the probability of getting x success in "C,p*q"-*. Notation: b(x; n, p) denotes a binomial distribution with x successes, n trials and with pas the probability of success. bx; n, p) = "C, p’q-*, x = 0,1, 2,3, ..,n Parameters of Binomial Distribution In B(x; n, p) there are’3 constants viz., n, p and q. Since, q = 1 - p, hence there are only 2 independent constants namely n and p. These are callled the parameters of binomial distribution. Note: Since b(x; n, p}) is same as the (x + 1)" term in the binomial expansion of (q + p)", hence this distribution is called the ‘Bihomial Distribution’. Mean of the Binomial Distribution, Mean = p= } x-b(x;n,p) 0 || 30 Il Statistics, Statistical Modelling and Data Analytics Puty=x-1, x=lty When x = 1 implies y = 0 x= Limplies y = x-1 nt = HVE pany md Cy p’d' fm = np(q + p) So, m = np is Mean of Binomial distribution, Variance of the Binomial Distribution E(x? = >) x?b(x,n,p) x=0 = Sbex-1)+ xiblx,n, p) 0 ¥, xb b(x,n, p) +} xblx,n,p) x=0 x=0 = . n x (nx) - Bre Vani +H pF) + np Puty=x-2 .. x=2+y { When x = 2 implies y = 0 When x = n implies y = n-2 (1-2)! gine») = nin type nin MPO Tin B— ol ’ +np n-2 = nln Dp? D.C, pig"? + mp yb r Probability Theory and Probability Distribution || 31 || = n(n ~ 1)p(q + p}"-? + np EQ@) = n(n - 1)p? + np o? = E(X?) - [E(X)]? = n(n — 1)p? + np - n2p? p(1 — p) = npq ". The variance of binomial distribution is npq. The standard deviation iso = mpq oe Moments of Binomial Distribution Mean yi’ = yt = E[X] = 5, xP(x) = yx "pq? x0 nt x gir-x) peg = atn=Ittx-tygin-x) Lea 2 = nplp + q)!"-) = np(1)"-D= np E[X2] = p'y = B, x2P(x) by definition = Dx? c,p%q—" x0 x0 w= DY txlx- +0 "Cp" ug = Yo xtx-1) "C,p*a"™) + np = Consider 13 = Yoxtx—1)"C,p%qe™ x=0 " . ! (nx) Hp = Yeas qn) ee — || 32 |] Statistics, Statistical Modelling and Data Analytics 3 = a Xqin-x) 4 Die Bina?" m2 oS nn=1) N= 2)! og 2) tne wy = yee 2 pl 2i gir» Sg lx -2)in— x) Wy =n ) pq + pi” (n= 1 u = n(n 1) p H’g = n(n~ 1) p? + np hy (variance) = yw} = n(n~ 1)p? + np - (np)? = np? — np? + np - np? = np(1-p) = npq [since, 1 - p = q] np > npq [since q is a fraction] Mean > Variance Similarly lig = npq(1 - 2p) 2 22g? 2 2 Hence, p= Wg. mp a ley _O=p=py BS (npq)' npq 1 When p=. a74 Therefore, 8, = 0 Case (i) when p= hh 0 Case (ii) when noo. then By= Standard deviation = npg Skewness = npq Moment Generating Function of Binomial Distribution Let X be a variable following binomial distribution, then M,(t) = E(e*) = Y)%-0 px) ree Probability Theory and Probability Distribution Il. 33 || = Le "cart Inox) a ‘a + pe'y” M.G.F about Mean of Binomial Distribution fell} = Efex = e*P E(e*) en? M,(t) e"r(q + pe (qe + petryn = (ge*+ petp)n = (ge + pe)” 3 t tt 3 Pala — P)+ = pall -3pq) + 3i 2 p e Z "C, eebnndeeet bee Since we have, a? + b? = (a +b) (a? - ab + b’); p? + q? = (p + q) (p?- pq + q?) = (1) (p + gq)’ —3(pq) = (1 - 3pq) i 2 3 | Now He = coefficient of o npa; fig = coefficient of ‘i =npqla-p); { a . | Hq = coefficient of a | n! foe = npa(l-3pq)+ = — 6 = npall-3pa)+ 77, ai Oa 3xaP 9 n(n-1)(n— 26 aha 6p’? npq(1 ~3pq) +3n(n - 1)p' = npq(l—3pq)+ || 34 [I Statistics, Statistical Modelling and Data Analytics Additive Property of Binomial Distribution Let X ~ b(n,, p;) and Y ~ b(np, p,) be independent random variables. Then Mylt) = (a, + py ef)"; Mylt) = (ag + poet)” What is distribution of X + ¥ We have My, At) = My(t) My(t) (since X and Y are independent] = (gq, +p, e'); (gy + py etre Since it cannot be expressed in the form (q + pe')” , from uniqueness theorem of m.g.fit follows that X + Y is nota binomial variate. Hence, in general the sum of two independent binomial variates is not a binomial variate. In other words, binomial distribution does not possess the additive or reproductive property. However, if we take p; = pp = p, we get Myzy (t) = (q + petyrt 2 which is the m.g.f of a binomial variate with parameters (n1 + n2, p). Hence, by uniqueness theorem of mg.f’s X + Y ~ b(n, + ng, p). Thus the binomial distribution possesses the additive or reproductive property if p, = po. Generalization: If X, for all i = 1, 2, 3, k k k then their sum )" X; ~ B & ve isl isl Recurrence Relation for the probabilities of Binomial Distribution (Fitting of Binomial Distribution) We have P(x + 1) = "C,,, p*t? qt and P(x) = "C,p* q”* Plett) "Cap q! _ Gexiin— xT P(x) "Cpl Plxt1) _ x!(n=x)(n—x-1!p _ P(x) (x+1)x!(n—x-I)!q “Pet l= {8} P(x) which is the required recurrence formula. x+1)q Example 2.8 It has been claimed that in 60% of solar heat installations the utility bills is reduced by least one-third, Accordingly what are the probabilities that utility bill will be reduced by at least one-third in (i) four of ive installations (i) at least four of five installations? Solution: n= 5,p=06,q=1-p=04 (i) b(4; 5, 0.6) = 5C, (0.6)* (0.4) = 5(0.6)4 (0.4) = 0.2592 r probably Theory and Probability Distribution || 35 || {ii) at least 4 means 4 or 5 b(5; 5, 0.6) = 5C, (0.6)5 (0.4) = 0.0778 Probability in at least four installations = b(4; 5, 0.6) + b(5; 5, 0.6) = 0.2592 + 0.0778 = 0.337 Hence the solution. example 2.9 Ten coins are tossed simultaneously. Find the probability of getting at least seven heads? Solution: n= 10,p =P(H) = za fle p=5 P(X 2 7) = PX = 7) —— + P(X = 9) P(X = 10) 2OC(%p)7 (Hp)? + 29Cg Mp )8 Hg)? + 2Cy (Hg) (Hg) + 29C 4p Yq)? (Hp) =o 17, #29 Gy 410.¢, 10 Cro] = Fol "Cs #9 C, 4306, 410 ©] 1 [10.9.8 10.9 | = gn eee +10+1]=—1 [120445 +10+1] = 18 0172 20 Hence, the solution. Example 2.10 If 3 of 20 tyres are defective and 4 of them are randomly chosen for j inspection. What is the probability that om one of the defective tyres will be included? Ww : = 4, p= 3,q=1- Solution: n= 4, p= 3. 19=1-P=55 | Poe= 1) = 4Cy (pI! (gi | = ,.3- (179! 43177 45H) =p 0.368 Hence, the solution. Chebyshev’s Theorem: Let X be a random variable with mean p and standard deviation o then P{(| x-1|)>ko}] < oo Proof: Let f(X)? be the probability mass function of a random variable having mean Wand variance o. Now 0 = Ey (x= 42) fle) oe || 36 |] Statistics, Statistical Modelling and Data Analytics Let R, be the region in which x < ko, Ry the region in which p—ko u + ko. y [ a ] [Reson | Vales of « Fig. 2.3 Chebyshev’s theorem igh — HI? fle) + Epa — w)? flee) Since (x - 1)? = Oi.e., non-negative, hence Zpglx~ 1)? f(x) = 0 also non-negative. Hence o? > Ep, (x ~ w)? fc) + Epa — w)? fle) 6 = Ep, 0e— WP flx) + InR, x< p-ko > x-psS —ko InR3 x2 p+ ko > x-peko Therefore; |x-p] > ko Hence both R; and Rs, (x ~ p)? 2 Ko? And 0? > Ep, ko? flx) + ZpykP0? flx) ie, 0? > Ko? [Ep, fix) + Zp, fod] 1 > gz > (Ep, fle) + Ep, flo Now ay fix) + Erg f(x)] represents the probability assigned to the region R, U Rs, ” [E005 = Pi|x~n] 2 ko] mR J 2 Pllx-u| 2 bol k > Plix- pl > ko} < 4 k This completes the proof of the theorem Note: P[|x-]>ko) Pl|x-nl & x! Proof: Let us consider b(x, n, p) so that b(x, n, p) = "C, p* q”* = HADEN orgs and given np = 4 => p = % alsoqg=1-p=1-% n n Nowasno1 2 x-1 9 |. 38 || Statistics, Statistical Modelling and Data Analytics bix,n, p)> Se x This completes the proof of the Poisson's Approximation to Binomial distribution theorem. Note: = nw 4x2 Mee Yor Late at Dee 2, Show that "| fx,2)=1 ~ ee x0 x! oo A se*e= x=0 x! For that consider" | fx,2)= Dy 3. 1. > 0is called the parameter of the Poisson Distribution, -a40 4, PIX=0)= Ren Ol Applications of Poisson Distribution Poisson distribution is applicable when n is very large and p is very small. Hence some of the applications of Poisson distribution are as follows: 1. Number of faulty blades produced by a reputed firm. 2. Number of deaths from a disease such as heart attack or cancer. 3. Number of telephone calls received at a particular telephone exchange. 4, Number of cars passing a crossing per minute. 5. Number of printing mistakes in a page of a book. Mean and Variance of Poisson Distribution Mean p = E(X) = SxPix =x) -S led) | S os | Yr probability Theory and Probability Distribution | 39 II Mean = p= A E(x?) = ye f(x) x0 = F bxtx I} + x] f(x.) x0 Sxe- I) flx,a)+ +d cflx,) x0 x0 E(X2) = 2 e*ert® EX) = 2 +2 Variance = V(X) = 0? = E(X®) - [E(X))? = 22 + 2-22 .. variance = Standard Deviation = S.D. = JVariance = Vx Note: Ina Poisson distribution mean always equal to the variance. Moment Generating Function of Poisson Distribution M,(t) = Ele’ ‘ le hy x de Lerseon= De Sr Additive Property of Poisson Variates Theorem: If X and Y are two independent random Poisson variates with parameters 2 and u then X + Y is also a Poisson variate with parameter 2. + y. {| 40 II Statistics, Statistical Modelling and Data Analytics Proof: Since X is a Poisson variate with parameter ?. => My (t) = ene -1) Similarly, since Y is Poisson variate with parameter jt => My (f) = gule!-1) Fram the additive nronerty of the mament aenerating function My (t) My (t) (e=1) let - = ehle!-1) gulet-) Which is the moment generating function of a Poisson variate with parameter 2. +t. -.X + Yis also a Poisson variate with parameter 4 + j. 2.7.3 Poisson Process Trt enp-op= 2 andne Enercon E(es)=at hemos Bandas Enencen (as=at PIK= x) = [0 (aT Hd Fig. 2.4 Poisson Process Suppose we have to find the probability of x successes during a time interval T. Divide the time interval T into n equal parts of width &t. Therefore T = n.8. Here we make the following assumptions: (a) The probability of success during an interval at is given by 0..5t. (b) The probability more than one success in a small interval st is negligible. (c) The probability of success in interval (t,t + &f) is independent of the actual time t and also of all previous successes. Here the assumptions of binomial distribution are satisfied and the probability b(x, n, p) where n= z and p = a 8t. .. As n > e binomial distribution approaches to Poisson distribution and here parameter | is 5 h R= mp +> p=> T. n= Ry andy = abt | Probability Theory and Probability Distribution {41 || Therefore; d (ot) = oT |. PIK=x) eel (aT Pix! | Example 2.11 If the probability that an individual suffers a bad reaction due to a certain injection is 0.001, determine the probability that out of 2000 individuals {i} exactly 3 (ii) more than 2 individuals will suffer a bad reaction? Solution: Given p= 0.001; n= 2000; 2=np=2 (i) Tofind P(Exactly 3) = P(X = 3) CE era x! = 0.1804 since e = 2.086,2 2) = 1-P(X <2) = 1-[P(X = 0) + P(x = 1) + Px = 2)] = 1_feta_ etal eh? o 7 ta 2 tenet ]a1 be 2 = 0.323. L 2 Hence, the solution Example 2.12 A manufacturer of cotter pins knows that 5% of his product is defective. If he sells cotter pins in boxes of 100 and guarantees that not more than 10 pins will be defective, what is the approximate probability that a box will fail to meet the guaranteed quality? Solution: We are given n = 100, p = probability of defective pins = 5% = 0.05 And 2. = mean number of defective pins in/a box of 100 = np = 100 x 0.05 =5 Since p is small, we may use Poisson distribution probability of ‘x’ defective pins in a hy x Sex box of 100 is P(X = x) = £ AY _€°5" for all x = 0, 1,2, .. x! x! Probability that a box will fail to meet the guaranteed quality is P(X > 10) = 1-P(X< 10) 10 Sex 10 ex (ee 5 Derren ant 0 Hence the solution. || 42 || Statistics, Statistical Modelling and Data Analytics Example 2.13 10% of the bolts produced by a certain machine turn out to be defective Find the probability that in a sample of 10 tools selected at random exactly two will be defective using (i) binomial distribution (ji) Poisson distribution and comment upon the 10 Solution: Given p = —~ =0.1,n= np =1 olution: Given p = ny {i) Using binomial distribution Let q=1-p=1-01=09 P(X = 2) = 9C, p? g-2) = px (0.1)? (0.9)8 = 0.194 (ii) Using Poisson distribution en? 1 2! 2 2e Comment: There is a difference between the two probabilities because of the fact that Poisson distribution (PD.) is an approximation to binomial distribution (B.D.) and it is applicable for large n. Hence the solution, PIX = 2) = Example 2.14 A hospital switch board receives an average of 4 emergency calls in a 10 minutes interval. What is the probability that (i) there are at the most 2 emergency calls and (ii) there are exactly 3 emergency calls in a 10 minutes interval? Solution: Given, n= 4, ( P(X <2) = P(X = 0) + P(X = 1) + P(X = 2) ee etal et? oa. 2 a2 =e*1+A 7) e [us ar) =e 4[14+4 + 8] =13e4 = 0.238. Ay 3 443 ti PI =3)= SSE =£SE = 0.195 Hence, the solution. 2.7.4 Normal Distribution (N.Dn) Normal distribution is also a continuous distribution. A random variable X is said to follow normal distribution (N.D?) with mean 1 and variance o? if its probability density function is given by i a er a Probability Theory and Probability Distribution {| 43 || fl fe-w | fix) = Lie | 20° eee erO oven -9 otherwise. wow la The corresponding distribution function is f(x) = I 1_ et 27 |* ~oV2n Let Z = ~— then the mean of Z is 0 and the variance is 1 o The corresponding probability density function is | pe Zis called standard normal variate. Nee Notation: 1. X ~ N(u, 0”) denotes that X is a normal variate with mean p: and variance 2. Z ~ N(0, 1) denotes that z is a standard normal variate with mean 0 and variance 1. Features of Normal Distribution curve The graph of f(x) is a bell shaped curve extending from - co to « with its peak at m =e Fig. 2.5 Gaussian distribution ((z) Note: 1, The mode of normal distribution is p. 2, The median of normal distribution is also 41, Hence, for a normal distribution the mean, median and mode coincide. a » Fig. 2.6 Area under the normal curve The area under the normal curve between the ordinates x = a and x = b gives the probability that the random variable X lies between a and b. 44 || Statistics, Statistical Modelling and Data Analytics be pe Put go kat ° So d= % - de=c.de 6 When x= a,2= 22! = ¢(say) 6 When = d (say) a # d Pla 6) (iv) PO 2) (b} Find P(- 1 < Y < 3) (c) Find P(X > 4|Y < 2) Acar hire firm has iwo cars which it hires out day by day. the number of demands for a car on each day is disiributed as a Poisson distribution with mean 1.5. calculate the proportion of days. (i) On which there is no demand. (ii) On which demand is refused (e® = 0.2231)? X is a normal variate with mean 30 and standard deviation 5. Find the probabilities that (i) 26 < X < 40, (ii) X= 45. In a normal distribution (N.D") 31% of the items are under 45 and 8% are over 63. Find the mean and variance of the distribution. 009 CHAPTER SAMPLING DISTRIBUTION . Tnbodulion . Some Common T Terms Used in Sanipling Diskin . Sampling Distributions mi 3.1 INTRODUCTION The field of statistics encompasses the processes of collecting, presenting, analyzing, and utilizing data to inform decision-making and problem-solving. The primary goal of any statistical inquiry is to derive meaningful conclusions about a group of subjects under examination, referred to as the population. Given the challenges or impracticalities associated with scrutinizing en entire population, the concept of investigating only a representative subset, known as a sample, emerges. ‘This exploration seeks to draw inferences about the entire population by leveraging insights gained from the sample -a method termed statistical inference. The act of selecting samples isdenoted as sampling, and a sample is deemed a faithful representation of the population when a probabilistic sampling method is employed. Notably, random sampling stands out as the most pivotal among probabilistic sampling techniques, ensuring that each population member has an equal likelihood of inclusion in the sample. Samples play a crucial role in making inferences about populations by estimating key parameters such as mean (1), standard deviation (c), and others. The estimation of population parameters relies on the examination of pertinent statistical quantities derived from a sample ~ referred to as sample statistics. The term ‘statistic’ is commonly used to denote either the random variable or its value, with the specific meaning discerned from the surrounding context. Let us consider all possible samples of a population and calculate a statistic for instance sample mean. Then the set of all such bivalues, one for each sample, is called the sampling distribution of the statistic. 48 oO eececrereereenrmanan Sampling Distribution || 49 |] Now we can compute the statistics mean variance, etc., for this sampling distribution. In most statistic problems, it is necessary to use the information from sample to draw inferences about the population 3.2 SOME COMMON TERMS USED IN SAMPLING DISTRIBUTIONS Population The population in a statistical study is the set or collection or totality of observations about which inferences are to be drawn. Thus the population consists of sets of numbers, measurements or observations. Population size Nis the number of objects o observations in the population Population is said to be finite or infinite depending on the size N being finite or infinite. Since it is impracticable to examine the entire population, a finite subset of the population known as sample is studied. Sample size n is the number of objects or observations in the sample. Example 3.1 (i) Engineering graduate students in A.P. (Population), Engineering graduate students of a college (sample). Population Sample Fig. 3.1 Example 3.1 Example 3.2 Total production of items in a month (Population), Total production of items in one day (Sample). Example 3.3 Budget of India (Population), Budget of A.P. (Sample), budget of a district (sub sample). Population { Sample orn Fig. 3.2 Population, Sample and Sub-sample Population parameter: A statistical measure or constant obtained from the population is called a population parameter. For example, population mean (1), population variance (02). || 50 I Statistics, Statistical Modelling and Data Analytics U (Sample) statistic: A statistical measurement computed from sample observations is called a (sample) statistic. For example, sample mean (2), sample variance (6?) clearly, parameters are t0 population while statistics are to sample. 4, cp represent the population mean. opulation standard deviation. population proportion. similarly x, s, p denote sample mean, sample standard deviation (s.s.d.), sample proportion Note: The samples must be a true or good representative of the population, sampling should be random or probabilistic Sampling: The process of drawing or obtaining samples is called sampling. Large sampling: If n > 30, then the sampling is known as large sampling. Small sampling: If n < 30, then the sampling is known as small or exact sampling. Note: The simplest and most commonly used type of probabilistic sampling is the random sampling. Random Sampling: Each member of the population has equal chances or probability ofbeing included in the sample. The sample obtained by this method is termed asa random sample. Finite Population: Population may be finite or infinite. If the number of items or observations consisting the population is fixed and limited, it is called as finite population Example 3.4 The workers in a factory, student in a college etc., Factory College Fig. 3,3 Example 3.4 Infinite Population: If the number of items or observations consisting the population is infinite (not fixed and not limited), it is called as finite population. Example: The population of all real numbers lying between 0 and 1. The population of stars or astral bodies in the sky. Sampling with replacement: If the items are selected or drawn one by one such a way that an item drawn at a time is replaced back to the population before the next or subsequent draw, it is known as (random) sampling with replacement. In this type of sampling from a population of size N, the probability of a selection of a unit at each draw remains x Thus sampling from finite population with replacement Sampling Distribution {|.51 || can be considered theoretically as sampling from infinite population. In this, N,, samples will be drawn In Sampling without replacement: An item of the population cannot be chosen for more than once. as it is not replaced. In this Nz, samples will he draum Henre the : 1 probability of drawing a unit from a population of N items at r* draw is — 1 Statistic is a real-valued function of the random sample. So it is a function of one or more random variables not involving any unknown parameter. Thus statistic is a function of samples observations only and is itself a random variable. Hence, a statistic must have a probability distribution, Sample mean: Let x,, Xp, x3, ... X, be a random, sample of size n from a population Dy Then sample mean = (x) = © Sample Variance: Then sample variance = Sample standard deviation is the positive square root of sample variance. Sample mean and sample variance are two important statistics which are statistical measures of a random sample of size n. ™ 3.3 SAMPLING DISTRIBUTIONS Let us consider all possible samples of size n, from a finite population of size N. Then the total number of all possible samples of size n, which can be drawn from the population is No, =m. ‘Cn Compute a statistic 0 [such as mean, variance/s.d, proportion] for each of these sample using the sample data x1, Xp, Xg, ---; Xq by © = Olxy, Xp, Xgp es Xp) eo, | o | 4 | .. | 6 Sampling distribution of the statistic 0 is the set of values {8,, 8p, 03, .., 0,,} of the statistic 6 obtained, one for each sample. Thus sampling distribution describes how a statistic 0 will vary from one sample to the other of the same size. Although all the m samples are drawn from the given population, the items included in different samples are different. If the statistic 0 is mean, then the corresponding distribution of the statistic is known as sampling distribution of means, thus if @ is variance, proportion etc., the corresponding ——_ 1) 52 Il Statistics, Statistical Modelling and Data Analytics ‘bution is known as sampling distribution of variances, sampling distribution of etc distr proportions ye isl ‘Then Mean of sampling distribution of © = (6) = ‘And Variance of sampling distribution of @ = u Similarly, we can have mean of sampling distribution of means, variance of sampling distribution of means, variance of the sampling distribution of variances ete. Standarad Error The standard deviation of the sampling distribution of a statistic is known as Standard Error (SE). The standard error gives some idea about the precision of the estimate of the ‘As the sample size n increases, SE decreases. SE plays a very important role parameters. ple decision theory and forms the basis for hypothesis testing. in large samy Sampling distribution of a statistic enables us to get information about the corresponding population parameter. Degrees of Freedom (v) ‘The number of degrees of freedom usually denoted by greek alphabet v, is a positive integer equals to n ~ k where n is the number of independent observations of the random sample and kis the number of population parameters which are calculated using the sample Gata, The degrees of freedom v = n ~ kis the difference between n the sample size and k the number of independent constaints imposed on the observations in the sample. 3.3.1 The Sampling Distribution of the Mean (o known) To answer any questions related to sampling distribution of the mean (x7) we need to consider a random sample of n observations and determine the value X for each sample, then by various values of X , it may be possible to get an idea of the nature of the sampling distribution. Aslo we have to consider the following theorem for the mean 1x and the variance of of sampling distribution of the mean (x). ‘Theorem: Ifa random sample of size n is taken from a population having the mean and the variance o, then (xX) is a random variable whose distribution has the mean j.. Proof: For samples from infinite population the variance of this distribution is 0 n Sampling Distribution ||_53 |] o? (N-n) For samples from finite population the variance of this distribution is { es | n\{N-1): By above statement, population is infinite then sampling with replacement s Mg =p and og = vn And when the population is finite, size N (sampling without replacement) rw Note: The factor LN n ; ; . ) is known as finite population correction factor. In sampling with replacement, we will have N" samples each with probability - . 1 In sampling without replacement we will have No, samples each with probability =~. c, N- Note: The factor ( n= 4] can be neglected if N is too large compared to the sample size n, 3.3.2 Central limit theorem Whenever n is large, the sampling distribution of X approximately normal with mean ht and variance — regardless of the form of the parent population distribution, as the following theorem states [without proof] Theorem: If x isthe mean of a random sample of size n drawn from a population with mean pt and finite variance o? then the standardized sample mean Z = ~ =H isa random t vn variable whose distribution function approaches that of the standard normal distribution N(0, l)asn> », Normal distribution provides a good approximation to the sampling distribution for almost all the populations for n > 30. For n < 30 small samples, sampling distribution of X is normally distributed, provided sampling is from normal population. ae || 54 |] Statistics, Statistical Modelling and Data Analytics 3.3.3 Sampling Distribution of Proportions Suppose that a population is infinite and that the probability of occurrence of an event called its success is p, while the probability of non-occurrence of the event is q = 1 — idee all a pulation, and for each comple compute the proportion p of successes. Then, we can have a sempling distribution of proportions whose mean samples deviation g,, are given by Hp =P And of = Pip) _ pa nn While population is binomially distributed, the sampling distribution of proportion is normally distributed whenever n is large n > 30. Above equations are also valid for a finite population in which sampling is with replacement. For finite population sampling without replacement of size n 3.3.4 Sampling distributions of differences and sums Let lls, and 6, be the mean and standard deviation of a sampling distribution of statistic 8, obtained by calculating s, for all possible samples of size n, drawn from population 1 This yields a sampling distribution of the statistic s, Inan analogous manner, H1, and ¢,, be the mean and standard deviation of sampling distribution of statistic s, obtained by calculating s, for all possible samples of size ny drawn from another different population 2 Now we can have a distribution of differences 1 - sp, called the sampling distribution of differences of the statistics, from the two population 1 and 2. Then the mean Hs, - 5, and the standard deviation 6,, _, the sampling distribution of differences are given by = bs, — Ms, (02 +02) provided the samples are independent. and Sampling distribution of sum of statistics has mean Ms +s. = Hs, + Hsp = 2 2 and Os, +5, (oz + 0) provided the samples are independent. Sampling Distribution || 55 || For infinite population the sampling distribution of the differences of means has mean iy, and standard deviation oy given by Usp = and 9. = fo 2 Ro = fox +o%, For infinite population the sampling distribution of the differences of means has mean Hz 3% and standard deviation O% , ~ given by R+% +m HE 2 2 oes: and Ox 4%, = fo%, 40%, = [Th 4 7 2 mong 3.3.5 Sampling Distribution of Mean co Unknown: t-distribution To estimate or infer on a population mean or the difference between two was assumed that the population standard deviation o is known. When c is unknown, for large n 2 30, can be replaced by the sample standard deviation s, calculated using the sample mean . Di - x? _ i=l i X by the formula = T For small sample of size n.< 30 the unknown o can be substituted by s, provided we make an assumption that the sample is drawn from a normal population. Let & be the mean ofa random sample of sizé n drawn from a normal population with x mean p and variance o? then t= ~—* jg a random variable having the t-distribution with v= n= 1 degrees of freedom, Where #? = =} re This result is more general than previous theorem CLT in the sense that it does not require knowledge of a: on the other hand, itis less general than the previous theorem CLT in the sense that it requires the assumption of normal population. || 56 || Statistics, Statistical Modelling and Data Analytics Thus for all small samples n < 30 and with o unknown a statistic for inference on Xb population mean jist = ~—* With the underlying assumption of sampling from normal 1 In The t-distribution curve is symmetric about the mean 0, bell shaped and asymptotic on both sides of horizontal t-axis. Thus t-distribution curve is similar to normal curve. The variance for the t-distribution is more than 1 as it depends on the parameter v = n— 1 degrees of freedom but it approaches Las n — ©, In essence, as v = (n - 1) > «, distribution tends to the standard normal distribution. Clearly for n > 30, standard normal distribution provides a good approximation to the t-distribution. oat 0.35} 0.30 0.25} £0.20] 0.15 o.10} 0.05} o.00! Fig. 3.4 t-distribution Critical values of t-distribution is denote by ,, which is such that the area under the ciirve to the right of f,, equals to a.. Since the t-distribution is symmetric, it follows that ty q =~ ty ie., the t-value leaving an area of 1 - «to the right and therefore an area «@. to its left, is equal to the negative t-value which leaves an area «in the right tail of the distribution. Please observe critical values of t, for values of the parameter v. In tables the left-hand column contains values of v, the column headings are area o in the right hand tail of the t-distribution, the entries are values of t,. 3.3.6 7? — Distribution 2-(chi-squared) distribution is a continuous probability distribution of a c..v. X with probability density function given by : ill Sampling Distribution {| 57 |] Where v is a +ve integer is the only single parameter of the distribution, also known as degrees of freedom. Properties of 2 - Distribution (i) 7% — Distribution curve is not symmetrical, lies entirely in the first quadrant. And hence not a normal curve, since 7? varies from 0 to «. (ii) It depends only on the degrees of freedom v. (iif) If 4? and x»? are two independent distributions with v,, vp degrees of freedom then 1° + 12° will be chi-squared distributions with v; + vy degrees of freedom ~ie., it is additive. Fig. 3.5 72 - Distribution The mean (1) of a chi-square distribution with v degrees of freedom is v. The variance (0%) is 2v Applications: The chi-square distribution is commonly used in hypothesis testing, particularly in tests related to the comparison of observed and expected frequencies in categorical data. It is used in the construction of confidence intervals for the variance of a normal distribution. A chi-square distribution table is a tabulated set of values that helps determine critical values and probabilities associated with the chi-square distribution. To use a chi-square distribution table, follow these steps: i For Critical Values: 1, Identify Degrees of Freedom (v): * Determine the degrees of freedom (v) associated with your chi-square distribution. 2. Determine Significance Level (c): * Choose a significance level (a) for your test. Common choices include 0.05, 0.01, or others depending on the specific requirements of your statistical test. || 58 |] Statistics, Statistical Modelling and Data Analytics 3. Look Up Critical Value: * Locate the intersection of the row corresponding to the degrees of freedom (v) and the column corresponding to the chosen significance level («). The value in that For Probabilities: 1. Identify Degrees of Freedom (v): + Determine the degrees of freedom (v) associated with your chi-square distribution 2. Determine Chi-Square Statistic (72): + Identify the chi-square statistic (2) for which you want to find the probability. 3. Look Up Probability: * Locate the intersection of the row corresponding to the degrees of freedom (v) and the column corresponding to the chi-square statistic (x2). The value in that cell is the cumulative probability. 3.3.7 Sampling distribution of Variance s” From the earlier discussions, the sample mean is used to estimate the population mean. Similarly, the sample variance is used to estimate the population variance (o”). The 1 the variance o? then x2 is a random variable having the 7-distribution with v = n-1d of. Exactly 95% of 72-distribution lies between 7p 975 and 7,925 when 6? is too small. 72-value falls to the right of 72,995 and when 0? is too large, 7? falls to the left of 12 975. thus when o? is correct y?-value fall s to the left of x20" or to the right of xy g95- Table 3.1 Critical region for testing : Hy : 0? = a9? lL , Sampling Distribution || 59 |] 3.3.8 F-Distribution (sampling distribution of the ratio of two sample variances) Another important continuous probability distribution which plays an important role in connection with sampling from normal population is the F-distribution . 1Ps,? and s,? ate the variances of independent random samples uf size ny and ny fom normal populati s with variances ¢,? and a,2 To determine whether the two samples come from two populations having equal variances, consider the sampling distribution of the ratio of the variances of the two independent random samples defined by F = —1. = which follows F-distribution with v, =n, -1and vp = ny ~ 1 degrees of freedom, Uses: F-distribution can be used for testing the quality of several population means, comparing sample variances, and analysis of variance completely depends on F-distribution. Under the hypothesis that two normal populations have the same variance: o, = 6,2, st % we have F F determines whether the ratio of two sample variances s, and sy is too small or too large. When F is close to 1, the two sample variances s, and s, are almost same. Fis always a positive number whenever the larger sample variance as the numerator. ro) F) 0 4 23 45 610 Foos Foo Probability density function of Tabulated values of F several F distributions Fig. 3.6 F-Distribution Properties of F-distribution (i) F-distribution curve lies entirely in first quadrant.

You might also like