Open navigation menu
Close suggestions
Search
Search
en
Change Language
Upload
Sign in
Sign in
Download free for days
0 ratings
0% found this document useful (0 votes)
106 views
16 pages
Stats 1 - IITM BS Notes - Part 2
Continuation of notes on Statistics 1 for Data Science by IIT Madras
Uploaded by
ryandonovan.des
AI-enhanced title
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here
.
Available Formats
Download as PDF or read online on Scribd
Download
Save
Save Stats 1_IITM BS notes_Part 2 For Later
Share
0%
0% found this document useful, undefined
0%
, undefined
Print
Embed
Report
0 ratings
0% found this document useful (0 votes)
106 views
16 pages
Stats 1 - IITM BS Notes - Part 2
Continuation of notes on Statistics 1 for Data Science by IIT Madras
Uploaded by
ryandonovan.des
AI-enhanced title
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here
.
Available Formats
Download as PDF or read online on Scribd
Carousel Previous
Carousel Next
Download
Save
Save Stats 1_IITM BS notes_Part 2 For Later
Share
0%
0% found this document useful, undefined
0%
, undefined
Print
Embed
Report
Download
Save Stats 1_IITM BS notes_Part 2 For Later
You are on page 1
/ 16
Search
Fullscreen
Based on this information answer questions 47. “Tenure of bieyles sod in 2 bey shop over the fst rine days ofthe month are gh below. 1% 2 3% kak 4) What is the value of th petcentle? No, the answer is inecreect. Score: 0 Accepted Answers (Cyne Nac) 15 5) What is the value of 25th percents? za No, the answer is meorrect. Score: 0 Accepted Answers (ype Narre) 25 6) What is the value of 75th percentile? Ea No, the answer is incorrect Score: 0 Accepted Answers: (Crype tures) 37 Box Plot ‘A box plot is also callled box and whiskers plot and candlestick chart. ©Min a. May 3 10 we ‘s ‘ us He Q; (Hed: 5 For a detailed explanation on box charts, see this video: https://www youtube. com/watch?v=-2¥nDBacOUY Week 4 Learning objectives 1. Understand the association between two categorical variables 2. Construct a two-way table or contingency table. 3, Summarise each group in a categorical variable through row total and column total. 4, Construct a contingency table in Google sheets using pivot tables option Association between two categorical variables ‘Sample question - Is owning a smartphone dependent on the gender of a person? Here we compare a categorical variable like gender (male/female/other) with another categorical variable (yes or no). Contingency table or two-way table This compares variable 1 (in rows) to variable 2 (in rows).Contingency table using google sheets Step 1 Choose the columns of the variables for which you seek an association Step 2 Go to Data-click on Pivot table option Step 3 Click on create option in the pivot table- it will open the pivot table editor 3.1 Under the Rows tab, click on the first categorical variable. 3.2 Under the columns tab, click on the second categorical variable. 3.3 Under the values tab, click on either of the variables and then click on the COUNTA tab under ‘summarize by” tab Row/column relative frequency The frequency relative to the whole in its own category is referred to as row/column relative frequency. Example 1: Row relative frequency No ‘Yes Row total Female 10/44) 34/44 4 Male 14/56 | 42/56 56 Column total | 24/100 | 76/100 100 lo fes ‘ow total Female, 22.18% _| 71.27% 44 25.00% | 75.00% 56 24.00% | 76.00% 100Example 1: Column relative frequency No Yes Row total Female 10/24 | 34/76 | «44/100 Male 14/24__| 42/76 | 56/100 Column total | 24 76 100 Row Total 41.67% | 44.74% 44.00% 58.33% | 55.26% 56.00% 24 76 100 eam) Association between two variables ‘What do we mean by stating two variables are associated? Knowing information about one variable provides information about the other variable. If the row relative frequencies (or the column relative frequencies) are the same for all rows (or columns) then we say that the two variables are not associated with each other. If the row relative frequencies (or the column relative frequencies) are different for some rows (some columns) then we say that the two variables are associated with each other. Associated Example:100% Stacked bar chart Income versus smartphone ownership aye mm Not associated Example: Gender versus smartphone ownership Bye mie i> as Scatterplot We use a scatter plot to look for association between numerical variables. A scatter plot is a graph that displays pairs of values as points on a two-dimensional plane. ToDescribing association When describing association between variables in a scatter plot, there are four key questions that need to be answered 1. Direction: Does the pattern trend up, down, or both? 2, Curvature: Does the pattern appear to be linear or does it curve? 3. Variation: Are the points tightly clustered along the pattern? 4, Outliers: Did you find something unexpected? Measures of association How to quantify association? There are two ways = Covariance and - Correlation Covariance ‘© When large (or small) values of x tend to be associated with large (or small) values of y- the signs of the deviations, (xi- x?) and (yi - §) will also tend to be the same. ‘© When large (or small) values of x tend to be associated with small (or large) values of y- the signs of the deviations, (xi - x) and (yi - J) will also tend to be different. Definition Let xi denote the i" observation of variable x, and yi denote the ith observation of variable y. Let (i, yi) be the i® paired observation of a population (sample) dataset having N(n) observations. ‘The Covariance between the variables x and y is given by Population covariance: Cov(x, y) = Sample covariance: Cov(x, y)© The measure of linear association between two numeric variables which has meaningful units is covariance. © But the size of the covariance, however has units. ‘© The units of the covariance are those of the x-variable times those of the y-variable. difficult to interpret because the covariance Correlation A more easily interpreted measure of linear association between two numerical variables is correlation. It is derived from covariance. To find the correlation between two numerical variables x and y divide the covariance between x and y by the product of the standard deviations of x and y. The Pearson correlation coefficient, r, between x and y is given by Remark The units of the standard deviations cancel out the units of covariance Itcan be shown that the correlation measure always lies between -1 and +1F700" (5) 0-98 This direction is lays negate Be coarance enegatve nthe ‘The srecton ie save cost quecane 2 cavanance'e poste nthe ussrane coostaToRERO MeANevasAsEDSTRAUTCN PaCiED FONTS oO =| +l DOWNWARD TREND UPWARD TREND © Line of best fitThe strength of linear association between the variables was measured using the measures of Covariance and Correlation. The linear association can be described using the equation of a line. Here, you can see the best fit line on google sheets, you can also see the equation and the R? value. R? is basically the square of r which is the correlation coefficient. The range of R® is between 0 and 1. Closer to 1 means best fit and closer to 0 implies loose fit. Statistics explanation - https://www.youtube.com/watch?v=1-9_vRZV-ck&t=752s Maths explanation for sum of squared errors : httos:/www youtube com/watch2v=xdZHsF uyBZM Important concepts: Ae demand 4x demand +1 se = A demand-1 © nagave1) Ri? is(More than one option can be correct) the measure of direction of linear association between two variables {@ the measure of proportion of the variance in the dataset explained by the explanatory variable @_the measure of goodness of fit of the line to the dataset not the measure of goodness of fit of the line to the dataset al variable Association between a categorical variable and a numet Example - A teacher was interested in knowing if female students performed better than male students in her class. She collected data from twenty students and the marks they obtained on 100 in the subject. First, code the categorical variable. Let male students be 0 and female students be 1 or vice versa Sample scatterplot Gender-coded and Marks 400 75 : ; 50 Marks 25 Such plots do not have best fit lines, instead, they are summarised by bi-serial correlation coefficientPoint Bi-serial Correlation Coefficient > Let X be a numerical variable and Y be a categorical variable with two categories (a dichotomous variable) » The following steps are used for calculating the Point Bi-serial correlation between these two variables: Step 1 Group the data into two sets based on the value of the dichotomous variable Y. That is, assume that the value of Y is either 0 or 1 Step 2 Calculate the mean values of two groups: Let Yo and Yj be the mean values of groups with Y = 0, and Y = 1, respectively. Step 3 Let po and p; be the proportion of observations in a group with Y =0 and Y = 1, respectively, and sy be the standard deviation of the random variable X The correlation coefficient The absolute value of r lies between 0 and 1 (Here the +ve sign and negative sign are only representative of the coding assigned to the categorical variable and hence the sign must not be taken into account). Closer to 0 means not associated and closer to 1 means strongly associated. Example - Gender vs marks data: 0 F 75 0 5 ° ® 1 ” 2 1 ® 1 « 2 F n ° n ° 2 8 ” n 1 n 1 n “ ” #4 1 o 1 o 15 F a o ‘ ° 2 6 F 7 o 75 ° a ” F » ° ‘o ° 2 8 ” “ 1 au 1 n 8 F st o * ° “0 20 ” % 1 so 1 2 FEMALE ae 2 ou ‘07 a) ot Ye se Ea & esraarss tosoreee7a Oar assesses? s9en09999, yar 225 5.25 hoPs oseasener2 0-1881086147 -0.9635800872Simpson's paradox is a phenomenon in probability and statistics in which a trend appears in several groups of data but disappears or reverses when the groups are combined, Simpson's Paradox is defined as the reversal of conclusions in disaggregated and aggregated cross-tabulation, “ttn | hen ‘cout 2 o 36] 33] 1m = 8) Does the above data illustrate Simpson's Paradox2(More than ane option can be correct) [1 No, because the conclusions in augregated and disaggregated cross-tabulation are the same, D Yes, because the reversal of conclusions occurs in draw percentages of both the captains. @__Yes, because the reversal of conclusions occurs in win percentages of bath the captains. 1. Yes, because the reversal of conclusions occurs in loss percentages of both the captains. Partially Correct, Score: 0.5 Feedback: Win percentage:Home ‘Away Total Captain A | 97/10(90%) | 767142(53.52%) | 85/152(55.92%) Captain B [| 150/200(75%) |_25760(41.66%) | 175/260(67.30%) ‘Table 4.1 Draw percentage: Home Away Total Captain A | 0/10(0%) | 17142(0.7%) | 17152(0.6%) Captain B | 0/200(0%) | _0760(0%) | 0/260(0%) Table 4.2 ‘Loss percentage: Home Away Total Captain A | _1/10(10%) | 65/142(45,77%) | 65/152(42.76%) Captain B | 50/200(25%) | 35/60(58.83%) | 85 /260(32.69%) Table 4.3 We observe from Table 4.1 that the winning percentage in ODI cricket is higher for captain A in both Home and Away matches, but overall it appears that the winning percentage is higher for captain B. From Table 4.3, we observe that the loss percentage is higher for captain B in both ‘Home and Away matches, but overall it appears that the loss percentage is higher for captain A. ‘Therefore, options (c) and (d) are correct. Lear 1. Understand basic principles of counting. 2, Understand the addition rule of counting, 3, Appreciate the importance of OR keyword in addition rule of counting, 4, Understand the multiplication rule of counting. 5. Appreciate the importance of AND keyword in multiplication rule of counting,Probability Probability is a tool that helps us understand uncertainty in an uncertain world. Principles of counting Imagine you have a gift card to buy either a pant or a shirt. There are 4 shirt options and 3 pant options, You have 4 + 3 = 7 choices. This is called the addition rule of counting. Now imagine you have a gift card with which you can buy a shirt and a pair of pants. There are again 4 shirt options and 3 pant options. You now have 4 * 3 = 12 options. Add a pair of shoes to the gift card offer and you have two kinds of shoes to choose from. You now have 4 * 3 * 2 = 24 options.Be BPs pe 12 ways 12 ways This is called the multi ion rule of counting. So, from the above examples we can summarise: Wiultiplication rule of counting Sa, be | by 321% > If an action A can occur in m different ways, another action B’can occur in n2 different ways, then the total number of occurrence of the actions A and B together is n x np. —— > Suppose that 1 actions are to be performed in a definite order. Further suppose that there are m possibilities for the first action and that corresponding to each of these possibilities are nz possibilities for the second action, and so on. Then there are m X M2 x... x n, possibilities altogether for the r actions. Application in the real world:> Suppose you are asked to create a six digit alpha-numeric password with the following requirement: > The password should have first two letters followed by four numbers. > Repetition allowed. > Number of ways- 26 x 26 x 10 x 10 x 10 x 10 = 6.760, 000 » Repetition not allowed. > Number of ways- 26 x 25 x 10 x 9 x 8 x 7 = 3,276,000 Factorials Imagine you have 6 coins and 6 boxes. You can put one coin in each box. In how many ways can you arrange the coins? In the first box you can put one of 6 coins, in the second box you can put one of 5 remaining coins, in the third - one of 4 remaining coins, in the 4th - one of three remaining coins, in the 5th box - one of two remaining coins and in the last box, the last coin. So, there are 6x5x4x3x2x1 ways of arranging the coins in the boxes. This can be represented as 6! also known as 6 factorial. Ways to work with and simplify factorials: 4! can also be represented as 4 x 3! 1. 51=5x4x3x2x1=120 2. Observe 5! =5 x 4! > In general, nl =nx(n—1)! 3. Observe 5!=5 x 4! =5x 4x 3! > In general, for i
You might also like
Unit II Notes Correlation and Regression
PDF
No ratings yet
Unit II Notes Correlation and Regression
19 pages
Data Exploration and Visualization Unit 2
PDF
100% (1)
Data Exploration and Visualization Unit 2
19 pages
3 Bivariate Data
PDF
No ratings yet
3 Bivariate Data
33 pages
Full Bound Reference
PDF
No ratings yet
Full Bound Reference
83 pages
Research Methods Chapter 5
PDF
No ratings yet
Research Methods Chapter 5
59 pages
Analise Bivariada - Moodle
PDF
No ratings yet
Analise Bivariada - Moodle
46 pages
Correlation and Regression Analysis
PDF
100% (1)
Correlation and Regression Analysis
59 pages
Correlation and Chi-Square Test - LDR 280
PDF
100% (1)
Correlation and Chi-Square Test - LDR 280
71 pages
Statistical Inference - II
PDF
No ratings yet
Statistical Inference - II
171 pages
Coo Relation
PDF
No ratings yet
Coo Relation
6 pages
Types of Correlation
PDF
No ratings yet
Types of Correlation
39 pages
Correlation and Regression
PDF
100% (5)
Correlation and Regression
49 pages
Research Methods Chapter 5
PDF
No ratings yet
Research Methods Chapter 5
59 pages
L3 Correlation
PDF
No ratings yet
L3 Correlation
101 pages
Hypothesis Testing Correlation
PDF
No ratings yet
Hypothesis Testing Correlation
15 pages
STAT22209 - Chapter 01-Correlation Analyisis - 2022
PDF
No ratings yet
STAT22209 - Chapter 01-Correlation Analyisis - 2022
53 pages
Chapter 05
PDF
No ratings yet
Chapter 05
13 pages
Correlation & Regression (Complete) .PDF Theory Module-6-B
PDF
100% (1)
Correlation & Regression (Complete) .PDF Theory Module-6-B
9 pages
Consu Bit (Numerical and Statistical) 1
PDF
No ratings yet
Consu Bit (Numerical and Statistical) 1
2 pages
3 - Bidimensional Statistics
PDF
No ratings yet
3 - Bidimensional Statistics
41 pages
Unit 2
PDF
No ratings yet
Unit 2
44 pages
Two Quantitative Variables: Scatterplot, Correlation, and Linear Regression
PDF
No ratings yet
Two Quantitative Variables: Scatterplot, Correlation, and Linear Regression
17 pages
Viva Update For BS
PDF
No ratings yet
Viva Update For BS
10 pages
Correg
PDF
No ratings yet
Correg
19 pages
Lesson 10 Relationship Between Variables
PDF
No ratings yet
Lesson 10 Relationship Between Variables
85 pages
Statistics Shortcut Formulae Set
PDF
No ratings yet
Statistics Shortcut Formulae Set
3 pages
FODS Unit-3
PDF
No ratings yet
FODS Unit-3
25 pages
Correlation 2
PDF
No ratings yet
Correlation 2
23 pages
Correlation Rank - Correlation Curve - Fitting For Student
PDF
No ratings yet
Correlation Rank - Correlation Curve - Fitting For Student
26 pages
Correlation Regression Theory
PDF
No ratings yet
Correlation Regression Theory
8 pages
Correlaton Stats
PDF
No ratings yet
Correlaton Stats
8 pages
Session 3 - Bivariate Data Analysis Tutorial Prac
PDF
No ratings yet
Session 3 - Bivariate Data Analysis Tutorial Prac
24 pages
31 Mathematics Correlation Regression
PDF
No ratings yet
31 Mathematics Correlation Regression
9 pages
Data Science With Python Relationship
PDF
No ratings yet
Data Science With Python Relationship
30 pages
BA 216 Lecture 5 Notes
PDF
No ratings yet
BA 216 Lecture 5 Notes
31 pages
QT - Juraz - PDF - Two Mark Q & A
PDF
No ratings yet
QT - Juraz - PDF - Two Mark Q & A
16 pages
Correction
PDF
No ratings yet
Correction
10 pages
The Significance of Correlation
PDF
No ratings yet
The Significance of Correlation
6 pages
CORRELATION
PDF
No ratings yet
CORRELATION
4 pages
Correlation
PDF
No ratings yet
Correlation
25 pages
Chap4 Normality (Data Analysis) FV
PDF
100% (1)
Chap4 Normality (Data Analysis) FV
72 pages
Unit 4
PDF
No ratings yet
Unit 4
10 pages
Further Bound Reference
PDF
No ratings yet
Further Bound Reference
42 pages
Correlation and Its Significance
PDF
No ratings yet
Correlation and Its Significance
15 pages
Corr - Regression Analysis
PDF
No ratings yet
Corr - Regression Analysis
19 pages
Statistics
PDF
No ratings yet
Statistics
13 pages
Statistics & Probability Q4 - Week 7-8
PDF
No ratings yet
Statistics & Probability Q4 - Week 7-8
15 pages
Introduction To Correlation Analysis GB6023 2012
PDF
No ratings yet
Introduction To Correlation Analysis GB6023 2012
34 pages
Business Statistics and Analysis Course 2&3
PDF
No ratings yet
Business Statistics and Analysis Course 2&3
42 pages
Lecture 7 Correlation
PDF
No ratings yet
Lecture 7 Correlation
5 pages
Correlation New
PDF
No ratings yet
Correlation New
37 pages
Introduction To Correlationand Regression Analysis BY Farzad Javidanrad PDF
PDF
No ratings yet
Introduction To Correlationand Regression Analysis BY Farzad Javidanrad PDF
52 pages
Correlation Regression
PDF
No ratings yet
Correlation Regression
5 pages
L3 - Correlation & Rank Correlation
PDF
No ratings yet
L3 - Correlation & Rank Correlation
11 pages
Correlation
PDF
No ratings yet
Correlation
19 pages
ECN 652 Handout 9 Student
PDF
No ratings yet
ECN 652 Handout 9 Student
46 pages
Oe Statistics Notes
PDF
No ratings yet
Oe Statistics Notes
32 pages
Approach To Comparative Politics
PDF
No ratings yet
Approach To Comparative Politics
8 pages