0% found this document useful (0 votes)
56 views16 pages

INTRODUCTION TO STATISTICS Notes

This document provides an introduction to statistics. It discusses key concepts like population, sample, descriptive statistics, inferential statistics, measures of central tendency (mean, median, mode), measures of variability (range, standard deviation), and data types (qualitative, quantitative). Descriptive statistics summarize and organize data, while inferential statistics make inferences about populations from samples.

Uploaded by

sourav guha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views16 pages

INTRODUCTION TO STATISTICS Notes

This document provides an introduction to statistics. It discusses key concepts like population, sample, descriptive statistics, inferential statistics, measures of central tendency (mean, median, mode), measures of variability (range, standard deviation), and data types (qualitative, quantitative). Descriptive statistics summarize and organize data, while inferential statistics make inferences about populations from samples.

Uploaded by

sourav guha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

INTRODUCTION TO STATISTICS
Statistics is a field which is omnipresent.

💡 Statistics is the science that deals with collecting, analyzing and


interpreting data with the help of mathematical tools .

POPULATION : It is a set or collection of items of interest in a


statistical study .

SAMPLE : A sample is a subset of items that have been


collected from the population .

Majorly there are 2 types of statistics :

 Descriptive statistics

Methods of organizing , summarizing and presenting numerical data fall into this
area of stats. It is the statistics carried out on an entire population .

INTRODUCTION TO STATISTICS 1
2. Inferential statistics
Problems involving statistical inference arise when a statistician takes a sample
from a population and wishes to make statements about the population
characteristics from the information in the sample .

INTRODUCTION TO STATISTICS 2
DIFFERENCE BETWEEN DESCRIPTIVE AND INFERENTIAL STATISTICS

DESCRIPTIVE STATISTICS INFERENTIAL STATISTICS


It gives information about raw data which It makes inference about about population
describes the data in some manner . using data drawn from the population.
It helps in organizing , analyzing and It allows us to compare data , make
presenting data in a meaningful manner . hypothesis and predictions.

It is used to explain the chance of


It is used to describe a situation.
occurrence of an event .
It explains already known data and limited to It attempts to reach a conclusion about the
a sample or population having small size population .
It can be achieved with the help of charts ,
It can be achieved by probability .
graphs and tables etc.
Untitled

Measure of central tendency


 MEAN : Mean is a measure of central tendency which is majorly used with
probability distribution especially central limit theorem. It is the measure of
where the center of a data set lies. It is also known as arithmetic mean or
average . It is calculated by adding all of the numbers together and dividing by
no. of items in the set . Eg: Population mean.

Other types of mean : weighted mean, harmonic mean , geometric mean,


arithmetic geometric mean .
Eg: mean = 7912/3.

2. MEDIAN : It is the middle no. in a data set , once the data is arranged in either
ascending or descending order.
Eg: 1,2,3,4,5,6,7,8,9,10. Equal median no. for example is 5 & 6 . In order to find the
median we will have to calculate the arithmetic mean of 5 and 6 .

3. MODE : Mode is the most common no. in a set . Foe example : 21,21,21,23,24,26.
High frequency in a set means the most repeated no. like 21 .

INTRODUCTION TO STATISTICS 3
Measure of variability
 RANGE : It is the difference between maximum value and minimum value of
data set .

Eg : 21,21,21,23,24,25,26,28,29,31,33.

Range= 3321 = 12 .

💡 Unusually high or unusually low data is known as outlier .

💡 Range is not a reliable measure when outlier is present in data set .

2. STANDARD DEVIATION : Standard deviation is the measure of the amount of


variation or dispersion of a set of values. A low standard deviation indicates that
the values of the data set tend to be close to the mean . While a high standard
deviation indicates that the values are spread out over a wider range. Standard
deviation is abbreviated as SD and is represented in mathematical texts and
equations as lower case Greek letter sigma for population SD and the Latin letter S
for sample SD . The SD of a random variable or sample or statistical population or
probability distribution or data set is the square root of its variance .The SD of a
population and a standard error of a statistics are two different things but related .

INTRODUCTION TO STATISTICS 4
STEPS TO CALCULATE VARIANCE AND HENCE STANDARD
DEVIATION :

STEP 1 The mean value is calculated by adding all the data


points and dividing by no. of data points .

STEP 2 The variance for each data point is calculated by


subtracting the mean from the value of data point . Each of the
resulting value is then squared and thus the result is summed
.This result is divided by the no. of data points ;(less one for
sample SD )

INTRODUCTION TO STATISTICS 5
STEP 3 : The square root of the variance is then used to find
standard deviation .

Standard deviation is a very useful tool in investing and trading strategies as it


helps to measure market trends and predict performance . Lower SD is not
necessarily preferable. It is one of the key fundamental risk measure that analysts,
portfolio managers and advisors use . A large dispersion shows how much the
return on he fund is deviating from expected normal returns .

STANDARD DEVIATION v/s VARIANCE


The variance helps to measure the data's spread size when compared to mean
value . As the variance gets bigger more variation in data values occur and there
may be a larger gap between one dat value and another. If the data values are all
close together the variance will be smaller , however this is more difficult to grasp
than the standard deviation because variances represent a squared result that
may not be meaningfully expressed on the same graph as the original data set .

Standard deviations are usually easier to picture and apply . The standard
deviation is expressed in the same unit of measurement as data which is not the
case with variance . Using SD it can be determined if the data has normal curve or
other mathematical relationship . Larger variances cause more data points to fall
outside the SD. Smaller variances result in more data that is close to average .

💡 DRAWBACK OF SD : The biggest drawback of standard deviation is that


it can be impacted by outliers and extreme values .

AVERAGE ABSOLUTE DEVIATON (AAD)


AAD of a data set is the average of the absolute (positive) deviations from a
central point . In general form the central point can be mean, median or mode .
AAD includes mean absolute deviation MAD and Median absolute deviation
MAD.

INTRODUCTION TO STATISTICS 6
Two types of mean absolute deviation :

 Mean absolute deviation around mean

 Mean absolute deviation around median.

Two types of Median absolute deviation :

 Median absolute deviation around mean

 Median absolute deviation around median.

MAXIMUM ABSOLUTE DEVIATION


Maximum absolute deviation around an arbitrary point is the maximum of the
absolute deviations of a sample from that point . It is not a strict measure of
central tendency .

DATA SET
A data set is a collection of data of all kinds .
Majorly there are 2 types of dat set :

 Qualitative data type

It is also known as categorical data. It describes the object under consideration


using a finite set of discrete classes. It means that this type of data can not be
counted or measured easily using numbers and therefore divided into categories .

Example : Gender of a person .

Qualitative data type has 2 subtypes :

 Nominal : These are sets of values that don't possess a natural ordering . Eg:
colour, gender of persons.

 Ordinal : These type of values have a natural ordering while maintaining their
class of value . Eg: If we consider the size of a clothing brand then we can

INTRODUCTION TO STATISTICS 7
easily sort them according to their name tag in order of small , medium and
large . The grading system while marking candidate in a test can also be
considered as an ordinal data type where A is definitely better than S grade .

💡 NOTE : These categories help us in defining which encoding strategy


can be applied to which type of data . For nominal data type where there
is no comparison among the categories one not encoding can be applied
which is similar to binary coding . For ordinal type label encoding applied
which is form of integer in coding.

2. Quantitative data type


This data type tries to quantify things and it does so by considering numerical
values that make in comfortable in nature . Eg: Price of smart phone .

💡 The key thing is that there can be an infinite no. of value a feature can
take for eg; the price of a smart phone can vary from "X" amount to any
value .

Quantitative data type has 2 sub types :

 Discrete data type : The numerical values which fall under integers or whole
numbers are placed under this category . For example : The no. of speakers in
a cell phone or no. of SIM cards .

 Continuous data type : The fractional numbers or decimal values are


considered as continuous . For example : The android version of a phone or
the frequency of Wi Fi , height of a person .

You can give numbering to ordinal data then it should be called


discrete type or ordinal type ?

INTRODUCTION TO STATISTICS 8
The truth is that it is still ordinal data. The reason for this is that, even if the
numbering is done , it does not convey the actual distances between the classes .
Eg: Consider the grading system of a test . The respective grades can be
A,B,C,D,E and if we number them from starting then it would be 1,2,3,4,5.
Now according to the numerical difference the distance between D& E grades is
the same as the distance between C & D which is not very accurate , as we all
know that C grade is till acceptable than E .

💡 NOTE We have discussed all the major classification of data . This is


important because now we can prioritize the tests to be performed on
different categories . Now it makes sense to plot a histogram or
frequency plot for quantitative data and a pie chart and bar plot for
qualitative data.

💡 Regression analysis where the relationship between one dependent and


other independent variables is possible only for quantitative data.

💡 ANOVA is applicable only for qualitative data .

GRAPHICAL REPRESTATION
Graphics can be used as an effective method of visual communication . Statistical
graphics are beneficial for presentation and analysis of data . The statistical
graphic forms that we usually encounter are line chart, bar or common charts,
grouped bar charts, combination charts, pie charts and pictorial charts .

 LINE CHARTS : These use lines between data points to depict magnitudes of
data for 2 variables or for one variable over time . A line chart for a time series
is known as time series plot or sequence plot .

INTRODUCTION TO STATISTICS 9
💡 Data values for a variable overtime are known as time series .

2. BAR CHART OR COLUMN CHART : Bar charts are used to depict magnitude of
data for different qualitative categories or overtime . The length/height of bars
allows the user to compare magnitudes easily .

INTRODUCTION TO STATISTICS 10
3. GROUPED BAR CHARTS : These can be used to depict the magnitude of 2 or
more grouped dat , values for different qualitative categories or overtime .

INTRODUCTION TO STATISTICS 11
4. COMBINATION CHARTS : These charts use lines and bars to depict the
magnitudes of 2 or more data values for different categories or for different times.

INTRODUCTION TO STATISTICS 12
5. PIE CHARTS : Pie charts can be used effectively to depict the proportions or
percentages of the total quantity that corresponds to several qualitative
categories. Each category is depicted as a wedge of a circle or a piece of a pie .
The angle in degrees of each wedge is equal to the categories proportion
multiplied by 360 degree.

6. PICTORIAL CHARTS : These charts use pictorial symbols to represent data .


These are often used to gain attention but it can be difficult to interpret and are
also misused at times .

INTRODUCTION TO STATISTICS 13
SEGMENTING DATA
We often talk about the top 25% or top 10% or top 5% or top 1% of something,

When we are segmenting data into percentages we commonly are talking about
quartiles, deciles, quintiles and percentiles respectively.

💡 Quartiles divide the data into 4 parts .

💡 Deciles divide the data into 10 parts .

💡 Quintiles divide the data into 5 parts .

💡 Percentiles divide the data into 100 parts .

INTRODUCTION TO STATISTICS 14
KEY FEATURES OF QUARTILES

 The quartile measures the spread of values above and below the mean or
median by dividing the distribution into 4 groups .

 A quartile divides data into 3 points : a lower quartile, median and an upper
quartile to form 4 groups of the data set .

 Quartiles are used to calculate the inter quartile range which is a measure of
variability around the median .

The quartiles of a data set divide the data into 4 equal parts with 1/4th of the data
values in each part.
The first quartile Q1 is the median of the first half of the data set and marks the
point at which 25% of the data values are lower and 75% are higher.
The second quartile Q2 is the median of the data set which divides the data set
in half .
The third quartile Q3 is the median of the second half of the data set and marks
the point at which 25% of the data values are higher and 75% are lower.
For example : 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15. ( ascending order )

Q2 = 8
Q1 4.5 ( 45/2
Q3 = 12.5 1213/2 )

DECILES AND PERCENTILES

Deciles and percentiles are usually applied to large data sets .


Deciles divide the data set into 10 equal parts and percentiles divide them into
100 equal parts .

INTRODUCTION TO STATISTICS 15
One example of the use of deciles is in school awards or rankings .For example:
students in the top 10% may be given an award , if there are 578 students in a
graduating class the top 10% or 58 student may be given the award .

Similarly ,at the opposite end if the scale students who score in the bottom 10 %
or 20% may be given extra assistance to boost their scores.

Percentiles divide the data set into groupings of 1% . Standardized tests often
report percentile scores ,these score help compare student's performance to that
of their peers (often across a state or country ). The meaning of a percentile score
reflects the percentage of students whose scored at or above that particular
group of students.
For example : Students who receive a percentile ranking of 87 on a particular test
received scores that were equal to or higher than 87% of students who took the
test .

💡 DO NOT mistake percentile for the score of the student .

Growth charts are another common example of an application of percentile .

INTRODUCTION TO STATISTICS 16

You might also like