0% found this document useful (0 votes)
8 views20 pages

Unit 1

Uploaded by

pes2ug23cs007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views20 pages

Unit 1

Uploaded by

pes2ug23cs007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

UNIT 1

Date : 21/09/2024
Subject : #UE23MA242A

What is Data?
Data refers to individual fact, statistics or items of
information that are collected through observation

Data v/s Information


1. Data:
Data is usually Raw Fact formatted in a specific way.
Data is based on record and observation
Unorganized
2. Information:
Information usually has additional meaning beyond the facts
themselves
Based on Analysis of Data
Organized

Example:
Data => {13.5C, 14.7C, 19.9C}
Information => Highest temp = 19.9C and lowest temp = 13.5C

Structured, Semi-Structured and


Unstructured
1. Structured:
This type of data is addressable for effective analysis.
Basically a database
2. Semi-Structured:
This type of data although does not have relations, it still
has a little bit of structure in it. Ex: XML -> has tags to
organize but no relations
3. Unstructured
No sort of structuring at all, just data everywhere. Ex:
Audio, Video, Location etc

Why do we need Data Science?


The main reason why we need data science is so that we can
process and interpret data, which can help to make better
decisions and help in growth, optimization etc
As we know, unstructured data is very hard to work with but its
size keeps increasing and increasing. Data Science provides a
way to extract valuable information from this Unstructured data

The 6 V's of Big Data


Volume: The amount of data that is generated
Velocity: The speed with which data is generated
Variety: The types of data being generated
Veracity: The trustworthiness of the data
Value: The value this data provides to the end user, business
etc
Variability: The ways in which data can be used and formatted

What is Statistical Analysis?


It's the science of collecting, exploring and presenting large
amounts of data to discover underlying patterns and trends
The basic idea of any statistical methods of data is to make
inference about a population by studying a relatively small
sample chosen from the population

Population:
A population is the entire collection of data/objects about
which the information is sought

Sample:

Sample is merely a subset of the Population, containing the


data/objects over which the outcomes are actually observed
Sample Size:
The number of items in the considered sample. This number
should always be less than the population size
The process of selecting observations in order to make an
inference that can be generalized to the public is know as
Sampling

Population v/s Sample

Population Sample
It is the complete set It is a subset of the population
Hard to define and Easier to define and observe
observe
Time consuming and Faster and Cheaper to study
Costly to study
Contains all the members Contains a small part of the
of a specified grp population that represents the
population
Eg: All the countries of Eg: Countries with data available
the world about GDP and birth rates since 2000

Why Sampling?
Necessity: Sometimes it’s simply not possible to study the whole
population
due to its size or inaccessibility.
Practicality:
It’s easier and more efficient to collect data from a
sample.
Cost-effectiveness:
There are fewer participant, laboratory, equipment, and
researcher costs involved.
Manageability:
Storing and running statistical analyses on smaller datasets
is easier and reliable.
Saves time:
As sample size is relatively less, it increases data-
collection speed

Characteristics of Sample:
1. Unbiased, which means it shd have all types of data in equal
proportion from the population
2. It should represent the population
3. It should be Goal-Oriented
4. It should be random i.e. every item in the population must have
an equal chance of getting selected
5. It should be Appropriately sized, it cannot be very small
compared to the population and also cannot be bigger than the
population. It should be just enough so that it can fit the
other characteristics

Types of Population:
1. Tangible or Concrete Population:
This type of populations consists of physical such as cars,
bolts, apples etc which can be counted. Thus, this
population is usually finite and involves counting
Each time an item is Sampled, population size decreases by 1
2. Conceptual Population:
Population which do not include physical objects as their
population are called Conceptual Population
This type of population usually involves measuring something
multiple times
The size of the population is usually very large
Example:
1. A geologist weighs a rock, several times on a scale ->
Conceptual Population
3. Target Population:
It is the entire population that researchers want to
generalise the conclusions
4. Study Population:
It is the population that the researchers have access to, it
can be due to geographic limitations, permissions, money etc
It is a subset of the Target Population
Sampling Breakdown:
Study : Find the mean weight of all students of all universities
in India.

1. Whom do you want to generalize results?


All universities in India
So this is our Target or Theoretical population
2. What population can you get access to?
All universities in Karnataka
So this is our Study population
3. How can you get access to them?
List of Universities in Karnataka
So this is our Sampling frame (Sampling frame is the list of
items or events from which potential respondents are drawn)
4. Who is in your study?
Two Universities from Karnataka
So this is our Sample

Types of Sampling Methods:


Read the Advantages and Disadvantages of each Sampling Method
from Slides

Samples

Non-Probability Samples Probability Samples

Probability Sampling
It is a type of sampling in which every unit in the population
has a non-zero chance/probability of being selected
This type of sampling reduces bias
When every item in the population has a equal probability to get
selected in the sample it is known as Equal Probability Sampling
or Self-Weighing
Probability Samples

Simple Random Systematic Stratified Cluster

1) Simple Random:

As the name suggests, it is an entirely random method of


selecting the sample
Here, each item has an equal probability of getting selected
Sampling Frame should be the entire Population
Best to use when sample size is small
Usually, all the items in the Population are assigned a number
and numbers are chosen at random to make the sample
Probability of an item getting selected is

n
( ) × 100
N

n -> Sample Size, N -> Population Size

1.5) (i) Simple Random Sampling With Replacement:

In this sampling method, a random item is selected, then it is


measured or recorded and then sent back to the population where
it can be selected again
This leads to a single item being sampled multiple times
Each time we sample a unit, all units have the same probability
of being sampled

1.5) (ii) Simple Random Sampling Without Replacement:

As the name suggests, in this sampling method there is not


replacement hence, a single unit cannot be sampled more than
once
This leads to the probability of items changing after one sample
has been made
Example: take a box of 10 balls, initially all the balls have a
probability of 10% (1/10 *100) but once we remove a ball it
becomes 11.11% (1/9*100)

2) Systematic:
In this type of sampling, we arrange the population is some
systematic way and then pick out samples at regular intervals
The first element is also selected randomly
Then we choose every K'th element where K = (Population Size /
Sample Size)
This is also an Equal Probability Sampling method

3) Stratified:

The population is broken down into Smaller Sub-Population called


Strata, then any of the sampling methods are applied to these
sub-population strata to create a sample
Usually we apply simple random sampling to choose items from
each strata
The strata can be chosen on terms of common characteristics but
we need to make sure that none of the strata overlap
The main advantage of this method is that the minorities in the
population are also given importance

4) Cluster:

In this type of sampling, population is again divided into non-


overlapping clusters which is a miniature of the population (In
stratified sampling we broke down the population based on common
traits)
Then entire clusters are selected randomly to be part of the
Sample

4.5) (i) One Stage Cluster Sampling:

In this, clusters are made -> random clusters are selected ->
entire clusters are put in the sample

4.5) (ii) Two Stage Cluster Sampling:

In this, clusters are made -> random clusters are selected ->
random items from the randomly chosen clusters are selected ->
put in the sample

Strata v/s Clusters:

Each Strata is represented in the Sample, since every strata


gives one or more element in the sample

Non-Probability Sampling:
In this type of sampling, every item does not have an equal
chance of getting selected in the sample.
This means some items might have a 0 probability of getting
selected, these items are usually referred to as Out of
coverage/Undercovered items
It involves the selection of elements based on assumptions
regarding the population of interest
It is a more biased sample

Non-Probability Samples

Convenience Judgement Snowball Quota

1) Convenience Sampling:

Sometimes it is also known as grab or opportunity or haphazard


or accidental sampling
The sample is drawn from the part of the population which is
close to hand to the researcher
Researcher cannot make generalisation about the population from
this as the sample will not be reprehensive enough of the
population

2) Judgement Sampling:

aka Purposive sampling


In this type of sampling, researcher chooses the sample based on
what they think is best for the research
This is used when limited number of people have expertise in the
area being researched
It is not a scientific way of sampling data

3) Snowball Sampling:

In this type of sampling, survey subjects are selected based on


referral from other survey respondents
This method is effective when sampling frame is difficult to
identify

4) Quota Sampling:
In this sampling, sample elements are selected until the Quota
controls are satisfied
The population is first divided into mutually exclusive sub
groups like Stratified sampling and then judgement sampling is
used to select item from each segment based on a specific
proportion

Sample Statistic v/s Population Parameter


Sample Statistic:
It is a piece of information that you get from a small part
of the population or sample
It can also be called the statistic computed from sample
data
Ex: Sample average, median etc
Population Parameter:
A statistical measure for a given population
It refers to the entire population
Ex: mean and variance of the population

Errors in Sampling
1. Sampling Error or random error
Occurs when sample is not representative of the population
The discrepancy between a sample statistic and its
population parameter is called sampling error
It occurs when the sample is not representative of the
population
2. Non-Sampling or Systematic Error
Occurs during data collection, causing the data to differ
from the true values
Sampling Bias:
This bias occurs when the sample is not representative of
the population
It can be either Selection bias or Non-Response Bias
Selection Bias:
It is a bias in which, the samples are chosen in such
a way that some members of the intended population
have a higher or lower sampling probability that
others
Non-Response Bias:
This occurs due to the absence of certain groups of
items from the population during sampling

Types of Data:
NOIR => Nominal Ordinal Interval Ratio

1. Qualitative (N-O):
Measurements that cannot be recorded on a naturally
occurring scale
These information can be categorized by category but not by
number
Nominal and Ordinal
Nominal:
Data that can be categorized without any natural order
Ex: Gender (Male, Female), Colours(Red, Green and Blue)
Ordinal:
Data has a natural order, but does not have a regular
interval between them
Ex: Grades (A,B,C), Satisfaction Levels
2. Quantitative (I-R):
This is the measurements that can be recorded on a naturally
occurring scale
These data are easily open for statistics and can be plotted
on various graphs
Discrete, Continuous, Interval and Ratio
Discrete:
If the values of the set are discrete and separate then
it is said to be discrete data
Usually bar charts are used to display this data
It has a limited number of values
Continuous:
If the values in the set can take any value finite or
infinite from the interval it is said to be continuous
data
Interval:
In this data type, data is measured along a scale with
regular intervals
Although they are placed at regular intervals, they do
not have a meaningful zero
Ex: Temperature in Celsius (0 Celsius does not mean "No
temperature")
Ratio:
Similar to interval data but it has a meaningful zero
point
It is the most precise type of data and allows for all
statistical techniques
Ex: Weight (0 means no weight)

Variables and Attributes


Variables are placeholder that can hold any type of data
Attributes refer to characteristic or properties of an entity.
For example in a dataset of children name, age, sex can be the
attributes
Types of Variables is the same as Types of data
Types of attributes depends on the properties it possesses
Nominal - Distinctiveness
Ordinal - Distinctiveness and Order
Interval - Distinctiveness, Order and Addition
Ratio - Distinctiveness, Order, Addition and Multiplication

Types of Studies:
Observational:
No interference from the researcher, subject are just
observed
Experimental:
Interference by the researcher to perform an experiment then
observations are made
Usually, experimental study happen with two groups namely
Control and Experimental
Control groups usually have no intervention by the
researcher so that the independent variable being
measured has no effect
Experimental group are the group on which the experiment
is conducted, and where the effect of the independent
variable is observed

Types of Statistics:
Descriptive Statistics:
Involves organization. summarization and display of data
It uses numerical and graphical methods to look for patterns
in a data set, to summarize the given data and to display
the data in a convenient form
Inferential Statistics:
Involves inferring something from the sample to infer
something about the population
It uses sample data to make estimates, generalization,
decisions and predictions about a larger set of data
There are two main area of Inferential Statistics:
Estimating Parameters:
This means taking a sample statistic (from sample
data) and using it to say something about a
population parameter
Hypothesis Testing:
This is where sample data is used to answer research
questions

Descriptive Statistics:
Measures of central tendency:
There are 3 diff. types of average mean, median and mode
All of them summarize where the centre of the data is

Mean:
It is the arithmetic average of the given data
Population mean

n
1
x = ∑ xi
N
i=1

N -> Population Size


x -> Measurement Values
Sample Mean

n
1
x̄ = ∑ xi
n
i=1

n -> Sample Size


x -> Measurement Values

Weighted mean:
It is the average where some of the elements contribute more to
the mean value

n
∑ wi xi
i=1
x̄ =
n
∑ wi
i=1

w -> Weights of the data

Trimmed Mean:

It is computed by arranging the sample data in an order and


trimming an equal number of them from both sides and then
finding the mean
if p% of the data is trimmed from both sides then the mean is
called p% trimmed mean
if a sample size is denoted by n and a p% trim is required then
the amount of data points to be removed is

np

100

This mean is used to reduce the effect of outliers in our mean


This method is suited for largely skewed or erratic deviations

Median:

It is the value separating the upper half values from the lower
half
It is the middle number of a sorted sample
To find the median we first arrange the data in ascending order
then,
If the no.of. elements is odd then median is the (n+1)/2 th terms
value
If the no.of. elements is even then the median is the average of
(n/2) and (n+1)/2 th term

Mode:

It is the value that occurs most number of times in the sample


data
If a sample has only one distinct mode it is called Unimodal
If it has 2 distinct modes -> Bimodal
If all the values of the sample data are very close to the mode
-> Uniform
If it has more than 2 distinct modes -> Multimodal

Empirical Formula

mean − mode = 3 ∗ (mean − meadian)

Skewness:
It is the measure of asymmetry of the distribution about its
mean
Skewness can be +ive, -ive, zero or undefined
Symmetric distribution is the one where the left and right side
of the distributions are balanced. The mean, median and mode are
the same
Skewed distribution is the one where the left and right side of
the distribution is imbalanced. The mean, median lie more
towards the skewness than the mode.
mean < median < mode -> Left Skewed
mode < median < mean -> Right Skewed
mean = median = mode -> Symmetric

Measures of Spread/Deviation:
It helps us tell how much the data is spread or how
homogenous/heterogenous the data is
There are two main methods to measure the spread of a data
Absolute:
Contains the same unit as the data, usually is the
average of deviations of observations such as standard
deviation etc
Relative:
This is used to compare the deviation of two or more data
sets

Range:

Most common and easily understandable


Difference b/w Max and Min of the data set
It can sometimes be misleading as it can be affected a lot if
the outlier values are very big
Example : Consider {8,11,5,9,7,6,3616} here although all the
values are around 10 the range is still 3616 - 8 = 3608 which
does not accurately tell the spread of the dataset

Percentile:
A percentile is a comparison measure between a particular value
and the values of rest of the dataset
For example, if u have scored 75 marks and are ranked in the
85th percentile that means that 75 marks is greater than 85% of
the scores
The percentile rank is calculated using

P
R = ( )(n + 1)
100

n -> Sample Size


The pth percentile of a sample, divides the distribution such
that
p% of the sample values are less that the pth percentile
(100 - p)% of the sample values are greater than the pth
percentile
Steps to calculate the percentile rank:
1. Order the sample in ascending order
2. Compute the rank
3. If rank is integer then the sample value in this position is
the percentile rank
4. Otherwise it is the average of the sample values preceding
and succeeding integer quantities

Quartile:
In this method the distribution is divided into 4 parts
Q1 = 0.25(n + 1)
Q2 = 0.5(n + 1) or this can also be the median
Q3 = 0.75(n + 1)
Here also if Q1, Q2, Q3 are integers we directly take the value
at that point, if not we take the average of the preceding and
succeeding values
Also we need to order the values in ascending order

InterQuartile Range:

It is the distance/range between the 75th Percentile and the


25th Percentile
IQR = Q3 - Q1

Variance:
It is the measure of how spread the values are from the
centre/average of the distribution

2
∑(x − x̄)
2
Sample V ariance, s =
n − 1

n -> Sample Size

2
∑(x − μ)
2
P opulation V ariance, σ =
N

N -> Population Size

Steps to calculate:
1. Find mean of the given data
2. Subtract the mean from the data
3. Square the deviation found in step 2
4. add all the deviation and divide by N for population and n-1
for sample

Standard Deviation:
Standard deviation = sqrt(Variance)
Larger the standard deviation, greater the spread
Just like mean, std. deviation is affected by outlier values

Chebyshev's Inequality:
This states that at least 1 - (1/k^2) of data from a sample must
fall within K standard deviations from the mean
Example: For K = 2, we have 1 - (1/k^2 = 1 - 1/4 = 3/4 = 75% .
According to Chebyshev's Inequality at least 75% of the data
should lie within 2 standard deviations from the mean
The inequality is represented something like this

1
P (|X − μ| ≥ kσ) ≤
2
k

In questions we will mostly use the rephrased version which is

1
P (μ − kσ ≤ X ≤ μ + kσ) ≥ 1 −
2
k

Do the questions on slide 367 - 369 in Unit 1 Combined


Slides

Sampling Distribution:
It is a probability distribution of a sample statistic like
sample mean etc taken from different samples from the same
population
It is used to estimate population parameters
If X1...Xn is a simple random sample from a population with mean
μ and variance σ 2 , then the sample mean X̄ is a random variable
with

μ x̄ = μ

2
σ
σ x̄ 2 =
n

σ
σ x̄ =
√n

Central Limit Theorem:


Point Estimate:
A quantity calculated from the data is called a statistic, and
the quantity used to estimate an unknown constant or parameter
is called Point Estimator
Properties of Point Estimator:
Bias:
When the expected value of the estimator is different from
the value of the parameter being estimated
Consistency:
This portrays how close can the point estimator be to the
true value even if the sample size increases
Efficiency:
A very efficient point estimator should have the following
Least Variance
Least Bias
Consistent
Goodness Measure of a Point Estimator - Mean Squared Error
Method to construct a Point Estimator - Maximum Likelihood
Estimate

Mean Squared Error:


MSE combines both bias and uncertainity

Maximum Likelihood Estimate:


Refer from page 69 of vibha notes (combined) in downloads

Next : _MCSE_/UNIT 2

You might also like