Intro To Statistics
Intro To Statistics
Data Science
Why Data Science?
● Data is everywhere .
● Data Science plays an important role in :
❖ Discovering useful information.
❖ Answering questions.
❖ Predicting future or the unknown .
Data Science and Statistics
● Data comes from many sources: sensor measurements, events, text, images,
and videos. The Internet of Things (IoT) is spewing out streams of
information.
● Much of this data is unstructured: images are a collection of pixels, with each
pixel containing RGB (red, green, blue) color information. Clickstreams are
sequences of actions by a user interacting with an app or a web page. In fact,
a major challenge of data science is to harness this torrent of raw data into
actionable information.
● To apply the statistical concepts, unstructured raw data must be processed
and manipulated into a structured form. One of the commonest forms of
structured data is a table with rows and columns as data might emerge from a
relational database or be collected for a study.
What is Statistics?
What is Statistics ?
● Statistics is an area of applied mathematics concerned with the data analysis,
presentation, collection, interpretation, organization, and large data
presentation.
● It helps in becoming familiar with the data and describe the data .
● It helps in understanding how data can be used in solving complex problems.
● Statistics The art of creating meaning from data and quantifying its associated
uncertainty.
● Statistics is a collection of various quantitative data.
Statistics
In general, statistics relate to numerical data; in fact, the term “statistics” can refer to the science of
dealing with numerical data itself. Statistics are also a type of information obtained through
mathematical operations on data. Above all, statistics aim to provide useful information by means
of numbers.
The most commonly used statistics to report statistical information are called descriptive statistics.
For numeric variables, measures of central tendency provide the value that is the most
representative of the units found in a data set. Measures of dispersion describe the spread of the
data around the central tendency. For categorical variables, frequency distributions are used to
summarize the data. Proportions, ratios and rates are also useful statistics to analyze the data.
Statistical information
Statistical information is data that has been recorded, classified, organized, related, or
interpreted within a framework so that meaning emerges. Statistical information that is
communicated to information users should help them understand the story told by the data and
communicate to them the quality of the information that is presented. Statistical information can
be presented in various formats: texts, tables, graphs, infographics, videos, or even databases.
Statistic terminology
Population:a collection or set of individuals or objects or events whose properties to be analyzed.
Sample :is a subset of population ,a well chosen sample will contain most of the information
about a particular population parameter.
Sampling
Sampling : is a statistical method that deals with the selection of individual
observations within a population ,it is performed in order to infer statistical
knowledge about a population.
Why sampling ?
In order to draw inferences about the entire population ,it is a shortcut to study the
entire population instead of taking the whole population and finding out all the
solutions
Sampling Techniques:
● Probability sampling involves random selection, allowing you to make strong statistical inferences
● Example: Simple random sampling You want to select a simple random sample of 100 employees of Company X.
You assign a number to every employee in the company database from 1 to 1000, and use a random number
generator to select 100 numbers.
Systematic Sample
● Systematic sampling: is similar to simple random sampling, but it is usually slightly easier to conduct. Every member
of the population is listed with a number, but instead of randomly generating numbers, individuals are chosen at
regular intervals.
● Example: Systematic sampling All employees of the company are listed in alphabetical order. From the first 10
numbers, you randomly select a starting point: number 6. From number 6 onwards, every 10th person on the list is
selected (6, 16, 26, 36, and so on), and you end up with a sample of 100 people.
Stratified Sample
Stratified sampling :involves dividing the population into subpopulations that may differ in important ways. It allows you draw
more precise conclusions by ensuring that every subgroup is properly represented in the sample.
To use this sampling method, you divide the population into subgroups (called strata) based on the relevant characteristic
Based on the overall proportions of the population, you calculate how many people should be sampled from each subgroup.
Then you use random or systematic sampling to select a sample from each subgroup.
Example: Stratified sampling The company has 800 female employees and 200 male employees. You want to ensure that
the sample reflects the gender balance of the company, so you sort the population into two strata based on gender. Then
you use random sampling on each group, selecting 80 women and 20 men, which gives you a representative sample of 100
people.
Cluster sampling
Cluster sampling also involves dividing the population into subgroups, but each subgroup should have similar characteristics to the
whole sample. Instead of sampling individuals from each subgroup, you randomly select entire subgroups.
If it is practically possible, you might include every individual from each sampled cluster. If the clusters themselves are large, you can
also sample individuals from within each cluster using one of the techniques above. This is called multistage sampling.
This method is good for dealing with large and dispersed populations, but there is more risk of error in the sample, as there could be
substantial differences between clusters. It’s difficult to guarantee that the sampled clusters are really representative of the whole
population.
Example: Cluster sampling The company has offices in 10 cities across the country (all with roughly the same number of employees
in similar roles). You don’t have the capacity to travel to every office to collect your data, so you use random sampling to select 3
offices – these are your clusters.
Statistic Types :
1. Descriptive statistics
2. Inferential statistics
Descriptive statistics
1. Mean (Average ) :The sum of all values divided by the number of values.
2. Median (50th percentile): The value such that one-half of the data lies above and below.
● The most basic estimate of location is the mean, or average value. The mean
is the sum of all values divided by the number of values.
● The formula to compute the mean for a set of n values x1 , x2 , ..., xn is:
Mean = x = ∑i=1 n xi n N (or n) refers to the total number of records or
observations. In statistics it is capitalized if it is referring to a population, and
lower‐ case if it refers to a sample from a population. In data science, that
distinction is not vital, so you may see it both ways.
Median
● The median is the middle number on a sorted list of the data.
● If there is an even number of data values, the middle value is one that is not actually
in the data set, but rather the average of the two values that divide the sorted data
into upper and lower halves.
● The Mode is the most frequent number ,the number that occurs the heights
number of times.
● At the heart of statistics lies variability: measuring it, reducing it, distinguishing
random from real variability, identifying the various sources of real variability,
and making decisions in the presence of it.
Variance
it is often represented by Var (X) or
Variance
● Variance :indicates the spread of the data. Variance of a random variable x is given by
● Variance is often represented by the symbol Sigma Square: σ^2
● It describes how much a random variable differs from it is expected value.
Step 1 to Calculate the Variance: Find the Mean
(80+85+90+95+100+105+110+115+120+125) / 10 = 102.5
80 - 102.5 = -22.5
85 - 102.5 = -17.5
90 - 102.5 = -12.5
95 - 102.5 = -7.5
100 - 102.5 = -2.5
105 - 102.5 = 2.5
110 - 102.5 = 7.5
115 - 102.5 = 12.5
120 - 102.5 = 17.5
125 - 102.5 = 22.5
Variance
Step 3: For Each Difference - Find the Square Value
(-22.5)^2 = 506.25
(-17.5)^2 = 306.25
(-12.5)^2 = 156.25
(-7.5)^2 = 56.25
(-2.5)^2 = 6.25
2.5^2 = 6.25
7.5^2 = 56.25
12.5^2 = 156.25
17.5^2 = 306.25
22.5^2 = 506.25
(506.25 + 306.25 + 156.25 + 56.25 + 6.25 + 6.25 + 56.25 + 156.25 + 306.25 + 506.25) / 10 = 206.25
var = np.var(data)
print(var)
Standard deviation (std)
● Standard deviation is the measure of the depression of a set of data from it is mean .
● A low standard deviation means that most of the numbers are close to the mean (average) value.
● A high standard deviation means that the values are spread out over a wider range.
● Standard deviation The square root of the variance.
Standard deviation in Python
import numpy as np
std = np.std(data)
print(std)