0% found this document useful (0 votes)
36 views37 pages

Intro To Statistics

The document discusses principles and practices of data science. It covers topics like why data science is important, how data science relates to statistics, different data types, statistical terminology, sampling techniques, statistic types, and key terms for estimates of location and variability in statistics. Rectangular and structured data are important formats for data science and analysis.

Uploaded by

yasmine hussein
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views37 pages

Intro To Statistics

The document discusses principles and practices of data science. It covers topics like why data science is important, how data science relates to statistics, different data types, statistical terminology, sampling techniques, statistic types, and key terms for estimates of location and variability in statistics. Rectangular and structured data are important formats for data science and analysis.

Uploaded by

yasmine hussein
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Principles and Practices of

Data Science
Why Data Science?

● Data is everywhere .
● Data Science plays an important role in :
❖ Discovering useful information.
❖ Answering questions.
❖ Predicting future or the unknown .
Data Science and Statistics
● Data comes from many sources: sensor measurements, events, text, images,
and videos. The Internet of Things (IoT) is spewing out streams of
information.
● Much of this data is unstructured: images are a collection of pixels, with each
pixel containing RGB (red, green, blue) color information. Clickstreams are
sequences of actions by a user interacting with an app or a web page. In fact,
a major challenge of data science is to harness this torrent of raw data into
actionable information.
● To apply the statistical concepts, unstructured raw data must be processed
and manipulated into a structured form. One of the commonest forms of
structured data is a table with rows and columns as data might emerge from a
relational database or be collected for a study.
What is Statistics?
What is Statistics ?
● Statistics is an area of applied mathematics concerned with the data analysis,
presentation, collection, interpretation, organization, and large data
presentation.
● It helps in becoming familiar with the data and describe the data .
● It helps in understanding how data can be used in solving complex problems.
● Statistics The art of creating meaning from data and quantifying its associated
uncertainty.
● Statistics is a collection of various quantitative data.
Statistics
In general, statistics relate to numerical data; in fact, the term “statistics” can refer to the science of
dealing with numerical data itself. Statistics are also a type of information obtained through
mathematical operations on data. Above all, statistics aim to provide useful information by means
of numbers.

The most commonly used statistics to report statistical information are called descriptive statistics.
For numeric variables, measures of central tendency provide the value that is the most
representative of the units found in a data set. Measures of dispersion describe the spread of the
data around the central tendency. For categorical variables, frequency distributions are used to
summarize the data. Proportions, ratios and rates are also useful statistics to analyze the data.
Statistical information
Statistical information is data that has been recorded, classified, organized, related, or
interpreted within a framework so that meaning emerges. Statistical information that is
communicated to information users should help them understand the story told by the data and
communicate to them the quality of the information that is presented. Statistical information can
be presented in various formats: texts, tables, graphs, infographics, videos, or even databases.
Statistic terminology
Population:a collection or set of individuals or objects or events whose properties to be analyzed.

Sample :is a subset of population ,a well chosen sample will contain most of the information
about a particular population parameter.
Sampling
Sampling : is a statistical method that deals with the selection of individual
observations within a population ,it is performed in order to infer statistical
knowledge about a population.

Why sampling ?

In order to draw inferences about the entire population ,it is a shortcut to study the
entire population instead of taking the whole population and finding out all the
solutions
Sampling Techniques:
● Probability sampling involves random selection, allowing you to make strong statistical inferences

about the whole group.Mainly used in quantitative research .

● Non-probability sampling involves non-random selection based on convenience or other criteria,

allowing you to easily collect data.


Probability Sampling
Probability sampling means that every member of the population has a chance of being selected. It is mainly used in
quantitative research. If you want to produce results that are representative of the whole population, probability sampling
techniques are the most valid choice.

There are four types of Probability Sampling :


1. Simple Random Sample
2. Systematic Sample
3. Stratified Sample
4. Cluster Sample
Simple Random Sampling
● In a simple random sample, every member of the population has an equal chance of being selected. Your sampling

frame should include the whole population.

● Example: Simple random sampling You want to select a simple random sample of 100 employees of Company X.
You assign a number to every employee in the company database from 1 to 1000, and use a random number
generator to select 100 numbers.
Systematic Sample
● Systematic sampling: is similar to simple random sampling, but it is usually slightly easier to conduct. Every member

of the population is listed with a number, but instead of randomly generating numbers, individuals are chosen at

regular intervals.

● Example: Systematic sampling All employees of the company are listed in alphabetical order. From the first 10
numbers, you randomly select a starting point: number 6. From number 6 onwards, every 10th person on the list is
selected (6, 16, 26, 36, and so on), and you end up with a sample of 100 people.
Stratified Sample
Stratified sampling :involves dividing the population into subpopulations that may differ in important ways. It allows you draw

more precise conclusions by ensuring that every subgroup is properly represented in the sample.

To use this sampling method, you divide the population into subgroups (called strata) based on the relevant characteristic

(e.g. gender, age range, income bracket, job role).

Based on the overall proportions of the population, you calculate how many people should be sampled from each subgroup.

Then you use random or systematic sampling to select a sample from each subgroup.

Example: Stratified sampling The company has 800 female employees and 200 male employees. You want to ensure that
the sample reflects the gender balance of the company, so you sort the population into two strata based on gender. Then
you use random sampling on each group, selecting 80 women and 20 men, which gives you a representative sample of 100
people.
Cluster sampling

Cluster sampling also involves dividing the population into subgroups, but each subgroup should have similar characteristics to the
whole sample. Instead of sampling individuals from each subgroup, you randomly select entire subgroups.

If it is practically possible, you might include every individual from each sampled cluster. If the clusters themselves are large, you can
also sample individuals from within each cluster using one of the techniques above. This is called multistage sampling.

This method is good for dealing with large and dispersed populations, but there is more risk of error in the sample, as there could be
substantial differences between clusters. It’s difficult to guarantee that the sampled clusters are really representative of the whole
population.

Example: Cluster sampling The company has offices in 10 cities across the country (all with roughly the same number of employees
in similar roles). You don’t have the capacity to travel to every office to collect your data, so you use random sampling to select 3
offices – these are your clusters.
Statistic Types :

Statistics have majorly categorised into two types:

1. Descriptive statistics
2. Inferential statistics
Descriptive statistics

Descriptive statistics :uses the data to describe the population through


numbers ,tables ,graphs ,summary measures.
● Count
● Sum
● Standard Deviation
● Percentile
● Average
● Etc..
Data Types
Data Types in a Software
Rectangular Data
● The typical frame of reference for an analysis in data science is a rectangular
data object, like a spreadsheet or database table.
● Rectangular data is the general term for a two-dimensional matrix with rows
indicating records and columns indicating features (variables)
● Data frame is the specific format in Python.
● Data in relational databases must be extracted and put into a single table for
most data analysis and modeling tasks
Rectangular Data
Data Frame
● In Table 1-1, there is a mix of measured
or counted data (e.g., duration and price)
and categorical data (e.g., category and
currency).

● As mentioned earlier, a special form of
categorical variable is a binary (yes/no or
0/1) variable

● An indicator variable showing whether an
auction was competitive (had multiple
bidders) or not.
● This indicator variable also happens to be
an outcome variable, when the scenario
is to predict whether an auction is
competitive or not.
Data Science and Statistics

1. Key Terms for Estimates of Location

2. Key Terms for Estimates of Variability


● Key Terms for Estimates of Location
Variables with measured or count data might have thousands of distinct values. A basic step in
exploring your data is getting a “typical value” for each feature (variable): an estimate of where most
of the data is located (i.e., its central tendency).

● Examples of Key Terms for Estimates of Location:

1. Mean (Average ) :The sum of all values divided by the number of values.

2. Median (50th percentile): The value such that one-half of the data lies above and below.

3. Mode:The Most frequent value of the data .


Mean

● The most basic estimate of location is the mean, or average value. The mean
is the sum of all values divided by the number of values.

● Consider the following set of numbers: {3 5 1 2}.

The mean is (3 + 5 + 1 + 2) / 4 = 11 / 4 = 2.75.


Mean
● You will encounter the symbol x (pronounced “x-bar”) being used to represent
the mean of a sample from a population.

● The formula to compute the mean for a set of n values x1 , x2 , ..., xn is:
Mean = x = ∑i=1 n xi n N (or n) refers to the total number of records or
observations. In statistics it is capitalized if it is referring to a population, and
lower‐ case if it refers to a sample from a population. In data science, that
distinction is not vital, so you may see it both ways.
Median
● The median is the middle number on a sorted list of the data.
● If there is an even number of data values, the middle value is one that is not actually
in the data set, but rather the average of the two values that divide the sorted data
into upper and lower halves.

● For Example:The Median of 1,4,9,6,7 is 6.


● What is the Median of these numbers?
1,4,9,11,15,17,6,7 ?
Mode

● The Mode is the most frequent number ,the number that occurs the heights
number of times.

● For Example:The Mode of 1,4,9,6,8,9,9,6,7 is 9.


Estimates of Variability:

● Estimates of Variability : Location is just one dimension in summarizing a


feature. A second dimension, variability, also referred to as dispersion,
measures whether the data values are tightly clustered or spread out.

● At the heart of statistics lies variability: measuring it, reducing it, distinguishing
random from real variability, identifying the various sources of real variability,
and making decisions in the presence of it.
Variance
it is often represented by Var (X) or
Variance

● Variance :indicates the spread of the data. Variance of a random variable x is given by
● Variance is often represented by the symbol Sigma Square: σ^2
● It describes how much a random variable differs from it is expected value.
Step 1 to Calculate the Variance: Find the Mean

1. Find the mean:

(80+85+90+95+100+105+110+115+120+125) / 10 = 102.5

The mean is 102.5


Variance
Step 2: For Each Value - Find the Difference From the Mean

2. Find the difference from the mean for each value:

80 - 102.5 = -22.5
85 - 102.5 = -17.5
90 - 102.5 = -12.5
95 - 102.5 = -7.5
100 - 102.5 = -2.5
105 - 102.5 = 2.5
110 - 102.5 = 7.5
115 - 102.5 = 12.5
120 - 102.5 = 17.5
125 - 102.5 = 22.5
Variance
Step 3: For Each Difference - Find the Square Value

3. Find the square value for each difference:

(-22.5)^2 = 506.25
(-17.5)^2 = 306.25
(-12.5)^2 = 156.25
(-7.5)^2 = 56.25
(-2.5)^2 = 6.25
2.5^2 = 6.25
7.5^2 = 56.25
12.5^2 = 156.25
17.5^2 = 306.25
22.5^2 = 506.25

Note: We must square the values to get the total spread.


Variance
Step 4: The Variance is the Average Number of These Squared Values

4. Sum the squared values and find the average:

(506.25 + 306.25 + 156.25 + 56.25 + 6.25 + 6.25 + 56.25 + 156.25 + 306.25 + 506.25) / 10 = 206.25

The variance is 206.25.


Variance in Python
import numpy as np

var = np.var(data)

print(var)
Standard deviation (std)
● Standard deviation is the measure of the depression of a set of data from it is mean .
● A low standard deviation means that most of the numbers are close to the mean (average) value.
● A high standard deviation means that the values are spread out over a wider range.
● Standard deviation The square root of the variance.
Standard deviation in Python
import numpy as np

std = np.std(data)

print(std)

You might also like