0% found this document useful (0 votes)

36 views37 pages

Intro To Statistics

The document discusses principles and practices of data science. It covers topics like why data science is important, how data science relates to statistics, different data types, statistical terminology, sampling techniques, statistic types, and key terms for estimates of location and variability in statistics. Rectangular and structured data are important formats for data science and analysis.

Uploaded by

yasmine hussein

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views37 pages

Intro To Statistics

Uploaded by

yasmine hussein

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Principles and Practices of

Data Science
Why Data Science?

● Data is everywhere .
● Data Science plays an important role in :
❖ Discovering useful information.
❖ Answering questions.
❖ Predicting future or the unknown .
Data Science and Statistics
● Data comes from many sources: sensor measurements, events, text, images,
and videos. The Internet of Things (IoT) is spewing out streams of
information.
● Much of this data is unstructured: images are a collection of pixels, with each
pixel containing RGB (red, green, blue) color information. Clickstreams are
sequences of actions by a user interacting with an app or a web page. In fact,
a major challenge of data science is to harness this torrent of raw data into
actionable information.
● To apply the statistical concepts, unstructured raw data must be processed
and manipulated into a structured form. One of the commonest forms of
structured data is a table with rows and columns as data might emerge from a
relational database or be collected for a study.
What is Statistics?
What is Statistics ?
● Statistics is an area of applied mathematics concerned with the data analysis,
presentation, collection, interpretation, organization, and large data
presentation.
● It helps in becoming familiar with the data and describe the data .
● It helps in understanding how data can be used in solving complex problems.
● Statistics The art of creating meaning from data and quantifying its associated
uncertainty.
● Statistics is a collection of various quantitative data.
Statistics
In general, statistics relate to numerical data; in fact, the term “statistics” can refer to the science of
dealing with numerical data itself. Statistics are also a type of information obtained through
mathematical operations on data. Above all, statistics aim to provide useful information by means
of numbers.

The most commonly used statistics to report statistical information are called descriptive statistics.
For numeric variables, measures of central tendency provide the value that is the most
representative of the units found in a data set. Measures of dispersion describe the spread of the
data around the central tendency. For categorical variables, frequency distributions are used to
summarize the data. Proportions, ratios and rates are also useful statistics to analyze the data.
Statistical information
Statistical information is data that has been recorded, classified, organized, related, or
interpreted within a framework so that meaning emerges. Statistical information that is
communicated to information users should help them understand the story told by the data and
communicate to them the quality of the information that is presented. Statistical information can
be presented in various formats: texts, tables, graphs, infographics, videos, or even databases.
Statistic terminology
Population:a collection or set of individuals or objects or events whose properties to be analyzed.

Sample :is a subset of population ,a well chosen sample will contain most of the information
about a particular population parameter.
Sampling
Sampling : is a statistical method that deals with the selection of individual
observations within a population ,it is performed in order to infer statistical
knowledge about a population.

Why sampling ?

In order to draw inferences about the entire population ,it is a shortcut to study the
entire population instead of taking the whole population and finding out all the
solutions
Sampling Techniques:
● Probability sampling involves random selection, allowing you to make strong statistical inferences

about the whole group.Mainly used in quantitative research .

● Non-probability sampling involves non-random selection based on convenience or other criteria,

allowing you to easily collect data.

Probability Sampling
Probability sampling means that every member of the population has a chance of being selected. It is mainly used in
quantitative research. If you want to produce results that are representative of the whole population, probability sampling
techniques are the most valid choice.

There are four types of Probability Sampling :

1. Simple Random Sample
2. Systematic Sample
3. Stratified Sample
4. Cluster Sample
Simple Random Sampling
● In a simple random sample, every member of the population has an equal chance of being selected. Your sampling

frame should include the whole population.

● Example: Simple random sampling You want to select a simple random sample of 100 employees of Company X.
You assign a number to every employee in the company database from 1 to 1000, and use a random number
generator to select 100 numbers.
Systematic Sample
● Systematic sampling: is similar to simple random sampling, but it is usually slightly easier to conduct. Every member

of the population is listed with a number, but instead of randomly generating numbers, individuals are chosen at

regular intervals.

● Example: Systematic sampling All employees of the company are listed in alphabetical order. From the first 10
numbers, you randomly select a starting point: number 6. From number 6 onwards, every 10th person on the list is
selected (6, 16, 26, 36, and so on), and you end up with a sample of 100 people.
Stratified Sample
Stratified sampling :involves dividing the population into subpopulations that may differ in important ways. It allows you draw

more precise conclusions by ensuring that every subgroup is properly represented in the sample.

To use this sampling method, you divide the population into subgroups (called strata) based on the relevant characteristic

(e.g. gender, age range, income bracket, job role).

Based on the overall proportions of the population, you calculate how many people should be sampled from each subgroup.

Then you use random or systematic sampling to select a sample from each subgroup.

Example: Stratified sampling The company has 800 female employees and 200 male employees. You want to ensure that
the sample reflects the gender balance of the company, so you sort the population into two strata based on gender. Then
you use random sampling on each group, selecting 80 women and 20 men, which gives you a representative sample of 100
people.
Cluster sampling

Cluster sampling also involves dividing the population into subgroups, but each subgroup should have similar characteristics to the
whole sample. Instead of sampling individuals from each subgroup, you randomly select entire subgroups.

If it is practically possible, you might include every individual from each sampled cluster. If the clusters themselves are large, you can
also sample individuals from within each cluster using one of the techniques above. This is called multistage sampling.

This method is good for dealing with large and dispersed populations, but there is more risk of error in the sample, as there could be
substantial differences between clusters. It’s difficult to guarantee that the sampled clusters are really representative of the whole
population.

Example: Cluster sampling The company has offices in 10 cities across the country (all with roughly the same number of employees
in similar roles). You don’t have the capacity to travel to every office to collect your data, so you use random sampling to select 3
offices – these are your clusters.
Statistic Types :

Statistics have majorly categorised into two types:

1. Descriptive statistics
2. Inferential statistics
Descriptive statistics

Descriptive statistics :uses the data to describe the population through

numbers ,tables ,graphs ,summary measures.
● Count
● Sum
● Standard Deviation
● Percentile
● Average
● Etc..
Data Types
Data Types in a Software
Rectangular Data
● The typical frame of reference for an analysis in data science is a rectangular
data object, like a spreadsheet or database table.
● Rectangular data is the general term for a two-dimensional matrix with rows
indicating records and columns indicating features (variables)
● Data frame is the specific format in Python.
● Data in relational databases must be extracted and put into a single table for
most data analysis and modeling tasks
Rectangular Data
Data Frame
● In Table 1-1, there is a mix of measured
or counted data (e.g., duration and price)
and categorical data (e.g., category and
currency).
●
● As mentioned earlier, a special form of
categorical variable is a binary (yes/no or
0/1) variable
●
● An indicator variable showing whether an
auction was competitive (had multiple
bidders) or not.
● This indicator variable also happens to be
an outcome variable, when the scenario
is to predict whether an auction is
competitive or not.
Data Science and Statistics

1. Key Terms for Estimates of Location

2. Key Terms for Estimates of Variability

● Key Terms for Estimates of Location
Variables with measured or count data might have thousands of distinct values. A basic step in
exploring your data is getting a “typical value” for each feature (variable): an estimate of where most
of the data is located (i.e., its central tendency).

● Examples of Key Terms for Estimates of Location:

1. Mean (Average ) :The sum of all values divided by the number of values.

2. Median (50th percentile): The value such that one-half of the data lies above and below.

3. Mode:The Most frequent value of the data .

Mean

● The most basic estimate of location is the mean, or average value. The mean
is the sum of all values divided by the number of values.

● Consider the following set of numbers: {3 5 1 2}.

The mean is (3 + 5 + 1 + 2) / 4 = 11 / 4 = 2.75.

Mean
● You will encounter the symbol x (pronounced “x-bar”) being used to represent
the mean of a sample from a population.

● The formula to compute the mean for a set of n values x1 , x2 , ..., xn is:
Mean = x = ∑i=1 n xi n N (or n) refers to the total number of records or
observations. In statistics it is capitalized if it is referring to a population, and
lower‐ case if it refers to a sample from a population. In data science, that
distinction is not vital, so you may see it both ways.
Median
● The median is the middle number on a sorted list of the data.
● If there is an even number of data values, the middle value is one that is not actually
in the data set, but rather the average of the two values that divide the sorted data
into upper and lower halves.

● For Example:The Median of 1,4,9,6,7 is 6.

● What is the Median of these numbers?
1,4,9,11,15,17,6,7 ?
Mode

● The Mode is the most frequent number ,the number that occurs the heights
number of times.

● For Example:The Mode of 1,4,9,6,8,9,9,6,7 is 9.

Estimates of Variability:

● Estimates of Variability : Location is just one dimension in summarizing a

feature. A second dimension, variability, also referred to as dispersion,
measures whether the data values are tightly clustered or spread out.

● At the heart of statistics lies variability: measuring it, reducing it, distinguishing
random from real variability, identifying the various sources of real variability,
and making decisions in the presence of it.
Variance
it is often represented by Var (X) or
Variance

● Variance :indicates the spread of the data. Variance of a random variable x is given by
● Variance is often represented by the symbol Sigma Square: σ^2
● It describes how much a random variable differs from it is expected value.
Step 1 to Calculate the Variance: Find the Mean

1. Find the mean:

(80+85+90+95+100+105+110+115+120+125) / 10 = 102.5

The mean is 102.5

Variance
Step 2: For Each Value - Find the Difference From the Mean

2. Find the difference from the mean for each value:

80 - 102.5 = -22.5
85 - 102.5 = -17.5
90 - 102.5 = -12.5
95 - 102.5 = -7.5
100 - 102.5 = -2.5
105 - 102.5 = 2.5
110 - 102.5 = 7.5
115 - 102.5 = 12.5
120 - 102.5 = 17.5
125 - 102.5 = 22.5
Variance
Step 3: For Each Difference - Find the Square Value

3. Find the square value for each difference:

(-22.5)^2 = 506.25
(-17.5)^2 = 306.25
(-12.5)^2 = 156.25
(-7.5)^2 = 56.25
(-2.5)^2 = 6.25
2.5^2 = 6.25
7.5^2 = 56.25
12.5^2 = 156.25
17.5^2 = 306.25
22.5^2 = 506.25

Note: We must square the values to get the total spread.

Variance
Step 4: The Variance is the Average Number of These Squared Values

4. Sum the squared values and find the average:

(506.25 + 306.25 + 156.25 + 56.25 + 6.25 + 6.25 + 56.25 + 156.25 + 306.25 + 506.25) / 10 = 206.25

The variance is 206.25.

Variance in Python
import numpy as np

var = np.var(data)

print(var)
Standard deviation (std)
● Standard deviation is the measure of the depression of a set of data from it is mean .
● A low standard deviation means that most of the numbers are close to the mean (average) value.
● A high standard deviation means that the values are spread out over a wider range.
● Standard deviation The square root of the variance.
Standard deviation in Python
import numpy as np

std = np.std(data)

print(std)

Cyber Security Training
No ratings yet
Cyber Security Training
30 pages
BOD310 EN Col06 FV Show
No ratings yet
BOD310 EN Col06 FV Show
142 pages
Lecture 1
No ratings yet
Lecture 1
13 pages
Sds Module 1
No ratings yet
Sds Module 1
86 pages
Inferential Statistics
No ratings yet
Inferential Statistics
8 pages
Ahmad Javaid - Software Engineer
No ratings yet
Ahmad Javaid - Software Engineer
1 page
2024 AMC 8 Paper
No ratings yet
2024 AMC 8 Paper
35 pages
Panasonic VL-SV74 PDF
No ratings yet
Panasonic VL-SV74 PDF
2 pages
Memory Access Method
No ratings yet
Memory Access Method
14 pages
5.2 Sampling Methods
No ratings yet
5.2 Sampling Methods
35 pages
Introduction, Lecture 1
No ratings yet
Introduction, Lecture 1
14 pages
Maths Statistics
No ratings yet
Maths Statistics
132 pages
Getting-Started - WCFM Documentation
No ratings yet
Getting-Started - WCFM Documentation
15 pages
Python Lesson 5 - Selection
No ratings yet
Python Lesson 5 - Selection
19 pages
Atmel
No ratings yet
Atmel
6 pages
101521-Report On The Physical Count of Property, Plant - Equipment-RPCPPE
No ratings yet
101521-Report On The Physical Count of Property, Plant - Equipment-RPCPPE
4 pages
Statistics and Probability
No ratings yet
Statistics and Probability
69 pages
Unit Iii
100% (1)
Unit Iii
36 pages
Stat Intro 01 June 2020
No ratings yet
Stat Intro 01 June 2020
17 pages
W2 Topic3 RelationalDatabaseDesign 2021
No ratings yet
W2 Topic3 RelationalDatabaseDesign 2021
13 pages
Perception-Desktop-CX-GAM&D-TM-003 - Software Release Note - MD - March 2024
No ratings yet
Perception-Desktop-CX-GAM&D-TM-003 - Software Release Note - MD - March 2024
35 pages
Data-Management
No ratings yet
Data-Management
101 pages
Installing, Configuring, and Using M-Files For Adobe Acrobat Sign
No ratings yet
Installing, Configuring, and Using M-Files For Adobe Acrobat Sign
28 pages
Coffee Shop Management Report
No ratings yet
Coffee Shop Management Report
16 pages
NSTA 51516 Slides
No ratings yet
NSTA 51516 Slides
97 pages
Voice Assistant
No ratings yet
Voice Assistant
46 pages
CH 1 Lecture Notes
No ratings yet
CH 1 Lecture Notes
10 pages
APS Master Interface User Manual V5.0.0
No ratings yet
APS Master Interface User Manual V5.0.0
42 pages
Basic ITK Customization Concept Part-01
No ratings yet
Basic ITK Customization Concept Part-01
1 page
Computer Software
No ratings yet
Computer Software
23 pages
Sampling
No ratings yet
Sampling
29 pages
Net Scaler
No ratings yet
Net Scaler
19 pages
DAT100 Int Data Ana Lec4 Obtaining Data
No ratings yet
DAT100 Int Data Ana Lec4 Obtaining Data
30 pages
Data Representation
No ratings yet
Data Representation
31 pages
Fundamentals of Crisis Communication
100% (1)
Fundamentals of Crisis Communication
28 pages
C1 STS
No ratings yet
C1 STS
3 pages
L5 Basic Concepts in Statistics
No ratings yet
L5 Basic Concepts in Statistics
20 pages
Algorithms and Flowcharts
No ratings yet
Algorithms and Flowcharts
37 pages
Introduction To Biostatistics
No ratings yet
Introduction To Biostatistics
67 pages
Inferential Statistics
No ratings yet
Inferential Statistics
169 pages
Lecture Notes - Prob and Stat
No ratings yet
Lecture Notes - Prob and Stat
229 pages
Economics Sem 1lecture Notes Introduction To Statistics
No ratings yet
Economics Sem 1lecture Notes Introduction To Statistics
90 pages
MATH30 6 Lecture 1 1
No ratings yet
MATH30 6 Lecture 1 1
32 pages
Unit 2-2 Sampling Design
No ratings yet
Unit 2-2 Sampling Design
26 pages
Pin and Puk System Final Final Project
No ratings yet
Pin and Puk System Final Final Project
66 pages
EM 104 Module
No ratings yet
EM 104 Module
12 pages
Bio Statistics
No ratings yet
Bio Statistics
217 pages
CS361 Artificial Intelligence (SEP) Lecture 1 (An Introduction To Artificial Intelligence) Fall 2020
No ratings yet
CS361 Artificial Intelligence (SEP) Lecture 1 (An Introduction To Artificial Intelligence) Fall 2020
44 pages
مبادئ الاحصاء
No ratings yet
مبادئ الاحصاء
66 pages
Statistics Book
No ratings yet
Statistics Book
170 pages
Presentation 1
No ratings yet
Presentation 1
88 pages
Department of Mathematics: Business Statistics 1
No ratings yet
Department of Mathematics: Business Statistics 1
132 pages
Cantonment Board Faizabad Distt-Ayodhya Employment Notice
No ratings yet
Cantonment Board Faizabad Distt-Ayodhya Employment Notice
10 pages
CS8261 C Programming Lab Record Manual
100% (1)
CS8261 C Programming Lab Record Manual
59 pages
Iot Based Monitoring System For White Button Mushroom Farming
No ratings yet
Iot Based Monitoring System For White Button Mushroom Farming
1 page
Spreadsheet Module 4 Fin
No ratings yet
Spreadsheet Module 4 Fin
49 pages
Smart Prepaid Energy Meter - Brochure
No ratings yet
Smart Prepaid Energy Meter - Brochure
4 pages
2 Data Collection 1
No ratings yet
2 Data Collection 1
38 pages
Complete Basic Stats
No ratings yet
Complete Basic Stats
18 pages
3 Introduction To Probablities
No ratings yet
3 Introduction To Probablities
25 pages
Chapter 1 Introduction To Statistics and Analysis
No ratings yet
Chapter 1 Introduction To Statistics and Analysis
6 pages
CN Lab Manual
No ratings yet
CN Lab Manual
36 pages
Stats Notes
No ratings yet
Stats Notes
7 pages
Data Science Q&A
No ratings yet
Data Science Q&A
4 pages
Week 3
No ratings yet
Week 3
35 pages
IT-2205 Lec 03 Error Detection & Correction-1
No ratings yet
IT-2205 Lec 03 Error Detection & Correction-1
45 pages
Probability and Statistics Lesson 1 2
No ratings yet
Probability and Statistics Lesson 1 2
47 pages
CHAPTER 1 and 2
No ratings yet
CHAPTER 1 and 2
18 pages
Spider V 20 MkII Manual - English
No ratings yet
Spider V 20 MkII Manual - English
7 pages
Week 1
No ratings yet
Week 1
6 pages
Data Science Interview Q - A
No ratings yet
Data Science Interview Q - A
165 pages
Sem 6 - DSV - Unit 4 - Sampling and Estimation
No ratings yet
Sem 6 - DSV - Unit 4 - Sampling and Estimation
50 pages
Sampling
No ratings yet
Sampling
22 pages
Data Science Q&A - Latest Ed (2020) - 2 - 2
No ratings yet
Data Science Q&A - Latest Ed (2020) - 2 - 2
2 pages
MMWChapter4 6
No ratings yet
MMWChapter4 6
66 pages
Donnie Marie Plaza - Sampling Techniques (March 06 2022)
No ratings yet
Donnie Marie Plaza - Sampling Techniques (March 06 2022)
34 pages
Chapter 5 Review
100% (2)
Chapter 5 Review
7 pages
Снимок экрана 2021-10-27 в 13.01.03
No ratings yet
Снимок экрана 2021-10-27 в 13.01.03
108 pages
APPLIED STATISTICS FOR BUSINESS AND ECONOMICS Midterms Reviewer
No ratings yet
APPLIED STATISTICS FOR BUSINESS AND ECONOMICS Midterms Reviewer
23 pages
Ipplan Installation and Configuration Guideline Centos 6
No ratings yet
Ipplan Installation and Configuration Guideline Centos 6
19 pages
Statistics in Research
No ratings yet
Statistics in Research
11 pages
Research With Statistical Tools
No ratings yet
Research With Statistical Tools
27 pages
Texecom Premier 412 816 832 User Guide
No ratings yet
Texecom Premier 412 816 832 User Guide
24 pages
Sample - Is The Subset of The Entire Population
No ratings yet
Sample - Is The Subset of The Entire Population
6 pages
What Is Statistics?: Item 2000 2010 Malaysia Population
No ratings yet
What Is Statistics?: Item 2000 2010 Malaysia Population
15 pages
STAT-II Week End
100% (2)
STAT-II Week End
57 pages
Business Data Analytics Students-07-Sampling PDF
No ratings yet
Business Data Analytics Students-07-Sampling PDF
50 pages
All The Statistical Concept You Required For Data Science
No ratings yet
All The Statistical Concept You Required For Data Science
26 pages
Prob and Statistics
No ratings yet
Prob and Statistics
34 pages
Nature of Statistics W1
No ratings yet
Nature of Statistics W1
39 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
18 pages
Ordinal: Ordinal Data Have Order, But The Interval Between Measurements Is Not Meaningful
No ratings yet
Ordinal: Ordinal Data Have Order, But The Interval Between Measurements Is Not Meaningful
6 pages
To Statistics
No ratings yet
To Statistics
85 pages

Intro To Statistics

Uploaded by

Intro To Statistics

Uploaded by

Principles and Practices of

about the whole group.Mainly used in quantitative research .

● Non-probability sampling involves non-random selection based on convenience or other criteria,

allowing you to easily collect data.

There are four types of Probability Sampling :

frame should include the whole population.

(e.g. gender, age range, income bracket, job role).

Statistics have majorly categorised into two types:

Descriptive statistics :uses the data to describe the population through

1. Key Terms for Estimates of Location

2. Key Terms for Estimates of Variability

● Examples of Key Terms for Estimates of Location:

3. Mode:The Most frequent value of the data .

● Consider the following set of numbers: {3 5 1 2}.

The mean is (3 + 5 + 1 + 2) / 4 = 11 / 4 = 2.75.

● For Example:The Median of 1,4,9,6,7 is 6.

● For Example:The Mode of 1,4,9,6,8,9,9,6,7 is 9.

● Estimates of Variability : Location is just one dimension in summarizing a

1. Find the mean:

The mean is 102.5

2. Find the difference from the mean for each value:

3. Find the square value for each difference:

Note: We must square the values to get the total spread.

4. Sum the squared values and find the average:

The variance is 206.25.

You might also like