0% found this document useful (0 votes)

7 views22 pages

Inferential Statistics

The document discusses inferential statistics, focusing on techniques such as ANOVA, correlation coefficients, and regression analysis, including simple, multiple, logistic, and polynomial regression. It also covers the handling of missing data, outlier detection methods, and segmentation analysis, emphasizing the importance of these techniques in predictive modeling and data analytics. Key concepts include the differences between descriptive and inferential statistics, types of missing data, and various methods for identifying and dealing with outliers.

Uploaded by

anithameruga_3272953

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views22 pages

Inferential Statistics

Uploaded by

anithameruga_3272953

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 22

Inferential Statistics

Making descriptions of data and drawing inferences and conclusions from the respective data
Analysis of Variance (ANOVA)

ANOVA is a statistical technique that compares means across two or more groups. It tests
whether there are statistically significant differences between the groups' means. ANOVA
calculates both within-group variance (variation within each group) and between-group
variance (variation between the group means) to determine whether any observed
differences are likely due to chance or represent true differences between groups.

 Null Hypothesis (H0): A statement of no effect or no difference, which researchers

aim to test against.
 Alternative Hypothesis (H1): A statement indicating the presence of an effect or
difference.

Difference between Descriptive and Inferential statistics

Descriptive Statistics Inferential Statistics

It makes inferences about the

It gives information about raw data which
population using data drawn from the
describes the data in some manner.
population.

It helps in organizing, analyzing, and to It allows us to compare data, and make

present data in a meaningful manner. hypotheses and predictions.

It is used to explain the chance of

It is used to describe a situation.
occurrence of an event.

It explains already known data and is

It attempts to reach the conclusion about
limited to a sample or population having a
the population.
small size.

It can be achieved with the help of charts,

It can be achieved by probability.
graphs, tables, etc.

Correlation Coefficient

Correlation is a statistical measure that indicates the extent to which two or more variables
fluctuate in relation to each other. The correlation coefficient is measured on a scale that varies from
+ 1 through 0 to – 1. Complete correlation between two variables is expressed by either + 1 or -1.
When one variable increases as the other increases the correlation is positive; when one decreases as
the other increases it is negative. Complete absence of correlation is represented by 0.
import numpy as np

# Sample data

x = np.array([1, 2, 3, 4, 5])

y = np.array([5, 4, 3, 2, 1])

# Calculate correlation coefficient

correlation_matrix = np.corrcoef(x, y)

correlation = correlation_matrix[0, 1]

print("Correlation coefficient:", correlation).

Where:

Xᵢ and Yᵢ represent individual paired values of the two variables.

X̄ and Ȳ represent the means (averages) of the X and Y variables, respectively.

Σ denotes the sum of the values across all data points.

Alternatively, the formula can be written in terms of the covariance and standard deviations of

X and Y:

r = cov(X, Y) / (σₓ * σᵧ)

Where:

cov(X, Y) represents the covariance between X and Y.

σₓ and σᵧ represent the standard deviations of X and Y, respectively.

Regression Coefficient

The term regression is used when you try to find the relationship between variables.In
Machine Learning and in statistical modeling, that relationship is used to predict the outcome
of events.
Least Square Method

Linear regression uses the least square method.

The concept is to draw a line through all the plotted data points. The line is positioned in a
way that it minimizes the distance to all of the data points.

The distance is called "residuals" or "errors".

The red dashed lines represents the distance from the data points to the drawn mathematical
function.

Regression Analysis – Linear Model Assumptions

Linear regression analysis is based on six fundamental assumptions:

1. The dependent and independent variables show a linear relationship between the slope
and the intercept.
2. The independent variable is not random.
3. The value of the residual (error) is zero.
4. The value of the residual (error) is constant across all observations.
5. The value of the residual (error) is not correlated across all observations.
6. The residual (error) values follow the normal distribution.

Regression Analysis – Simple Linear Regression

Simple linear regression is a model that assesses the relationship between a dependent
variable and an independent variable. The simple linear model is expressed using the
following equation:
Y = a + bX + ϵ

Where:

 Y – Dependent variable
 X – Independent (explanatory) variable
 a – Intercept
 b – Slope
 ϵ – Residual (error)

Regression Analysis – Multiple Linear Regression

Multiple linear regression analysis is essentially similar to the simple linear model, with the
exception that multiple independent variables are used in the model. The mathematical
representation of multiple linear regression is:

Y = a + bX1 + cX2 + dX3 + ϵ

Where:

 Y – Dependent variable
 X1, X2, X3 – Independent (explanatory) variables
 a – Intercept
 b, c, d – Slopes
 ϵ – Residual (error)

Logistic Regression: Logistic regression is used when the dependent variable is binary (two
possible outcomes). It models the probability of a particular outcome occurring.

In business, logistic regression is employed for tasks like predicting customer churn (yes/no),
whether a customer will purchase a product (yes/no), or whether a loan applicant will default
on a loan (yes/no).

Polynomial Regression: Polynomial regression is used when the relationship between the
independent and dependent variables follows a polynomial curve and is not linear.
Model development in predictive analytics is the process of creating algorithms to analyze
data and make predictions:

1. Data collection and preparation: Identify relevant data sets and remove any redundant or
noisy data

2. Algorithm development: Create algorithms to pair with the data

3. Training: Train the model to learn the data so it can make predictions

4. Model assessment: Test the accuracy of the model's predictions and make adjustments as
needed
Predictive modeling is a data mining technique that uses historical and current data to predict
future outcomes. It's used in many industries and applications, including:
 Marketing to gauge customer responses

 Financial analysis to estimate stock market trends and events

 Fraud detection

 Customer segmentation

 Disease diagnosis

 Stock price prediction

The accuracy of predictive models depends on several factors, including: The quality of the
data, The choice of variables, and The model's assumptions.

Some common predictive modeling techniques include:

Regression

Neural networks

Decision trees

https://airbyte.com/data-engineering-resources/what-is-data-partitioning

https://www.analyticsvidhya.com/blog/2019/09/data-scientists-guide-8-types-of-sampling-
techniques/
Missing values
Missing values are common when working with real-world datasets

Types of Missing Data

 Missing Completely at Random (MCAR).

 Missing at Random (MAR).

 Not Missing at Random (NMAR).

Missing Data that's Missing Completely at Random (MCAR)

These are data that are missing completely at random. That is, the missingness is independent

from the data. There is no discernible pattern to this type of data missingness.

This means that you cannot predict whether the value was missing due to specific

circumstances or not. They are just completely missing at random.

Missing Data that's Missing at Random (MAR):

These types of data are missing at random but not completely missing. The data's missingness

is determined by the data

Consider for instance that you built a smart watch that can track people's heart rates every

hour. Then you distributed the watch to a group of individuals to wear so you can collect data

for analysis.

After collecting the data, you discovered that some data were missing, which was due to

some people being reluctant to wear the wristwatch at night. As a result, we can conclude that

the missingness was caused by the observed data.

Missing Data that's Not Missing at Random (NMAR)

These are data that are not missing at random and are also known as ignorable data. In other

words, the missingness of the missing data is determined by the variable of interest.
A common example is a survey in which students are asked how many cars they own. In this

case, some students may purposefully fail to complete the survey, resulting in missing values.

Handling missing values falls generally into two categories. The two categories are as

follows:

 Deletion

 Imputation

One of the most prevalent methods for dealing with missing data is deletion. And one of the

most commonly used methods in the deletion approach is using the list wise deletion method.

 In the list-wise deletion method, you remove a record or observation in the dataset if it

contains some missing values.

Another frequent general method for dealing with missing data is to fill in the missing value

with a substituted value.

Regression imputation

The regression imputation method includes creating a model to predict the observed value of

a variable based on another variable. Then you use the model to fill in the missing value of

that variable.

This technique is utilized for the MAR and MCAR categories when the features in the dataset

are dependent on one another. For example using a linear regression model.
Simple Imputation

This method involves utilizing a numerical summary of the variable where the missing value

occurred (that is using the feature or variable's central tendency summary, such as mean,

median, and mode).

KNN Imputation

KNN imputation is a fairer approach to the Simple Imputation method. It operates by

replacing missing data with the average mean of the neighbors nearest to it.

What is an outlier?

In data analytics, outliers are values within a dataset that vary greatly from the others—
they’re either much larger, or significantly smaller. Outliers may indicate variabilities in a
measurement, experimental errors, or a novelty.

In a real-world example, the average height of a giraffe is about 16 feet tall. However, there
have been recent discoveries of two giraffes that stand at 9 feet and 8.5 feet, respectively.
These two giraffes would be considered outliers in comparison to the general giraffe
population.

Types of outliers

There are two kinds of outliers:

 A univariate outlier is an extreme value that relates to just one variable. For
example, Sultan Kösen is currently the tallest man alive, with a height of 8ft, 2.8
inches (251cm). This case would be considered a univariate outlier as it’s an extreme
case of just one factor: height.

 A multivariate outlier is a combination of unusual or extreme values for at least two

variables. For example, if you’re looking at both the height and weight of a group of
adults, you might observe that one person in your dataset is 5ft 9 inches tall—a
measurement that would fall within the normal range for this particular variable. You
may also observe that this person weighs 110lbs. Again, this observation alone falls
within the normal range for the variable of interest: weight. However, when you
consider these two observations in conjunction, you have an adult who is 5ft 9 inches
and weighs 110lbs—a surprising combination. That’s a multivariate outlier.

Besides the distinction between univariate and multivariate outliers, outliers categorized as
any of the following:

 Global outliers (otherwise known as point outliers) are single data points that lay
far from the rest of the data distribution.

 Contextual outliers (otherwise known as conditional outliers) are values that

significantly deviate from the rest of the data points in the same context, meaning that
the same value may not be considered an outlier if it occurred in a different context.
Outliers in this category are commonly found in time series data.

If an e-commerce company experiences a sudden increase in orders in the middle of

the night, it would be a contextual outlier.

 Collective outliers are seen as a subset of data points that are completely different
with respect to the entire dataset. A group of customers who consistently make
purchases that are significantly larger than the rest of the customers

Common causes of outliers in datasets:

 Human error while manually entering data, such as a typo

 Intentional errors, such as dummy outliers included in a dataset to test detection
methods

 Sampling errors that arise from extracting or mixing data from inaccurate or various
sources

 Data processing errors that arise from data manipulation, or unintended mutations of a
dataset

 Measurement errors as a result of instrumental error

 Experimental errors, from the data extraction process or experiment planning or

execution

 Natural outliers which occur “naturally” in the dataset, as opposed to being the result
of an error otherwise listed. These naturally-occurring errors are known as novelties

Outliers using visualizations

In data analytics, analysts create data visualizations to present data graphically in a

meaningful and impactful way, in order to present their findings to relevant stakeholders.
These visualizations can easily show trends, patterns, and outliers from a large set of data in
the form of maps, graphs and charts.

Identifying outliers with box plots

Visualizing data as a box plot makes it very easy to spot outliers. A box plot will show the
“box” which indicates the interquartile range (from the lower quartile to the upper quartile,
with the middle indicating the median data value) and any outliers will be shown outside of
the “whiskers” of the plot, each side representing the minimum and maximum values of the
dataset, respectively. If the box skews closer to the maximum whisker, the prominent outlier
would be the minimum value. Likewise, if the box skews closer to the minimum-valued
whisker, the prominent outlier would then be the maximum value.

Outliers using statistical methods

A data analyst may use a statistical method to assist with machine learning modeling, which
can be improved by identifying, understanding, and—in some cases—removing outliers.

1. Outliers with DBSCAN

DBSCAN (Density Based Spatial Clustering of Applications with Noise) is a clustering
method that’s used in machine learning and data analytics applications. Relationships
between trends, features, and populations in a dataset are graphically represented by
DBSCAN, which can also be applied to detect outliers.

The above illustration is of a DBSCAN cluster analysis. Points around A are core points.
Points B and C are not core points, but are density-connected via the cluster of A (and thus
belong to this cluster). Point N is Noise, since it is neither a core point nor reachable from a
core point.
Outliers by finding the Z-Score

Computing a z-score helps describe any data point by placing it in relation to the standard
deviation and mean of the whole group of data points. Positive standard scores appear as raw
scores above the mean, whereas negative standard scores appear below the mean. The mean
is 0 and standard deviation is 1, creating a normal distribution.

Outliers are found from z-score calculations by observing the data points that are too far from
0 (mean). In many cases, the “too far” threshold will be +3 to -3, where anything above +3 or
below -3 respectively will be considered outliers.

Outliers with the Isolation Forest algorithm

Isolation Forest—otherwise known as iForest—is another anomaly detection algorithm. The

founders of the algorithm used two quantitative features of anomalous data points—that they
are “few” in quantity and have “different” attribute-values to those of normal instances—to
isolate outliers from normal data points in a dataset.

To show these outliers, the Isolation Forest will build “Isolation Trees” from the set of data,
and outliers will be shown as the points that have shorter average path lengths than the rest of
the branches.

Segmentation

Segmentation analysis is a powerful data analytics technique that helps you identify and
understand different groups of customers, users, or prospects based on their characteristics,
preferences, or behaviors. By segmenting your data, you can tailor your marketing, product,
or service strategies to each group and optimize your performance and outcomes.

Criteria-based segmentation

This technique involves dividing your data into segments based on predefined criteria that are
relevant to your business goals, such as demographics, psychographics, geographic, or
behavioral variables. For example, you can segment your customers by age, income, lifestyle,
location, or purchase frequency. The main advantage of this technique is that it is easy to
implement and interpret, and it can help you target specific segments with customized
messages or offers. The main disadvantage is that it can be too simplistic or arbitrary, and it
may not capture the underlying patterns or relationships in your data.

Cluster analysis

This technique involves using statistical methods to group your data into segments based on
their similarities or differences in multiple dimensions, without relying on predefined criteria.
For example, you can use cluster analysis to find out how your customers are clustered based
on their preferences, needs, or behaviors across various product or service attributes. The
main advantage of this technique is that it can reveal hidden insights and patterns in your
data, and it can help you discover new or niche segments that you may not have considered
before. The main disadvantage is that it can be complex and challenging to implement and
validate, and it may require advanced data analytics skills and tools.

Hybrid segmentation

This technique involves combining criteria-based and cluster analysis techniques to create
segments that are both meaningful and actionable. For example, you can use criteria-based
segmentation to create broad segments based on your business objectives, and then use
cluster analysis to refine or subdivide those segments based on your data characteristics. The
main advantage of this technique is that it can leverage the strengths of both techniques and
overcome their limitations, and it can help you create segments that are more accurate and
relevant to your data and goals. The main disadvantage is that it can be time-consuming and
resource-intensive, and it may require more testing and evaluation to ensure its validity and
reliability.

Automated Data Preparation

Data preparation is transforming raw data into a clean and structured format, ready for
analysis. It involves several tasks, including data cleaning, integration, transformation, and
enrichment.
Automated data preparation tools can perform a range of tasks, including data cleaning,
integration, transformation, and enrichment, with minimal human intervention.

Key Components of Automated Data Preparation

Automated data preparation involves several key components, each addressing different
aspects of the data preparation process:

1. Data Ingestion
Data ingestion is the process of collecting and importing data from various sources into a
centralized repository. Automated data ingestion tools can connect to multiple data sources,
such as databases, cloud storage, APIs, and streaming platforms, to extract and load data in
real-time or batch mode.

2. Data Cleaning
Data cleaning involves identifying and rectifying errors, inconsistencies, and missing values
in the dataset. Automated data cleaning tools use algorithms and rules to detect and correct
common data quality issues, such as:

 Duplicate Records: Identifying and removing duplicate records to ensure data

uniqueness.

 Missing Values: Handling missing values through imputation, interpolation, or

deletion.

 Inconsistent Data: Standardizing inconsistent data formats, such as date and

time formats, units of measurement, and categorical values.

 Outliers: Detecting and handling outliers that may skew analysis results.

3. Data Transformation
Data transformation involves converting data from one format or structure to another to meet
the requirements of the analysis or target system. Automated data transformation tools can
perform a range of tasks, including:

 Data Normalization: Scaling numerical data to a standard range to ensure

comparability.
 Data Aggregation: Summarizing data at different levels of granularity for
analysis.

 Data Pivoting: Converting rows to columns and vice versa to restructure the
dataset.

 Feature Engineering: Creating new features or variables from existing data to

enhance the analysis.

4. Data Integration
Data integration involves combining data from multiple sources into a unified dataset.
Automated data integration tools can perform tasks such as:

 Data Merging: Combining data from different sources based on common keys
or identifiers.

 Data Blending: Integrating data from disparate sources to create a

comprehensive view.

 Data Harmonization: Aligning data from different sources to ensure

consistency and coherence.

5. Data Enrichment
Data enrichment involves enhancing the dataset with additional information or context to
improve analysis and decision-making. Automated data enrichment tools can:

 Append External Data: Incorporate data from external sources, such as

demographics, weather, or market data.

 Geocoding: Add geographic coordinates to address data for spatial analysis.

 Text Enrichment: Extract and add relevant information from unstructured text
data using natural language processing techniques.

Case Management System
No ratings yet
Case Management System
34 pages
Scatter Diagram
No ratings yet
Scatter Diagram
14 pages
Predictive Analytics-Mid Sem Exam Question Bank
No ratings yet
Predictive Analytics-Mid Sem Exam Question Bank
28 pages
Regression Analysis
No ratings yet
Regression Analysis
12 pages
DA-3rd Unit
No ratings yet
DA-3rd Unit
16 pages
Ssdma Unit 2 Part1
No ratings yet
Ssdma Unit 2 Part1
20 pages
Unit 2
No ratings yet
Unit 2
76 pages
Regression
No ratings yet
Regression
7 pages
PSAI Unit3
No ratings yet
PSAI Unit3
36 pages
Unit 2 Data Analytics
No ratings yet
Unit 2 Data Analytics
33 pages
Statistical Modeling
No ratings yet
Statistical Modeling
22 pages
Experiment No.2 Title:: Predicting Missing Data Using Regression Modeling
No ratings yet
Experiment No.2 Title:: Predicting Missing Data Using Regression Modeling
8 pages
Ida Unit-3
No ratings yet
Ida Unit-3
34 pages
Predective Analytics or Inferential Statistics
No ratings yet
Predective Analytics or Inferential Statistics
27 pages
Data Mining Reviewer
No ratings yet
Data Mining Reviewer
4 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
70 pages
XSTK Project PDF
No ratings yet
XSTK Project PDF
26 pages
Predictive Analytics
No ratings yet
Predictive Analytics
22 pages
Regrion
No ratings yet
Regrion
19 pages
Da Unit 3 R22
No ratings yet
Da Unit 3 R22
15 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
83 pages
Da 2
No ratings yet
Da 2
31 pages
Chapter 2
No ratings yet
Chapter 2
136 pages
Data Analysis Notes
No ratings yet
Data Analysis Notes
9 pages
FMD PRACTICAL FILE
No ratings yet
FMD PRACTICAL FILE
61 pages
Ids
No ratings yet
Ids
6 pages
Bda Unit 5
No ratings yet
Bda Unit 5
14 pages
Module 9 - Simple Linear Regression & Correlation
No ratings yet
Module 9 - Simple Linear Regression & Correlation
29 pages
Income Tax
No ratings yet
Income Tax
9 pages
Module 3 Data Preparation
No ratings yet
Module 3 Data Preparation
33 pages
Statistic Correlation and Regression
No ratings yet
Statistic Correlation and Regression
9 pages
Unit III
No ratings yet
Unit III
13 pages
Module 2 Part 1 - Types of Forecasting Models and Simple Linear Regression
No ratings yet
Module 2 Part 1 - Types of Forecasting Models and Simple Linear Regression
71 pages
Multivariant Data.
No ratings yet
Multivariant Data.
36 pages
Regression and Correlation Analysis
No ratings yet
Regression and Correlation Analysis
16 pages
DAV Short Notes
No ratings yet
DAV Short Notes
5 pages
Topic0 Introduction
No ratings yet
Topic0 Introduction
9 pages
Finals-Predictive-Time-Series-Analysis - Module
No ratings yet
Finals-Predictive-Time-Series-Analysis - Module
14 pages
6 ASAP Advanced Statistics-Regression
No ratings yet
6 ASAP Advanced Statistics-Regression
53 pages
Management Science Notes
No ratings yet
Management Science Notes
13 pages
Module 3 - Regression and Correlation Analysis
No ratings yet
Module 3 - Regression and Correlation Analysis
54 pages
Regression and Correlation
No ratings yet
Regression and Correlation
1 page
Evans - Analytics2e - PPT - 07 and 08 CH
No ratings yet
Evans - Analytics2e - PPT - 07 and 08 CH
50 pages
Objects Oriented Programming OOP
No ratings yet
Objects Oriented Programming OOP
66 pages
Unit-III (Data Analytics)
50% (2)
Unit-III (Data Analytics)
15 pages
Questions Stats and Trix
No ratings yet
Questions Stats and Trix
39 pages
Objects Oriented Programming OOP
No ratings yet
Objects Oriented Programming OOP
67 pages
Chap3-INTERVENTION ANALYSIS
No ratings yet
Chap3-INTERVENTION ANALYSIS
62 pages
ASM Using R 2 Marks Answer Keys
100% (1)
ASM Using R 2 Marks Answer Keys
10 pages
Unit 6 Machine Learning Algorithms
No ratings yet
Unit 6 Machine Learning Algorithms
13 pages
Slides
No ratings yet
Slides
39 pages
CH 5
No ratings yet
CH 5
36 pages
ML Practical File
100% (2)
ML Practical File
43 pages
ML Unit-2
100% (1)
ML Unit-2
52 pages
Regression Analysis
No ratings yet
Regression Analysis
6 pages
ECO 391 Lecture Slides - Part 2
No ratings yet
ECO 391 Lecture Slides - Part 2
26 pages
Data Analytivs-Unit-2
No ratings yet
Data Analytivs-Unit-2
24 pages
Week-4 Statistical-Forecasting Handout
No ratings yet
Week-4 Statistical-Forecasting Handout
9 pages
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
From Everand
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
Seaport AI Madhavan
No ratings yet
Introduction To Business Statistics Through R Software: Software
From Everand
Introduction To Business Statistics Through R Software: Software
Editor IJSMI
No ratings yet
Descriptive Statistics: Six Sigma Thinking, #3
From Everand
Descriptive Statistics: Six Sigma Thinking, #3
Sumeet Savant
No ratings yet
Dbms
No ratings yet
Dbms
50 pages
An Introduction To Linguistics Loretto T
No ratings yet
An Introduction To Linguistics Loretto T
135 pages
At&CD Previous Year Papers
No ratings yet
At&CD Previous Year Papers
38 pages
Concurrency Control
No ratings yet
Concurrency Control
5 pages
TF Mannual
No ratings yet
TF Mannual
19 pages
Assignment 2
No ratings yet
Assignment 2
2 pages
SQP 04 - QP
No ratings yet
SQP 04 - QP
10 pages
Crystal Reports-XI R2
No ratings yet
Crystal Reports-XI R2
143 pages
Library Management System Project 1
No ratings yet
Library Management System Project 1
8 pages
Ravali Data Engineer GCP
No ratings yet
Ravali Data Engineer GCP
8 pages
Basic Linux Terminal Commands PDF
No ratings yet
Basic Linux Terminal Commands PDF
8 pages
Report#20922873
No ratings yet
Report#20922873
29 pages
IT3031-Database Systems and Data-Driven Application
No ratings yet
IT3031-Database Systems and Data-Driven Application
6 pages
Poster Ruta Certificaciones
No ratings yet
Poster Ruta Certificaciones
1 page
Ad Assignment 2 Fall 2021
No ratings yet
Ad Assignment 2 Fall 2021
29 pages
12 Normalization
No ratings yet
12 Normalization
41 pages
App Dna Lga
No ratings yet
App Dna Lga
15 pages
Understanding Data
No ratings yet
Understanding Data
14 pages
One Sample Z Test (G3)
No ratings yet
One Sample Z Test (G3)
16 pages
GRC Dasboar - Runtime Installation: Aris Connect
No ratings yet
GRC Dasboar - Runtime Installation: Aris Connect
16 pages
Clustering Approaches For Financial Data Analysis PDF
No ratings yet
Clustering Approaches For Financial Data Analysis PDF
7 pages
Bigdata Bits
No ratings yet
Bigdata Bits
2 pages
Mba-Ms-Iii 2
No ratings yet
Mba-Ms-Iii 2
9 pages
Europass CV Template
No ratings yet
Europass CV Template
2 pages
Advanced Computer Networks & Computer and Network Security: Prof. Dr. Hasan Hüseyin BALIK (4 Week)
No ratings yet
Advanced Computer Networks & Computer and Network Security: Prof. Dr. Hasan Hüseyin BALIK (4 Week)
36 pages
Github Command and Control: Black Hat Python © 2015 Justin Seitz
No ratings yet
Github Command and Control: Black Hat Python © 2015 Justin Seitz
9 pages
Database NK
No ratings yet
Database NK
86 pages
Class 12th Holiday Homework 2025-2026
No ratings yet
Class 12th Holiday Homework 2025-2026
32 pages
chp4 CCD
No ratings yet
chp4 CCD
8 pages
RAC Backup and Recovery Using RMAN
No ratings yet
RAC Backup and Recovery Using RMAN
7 pages
E Commerce
No ratings yet
E Commerce
11 pages
DBMS Unit1 Notes
No ratings yet
DBMS Unit1 Notes
24 pages
CG Boards Fail To Boot After Upgrade From AG Boards
No ratings yet
CG Boards Fail To Boot After Upgrade From AG Boards
1 page
SQL Nanodegree Program Syllabus
No ratings yet
SQL Nanodegree Program Syllabus
12 pages
SQL Question
No ratings yet
SQL Question
15 pages

Inferential Statistics

Uploaded by

Inferential Statistics

Uploaded by

Inferential Statistics

 Null Hypothesis (H0): A statement of no effect or no difference, which researchers

Difference between Descriptive and Inferential statistics

It makes inferences about the

It helps in organizing, analyzing, and to It allows us to compare data, and make

It is used to explain the chance of

It explains already known data and is

It can be achieved with the help of charts,

# Calculate correlation coefficient

print("Correlation coefficient:", correlation).

Xᵢ and Yᵢ represent individual paired values of the two variables.

X̄ and Ȳ represent the means (averages) of the X and Y variables, respectively.

Σ denotes the sum of the values across all data points.

r = cov(X, Y) / (σₓ * σᵧ)

cov(X, Y) represents the covariance between X and Y.

σₓ and σᵧ represent the standard deviations of X and Y, respectively.

Linear regression uses the least square method.

The distance is called "residuals" or "errors".

Regression Analysis – Linear Model Assumptions

Linear regression analysis is based on six fundamental assumptions:

Regression Analysis – Simple Linear Regression

Regression Analysis – Multiple Linear Regression

Y = a + bX1 + cX2 + dX3 + ϵ

2. Algorithm development: Create algorithms to pair with the data

 Financial analysis to estimate stock market trends and events

 Stock price prediction

Some common predictive modeling techniques include:

Types of Missing Data

 Missing at Random (MAR).

 Not Missing at Random (NMAR).

Missing Data that's Missing Completely at Random (MCAR)

circumstances or not. They are just completely missing at random.

Missing Data that's Missing at Random (MAR):

is determined by the data

the missingness was caused by the observed data.

Missing Data that's Not Missing at Random (NMAR)

contains some missing values.

with a substituted value.

median, and mode).

KNN imputation is a fairer approach to the Simple Imputation method. It operates by

There are two kinds of outliers:

 A multivariate outlier is a combination of unusual or extreme values for at least two

 Contextual outliers (otherwise known as conditional outliers) are values that

If an e-commerce company experiences a sudden increase in orders in the middle of

Common causes of outliers in datasets:

 Human error while manually entering data, such as a typo

 Measurement errors as a result of instrumental error

 Experimental errors, from the data extraction process or experiment planning or

Outliers using visualizations

In data analytics, analysts create data visualizations to present data graphically in a

Identifying outliers with box plots

Outliers using statistical methods

1. Outliers with DBSCAN

Outliers with the Isolation Forest algorithm

Isolation Forest—otherwise known as iForest—is another anomaly detection algorithm. The

Automated Data Preparation

Key Components of Automated Data Preparation

 Duplicate Records: Identifying and removing duplicate records to ensure data

 Missing Values: Handling missing values through imputation, interpolation, or

 Inconsistent Data: Standardizing inconsistent data formats, such as date and

 Data Normalization: Scaling numerical data to a standard range to ensure

 Feature Engineering: Creating new features or variables from existing data to

 Data Blending: Integrating data from disparate sources to create a

 Data Harmonization: Aligning data from different sources to ensure

 Append External Data: Incorporate data from external sources, such as

 Geocoding: Add geographic coordinates to address data for spatial analysis.

You might also like