0% found this document useful (0 votes)
69 views26 pages

Ids Unit-2

Uploaded by

Gayathri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views26 pages

Ids Unit-2

Uploaded by

Gayathri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 26

KMEC/III SEM/CSM/IDS T.

GAYATHRI

UNIT-2
Syllabus
Techniques: Introduction: Data Analysis and Data Analytics, Descriptive Analysis:
Variables, Frequency Distribution, Measures of Centrality, Dispersion of a Distribution.

Diagnostic Analytics: Correlations, Predictive Analytics, Prescriptive Analytics, Exploratory


Analysis, Mechanistic Analysis : Regression.

Statistics: Understanding attributes and their types, Types of attributes, Discrete and
continuous attributes, measuring central tendency: Mean, Mode, Median, measuring
dispersion, Skewness and kurtosis, understanding relationships using covariance and
correlation coefficients: Pearson's correlation coefficient, Spearman's rank correlation
coefficient, collecting samples, Performing parametric tests.

Descriptive Statistics: Understanding statistics, distribution functions uniform distribution,


normal distribution, exponential distribution, binomial distribution. Cumulative distribution
function, descriptive statistics.

Data Analysis and Data Analytics


• While the terms "data analysis" and "data analytics" are often used interchangeably,
there are subtle but important differences between them.

• Understanding these distinctions can help data practitioners apply the right techniques
for the right purposes.

1. Data Analysis:

Refers to hands-on exploration and evaluation of data.

• It involves looking at historical data to understand what has happened in the past.

• Data analysis is focused on examining and summarizing data to extract insights, often
using descriptive techniques.

• 2. Data Analytics:

• A broader term that includes data analysis as a subcomponent.

• Analytics is the science behind data analysis and focuses on the methodologies used
to extract meaningful insights from data.

• It involves using mathematics, statistics, and predictive models to not only


understand the past (like data analysis) but also forecast future outcomes and guide
decision-making.
KMEC/III SEM/CSM/IDS T.GAYATHRI

• In essence, data analysis looks backward, providing historical insights, while data
analytics looks forward, often predicting future outcomes or recommending actions
based on data insights.

2. Data Analytics:

A broader term that includes data analysis as a subcomponent.

• Analytics is the science behind data analysis and focuses on the methodologies used
to extract meaningful insights from data.

• It involves using mathematics, statistics, and predictive models to not only


understand the past (like data analysis) but also forecast future outcomes and guide
decision-making.

• In essence, data analysis looks backward, providing historical insights, while data
analytics looks forward, often predicting future outcomes or recommending actions
based on data insights.

Types of Data Analysis and Analytics:

1. Descriptive Analysis: Summarizes historical data to describe what has happened.

2. Diagnostic Analytics: Investigates data to explain why something happened.

3. Predictive Analytics: Uses historical data and models to predict future outcomes.

4. Prescriptive Analytics: Recommends actions to optimize outcomes.

5. Exploratory Analysis: Explores data to uncover patterns or relationships without


predefined hypotheses.

6. Mechanistic Analysis: Seeks to understand the underlying mechanisms driving data


patterns or phenomena.

Descriptive Analysis
• Descriptive analysis is about: “What is happening now based on incoming data.”

• It is a method for quantitatively describing the main features of a collection of


data.

• Typically, it is the first kind of data analysis performed on a dataset.

• Usually it is applied to large volumes of data, such as census data.

• Descriptive analysis is useful for organizing and summarizing data to reveal


patterns and trends.
KMEC/III SEM/CSM/IDS T.GAYATHRI

• For example, in sales, it helps categorize customers by product preferences and


purchasing patterns.

• They are critical for summarizing data and providing insights into the data’s structure.

• Without summarization, data cannot be meaningfully analyzed.

• It involves summary statistics like the mean, median, and mode.

Numerical Representation:

• Numbers provide more precision and differentiation than words.

• For instance, numbers can distinguish between two "small" rooms by specifying their
exact size, allowing for clearer comparisons and more accurate conclusions.

Advantages of Numbers Over Words:

• Numbers not only quantify but also provide specific details, such as counts (e.g., 2500
people), ranks (e.g., third most populated city).

• This level of specificity makes data more useful for analysis and decision-making.

Variables

• A variable is a label or name given to data in order to categorize and represent it. It
helps organize and identify different types of data in a dataset.

Types of Variables:

• Numeric Variables: These represent quantities and can be measured on a scale (e.g.,
age, height).

• Categorical Variables: These represent categories or groups (e.g.yes/no, gender).

Levels of Measurement for Numeric Variables

• Numeric variables can be classified based on their measurement properties, which


dictate the types of mathematical operations that can be meaningfully applied.

• Categorical Variables: These represent distinct categories without any inherent


order. They are useful for counting and classifying data, but you cannot perform
arithmetic operations on them.

– Nominal Variables: These are categories without a specific order. Numbers


may be used to label categories, but they have no quantitative meaning.

• Example: Gender (male, female), animal species (mammals, reptiles).


KMEC/III SEM/CSM/IDS T.GAYATHRI

• Ordinal Variables: These represent categories with a defined order or


ranking, but the differences between ranks are not necessarily equal or
meaningful.

• Example: Customer satisfaction rankings (1st, 2nd, 3rd) — while you


can rank people, the difference between "1st" and "2nd" is not
necessarily the same as between "2nd" and "3rd."

Numerical Variable:

• Interval Variables: These represent numerical data where the differences between
values are meaningful and consistent, but there is no true zero point, meaning ratios
are not meaningful.

– Example: Temperature in Fahrenheit or Celsius — a 10-degree difference is


meaningful, but 100°F is not "twice as hot" as 50°F because the zero point is
arbitrary.

• Ratio Variables: These are similar to interval variables but with a true zero point,
which makes both differences and ratios meaningful. These variables allow for all
arithmetic operations, including multiplication and division.

– Example: Length (meters, feet), weight (kilograms, pounds) — 2 meters is


exactly twice as long as 1 meter, and 0 meters means no length.

Independent and Dependent Variables

 When working with multiple variables, particularly in predictive modeling, variables


can be classified based on their relationship to one another:

 Independent Variables: These are variables that are manipulated or controlled in an


analysis. They are believed to influence or predict the value of another variable. In
predictive modeling, they are called predictor variables.
Example: "Tumor size" could be an independent variable in a model predicting the
likelihood of having cancer.

 Dependent Variables: These are the variables that depend on or are affected by the
independent variables. In predictive modeling, they are referred to as outcome
variables.
Example: "Cancer" status (yes/no) could be the dependent variable, predicted based
on tumor size.

Example Use Case:

Tumor Size and Cancer Prediction:

• In a dataset with the variables "tumor size" (a ratio variable) and "cancer" (a
categorical variable with values "yes" or "no"), "tumor size" would be the
KMEC/III SEM/CSM/IDS T.GAYATHRI

independent (predictor) variable, and "cancer" would be the dependent (outcome)


variable.

• This setup would allow you to predict whether a person has cancer based on their
tumor size.

Frequency Distribution
• This is a summary of how often each value in a dataset occurs. It helps in
understanding the distribution pattern of data.

Histogram: A type of graph that displays the frequency distribution of data. Each bar on a
histogram represents how many times a value (or a range of values) appears in the data. The
horizontal axis shows the values or intervals, and the height of each bar represents the
frequency.

Pie Chart:

pie charts are used for visualizing categorical data to display proportions and distributions.
Although they are widely recognized and intuitive, pie charts have limitations and are best
suited for specific scenarios.

Normal Distribution

• Ideally, data follows a normal (bell-shaped) curve, with scores symmetrically


distributed around a central point. A vertical line through the center would split the
data evenly.

• Types of Deviations from Normality:

– Skew: This refers to a lack of symmetry in the distribution.

• Positive Skew: The tail is longer on the right side

• Negative Skew: The tail is longer on the left side

– Kurtosis: This refers to the "pointiness" or clustering of scores at the ends.

• Platykurtic: Distributions are flat with less clustering at the ends.

• Leptokurtic: Distributions are pointy with more clustering at the ends.

import pandas as pd

# Create dataframe

sample_data = {'name': ['John', 'Alia', 'Ananya', 'Steve', 'Ben'],

'gender': ['M', 'F', 'F', 'M', 'M'],


KMEC/III SEM/CSM/IDS T.GAYATHRI

'communication_skill_score': [40, 45, 23, 39, 39],

'quantitative_skill_score': [38, 41, 42, 48, 32]}

data = pd.DataFrame(sample_data)

data['communcation_skill_score'].skew()

data['communcation_skill_score'].kurtosis()

Measures of Centrality
• A single value that represents the "center" of a data distribution. The three common
measures of central tendency are:

• Mean: The average of all values, calculated by summing them and dividing by the
total number of observations. It is sensitive to extreme values (outliers).
KMEC/III SEM/CSM/IDS T.GAYATHRI

• If there are n number of values in a dataset and the values are x1, x2, . . ., xn, then the
mean is calculated as

• Median: The middle value when all observations are ordered. For datasets with an
odd number of values, it’s the exact middle; for even numbers, it’s the average of the
two central values. The median is less affected by outliers, making it a robust measure
for skewed distributions.

• Mode: The most frequently occurring value(s) in a dataset. A distribution can have
one mode (unimodal), more than one mode (multimodal), or no mode if all values
occur with the same frequency.

Dispersion of a Distribution
• When analyzing the shape of a dataset, measures of centrality (mean, median, mode)
alone may not provide a complete picture.

• To fully understand the distribution, measures of dispersion (how spread out the data
is) are used.

1. Range

The difference between the largest and smallest values in the dataset.

Example: For the "Productivity" dataset, if the highest value is 25 and the lowest is 12:

• Range=25-12=13

Limitation: The range is sensitive to outliers, as it considers only the extremes.


KMEC/III SEM/CSM/IDS T.GAYATHRI

2. Interquartile Range (IQR)

The range of the middle 50% of the data, calculated by removing the lowest 25% (first
quartile, Q1) and highest 25% (third quartile, Q3) of values.

3.Variance is a statistical measure that quantifies how much the data points in a dataset
deviate from the mean.

 It captures the spread or dispersion of the data.

 A larger variance indicates greater spread, while a smaller variance indicates data
points are closer to the mean.
KMEC/III SEM/CSM/IDS T.GAYATHRI

Population vs. Sample Variance

• Population Variance :Used when analyzing the entire dataset or population.

• Sample Variance : Used when analyzing a subset of data (sample), accounting for
sampling bias.

4. Standard deviation :

The standard deviation is the square root of the variance, giving an average spread of data
points relative to the mean in the original units.

Diagnostic Analytics
• Diagnostic analytics focuses on understanding why something happened by
examining past performance and identifying causal relationships between variables.

• For example, in social media marketing, descriptive analytics can measure campaign
metrics like posts, mentions, followers, and page views.

• Diagnostic analytics distills this data into insights about what strategies succeeded or
failed.

• A common technique in diagnostic analytics is correlation analysis, which helps


identify relationships between variables to uncover patterns and causal factors.

Correlations
• Correlation measures the strength and direction of the relationship between two
variables. Strength reflects how closely the variables are related, while direction
indicates whether one variable increases or decreases as the other changes.
KMEC/III SEM/CSM/IDS T.GAYATHRI

• For instance, "rain" and "umbrella" show a strong, positive correlation: umbrellas are
used when it rains and not on dry days.

• A commonly used statistic for measuring linear relationships is Pearson’s r


correlation, which quantifies the degree of linear association between two variables.

• It is widely applied, such as in the stock market, to determine how two commodities
are related.

• Pearson's r is calculated using a specific formula to assess this relationship


numerically.

• The correlation coefficient can never be less than -1 or higher than 1.

• 1 = There is a perfect linear relationship between the variables(like Average_Pulse


against Calorie_Burnage)

• 0 = There is no linear relationship between the variables

• -1 = There is a perfect negative linear relationship between the variables (e.g. Less
hours worked, leads to higher calorie burnage during a training session)
KMEC/III SEM/CSM/IDS T.GAYATHRI

Correlation

data.corr(method ='pearson')

• The 'method' parameter can take one of the following three parameters:

• pearson: Standard correlation coefficient

• kendall: Kendall's tau correlation coefficient

• spearman: Spearman's rank correlation coefficient.

Spearman's rank correlation coefficient

• Spearman's rank correlation coefficient is a non-parametric measure that evaluates


the strength of association between two ranked variables.

Key points:

• It calculates Pearson's correlation on the ranks of observations.

• Applicable to both continuous and discrete ordinal variables.

• Suitable when data is skewed or contains outliers, as it does not assume a specific
data distribution.
KMEC/III SEM/CSM/IDS T.GAYATHRI

Example
KMEC/III SEM/CSM/IDS T.GAYATHRI

Predictive Analytics
 Predictive analytics focuses on forecasting future outcomes by analyzing past data,
trends, and emerging contexts.

 For example, it can predict consumer spending behaviors based on historical data and
new factors like changes in tax policy.

 While predictive analytics provides actionable insights, it relies on probabilities and


cannot guarantee 100% accuracy.

This process involves several stages:

 Data Cleaning: Preparing and organizing data for analysis.

 Hindsight: Identifying relationships between variables through visualization, like


scatterplots.

 Insight: Confirming patterns in the data using regression analysis to understand


variable distributions.
KMEC/III SEM/CSM/IDS T.GAYATHRI

 Foresight: Using identified patterns to predict future outcomes.

Applications of Predictive analytics

• Sales Forecasting: By analyzing the expenditures on advertising campaigns,


predictive analytics can help predict sales, as seen with Salesforce's use of data to
forecast results based on ad spend in different media.

• Credit Scoring: Financial services use predictive analytics to assess the likelihood of
customers making timely credit payments. FICO, for example, uses this to calculate
individual credit scores.

• CRM: In marketing, sales, and customer service, predictive analytics helps optimize
campaigns and improve customer engagement by forecasting behaviors and
outcomes.

• Healthcare: Predictive analytics is used to identify patients at risk for conditions like
diabetes, asthma, and other chronic illnesses, enabling proactive care.

Prescriptive Analytics
• Prescriptive analytics focuses on recommending the best course of action for a given
situation.

• It builds on descriptive and predictive analytics by analyzing variables, their


relationships, and potential decisions to prescribe optimal solutions in real time.

• This approach helps organizations take advantage of opportunities or mitigate risks,


offering actionable options and illustrating their implications.

• Techniques: Includes optimization, simulation, game theory, and decision analysis.


KMEC/III SEM/CSM/IDS T.GAYATHRI

Applications:

• In healthcare, it identifies high-priority patients (e.g., clinically obese individuals


with specific risk factors like diabetes or high LDL cholesterol) to focus treatment
effectively.

• Continuously processes new data to refine predictions and improve decision-making


accuracy.

Exploratory Analysis
• Exploratory Data Analysis (EDA) is a methodology for analyzing datasets to uncover
hidden patterns, relationships, or insights, especially when there's no clear question or
hypothesis at the outset.

• Rather than applying predefined models, EDA focuses on letting the data itself reveal
its structure through techniques like data visualization.

Purpose:

• To identify patterns, trends, or outliers.

• To generate hypotheses for future analysis.

Methods:

• Visualizations like scatterplots, bar charts, or dimensional reduction.

• Sampling subsets of data or variables to manage complexity.

Applications:

• Commonly used for uncovering relationships in large datasets, such as identifying


groups or trends.

Mechanistic Analysis
• Mechanistic analysis focuses on understanding precise cause-and-effect
relationships between variables for individual cases.

• It aims to determine how changes in one variable directly influence another.

Examples include:

Employee Productivity: Studying how giving employees more benefits impacts their
productivity, where one benefit might boost productivity by 5%, but excess could lead to
negative outcomes.
KMEC/III SEM/CSM/IDS T.GAYATHRI

Climate Change: Analyzing how increased atmospheric CO₂ levels contribute to rising
global temperatures. Over the past 150 years, CO₂ levels increased from 280 to 400 ppm,
corresponding to a 1.53°F (0.85°C) rise in Earth's temperature, a clear sign of climate change.

• Mechanistic analysis often employs techniques like regression to explore and quantify
relationships between variables, enabling deeper insights into how changes in one
factor directly influence another.

1.Regression

• Regression analysis is a statistical method used to estimate relationships between


variables and predict outcomes.

• Unlike correlation, which only measures the strength and direction of a relationship,
regression provides a way to predict the outcome variable (dependent variable, y)
based on one or more predictor variables (independent variables, x).

Types of Regression:

– Simple Linear Regression: Predicts y using one predictor variable x.

– Multiple Linear Regression: Uses multiple predictor variables.

Linear Relationship:

• Assumes a straight-line relationship between x and y, represented by the equation:


KMEC/III SEM/CSM/IDS T.GAYATHRI
KMEC/III SEM/CSM/IDS T.GAYATHRI

• Regression analysis has numerous applications in data science and statistics,


particularly in business.

It helps uncover insights and make predictions, such as:

• Consumer Behavior: Understanding factors influencing customer actions and


profitability.

• Sales and Advertising: Measuring sensitivity of sales to advertising expenditures.

• Stock Market: Analyzing how stock prices respond to changes in interest rates.

• Forecasting: Predicting future demand for products or stock performance using


regression equations.

Covariance:
Measures how two variables change together.

• Indicates the direction of the relationship (positive or negative).

• Its value ranges from −∞ to +∞.

Collecting samples
1. Probability Sampling (Random Selection)

• Simple Random Sampling: Every respondent has an equal chance of being selected
(e.g., selecting 20 products randomly from 500).

• Stratified Sampling: The population is divided into strata (groups) based on shared
characteristics. This technique reduces selection bias.

• Systematic Sampling: Respondents are selected at regular intervals (e.g., every nth
person from the population).

• Cluster Sampling: The population is divided into clusters (e.g., by gender, location),
and entire clusters are sampled.

2.Non-Probability Sampling (Non-Random Selection)

• Convenience Sampling: Respondents are selected based on availability and


willingness. It's cheap and quick but may be biased.

• Purposive Sampling: Also called judgmental sampling, it involves selecting


participants based on the researcher's judgment and predefined characteristics.

• Quota Sampling: Respondents are selected to meet predefined characteristics, and


the selection continues until the quota is reached, though it's not random.
KMEC/III SEM/CSM/IDS T.GAYATHRI

• Snowball Sampling: Used for hard-to-reach populations (e.g., illegal immigrants or


HIV-positive individuals). Initial participants refer others who fit the sample criteria.

Performing parametric tests


Parametric tests are statistical methods that rely on assumptions about the population
distribution. These tests are more powerful and reliable than non-parametric tests and are
primarily used for quantitative and continuous data. Examples include t-tests and ANOVA,
which help in hypothesis testing based on population parameters.

T-Test

A t-test is used to assess whether there is a significant difference between the means of
groups or datasets. It assumes the data follows a normal distribution and has three main types:

1. One-Sample T-Test

 Purpose: Compares the mean of a single sample to a known population mean.


 Example: Testing if the average weight of 10 students is 68 kg.
 Result:
o p-value: 0.5987 (greater than 0.05 significance level).
o Conclusion: Hypothesis accepted; the mean is not significantly different from
68 kg.

2. Two-Sample (Independent) T-Test

 Purpose: Compares the means of two independent groups.


 Example: Comparing the average weight of two groups of students.
 Result:
o p-value: 0.0152 (less than 0.05 significance level).
o Conclusion: Hypothesis rejected; the group means are significantly different.

3. Paired Sample (Dependent) T-Test

 Purpose: Tests the mean difference between two observations from the same group.
 Example: Evaluating the impact of weight-loss treatment by comparing patient
weights before and after treatment.
 Result:
o p-value: 0.0137 (less than 0.05 significance level).
o Conclusion: Hypothesis rejected; weight-loss treatment had a significant
effect.

ANOVA (Analysis of Variance)

ANOVA is used to compare means across multiple groups. It evaluates the variance within
and between groups to test null hypotheses for differences in means.
KMEC/III SEM/CSM/IDS T.GAYATHRI

1. One-Way ANOVA

 Purpose: Compares the means of multiple groups based on a single independent


variable.
 Example: Comparing employee productivity scores across three locations (Mumbai,
Chicago, London).
 Result:
o p-value: 0.2767 (greater than 0.05 significance level).
o Conclusion: Hypothesis accepted; no significant difference in mean
performance across locations.

2. Two-Way ANOVA

 Purpose: Compares group means based on two independent variables (e.g., working
hours and project complexity).

3. N-Way ANOVA

 Purpose: Compares group means based on multiple independent variables (e.g.,


working hours, training, perks).

Example code

import numpy as np
from scipy.stats import ttest_1samp
# Create data
data=np.array([63, 75, 84, 58, 52, 96, 63, 55, 76, 83])
# Find mean
mean_value = np.mean(data)
print("Mean:",mean_value)
Output:
Mean: 70.5

Example Code

from scipy.stats import f_oneway


# Performance scores of Mumbai location
Statistics Chapter 3
[ 106 ]
mumbai=[0.14730927, 0.59168541, 0.85677052, 0.27315387,
0.78591207,0.52426114, 0.05007655, 0.64405363, 0.9825853 ,
0.62667439]
# Performance scores of Chicago location
chicago=[0.99140754, 0.76960782, 0.51370154, 0.85041028,
0.19485391,0.25269917, 0.19925735, 0.80048387, 0.98381235,
0.5864963 ]
# Performance scores of London location
london=[0.40382226, 0.51613408, 0.39374473, 0.0689976 ,
KMEC/III SEM/CSM/IDS T.GAYATHRI

0.28035865,0.56326686, 0.66735357, 0.06786065, 0.21013306,


0.86503358]
In the preceding code block, we have created three lists of employee performance
scores for three locations: Mumbai, Chicago, and London.
Let's perform a one-way ANOVA test, as follows:
# Compare results using Oneway ANOVA
stat, p = f_oneway(mumbai, chicago, london)
print("p-values:", p)
print("ANOVA:", stat)
if p < 0.05:
print("Hypothesis Rejected")
else:
print("Hypothesis Accepted")
Output:
p-values: 0.27667556390705783
ANOVA: 1.3480446381965452
Hypothesis Accepted

Descriptive Statistics
Understanding statistics

• In data science, qualitative(Categorical) and quantitative(Numerical) analyses


play complementary roles.

• While qualitative analysis focuses on understanding the underlying patterns,


behaviors, and relationships in data through non-numerical insights, quantitative
analysis relies heavily on numerical and statistical methods to extract meaningful
insights.

• Statistics is a branch of mathematics that deals with collecting, organizing, and


interpreting data.

• Hence, by using statistical concepts, we can understand the nature of the data, a
summary of the dataset, and the type of distribution that the data has.

Distribution function

Continuous Function:

• A continuous function is any function that does not have any unexpected changes in
value.

• These abrupt or unexpected changes are referred to as discontinuities.

• For example, consider the following cubic function


KMEC/III SEM/CSM/IDS T.GAYATHRI

If you plot the graph of this function, you will see that there are no jumps or holes in the
series of values. Hence, this function is continuous.

Probability density function (PDF):

is a continuous function that describes the probability of a random variable taking a specific
value within a range.

It applies to continuous random variables and satisfies the following conditions:

 The function values must be non-negative (p(a)>0 for all a).

 The total area under the curve equals 1.

 For discrete random variables, the equivalent is the probability mass function
(PMF), which lists probabilities associated with each possible value of the variable.

 The probability distribution of a random variable (discrete or continuous)


characterizes the likelihood of outcomes and is represented by a density curve for
continuous distributions.

 Examples of continuous distributions include the

Uniform

Normal-Gaussian

Binomial

Exponential-Poisson

Uniform distribution

• The uniform probability distribution function of any continuous uniform distribution


is given by the following equation:
KMEC/III SEM/CSM/IDS T.GAYATHRI

a and b are the lower and upper bounds, respectively, of the interval over which the random
variable X is uniformly distributed.

from scipy.stats import uniform

number = 10000

start = 20

width = 25

uniform_data = uniform.rvs(size=number, loc=start, scale=width)

• The uniform function is used to generate a uniform continuous variable between the
given start location (loc) and the width of the arguments (scale).

• The size arguments specify the number of random variates taken under consideration.

Normal distribution

• The normal distribution, or Gaussian distribution, is a symmetrical, bell-shaped


curve that represents the distribution of random variables.

Key features include:

• Symmetry: The curve is centered around its mean (μ).

• Spread: Determined by its standard deviation (σ).

It has two main parameters:

• Mean (μ): Defines the center of the distribution.

• Standard Deviation (σ): Defines the spread or width of the curve.

• The central limit theorem makes the normal distribution significant, as it states that
for a sufficiently large sample size (n), the sampling distribution of the mean
approaches a normal distribution, regardless of the population's original distribution.
KMEC/III SEM/CSM/IDS T.GAYATHRI

• Mathematically, it is given as follows:

from scipy.stats import norm

normal_data = norm.rvs(size=90000,loc=20,scale=30)

Exponential distribution

• A process in which some events occur continuously and independently at a constant


average rate is referred to as a Poisson point process.

• The exponential distribution describes the time between events in such a Poisson
point process, and the probability density function of the exponential distribution is
given as follows:

from scipy.stats import expon

expon_data = expon.rvs(scale=1,loc=0,size=1000)
KMEC/III SEM/CSM/IDS T.GAYATHRI

The curve is decreasing over the x axis.

Binomial distribution

• Binomial distribution, as the name suggests, has only two possible outcomes, success
or failure.

• The outcomes do not need to be equally likely and each trial is independent of the
other.

from scipy.stats import binom

binomial_data = binom.rvs(n=10, p=0.8,size=10000)

Cumulative distribution function

• Cumulative distribution function (CDF) is the probability that the variable takes a
value that is less than or equal to x.

Mathematically, it is written as follows:

Descriptive Statistics

• Descriptive statistics deals with the formulation of simple summaries of data so that
they can be clearly understood.

• The summaries of data may be either numerical representations or visualizations with


simple graphs for further understanding.
KMEC/III SEM/CSM/IDS T.GAYATHRI

• Typically, such summaries help in the initial phase of statistical analysis.

There are two types of descriptive statistics:

1. Measures of central tendency

2. Measures of variability (spread)

• Measures of central tendency include mean, median, and mode.

• Measures of variability include standard deviation (or variance), the minimum and
maximum values of the variables, kurtosis, and skewness.

You might also like