0% found this document useful (0 votes)

76 views26 pages

Ids Unit-2

Uploaded by

Gayathri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

76 views26 pages

Ids Unit-2

Uploaded by

Gayathri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 26

KMEC/III SEM/CSM/IDS T.

GAYATHRI

UNIT-2
Syllabus
Techniques: Introduction: Data Analysis and Data Analytics, Descriptive Analysis:
Variables, Frequency Distribution, Measures of Centrality, Dispersion of a Distribution.

Diagnostic Analytics: Correlations, Predictive Analytics, Prescriptive Analytics, Exploratory

Analysis, Mechanistic Analysis : Regression.

Statistics: Understanding attributes and their types, Types of attributes, Discrete and
continuous attributes, measuring central tendency: Mean, Mode, Median, measuring
dispersion, Skewness and kurtosis, understanding relationships using covariance and
correlation coefficients: Pearson's correlation coefficient, Spearman's rank correlation
coefficient, collecting samples, Performing parametric tests.

Descriptive Statistics: Understanding statistics, distribution functions uniform distribution,

normal distribution, exponential distribution, binomial distribution. Cumulative distribution
function, descriptive statistics.

Data Analysis and Data Analytics

• While the terms "data analysis" and "data analytics" are often used interchangeably,
there are subtle but important differences between them.

• Understanding these distinctions can help data practitioners apply the right techniques
for the right purposes.

1. Data Analysis:

Refers to hands-on exploration and evaluation of data.

• It involves looking at historical data to understand what has happened in the past.

• Data analysis is focused on examining and summarizing data to extract insights, often
using descriptive techniques.

• 2. Data Analytics:

• A broader term that includes data analysis as a subcomponent.

• Analytics is the science behind data analysis and focuses on the methodologies used
to extract meaningful insights from data.

• It involves using mathematics, statistics, and predictive models to not only

understand the past (like data analysis) but also forecast future outcomes and guide
decision-making.
KMEC/III SEM/CSM/IDS T.GAYATHRI

• In essence, data analysis looks backward, providing historical insights, while data
analytics looks forward, often predicting future outcomes or recommending actions
based on data insights.

2. Data Analytics:

A broader term that includes data analysis as a subcomponent.

• Analytics is the science behind data analysis and focuses on the methodologies used
to extract meaningful insights from data.

• It involves using mathematics, statistics, and predictive models to not only

understand the past (like data analysis) but also forecast future outcomes and guide
decision-making.

• In essence, data analysis looks backward, providing historical insights, while data
analytics looks forward, often predicting future outcomes or recommending actions
based on data insights.

Types of Data Analysis and Analytics:

1. Descriptive Analysis: Summarizes historical data to describe what has happened.

2. Diagnostic Analytics: Investigates data to explain why something happened.

3. Predictive Analytics: Uses historical data and models to predict future outcomes.

4. Prescriptive Analytics: Recommends actions to optimize outcomes.

5. Exploratory Analysis: Explores data to uncover patterns or relationships without

predefined hypotheses.

6. Mechanistic Analysis: Seeks to understand the underlying mechanisms driving data

patterns or phenomena.

Descriptive Analysis
• Descriptive analysis is about: “What is happening now based on incoming data.”

• It is a method for quantitatively describing the main features of a collection of

data.

• Typically, it is the first kind of data analysis performed on a dataset.

• Usually it is applied to large volumes of data, such as census data.

• Descriptive analysis is useful for organizing and summarizing data to reveal

patterns and trends.
KMEC/III SEM/CSM/IDS T.GAYATHRI

• For example, in sales, it helps categorize customers by product preferences and

purchasing patterns.

• They are critical for summarizing data and providing insights into the data’s structure.

• Without summarization, data cannot be meaningfully analyzed.

• It involves summary statistics like the mean, median, and mode.

Numerical Representation:

• Numbers provide more precision and differentiation than words.

• For instance, numbers can distinguish between two "small" rooms by specifying their
exact size, allowing for clearer comparisons and more accurate conclusions.

Advantages of Numbers Over Words:

• Numbers not only quantify but also provide specific details, such as counts (e.g., 2500
people), ranks (e.g., third most populated city).

• This level of specificity makes data more useful for analysis and decision-making.

Variables

• A variable is a label or name given to data in order to categorize and represent it. It
helps organize and identify different types of data in a dataset.

Types of Variables:

• Numeric Variables: These represent quantities and can be measured on a scale (e.g.,
age, height).

• Categorical Variables: These represent categories or groups (e.g.yes/no, gender).

Levels of Measurement for Numeric Variables

• Numeric variables can be classified based on their measurement properties, which

dictate the types of mathematical operations that can be meaningfully applied.

• Categorical Variables: These represent distinct categories without any inherent

order. They are useful for counting and classifying data, but you cannot perform
arithmetic operations on them.

– Nominal Variables: These are categories without a specific order. Numbers

may be used to label categories, but they have no quantitative meaning.

• Example: Gender (male, female), animal species (mammals, reptiles).

KMEC/III SEM/CSM/IDS T.GAYATHRI

• Ordinal Variables: These represent categories with a defined order or

ranking, but the differences between ranks are not necessarily equal or
meaningful.

• Example: Customer satisfaction rankings (1st, 2nd, 3rd) — while you

can rank people, the difference between "1st" and "2nd" is not
necessarily the same as between "2nd" and "3rd."

Numerical Variable:

• Interval Variables: These represent numerical data where the differences between
values are meaningful and consistent, but there is no true zero point, meaning ratios
are not meaningful.

– Example: Temperature in Fahrenheit or Celsius — a 10-degree difference is

meaningful, but 100°F is not "twice as hot" as 50°F because the zero point is
arbitrary.

• Ratio Variables: These are similar to interval variables but with a true zero point,
which makes both differences and ratios meaningful. These variables allow for all
arithmetic operations, including multiplication and division.

– Example: Length (meters, feet), weight (kilograms, pounds) — 2 meters is

exactly twice as long as 1 meter, and 0 meters means no length.

Independent and Dependent Variables

 When working with multiple variables, particularly in predictive modeling, variables

can be classified based on their relationship to one another:

 Independent Variables: These are variables that are manipulated or controlled in an

analysis. They are believed to influence or predict the value of another variable. In
predictive modeling, they are called predictor variables.
Example: "Tumor size" could be an independent variable in a model predicting the
likelihood of having cancer.

 Dependent Variables: These are the variables that depend on or are affected by the
independent variables. In predictive modeling, they are referred to as outcome
variables.
Example: "Cancer" status (yes/no) could be the dependent variable, predicted based
on tumor size.

Example Use Case:

Tumor Size and Cancer Prediction:

• In a dataset with the variables "tumor size" (a ratio variable) and "cancer" (a
categorical variable with values "yes" or "no"), "tumor size" would be the
KMEC/III SEM/CSM/IDS T.GAYATHRI

independent (predictor) variable, and "cancer" would be the dependent (outcome)

variable.

• This setup would allow you to predict whether a person has cancer based on their
tumor size.

Frequency Distribution
• This is a summary of how often each value in a dataset occurs. It helps in
understanding the distribution pattern of data.

Histogram: A type of graph that displays the frequency distribution of data. Each bar on a
histogram represents how many times a value (or a range of values) appears in the data. The
horizontal axis shows the values or intervals, and the height of each bar represents the
frequency.

Pie Chart:

pie charts are used for visualizing categorical data to display proportions and distributions.
Although they are widely recognized and intuitive, pie charts have limitations and are best
suited for specific scenarios.

Normal Distribution

• Ideally, data follows a normal (bell-shaped) curve, with scores symmetrically

distributed around a central point. A vertical line through the center would split the
data evenly.

• Types of Deviations from Normality:

– Skew: This refers to a lack of symmetry in the distribution.

• Positive Skew: The tail is longer on the right side

• Negative Skew: The tail is longer on the left side

– Kurtosis: This refers to the "pointiness" or clustering of scores at the ends.

• Platykurtic: Distributions are flat with less clustering at the ends.

• Leptokurtic: Distributions are pointy with more clustering at the ends.

import pandas as pd

# Create dataframe

sample_data = {'name': ['John', 'Alia', 'Ananya', 'Steve', 'Ben'],

'gender': ['M', 'F', 'F', 'M', 'M'],

KMEC/III SEM/CSM/IDS T.GAYATHRI

'communication_skill_score': [40, 45, 23, 39, 39],

'quantitative_skill_score': [38, 41, 42, 48, 32]}

data = pd.DataFrame(sample_data)

data['communcation_skill_score'].skew()

data['communcation_skill_score'].kurtosis()

Measures of Centrality
• A single value that represents the "center" of a data distribution. The three common
measures of central tendency are:

• Mean: The average of all values, calculated by summing them and dividing by the
total number of observations. It is sensitive to extreme values (outliers).
KMEC/III SEM/CSM/IDS T.GAYATHRI

• If there are n number of values in a dataset and the values are x1, x2, . . ., xn, then the
mean is calculated as

• Median: The middle value when all observations are ordered. For datasets with an
odd number of values, it’s the exact middle; for even numbers, it’s the average of the
two central values. The median is less affected by outliers, making it a robust measure
for skewed distributions.

• Mode: The most frequently occurring value(s) in a dataset. A distribution can have
one mode (unimodal), more than one mode (multimodal), or no mode if all values
occur with the same frequency.

Dispersion of a Distribution
• When analyzing the shape of a dataset, measures of centrality (mean, median, mode)
alone may not provide a complete picture.

• To fully understand the distribution, measures of dispersion (how spread out the data
is) are used.

1. Range

The difference between the largest and smallest values in the dataset.

Example: For the "Productivity" dataset, if the highest value is 25 and the lowest is 12:

• Range=25-12=13

Limitation: The range is sensitive to outliers, as it considers only the extremes.

KMEC/III SEM/CSM/IDS T.GAYATHRI

2. Interquartile Range (IQR)

The range of the middle 50% of the data, calculated by removing the lowest 25% (first
quartile, Q1) and highest 25% (third quartile, Q3) of values.

3.Variance is a statistical measure that quantifies how much the data points in a dataset
deviate from the mean.

 It captures the spread or dispersion of the data.

 A larger variance indicates greater spread, while a smaller variance indicates data
points are closer to the mean.
KMEC/III SEM/CSM/IDS T.GAYATHRI

Population vs. Sample Variance

• Population Variance :Used when analyzing the entire dataset or population.

• Sample Variance : Used when analyzing a subset of data (sample), accounting for
sampling bias.

4. Standard deviation :

The standard deviation is the square root of the variance, giving an average spread of data
points relative to the mean in the original units.

Diagnostic Analytics
• Diagnostic analytics focuses on understanding why something happened by
examining past performance and identifying causal relationships between variables.

• For example, in social media marketing, descriptive analytics can measure campaign
metrics like posts, mentions, followers, and page views.

• Diagnostic analytics distills this data into insights about what strategies succeeded or
failed.

• A common technique in diagnostic analytics is correlation analysis, which helps

identify relationships between variables to uncover patterns and causal factors.

Correlations
• Correlation measures the strength and direction of the relationship between two
variables. Strength reflects how closely the variables are related, while direction
indicates whether one variable increases or decreases as the other changes.
KMEC/III SEM/CSM/IDS T.GAYATHRI

• For instance, "rain" and "umbrella" show a strong, positive correlation: umbrellas are
used when it rains and not on dry days.

• A commonly used statistic for measuring linear relationships is Pearson’s r

correlation, which quantifies the degree of linear association between two variables.

• It is widely applied, such as in the stock market, to determine how two commodities
are related.

• Pearson's r is calculated using a specific formula to assess this relationship

numerically.

• The correlation coefficient can never be less than -1 or higher than 1.

• 1 = There is a perfect linear relationship between the variables(like Average_Pulse

against Calorie_Burnage)

• 0 = There is no linear relationship between the variables

• -1 = There is a perfect negative linear relationship between the variables (e.g. Less
hours worked, leads to higher calorie burnage during a training session)
KMEC/III SEM/CSM/IDS T.GAYATHRI

Correlation

data.corr(method ='pearson')

• The 'method' parameter can take one of the following three parameters:

• pearson: Standard correlation coefficient

• kendall: Kendall's tau correlation coefficient

• spearman: Spearman's rank correlation coefficient.

Spearman's rank correlation coefficient

• Spearman's rank correlation coefficient is a non-parametric measure that evaluates

the strength of association between two ranked variables.

Key points:

• It calculates Pearson's correlation on the ranks of observations.

• Applicable to both continuous and discrete ordinal variables.

• Suitable when data is skewed or contains outliers, as it does not assume a specific
data distribution.
KMEC/III SEM/CSM/IDS T.GAYATHRI

Example
KMEC/III SEM/CSM/IDS T.GAYATHRI

Predictive Analytics
 Predictive analytics focuses on forecasting future outcomes by analyzing past data,
trends, and emerging contexts.

 For example, it can predict consumer spending behaviors based on historical data and
new factors like changes in tax policy.

 While predictive analytics provides actionable insights, it relies on probabilities and

cannot guarantee 100% accuracy.

This process involves several stages:

 Data Cleaning: Preparing and organizing data for analysis.

 Hindsight: Identifying relationships between variables through visualization, like

scatterplots.

 Insight: Confirming patterns in the data using regression analysis to understand

variable distributions.
KMEC/III SEM/CSM/IDS T.GAYATHRI

 Foresight: Using identified patterns to predict future outcomes.

Applications of Predictive analytics

• Sales Forecasting: By analyzing the expenditures on advertising campaigns,

predictive analytics can help predict sales, as seen with Salesforce's use of data to
forecast results based on ad spend in different media.

• Credit Scoring: Financial services use predictive analytics to assess the likelihood of
customers making timely credit payments. FICO, for example, uses this to calculate
individual credit scores.

• CRM: In marketing, sales, and customer service, predictive analytics helps optimize
campaigns and improve customer engagement by forecasting behaviors and
outcomes.

• Healthcare: Predictive analytics is used to identify patients at risk for conditions like
diabetes, asthma, and other chronic illnesses, enabling proactive care.

Prescriptive Analytics
• Prescriptive analytics focuses on recommending the best course of action for a given
situation.

• It builds on descriptive and predictive analytics by analyzing variables, their

relationships, and potential decisions to prescribe optimal solutions in real time.

• This approach helps organizations take advantage of opportunities or mitigate risks,

offering actionable options and illustrating their implications.

• Techniques: Includes optimization, simulation, game theory, and decision analysis.

KMEC/III SEM/CSM/IDS T.GAYATHRI

Applications:

• In healthcare, it identifies high-priority patients (e.g., clinically obese individuals

with specific risk factors like diabetes or high LDL cholesterol) to focus treatment
effectively.

• Continuously processes new data to refine predictions and improve decision-making

accuracy.

Exploratory Analysis
• Exploratory Data Analysis (EDA) is a methodology for analyzing datasets to uncover
hidden patterns, relationships, or insights, especially when there's no clear question or
hypothesis at the outset.

• Rather than applying predefined models, EDA focuses on letting the data itself reveal
its structure through techniques like data visualization.

Purpose:

• To identify patterns, trends, or outliers.

• To generate hypotheses for future analysis.

Methods:

• Visualizations like scatterplots, bar charts, or dimensional reduction.

• Sampling subsets of data or variables to manage complexity.

Applications:

• Commonly used for uncovering relationships in large datasets, such as identifying

groups or trends.

Mechanistic Analysis
• Mechanistic analysis focuses on understanding precise cause-and-effect
relationships between variables for individual cases.

• It aims to determine how changes in one variable directly influence another.

Examples include:

Employee Productivity: Studying how giving employees more benefits impacts their
productivity, where one benefit might boost productivity by 5%, but excess could lead to
negative outcomes.
KMEC/III SEM/CSM/IDS T.GAYATHRI

Climate Change: Analyzing how increased atmospheric CO₂ levels contribute to rising
global temperatures. Over the past 150 years, CO₂ levels increased from 280 to 400 ppm,
corresponding to a 1.53°F (0.85°C) rise in Earth's temperature, a clear sign of climate change.

• Mechanistic analysis often employs techniques like regression to explore and quantify
relationships between variables, enabling deeper insights into how changes in one
factor directly influence another.

1.Regression

• Regression analysis is a statistical method used to estimate relationships between

variables and predict outcomes.

• Unlike correlation, which only measures the strength and direction of a relationship,
regression provides a way to predict the outcome variable (dependent variable, y)
based on one or more predictor variables (independent variables, x).

Types of Regression:

– Simple Linear Regression: Predicts y using one predictor variable x.

– Multiple Linear Regression: Uses multiple predictor variables.

Linear Relationship:

• Assumes a straight-line relationship between x and y, represented by the equation:

KMEC/III SEM/CSM/IDS T.GAYATHRI
KMEC/III SEM/CSM/IDS T.GAYATHRI

• Regression analysis has numerous applications in data science and statistics,

particularly in business.

It helps uncover insights and make predictions, such as:

• Consumer Behavior: Understanding factors influencing customer actions and

profitability.

• Sales and Advertising: Measuring sensitivity of sales to advertising expenditures.

• Stock Market: Analyzing how stock prices respond to changes in interest rates.

• Forecasting: Predicting future demand for products or stock performance using

regression equations.

Covariance:
Measures how two variables change together.

• Indicates the direction of the relationship (positive or negative).

• Its value ranges from −∞ to +∞.

Collecting samples
1. Probability Sampling (Random Selection)

• Simple Random Sampling: Every respondent has an equal chance of being selected
(e.g., selecting 20 products randomly from 500).

• Stratified Sampling: The population is divided into strata (groups) based on shared
characteristics. This technique reduces selection bias.

• Systematic Sampling: Respondents are selected at regular intervals (e.g., every nth
person from the population).

• Cluster Sampling: The population is divided into clusters (e.g., by gender, location),
and entire clusters are sampled.

2.Non-Probability Sampling (Non-Random Selection)

• Convenience Sampling: Respondents are selected based on availability and

willingness. It's cheap and quick but may be biased.

• Purposive Sampling: Also called judgmental sampling, it involves selecting

participants based on the researcher's judgment and predefined characteristics.

• Quota Sampling: Respondents are selected to meet predefined characteristics, and

the selection continues until the quota is reached, though it's not random.
KMEC/III SEM/CSM/IDS T.GAYATHRI

• Snowball Sampling: Used for hard-to-reach populations (e.g., illegal immigrants or

HIV-positive individuals). Initial participants refer others who fit the sample criteria.

Performing parametric tests

Parametric tests are statistical methods that rely on assumptions about the population
distribution. These tests are more powerful and reliable than non-parametric tests and are
primarily used for quantitative and continuous data. Examples include t-tests and ANOVA,
which help in hypothesis testing based on population parameters.

T-Test

A t-test is used to assess whether there is a significant difference between the means of
groups or datasets. It assumes the data follows a normal distribution and has three main types:

1. One-Sample T-Test

 Purpose: Compares the mean of a single sample to a known population mean.

 Example: Testing if the average weight of 10 students is 68 kg.
 Result:
o p-value: 0.5987 (greater than 0.05 significance level).
o Conclusion: Hypothesis accepted; the mean is not significantly different from
68 kg.

2. Two-Sample (Independent) T-Test

 Purpose: Compares the means of two independent groups.

 Example: Comparing the average weight of two groups of students.
 Result:
o p-value: 0.0152 (less than 0.05 significance level).
o Conclusion: Hypothesis rejected; the group means are significantly different.

3. Paired Sample (Dependent) T-Test

 Purpose: Tests the mean difference between two observations from the same group.
 Example: Evaluating the impact of weight-loss treatment by comparing patient
weights before and after treatment.
 Result:
o p-value: 0.0137 (less than 0.05 significance level).
o Conclusion: Hypothesis rejected; weight-loss treatment had a significant
effect.

ANOVA (Analysis of Variance)

ANOVA is used to compare means across multiple groups. It evaluates the variance within
and between groups to test null hypotheses for differences in means.
KMEC/III SEM/CSM/IDS T.GAYATHRI

1. One-Way ANOVA

 Purpose: Compares the means of multiple groups based on a single independent

variable.
 Example: Comparing employee productivity scores across three locations (Mumbai,
Chicago, London).
 Result:
o p-value: 0.2767 (greater than 0.05 significance level).
o Conclusion: Hypothesis accepted; no significant difference in mean
performance across locations.

2. Two-Way ANOVA

 Purpose: Compares group means based on two independent variables (e.g., working
hours and project complexity).

3. N-Way ANOVA

 Purpose: Compares group means based on multiple independent variables (e.g.,

working hours, training, perks).

Example code

import numpy as np
from scipy.stats import ttest_1samp
# Create data
data=np.array([63, 75, 84, 58, 52, 96, 63, 55, 76, 83])
# Find mean
mean_value = np.mean(data)
print("Mean:",mean_value)
Output:
Mean: 70.5

Example Code

from scipy.stats import f_oneway

# Performance scores of Mumbai location
Statistics Chapter 3
[ 106 ]
mumbai=[0.14730927, 0.59168541, 0.85677052, 0.27315387,
0.78591207,0.52426114, 0.05007655, 0.64405363, 0.9825853 ,
0.62667439]
# Performance scores of Chicago location
chicago=[0.99140754, 0.76960782, 0.51370154, 0.85041028,
0.19485391,0.25269917, 0.19925735, 0.80048387, 0.98381235,
0.5864963 ]
# Performance scores of London location
london=[0.40382226, 0.51613408, 0.39374473, 0.0689976 ,
KMEC/III SEM/CSM/IDS T.GAYATHRI

0.28035865,0.56326686, 0.66735357, 0.06786065, 0.21013306,

0.86503358]
In the preceding code block, we have created three lists of employee performance
scores for three locations: Mumbai, Chicago, and London.
Let's perform a one-way ANOVA test, as follows:
# Compare results using Oneway ANOVA
stat, p = f_oneway(mumbai, chicago, london)
print("p-values:", p)
print("ANOVA:", stat)
if p < 0.05:
print("Hypothesis Rejected")
else:
print("Hypothesis Accepted")
Output:
p-values: 0.27667556390705783
ANOVA: 1.3480446381965452
Hypothesis Accepted

Descriptive Statistics
Understanding statistics

• In data science, qualitative(Categorical) and quantitative(Numerical) analyses

play complementary roles.

• While qualitative analysis focuses on understanding the underlying patterns,

behaviors, and relationships in data through non-numerical insights, quantitative
analysis relies heavily on numerical and statistical methods to extract meaningful
insights.

• Statistics is a branch of mathematics that deals with collecting, organizing, and

interpreting data.

• Hence, by using statistical concepts, we can understand the nature of the data, a
summary of the dataset, and the type of distribution that the data has.

Distribution function

Continuous Function:

• A continuous function is any function that does not have any unexpected changes in
value.

• These abrupt or unexpected changes are referred to as discontinuities.

• For example, consider the following cubic function

KMEC/III SEM/CSM/IDS T.GAYATHRI

If you plot the graph of this function, you will see that there are no jumps or holes in the
series of values. Hence, this function is continuous.

Probability density function (PDF):

is a continuous function that describes the probability of a random variable taking a specific
value within a range.

It applies to continuous random variables and satisfies the following conditions:

 The function values must be non-negative (p(a)>0 for all a).

 The total area under the curve equals 1.

 For discrete random variables, the equivalent is the probability mass function
(PMF), which lists probabilities associated with each possible value of the variable.

 The probability distribution of a random variable (discrete or continuous)

characterizes the likelihood of outcomes and is represented by a density curve for
continuous distributions.

 Examples of continuous distributions include the

Uniform

Normal-Gaussian

Binomial

Exponential-Poisson

Uniform distribution

• The uniform probability distribution function of any continuous uniform distribution

is given by the following equation:
KMEC/III SEM/CSM/IDS T.GAYATHRI

a and b are the lower and upper bounds, respectively, of the interval over which the random
variable X is uniformly distributed.

from scipy.stats import uniform

number = 10000

start = 20

width = 25

uniform_data = uniform.rvs(size=number, loc=start, scale=width)

• The uniform function is used to generate a uniform continuous variable between the
given start location (loc) and the width of the arguments (scale).

• The size arguments specify the number of random variates taken under consideration.

Normal distribution

• The normal distribution, or Gaussian distribution, is a symmetrical, bell-shaped

curve that represents the distribution of random variables.

Key features include:

• Symmetry: The curve is centered around its mean (μ).

• Spread: Determined by its standard deviation (σ).

It has two main parameters:

• Mean (μ): Defines the center of the distribution.

• Standard Deviation (σ): Defines the spread or width of the curve.

• The central limit theorem makes the normal distribution significant, as it states that
for a sufficiently large sample size (n), the sampling distribution of the mean
approaches a normal distribution, regardless of the population's original distribution.
KMEC/III SEM/CSM/IDS T.GAYATHRI

• Mathematically, it is given as follows:

from scipy.stats import norm

normal_data = norm.rvs(size=90000,loc=20,scale=30)

Exponential distribution

• A process in which some events occur continuously and independently at a constant

average rate is referred to as a Poisson point process.

• The exponential distribution describes the time between events in such a Poisson
point process, and the probability density function of the exponential distribution is
given as follows:

from scipy.stats import expon

expon_data = expon.rvs(scale=1,loc=0,size=1000)
KMEC/III SEM/CSM/IDS T.GAYATHRI

The curve is decreasing over the x axis.

Binomial distribution

• Binomial distribution, as the name suggests, has only two possible outcomes, success
or failure.

• The outcomes do not need to be equally likely and each trial is independent of the
other.

from scipy.stats import binom

binomial_data = binom.rvs(n=10, p=0.8,size=10000)

Cumulative distribution function

• Cumulative distribution function (CDF) is the probability that the variable takes a
value that is less than or equal to x.

Mathematically, it is written as follows:

Descriptive Statistics

• Descriptive statistics deals with the formulation of simple summaries of data so that
they can be clearly understood.

• The summaries of data may be either numerical representations or visualizations with

simple graphs for further understanding.
KMEC/III SEM/CSM/IDS T.GAYATHRI

• Typically, such summaries help in the initial phase of statistical analysis.

There are two types of descriptive statistics:

1. Measures of central tendency

2. Measures of variability (spread)

• Measures of central tendency include mean, median, and mode.

• Measures of variability include standard deviation (or variance), the minimum and
maximum values of the variables, kurtosis, and skewness.

Daroy - On The Eve of Dictatorship and Revolution
No ratings yet
Daroy - On The Eve of Dictatorship and Revolution
13 pages
OOP Reviewer Final Exam
No ratings yet
OOP Reviewer Final Exam
14 pages
DS Chapter - 2
No ratings yet
DS Chapter - 2
73 pages
ADS LAB Merged
No ratings yet
ADS LAB Merged
86 pages
Translating Business Rules Into Data Model Components: and Constraints
No ratings yet
Translating Business Rules Into Data Model Components: and Constraints
18 pages
Prog 1 Module
No ratings yet
Prog 1 Module
161 pages
Handling of Categorical Data
No ratings yet
Handling of Categorical Data
18 pages
Unit-2 Data Analytics Approaches
No ratings yet
Unit-2 Data Analytics Approaches
24 pages
Integrated Project - Access To Drinking Water (Understanding The Data)
100% (1)
Integrated Project - Access To Drinking Water (Understanding The Data)
41 pages
Advanced Programming Exam 1
No ratings yet
Advanced Programming Exam 1
5 pages
01 Introduction
No ratings yet
01 Introduction
50 pages
Basic Statistical Descriptions of Data: Dr. Amiya Ranjan Panda
No ratings yet
Basic Statistical Descriptions of Data: Dr. Amiya Ranjan Panda
35 pages
Mind On Statistics 5th Edition Utts Test Bank
No ratings yet
Mind On Statistics 5th Edition Utts Test Bank
22 pages
Summative Exam in Grade 10 Math Quarter 4
No ratings yet
Summative Exam in Grade 10 Math Quarter 4
10 pages
MTH302 Final Term Solved MCQs With Reference
100% (1)
MTH302 Final Term Solved MCQs With Reference
30 pages
AI Invasion 2021
No ratings yet
AI Invasion 2021
132 pages
Object Oriented Database
No ratings yet
Object Oriented Database
97 pages
STD 10 Chap 4 Data Merging Notes
No ratings yet
STD 10 Chap 4 Data Merging Notes
4 pages
Data science-Unit-3-Complete
No ratings yet
Data science-Unit-3-Complete
33 pages
MMW Chapter 2 Language of Set-1
No ratings yet
MMW Chapter 2 Language of Set-1
25 pages
Business Analytics Module 8
100% (1)
Business Analytics Module 8
65 pages
Fundamentals of Data Science Unit 2
No ratings yet
Fundamentals of Data Science Unit 2
26 pages
Statistics and Probability Exam
100% (1)
Statistics and Probability Exam
3 pages
IE4102 Lecture2
No ratings yet
IE4102 Lecture2
43 pages
Net 101 Study Guide Module 2 1
No ratings yet
Net 101 Study Guide Module 2 1
8 pages
IM Ch05 Advanced Data Modeling Ed12
75% (4)
IM Ch05 Advanced Data Modeling Ed12
38 pages
11-6D Quartiles, Percentiles and Boxplots and Histograms
No ratings yet
11-6D Quartiles, Percentiles and Boxplots and Histograms
24 pages
03 Predicate Logic
No ratings yet
03 Predicate Logic
36 pages
Mathematics: Quarter 4: Week 1-5
No ratings yet
Mathematics: Quarter 4: Week 1-5
52 pages
Cumulative Frequency Worksheet
No ratings yet
Cumulative Frequency Worksheet
7 pages
OBE Course Sheet Tentative Assessment Criteria V7 PDF
No ratings yet
OBE Course Sheet Tentative Assessment Criteria V7 PDF
156 pages
Data Transformation 1 Reviewed
No ratings yet
Data Transformation 1 Reviewed
43 pages
Manuscript Evolution of Operating Systems
No ratings yet
Manuscript Evolution of Operating Systems
8 pages
Chapter 10 Statistics and Computer - Tools For Analyzing of Assessment Data
92% (12)
Chapter 10 Statistics and Computer - Tools For Analyzing of Assessment Data
34 pages
Maths Grade 11 MEMO FINAL
No ratings yet
Maths Grade 11 MEMO FINAL
7 pages
Java Programs 1-10
No ratings yet
Java Programs 1-10
23 pages
Boolean Algebra and Karnaugh Map Simplification
100% (29)
Boolean Algebra and Karnaugh Map Simplification
7 pages
Stats
No ratings yet
Stats
16 pages
Cavite Mutiny: Readings in Philippine History
No ratings yet
Cavite Mutiny: Readings in Philippine History
15 pages
Normative Data For Rosner TVAS - Original
No ratings yet
Normative Data For Rosner TVAS - Original
6 pages
Lecture 5 ER Diagram in DBMS
No ratings yet
Lecture 5 ER Diagram in DBMS
35 pages
BIO401 Best File For Mid Term by Jawad Masroor (J Biology)
No ratings yet
BIO401 Best File For Mid Term by Jawad Masroor (J Biology)
7 pages
1 Mark Type (Statistics)
No ratings yet
1 Mark Type (Statistics)
8 pages
Introduction To Data Structures
No ratings yet
Introduction To Data Structures
23 pages
LongQuiz1and2PrelimAOS Converted Merged
No ratings yet
LongQuiz1and2PrelimAOS Converted Merged
94 pages
Platform Technologies Reviewer
No ratings yet
Platform Technologies Reviewer
46 pages
Skill Levels and Gains in University STEM Education in China India Russia and The United States
No ratings yet
Skill Levels and Gains in University STEM Education in China India Russia and The United States
18 pages
Chapter 02. Classes & Objects Class
No ratings yet
Chapter 02. Classes & Objects Class
20 pages
Name: Janah Rose Medina It301A 08 Task Performance 1 Output in Java
No ratings yet
Name: Janah Rose Medina It301A 08 Task Performance 1 Output in Java
5 pages
Statistika Siapppp
No ratings yet
Statistika Siapppp
38 pages
Measures of Relative Position
No ratings yet
Measures of Relative Position
21 pages
Chap 1 Data Preprocessing
No ratings yet
Chap 1 Data Preprocessing
17 pages
CS 213 2022
No ratings yet
CS 213 2022
20 pages
Agenda: - Introduction - Basics - Classification - Clustering - Regression - Use-Cases
No ratings yet
Agenda: - Introduction - Basics - Classification - Clustering - Regression - Use-Cases
30 pages
Euthenics Prelims
No ratings yet
Euthenics Prelims
2 pages
Midterm Lab Quiz 1 and Lab Quiz 2 Computer Programming 1
No ratings yet
Midterm Lab Quiz 1 and Lab Quiz 2 Computer Programming 1
14 pages
EA 4b ArchiMate Views and Viewpoints PDF
No ratings yet
EA 4b ArchiMate Views and Viewpoints PDF
41 pages
Chapter - 4: OOP With C#
No ratings yet
Chapter - 4: OOP With C#
34 pages
Topic3 1
100% (1)
Topic3 1
2 pages
Box and Whisker Notes
No ratings yet
Box and Whisker Notes
5 pages
Sample Questions
No ratings yet
Sample Questions
31 pages
EgzonaFida (Ef26398 - SWLcase - I)
No ratings yet
EgzonaFida (Ef26398 - SWLcase - I)
24 pages
UGRD-ITE6102 Computer Programming 1 by EB
No ratings yet
UGRD-ITE6102 Computer Programming 1 by EB
18 pages
7 Query Localization
No ratings yet
7 Query Localization
27 pages
Geclwr 18
No ratings yet
Geclwr 18
5 pages
Example Ra
No ratings yet
Example Ra
13 pages
Technopreneurship
No ratings yet
Technopreneurship
3 pages
Midterm Exam CS Design
No ratings yet
Midterm Exam CS Design
16 pages
Computer Graphics Introduction
No ratings yet
Computer Graphics Introduction
11 pages
Statistics Analysis With Software Application
No ratings yet
Statistics Analysis With Software Application
22 pages
Mathematics Competition
No ratings yet
Mathematics Competition
4 pages
A Survey of Patient Satisfaction in A Metropolitan Emergency Department: Comparing Nurse Practitioners and Emergency Physicians
No ratings yet
A Survey of Patient Satisfaction in A Metropolitan Emergency Department: Comparing Nurse Practitioners and Emergency Physicians
6 pages
C Standard Libraries
No ratings yet
C Standard Libraries
9 pages
Exercise 1 Sem 2 201718
No ratings yet
Exercise 1 Sem 2 201718
6 pages
Lesson 10
No ratings yet
Lesson 10
27 pages
Difference Between Qualitative and Quantitative Data
No ratings yet
Difference Between Qualitative and Quantitative Data
4 pages
04 Hands-On Activity 1
No ratings yet
04 Hands-On Activity 1
1 page
Desalgo Preliminary Exam Reviewer 1
No ratings yet
Desalgo Preliminary Exam Reviewer 1
11 pages
Long Quiz 1-B
No ratings yet
Long Quiz 1-B
11 pages
Caculus Based Physics 1 Quiz 1 Prelim by Jezza
No ratings yet
Caculus Based Physics 1 Quiz 1 Prelim by Jezza
3 pages
Base de Datos
No ratings yet
Base de Datos
12 pages
AGUILAR - 1.9.3 Lab - Research IT and Networking Job Opportunities
No ratings yet
AGUILAR - 1.9.3 Lab - Research IT and Networking Job Opportunities
3 pages
Jurnal 1
No ratings yet
Jurnal 1
4 pages
C Palindrome Program
No ratings yet
C Palindrome Program
6 pages
Discrete Mathematics Prelims Quiz 1 by Bertski
No ratings yet
Discrete Mathematics Prelims Quiz 1 by Bertski
5 pages
04 Task Performance 1 ARG Morales DataComs
No ratings yet
04 Task Performance 1 ARG Morales DataComs
1 page
Black Board Architecture-Example
No ratings yet
Black Board Architecture-Example
7 pages
Algorithm Flow Chat
No ratings yet
Algorithm Flow Chat
4 pages
Chavacano Lessons 120113a
No ratings yet
Chavacano Lessons 120113a
1 page
Continuous Probability Distribution
No ratings yet
Continuous Probability Distribution
0 pages