Ids Unit-2
Ids Unit-2
GAYATHRI
UNIT-2
Syllabus
Techniques: Introduction: Data Analysis and Data Analytics, Descriptive Analysis:
Variables, Frequency Distribution, Measures of Centrality, Dispersion of a Distribution.
Statistics: Understanding attributes and their types, Types of attributes, Discrete and
continuous attributes, measuring central tendency: Mean, Mode, Median, measuring
dispersion, Skewness and kurtosis, understanding relationships using covariance and
correlation coefficients: Pearson's correlation coefficient, Spearman's rank correlation
coefficient, collecting samples, Performing parametric tests.
• Understanding these distinctions can help data practitioners apply the right techniques
for the right purposes.
1. Data Analysis:
• It involves looking at historical data to understand what has happened in the past.
• Data analysis is focused on examining and summarizing data to extract insights, often
using descriptive techniques.
• 2. Data Analytics:
• Analytics is the science behind data analysis and focuses on the methodologies used
to extract meaningful insights from data.
• In essence, data analysis looks backward, providing historical insights, while data
analytics looks forward, often predicting future outcomes or recommending actions
based on data insights.
2. Data Analytics:
• Analytics is the science behind data analysis and focuses on the methodologies used
to extract meaningful insights from data.
• In essence, data analysis looks backward, providing historical insights, while data
analytics looks forward, often predicting future outcomes or recommending actions
based on data insights.
3. Predictive Analytics: Uses historical data and models to predict future outcomes.
Descriptive Analysis
• Descriptive analysis is about: “What is happening now based on incoming data.”
• They are critical for summarizing data and providing insights into the data’s structure.
Numerical Representation:
• For instance, numbers can distinguish between two "small" rooms by specifying their
exact size, allowing for clearer comparisons and more accurate conclusions.
• Numbers not only quantify but also provide specific details, such as counts (e.g., 2500
people), ranks (e.g., third most populated city).
• This level of specificity makes data more useful for analysis and decision-making.
Variables
• A variable is a label or name given to data in order to categorize and represent it. It
helps organize and identify different types of data in a dataset.
Types of Variables:
• Numeric Variables: These represent quantities and can be measured on a scale (e.g.,
age, height).
Numerical Variable:
• Interval Variables: These represent numerical data where the differences between
values are meaningful and consistent, but there is no true zero point, meaning ratios
are not meaningful.
• Ratio Variables: These are similar to interval variables but with a true zero point,
which makes both differences and ratios meaningful. These variables allow for all
arithmetic operations, including multiplication and division.
Dependent Variables: These are the variables that depend on or are affected by the
independent variables. In predictive modeling, they are referred to as outcome
variables.
Example: "Cancer" status (yes/no) could be the dependent variable, predicted based
on tumor size.
• In a dataset with the variables "tumor size" (a ratio variable) and "cancer" (a
categorical variable with values "yes" or "no"), "tumor size" would be the
KMEC/III SEM/CSM/IDS T.GAYATHRI
• This setup would allow you to predict whether a person has cancer based on their
tumor size.
Frequency Distribution
• This is a summary of how often each value in a dataset occurs. It helps in
understanding the distribution pattern of data.
Histogram: A type of graph that displays the frequency distribution of data. Each bar on a
histogram represents how many times a value (or a range of values) appears in the data. The
horizontal axis shows the values or intervals, and the height of each bar represents the
frequency.
Pie Chart:
pie charts are used for visualizing categorical data to display proportions and distributions.
Although they are widely recognized and intuitive, pie charts have limitations and are best
suited for specific scenarios.
Normal Distribution
import pandas as pd
# Create dataframe
data = pd.DataFrame(sample_data)
data['communcation_skill_score'].skew()
data['communcation_skill_score'].kurtosis()
Measures of Centrality
• A single value that represents the "center" of a data distribution. The three common
measures of central tendency are:
• Mean: The average of all values, calculated by summing them and dividing by the
total number of observations. It is sensitive to extreme values (outliers).
KMEC/III SEM/CSM/IDS T.GAYATHRI
• If there are n number of values in a dataset and the values are x1, x2, . . ., xn, then the
mean is calculated as
• Median: The middle value when all observations are ordered. For datasets with an
odd number of values, it’s the exact middle; for even numbers, it’s the average of the
two central values. The median is less affected by outliers, making it a robust measure
for skewed distributions.
• Mode: The most frequently occurring value(s) in a dataset. A distribution can have
one mode (unimodal), more than one mode (multimodal), or no mode if all values
occur with the same frequency.
Dispersion of a Distribution
• When analyzing the shape of a dataset, measures of centrality (mean, median, mode)
alone may not provide a complete picture.
• To fully understand the distribution, measures of dispersion (how spread out the data
is) are used.
1. Range
The difference between the largest and smallest values in the dataset.
Example: For the "Productivity" dataset, if the highest value is 25 and the lowest is 12:
• Range=25-12=13
The range of the middle 50% of the data, calculated by removing the lowest 25% (first
quartile, Q1) and highest 25% (third quartile, Q3) of values.
3.Variance is a statistical measure that quantifies how much the data points in a dataset
deviate from the mean.
A larger variance indicates greater spread, while a smaller variance indicates data
points are closer to the mean.
KMEC/III SEM/CSM/IDS T.GAYATHRI
• Sample Variance : Used when analyzing a subset of data (sample), accounting for
sampling bias.
4. Standard deviation :
The standard deviation is the square root of the variance, giving an average spread of data
points relative to the mean in the original units.
Diagnostic Analytics
• Diagnostic analytics focuses on understanding why something happened by
examining past performance and identifying causal relationships between variables.
• For example, in social media marketing, descriptive analytics can measure campaign
metrics like posts, mentions, followers, and page views.
• Diagnostic analytics distills this data into insights about what strategies succeeded or
failed.
Correlations
• Correlation measures the strength and direction of the relationship between two
variables. Strength reflects how closely the variables are related, while direction
indicates whether one variable increases or decreases as the other changes.
KMEC/III SEM/CSM/IDS T.GAYATHRI
• For instance, "rain" and "umbrella" show a strong, positive correlation: umbrellas are
used when it rains and not on dry days.
• It is widely applied, such as in the stock market, to determine how two commodities
are related.
• -1 = There is a perfect negative linear relationship between the variables (e.g. Less
hours worked, leads to higher calorie burnage during a training session)
KMEC/III SEM/CSM/IDS T.GAYATHRI
Correlation
data.corr(method ='pearson')
• The 'method' parameter can take one of the following three parameters:
Key points:
• Suitable when data is skewed or contains outliers, as it does not assume a specific
data distribution.
KMEC/III SEM/CSM/IDS T.GAYATHRI
Example
KMEC/III SEM/CSM/IDS T.GAYATHRI
Predictive Analytics
Predictive analytics focuses on forecasting future outcomes by analyzing past data,
trends, and emerging contexts.
For example, it can predict consumer spending behaviors based on historical data and
new factors like changes in tax policy.
• Credit Scoring: Financial services use predictive analytics to assess the likelihood of
customers making timely credit payments. FICO, for example, uses this to calculate
individual credit scores.
• CRM: In marketing, sales, and customer service, predictive analytics helps optimize
campaigns and improve customer engagement by forecasting behaviors and
outcomes.
• Healthcare: Predictive analytics is used to identify patients at risk for conditions like
diabetes, asthma, and other chronic illnesses, enabling proactive care.
Prescriptive Analytics
• Prescriptive analytics focuses on recommending the best course of action for a given
situation.
Applications:
Exploratory Analysis
• Exploratory Data Analysis (EDA) is a methodology for analyzing datasets to uncover
hidden patterns, relationships, or insights, especially when there's no clear question or
hypothesis at the outset.
• Rather than applying predefined models, EDA focuses on letting the data itself reveal
its structure through techniques like data visualization.
Purpose:
Methods:
Applications:
Mechanistic Analysis
• Mechanistic analysis focuses on understanding precise cause-and-effect
relationships between variables for individual cases.
Examples include:
Employee Productivity: Studying how giving employees more benefits impacts their
productivity, where one benefit might boost productivity by 5%, but excess could lead to
negative outcomes.
KMEC/III SEM/CSM/IDS T.GAYATHRI
Climate Change: Analyzing how increased atmospheric CO₂ levels contribute to rising
global temperatures. Over the past 150 years, CO₂ levels increased from 280 to 400 ppm,
corresponding to a 1.53°F (0.85°C) rise in Earth's temperature, a clear sign of climate change.
• Mechanistic analysis often employs techniques like regression to explore and quantify
relationships between variables, enabling deeper insights into how changes in one
factor directly influence another.
1.Regression
• Unlike correlation, which only measures the strength and direction of a relationship,
regression provides a way to predict the outcome variable (dependent variable, y)
based on one or more predictor variables (independent variables, x).
Types of Regression:
Linear Relationship:
• Stock Market: Analyzing how stock prices respond to changes in interest rates.
Covariance:
Measures how two variables change together.
Collecting samples
1. Probability Sampling (Random Selection)
• Simple Random Sampling: Every respondent has an equal chance of being selected
(e.g., selecting 20 products randomly from 500).
• Stratified Sampling: The population is divided into strata (groups) based on shared
characteristics. This technique reduces selection bias.
• Systematic Sampling: Respondents are selected at regular intervals (e.g., every nth
person from the population).
• Cluster Sampling: The population is divided into clusters (e.g., by gender, location),
and entire clusters are sampled.
T-Test
A t-test is used to assess whether there is a significant difference between the means of
groups or datasets. It assumes the data follows a normal distribution and has three main types:
1. One-Sample T-Test
Purpose: Tests the mean difference between two observations from the same group.
Example: Evaluating the impact of weight-loss treatment by comparing patient
weights before and after treatment.
Result:
o p-value: 0.0137 (less than 0.05 significance level).
o Conclusion: Hypothesis rejected; weight-loss treatment had a significant
effect.
ANOVA is used to compare means across multiple groups. It evaluates the variance within
and between groups to test null hypotheses for differences in means.
KMEC/III SEM/CSM/IDS T.GAYATHRI
1. One-Way ANOVA
2. Two-Way ANOVA
Purpose: Compares group means based on two independent variables (e.g., working
hours and project complexity).
3. N-Way ANOVA
Example code
import numpy as np
from scipy.stats import ttest_1samp
# Create data
data=np.array([63, 75, 84, 58, 52, 96, 63, 55, 76, 83])
# Find mean
mean_value = np.mean(data)
print("Mean:",mean_value)
Output:
Mean: 70.5
Example Code
Descriptive Statistics
Understanding statistics
• Hence, by using statistical concepts, we can understand the nature of the data, a
summary of the dataset, and the type of distribution that the data has.
Distribution function
Continuous Function:
• A continuous function is any function that does not have any unexpected changes in
value.
If you plot the graph of this function, you will see that there are no jumps or holes in the
series of values. Hence, this function is continuous.
is a continuous function that describes the probability of a random variable taking a specific
value within a range.
For discrete random variables, the equivalent is the probability mass function
(PMF), which lists probabilities associated with each possible value of the variable.
Uniform
Normal-Gaussian
Binomial
Exponential-Poisson
Uniform distribution
a and b are the lower and upper bounds, respectively, of the interval over which the random
variable X is uniformly distributed.
number = 10000
start = 20
width = 25
• The uniform function is used to generate a uniform continuous variable between the
given start location (loc) and the width of the arguments (scale).
• The size arguments specify the number of random variates taken under consideration.
Normal distribution
• The central limit theorem makes the normal distribution significant, as it states that
for a sufficiently large sample size (n), the sampling distribution of the mean
approaches a normal distribution, regardless of the population's original distribution.
KMEC/III SEM/CSM/IDS T.GAYATHRI
normal_data = norm.rvs(size=90000,loc=20,scale=30)
Exponential distribution
• The exponential distribution describes the time between events in such a Poisson
point process, and the probability density function of the exponential distribution is
given as follows:
expon_data = expon.rvs(scale=1,loc=0,size=1000)
KMEC/III SEM/CSM/IDS T.GAYATHRI
Binomial distribution
• Binomial distribution, as the name suggests, has only two possible outcomes, success
or failure.
• The outcomes do not need to be equally likely and each trial is independent of the
other.
• Cumulative distribution function (CDF) is the probability that the variable takes a
value that is less than or equal to x.
Descriptive Statistics
• Descriptive statistics deals with the formulation of simple summaries of data so that
they can be clearly understood.
• Measures of variability include standard deviation (or variance), the minimum and
maximum values of the variables, kurtosis, and skewness.