Unit 3
Unit 3
UNIT-III
COMBINING DATASETS
• Transformations modify data to make it suitable for analysis. This includes scaling,
normalizing, or applying mathematical operations.
• Scaling: Standardizing values to have a mean of 0 and standard deviation of 1
• Normalization: Rescaling values to a specific range (e.g., 0 to 1).
𝑥−min(𝑥)
Formula: 𝑥′=
max(𝑥)−min(𝑥)
• Log Transformation: Reducing skewness in data.
• Square Root Transformation: Reduces data range while maintaining the original
distribution.
• Categorical Transformation: Converts continuous data into categories or bins. (Binning)
# Log transformation
log_data <- log10(data)
# Scaling
scaled_data <- scale(data)
# Normalization
normalized_data <- (data - min(data)) / (max(data) - min(data))
# Power transformation
cubed_data <- data^3
WHY TRANSFORM DATA?
1. Equal-Width Binning: Divides the range of data into intervals of equal size.
mtcars$mpg_bin <- cut(mtcars$mpg,
breaks = 4, # Number of bins
labels = c("Low", "Medium", "High", "Very High"),
include.lowest = TRUE)
0-20
20-40
40-60
60-80
80-100
METHODS OF BINNING
• Data cleaning ensures that datasets are consistent, accurate, and usable.
• Steps in Data Cleaning:
• Remove Duplicates: Use unique() or duplicated().
• Handle Missing Values: Use is.na() to identify and fill missing values (na.omit(),
replace()).
• Standardize Data: Ensure consistent formatting.
ANALYZING DATA
• A T-test is a statistical test used to compare the means of two groups to see if
they are significantly different from each other.
• Types:
• One-Sample T-Test: Compares the sample mean to a known value
(assumed/hypothesized) (e.g., population mean).
• Two-Sample T-Test: Compares the means of two independent groups.
• Paired T-Test: Compares the means from the same group at two different times or
under two different conditions.
ONE-SAMPLE T-TEST
𝑚𝑒𝑎𝑛 𝑥 −𝜇
• 𝑡=
𝑠/√𝑛
𝑚𝑒𝑎𝑛 𝑥1 −𝑚𝑒𝑎𝑛(𝑥2)
• 𝑡=
𝑠2 2
1 + 𝑠2
𝑛1 𝑛2
• s:Variances
• n: sizes
• x: data :mean
PAIRED T-TEST
𝑚𝑒𝑎𝑛 𝑑
• 𝑡=
𝑆𝑑 /√𝑛
• Continuous data refers to numerical data that can take an infinite number of
values within a given range. These are typically measured and can take on any
value within an interval.
• Graphical Representation: Continuous data is often represented using
histograms, box plots, or line graphs.
DISCRETE DATA
• Discrete data consists of distinct or separate values, often counted and finite.
• Graphical Representation: Bar charts or pie charts are typically used for
discrete data.
DISCRETE VS CONTINUOUS DATA
Definition Discrete data consists of distinct or Continuous data refers to numerical data that
separate values, often counted and finite. can take an infinite number of values within a
given range.