0% found this document useful (0 votes)
20 views36 pages

Unit 3

The document provides an overview of R programming techniques for combining datasets, transforming data, and performing statistical analyses. It covers methods such as row and column binding, data transformations, binning, and various statistical tests like T-tests and ANOVA. Additionally, it discusses data visualization techniques and the differences between discrete and continuous data.

Uploaded by

Devanshi Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views36 pages

Unit 3

The document provides an overview of R programming techniques for combining datasets, transforming data, and performing statistical analyses. It covers methods such as row and column binding, data transformations, binning, and various statistical tests like T-tests and ANOVA. Additionally, it discusses data visualization techniques and the differences between discrete and continuous data.

Uploaded by

Devanshi Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

R PROGRAMMING

UNIT-III
COMBINING DATASETS

• Combining datasets is a common task when dealing with multiple sources of


data. It can be done row-wise, column-wise, or through joins.
• Row Binding (rbind()): Combines datasets by rows (should have the same number and
names of columns)
• Column Binding (cbind()): Combines datasets by columns (should have the same
number of rows)
• Merging (merge()): Combines datasets based on a common column or key.
TRANSFORMATIONS

• Transformations modify data to make it suitable for analysis. This includes scaling,
normalizing, or applying mathematical operations.
• Scaling: Standardizing values to have a mean of 0 and standard deviation of 1
• Normalization: Rescaling values to a specific range (e.g., 0 to 1).
𝑥−min(𝑥)
Formula: 𝑥′=
max(𝑥)−min(𝑥)
• Log Transformation: Reducing skewness in data.
• Square Root Transformation: Reduces data range while maintaining the original
distribution.
• Categorical Transformation: Converts continuous data into categories or bins. (Binning)
# Log transformation
log_data <- log10(data)

# Square root transformation


sqrt_data <- sqrt(data)

# Scaling
scaled_data <- scale(data)

# Normalization
normalized_data <- (data - min(data)) / (max(data) - min(data))

# Power transformation
cubed_data <- data^3
WHY TRANSFORM DATA?

1. Reduce Skewness: Log or square root transformations reduce extreme


values in skewed data. It makes data symmetrical.
2. Improve Model Performance: Many statistical models (t-test, ANOVA,
Regression, etc.) work better with normalized or standardized data.
3. Handle Outliers: Transformations reduce the influence of outliers.
4. Simplify Interpretation: Transformed data can be easier to interpret and
visualize. Thus, interpretation of the output must change accordingly.
BINNING DATA

• Binning converts numerical data into categories or intervals (useful for


histograms or classification).
• Techniques:
• Manual Binning: Define intervals explicitly.
• Automatic Binning: Use functions like cut().
METHODS OF BINNING

1. Equal-Width Binning: Divides the range of data into intervals of equal size.
mtcars$mpg_bin <- cut(mtcars$mpg,
breaks = 4, # Number of bins
labels = c("Low", "Medium", "High", "Very High"),
include.lowest = TRUE)
0-20
20-40
40-60
60-80
80-100
METHODS OF BINNING

2. Equal-Frequency Binning: Divides data into intervals containing an equal


number of observations.
quantiles <- quantile(mtcars$mpg, probs = seq(0, 1, 0.25)) # Quartiles
OR probs=c(0,0.25,0.5,0.75,1)
mtcars$mpg_bin_freq <- cut(mtcars$mpg,
breaks = quantiles,
labels = c("Q1", "Q2", "Q3", "Q4"),
include.lowest = TRUE)
METHODS OF BINNING

3. Custom Binning: Manually defines bins based on domain knowledge.


mtcars$mpg_custom_bin <- cut(mtcars$mpg,
breaks = c(10, 15, 20, 25, 35),
labels = c("Very Low", "Low", "Medium", "High"),
include.lowest = TRUE)
SUBSETS

• Extracting specific portions of a dataset based on conditions or indices.


• Syntax: subset(x, subset, select=c(name,gender), drop = FALSE) CODE
• x: The data frame, matrix, or vector to subset.
• subset: Logical condition to filter rows (optional).
• select: Columns to select from the dataset (optional).
• drop: If TRUE, drops dimensions that are not needed (default is FALSE).
SUMMARIZING FUNCTIONS

• Summarizing functions provide insights into data, such as mean, median,


variance, and frequency.
• Common Functions:
• mean(), median(), var(), sd() – Summary statistics.
• summary() – Provides a summary of data.
• table() – Counts frequencies.
DATA CLEANING

• Data cleaning ensures that datasets are consistent, accurate, and usable.
• Steps in Data Cleaning:
• Remove Duplicates: Use unique() or duplicated().
• Handle Missing Values: Use is.na() to identify and fill missing values (na.omit(),
replace()).
• Standardize Data: Ensure consistent formatting.
ANALYZING DATA

• Analyzing data involves inspecting, cleaning, transforming, and modeling data to


discover useful information, draw conclusions, and support decision-making.
• Steps in Data Analysis:
• Data Collection: Gather relevant data.
• Data Cleaning: Identify and correct errors or inconsistencies in the data.
• Exploratory Data Analysis: Summarize the main characteristics of data, often using
visual methods.
• Statistical Analysis: Apply statistical models to make inferences or predictions.
• Interpretation: Translate the findings into actionable insights.-Visualization
BAR CHART & PIE CHART

• Bar chart is a visual display of the


frequency for each category of a
categorical variable or the relative
frequency (%) for each category.
• Bar chart: Represent categorical data
with rectangular bars
• Pie Chart: Circular representation of
data as proportions of a whole
BOXPLOT

• A boxplot is appropriate for summarising the distribution of a numeric


variable.
• Boxplot is visual display of 5 point summary- min, 1st quartile, median, 3rd
quartile, max
HISTOGRAM

• A histogram is appropriate for summarizing the distribution of a numeric


variable.
LINE CHART & SCATTER PLOT
THREE DIMENSIONAL DATA

• Plots representing three variables.


1. Scatterplot3d:
2. Surface Plot
HEATMAP

• A heatmap is another way to represent 3D data, where the intensity of color


represents the third dimension.
APPLICATIONS OF 3D PLOTS

• Visualizing multivariate data.


• Representing mathematical functions.
• Exploring relationships between three variables.
BASIC GRAPH FUNCTIONS

1. plot(): Base plotting function.


2. lines(): Add lines to an existing plot.
3. points(): Add points to an existing plot.
4. text(): Add text annotations.
5. abline(): Add straight lines (e.g., regression lines).
COMMON ARGUMENTS FOR CHART
FUNCTIONS

• main: Title of the chart.


• xlab/ylab: Labels for the axes.
• col: Colors for the chart.
• xlim/ylim: Set axis limits.
• las: Orientation of axis labels (0-3). How labels will appear.
• type: Type of plot ("p" for points, "l" for lines, "b" for both).
• Save plots to file using:
png("plot.png")
plot(data)
dev.off()
T-TEST DESIGN

• A T-test is a statistical test used to compare the means of two groups to see if
they are significantly different from each other.
• Types:
• One-Sample T-Test: Compares the sample mean to a known value
(assumed/hypothesized) (e.g., population mean).
• Two-Sample T-Test: Compares the means of two independent groups.
• Paired T-Test: Compares the means from the same group at two different times or
under two different conditions.
ONE-SAMPLE T-TEST

𝑚𝑒𝑎𝑛 𝑥 −𝜇
• 𝑡=
𝑠/√𝑛

• Mean(x): Sample mean


• μ: Population mean
• s: Sample standard deviation
• n: Sample size
TWO-SAMPLE T-TEST (INDEPENDENT
SAMPLES)

𝑚𝑒𝑎𝑛 𝑥1 −𝑚𝑒𝑎𝑛(𝑥2)
• 𝑡=
𝑠2 2
1 + 𝑠2
𝑛1 𝑛2

• s:Variances
• n: sizes
• x: data :mean
PAIRED T-TEST

𝑚𝑒𝑎𝑛 𝑑
• 𝑡=
𝑆𝑑 /√𝑛

• Mean(d): mean of differences


• Sd: Standard deviation of differences
• n: Number of pairs
ANOVA TEST DESIGN

• Analysis of Variance (ANOVA) is a statistical method used to test the


differences between three or more group means.
• Types:
• One-Way ANOVA: Tests for differences in one factor (e.g., different teaching
methods affecting student performance).
• Two-Way ANOVA: Tests for the interaction between two factors (e.g., the effect of
teaching method and gender on student performance).
T-TEST VS ANOVA TEST

Aspect T-Test ANOVA Test


Definition A statistical test used to compare the means of A statistical test used to compare the means of
two groups. three or more groups.
Type of Output T-test gives a direct comparison between two ANOVA identifies whether there is a general
groups. difference among groups but requires additional
tests for specific pairwise comparisons.
- Testing if male and female students have the - Testing if different teaching methods lead to
Example Use same average score. different student performances.
Cases - Comparing average blood pressure before and - Comparing the average yield of crops with three
after a treatment. types of fertilizers.
R Functions t.test() aov()
REGRESSION

• Regression is a statistical method for modeling the relationship between a


dependent variable and one or more independent variables.
• Types of Regression:
• Linear Regression: Models the relationship between two variables as a straight line.
• Multiple Regression: Involves more than one independent variable.
LINEAR MODEL

• A linear model assumes a straight-line relationship between the independent


variable(s) and the dependent variable.
• Equation:Y=β0+β1X1+ϵ, where Y is the dependent variable, X1 is the
independent variable, β0 and β1 are coefficients, and ϵ is the error term.
SMOOTHENING

• Smoothening is a technique used to reduce noise or fluctuations in data,


making the data easier to analyze.
• Methods:
• Moving Average: A simple method where each data point is replaced by the average
of itself and its neighbors.
• Exponential Smoothing: Gives more weight to recent data points.
PROBABILITY DISTRIBUTION

• A probability distribution describes the likelihood of different outcomes in a


random experiment.
• A probability distribution describes how the probabilities of a random variable
are distributed. It can be classified into discrete and continuous distributions,
depending on whether the variable takes on countable values or an infinite range of
values.
• Random Variable: A variable whose values depend on the outcome of a random
event.
• Types of Probability Distributions:
• Discrete Probability Distributions: The outcome can take only a finite or countably
infinite number of values. Example: Binomial, Poisson.
• Continuous Probability Distributions: The outcome can take any value within a range.
Example: Normal, Exponential.
CONTINUOUS DATA

• Continuous data refers to numerical data that can take an infinite number of
values within a given range. These are typically measured and can take on any
value within an interval.
• Graphical Representation: Continuous data is often represented using
histograms, box plots, or line graphs.
DISCRETE DATA

• Discrete data consists of distinct or separate values, often counted and finite.
• Graphical Representation: Bar charts or pie charts are typically used for
discrete data.
DISCRETE VS CONTINUOUS DATA

Aspect Discrete Data Continuous Data


Nature Countable and distinct values. Measurable and can take any value.
Representation Often represented as whole numbers. Can include fractions and decimals.

Examples Number of pets, number of books, test


Weight, height, distance, time.
scores (whole numbers).
Visualization Bar charts or scatter plots. Histograms, line charts, density plots.

Definition Discrete data consists of distinct or Continuous data refers to numerical data that
separate values, often counted and finite. can take an infinite number of values within a
given range.

You might also like