0% found this document useful (0 votes)

20 views36 pages

Unit 3

The document provides an overview of R programming techniques for combining datasets, transforming data, and performing statistical analyses. It covers methods such as row and column binding, data transformations, binning, and various statistical tests like T-tests and ANOVA. Additionally, it discusses data visualization techniques and the differences between discrete and continuous data.

Uploaded by

Devanshi Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views36 pages

Unit 3

Uploaded by

Devanshi Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

R PROGRAMMING

UNIT-III
COMBINING DATASETS

• Combining datasets is a common task when dealing with multiple sources of

data. It can be done row-wise, column-wise, or through joins.
• Row Binding (rbind()): Combines datasets by rows (should have the same number and
names of columns)
• Column Binding (cbind()): Combines datasets by columns (should have the same
number of rows)
• Merging (merge()): Combines datasets based on a common column or key.
TRANSFORMATIONS

• Transformations modify data to make it suitable for analysis. This includes scaling,
normalizing, or applying mathematical operations.
• Scaling: Standardizing values to have a mean of 0 and standard deviation of 1
• Normalization: Rescaling values to a specific range (e.g., 0 to 1).
𝑥−min(𝑥)
Formula: 𝑥′=
max(𝑥)−min(𝑥)
• Log Transformation: Reducing skewness in data.
• Square Root Transformation: Reduces data range while maintaining the original
distribution.
• Categorical Transformation: Converts continuous data into categories or bins. (Binning)
# Log transformation
log_data <- log10(data)

# Square root transformation

sqrt_data <- sqrt(data)

# Scaling
scaled_data <- scale(data)

# Normalization
normalized_data <- (data - min(data)) / (max(data) - min(data))

# Power transformation
cubed_data <- data^3
WHY TRANSFORM DATA?

1. Reduce Skewness: Log or square root transformations reduce extreme

values in skewed data. It makes data symmetrical.
2. Improve Model Performance: Many statistical models (t-test, ANOVA,
Regression, etc.) work better with normalized or standardized data.
3. Handle Outliers: Transformations reduce the influence of outliers.
4. Simplify Interpretation: Transformed data can be easier to interpret and
visualize. Thus, interpretation of the output must change accordingly.
BINNING DATA

• Binning converts numerical data into categories or intervals (useful for

histograms or classification).
• Techniques:
• Manual Binning: Define intervals explicitly.
• Automatic Binning: Use functions like cut().
METHODS OF BINNING

1. Equal-Width Binning: Divides the range of data into intervals of equal size.
mtcars$mpg_bin <- cut(mtcars$mpg,
breaks = 4, # Number of bins
labels = c("Low", "Medium", "High", "Very High"),
include.lowest = TRUE)
0-20
20-40
40-60
60-80
80-100
METHODS OF BINNING

2. Equal-Frequency Binning: Divides data into intervals containing an equal

number of observations.
quantiles <- quantile(mtcars$mpg, probs = seq(0, 1, 0.25)) # Quartiles
OR probs=c(0,0.25,0.5,0.75,1)
mtcars$mpg_bin_freq <- cut(mtcars$mpg,
breaks = quantiles,
labels = c("Q1", "Q2", "Q3", "Q4"),
include.lowest = TRUE)
METHODS OF BINNING

3. Custom Binning: Manually defines bins based on domain knowledge.

mtcars$mpg_custom_bin <- cut(mtcars$mpg,
breaks = c(10, 15, 20, 25, 35),
labels = c("Very Low", "Low", "Medium", "High"),
include.lowest = TRUE)
SUBSETS

• Extracting specific portions of a dataset based on conditions or indices.

• Syntax: subset(x, subset, select=c(name,gender), drop = FALSE) CODE
• x: The data frame, matrix, or vector to subset.
• subset: Logical condition to filter rows (optional).
• select: Columns to select from the dataset (optional).
• drop: If TRUE, drops dimensions that are not needed (default is FALSE).
SUMMARIZING FUNCTIONS

• Summarizing functions provide insights into data, such as mean, median,

variance, and frequency.
• Common Functions:
• mean(), median(), var(), sd() – Summary statistics.
• summary() – Provides a summary of data.
• table() – Counts frequencies.
DATA CLEANING

• Data cleaning ensures that datasets are consistent, accurate, and usable.
• Steps in Data Cleaning:
• Remove Duplicates: Use unique() or duplicated().
• Handle Missing Values: Use is.na() to identify and fill missing values (na.omit(),
replace()).
• Standardize Data: Ensure consistent formatting.
ANALYZING DATA

• Analyzing data involves inspecting, cleaning, transforming, and modeling data to

discover useful information, draw conclusions, and support decision-making.
• Steps in Data Analysis:
• Data Collection: Gather relevant data.
• Data Cleaning: Identify and correct errors or inconsistencies in the data.
• Exploratory Data Analysis: Summarize the main characteristics of data, often using
visual methods.
• Statistical Analysis: Apply statistical models to make inferences or predictions.
• Interpretation: Translate the findings into actionable insights.-Visualization
BAR CHART & PIE CHART

• Bar chart is a visual display of the

frequency for each category of a
categorical variable or the relative
frequency (%) for each category.
• Bar chart: Represent categorical data
with rectangular bars
• Pie Chart: Circular representation of
data as proportions of a whole
BOXPLOT

• A boxplot is appropriate for summarising the distribution of a numeric

variable.
• Boxplot is visual display of 5 point summary- min, 1st quartile, median, 3rd
quartile, max
HISTOGRAM

• A histogram is appropriate for summarizing the distribution of a numeric

variable.
LINE CHART & SCATTER PLOT
THREE DIMENSIONAL DATA

• Plots representing three variables.

1. Scatterplot3d:
2. Surface Plot
HEATMAP

• A heatmap is another way to represent 3D data, where the intensity of color

represents the third dimension.
APPLICATIONS OF 3D PLOTS

• Visualizing multivariate data.

• Representing mathematical functions.
• Exploring relationships between three variables.
BASIC GRAPH FUNCTIONS

1. plot(): Base plotting function.

2. lines(): Add lines to an existing plot.
3. points(): Add points to an existing plot.
4. text(): Add text annotations.
5. abline(): Add straight lines (e.g., regression lines).
COMMON ARGUMENTS FOR CHART
FUNCTIONS

• main: Title of the chart.

• xlab/ylab: Labels for the axes.
• col: Colors for the chart.
• xlim/ylim: Set axis limits.
• las: Orientation of axis labels (0-3). How labels will appear.
• type: Type of plot ("p" for points, "l" for lines, "b" for both).
• Save plots to file using:
png("plot.png")
plot(data)
dev.off()
T-TEST DESIGN

• A T-test is a statistical test used to compare the means of two groups to see if
they are significantly different from each other.
• Types:
• One-Sample T-Test: Compares the sample mean to a known value
(assumed/hypothesized) (e.g., population mean).
• Two-Sample T-Test: Compares the means of two independent groups.
• Paired T-Test: Compares the means from the same group at two different times or
under two different conditions.
ONE-SAMPLE T-TEST

𝑚𝑒𝑎𝑛 𝑥 −𝜇
• 𝑡=
𝑠/√𝑛

• Mean(x): Sample mean

• μ: Population mean
• s: Sample standard deviation
• n: Sample size
TWO-SAMPLE T-TEST (INDEPENDENT
SAMPLES)

𝑚𝑒𝑎𝑛 𝑥1 −𝑚𝑒𝑎𝑛(𝑥2)
• 𝑡=
𝑠2 2
1 + 𝑠2
𝑛1 𝑛2

• s:Variances
• n: sizes
• x: data :mean
PAIRED T-TEST

𝑚𝑒𝑎𝑛 𝑑
• 𝑡=
𝑆𝑑 /√𝑛

• Mean(d): mean of differences

• Sd: Standard deviation of differences
• n: Number of pairs
ANOVA TEST DESIGN

• Analysis of Variance (ANOVA) is a statistical method used to test the

differences between three or more group means.
• Types:
• One-Way ANOVA: Tests for differences in one factor (e.g., different teaching
methods affecting student performance).
• Two-Way ANOVA: Tests for the interaction between two factors (e.g., the effect of
teaching method and gender on student performance).
T-TEST VS ANOVA TEST

Aspect T-Test ANOVA Test

Definition A statistical test used to compare the means of A statistical test used to compare the means of
two groups. three or more groups.
Type of Output T-test gives a direct comparison between two ANOVA identifies whether there is a general
groups. difference among groups but requires additional
tests for specific pairwise comparisons.
- Testing if male and female students have the - Testing if different teaching methods lead to
Example Use same average score. different student performances.
Cases - Comparing average blood pressure before and - Comparing the average yield of crops with three
after a treatment. types of fertilizers.
R Functions t.test() aov()
REGRESSION

• Regression is a statistical method for modeling the relationship between a

dependent variable and one or more independent variables.
• Types of Regression:
• Linear Regression: Models the relationship between two variables as a straight line.
• Multiple Regression: Involves more than one independent variable.
LINEAR MODEL

• A linear model assumes a straight-line relationship between the independent

variable(s) and the dependent variable.
• Equation:Y=β0+β1X1+ϵ, where Y is the dependent variable, X1 is the
independent variable, β0 and β1 are coefficients, and ϵ is the error term.
SMOOTHENING

• Smoothening is a technique used to reduce noise or fluctuations in data,

making the data easier to analyze.
• Methods:
• Moving Average: A simple method where each data point is replaced by the average
of itself and its neighbors.
• Exponential Smoothing: Gives more weight to recent data points.
PROBABILITY DISTRIBUTION

• A probability distribution describes the likelihood of different outcomes in a

random experiment.
• A probability distribution describes how the probabilities of a random variable
are distributed. It can be classified into discrete and continuous distributions,
depending on whether the variable takes on countable values or an infinite range of
values.
• Random Variable: A variable whose values depend on the outcome of a random
event.
• Types of Probability Distributions:
• Discrete Probability Distributions: The outcome can take only a finite or countably
infinite number of values. Example: Binomial, Poisson.
• Continuous Probability Distributions: The outcome can take any value within a range.
Example: Normal, Exponential.
CONTINUOUS DATA

• Continuous data refers to numerical data that can take an infinite number of
values within a given range. These are typically measured and can take on any
value within an interval.
• Graphical Representation: Continuous data is often represented using
histograms, box plots, or line graphs.
DISCRETE DATA

• Discrete data consists of distinct or separate values, often counted and finite.
• Graphical Representation: Bar charts or pie charts are typically used for
discrete data.
DISCRETE VS CONTINUOUS DATA

Aspect Discrete Data Continuous Data

Nature Countable and distinct values. Measurable and can take any value.
Representation Often represented as whole numbers. Can include fractions and decimals.

Examples Number of pets, number of books, test

Weight, height, distance, time.
scores (whole numbers).
Visualization Bar charts or scatter plots. Histograms, line charts, density plots.

Definition Discrete data consists of distinct or Continuous data refers to numerical data that
separate values, often counted and finite. can take an infinite number of values within a
given range.

Module 2
No ratings yet
Module 2
107 pages
5.1 Exploratory Analysis en
No ratings yet
5.1 Exploratory Analysis en
79 pages
Panasonic Lumix s5 II
No ratings yet
Panasonic Lumix s5 II
803 pages
Modern Statistics With R
100% (3)
Modern Statistics With R
580 pages
Visual Statistics Use R
No ratings yet
Visual Statistics Use R
451 pages
Exploratory Data Analysis Reference
No ratings yet
Exploratory Data Analysis Reference
50 pages
Statistical Methods For Data Science
100% (2)
Statistical Methods For Data Science
406 pages
Basics of Data Analysis and Graphics in
No ratings yet
Basics of Data Analysis and Graphics in
103 pages
Computer Interactive Statistics
No ratings yet
Computer Interactive Statistics
102 pages
R语言学习笔记
No ratings yet
R语言学习笔记
78 pages
Business Analytics Unit 4
No ratings yet
Business Analytics Unit 4
24 pages
IOT-Domain Analyst
No ratings yet
IOT-Domain Analyst
68 pages
Chapter 2 - Data Exploration, Preprocessing and Visualization
No ratings yet
Chapter 2 - Data Exploration, Preprocessing and Visualization
92 pages
3-Data Description
No ratings yet
3-Data Description
91 pages
Module2 BDA
No ratings yet
Module2 BDA
44 pages
MongoDB and NoSQL Injection and Prevention
No ratings yet
MongoDB and NoSQL Injection and Prevention
5 pages
Unit 4 - R Programming
No ratings yet
Unit 4 - R Programming
26 pages
Unit 5
No ratings yet
Unit 5
25 pages
Essential R
No ratings yet
Essential R
261 pages
Data Science and Machine Learning
100% (1)
Data Science and Machine Learning
190 pages
Business Analytics Unit - IV Notes - 60637706 - 2025 - 05!15!02 - 16
No ratings yet
Business Analytics Unit - IV Notes - 60637706 - 2025 - 05!15!02 - 16
28 pages
A Guide To Doing Statistics PDF
No ratings yet
A Guide To Doing Statistics PDF
320 pages
R For Data Exploration
No ratings yet
R For Data Exploration
52 pages
Ismaykim1 PDF
No ratings yet
Ismaykim1 PDF
522 pages
Preprocessing 935
No ratings yet
Preprocessing 935
68 pages
QA Bible Updated
No ratings yet
QA Bible Updated
95 pages
Ida PDF
No ratings yet
Ida PDF
62 pages
3 Data Description
No ratings yet
3 Data Description
87 pages
A Guide To Doing Statistics in Second Language Research Using R
No ratings yet
A Guide To Doing Statistics in Second Language Research Using R
320 pages
Graphs and Viz With R
No ratings yet
Graphs and Viz With R
119 pages
Service Manual: History Information For The Following Manual
No ratings yet
Service Manual: History Information For The Following Manual
71 pages
STTN 225 R Summary
No ratings yet
STTN 225 R Summary
18 pages
R File Code
No ratings yet
R File Code
16 pages
Visual Statistics Use R!
50% (2)
Visual Statistics Use R!
388 pages
Research Methodogy Class 5
No ratings yet
Research Methodogy Class 5
29 pages
Arrays
No ratings yet
Arrays
9 pages
Research Methodogy Class 4
No ratings yet
Research Methodogy Class 4
29 pages
Boulder Handout 2019
No ratings yet
Boulder Handout 2019
187 pages
DV Unit 2 Update
No ratings yet
DV Unit 2 Update
13 pages
Mvda - Question Bank
No ratings yet
Mvda - Question Bank
14 pages
EASE 4.0 Loudspeaker Device File Formats V4.02i
No ratings yet
EASE 4.0 Loudspeaker Device File Formats V4.02i
19 pages
A Study of Determinants of Influencing The Adaptation of Computerized Accounting System Among Small Enterprises Located in Cabadbaran City
No ratings yet
A Study of Determinants of Influencing The Adaptation of Computerized Accounting System Among Small Enterprises Located in Cabadbaran City
4 pages
Abdullah 2018
No ratings yet
Abdullah 2018
45 pages
4 - Basics in Statistics and Linear Algebra
No ratings yet
4 - Basics in Statistics and Linear Algebra
7 pages
R Language - Experiment 1 (21-01-25)
No ratings yet
R Language - Experiment 1 (21-01-25)
8 pages
Oracle Cash Management
100% (2)
Oracle Cash Management
14 pages
R Cheat Sheet
No ratings yet
R Cheat Sheet
9 pages
CC0002 Notes
No ratings yet
CC0002 Notes
10 pages
Lab 1 - Basic Functions in R and Plotting
No ratings yet
Lab 1 - Basic Functions in R and Plotting
8 pages
Tarantella PDF
No ratings yet
Tarantella PDF
10 pages
R
No ratings yet
R
13 pages
Accurate Ignition Detection of Solid Fuel Particles Using Machine Learning
No ratings yet
Accurate Ignition Detection of Solid Fuel Particles Using Machine Learning
9 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
30 pages
BA Notes
No ratings yet
BA Notes
5 pages
Yamaha Rxv450
No ratings yet
Yamaha Rxv450
82 pages
Samsung GT c3520 Service Manual PDF
No ratings yet
Samsung GT c3520 Service Manual PDF
71 pages
Brochure X13 Servers
No ratings yet
Brochure X13 Servers
48 pages
A+ Guide To Managing and Maintaining Your PC, 6e: Motherboards
100% (1)
A+ Guide To Managing and Maintaining Your PC, 6e: Motherboards
36 pages
Detailed Analysis
No ratings yet
Detailed Analysis
3 pages
MIPS Addressing Modes
No ratings yet
MIPS Addressing Modes
5 pages
Unit - 2: Data Manipulation With R & Data Visualization in Watson Studio
No ratings yet
Unit - 2: Data Manipulation With R & Data Visualization in Watson Studio
58 pages
Find Changes Logs For A Table Using SM30 - SAP Blogs
No ratings yet
Find Changes Logs For A Table Using SM30 - SAP Blogs
7 pages
L4 Exploratory Analysis en
No ratings yet
L4 Exploratory Analysis en
42 pages
7CCMMS61 Statistics For Data Analysis: Francisco Javier Rubio Department of Mathematics
No ratings yet
7CCMMS61 Statistics For Data Analysis: Francisco Javier Rubio Department of Mathematics
13 pages
Corning 144F (12x12) Armoured
No ratings yet
Corning 144F (12x12) Armoured
4 pages
AS Level Mathematics Statistics (New)
No ratings yet
AS Level Mathematics Statistics (New)
49 pages
A Case Study Application of Linear Programming and Simulation To Mine Planning
No ratings yet
A Case Study Application of Linear Programming and Simulation To Mine Planning
9 pages
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
No ratings yet
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
79 pages
08 GT I9070 Tshoo 7
No ratings yet
08 GT I9070 Tshoo 7
49 pages
Muthayammal College of Arts and Science Rasipuram: Assignment No - 1
No ratings yet
Muthayammal College of Arts and Science Rasipuram: Assignment No - 1
10 pages
UL2
No ratings yet
UL2
2 pages
Introduction To R: Nihan Acar-Denizli, Pau Fonseca
No ratings yet
Introduction To R: Nihan Acar-Denizli, Pau Fonseca
50 pages
Module 01 - STAT 101
No ratings yet
Module 01 - STAT 101
23 pages
Exploratory Data Analysis and Data Visualization: Credits: Chrisvolinsky - Columbia University
No ratings yet
Exploratory Data Analysis and Data Visualization: Credits: Chrisvolinsky - Columbia University
49 pages
BES 516-3005-G-E4-C-S49-00,3 Order Code: BES00HC: Inductive Sensors
No ratings yet
BES 516-3005-G-E4-C-S49-00,3 Order Code: BES00HC: Inductive Sensors
2 pages
1st Unit Notes
No ratings yet
1st Unit Notes
22 pages
.. Link Analysis Report: Site Information
No ratings yet
.. Link Analysis Report: Site Information
3 pages
BAN5
No ratings yet
BAN5
2 pages
Simplification
No ratings yet
Simplification
18 pages
Carpathia
No ratings yet
Carpathia
13 pages
Finite Fields and String Matching Presentation
No ratings yet
Finite Fields and String Matching Presentation
10 pages
Importing The Files
No ratings yet
Importing The Files
14 pages
CCNA Lab 1
No ratings yet
CCNA Lab 1
19 pages
Contact Summary
No ratings yet
Contact Summary
19 pages
A Short List of The Most Useful R Commands
No ratings yet
A Short List of The Most Useful R Commands
11 pages
Ap Stats Cram Sheet: Symmetric - When The Left Half Is
No ratings yet
Ap Stats Cram Sheet: Symmetric - When The Left Half Is
7 pages
Pana Bežični Manujal kx-tcd150FX
No ratings yet
Pana Bežični Manujal kx-tcd150FX
77 pages
Algebra 1 Unit 6 Describing Data Notes
No ratings yet
Algebra 1 Unit 6 Describing Data Notes
13 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet