DS Handout Complete
DS Handout Complete
TARKWA
LECTURE MATERIAL
DS 155 FOUNDATIONS OF DATA SCIENCE WITH R.
Compiled by:
Ernest Kwame Ampomah ( PhD)
1
Course Objective
The objective of this course is to introduce students to the core set of knowledge, skills and ways
of thinking required solving real-world data-science problems and build applications in this space.
The student will also be able to demonstrate and understanding of data collection, sampling,
quality assessment and repair; statistical analysis and machine learning; state-of-the-art tools to
build data-science applications for different types of data, including text and CSV data and key
concepts in data science, including tools, approaches, and application scenarios.
Course Outline
Fundamental concepts in Data Science and Analytics
Basic R Programming Concepts
Data Collection and Preprocessing with R
Data Visualization in R
Data Analysis Techniques
Reference
1.) Wickham, H., & Grolemund, G. (2017). R for data science: Import, tidy, transform,
visualize, and model data. O'Reilly Media.
https://r4ds.had.co.nz
2.) James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical
learning: With applications in R (1st ed.). Springer.
https://www.springer.com/gp/book/9781461471370
3.) Bruce, P., & Bruce, A. (2017). Practical statistics for data scientists: 50 essential
concepts (2nded.). O'Reilly Media.
https://www.oreilly.com/library/view/practical-statistics-for/9781491952955/
Course Presentation
This course is delivered through a combination of lectures and hands-on laboratory practice,
supplemented by comprehensive handouts. During lab sessions, students will be guided in solving
practical problems of varying complexity to reinforce theoretical concepts. In addition, students
will be assigned practical exercises to complete independently and submit as assignments. To gain
a thorough understanding and appreciation of the subject, students are encouraged to actively
participate in all lectures and lab sessions, consistently practice programming tasks, review
provided references and handouts, and complete all assignments on time.
2
Chapter 1:
Fundamental Concepts of Data Science
Data science is the art and science of acquiring knowledge through data.
It is an interdisciplinary field that uses scientific methods, algorithms, processes, and
systems to extract insights and knowledge from data.
Data science is an interdisciplinary field that integrates principles and techniques from various
domains to extract meaningful insights from data. The key components include:
Mathematics and Statistics: These disciplines provide the theoretical foundation for data
analysis, enabling data scientists to perform hypothesis testing, model evaluation, and identify
trends within datasets.
Computer Science and Programming: Skills in programming languages such as Python, R,
and Java are essential for processing, storing, and analyzing large datasets. These tools
facilitate the development of algorithms and models that can handle complex data structures.
Domain Knowledge: Understanding the specific area of application—be it medicine, finance,
social sciences, or another field—is crucial. This expertise allows data scientists to frame
relevant questions, interpret results accurately, and ensure that analyses are contextually
appropriate.
3
Education:
Data science is utilized to analyze student performance, tailor educational content,
and improve learning outcomes. By examining data on student interactions and
achievements, educators can identify areas needing attention and adapt teaching
methods accordingly.
Airline Industry:
Airlines employ data science for route optimization, demand forecasting, and
enhancing customer experience. Analyzing historical flight data helps in predicting
delays and optimizing schedules, leading to improved operational efficiency.
Delivery Logistics:
Logistics companies leverage data science to optimize delivery routes, manage I
inventory, and predict shipping delays. This ensures timely deliveries and cost savings
by efficiently managing resources.
Energy Sector:
In the energy industry, data science aids in predictive maintenance of equipment,
demand forecasting, and optimizing energy distribution. By analyzing consumption
patterns, companies can enhance efficiency and reduce operational costs.
Manufacturing:
Manufacturers use data science for quality control, supply chain optimization, and
predictive maintenance. Analyzing production data helps in identifying defects early
and streamlining operations.
Retail and E-commerce:
Retailers analyze customer data to personalize shopping experiences, manage
inventory, and optimize pricing strategies. This leads to increased customer
satisfaction and sales.
Transportation and Travel:
Data science is applied in optimizing routes, managing traffic flow, and improving
public transportation systems. Analyzing travel patterns helps in reducing congestion and
enhancing commuter experience.
Healthcare:
In the medical field, data science aids in detecting diseases, such as cancer, by
analyzing medical images and patient data to identify patterns indicative of tumors.
Supply Chain Management:
Businesses utilize data science to optimize supply chain networks, ensuring efficient
operations and reducing costs through predictive analytics and demand forecasting.
Sports Analytics:
Professional sports teams analyze in-game performance metrics of athletes to
enhance strategies and training programs, leading to improved performance and
competitive advantage.
4
Finance:
Financial institutions develop credit reports and assess risk by analyzing vast
amounts of financial data, enabling better decision-making in lending and investments.
Data professionals play crucial roles in managing, analyzing, and safeguarding data within
organizations. Their responsibilities vary based on specific roles, each contributing uniquely to the
organization's data strategy. The following are key data professional roles and their primary
responsibilities:
Providing the necessary infrastructure and tools for data scientists and analysts to perform
their tasks effectively.
5
4.) Data Steward
Data stewards are responsible for ensuring the quality and fitness for purpose of the organization's
data assets. Their key responsibilities include:
Ensuring each data element has a clear and unambiguous definition.
Ensuring data is accurate, consistent, and used appropriately across the organization.
Documenting the origin and sources of authority for each data element.
1.) Business Understanding: This initial phase involves clearly defining the problem to be
solved. Data scientists work with stakeholders to understand the business context, set clear
objectives, and ensure that the data science project is aligned with broader business goals.
This stage is crucial for framing the problem in a way that the data science approach can
address it effectively.
6
2.) The Data Understanding and Collection: It involves identifying and gathering the
necessary data to solve the business problem, ensuring that the data is relevant, accurate,
and in a suitable format. This stage also includes checking the quality of the data for
completeness and correctness, as well as understanding its structure to ensure it aligns with
the problem at hand. Essentially, this phase ensures that the right data is collected and
prepared for further analysis.
3.) Data Preparation: Data often comes in raw, unstructured forms, so it’s important to
clean and preprocess it to ensure it’s ready for analysis. This step includes handling missing
values, removing duplicates, encoding categorical variables, scaling numerical data, and
transforming features. The goal is to prepare high-quality, consistent data that will lead to
accurate, reliable models.
4.) Exploratory Data Analysis (EDA): EDA is an essential step where data scientists
explore the dataset to identify patterns, trends, and relationships. This involves
summarizing the data with descriptive statistics, visualizing distributions, and examining
correlations. EDA helps in gaining a deeper understanding of the data, identifying potential
outliers, and discovering insights that could influence the choice of modeling techniques.
5.) Model Building: In this stage, data scientists select appropriate machine learning
algorithms based on the nature of the problem (e.g., classification, regression) and the
characteristics of the data. They split the data into training and test sets, train the models
on the training data, and fine-tune parameters to improve performance. The goal is to build
a model that captures the patterns in the data effectively.
6.) Model Evaluation: Once the model is built, it is crucial to assess its performance.
Evaluation involves using various metrics such as accuracy, precision, recall, F1-score, and
confusion matrices to determine how well the model performs. Data scientists may also
use techniques like cross-validation to ensure that the model generalizes well to unseen
data and doesn’t overfit.
7.) Model Deployment: After the model has been trained and evaluated, the next step is to
deploy it into a production environment. This involves integrating the model into
operational systems where it can make real-time predictions or decisions. Deployment may
include creating APIs, setting up data pipelines, and ensuring that the model can scale and
function smoothly in the business environment.
7
Figure 1.1. The lifecycle of data science
8
1.5.2 Types of Data
Data can be classified based on its structure and nature
Structured Data
Structured data refers to data that is highly organized and conforms to a predefined format, such
as rows and columns in a table. It adheres to a fixed schema, making it easily stored, queried, and
analyzed.
Examples:
Spreadsheets (e.g., Excel files with rows and columns).
Relational databases (e.g., SQL databases like MySQL, PostgreSQL, Oracle).
Transactional data such as banking records or point-of-sale data
Semi-Structured Data
Semi-structured data is partially organized, combining elements of structured data with
flexibility. It does not conform to a strict schema but includes tags, markers, or keys to
provide structure and context.
Examples:
XML (Extensible Markup Language) and JSON (JavaScript Object Notation) files.
NoSQL databases (e.g., MongoDB, Cassandra).
Emails, where metadata (e.g., sender, recipient, timestamp) is structured, but the body
text is unstructured.
API responses.
9
Unstructured Data
Unstructured data lacks a predefined format or organizational structure, making it more
challenging to process and analyze. Despite its complexity, it represents the majority of
data generated in today's digital world.
Examples:
Text files (e.g., documents, PDFs).
Multimedia content (e.g., images, videos, audio recordings).
Social media posts (e.g., tweets, Facebook updates).
Quantitative Data
Quantitative data refers to numerical data that can be measured, counted, or expressed in terms of
quantities. It provides objective, measurable information that allows for statistical analysis and
comparison.
Examples:
Age of individuals (e.g., 25 years, 40 years).
Monthly income of employees (e.g., $5,000, $10,000).
Temperature readings (e.g., 22°C, 30°F).
Sales numbers (e.g., 500 units sold in a month).
Features of Quantitative Data:
Objectivity: Quantitative data is unbiased and represents measurable facts.
Answers Specific Questions: It addresses questions such as "how much," "how many," or
"how often."
Statistical Analysis: Allows for techniques such as averages, percentages, trends, and
hypothesis testing.
10
o Examples: Number of employees in a department, number of cars in a parking lot,
number of customers visiting a store.
2) Continuous Data:
o Measurements that can take any value within a range, often involving decimals or
fractions
o Examples: Weight of an individual (e.g., 70.5 kg), height (e.g., 5.8 feet), time taken
to complete a task (e.g., 12.3 seconds).
Continuous Data is further divided into two subcategories: interval data and ratio data. These
classifications are based on the nature of the scale used to measure the data and the presence or
absence of a true zero point.
Interval Data
Interval data refers to data measured on a scale where the intervals between values are consistent
and equal. However, it lacks a true zero point, meaning that zero does not represent the complete
absence of the measured attribute.
Examples:
Temperature measured in Celsius or Fahrenheit.
Time of day on a 12-hour clock.
IQ scores.
NB: The absence of a true zero limits the types of comparisons and calculations that can be made.
For example, ratios (e.g., "twice as much") cannot be accurately determined.
Ratio Data
Ratio data also has equal intervals between values but differs from interval data by having a true
zero point, which indicates the complete absence of the measured attribute.
Features of Ratio Data
Equal Intervals: The scale maintains consistent intervals between values, just like interval
data.
True Zero Point: Zero represents the absence of the property being measured. For instance,
0 kg signifies no weight, and 0 cm signifies no height.
11
o Mathematical Operations: All arithmetic operations—addition, subtraction, multiplication,
and division—are meaningful. For example, a weight of 40 kg is objectively twice as heavy
as 20 kg.
Examples
Weight (e.g., kilograms, pounds).
Height (e.g., centimeters, meters).
Distance (e.g., kilometers, miles).
Age (e.g., years, months).
Ratio data is the most informative type of quantitative data because it supports the widest range of
mathematical and statistical analyses.
Qualitative Data
Qualitative data is also known as categorical data and it measures data represented by a name or
symbol. This could be the names of each department in your organisation, office locations, and
many other names that are all categorical data. It is descriptive, non-numerical information that
captures the characteristics, attributes, traits, or properties of an object, person, or phenomenon.
Unlike quantitative data, it focuses on "what" something is like rather than measuring it in
numerical terms
Other examples include textual data like blog posts, photos, videos, social media comments, and
cultural observations.
12
Categories of Qualitative Data
1.) Nominal Data (Categorical)
o Data that represents categories or groups with no inherent order or ranking.
o Examples:
Gender (e.g., male, female, non-binary).
Colors (e.g., red, blue, green).
Types of cuisine (e.g., Italian, Mexican, Indian).
o Features of Qualitative
Categories are mutually exclusive.
No quantitative comparison or order between categories.
13
1.5.3 Data Collection
Data collection in data science refers to the systematic gathering of information from a range of
sources to be used for analysis, modeling, and decision-making. It plays a pivotal role in the data
science process as it provides the raw material for deriving insights and predictions.
Accurate data collection is critical because poor-quality data leads to biased models and incorrect
conclusions. High-quality data allows data scientists to develop models that accurately reflect real-
world scenarios, ensuring that decisions based on these models are both effective and reliable. The
relationship between data quality and model performance is direct—better data leads to better
models and, consequently, better outcomes.
Secondary Data
Secondary data refers to information that has already been collected by others and is made
available for analysis. It is often used to complement primary data or provide a broader context.
Examples: Public datasets from government databases, research publications, and data shared by
organizations like the World Bank.
14
Methods of Data Collection
Various methods are employed to collect data, each suitable for different scenarios:
Surveys and Questionnaires: Gathering information through structured questions.
Interviews and Focus Groups: Collecting in-depth insights through direct interaction.
Observations: Recording behaviors or events as they occur naturally.
Experiments: Conducting controlled tests to study specific variables.
Transactional Tracking: Monitoring and recording transactions or interactions.
15
Chapter 2:
Basic R Programming Concepts
16
1.) Console
The console is the main area where you type and run R commands directly.
It acts like a command-line interface within R.
You can enter R code here, press Enter, and immediately see the output or results. This is
great for quick calculations or testing small pieces of code.
Example: Typing 2 + 3 and pressing Enter will instantly show the result 5 in the console.
NB: If an error occurs, the console will display an error message to help you troubleshoot
the issue.
17
History Tab:
o Keeps a record of all the commands you’ve previously run in the console.
o You can scroll through past commands, copy them, and reuse them without having
to retype everything.
o NB: This is especially helpful when working on complex analyses where you need
to reference earlier steps.
4.) Plots/Files/Packages/Help Pane
Plots Tab:
o Displays any graphs or visualizations you create using R’s plotting functions (like
plot() or ggplot2).
o You can zoom in, export plots as images or PDFs, and navigate between different
plots you’ve created.
Files Tab:
o Allows you to browse files and folders on your computer, similar to a file explorer.
o Makes it easy to open data files, save scripts, and manage your project directory.
Packages Tab:
o Shows the list of R packages installed on your system. You can load or unload
packages, install new ones, and check for updates.
o Packages are collections of functions and tools that extend R’s capabilities. For
example, dplyr is a popular package for data manipulation.
Help Tab:
o Provides documentation for R functions, packages, and commands.
o You can search for help topics using ?function_name in the console (e.g., ?mean),
and detailed explanations will appear in this tab.
18
Change the Font Size in RStudio Settings
Go to Tools in the top menu bar.
Select Global Options.
In the General section, click on the Appearance tab.
In the Editor font size section, use the Slider or manually input a value to adjust the font size of
your code.
After adjusting, click Apply and then OK to save the changes.
2.3 Variable in R
A variable is a name that stores data or a value, which can be used and modified in your R code.
1.) Creating a Variable
Use the assignment operator <- (preferred) or = to assign a value.
Syntax:
variable_name <- value
19
2.) Rules for Naming Variables
Can include letters, numbers, and underscores.
Cannot start with a number.
Case-sensitive: Name and name are different
Data types in R
1.) Numeric
The numeric data type in R is used to represent numbers, including both integers (whole
numbers) and decimals (floating-point numbers).
Types of Numeric Values:
o Integers: Whole numbers without any decimal point (e.g., 5, 10, -3).
o Decimals (Floating Point Numbers): Numbers that include a decimal point (e.g.,
5.7, 10.3, -4.6).
R automatically recognizes numbers with or without decimal points as numeric. You don't
need to specify the type of number explicitly, and R will treat both integers and floating-
point numbers as numeric.
Example:
x <- 5.7 # a floating-point number
y <- 10 # an integer
2.) Character
o The character data type in R is used for text or strings—sequences of characters.
A string can contain letters, numbers, symbols, and spaces, enclosed within
quotation marks (either single ' or double ").
o In R, any text or string that appears between quotation marks is automatically
treated as a character.
Example
name <- "John" # character string
greeting <- "Hello" # another character string
NB: Be careful not to forget the quotation marks, as text without quotes will cause an error.
For example, name <- John without quotes would result in an error.
20
3) Logical
The logical data type in R represents boolean values, which can only be either
TRUE or FALSE.
These values are often used in conditions, loops, and logical operations.
Logical values are used to test conditions or relationships between values.
They are essential for controlling flow in your code, such as in if statements or
while loops.
Example
is_active <- TRUE # logical value
has_data <- FALSE # logical value
o Operations: Logical values can also be the result of comparison operators (e.g., ==,!=,
>, <), and they can be combined using logical operators like & (AND), | (OR), and ! (NOT).
o Example
5>3 # returns TRUE
10 == 5 # returns FALSE
4.) Complex
The complex data type in R is used to represent complex numbers.
A complex number consists of two parts: a real part and an imaginary part.
The imaginary part is denoted by i,
In R, complex numbers can be written in the form of a + bi, where a is the real part,
and b is the imaginary part.
Example
z <- 3 + 2i # a complex number with real part 3 and imaginary part 2
2. 4. Basic Commands in R
21
Using =
y = 20
language = "R Programming"
Note:
The <- operator is preferred in R programming.
Variable names are case-sensitive (Var and var are different).
You can also assign vectors, lists, data frames, etc.
# Example
y <- 20
y # This will print 20
22
2.5 Basic Operations
Greater than (>): Checks if the left value is greater than the right value.
Example: 6 > 3 returns TRUE.
23
Less than (<): Checks if the left value is less than the right value.
Example: 2 < 5 returns TRUE.
Greater than or equal to (>=): Checks if the left value is greater than or equal to
the right value.
Example: 5 >= 5 returns TRUE, while 4 >= 5 returns FALSE.
Less than or equal to (<=): Checks if the left value is less than or equal to the
right value.
Example: 3 <= 3 returns TRUE, while 6 <= 3 returns FALSE.
Example
x <- 5
y <- 3
is_equal <- x == y # FALSE
is_greater <- x > y # TRUE
print(is_equal)
print(is_greater)
3) Logical Operations
Logical operations are used to combine or manipulate logical values (TRUE and
FALSE).
These are especially useful in decision-making structures like if statements or loops.
XOR (Exclusive OR) (xor()): Returns TRUE if one condition is TRUE but not
both.
Example: xor(TRUE, FALSE) returns TRUE, but xor(TRUE, TRUE)
returns FALSE.
24
Example
a <- TRUE
b <- FALSE
and_result <- a & b # FALSE (because both must be TRUE)
or_result <- a | b # TRUE (because at least one is TRUE)
not_result <- !a # FALSE (because !TRUE is FALSE)
Example
# Combining relational and logical operations
x <- 5
y <- 10
z <- 15
result <- (x < y) & (y < z) # TRUE (x < y and y < z are both true)
Creating a Vector in R
a) Using c() function
# Numeric vector
num_vector <- c(1, 2, 3, 4, 5)
# Character vector
char_vector <- c("apple", "banana", "cherry")
# Logical vector
log_vector <- c(TRUE, FALSE, TRUE)
25
5.) Accessing Vector Elements
num_vector[2] # Access 2nd element
num_vector[1:3] # Access elements from 1st to 3rd
num_vector[-1] # Exclude 1st element
# Logical
a>2 # Returns: FALSE, FALSE, TRUE
7.) Basic Functions
length(a) # Returns number of elements in a.
sum(a) # Returns sum of elements in a.
mean(a) # Calculates the average of a.
max(a) # Finds the largest value in a.
min(a) # Finds the smallest value in a.
8.) Lists
A list can hold elements of different data types (numbers, strings, vectors, etc.).
Created using the list() function.
Example:
# List with different data types
my_list <- list(10, "Hello", TRUE, c(1, 2, 3))
Accessing Elements
my_list[[1]] # Access 1st element → 10
my_list[[4]] # Access the vector → 1 2 3
26
9.) Factors
o Factors are used to handle categorical data (like labels or groups).
o They store data as levels, which makes them memory efficient.
o Levels are the unique categories or distinct values within a factor.
o When you create a factor from a vector of categorical data, R automatically
identifies and stores the unique values as levels.
o Created using the factor() function.
Example:
# Creating a Factor
colors <- factor(c("Red", "Blue", "Red", "Green", "Blue"))
# Checking Levels
levels(colors) # Output: "Blue" "Green" "Red"
10.) Matrices
o A matrix is a 2-dimensional data structure with rows and columns.
o It can only contain one data type (numeric, character, etc.).
o Created using the matrix() function.
# Creating a Matrix
m <- matrix(1:6, nrow = 2, ncol = 3)
print(m)
# Output:
27
2.6 Introduction to Data Frames in R
A Data Frame is a table-like structure used to store data in rows and columns.
It can contain different data types in different columns (numeric, character, logical, etc.).
Think of it like an Excel spreadsheet or a SQL table in R.
Output
28
NB:
Columns can have different data types, but each column must have the same type
of data.
Use $ to access columns easily.
nrow(students) and ncol(students) give the number of rows and columns.
str(students) shows the structure of the data frame.
if Statement
Executes code only if a condition is TRUE.
Syntax
if (condition) {
# Code to run if condition is TRUE
}
Example
x <- 10
if (x > 5) {
print("x is greater than 5")
}
if...else Statement
Adds an alternative if the condition is FALSE.
if (condition) {
# Code if TRUE
} else {
# Code if FALSE
}
Example
x <- 3
if (x > 5) {
print("x is greater than 5")
} else {
print("x is 5 or less")
}
29
if...else if...else Statement
Checks multiple conditions one by one.
Executes the code for the first TRUE condition.
Syntax:
if (condition1) {
# Code if condition1 is TRUE
} else if (condition2) {
# Code if condition2 is TRUE
} else {
# Code if none of the conditions are TRUE
}
Example
score <- 85
ifelse() Function
A vectorized version of if...else,
A vectorized way to handle conditions.
Great for working with vectors.
Good for quick checks on multiple values.
Syntax:
ifelse(condition, value_if_true, value_if_false)
Example
# Vector example
marks <- c(80, 45, 70, 30)
result <- ifelse(marks >= 50, "Pass", "Fail")
print(result)
30
2.8 Loops in R: for, while, repeat
Loops are used to repeat a set of instructions multiple times in R.
for Loop
Used to iterate over a sequence (like a vector or range).
Syntax
for (variable in sequence) {
# Code to repeat
}
Example
# Print numbers from 1 to 5
for (i in 1:5) {
print(i)
}
while Loop
Repeats code as long as a condition is TRUE.
Syntax:
while (condition) {
# Code to repeat
}
Example:
# Print numbers from 1 to 5
x <- 1
while (x <= 5) {
print(x)
x <- x + 1
}
31
repeat Loop
Syntax
repeat {
# Code to repeat
if (condition) {
break # Stops the loop
}
}
Example
# Print numbers from 1 to 5
y <- 1
repeat {
print(y)
y <- y + 1
if (y > 5) {
break # Stop the loop when y > 5
}
}
NB:
for loop → Best for iterating over a known sequence.
while loop → Runs as long as the condition stays TRUE.
repeat loop → Runs indefinitely unless stopped with break.
32
2.9 Functions in R: Creating and Calling Functions
Functions allow you to reuse code and organize it into manageable chunks. In R, you can create
and call functions to make your code more modular and efficient.
Syntax:
function_name <- function(parameter1, parameter2, ...) {
# Code to execute
# Return value (optional)
}
NB:
Function Name: add_numbers
Parameters: a and b
Return: The sum of a and b.
Example
# Calling the function with arguments
sum_result <- add_numbers(5, 10)
print(sum_result)
# Output: 15
NB: Here, 5 is passed as a and 10 as b. The function calculates the sum and returns it
33
3.) Function with Default Arguments
You can also set default values for parameters, so they are used if no value is provided
when calling the function.
Example:
# Function with default argument
greet <- function(name = "Guest") {
message <- paste("Hello,", name)
return(message)
}
# Calling without argument
print(greet()) # Output: Hello, Guest
# Calling with argument
print(greet("Alice")) # Output: Hello, Alice
1) Global Variables
Global variables are defined outside of any function.
They can be accessed anywhere in the script (inside or outside functions).
# Global Variable
x <- 10
# Function accessing global variable
printValue <- function() {
print(x) # Accessing global variable
}
printValue() # Output: 10
34
2) Local Variables
Local variables are defined inside a function.
They can be accessed only within that function.
Example
# Function with a Local Variable
calculateSum <- function() {
y <- 5 # Local variable
z <- 7
return(y + z)
}
calculateSum() # Output: 12
# print(y) # Error: 'y' is not found (because 'y' is local)
35
Step 4: Install tidyverse package
1. Open RStudio:
2. Run the following command in the R console:
This will install the core packages of the tidyverse including ggplot2, dplyr, tidyr, and others.
NB:
To install and load the required packages in R, follow these steps:
Step 1: Install the Packages
If you haven't installed the packages yet, you can install them using the function
Or
36
R Data File
An R Data File (typically with the .RData or .rds extension) is a file format used to save R objects
such as data frames, vectors, lists, or even entire R workspaces. These files allow you to save your
R environment or specific objects and load them later without needing to recalculate or recreate
the data.
This will load all objects saved in that .RData file back into your environment.
2) .rds
The .rds file is used to save a single R object. It is often used for saving larger datasets or
models because it's more space-efficient.
Unlike .RData, when you load an .rds file, you must assign it to a variable.
37
Chapter 3:
Data Collection and Preprocessing with R
3.1 Data Collection in R
Data collection refers to gathering and importing data from various sources into R for analysis.
There are multiple ways to collect data in R.
Using vectors
A vector is a basic data structure in R that holds elements of the same type (numeric,
character, logical, etc.).
Creation: Use c( )
Output:
38
b) Reading Data from External Files
read.csv()
Imports CSV (Comma-Separated Values) files into R as a data frame.
The file extension is .csv
read_excel()
the read_excel() function from the readxl package is used to read Excel files.
The file extension is .xlsx
Parameters:
"your_file.xlsx" – Path to the Excel file
sheet = 1 – Specifies the sheet (default is the first sheet)
read.table( )
In R, read.table() is a general function to read delimited text files into a data frame.
Parameters:
"file.txt" – Path to the file
header = TRUE – First row as column names
sep = "\t" – Tab-separated values (use "," for CSV)
39
c) Accessing Data
View()
The View() function opens a spreadsheet-like GUI view of a data frame in RStudio.
glimpse()
The glimpse() function (from the dplyr package) provides a compact, transposed view of
a data frame.
Displays column names, types, and first few values
40
dim()
Returns the dimensions (number of rows and columns) of an object, like a data frame or
matrix.
summary()
The summary() function provides descriptive statistics for each column in a data frame.
Numeric columns: Min, Max, Mean, Median, 1st & 3rd Quartiles
str()
The str() function displays the structure of an object, including data types and sample
values.
Data type of each column
First few values of each column
Selecting a Column
You can select a column from a data frame in multiple ways:
i. Using $
ii. Using [[ ]]
41
iii. Using [ , ]
Selecting Rows
Filtering Data
Remove a Column
e) Exporting Data
Save to CSV
42
3.2. Data Preprocessing with R
Data preprocessing is the process of cleaning, transforming, and organizing raw data to make it
suitable for analysis. It is a crucial step in machine learning and data science to improve the quality
and performance of models.
43
1) Detecting Missing Values
is.na() Function
is.na() is a function in R that checks whether a value is missing (NA - Not Available).
It returns TRUE for missing values (NA) and FALSE for non-missing values.
na.omit()
na.omit() removes all rows containing NA values from a dataset or vector
44
3. Imputing Missing Values
45
Replace missing values with the mode
o To replace missing (NA) values with the median in R, you can use Using names(),
sort(), and table()
o Suitable for categorical data where mean/median cannot be used.
Handling Duplicates in R
Duplicates in a dataset can lead to redundancy and incorrect analysis. Below are methods to
identify and remove duplicates in R with practical examples.
duplicated()
duplicated() checks for duplicate values in a vector or rows in a data frame and returns TRUE
for duplicates (except the first occurrence).
Output:
46
For Data Frames:
sum(duplicated())
sum(duplicated()) counts the number of duplicate values in a vector or dataset.
Output:
Output:
df[duplicated(df), ]
df[duplicated(df), ] extracts only the duplicate rows from a data frame, excluding the first
occurrence.
Output:
with a simple vector instead of a data frame, you can still use duplicated() to find and extract
duplicate values.
47
Identify unique rows:
Removing Duplicates
Remove all duplicate rows:
Handling Outliers in R
Outliers are extreme values that differ significantly from other observations in a dataset. They
can impact statistical analysis and machine learning models.
48
Data Transformation: Feature Scaling
Feature scaling is a data preprocessing technique used to normalize or standardize the range of
numerical features in a dataset. It ensures that all features contribute equally to a machine learning
model by bringing them to a common scale, preventing features with larger magnitudes from
dominating the learning process. Feature scaling is an important step in data preprocessing,
especially for machine learning models that rely on distance-based calculations (e.g., KNN, SVM,
linear regression).
The two common methods are Normalization (Min-Max Scaling) and Standardization (Z-score
Scaling).
49
lapply() applies normalization to each column in base R.
50
Using scale() Function
Label Encoding
Assigns a unique number to each category (e.g., "Red" → 1, "Blue" → 2).
Suitable for ordinal categorical data (e.g., "Low", "Medium", "High").
51
Output:
One-Hot Encoding
Converts each category into a separate binary column (0 or 1).
Suitable for nominal categorical data (e.g., "Country", "Color").
using model.matrix()
Output:
52
Feature Selection & Extraction
Feature selection and extraction help make machine learning models better by removing
unnecessary data and keeping only the most useful information.
2.) Wrapper Methods: Use a model to decide which features are best
Forward Selection: Start with no features, add them one by one, and check performance.
Backward Elimination: Start with all features, remove the least useful ones one by one.
Recursive Feature Elimination (RFE): Train a model multiple times and remove the least
important features step by step.
53
Log Transformation: Convert data into a better scale to handle uneven distributions.
Binning: Group numerical values into categories (e.g., age groups like 0-18, 19-35, 36+).
Data Splitting
In R, data splitting is commonly used to divide datasets into training and testing (or validation)
sets for machine learning and statistical modeling.
Using rsample::initial_split()
54
Chapter 4:
Data Visualization in R
55
Explanation:
x, y are data points.
main specifies the title.
xlab and ylab label the axes.
col sets the color of points.
pch=16 changes the point style.
Line Plot
Explanation:
type="l" specifies a line plot.
col="blue" sets the line color.
lwd=2 makes the line thicker.
points(x, y, col="red", pch=16) adds red dots at data points.
56
Explanation:
heights represents bar heights.
names.arg assigns category names to bars.
col assigns different colors to bars.
hist() – Histograms
Used for visualizing the distribution of numerical data.
Explanation:
breaks=5 defines the number of bins.
col="lightblue" sets the bar color.
Explanation:
Displays median, quartiles, and outliers.
col="purple" sets the box color.
57
Explanation:
slices represents portions.
col sets segment colors.
heat( ): A heatmap is used to visualize matrix-like data, where values are represented using colors.
Explanation:
matrix(rnorm(100), nrow=10) creates a random numeric matrix with 10 rows and 10
columns.
heatmap() generates a heatmap.
heat.colors(10) applies a color gradient.
58
Chapter 5:
Data Analysis Techniques
Data analysis techniques are systematic methods used to inspect, clean, transform, and model data
to extract useful insights, support decision-making, and predict future trends. These techniques
help organizations understand patterns, relationships, and anomalies in data.
59
2.) Diagnostic Analysis
Diagnostic analysis is a data analysis method used to investigate and determine the causes behind
past events or trends. It goes beyond descriptive analysis by answering the question: "Why did this
happen?"
Characteristics
Cause-and-Effect Investigation: Identifies factors that contributed to a specific outcome.
Deeper Data Exploration: Uses advanced techniques to uncover hidden relationships in
data.
Data-Driven Decision Making: Helps organizations understand issues and take corrective
actions.
Characteristics:
Forecasting Future Trends: Uses past data patterns to predict future events.
Probabilistic Outcomes: Provides likelihood estimates rather than exact predictions.
Data-Driven Decision Making: Helps organizations anticipate risks and opportunities.
60
4.) Prescriptive Analysis
Prescriptive analysis is an advanced form of data analysis that not only predicts future outcomes
but also provides recommendations on the best course of action to achieve desired results. It helps
organizations make data-driven decisions by answering the question: "What should be done
next?" It combines descriptive analysis (what happened) and predictive analysis (what might
happen) to suggest the best possible course of action to achieve desired outcomes.
Characteristics:
Action-Oriented: Focuses on suggesting specific actions to optimize outcomes.
Decision Support: Helps businesses and individuals make informed choices based on data
insights.
Advanced Analytics: Uses machine learning, artificial intelligence, and mathematical
optimization techniques.
Techniques Used:
Optimization Algorithms: Determines the best possible solution for a given scenario.
Decision Trees: Models various decision paths and their possible outcomes.
Artificial Intelligence (AI): Automates complex decision-making processes.
Objectives:
Understand Data Structure: Identify key variables, data types, and distributions.
Detect Patterns and Trends: Discover relationships between variables.
Identify Anomalies: Find missing values, outliers, or inconsistencies.
Guide Further Analysis: Helps decide which statistical models or machine learning
techniques to use.
61
Example: Analyzing customer demographics to find buying patterns.
A data scientist analyzing customer purchase behavior might use EDA to visualize
spending patterns, identify the most common products bought together, and detect
unusual transactions before building a predictive model.
Objectives:
Generalization: Extends findings from a sample to a broader population.
Hypothesis Testing: Determines whether observed patterns are statistically significant.
Prediction and Estimation: Estimates unknown population parameters based on sample
data.
Objectives:
Information Extraction: Identifies key entities, phrases, and relationships within text.
Pattern Recognition: Finds trends, sentiments, and recurring themes in textual data.
Data Structuring: Converts unstructured text into structured data for analysis.
62
Topic Modeling: Identifies topics present in a collection of documents (e.g., Latent
Dirichlet Allocation (LDA)).
Text Classification: Categorizes text into predefined labels (e.g., spam detection).
Named Entity Recognition (NER): Identifies names, locations, organizations, and other
key entities.
Keyword Extraction: Identifies the most important words or phrases in a text.
Objectives:
Trend Analysis: Identifies long-term patterns in data.
Seasonality Detection: Recognizes repeating cycles or patterns within a fixed time
period.
Forecasting: Predicts future values based on historical data.
Anomaly Detection: Identifies unexpected changes or outliers in time-based data.
Objectives:
Pattern Recognition: Identifies spatial distributions and relationships in data.
Proximity Analysis: Measures distances between locations and finds nearest points of
interest.
63
Spatial Prediction: Uses geographic trends to forecast future outcomes.
Cluster Detection: Groups similar spatial points to identify trends or anomalies.
Objectives:
Understanding Relationships: Examines how different entities (nodes) are connected.
Identifying Key Influencers: Finds the most important nodes in a network.
Detecting Communities: Identifies clusters or groups of closely connected nodes.
Optimizing Network Flow: Analyzes efficiency and bottlenecks in a system.
64