0% found this document useful (0 votes)
7 views64 pages

DS Handout Complete

The document outlines a course on Data Science with R, focusing on core knowledge and skills necessary for solving data-related problems. It covers fundamental concepts, programming, data collection, analysis techniques, and the data science lifecycle, emphasizing the importance of data in decision-making across various industries. Additionally, it details the roles of data professionals and the types of data used in analysis.

Uploaded by

adaneanson2007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views64 pages

DS Handout Complete

The document outlines a course on Data Science with R, focusing on core knowledge and skills necessary for solving data-related problems. It covers fundamental concepts, programming, data collection, analysis techniques, and the data science lifecycle, emphasizing the importance of data in decision-making across various industries. Additionally, it details the roles of data professionals and the types of data used in analysis.

Uploaded by

adaneanson2007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

UNIVERSITY OF MINES AND TECHNOLOGY

TARKWA

SCHOOL OF RAILWAYS AND INFRASTRUCTURE


DEVELOPMENT (SRID)

LECTURE MATERIAL
DS 155 FOUNDATIONS OF DATA SCIENCE WITH R.

Compiled by:
Ernest Kwame Ampomah ( PhD)

1
Course Objective
The objective of this course is to introduce students to the core set of knowledge, skills and ways
of thinking required solving real-world data-science problems and build applications in this space.
The student will also be able to demonstrate and understanding of data collection, sampling,
quality assessment and repair; statistical analysis and machine learning; state-of-the-art tools to
build data-science applications for different types of data, including text and CSV data and key
concepts in data science, including tools, approaches, and application scenarios.

Course Outline
 Fundamental concepts in Data Science and Analytics
 Basic R Programming Concepts
 Data Collection and Preprocessing with R
 Data Visualization in R
 Data Analysis Techniques

Reference
1.) Wickham, H., & Grolemund, G. (2017). R for data science: Import, tidy, transform,
visualize, and model data. O'Reilly Media.
https://r4ds.had.co.nz
2.) James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical
learning: With applications in R (1st ed.). Springer.
https://www.springer.com/gp/book/9781461471370
3.) Bruce, P., & Bruce, A. (2017). Practical statistics for data scientists: 50 essential
concepts (2nded.). O'Reilly Media.
https://www.oreilly.com/library/view/practical-statistics-for/9781491952955/

Course Presentation
This course is delivered through a combination of lectures and hands-on laboratory practice,
supplemented by comprehensive handouts. During lab sessions, students will be guided in solving
practical problems of varying complexity to reinforce theoretical concepts. In addition, students
will be assigned practical exercises to complete independently and submit as assignments. To gain
a thorough understanding and appreciation of the subject, students are encouraged to actively
participate in all lectures and lab sessions, consistently practice programming tasks, review
provided references and handouts, and complete all assignments on time.

2
Chapter 1:
Fundamental Concepts of Data Science

Data science is the art and science of acquiring knowledge through data.
It is an interdisciplinary field that uses scientific methods, algorithms, processes, and
systems to extract insights and knowledge from data.

Data science is an interdisciplinary field that integrates principles and techniques from various
domains to extract meaningful insights from data. The key components include:
 Mathematics and Statistics: These disciplines provide the theoretical foundation for data
analysis, enabling data scientists to perform hypothesis testing, model evaluation, and identify
trends within datasets.
 Computer Science and Programming: Skills in programming languages such as Python, R,
and Java are essential for processing, storing, and analyzing large datasets. These tools
facilitate the development of algorithms and models that can handle complex data structures.
 Domain Knowledge: Understanding the specific area of application—be it medicine, finance,
social sciences, or another field—is crucial. This expertise allows data scientists to frame
relevant questions, interpret results accurately, and ensure that analyses are contextually
appropriate.

1.1 Importance of Data Science


Data science is all about how we take data, use it to acquire knowledge, and then use that
knowledge to do the following:
 Make Informed Decisions:
By analyzing historical and current data, data science provides a foundation for
strategic decision-making across various sectors.
 Predict Future Outcomes:
Through predictive analytics, data science forecasts future events, aiding in
proactive planning and risk management.
 Understand Past and Present Trends:
Data science uncovers patterns and trends in historical data, offering insights into past
behaviors and current conditions.
 Drive Innovation:
By identifying opportunities and optimizing processes, data science fosters the
creation of new products and services, spurring industry innovation.

1.2 Applications of Data Science


Data science is a versatile field with applications across various industries, providing valuable
insights that drive decision-making, optimize processes, and create new opportunities.
Some common applications of data science include:

3
 Education:
Data science is utilized to analyze student performance, tailor educational content,
and improve learning outcomes. By examining data on student interactions and
achievements, educators can identify areas needing attention and adapt teaching
methods accordingly.
 Airline Industry:
Airlines employ data science for route optimization, demand forecasting, and
enhancing customer experience. Analyzing historical flight data helps in predicting
delays and optimizing schedules, leading to improved operational efficiency.
 Delivery Logistics:
Logistics companies leverage data science to optimize delivery routes, manage I
inventory, and predict shipping delays. This ensures timely deliveries and cost savings
by efficiently managing resources.
 Energy Sector:
In the energy industry, data science aids in predictive maintenance of equipment,
demand forecasting, and optimizing energy distribution. By analyzing consumption
patterns, companies can enhance efficiency and reduce operational costs.
 Manufacturing:
Manufacturers use data science for quality control, supply chain optimization, and
predictive maintenance. Analyzing production data helps in identifying defects early
and streamlining operations.
 Retail and E-commerce:
Retailers analyze customer data to personalize shopping experiences, manage
inventory, and optimize pricing strategies. This leads to increased customer
satisfaction and sales.
 Transportation and Travel:
Data science is applied in optimizing routes, managing traffic flow, and improving
public transportation systems. Analyzing travel patterns helps in reducing congestion and
enhancing commuter experience.
 Healthcare:
In the medical field, data science aids in detecting diseases, such as cancer, by
analyzing medical images and patient data to identify patterns indicative of tumors.
 Supply Chain Management:
Businesses utilize data science to optimize supply chain networks, ensuring efficient
operations and reducing costs through predictive analytics and demand forecasting.
 Sports Analytics:
Professional sports teams analyze in-game performance metrics of athletes to
enhance strategies and training programs, leading to improved performance and
competitive advantage.

4
 Finance:
Financial institutions develop credit reports and assess risk by analyzing vast
amounts of financial data, enabling better decision-making in lending and investments.

1.3 Roles and Responsibilities of Data Professionals

Data professionals play crucial roles in managing, analyzing, and safeguarding data within
organizations. Their responsibilities vary based on specific roles, each contributing uniquely to the
organization's data strategy. The following are key data professional roles and their primary
responsibilities:

1.) Data Engineer


Data engineers are responsible for designing, building, and maintaining the infrastructure that
allows for the collection, storage, and analysis of data.
Their key responsibilities include:
 Creating automated pipelines to extract, transform, and load (ETL) data from various
sources into data warehouses or databases.
 Developing scalable and efficient data architectures to support analytics and reporting
needs.

 Providing the necessary infrastructure and tools for data scientists and analysts to perform
their tasks effectively.

2.) Data Analyst


Data analysts play a crucial role in gathering, organizing, and analyzing data to uncover insights
and trends. Their key responsibilities include:
 Gathering data from various sources and ensuring its accuracy and completeness.
 Applying statistical techniques and data mining algorithms to identify patterns and
correlations within the data.
 Generating reports and presenting findings to aid decision-making processes.

3.) Data Scientist


Data scientists analyze and interpret complex data to help organizations make informed decisions.
Their key responsibilities include:
 Utilizing machine learning and statistical methods to build predictive models.
 Designing experiments to test hypotheses and validate models.
 Translating complex analytical results into actionable insights for stakeholders.

5
4.) Data Steward
Data stewards are responsible for ensuring the quality and fitness for purpose of the organization's
data assets. Their key responsibilities include:
 Ensuring each data element has a clear and unambiguous definition.
 Ensuring data is accurate, consistent, and used appropriately across the organization.
 Documenting the origin and sources of authority for each data element.

5.) Data Custodian


Data custodians are responsible for the safe custody, transport, and storage of data, as well as the
implementation of business rules. Their key responsibilities include:
 Ensuring access to data is authorized and controlled.
 Implementing technical processes to sustain data integrity.
 Applying technical controls to safeguard data.

6.) Chief Information Officer (CIO)


The CIO is responsible for the overall technology strategy of an organization, including data
management. Their key responsibilities include:
 Developing and implementing the organization's IT strategy.
 Establishing policies related to information technology and data governance.
 Leading and directing the IT workforce to align with business objectives.

7.) Chief Privacy Officer (CPO)


The CPO is responsible for managing the organization's data privacy policies and compliance.
Their key responsibilities include:
 Overseeing the company's data governance policies and procedures.
 Driving privacy-related awareness and training among employees.
 Assessing privacy-related risks arising from existing and new business activities.

1.4 Data Science Lifecycle


The Data Science Lifecycle is a structured and iterative framework that guides data scientists
through the various stages of a project, from understanding a business problem to deploying and
maintaining a predictive model. This process ensures that data-driven solutions align with business
goals and deliver actionable insights.
The key stages in data science lifecycle include the following:

1.) Business Understanding: This initial phase involves clearly defining the problem to be
solved. Data scientists work with stakeholders to understand the business context, set clear
objectives, and ensure that the data science project is aligned with broader business goals.
This stage is crucial for framing the problem in a way that the data science approach can
address it effectively.

6
2.) The Data Understanding and Collection: It involves identifying and gathering the
necessary data to solve the business problem, ensuring that the data is relevant, accurate,
and in a suitable format. This stage also includes checking the quality of the data for
completeness and correctness, as well as understanding its structure to ensure it aligns with
the problem at hand. Essentially, this phase ensures that the right data is collected and
prepared for further analysis.

3.) Data Preparation: Data often comes in raw, unstructured forms, so it’s important to
clean and preprocess it to ensure it’s ready for analysis. This step includes handling missing
values, removing duplicates, encoding categorical variables, scaling numerical data, and
transforming features. The goal is to prepare high-quality, consistent data that will lead to
accurate, reliable models.

4.) Exploratory Data Analysis (EDA): EDA is an essential step where data scientists
explore the dataset to identify patterns, trends, and relationships. This involves
summarizing the data with descriptive statistics, visualizing distributions, and examining
correlations. EDA helps in gaining a deeper understanding of the data, identifying potential
outliers, and discovering insights that could influence the choice of modeling techniques.

5.) Model Building: In this stage, data scientists select appropriate machine learning
algorithms based on the nature of the problem (e.g., classification, regression) and the
characteristics of the data. They split the data into training and test sets, train the models
on the training data, and fine-tune parameters to improve performance. The goal is to build
a model that captures the patterns in the data effectively.

6.) Model Evaluation: Once the model is built, it is crucial to assess its performance.
Evaluation involves using various metrics such as accuracy, precision, recall, F1-score, and
confusion matrices to determine how well the model performs. Data scientists may also
use techniques like cross-validation to ensure that the model generalizes well to unseen
data and doesn’t overfit.

7.) Model Deployment: After the model has been trained and evaluated, the next step is to
deploy it into a production environment. This involves integrating the model into
operational systems where it can make real-time predictions or decisions. Deployment may
include creating APIs, setting up data pipelines, and ensuring that the model can scale and
function smoothly in the business environment.

7
Figure 1.1. The lifecycle of data science

1.5 Fundamentals of Data


Data refers to raw facts, figures, or observations that are collected, stored, and processed to
generate meaningful insights. These raw elements can be in various forms, such as numbers, text,
images, audio, or video, and serve as the foundation for analysis, modeling, and decision-making.
Data in data science is the raw material that, when properly collected, cleaned, and analyzed,
transforms into actionable knowledge.

1.5.1 Role of Data in Data Science:


 Foundation for Analysis
Data serves as the input for various analyses, algorithms, and models.
 Insights Generation
Through processing and analysis, data reveals patterns, trends, and correlations.
 Decision Support
Data-driven insights inform business decisions and strategies.
 Model Training
In machine learning, data is used to train algorithms to make predictions or classifications.

8
1.5.2 Types of Data
Data can be classified based on its structure and nature

1.) Based on Structure


Based on structure, data is classified into three: Structured, Semi-structured, and Unstructured.

 Structured Data
Structured data refers to data that is highly organized and conforms to a predefined format, such
as rows and columns in a table. It adheres to a fixed schema, making it easily stored, queried, and
analyzed.
Examples:
 Spreadsheets (e.g., Excel files with rows and columns).
 Relational databases (e.g., SQL databases like MySQL, PostgreSQL, Oracle).
 Transactional data such as banking records or point-of-sale data

Features of Structured Data


 Predefined Schema: The structure of the data, including fields and data types, is defined
in advance.
 Ease of Use: Structured data can be efficiently stored, retrieved, and manipulated using
standardized tools like SQL.
 Searchable: Data can be queried quickly using indexing and well-defined relationships
between tables

 Semi-Structured Data
Semi-structured data is partially organized, combining elements of structured data with
flexibility. It does not conform to a strict schema but includes tags, markers, or keys to
provide structure and context.
Examples:
 XML (Extensible Markup Language) and JSON (JavaScript Object Notation) files.
 NoSQL databases (e.g., MongoDB, Cassandra).
 Emails, where metadata (e.g., sender, recipient, timestamp) is structured, but the body
text is unstructured.
 API responses.

Features Semi-Structured Data


 Flexible Structure: It does not require a rigid schema, making it suitable for dynamic or
evolving datasets.
 Interoperability: Easy exchange and integration of data across systems.
 Scalability: Well-suited for large and complex datasets, such as those generated by web
services or IoT devices.

9
 Unstructured Data
Unstructured data lacks a predefined format or organizational structure, making it more
challenging to process and analyze. Despite its complexity, it represents the majority of
data generated in today's digital world.
Examples:
 Text files (e.g., documents, PDFs).
 Multimedia content (e.g., images, videos, audio recordings).
 Social media posts (e.g., tweets, Facebook updates).

Features of Unstructured Data:


 No Fixed Schema: The data is not arranged in rows, columns, or fields.
 Complex Processing Requirements: Advanced techniques such as natural language
processing (NLP), computer vision, or machine learning are often needed to extract
insights.
 High Value Potential: While unstructured data is harder to manage, it often contains
valuable information that can drive business intelligence.

2.) Based on Nature


Data can be classified into two primary types: quantitative data and qualitative data. These
classifications help determine the type of analysis, tools, and approaches suitable for understanding
the data and drawing insights.

 Quantitative Data
Quantitative data refers to numerical data that can be measured, counted, or expressed in terms of
quantities. It provides objective, measurable information that allows for statistical analysis and
comparison.
Examples:
 Age of individuals (e.g., 25 years, 40 years).
 Monthly income of employees (e.g., $5,000, $10,000).
 Temperature readings (e.g., 22°C, 30°F).
 Sales numbers (e.g., 500 units sold in a month).
Features of Quantitative Data:
 Objectivity: Quantitative data is unbiased and represents measurable facts.
 Answers Specific Questions: It addresses questions such as "how much," "how many," or
"how often."
 Statistical Analysis: Allows for techniques such as averages, percentages, trends, and
hypothesis testing.

Categories of Quantitative Data:


1) Discrete Data:
o Discrete values are when the numerical data can only be in whole numbers.
o Consists of distinct, countable values.
o Typically represented as whole numbers.

10
o Examples: Number of employees in a department, number of cars in a parking lot,
number of customers visiting a store.

2) Continuous Data:
o Measurements that can take any value within a range, often involving decimals or
fractions
o Examples: Weight of an individual (e.g., 70.5 kg), height (e.g., 5.8 feet), time taken
to complete a task (e.g., 12.3 seconds).

Continuous Data is further divided into two subcategories: interval data and ratio data. These
classifications are based on the nature of the scale used to measure the data and the presence or
absence of a true zero point.

Interval Data
Interval data refers to data measured on a scale where the intervals between values are consistent
and equal. However, it lacks a true zero point, meaning that zero does not represent the complete
absence of the measured attribute.

Features of Interval Data:


 Equal Intervals: The difference between any two values on the scale is the same
throughout (e.g., the difference between 20°C and 30°C is the same as between 40°C and
50°C).
 No True Zero Point: Zero is arbitrary and does not imply the absence of the measured
characteristic. For example, 0°C does not mean "no temperature."
 Mathematical Operations: Addition and subtraction are meaningful, but multiplication
and division are not. For instance, it is incorrect to say 40°C is "twice as hot" as 20°C.

Examples:
 Temperature measured in Celsius or Fahrenheit.
 Time of day on a 12-hour clock.
 IQ scores.

NB: The absence of a true zero limits the types of comparisons and calculations that can be made.
For example, ratios (e.g., "twice as much") cannot be accurately determined.

Ratio Data
Ratio data also has equal intervals between values but differs from interval data by having a true
zero point, which indicates the complete absence of the measured attribute.
Features of Ratio Data
 Equal Intervals: The scale maintains consistent intervals between values, just like interval
data.
 True Zero Point: Zero represents the absence of the property being measured. For instance,
0 kg signifies no weight, and 0 cm signifies no height.

11
o Mathematical Operations: All arithmetic operations—addition, subtraction, multiplication,
and division—are meaningful. For example, a weight of 40 kg is objectively twice as heavy
as 20 kg.

Examples
 Weight (e.g., kilograms, pounds).
 Height (e.g., centimeters, meters).
 Distance (e.g., kilometers, miles).
 Age (e.g., years, months).

Ratio data is the most informative type of quantitative data because it supports the widest range of
mathematical and statistical analyses.

 Qualitative Data
Qualitative data is also known as categorical data and it measures data represented by a name or
symbol. This could be the names of each department in your organisation, office locations, and
many other names that are all categorical data. It is descriptive, non-numerical information that
captures the characteristics, attributes, traits, or properties of an object, person, or phenomenon.
Unlike quantitative data, it focuses on "what" something is like rather than measuring it in
numerical terms

Examples of Qualitative Data


 Customer Reviews:
 Interview Transcripts:
 Survey Responses:

Other examples include textual data like blog posts, photos, videos, social media comments, and
cultural observations.

Features of Qualitative Data


1) Subjective and Descriptive:
o Reflects people's experiences, opinions, and emotions, often requiring interpretation to
identify patterns or themes.
2) Open-Ended Nature:
o Captures detailed and nuanced insights, addressing questions like "why" or "how" rather
than "how much."
3) Unstructured or Semi-Structured:
o Often lacks a fixed format, requiring analysis techniques such as coding or thematic
analysis to organize the data.
4) Context-Rich:
o Provides a deep understanding of a subject within its specific context, which is often
missed by purely numerical data.

12
Categories of Qualitative Data
1.) Nominal Data (Categorical)
o Data that represents categories or groups with no inherent order or ranking.
o Examples:
 Gender (e.g., male, female, non-binary).
 Colors (e.g., red, blue, green).
 Types of cuisine (e.g., Italian, Mexican, Indian).
o Features of Qualitative
 Categories are mutually exclusive.
 No quantitative comparison or order between categories.

2.) Ordinal Data (Ranked)


o Data that represents categories with an inherent order or ranking, but the intervals
between ranks are not consistent or measurable.
o Examples:
 Satisfaction Levels: Very Satisfied, Satisfied, Neutral, Dissatisfied, Very
Dissatisfied.
 Rankings: First place, second place, third place in a competition.
 Educational Attainment: High school diploma, bachelor's degree, master's
degree, Ph.D.
o Features of Ordinal Data
 Categories have a logical order.
 Differences between ranks are subjective and not uniform.

Figure 1.2. Classification of Data based on Nature

13
1.5.3 Data Collection
Data collection in data science refers to the systematic gathering of information from a range of
sources to be used for analysis, modeling, and decision-making. It plays a pivotal role in the data
science process as it provides the raw material for deriving insights and predictions.
Accurate data collection is critical because poor-quality data leads to biased models and incorrect
conclusions. High-quality data allows data scientists to develop models that accurately reflect real-
world scenarios, ensuring that decisions based on these models are both effective and reliable. The
relationship between data quality and model performance is direct—better data leads to better
models and, consequently, better outcomes.

Types of Data Sources


Data can come from a variety of sources, each with its own advantages and use cases.
 Primary Data
Primary data is collected firsthand through surveys, experiments, or observations. This type of data
is highly specific to the problem at hand and allows for greater control over the data-gathering
process.
Examples: Surveys, laboratory experiments, and field observations.

 Secondary Data
Secondary data refers to information that has already been collected by others and is made
available for analysis. It is often used to complement primary data or provide a broader context.
Examples: Public datasets from government databases, research publications, and data shared by
organizations like the World Bank.

Figure 1.3: Classification of Data Sources

14
Methods of Data Collection
Various methods are employed to collect data, each suitable for different scenarios:
 Surveys and Questionnaires: Gathering information through structured questions.
 Interviews and Focus Groups: Collecting in-depth insights through direct interaction.
 Observations: Recording behaviors or events as they occur naturally.
 Experiments: Conducting controlled tests to study specific variables.
 Transactional Tracking: Monitoring and recording transactions or interactions.

Websites to get Secondary Data


1. Kaggle:.
https://www.kaggle.com/datasets
2. UCI Machine Learning Repository: One of the oldest sources on the web to get the
dataset. http://mlr.cs.umass.edu/ml/
3. This awesome GitHub repository has high-quality datasets.
https://github.com/awesomedata/awesome-public-datasets
4. And if you are looking for Government’s Open Data then here is few of them:
Indian Government: http://data.gov.in
US Government: https://www.data.gov/
British Government: https://data.gov.uk/
France Government: https://www.data.gouv.fr/en/

15
Chapter 2:
Basic R Programming Concepts

 R is a powerful open-source programming language widely used for statistics, data


analysis, and visualization.
 It is free and open-source.
NB: An open-source programming language is a language whose source code is freely
available for anyone to view, modify, and distribute.
 RStudio is a user-friendly Integrated Development Environment (IDE) that enhances the
R programming experience with tools for coding, debugging, data analysis, and
visualization.

2.1. Installing R and RStudio


 Install R:
o Visit the Comprehensive R Archive Network (CRAN) at https://cran.r-project.org/.
o Choose the version suitable for your operating system (Windows, macOS, or
Linux).
o Download and follow the installation prompts.
 Install RStudio:
o Navigate to the RStudio download page: https://posit.co/download/rstudio-
desktop/
o Select the free version of RStudio Desktop.
o Download and install the application.

2.2 Exploring the RStudio Interface


RStudio has several important sections that help you write, execute, and manage your code
effectively. The interface is designed to keep your workflow organized. It is divided into four
primary sections (panes), each serving a distinct purpose to enhance your workflow.
Below are the various sections:

16
1.) Console
 The console is the main area where you type and run R commands directly.
 It acts like a command-line interface within R.
 You can enter R code here, press Enter, and immediately see the output or results. This is
great for quick calculations or testing small pieces of code.
 Example: Typing 2 + 3 and pressing Enter will instantly show the result 5 in the console.
 NB: If an error occurs, the console will display an error message to help you troubleshoot
the issue.

2.) Script Editor (Source Pane)


 This is the area where you can write, edit, and save longer blocks of code as scripts.
 Unlike the console, code here doesn’t run automatically—you need to run it manually.
 You can type multiple lines of code, save the file (usually with a .R extension), and run
selected parts or the entire script when needed.
 This is useful for projects where you want to keep your code organized and reusable. You
can also add comments to explain your code, which helps when revisiting your work later.
 Use Ctrl + Enter (or Cmd + Enter on Mac) to run a selected line of code from the script
directly in the console.
Writing Your First R Script
 In RStudio, go to File > New File > R Script to open a new script editor.

3.) Environment/History Pane


 Environment Tab:
o Displays all the variables, data frames, functions, and objects you’ve created during
your session.
o You can see the names and values of objects, which helps you keep track of your
data and avoid mistakes.
o Example: If you create a variable like , it will appear in the Environment
tab with its value.

17
 History Tab:
o Keeps a record of all the commands you’ve previously run in the console.
o You can scroll through past commands, copy them, and reuse them without having
to retype everything.
o NB: This is especially helpful when working on complex analyses where you need
to reference earlier steps.
4.) Plots/Files/Packages/Help Pane
 Plots Tab:
o Displays any graphs or visualizations you create using R’s plotting functions (like
plot() or ggplot2).
o You can zoom in, export plots as images or PDFs, and navigate between different
plots you’ve created.
 Files Tab:
o Allows you to browse files and folders on your computer, similar to a file explorer.
o Makes it easy to open data files, save scripts, and manage your project directory.
 Packages Tab:
o Shows the list of R packages installed on your system. You can load or unload
packages, install new ones, and check for updates.
o Packages are collections of functions and tools that extend R’s capabilities. For
example, dplyr is a popular package for data manipulation.
 Help Tab:
o Provides documentation for R functions, packages, and commands.

o You can search for help topics using ?function_name in the console (e.g., ?mean),
and detailed explanations will appear in this tab.

Figure 2.1. Rstudio Interface

18
Change the Font Size in RStudio Settings
 Go to Tools in the top menu bar.
 Select Global Options.
 In the General section, click on the Appearance tab.
 In the Editor font size section, use the Slider or manually input a value to adjust the font size of
your code.
 After adjusting, click Apply and then OK to save the changes.

Running Code in R Studio


1.) Running Code from the Script Editor
 Highlight the Code: Select the lines of code in your script that you want to run.
 Run the Selected Code:
o Press Ctrl + Enter (Windows/Linux) or Cmd + Enter (Mac) to execute the highlighted code
in the console.
o You can also use the Run button (green triangle) in the toolbar above the script editor.
2.) Running Entire Script
 Source the Entire Script:
o Click the Source button in the script editor, or use the shortcut Ctrl + Shift + S
(Windows/Linux) or Cmd + Shift + S (Mac) to run the entire script.
3.) Running Code in the Console Pane
 Direct Execution:
o You can type code directly into the Console Pane and press Enter to execute it.

Delete or clear the content in the RStudio console


1.) Clear the Console via the Menu
 Go to the "Edit" menu in the top-left corner of RStudio.
 Click "Clear Console"

2.) Use Keyboard Shortcut


 Windows/Linux: Press Ctrl + L
 macOS: Press Cmd + L
This will instantly clear the console.

2.3 Variable in R
A variable is a name that stores data or a value, which can be used and modified in your R code.
1.) Creating a Variable
 Use the assignment operator <- (preferred) or = to assign a value.
Syntax:
variable_name <- value

19
2.) Rules for Naming Variables
 Can include letters, numbers, and underscores.
 Cannot start with a number.
 Case-sensitive: Name and name are different

Data types in R
1.) Numeric
 The numeric data type in R is used to represent numbers, including both integers (whole
numbers) and decimals (floating-point numbers).
 Types of Numeric Values:
o Integers: Whole numbers without any decimal point (e.g., 5, 10, -3).
o Decimals (Floating Point Numbers): Numbers that include a decimal point (e.g.,
5.7, 10.3, -4.6).
 R automatically recognizes numbers with or without decimal points as numeric. You don't
need to specify the type of number explicitly, and R will treat both integers and floating-
point numbers as numeric.
Example:
x <- 5.7 # a floating-point number
y <- 10 # an integer

2.) Character
o The character data type in R is used for text or strings—sequences of characters.
A string can contain letters, numbers, symbols, and spaces, enclosed within
quotation marks (either single ' or double ").
o In R, any text or string that appears between quotation marks is automatically
treated as a character.
Example
name <- "John" # character string
greeting <- "Hello" # another character string

NB: Be careful not to forget the quotation marks, as text without quotes will cause an error.
For example, name <- John without quotes would result in an error.

20
3) Logical
 The logical data type in R represents boolean values, which can only be either
TRUE or FALSE.
 These values are often used in conditions, loops, and logical operations.
 Logical values are used to test conditions or relationships between values.
 They are essential for controlling flow in your code, such as in if statements or
while loops.

Example
is_active <- TRUE # logical value
has_data <- FALSE # logical value

o Operations: Logical values can also be the result of comparison operators (e.g., ==,!=,
>, <), and they can be combined using logical operators like & (AND), | (OR), and ! (NOT).
o Example
5>3 # returns TRUE
10 == 5 # returns FALSE

4.) Complex
The complex data type in R is used to represent complex numbers.
A complex number consists of two parts: a real part and an imaginary part.
The imaginary part is denoted by i,
In R, complex numbers can be written in the form of a + bi, where a is the real part,
and b is the imaginary part.

Example
z <- 3 + 2i # a complex number with real part 3 and imaginary part 2

2. 4. Basic Commands in R

1.) Assigning values to variables


In R, you can assign values to variables using the following operators:
 Using <-
x <- 10 # Assigning a number
name <- "R" # Assigning a string
flag <- TRUE # Assigning a logical value
z <- x + y # Adds x and y and stores in z.

21
 Using =
y = 20
language = "R Programming"

 Using assign() Function


assign("z", 30)

Note:
The <- operator is preferred in R programming.
Variable names are case-sensitive (Var and var are different).
You can also assign vectors, lists, data frames, etc.

2.) Printing output


In R, you can print output to the screen using the following functions:
a.) print() Function
Displays the value of an object.
# Example
x <- 10
print(x)

# Prints text with quotes.


print("Hello, R!")

b.) cat() Function


Concatenates and prints text or variables without quotes.
# Example
name <- "R Programming"
cat("Welcome to", name)

c)Typing the Variable Name (R Console Only)


Simply type the variable name, and R will display its value.

# Example
y <- 20
y # This will print 20

22
2.5 Basic Operations

Basic operations in R, categorized into arithmetic, relational, and logical operations:

1.) Arithmetic Operations


Arithmetic operations are used to perform basic mathematical calculations such as
addition, subtraction, multiplication, division, and more.
Common Arithmetic Operators:
Addition (+): Adds two numbers.
Example: 5 + 3 returns 8.
Subtraction (-): Subtracts one number from another.
Example: 10 - 2 returns 8.
Multiplication (*): Multiplies two numbers.
Example: 4 * 3 returns 12.
Division (/): Divides one number by another.
Example: 9 / 3 returns 3.
Exponentiation (^): Raises a number to a power.
Example: 2^3 returns 8
Modulus (%%): Returns the remainder of a division.
Example: 9 %% 4 returns 1
Integer Division (%/%): Divides two numbers and returns the integer part of the
result (without the remainder).
Example: 9 %/% 4 returns 2 (since 9 divided by 4 gives 2 as the integer
quotient).
Example
x <- 5
y <- 3
sum_result <- x + y #5+3=8
product_result <- x * y # 5 * 3 = 15

2.) Relational (Comparison) Operations


Relational operations are used to compare two values. They return a logical result: TRUE
if the comparison is valid, or FALSE if it's not.

Common Relational Operators


Equal to (==): Checks if two values are equal.
Example: 5 == 5 returns TRUE, while 5 == 3 returns FALSE.

Not equal to (!=): Checks if two values are not equal.


Example: 5 != 3 returns TRUE, while 5 != 5 returns FALSE.

Greater than (>): Checks if the left value is greater than the right value.
Example: 6 > 3 returns TRUE.

23
Less than (<): Checks if the left value is less than the right value.
Example: 2 < 5 returns TRUE.

Greater than or equal to (>=): Checks if the left value is greater than or equal to
the right value.
Example: 5 >= 5 returns TRUE, while 4 >= 5 returns FALSE.

Less than or equal to (<=): Checks if the left value is less than or equal to the
right value.
Example: 3 <= 3 returns TRUE, while 6 <= 3 returns FALSE.

Example
x <- 5
y <- 3
is_equal <- x == y # FALSE
is_greater <- x > y # TRUE

print(is_equal)
print(is_greater)

3) Logical Operations
Logical operations are used to combine or manipulate logical values (TRUE and
FALSE).
These are especially useful in decision-making structures like if statements or loops.

Common Logical Operators:


AND (&): Checks if both conditions are TRUE.
Example: TRUE & TRUE returns TRUE, but TRUE & FALSE returns
FALSE.

OR (|): Checks if at least one condition is TRUE.


Example: TRUE | FALSE returns TRUE, while FALSE | FALSE returns
FALSE.

NOT (!): Reverses the value of a logical condition.


Example: !TRUE returns FALSE, and !FALSE returns TRUE.

XOR (Exclusive OR) (xor()): Returns TRUE if one condition is TRUE but not
both.
Example: xor(TRUE, FALSE) returns TRUE, but xor(TRUE, TRUE)
returns FALSE.

24
Example
a <- TRUE
b <- FALSE
and_result <- a & b # FALSE (because both must be TRUE)
or_result <- a | b # TRUE (because at least one is TRUE)
not_result <- !a # FALSE (because !TRUE is FALSE)

Example
# Combining relational and logical operations
x <- 5
y <- 10
z <- 15
result <- (x < y) & (y < z) # TRUE (x < y and y < z are both true)

4.) Working with Vectors


In R, a vector is a basic data structure that holds elements of the same type (numeric,
character, logical, etc.).

Creating a Vector in R
a) Using c() function
# Numeric vector
num_vector <- c(1, 2, 3, 4, 5)
# Character vector
char_vector <- c("apple", "banana", "cherry")
# Logical vector
log_vector <- c(TRUE, FALSE, TRUE)

b) Using : operator (for sequences)


seq_vector <- 1:5 # creates a vector: 1, 2, 3, 4, 5
seq_vector <- seq(1, 10, by = 2) # Creates: 1, 3, 5, 7, 9

c) Using rep() function (to repeat elements)

rep_vector <- rep(3, times = 4) # Creates: 3, 3, 3, 3

25
5.) Accessing Vector Elements
num_vector[2] # Access 2nd element
num_vector[1:3] # Access elements from 1st to 3rd
num_vector[-1] # Exclude 1st element

6.) Vector Operations


# Arithmetic
a <- c(1, 2, 3)
b <- c(4, 5, 6)
a+b # Addition: 5, 7, 9

# Logical
a>2 # Returns: FALSE, FALSE, TRUE
7.) Basic Functions
length(a) # Returns number of elements in a.
sum(a) # Returns sum of elements in a.
mean(a) # Calculates the average of a.
max(a) # Finds the largest value in a.
min(a) # Finds the smallest value in a.

8.) Lists
A list can hold elements of different data types (numbers, strings, vectors, etc.).
Created using the list() function.
Example:
# List with different data types
my_list <- list(10, "Hello", TRUE, c(1, 2, 3))

Accessing Elements
my_list[[1]] # Access 1st element → 10
my_list[[4]] # Access the vector → 1 2 3

26
9.) Factors
o Factors are used to handle categorical data (like labels or groups).
o They store data as levels, which makes them memory efficient.
o Levels are the unique categories or distinct values within a factor.
o When you create a factor from a vector of categorical data, R automatically
identifies and stores the unique values as levels.
o Created using the factor() function.

Example:
# Creating a Factor
colors <- factor(c("Red", "Blue", "Red", "Green", "Blue"))
# Checking Levels
levels(colors) # Output: "Blue" "Green" "Red"

10.) Matrices
o A matrix is a 2-dimensional data structure with rows and columns.
o It can only contain one data type (numeric, character, etc.).
o Created using the matrix() function.

# Creating a Matrix
m <- matrix(1:6, nrow = 2, ncol = 3)
print(m)

# Output:

Accessing Matrix Elements:

m[1, 2] # Element in 1st row, 2nd column → 3


m[, 3] # All elements in 3rd column → 5, 6
m[2, ] # All elements in 2nd row → 2, 4, 6

27
2.6 Introduction to Data Frames in R
 A Data Frame is a table-like structure used to store data in rows and columns.
 It can contain different data types in different columns (numeric, character, logical, etc.).
 Think of it like an Excel spreadsheet or a SQL table in R.

Creating a Data Frame


Use the data.frame() function

# Creating a simple Data Frame


students <- data.frame(
Name = c("Alice", "Bob", "Charlie"), # Character
Age = c(20, 22, 19), # Numeric
Passed = c(TRUE, FALSE, TRUE) # Logical
)

# Viewing the Data Frame


print(students)

Output

Accessing Data in a Data Frame


1) By Column:
students$Name # Access the 'Name' column
students$Age # Access the 'Age' column

2) By Row and Column (using indices):


students[1, ] # First row
students[, 2] # Second column (Age)
students[2, 3] # Value at 2nd row, 3rd column (FALSE)

Adding New Data


Add a New Column
students$Grade <- c("A", "B", "A")

Add a New Row


new_student <- data.frame(Name = "David", Age = 21, Passed = TRUE, Grade = "B")
students <- rbind(students, new_student)

28
NB:
 Columns can have different data types, but each column must have the same type
of data.
 Use $ to access columns easily.
 nrow(students) and ncol(students) give the number of rows and columns.
 str(students) shows the structure of the data frame.

2.7 Conditional Statements in R


 Conditional statements are used to make decisions in your code based on certain
conditions.

 if Statement
Executes code only if a condition is TRUE.
Syntax
if (condition) {
# Code to run if condition is TRUE
}

Example
x <- 10
if (x > 5) {
print("x is greater than 5")
}

 if...else Statement
Adds an alternative if the condition is FALSE.
if (condition) {
# Code if TRUE
} else {
# Code if FALSE
}

Example
x <- 3
if (x > 5) {
print("x is greater than 5")
} else {
print("x is 5 or less")
}

29
 if...else if...else Statement
Checks multiple conditions one by one.
Executes the code for the first TRUE condition.
Syntax:
if (condition1) {
# Code if condition1 is TRUE
} else if (condition2) {
# Code if condition2 is TRUE
} else {
# Code if none of the conditions are TRUE
}

Example
score <- 85

if (score >= 90) {


print("Grade: A")
} else if (score >= 75) {
print("Grade: B")
} else if (score >= 60) {
print("Grade: C")
} else {
print("Grade: F")
}

 ifelse() Function
A vectorized version of if...else,
A vectorized way to handle conditions.
Great for working with vectors.
Good for quick checks on multiple values.

Syntax:
ifelse(condition, value_if_true, value_if_false)

Example
# Vector example
marks <- c(80, 45, 70, 30)
result <- ifelse(marks >= 50, "Pass", "Fail")
print(result)

# Output: "Pass" "Fail" "Pass" "Fail"

30
2.8 Loops in R: for, while, repeat
Loops are used to repeat a set of instructions multiple times in R.

for Loop
Used to iterate over a sequence (like a vector or range).

Syntax
for (variable in sequence) {
# Code to repeat
}
Example
# Print numbers from 1 to 5
for (i in 1:5) {
print(i)
}

NB: i takes values from 1 to 5 one by one.

while Loop
Repeats code as long as a condition is TRUE.

Syntax:
while (condition) {
# Code to repeat
}

Example:
# Print numbers from 1 to 5
x <- 1
while (x <= 5) {
print(x)
x <- x + 1
}

NB: The loop runs until x becomes 6 (condition x <= 5 is FALSE).

31
repeat Loop

Repeats code indefinitely until you use a break to stop it.

Syntax
repeat {
# Code to repeat
if (condition) {
break # Stops the loop
}
}

Example
# Print numbers from 1 to 5
y <- 1
repeat {
print(y)
y <- y + 1
if (y > 5) {
break # Stop the loop when y > 5
}
}

NB:
for loop → Best for iterating over a known sequence.
while loop → Runs as long as the condition stays TRUE.
repeat loop → Runs indefinitely unless stopped with break.

32
2.9 Functions in R: Creating and Calling Functions
Functions allow you to reuse code and organize it into manageable chunks. In R, you can create
and call functions to make your code more modular and efficient.

1.) Creating a Function in R


A function in R is created using the function() keyword.
You define the function, specify the parameters (inputs), and write the body of the
function (the operations the function performs).

Syntax:
function_name <- function(parameter1, parameter2, ...) {
# Code to execute
# Return value (optional)
}

Example: Creating a Function

Let's create a simple function to add two numbers:

# Creating a function to add two numbers


add_numbers <- function(a, b) {
result <- a + b
return(result)
}

NB:
Function Name: add_numbers
Parameters: a and b
Return: The sum of a and b.

2.) Calling a Function


Once the function is created, you can call it by using its name and passing the required
arguments (values for the parameters).

Example
# Calling the function with arguments
sum_result <- add_numbers(5, 10)
print(sum_result)

# Output: 15

NB: Here, 5 is passed as a and 10 as b. The function calculates the sum and returns it

33
3.) Function with Default Arguments
You can also set default values for parameters, so they are used if no value is provided
when calling the function.

Example:
# Function with default argument
greet <- function(name = "Guest") {
message <- paste("Hello,", name)
return(message)
}
# Calling without argument
print(greet()) # Output: Hello, Guest
# Calling with argument
print(greet("Alice")) # Output: Hello, Alice

4) Returning Values from Functions


R functions can return values using the return() statement. If not explicitly specified, the
function will return the last evaluated expression.

# Function without return() statement


multiply <- function(x, y) {
x * y # Implicit return
}

result <- multiply(3, 4)


print(result) # Output: 12

2.10 Scope of Variables in R


The scope of a variable defines where in the code the variable can be accessed or modified.

1) Global Variables
Global variables are defined outside of any function.
They can be accessed anywhere in the script (inside or outside functions).

# Global Variable
x <- 10
# Function accessing global variable
printValue <- function() {
print(x) # Accessing global variable
}
printValue() # Output: 10

34
2) Local Variables
Local variables are defined inside a function.
They can be accessed only within that function.
Example
# Function with a Local Variable
calculateSum <- function() {
y <- 5 # Local variable
z <- 7
return(y + z)
}
calculateSum() # Output: 12
# print(y) # Error: 'y' is not found (because 'y' is local)

2.11 Installation of Rtools and Tidyverse


Rtools is a collection of tools that is used to build and compile R packages on Windows. When
you install certain R packages, especially those that are not pre-compiled and need to be built from
source code, you need Rtools to provide the necessary tools for this process.
Tidyverse is a collection of R packages designed for data science tasks, making it easier to
manipulate, visualize, and analyze data in a clean and consistent way.

Step 1: Download Rtools


1) Go to the official Rtools page: Rtools for Windows.
https://cran.r-project.org/bin/windows/Rtools/rtools44/rtools.html?
2) Download the version recommended for your version of R (e.g., Rtools40 for R 4.0 and
above).

Step 2: Install Rtools


1) Double-click the downloaded .exe file (e.g., Rtools40-x86_64.exe).
2) Follow the installation prompts, and accept the default settings (the default installation path
is fine).
3) Make sure to check the option that says "Add Rtools to system PATH".

Step 3: Verify Installation


1) Open R or RStudio.
2) Run this command to check if Rtools is installed correctly:

If Rtools is installed properly, you should see a path like this:

35
Step 4: Install tidyverse package
1. Open RStudio:
2. Run the following command in the R console:

This will install the core packages of the tidyverse including ggplot2, dplyr, tidyr, and others.

3. Load the tidyverse package

NB:
To install and load the required packages in R, follow these steps:
Step 1: Install the Packages
If you haven't installed the packages yet, you can install them using the function

Step 2: Load the Installed Packages


After installation, load the required packages using the function

Step 3: Verify Successful Loading


To check if the packages are loaded successfully, you can use:

Or

Load Packages Automatically


To ensure the packages are always loaded in your R scripts, you can use:

36
R Data File
An R Data File (typically with the .RData or .rds extension) is a file format used to save R objects
such as data frames, vectors, lists, or even entire R workspaces. These files allow you to save your
R environment or specific objects and load them later without needing to recalculate or recreate
the data.

Types of R Data Files


1) .RData (or .rda)
o The .RData file can store multiple R objects, such as variables, data frames, or functions,
and is used for saving the entire workspace or a selected set of objects.
o When you load an .RData file, all the saved objects are restored to your current R
session.
Saving an R Workspace to .RData

Loading an R Workspace from .RData

This will load all objects saved in that .RData file back into your environment.

2) .rds
 The .rds file is used to save a single R object. It is often used for saving larger datasets or
models because it's more space-efficient.
 Unlike .RData, when you load an .rds file, you must assign it to a variable.

Saving an R Object to .rds

Loading an R Object from .rds

When to Use .RData vs .rds


 .RData is convenient when you want to save the entire workspace or multiple objects at
once.
 .rds is ideal when you want to save and load a single object, such as a model, dataset, or
results of a computation.

37
Chapter 3:
Data Collection and Preprocessing with R
3.1 Data Collection in R
Data collection refers to gathering and importing data from various sources into R for analysis.
There are multiple ways to collect data in R.

a) Manually Creating Data in R

 Using vectors
 A vector is a basic data structure in R that holds elements of the same type (numeric,
character, logical, etc.).
 Creation: Use c( )

 Using data frames


 A data frame is a table-like structure in R where columns can have different data types.
 Creation: Use data.frame( ).

Output:

38
b) Reading Data from External Files

 read.csv()
 Imports CSV (Comma-Separated Values) files into R as a data frame.
 The file extension is .csv

 Load Data from a URL

 read_excel()
 the read_excel() function from the readxl package is used to read Excel files.
 The file extension is .xlsx

Parameters:
 "your_file.xlsx" – Path to the Excel file
 sheet = 1 – Specifies the sheet (default is the first sheet)

 read.table( )
 In R, read.table() is a general function to read delimited text files into a data frame.

Parameters:
 "file.txt" – Path to the file
 header = TRUE – First row as column names
 sep = "\t" – Tab-separated values (use "," for CSV)

39
c) Accessing Data

 head() and tail()


 These functions are used to view the first or last few rows of a data frame.

 View()
 The View() function opens a spreadsheet-like GUI view of a data frame in RStudio.

 glimpse()
 The glimpse() function (from the dplyr package) provides a compact, transposed view of
a data frame.
 Displays column names, types, and first few values

40
 dim()
 Returns the dimensions (number of rows and columns) of an object, like a data frame or
matrix.

 summary()
 The summary() function provides descriptive statistics for each column in a data frame.
 Numeric columns: Min, Max, Mean, Median, 1st & 3rd Quartiles

 str()
 The str() function displays the structure of an object, including data types and sample
values.
 Data type of each column
 First few values of each column

 Selecting a Column
 You can select a column from a data frame in multiple ways:

i. Using $

ii. Using [[ ]]

41
iii. Using [ , ]

 Selecting Rows

 Filtering Data

d.) Adding & Removing Columns

 Add a New Column

 Remove a Column

e) Exporting Data

 Save to CSV

 Save to Excel (using writexl package)

42
3.2. Data Preprocessing with R
Data preprocessing is the process of cleaning, transforming, and organizing raw data to make it
suitable for analysis. It is a crucial step in machine learning and data science to improve the quality
and performance of models.

Steps in Data Preprocessing


1.) Data Collection
 Gather data from various sources such as databases, files (CSV, Excel, JSON), or APIs.
2.) Data Cleaning
 Handling Missing Values
o Remove missing values
o Impute missing values using mean, median, or mode
 Handling Duplicates
o Identify duplicates:
o Remove duplicates
 Handling Outliers
o Use boxplots to detect outliers
o Remove or replace extreme values

3.) Data Transformation


 Feature Scaling (Normalization & Standardization)
o Normalization (Min-Max Scaling): Scales values between 0 and 1
o Standardization (Z-score Scaling): Centers data around mean with standard
deviation of 1
 Encoding Categorical Data
o Convert text labels into numerical format (One-Hot Encoding, Label Encoding)

4) Feature Selection & Extraction


 Remove irrelevant or redundant features
 Extract meaningful information from existing data

5.) Data Splitting


 Divide the dataset into Training Set and Testing Set
 Common ratios: 70-30, 80-20

3.2.1. Handling Missing Values in R


Missing values can negatively impact data analysis and machine learning models. In R, missing
values are represented as NA. Methods used to handle missing values effectively in R include the
following.

43
1) Detecting Missing Values

 is.na() Function
 is.na() is a function in R that checks whether a value is missing (NA - Not Available).
 It returns TRUE for missing values (NA) and FALSE for non-missing values.

 Find missing values in a dataset

 Check missing values in each column

 Find rows with missing values

2. Removing Missing Values

 na.omit()
 na.omit() removes all rows containing NA values from a dataset or vector

 Remove specific columns with many missing values:

44
3. Imputing Missing Values

 Replace missing values with the mean


To replace missing (NA) values with the mean in R, use the mean() function along with na.rm
= TRUE and is.na()..

 Replace missing values with the median


To replace missing (NA) values with the median in R, you can use the median() function along
with na.rm = TRUE and is.na().

45
 Replace missing values with the mode

o To replace missing (NA) values with the median in R, you can use Using names(),
sort(), and table()
o Suitable for categorical data where mean/median cannot be used.

For Categorical Columns

For numerical columns

Handling Duplicates in R
Duplicates in a dataset can lead to redundancy and incorrect analysis. Below are methods to
identify and remove duplicates in R with practical examples.

 duplicated()
 duplicated() checks for duplicate values in a vector or rows in a data frame and returns TRUE
for duplicates (except the first occurrence).

Output:

(It marks the second occurrence of 2 and 4 as duplicates.)

46
For Data Frames:

 Returns TRUE for duplicate rows.

 sum(duplicated())
 sum(duplicated()) counts the number of duplicate values in a vector or dataset.

Output:

 For Data Frames: It counts the number of duplicate rows

Output:

 df[duplicated(df), ]
 df[duplicated(df), ] extracts only the duplicate rows from a data frame, excluding the first
occurrence.

Output:

 with a simple vector instead of a data frame, you can still use duplicated() to find and extract
duplicate values.

47
 Identify unique rows:

 Check duplicates based on a specific column:

 Removing Duplicates
 Remove all duplicate rows:

 Remove duplicates based on a specific column:

Handling Outliers in R
 Outliers are extreme values that differ significantly from other observations in a dataset. They
can impact statistical analysis and machine learning models.

1.) Detecting Outliers Using Boxplots


Boxplots help visualize outliers based on the interquartile range (IQR) rule.

Outliers appear as individual points outside the whiskers.

2.) Removing Outliers Using IQR Method


o The Interquartile Range (IQR) rule considers values beyond 1.5 * IQR from the first
quartile (Q1) and third quartile (Q3) as outliers.

48
Data Transformation: Feature Scaling
Feature scaling is a data preprocessing technique used to normalize or standardize the range of
numerical features in a dataset. It ensures that all features contribute equally to a machine learning
model by bringing them to a common scale, preventing features with larger magnitudes from
dominating the learning process. Feature scaling is an important step in data preprocessing,
especially for machine learning models that rely on distance-based calculations (e.g., KNN, SVM,
linear regression).

The two common methods are Normalization (Min-Max Scaling) and Standardization (Z-score
Scaling).

1. Normalization (Min-Max Scaling)


 Min-Max Scaling (Normalization) rescales the feature values between 0 and 1
 Formula:

49
lapply() applies normalization to each column in base R.

2. Standardization (Z-score Scaling)


 Standardization (Z-score normalization) transforms features to have a mean of 0 and a
standard deviation of 1. It is useful for models that assume normally distributed data, such
as linear regression and logistic regression.
 Formula:

50
Using scale() Function

Encoding Categorical Data in R


Categorical data needs to be converted into numerical format before being used in machine
learning models. The major methods include:

 Label Encoding – Assigns a unique integer to each category.

 One-Hot Encoding – Creates binary columns for each category.

Label Encoding
 Assigns a unique number to each category (e.g., "Red" → 1, "Blue" → 2).
 Suitable for ordinal categorical data (e.g., "Low", "Medium", "High").

51
Output:

One-Hot Encoding
 Converts each category into a separate binary column (0 or 1).
 Suitable for nominal categorical data (e.g., "Country", "Color").

using model.matrix()

Output:

52
Feature Selection & Extraction
Feature selection and extraction help make machine learning models better by removing
unnecessary data and keeping only the most useful information.

A.) Feature Selection


 Remove unimportant or duplicate features to improve model accuracy and speed.

Types of Feature Selection Techniques:


1.) Filter Methods: Use statistics to pick features
 Variance Threshold: Remove features that don’t change much (low variance).
 Correlation Analysis: Remove features that are too similar (highly correlated).
 Chi-Square Test: Find which features are important for categorical data.

2.) Wrapper Methods: Use a model to decide which features are best
 Forward Selection: Start with no features, add them one by one, and check performance.
 Backward Elimination: Start with all features, remove the least useful ones one by one.
 Recursive Feature Elimination (RFE): Train a model multiple times and remove the least
important features step by step.

3.) Embedded Methods: Feature selection happens inside the model)


 Lasso Regression (L1 Regularization): Reduces the impact of less useful features by
shrinking their values to zero.
 Decision Trees & Random Forests: Rank features based on how much they help in
making predictions.

B.) Feature Extraction


 Create new, better features by transforming the original data.

Types of Feature Extraction Techniques:


1) Principal Component Analysis (PCA)
 Reduces the number of features while keeping the most important information.
 Converts similar features into new independent features.

2) Linear Discriminant Analysis (LDA)


 Similar to PCA but focuses on separating different categories of data.

3) t-SNE & UMAP


 Reduce the number of features for better visualization (useful for graphs and clustering).

4) Feature Engineering: Manually creating new features from existing data


 Polynomial Features: Create new features by squaring or cubing existing values (e.g., x²,
x³).

53
 Log Transformation: Convert data into a better scale to handle uneven distributions.

 Binning: Group numerical values into categories (e.g., age groups like 0-18, 19-35, 36+).

Data Splitting
In R, data splitting is commonly used to divide datasets into training and testing (or validation)
sets for machine learning and statistical modeling.

Using rsample::initial_split()

54
Chapter 4:
Data Visualization in R

3.1 Overview of Data Visualization


Data visualization is the process of representing data graphically or visually to help people
understand patterns, trends, and insights from the data. It involves using charts, graphs, maps, and
other visual elements to make complex information more accessible and interpretable.

Importance of Data Visualization


 Enhances data comprehension and decision-making.
 Simplifies complex datasets for better communication.
 Helps in identifying correlations and trends.
 Aids in detecting anomalies and outliers.

Common Types of Data Visualization


1. Bar Chart – Represents categorical data using rectangular bars. It shows comparisons
between categories.
2. Line Chart – Shows trends over time with connected data points.
3. Pie Chart – Displays proportions in a circular format.
4. Histogram – Represents the distribution of numerical data.
5. Scatter Plot – Shows relationships between two numerical variables.
6. Heatmap – Uses color gradients to represent data values.
7. Box Plot – Summarizes data distribution using quartiles.
8. Tree Map – Visualizes hierarchical data as nested rectangles.

3.2 Introduction to Data Visualization in R


R provides powerful visualization packages such as Base R graphics, and ggplot2,

Basic Plotting Functions in Base R


 plot(): Generic function for scatter plots and line plots.
 barplot(): Creates bar charts.
 hist(): Generates histograms.
 boxplot(): Displays box-and-whisker plots.
 pie(): Produces pie charts.

plot() – Scatter Plots & Line Plots


The plot() function is a generic function used for creating different types of plots, including scatter
plots and line plots.

55
Explanation:
 x, y are data points.
 main specifies the title.
 xlab and ylab label the axes.
 col sets the color of points.
 pch=16 changes the point style.

Line Plot

Explanation:
 type="l" specifies a line plot.
 col="blue" sets the line color.
 lwd=2 makes the line thicker.
 points(x, y, col="red", pch=16) adds red dots at data points.

barplot() – Bar Charts


Used for visualizing categorical data.

56
Explanation:
 heights represents bar heights.
 names.arg assigns category names to bars.
 col assigns different colors to bars.

hist() – Histograms
Used for visualizing the distribution of numerical data.

Explanation:
 breaks=5 defines the number of bins.
 col="lightblue" sets the bar color.

boxplot() – Box-and-Whisker Plots


Used for showing data distribution and detecting outliers.

Explanation:
 Displays median, quartiles, and outliers.
 col="purple" sets the box color.

pie() – Pie Charts


Used for visualizing proportions.

57
Explanation:
 slices represents portions.
 col sets segment colors.

heat( ): A heatmap is used to visualize matrix-like data, where values are represented using colors.

Explanation:
 matrix(rnorm(100), nrow=10) creates a random numeric matrix with 10 rows and 10
columns.
 heatmap() generates a heatmap.
 heat.colors(10) applies a color gradient.

58
Chapter 5:
Data Analysis Techniques

Data analysis techniques are systematic methods used to inspect, clean, transform, and model data
to extract useful insights, support decision-making, and predict future trends. These techniques
help organizations understand patterns, relationships, and anomalies in data.

Types of Data Analysis Technique


Data analysis techniques can be broadly categorized into several types based on the purpose and
nature of the analysis. The main types of data analysis techniques include the following:

1.) Descriptive Analysis


 Descriptive analysis is a data analysis technique used to summarize and interpret historical
data to understand past trends and patterns. It helps in organizing raw data into a meaningful
format, making it easier to interpret and draw insights.
 It is used to examine historical data, providing a comprehensive summary of past events. It
involves aggregating, organizing, and interpreting raw data to identify patterns, trends, and
distributions.
 By employing statistical measures such as central tendency (mean, median, mode) and
dispersion (standard deviation, variance), along with data visualization techniques (charts,
graphs, histograms), descriptive analysis enables a clearer understanding of underlying data
characteristics and informs decision-making processes.

Characteristics of Descriptive Analysis


 Summarization: Condenses large datasets into key statistics.
 Pattern Recognition: Identifies trends and distributions in data.
 No Predictions: Unlike predictive analysis, it does not forecast future outcomes

Common Methods Used:


 Mean (Average): The sum of all values divided by the total number of values.
 Median: The middle value when all values are arranged in order.
 Mode: The value that appears most often in a dataset.
 Standard Deviation: Measures how much data varies from the average.
 Visualizations: Charts, graphs, and tables make data easier to understand.

Example: Sales reports showing total revenue per quarter.


 A company’s sales report shows total revenue earned in each quarter (every three months).
This helps managers see which months had higher or lower sales.

59
2.) Diagnostic Analysis
Diagnostic analysis is a data analysis method used to investigate and determine the causes behind
past events or trends. It goes beyond descriptive analysis by answering the question: "Why did this
happen?"

Characteristics
 Cause-and-Effect Investigation: Identifies factors that contributed to a specific outcome.
 Deeper Data Exploration: Uses advanced techniques to uncover hidden relationships in
data.
 Data-Driven Decision Making: Helps organizations understand issues and take corrective
actions.

Common Methods Used:


 Drill-Down Analysis: Examines data at different levels of detail to find patterns.
 Correlation Analysis: Determines relationships between variables (e.g., how website traffic
relates to marketing efforts).
 Root Cause Analysis (RCA): Identifies the fundamental reason behind an issue.

Example: Identifying the reason for a sudden drop in website traffic.


 If a website experiences a sudden drop in traffic, diagnostic analysis can help identify the
reason, such as a technical issue, search engine ranking changes, or reduced marketing
efforts.

3.) Predictive Analysis


Predictive analysis is a data-driven technique that leverages statistical models and machine
learning algorithms to forecast future outcomes based on historical data. It answers the question:
"What is likely to happen next?"

Characteristics:
 Forecasting Future Trends: Uses past data patterns to predict future events.
 Probabilistic Outcomes: Provides likelihood estimates rather than exact predictions.
 Data-Driven Decision Making: Helps organizations anticipate risks and opportunities.

Common Techniques used:


 Regression Analysis: Identifies relationships between variables to make predictions.
 Time Series Forecasting: Analyzes trends over time to predict future values.
 Classification: Categorizes data into predefined groups (e.g., predicting customer churn).
 Clustering: Groups similar data points to uncover hidden patterns.

Example: Predicting customer churn in a subscription-based service.


 A subscription-based service can use predictive analysis to anticipate customer churn by
analyzing past usage patterns, engagement levels, and demographic factors, allowing the
company to take proactive retention measures.

60
4.) Prescriptive Analysis
Prescriptive analysis is an advanced form of data analysis that not only predicts future outcomes
but also provides recommendations on the best course of action to achieve desired results. It helps
organizations make data-driven decisions by answering the question: "What should be done
next?" It combines descriptive analysis (what happened) and predictive analysis (what might
happen) to suggest the best possible course of action to achieve desired outcomes.

Characteristics:
 Action-Oriented: Focuses on suggesting specific actions to optimize outcomes.
 Decision Support: Helps businesses and individuals make informed choices based on data
insights.
 Advanced Analytics: Uses machine learning, artificial intelligence, and mathematical
optimization techniques.

Techniques Used:
 Optimization Algorithms: Determines the best possible solution for a given scenario.
 Decision Trees: Models various decision paths and their possible outcomes.
 Artificial Intelligence (AI): Automates complex decision-making processes.

Example: Recommending the best pricing strategy for an e-commerce platform


 An online retail company can use prescriptive analysis to determine the best pricing
strategy by considering factors such as competitor prices, customer demand, and seasonal
trends, ensuring maximum sales and profitability.

5.) Exploratory Data Analysis (EDA)


EDA is a process used to analyze and summarize datasets by identifying patterns, relationships,
and anomalies. It is typically performed before applying formal modeling techniques to gain
insights and make informed decisions.

Objectives:
 Understand Data Structure: Identify key variables, data types, and distributions.
 Detect Patterns and Trends: Discover relationships between variables.
 Identify Anomalies: Find missing values, outliers, or inconsistencies.
 Guide Further Analysis: Helps decide which statistical models or machine learning
techniques to use.

Common Techniques Used in EDA:


 Summary Statistics: Mean, median, mode, standard deviation, variance.
 Data Visualization: Histograms, scatter plots, box plots, and correlation matrices.
 Missing Value Analysis: Identifies gaps in the dataset.
 Feature Engineering: Creates new variables to enhance model performance.

61
Example: Analyzing customer demographics to find buying patterns.
 A data scientist analyzing customer purchase behavior might use EDA to visualize
spending patterns, identify the most common products bought together, and detect
unusual transactions before building a predictive model.

6.) Inferential Analysis


Inferential analysis is a statistical approach used to draw conclusions about a larger population
based on a sample of data. It helps in making predictions, testing hypotheses, and determining
relationships between variables.

Objectives:
 Generalization: Extends findings from a sample to a broader population.
 Hypothesis Testing: Determines whether observed patterns are statistically significant.
 Prediction and Estimation: Estimates unknown population parameters based on sample
data.

Common Techniques used:


 Hypothesis Testing: t-tests, chi-square tests, ANOVA (Analysis of Variance).
 Confidence Intervals: Provides a range in which the true population parameter is likely
to fall.
 Regression Analysis: Establishes relationships between variables (e.g., linear regression).
 Correlation Analysis: Measures the strength and direction of relationships between
variables.

Example: Determining if a new drug has a significant effect compared to a placebo.


 A pharmaceutical company tests a new drug on a sample of patients and uses inferential
analysis to determine whether the drug will be effective for the entire population. By
using statistical tests, the company can estimate the drug’s effectiveness with a certain
level of confidence.

7.) Text Analysis (Text Mining)


Text mining, also known as text analytics, is the process of extracting meaningful insights,
patterns, and knowledge from large volumes of unstructured text data. It involves techniques from
natural language processing (NLP), machine learning, and statistics to analyze and interpret text.

Objectives:
 Information Extraction: Identifies key entities, phrases, and relationships within text.
 Pattern Recognition: Finds trends, sentiments, and recurring themes in textual data.
 Data Structuring: Converts unstructured text into structured data for analysis.

Common Techniques Used:


 Natural Language Processing (NLP): Enables machines to understand human language.
 Sentiment Analysis: Determines whether text expresses positive, negative, or neutral
sentiment.

62
 Topic Modeling: Identifies topics present in a collection of documents (e.g., Latent
Dirichlet Allocation (LDA)).
 Text Classification: Categorizes text into predefined labels (e.g., spam detection).
 Named Entity Recognition (NER): Identifies names, locations, organizations, and other
key entities.
 Keyword Extraction: Identifies the most important words or phrases in a text.

Example: Analyzing customer reviews to identify common complaints.


 A company uses text mining to analyze customer reviews and determine common
complaints or praises. Sentiment analysis helps them understand overall customer
satisfaction, while keyword extraction highlights frequent issues.

8.) Time Series Analysis


Time series analysis is a statistical technique used to analyze data points collected or recorded at
specific time intervals. It helps in identifying trends, seasonal patterns, and fluctuations over time,
making it useful for forecasting future values.

Objectives:
 Trend Analysis: Identifies long-term patterns in data.
 Seasonality Detection: Recognizes repeating cycles or patterns within a fixed time
period.
 Forecasting: Predicts future values based on historical data.
 Anomaly Detection: Identifies unexpected changes or outliers in time-based data.

Common Techniques Used:


 Moving Averages: Smooths fluctuations to identify trends.
 Autoregressive Integrated Moving Average (ARIMA): A model used for time series
forecasting.
 Exponential Smoothing: Assigns different weights to past observations for prediction.
 Seasonal Decomposition of Time Series (STL): Separates time series data into trend,
seasonal, and residual components.

Example:nStock price prediction or weather forecasting.


 A retail company uses time series analysis to forecast future sales based on past sales data,
accounting for seasonal peaks during holiday seasons and long-term growth trends.

9.) Spatial Analysis


Spatial analysis is a technique used to examine, interpret, and model spatial data to identify
patterns, relationships, and trends based on geographic or spatial characteristics. It is widely used
in geographic information systems (GIS), urban planning, environmental science, and logistics.

Objectives:
 Pattern Recognition: Identifies spatial distributions and relationships in data.
 Proximity Analysis: Measures distances between locations and finds nearest points of
interest.

63
 Spatial Prediction: Uses geographic trends to forecast future outcomes.
 Cluster Detection: Groups similar spatial points to identify trends or anomalies.

Common Techniques Used:


 Geocoding: Assigns geographic coordinates to data points (e.g., addresses).
 Spatial Interpolation: Estimates unknown values based on nearby data points.
 Heat Maps: Visualizes density or intensity of spatial data.
 Network Analysis: Examines connectivity and optimal routing (e.g., shortest path
algorithms).
 Spatial Regression: Analyzes relationships between spatial variables.

Example: Mapping disease outbreaks to identify high-risk areas.


 A city government uses spatial analysis to determine the best locations for new fire
stations by analyzing population density, emergency response times, and historical fire
incident data.

10.) Network Analysis


Network analysis is the study of relationships and connections between entities (nodes) in a
network. It helps in understanding the structure, behavior, and interactions within a system, such
as social networks, computer networks, transportation systems, or biological systems.

Objectives:
 Understanding Relationships: Examines how different entities (nodes) are connected.
 Identifying Key Influencers: Finds the most important nodes in a network.
 Detecting Communities: Identifies clusters or groups of closely connected nodes.
 Optimizing Network Flow: Analyzes efficiency and bottlenecks in a system.

Common Techniques Used:


 Graph Theory: Represents networks as nodes (points) and edges (connections).
 Centrality Measures: Identifies key nodes using metrics like degree centrality,
betweenness centrality, and closeness centrality.
 Community Detection: Finds clusters of highly interconnected nodes (e.g., modularity-
based methods).
 Shortest Path Analysis: Determines the most efficient route between nodes (e.g.,
Dijkstra’s algorithm).
 Network Visualization: Uses tools like Gephi, Cytoscape, or Python libraries (NetworkX)
to display networks.

Example: Identifying influencers in a social media network.


 A company performs network analysis on social media data to identify influential users
who can help promote a product. By analyzing connections, they determine which users
have the most influence based on their interactions with others.

64

You might also like