Data Analysis2
Data Analysis2
Reading external data into R is a fundamental task for data analysis. R offers several functions and
packages to read data from various file formats and sources. Here are some common methods to read
external data into R:
Reading from Text Files: You can use functions like read.table() or read.csv() to read data from text files
Reading from Excel Files: If your data is in Excel format, you can use the readxl package to read it.
install.packages("readxl")
library(readxl)
Reading from Databases: R provides various packages to connect to databases and fetch data. For
example, you can use RODBC or DBI along with database-specific packages like RMySQL, RPostgreSQL,
etc.
install.packages("RMySQL")
library(RMySQL)
Reading from APIs: If your data is available through an API, you can use packages like httr or jsonlite to
interact with APIs and fetch data.
# Example using jsonlite package to fetch data from a JSON API
install.packages("jsonlite")
library(jsonlite)
Reading from Web Scraping: You can use packages like rvest or rvest to scrape data from websites.
install.packages("rvest")
library(rvest)
R – Line Graphs
A line graph is a chart that is used to display information in the form
of a series of data points. It utilizes points and lines to represent change
over time. The plot() function in R is used
v <- c(17, 25, 38, 13, 41) # Plot the bar chart.
Histograms in R
A histogram contains a rectangular area to display the statistical
information which is proportional to the frequency of a variable and its
width in successive numerical intervals. histograms in R Programming Language
using the hist() function.
v <- c(19, 23, 11, 5, 16, 21, 32, 14, 19, 27, 39)# Create the histogram.
xlab = "Weight",
ylab = "Milage",
R – Pie Charts
A pie chart is a circular statistical graphic, which is divided into slices to
illustrate numerical proportions. It depicts a special chart that uses “pie
slices”, where each sector shows the relative sizes of data. R uses the
function pie() to create pie charts.
pie(geeks, labels)
Boxplots in R
A box graph is a chart that is used to display information in the form of
distribution by drawing boxplots for each of them. Boxplots are created in R by
using the boxplot() function.
xlab = "Gear",
ylab = "Displacement")
Random Forest in R Programming
Random Forest in R Programming is an ensemble of decision trees. It builds and combines multiple
decision trees to get more accurate predictions. It’s a non-linear classification algorithm. Each decision
tree model is used when employed on its own. An error estimate of cases is made that is not used when
constructing the tree. This is called an out of bag error estimate mentioned as a percentage.
Grow a decision tree from bootstrap sample. At each node of tree, randomly select d features.
Split the node using features(variables) that provide best split according to objective function.
Aggregate the prediction by each tree for a new data point to assign the class label by majority vote
Example:
Consider a Fruit Box consisting of three fruits Apples, Oranges, and Cherries in training data i.e n = 3. We
are predicting the fruit which is maximum in number in a fruit box. A random forest model using the
training data with a number of trees, k = 3.
Decision Tree in R Programming
Decision Trees are useful supervised Machine learning algorithms that have the ability to perform both
regression and classification tasks. It is characterized by nodes and branches, where the tests on each
attribute are represented at the nodes, the outcome of this procedure is represented at the branches
and the class labels are represented at the leaf nodes.
These types of tree-based algorithms are one of the most widely used algorithms due to the fact that
these algorithms are easy to interpret and use.
Categorical Variable Decision Tree: This refers to the decision trees whose target variables have limited
value and belong to a particular group.
Continuous Variable Decision Tree: This refers to the decision trees whose target variables can take
values from a wide range of data types.
Pruning: This refers to the process wherein the branch nodes are turned into leaf nodes which results in
the shortening of the branches of the tree.
Selection of the tree: The main goal of this process is to select the smallest tree that fits the data due to
the reasons discussed in the pruning section.
Important factors
Entropy:
Mainly used to determine the uniformity in the given sample. If the sample is completely uniform then
entropy is 0, if it’s uniformly partitioned it is one.
Information Gain:
Statistical property which measures how well training examples are separated based on the target
classification.
Advantages
Disadvantages
Normal Distribution in R
Normal Distribution is a probability function used in statistics that tells about how the data values are
distributed. It is the most important probability distribution function used in statistics because of its
advantages in real case scenarios.
It is generally observed that data distribution is normal when there is a random collection of data from
independent sources.
pnorm() this function is the cumulative distribution function which measures the probability that a
random number X takes a value less than or equal to x
rnorm() this function in R programming is used to generate a vector of random numbers which are
normally distributed.
Binomial distribution helps us to find the individual probabilities as well as cumulative probabilities over
a certain range.
dbinom() Function This function is used to find probability at a particular value for a data that follows
binomial distribution
Syntax: dbinom(k, n, p)
pbinom() Function The function pbinom() is used to find the cumulative probability of a data following
binomial distribution till a given value
Syntax: pbinom(k, n, p)
qbinom() Function This function is used to find the nth quantile, that is if P(x <= k) is given, it finds k.
Syntax: qbinom(P, n, p)
rbinom() Function This function generates n random variables of a particular probability.
Syntax: rbinom(n, N, p)
where,
Converting Data to Time Series Object:This is often done using the ts() function for equally spaced time
series or xts() function for irregularly spaced time series.
Exploratory Data Analysis (EDA):Perform exploratory data analysis to understand the characteristics of
your time series data.
Modeling Time Series: Fit appropriate models to your time series data. This can include simple models
like ARIMA
Forecasting: Use fitted models to forecast future values of the time series.
Model Evaluation: Evaluate the performance of your models using appropriate metrics such as mean
absolute error, mean squared error
Visualization: Visualize your forecasts along with the historical data to understand the model's
performance.
Linear Regression
Linear Regression is a commonly used type of predictive analysis. Linear Regression is a statistical
approach for modelling the relationship between a dependent variable and a given set of independent
variables.
linear regression shows the linear relationship, which means it finds how the value of the dependent
variable changes according to the value of the independent variable.
Positive Linear Relationship: If the dependent variable increases on the Y-axis and the independent
variable increases on the X-axis, then such a relationship is termed as a Positive linear relationship.
Negative Linear Relationship: If the dependent variable decreases on the Y-axis and independent
variable increases on the X-axis, then such a relationship is called a negative linear relationship.
Linear relationship : Linear regression assumes the linear relationship between the dependent and
independent variables.
due to this it may difficult to find the true relationship between the predictors and target variables.
Homoscedasticity: Homoscedasticity is a situation when the error term is the same for all the values of
independent variables.
Multiple Linear Regression It is the most common form of Linear Regression. Multiple Linear Regression
basically describes how a single response variable Y depends linearly on a number of predictor variables.
The basic examples where Multiple Regression can be used are as follows:
The height of a child can depend on the height of the mother, the height of the father, nutrition, and
environmental factors.
Logistic Regression
Logistic regression in R Programming is a classification algorithm used to find the probability of event
success and event failure. Logistic regression is used when the dependent variable is binary(0/1,
True/False, Yes/No) in nature. The logit function is used as a link function in a binomial distribution.
Logistic regression is also known as Binomial logistics regression. It is based on the sigmoid function
where output is probability and input can be from -infinity to +infinity.
Survival Analysis
Survival analysis in R Programming Language deals with the prediction of events at a specified time. It
deals with the occurrence of an interesting event within a specified time and failure of it produces
censored observations
# Installing package
install.packages("survival")
# Loading package
library(survival)
# Dataset information
?lung
Survival_Function
# Plotting the function
plot(Survival_Function)
Kaplan-Meier Method
The Kaplan-Meier method is used in survival distribution using the Kaplan-Meier estimator for truncated
or censored data. It’s a non-parametric statistic that allows us to estimate the survival function and thus
not based on underlying probability distribution.
The Kaplan–Meier estimates are based on the number of patients (each patient as a row of data) from
the total number of patients who survive for a certain time after treatment. (which is the event).
It is a regression modeling that measures the instantaneous risk of deaths and is bit more difficult to
illustrate than the Kaplan-Meier estimator. It consists of hazard function h(t) which describes the
probability of event or hazard h(e.g. survival) up to a particular time t.
It does not assume an underlying probability distribution but it assumes that the hazards of the patient
groups we compare are constant over time
Prescriptive Analytics
Prescriptive Analytics can be defined as a type of data analytics that uses algorithms and analysis of raw
data to achieve better and more effective decisions for a long and short span of time. It suggests
strategy over possible scenarios
Prescriptive Analytics is the area of Business Analytics dedicated to searching out the best solution for
day-to-day occurring problems.
Creating data for analytics through designed
Creating data for prescriptive analytics often involves designing experiments to generate data that can
be used to build and validate prescriptive models.
Define the Objective: Clearly define the objective of the prescriptive analytics project.
Identify Factors : Identify the factors (variables) that influence the outcome of interest and any
constraints that need to be considered.
Design Experiments or Simulations: Design experiments or simulations to systematically vary the factors
and observe the outcomes.
Collect and Prepare Data: Collect the data generated from experiments and prepare it for analysis.
Build Prescriptive Models: Use the collected data to build models that relate the factors
Validate models: Validate the prescriptive models using techniques like cross-validation, sensitivity
analysis
Creating data for prescriptive analytics through active learning involves iteratively selecting and labeling
data points to train predictive models, optimizing the model's performance over time.
Define the Objective: Clearly define the objective of the prescriptive analytics project.
Identify Factors : Identify the factors (variables) that influence the outcome of interest and any
constraints that need to be considered.
Initial Data Collection: start an initial dataset that includes historical data relevant to the
decision-making process.
Select Initial Training Set: Select a small subset of the initial dataset as the initial training set for the
prescriptive model.
Train Initial Model:Train an initial prescriptive model using the selected training set.
Deploy Initial Model: Deploy the initial model into the decision-making process to start making
recommendations
Iterative Data Selection: active learning strategies to iteratively select additional data points for
labeling.
Define the Objective: Clearly define the objective of the prescriptive analytics project.
Identify Factors : Identify the factors (variables) that influence the outcome of interest and any
constraints that need to be considered
Design State Space: Define the state space, which represents the possible states of the environment
Define Action Space: Define the action space, which represents the possible actions or decisions that the agent
can take in each state.
Model Reward Function: Define a reward function that quantifies the desirability of different outcomes or
decisions.
Train Reinforcement Learning Agent: Train a reinforcement learning agent using the initial dataset and the
defined state and action spaces.
Deploy Agent: Deploy the trained reinforcement learning agent into the decision-making process.