0% found this document useful (0 votes)
41 views16 pages

Data Analysis2

Data analysis class notes 2

Uploaded by

rohit972012
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views16 pages

Data Analysis2

Data analysis class notes 2

Uploaded by

rohit972012
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Reading and getting data into R (External Data)

Reading external data into R is a fundamental task for data analysis. R offers several functions and
packages to read data from various file formats and sources. Here are some common methods to read
external data into R:

Reading from Text Files: You can use functions like read.table() or read.csv() to read data from text files

# Reading from CSV file

data <- read.csv("data.csv")

Reading from Excel Files: If your data is in Excel format, you can use the readxl package to read it.

# Install and load readxl package

install.packages("readxl")

library(readxl)

# Reading from Excel file

data <- read_excel("data.xlsx")

Reading from Databases: R provides various packages to connect to databases and fetch data. For
example, you can use RODBC or DBI along with database-specific packages like RMySQL, RPostgreSQL,
etc.

# Example using RMySQL package

install.packages("RMySQL")

library(RMySQL)

# Connect to MySQL database

con <- dbConnect(MySQL(), user = "username", password = "password",

dbname = "database_name", host = "host_name")

# Fetch data from a table

data <- dbGetQuery(con, "SELECT * FROM table_name")

Reading from APIs: If your data is available through an API, you can use packages like httr or jsonlite to
interact with APIs and fetch data.
# Example using jsonlite package to fetch data from a JSON API

install.packages("jsonlite")

library(jsonlite)

# Fetch data from API

url <- "https://api.example.com/data"

response <- GET(url)

data <- fromJSON(content(response, "text"))

Reading from Web Scraping: You can use packages like rvest or rvest to scrape data from websites.

install.packages("rvest")

library(rvest)

# Scraping data from a website

url <- "https://www.example.com"

webpage <- read_html(url)

data <- html_table(webpage)[[1]]

R – Line Graphs
A line graph is a chart that is used to display information in the form
of a series of data points. It utilizes points and lines to represent change
over time. The plot() function in R is used

Syntax: plot(v, type, col, xlab, ylab)


# Create the data for the chart.

v <- c(17, 25, 38, 13, 41) # Plot the bar chart.

plot(v, type = "o")


R – Bar Charts
A bar chart also known as bar graph is a pictorial representation of data
that presents categorical data with rectangular bars with heights or
lengths proportional to the values that they represent. R uses
the barplot() function to create bar charts.

Syntax: barplot(H, xlab, ylab, main, names.arg, col)


# Create the data for the chart

A <- c(17, 32, 8, 53, 1) # Plot the bar chart

barplot(A, xlab = "X-axis", ylab = "Y-axis", main ="Bar-Chart")

Histograms in R
A histogram contains a rectangular area to display the statistical
information which is proportional to the frequency of a variable and its
width in successive numerical intervals. histograms in R Programming Language
using the hist() function.

Syntax: hist(v, main, xlab, xlim, ylim, breaks, col, border)

# Create data for the graph.

v <- c(19, 23, 11, 5, 16, 21, 32, 14, 19, 27, 39)# Create the histogram.

hist(v, xlab = "No.of Articles ",col = "green", border = "black")


Scatter plots in R
A scatter plot is a set of dotted points representing individual data pieces on
the horizontal and vertical axis, a scatter plot in R using the plot() function.

Syntax: plot(x, y, main, xlab, ylab, xlim, ylim, axes)

input <- mtcars[, c('wt', 'mpg')]

plot(x = input$wt, y = input$mpg,

xlab = "Weight",

ylab = "Milage",

xlim = c(1.5, 4),

ylim = c(10, 25),

main = "Weight vs Milage"

R – Pie Charts
A pie chart is a circular statistical graphic, which is divided into slices to
illustrate numerical proportions. It depicts a special chart that uses “pie
slices”, where each sector shows the relative sizes of data. R uses the
function pie() to create pie charts.

Syntax: pie(x, labels, radius, main, col, clockwise)

geeks<- c(23, 56, 20, 63)

labels <- c("Mumbai", "Pune", "Chennai", "Bangalore")# Plot the chart.

pie(geeks, labels)

Boxplots in R
A box graph is a chart that is used to display information in the form of
distribution by drawing boxplots for each of them. Boxplots are created in R by
using the boxplot() function.

Syntax: boxplot(x, data, notch, varwidth, names, main)

# Create the box plot

boxplot(disp ~ gear, data = mtcars,

main = "Displacement by Gear",

xlab = "Gear",

ylab = "Displacement")
Random Forest in R Programming
Random Forest in R Programming is an ensemble of decision trees. It builds and combines multiple
decision trees to get more accurate predictions. It’s a non-linear classification algorithm. Each decision
tree model is used when employed on its own. An error estimate of cases is made that is not used when
constructing the tree. This is called an out of bag error estimate mentioned as a percentage.

Random forest algorithm

Draw a random bootstrap sample of size n .

Grow a decision tree from bootstrap sample. At each node of tree, randomly select d features.

Split the node using features(variables) that provide best split according to objective function.

Repeat steps 1 to step 2, k times(k is the number of trees).

Aggregate the prediction by each tree for a new data point to assign the class label by majority vote

Example:

Consider a Fruit Box consisting of three fruits Apples, Oranges, and Cherries in training data i.e n = 3. We
are predicting the fruit which is maximum in number in a fruit box. A random forest model using the
training data with a number of trees, k = 3.
Decision Tree in R Programming
Decision Trees are useful supervised Machine learning algorithms that have the ability to perform both
regression and classification tasks. It is characterized by nodes and branches, where the tests on each
attribute are represented at the nodes, the outcome of this procedure is represented at the branches
and the class labels are represented at the leaf nodes.

These types of tree-based algorithms are one of the most widely used algorithms due to the fact that
these algorithms are easy to interpret and use.

Types of Decision Trees

two main categories

Categorical Variable Decision Tree: This refers to the decision trees whose target variables have limited
value and belong to a particular group.

Continuous Variable Decision Tree: This refers to the decision trees whose target variables can take
values from a wide range of data types.

Working of a Decision Tree in R


Partitioning: It refers to the process of splitting the data set into subsets. Many algorithms are used by
the tree to split a node into sub-nodes which results in an overall increase in the clarity of the node

Pruning: This refers to the process wherein the branch nodes are turned into leaf nodes which results in
the shortening of the branches of the tree.

Selection of the tree: The main goal of this process is to select the smallest tree that fits the data due to
the reasons discussed in the pruning section.

Important factors

Entropy:

Mainly used to determine the uniformity in the given sample. If the sample is completely uniform then
entropy is 0, if it’s uniformly partitioned it is one.
Information Gain:

Statistical property which measures how well training examples are separated based on the target
classification.

Advantages

Easy to understand and interpret.

Does not require Data normalization

Disadvantages

Requires higher time to train the model

It has considerable high complexity

Normal Distribution in R
Normal Distribution is a probability function used in statistics that tells about how the data values are
distributed. It is the most important probability distribution function used in statistics because of its
advantages in real case scenarios.

It is generally observed that data distribution is normal when there is a random collection of data from
independent sources.

In R, there are 4 built-in functions to generate normal distribution:

dnorm() this function in R programming measures density function of distribution.

syntax : dnorm(x, mean, sd)

pnorm() this function is the cumulative distribution function which measures the probability that a
random number X takes a value less than or equal to x

Syntax: pnorm(x, mean, sd)


qnorm() this function is the inverse of pnorm() function. It takes the probability value and gives output
which corresponds to the probability value

Syntax: qnorm(p, mean, sd)

rnorm() this function in R programming is used to generate a vector of random numbers which are
normally distributed.

Syntax: rnorm(x, mean, sd)

Binomial Distribution in R Programming


Binomial distribution in R is a probability distribution used in statistics. The binomial distribution is a
discrete distribution and has only two outcomes i.e. success or failure. All its trials are independent, the
probability of success remains the same and the previous outcome does not affect the next outcome.

Binomial distribution helps us to find the individual probabilities as well as cumulative probabilities over
a certain range.

Functions for Binomial Distribution

dbinom() Function This function is used to find probability at a particular value for a data that follows
binomial distribution

Syntax: dbinom(k, n, p)

pbinom() Function The function pbinom() is used to find the cumulative probability of a data following
binomial distribution till a given value

Syntax: pbinom(k, n, p)

qbinom() Function This function is used to find the nth quantile, that is if P(x <= k) is given, it finds k.

Syntax: qbinom(P, n, p)
rbinom() Function This function generates n random variables of a particular probability.

Syntax: rbinom(n, N, p)

Time Series Analysis in R


Time Series Analysis in R is used to see how an object behaves over some time. In R Programming it can
be easily done by the ts() function with some parameters. Time series takes the data vector and each
data is connected with a timestamp value as given by the user

Syntax: objectName <- ts(data, start, end, frequency)

where,

data – represents the data vector

start – represents the first observation in time series

end – represents the last observation in time series

frequency – represents number of observations per unit time.

steps to follow for time series analysis


Loading Time Series Data:Load your time series data into R. This could be in various formats like CSV,
Excel, or fetched directly from a database.

Converting Data to Time Series Object:This is often done using the ts() function for equally spaced time
series or xts() function for irregularly spaced time series.

Exploratory Data Analysis (EDA):Perform exploratory data analysis to understand the characteristics of
your time series data.

Modeling Time Series: Fit appropriate models to your time series data. This can include simple models
like ARIMA

Forecasting: Use fitted models to forecast future values of the time series.

Model Evaluation: Evaluate the performance of your models using appropriate metrics such as mean
absolute error, mean squared error

Visualization: Visualize your forecasts along with the historical data to understand the model's
performance.

Linear Regression
Linear Regression is a commonly used type of predictive analysis. Linear Regression is a statistical
approach for modelling the relationship between a dependent variable and a given set of independent
variables.

There are two types of linear regression.

Simple Linear Regression:

Simple Linear regression makes predictions for continuous or numeric variables.

linear regression shows the linear relationship, which means it finds how the value of the dependent
variable changes according to the value of the independent variable.

Linear Regression Line

A regression line can show two types of relationship:

Positive Linear Relationship: If the dependent variable increases on the Y-axis and the independent
variable increases on the X-axis, then such a relationship is termed as a Positive linear relationship.

Negative Linear Relationship: If the dependent variable decreases on the Y-axis and independent
variable increases on the X-axis, then such a relationship is called a negative linear relationship.

Assumptions of Simple Linear Regression

Linear relationship : Linear regression assumes the linear relationship between the dependent and
independent variables.

no multicollinearity : Multicollinearity means high-correlation between the independent variables.

due to this it may difficult to find the true relationship between the predictors and target variables.
Homoscedasticity: Homoscedasticity is a situation when the error term is the same for all the values of
independent variables.

Multiple Linear Regression It is the most common form of Linear Regression. Multiple Linear Regression
basically describes how a single response variable Y depends linearly on a number of predictor variables.

The basic examples where Multiple Regression can be used are as follows:

The height of a child can depend on the height of the mother, the height of the father, nutrition, and
environmental factors.

Logistic Regression
Logistic regression in R Programming is a classification algorithm used to find the probability of event
success and event failure. Logistic regression is used when the dependent variable is binary(0/1,
True/False, Yes/No) in nature. The logit function is used as a link function in a binomial distribution.

Logistic regression is also known as Binomial logistics regression. It is based on the sigmoid function
where output is probability and input can be from -infinity to +infinity.

Survival Analysis
Survival analysis in R Programming Language deals with the prediction of events at a specified time. It
deals with the occurrence of an interesting event within a specified time and failure of it produces
censored observations

# Installing package

install.packages("survival")

# Loading package

library(survival)

# Dataset information

?lung

# Fitting the survival model

Survival_Function = survfit(Surv(lung$time, lung$status == 2)~1)

Survival_Function
# Plotting the function

plot(Survival_Function)

Methods used to do survival analysis:

Kaplan-Meier Method

The Kaplan-Meier method is used in survival distribution using the Kaplan-Meier estimator for truncated
or censored data. It’s a non-parametric statistic that allows us to estimate the survival function and thus
not based on underlying probability distribution.

The Kaplan–Meier estimates are based on the number of patients (each patient as a row of data) from
the total number of patients who survive for a certain time after treatment. (which is the event).

Cox proportional hazard model

It is a regression modeling that measures the instantaneous risk of deaths and is bit more difficult to
illustrate than the Kaplan-Meier estimator. It consists of hazard function h(t) which describes the
probability of event or hazard h(e.g. survival) up to a particular time t.

It does not assume an underlying probability distribution but it assumes that the hazards of the patient
groups we compare are constant over time

Prescriptive Analytics
Prescriptive Analytics can be defined as a type of data analytics that uses algorithms and analysis of raw
data to achieve better and more effective decisions for a long and short span of time. It suggests
strategy over possible scenarios

Prescriptive Analytics is the area of Business Analytics dedicated to searching out the best solution for
day-to-day occurring problems.
Creating data for analytics through designed

Creating data for prescriptive analytics often involves designing experiments to generate data that can
be used to build and validate prescriptive models.

Here's a general framework for creating data

Define the Objective: Clearly define the objective of the prescriptive analytics project.

Identify Factors : Identify the factors (variables) that influence the outcome of interest and any
constraints that need to be considered.

Design Experiments or Simulations: Design experiments or simulations to systematically vary the factors
and observe the outcomes.

Generate Data: Implement the designed experiments or simulations to generate data.

Collect and Prepare Data: Collect the data generated from experiments and prepare it for analysis.

Build Prescriptive Models: Use the collected data to build models that relate the factors

Validate models: Validate the prescriptive models using techniques like cross-validation, sensitivity
analysis

Creating data for analytics through active learning

Creating data for prescriptive analytics through active learning involves iteratively selecting and labeling
data points to train predictive models, optimizing the model's performance over time.

Here's a framework for creating data

Define the Objective: Clearly define the objective of the prescriptive analytics project.

Identify Factors : Identify the factors (variables) that influence the outcome of interest and any
constraints that need to be considered.

Initial Data Collection: start an initial dataset that includes historical data relevant to the
decision-making process.

Select Initial Training Set: Select a small subset of the initial dataset as the initial training set for the
prescriptive model.
Train Initial Model:Train an initial prescriptive model using the selected training set.

Deploy Initial Model: Deploy the initial model into the decision-making process to start making
recommendations

Iterative Data Selection: active learning strategies to iteratively select additional data points for
labeling.

Creating data for analytics through reinforcement learning

It involves using the principles of reinforcement learning to iteratively optimize decision-making


processes.

Here's a framework for creating data

Define the Objective: Clearly define the objective of the prescriptive analytics project.

Identify Factors : Identify the factors (variables) that influence the outcome of interest and any
constraints that need to be considered

Design State Space: Define the state space, which represents the possible states of the environment

Define Action Space: Define the action space, which represents the possible actions or decisions that the agent
can take in each state.

Model Reward Function: Define a reward function that quantifies the desirability of different outcomes or
decisions.

Train Reinforcement Learning Agent: Train a reinforcement learning agent using the initial dataset and the
defined state and action spaces.

Deploy Agent: Deploy the trained reinforcement learning agent into the decision-making process.

You might also like