data analytics lab manual using R programming
data analytics lab manual using R programming
LAB MANUAL
List of Programs:
1. Data Preprocessing a. Handling missing values b. Noise detection removal c. Identifying data
redundancy and elimination
2. Implement any one imputation model
3. Implement Linear Regression
4. Implement Logistic Regression
5. Implement Decision Tree Induction for classification
6. Implement Random Forest Classifier
7. Implement ARIMA on Time Series data
8. Object segmentation using hierarchical based methods
9. Perform Visualization techniques (types of maps - Bar, Colum, Line, Scatter, 3D Cubes etc)
10. Perform Descriptive analytics on healthcare data
11. Perform Predictive analytics on Product Sales data
12. Apply Predictive analytics for Weather forecasting.
1 Introduction to R Programming 4
Introduction to R programming:
R is a programming language and free software developed by Ross Ihaka and Robert Gentleman in
Downloaded by MB Sailaja ([email protected])
lOMoARcPSD|51655226
1993. R possesses an extensive catalog of statistical and graphical methods. It includes machine
learning algorithms, linear regression, time series, statistical inference to name a few. Most of the R
libraries are written in R, but for heavy computational tasks, C, C++ and Fortran codes are
preferred. R is not only entrusted by academic, but many large companies also use R programming
language, including Uber, Google, Airbnb, Facebook and so on.
Discover: Investigate the data, refine your hypothesis and analyze them
Model: R provides a wide array of tools to capture the right model for your data
Communicate: Integrate codes, graphs, and outputs to a report with R Markdown or build Shiny
apps to share with the world
Step – 1: With R-base installed, let’s move on to installing RStudio. To begin, goto
download RStudio and click on the download button for RStudio desktop.
Step – 2: Click on the link for the windows version of RStudio and save
the .exe file. Step – 3: Run the .exe and follow the installation instructions.
Enter/browse the path to the installation folder and click Next to proceed.
Select the folder for the start menu shortcut or click on do not create shortcuts and
then click Next.
Installing Packages:-
The most common place to get packages from is CRAN. To install packages from
CRAN you use install.packages("package name"). For instance, if you want to
install the ggplot2 package, which is a very popular visualization package, you
would type the following in the console:-
Syntax:-
# install package from
CRAN
install.packages("ggplot2"
) Loading Packages:-
Once the package is downloaded to your computer you can access the functions and
resources provided by the package in two different ways:
# load the package to use in the current R session
library(packagename)
1 Data Preprocessing
a. Handling missing values
b. Noise detection removal
c. Identifying data redundancy and elimination
a. Handling Missing Values:
# Remove rows with missing values
data <- na.omit(data)
# Impute missing values with mean
data$column_with_missing <- ifelse(is.na(data$column_with_missing),
mean(data$column_with_missing, na.rm = TRUE),
data$column_with_missing)
b. Noise Detection and Removal:
z_scores <- scale(data$numeric_column)
outliers <- which(abs(z_scores) > 3) # Adjust the threshold as needed
cleaned_data <- data[-outliers, ]
c. Identifying Data Redundancy and Elimination:
# Remove duplicate rows
unique_data <- unique(data)
# Remove highly correlated variables
cor_matrix <- cor(data)
high_correlation <- findCorrelation(cor_matrix, cutoff = 0.9) # Adjust the
threshold as needed
cleaned_data <- data[, -high_correlation]
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Linear Regression",
x = "X",
y = "Y")
# Predict using the model
new_x <- 11
predicted_y <- predict(lm_model, newdata = data.frame(x = new_x))
cat("Predicted value for x =", new_x, ":", predicted_y)
```
In this program:
1. We generate some sample data with a linear relationship between `x` and `y`,
adding some random noise.
2. We visualize the sample data using a scatter plot.
3. We fit a linear regression model using the `lm()` function, specifying the
formula `y ~ x`.
4. We print a summary of the fitted model using the `summary()` function.
5. We plot the original data points along with the fitted regression line using
`geom_smooth()` in ggplot2.
6. We demonstrate how to make predictions using the fitted model for a new
value of `x`.
new_x <- 1
predicted_probability <- predict(logit_model, newdata = data.frame(x =
new_x), type = "response")
cat("Predicted probability for x =", new_x, ":", predicted_probability)
```
In this program:
1. We generate some sample data with a binary outcome variable `y` based on a
linear combination of the feature `x`.
2. We visualize the sample data using a scatter plot.
3. We fit a logistic regression model using the `glm()` function with `family =
binomial`.
4. We print a summary of the fitted model using the `summary()` function.
5. We plot the logistic regression curve using the coefficients obtained from the
fitted model.
6. We demonstrate how to make predictions using the fitted model for a new
value of `x`.
2. We read the healthcare data from a CSV file using the `read.csv()` function.
Replace `"healthcare_data.csv"` with the path to your dataset.
3. We display the structure of the dataset using `str()` and summary statistics
using `summary()`.
4. We check for missing values in the dataset.
5. We visualize the distributions of key variables (age, weight, height) using
histograms and boxplots.
6. We analyze categorical variables (gender, diagnosis) using the `table()`
function.
7. We calculate correlations between numeric variables using the `cor()`
function and display the correlation matrix.
8. We create a scatterplot matrix to visualize relationships between variables.
9. Additional exploratory analysis can be performed based on the specific
requirements of the analysis.