0% found this document useful (0 votes)

4 views

Data Science Using r 2

This document outlines various techniques in R for data preprocessing, including standardizing date formats, detecting duplicates, handling categorical inconsistencies, and treating outliers. It also introduces SparkR for processing big data and discusses challenges in automating data preprocessing in healthcare. Finally, it provides code snippets for automating missing value imputation, outlier detection, and data transformation in both R and Python.

Uploaded by

BALAJI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Data Science Using r 2

Uploaded by

BALAJI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Name: Abi S

Class: MBA
Reg no: DA2352305010370
Sub: DATA SCIENCE USING R
Assignment:2

1.
1. Standardizing Date Formats in R
To standardize date formats, we'll use the lubridate package,
which is excellent for handling dates and times.
Code snippet
# Install and load the lubridate package (if not already
installed)
# install.packages("lubridate")
library(lubridate)

# Sample data with inconsistent date formats

dates <- c("2023-01-15", "01/15/2023", "15.01.2023", "Jan
15, 2023", "15-Jan-2023")

# Function to standardize date formats

standardize_dates <- function(date_vector) {
# Try different date formats and parse them
parsed_dates <- parse_date_time(date_vector,
orders = c("ymd", "mdy", "dmy", "b d
Y", "d b Y"),
quiet = TRUE)

# Convert to a consistent format (e.g., YYYY-MM-DD)

standardized_dates <- ymd(parsed_dates)

return(standardized_dates)
}

# Apply the function

standardized_dates <- standardize_dates(dates)

# Print the standardized dates

print(standardized_dates)
Explanation:
 lubridate::parse_date_time(): This function automatically
detects and parses various date formats based on the
specified orders.
 lubridate::ymd(): This converts the parsed dates into a
consistent "YYYY-MM-DD" format.
 quiet = TRUE: This argument suppresses warnings if
some dates cannot be parsed.
2. Detecting and Removing Duplicate Entries
We can use the duplicated() and unique() functions in R to
handle duplicates.
Code snippet
# Sample data with duplicate entries
data <- data.frame(
id = c(1, 2, 3, 2, 4, 1),
value = c("A", "B", "C", "B", "D", "A")
)

# Detect duplicate rows

duplicate_rows <- duplicated(data)

# Print the logical vector indicating duplicate rows

print(duplicate_rows)

# Remove duplicate rows

unique_data <- unique(data)
# Print the data without duplicates
print(unique_data)
Explanation:
 duplicated(data): Returns a logical vector where TRUE
indicates a duplicate row.
 unique(data): Returns a data frame with only the unique
rows.
3. Handling Inconsistencies in Categorical Variables
We can use tolower(), toupper(), and str_to_title() (from the
stringr package) to standardize capitalization.
Code snippet
# Install and load the stringr package (if not already installed)
# install.packages("stringr")
library(stringr)

# Sample categorical data with inconsistent capitalization

categories <- c("Male", "male", "MALE", "Female", "female",
"FEMALE")

# Standardize to lowercase
lowercase_categories <- tolower(categories)
print(lowercase_categories)
# Standardize to uppercase
uppercase_categories <- toupper(categories)
print(uppercase_categories)

# Standardize to title case (first letter capitalized)

titlecase_categories <- str_to_title(categories)
print(titlecase_categories)

#If you want to replace the original column in your data

frame.
#data$gender <- tolower(data$gender)
Explanation:
 tolower(): Converts all characters to lowercase.
 toupper(): Converts all characters to uppercase.
 str_to_title(): Converts the first letter of each word to
uppercase and the rest to lowercase.
4. Detecting and Treating Outliers
We'll use the Tukey's method (Interquartile Range - IQR) to
detect outliers and then either remove or cap them.
Code snippet
# Sample numerical data with outliers
numerical_data <- c(10, 12, 15, 18, 20, 100, -50)

# Function to detect and treat outliers using IQR

treat_outliers_iqr <- function(data) {
# Calculate quartiles and IQR
q1 <- quantile(data, 0.25)
q3 <- quantile(data, 0.75)
iqr <- q3 - q1

# Calculate lower and upper bounds

lower_bound <- q1 - 1.5 * iqr
upper_bound <- q3 + 1.5 * iqr

# Detect outliers
outliers <- data < lower_bound | data > upper_bound

# Treat outliers (e.g., cap them)

data[data < lower_bound] <- lower_bound
data[data > upper_bound] <- upper_bound

return(list(treated_data = data, outliers = outliers))

}
# Apply the outlier treatment function
result <- treat_outliers_iqr(numerical_data)

# Print the treated data and outlier flags

print(result$treated_data)
print(result$outliers)
Explanation:
 We calculate the first quartile (Q1), third quartile (Q3),
and the Interquartile Range (IQR).
 We calculate the lower and upper bounds using
Q1−1.5×IQR and Q3+1.5×IQR, respectively.
 Values outside these bounds are considered outliers.
 We can then cap the outliers to the lower or upper bounds
or remove them entirely.
 Other methods to detect outliers include z-score, and
boxplot visualization.
 If you want to remove the outliers you can use the code :
Code snippet
numerical_data[!(numerical_data %in%
numerical_data[result$outliers])]
These R scripts provide a foundation for handling inconsistent
and messy data. Remember to adjust the code based on the
specific characteristics of your dataset.

2.
1. What is SparkR, and How Does it Help in Processing Big
Data?
 SparkR:
o SparkR is an R package that provides a lightweight
frontend to use Apache Spark from R. It allows data
scientists and analysts to leverage the distributed
computing capabilities of Spark directly from their
R environment.
o It essentially acts as an R interface to Spark's
distributed data processing engine.
 How it helps in processing Big Data:
o Distributed Computing: SparkR enables R users to
process massive datasets that would be impossible
to handle with traditional R alone. Spark distributes
the data and computations across a cluster of
machines, significantly speeding up processing.
o Scalability: SparkR scales horizontally, meaning
you can add more machines to the cluster to handle
even larger datasets.
o In-Memory Processing: Spark's in-memory
processing capabilities dramatically reduce I/O
operations, leading to faster data processing.
o Data Manipulation and Transformation: SparkR
provides a rich set of functions for data
manipulation, transformation, and analysis, similar
to those available in R's dplyr package, but
operating on distributed Spark DataFrames.
o Machine Learning: SparkR integrates with Spark's
MLlib library, allowing you to build and deploy
machine learning models on large datasets.
o Streaming Data: SparkR, combined with Spark
Streaming or Structured Streaming, can process
real-time data streams from IoT sensors or other
sources.
2. Demonstrating How to Connect R with a Spark Cluster
and Load Data into Spark
Here's a demonstration of how to connect R with a Spark
cluster and load data into Spark using SparkR.
Code snippet
# Install and load SparkR (if not already installed)
# install.packages("SparkR")
library(SparkR)

# Initialize a SparkSession
sparkR.session <- sparkR.session(master = "local[*]",
appName = "SparkR_Example")

# Check if the SparkSession is active

if (sparkR.isActive(sparkR.session)) {
print("SparkSession is active!")
} else {
print("SparkSession is not active.")
}

# Create a sample data frame in R

r_df <- data.frame(
id = 1:5,
name = c("Alice", "Bob", "Charlie", "David", "Eve"),
value = c(10, 20, 30, 40, 50)
)

# Convert the R data frame to a Spark DataFrame

spark_df <- createDataFrame(r_df)

# Print the schema of the Spark DataFrame

printSchema(spark_df)
# Show the first few rows of the Spark DataFrame
head(spark_df)

# Load data from a CSV file into a Spark DataFrame

# (Replace "path/to/your/data.csv" with the actual path)
# spark_csv_df <- read.df(sparkR.session,
"path/to/your/data.csv", source = "csv", header = "true",
inferSchema = "true")

# Show the first few rows of the csv spark dataframe.

# head(spark_csv_df)

# Stop the SparkSession when finished

sparkR.stop()
Explanation:
1. Initialize SparkSession:
o sparkR.session(master = "local[*]", appName =
"SparkR_Example") creates a SparkSession.
 master = "local[*]" indicates that Spark will
run locally, using all available cores. In a
cluster environment, you would replace this
with the cluster's master URL (e.g., "yarn-
client").
 appName Provides a name for the spark
application.
2. Create Spark DataFrame from R DataFrame:
o createDataFrame(r_df) converts the R data frame
r_df into a Spark DataFrame.
3. Print Schema and Show Data:
o printSchema(spark_df) displays the schema (data
types and column names) of the Spark DataFrame.
o head(spark_df) shows the first few rows of the
DataFrame.
4. Load Data from CSV:
o read.df() loads data from a CSV file into a Spark
DataFrame.
o source = "csv" specifies the data source as CSV.
o header = "true" indicates that the first row contains
column headers.
o inferSchema = "true" tells Spark to automatically
infer the data types of the columns.
5. Stop SparkSession:
o sparkR.stop() terminates the SparkSession.
Important Notes:
 Ensure that Apache Spark is installed and configured on
your system or cluster.
 You may need to set environment variables (e.g.,
SPARK_HOME) to point to your Spark installation.
 For a cluster based installation, the master parameter of
the sparkR.session function will change.
 For streaming data, you would use Spark Streaming or
Structured Streaming in conjunction with SparkR. This
involves setting up data streams and using Spark's
streaming APIs to process them.
 SparkR is often used with yarn for cluster resource
management.

3.
1. Key Challenges in Automating Data Preprocessing
 Data Variability and Inconsistency:
o Healthcare data often comes from diverse sources
(e.g., electronic health records, lab results, imaging),
leading to varying formats, units, and terminologies.
o Inconsistent data entry and human errors contribute
to inconsistencies.
 Missing Values:
o Missing data is common due to incomplete records,
technical issues, or privacy concerns.
o Different missing value patterns may require
different imputation strategies.
 Outliers:
o Outliers can arise from measurement errors, rare
medical conditions, or data entry mistakes.
o Identifying and handling outliers appropriately is
crucial to avoid biased analyses.
 Data Privacy and Security:
o Healthcare data is highly sensitive and requires
strict adherence to privacy regulations (e.g.,
HIPAA).
o Preprocessing pipelines must incorporate data
anonymization and security measures.
 Scalability and Performance:
o Processing large volumes of patient records
demands efficient and scalable algorithms.
o Automation should minimize processing time and
resource consumption.
 Domain-Specific Knowledge:
o Understanding medical terminologies and clinical
contexts is essential for accurate data preprocessing.
o Incorporating domain expertise into the pipeline is
challenging.
 Pipeline Maintenance and Adaptability:
o Data structures and requirements can change over
time, requiring robust and adaptable pipelines.
o Maintaining the pipeline to handle these changes is
a challenge.
2. Automating Missing Value Imputation, Outlier
Detection, and Data Transformation in R/Python
R:
Code snippet
# Load necessary libraries
library(dplyr)
library(tidyr)
library(outliers) #for outlier detection
library(lubridate)

# Sample healthcare data (replace with your actual data)

patient_data <- data.frame(
patient_id = 1:100,
age = sample(18:90, 100, replace = TRUE),
blood_pressure = sample(80:200, 100, replace = TRUE),
diagnosis_date = sample(seq(as.Date('2020/01/01'),
as.Date('2023/12/31'), by="day"), 100, replace = TRUE),
gender = sample(c("Male", "Female", NA), 100, replace =
TRUE),
lab_value = c(rnorm(90, mean = 50, sd = 10),
sample(200:300, 10)) #outliers added.
)

# Missing value imputation (e.g., mean imputation for

numeric, mode for categorical)
patient_data <- patient_data %>%
mutate(
age = ifelse(is.na(age), mean(age, na.rm = TRUE), age),
blood_pressure = ifelse(is.na(blood_pressure),
mean(blood_pressure, na.rm = TRUE), blood_pressure),
gender = ifelse(is.na(gender),
names(which.max(table(gender))), gender)
)

# Outlier detection and treatment (e.g., IQR method)

treat_outliers <- function(x) {
q1 <- quantile(x, 0.25)
q3 <- quantile(x, 0.75)
iqr <- q3 - q1
lower_bound <- q1 - 1.5 * iqr
upper_bound <- q3 + 1.5 * iqr
x[x < lower_bound] <- lower_bound
x[x > upper_bound] <- upper_bound
return(x)
}

patient_data <- patient_data %>%

mutate(lab_value = treat_outliers(lab_value))

# Data transformation (e.g., date formatting, categorical

encoding)
patient_data <- patient_data %>%
mutate(
diagnosis_date = ymd(diagnosis_date),
gender_encoded = as.numeric(as.factor(gender))
)

print(head(patient_data))
Python:
Python
import pandas as pd
import numpy as np
from scipy import stats
from sklearn.preprocessing import LabelEncoder
# Sample healthcare data
patient_data = pd.DataFrame({
'patient_id': range(1, 101),
'age': np.random.randint(18, 90, 100),
'blood_pressure': np.random.randint(80, 200, 100),
'diagnosis_date':
pd.to_datetime(pd.Series(np.random.choice(pd.date_range('20
200101', '20231231'), 100))),
'gender': np.random.choice(['Male', 'Female', None], 100),
'lab_value': np.concatenate([np.random.normal(50, 10, 90),
np.random.randint(200, 300, 10)])
})

# Missing value imputation

patient_data['age'].fillna(patient_data['age'].mean(),
inplace=True)
patient_data['blood_pressure'].fillna(patient_data['blood_press
ure'].mean(), inplace=True)
patient_data['gender'].fillna(patient_data['gender'].mode()[0],
inplace=True)

# Outlier detection and treatment (Z-score)

z_scores = np.abs(stats.zscore(patient_data['lab_value']))
patient_data['lab_value'] = np.where(z_scores > 3,
patient_data['lab_value'].median(), patient_data['lab_value'])

# Data transformation
patient_data['gender_encoded'] =
LabelEncoder().fit_transform(patient_data['gender'])

print(patient_data.head())
3. Automating Cleaning in R Using dplyr, tidyr, and
data.table
 dplyr:
o Provides a consistent and intuitive syntax for data
manipulation.
o mutate() for creating or modifying columns.
o filter() for subsetting rows.
o group_by() and summarise() for aggregate
calculations.
o %>% for chaining operations.
 tidyr:
o Helps to tidy messy data by reshaping and
restructuring it.
o pivot_longer() and pivot_wider() for converting
between wide and long formats.
o separate() and unite() for splitting and combining
columns.
o drop_na() and fill() for handling missing values.
 data.table:
o Provides a high-performance data manipulation
framework, especially for large datasets.
o Uses efficient indexing and grouping for fast
operations.
o := operator for in-place modification of data.
o setkey() for fast subsetting and joining.
Code snippet
library(dplyr)
library(tidyr)
library(data.table)
library(lubridate)

# Sample data
patient_data <- data.frame(
patient_id = 1:6,
lab_test_1 = c(10, 12, NA, 15, 18, 20),
lab_test_2 = c(25, 28, 30, 32, NA, 35),
gender =
c('male','Female','MALE','female','Male','FEMALE'),
test_date = c('2023-01-01','01/02/2023','2023 03 01','2023-
04-01','2023-05-01','2023-06-01')
)

patient_data_dt <- as.data.table(patient_data)

#Clean data with data table, and dplyr, and tidyr.

cleaned_data <- patient_data_dt %>%
mutate(gender = tolower(gender), test_date =
ymd(parse_date_time(test_date, orders = c('ymd', 'mdy',
'dmy')))) %>%
replace_na(list(lab_test_1 =
mean(patient_data_dt$lab_test_1, na.rm = TRUE), lab_test_2
= mean(patient_data_dt$lab_test_2, na.rm = TRUE)))

print(cleaned_data)

4. 1. Scheduling with Cron Jobs

Cron is a time-based job scheduler in Unix-like operating
systems. It allows you to schedule commands or scripts to run
automatically at 1 specific intervals.
1. medium.com

medium.com

Steps:
1. Create a Script:
o Write an R or Python script that performs the data
preprocessing tasks.
o Ensure the script is executable and handles any
necessary file paths and dependencies.
2. Open the Crontab:
o In your terminal, type crontab -e and press Enter.
This will open the crontab file in your default text
editor.
3. Define the Schedule:
o Add a line to the crontab file that specifies the
schedule and the command to run.
o The cron schedule format is: minute hour
day_of_month month day_of_week command
o Example: To run a Python script preprocess_data.py
every day at 2:00 AM, the crontab entry would be:
o 0 2 * * * /usr/bin/python3
/path/to/preprocess_data.py
o Example: To run an R script preprocess_data.R
every hour at minute 0, the crontab entry would be:
o 0 * * * * /usr/bin/Rscript /path/to/preprocess_data.R
4. Save and Exit:
o Save the crontab file and exit the editor. Cron will
automatically load the new schedule.
Pros:
 Simple and widely available.
 Easy to set up for basic scheduling needs.
Cons:
 Limited monitoring and error handling.
 Difficult to manage complex dependencies and
workflows.
 No built-in retry mechanisms.
 Difficult to track job history.
2. Scheduling with Apache Airflow
Apache Airflow is a platform to programmatically author,
schedule, and monitor workflows. It's designed for complex
data pipelines and provides robust features for scheduling and
managing jobs.
Steps:
1. Install and Set Up Airflow:
o Install Airflow on your system or cluster.
o Configure the Airflow environment, including the
database, executor, and other settings.
2. Create a DAG (Directed Acyclic Graph):
o A DAG defines the workflow as a series of tasks
and their dependencies.
o Write a Python script to create the DAG.
o Use Airflow's operators (e.g., BashOperator,
PythonOperator) to define the tasks.
o Define the schedule using the schedule_interval
argument in the DAG definition.
3. Define Tasks:
o Use Airflow operators to define the tasks in the
pipeline.
o For example, you can use BashOperator to run shell
commands or PythonOperator to execute Python
functions.
4. Define Dependencies:
o Use Airflow's dependency operators (e.g., >>, <<)
to define the order in which tasks should run.
5. Deploy the DAG:
o Place the DAG script in the Airflow DAGs
directory.
o Airflow will automatically detect and schedule the
DAG.
Example Airflow DAG (Python):
Python
from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.operators.python import PythonOperator
from datetime import datetime

def preprocess_data():
# Your R or python data preprocessing code here.
import pandas as pd
import numpy as np
# Example to show the function is running.
print(pd.DataFrame(np.random.rand(5,5)))

with DAG(
dag_id='data_preprocessing_pipeline',
schedule_interval='0 2 * * *', # Run daily at 2:00 AM
start_date=datetime(2023, 1, 1),
catchup=False,
) as dag:

run_preprocessing = PythonOperator(
task_id='run_preprocessing',
python_callable=preprocess_data,
)

run_preprocessing
Pros:
 Robust monitoring and error handling.
 Dependency management and workflow orchestration.
 Scalability and fault tolerance.
 Web-based UI for monitoring and managing pipelines.
 Retry mechanisms and logging.
 Very useful for complex data pipelines.
Cons:
 More complex to set up and configure.
 Requires more resources than cron.
 Steeper learning curve.
Choosing Between Cron and Airflow:
 Cron: Use cron for simple, standalone scripts that need
to be run on a schedule.
 Airflow: Use Airflow for complex data pipelines with
dependencies, monitoring requirements, and scalability
needs. Airflow is better for production level data
pipelines.
5. 1. Increased Efficiency and Speed:
 Faster Processing: Automated scripts can process vast
amounts of data much faster than manual methods,
saving considerable time and resources.
 Reduced Manual Labor: Automation eliminates
repetitive tasks, freeing up data scientists and analysts to
focus on higher-level analysis and interpretation.
 Scalability: Automated pipelines can handle increasing
data volumes without proportionally increasing manual
effort.
2. Improved Accuracy and Consistency:
 Reduced Human Error: Manual data processing is
prone to errors due to fatigue, inconsistencies, or
variations in interpretation. Automation minimizes these
errors, ensuring data accuracy.
 Standardized Processes: Automated pipelines enforce
consistent data preprocessing rules and transformations,
leading to uniform data quality across all datasets.
 Reproducibility: Automated scripts provide a clear
record of all preprocessing steps, making it easy to
reproduce results and track changes.
3. Enhanced Reliability and Robustness:
 Scheduled Execution: Automated pipelines can be
scheduled to run at specific intervals, ensuring timely
data preprocessing and analysis.
 Error Handling and Logging: Automated systems can
incorporate error handling mechanisms and logging to
detect and address issues promptly.
 Reduced Downtime: Automated systems can operate
continuously, minimizing downtime and ensuring
uninterrupted data availability.
4. Streamlined Workflows and Integration:
 Pipeline Orchestration: Automation tools like Apache
Airflow enable the creation of complex data pipelines
with dependencies and scheduling.
 Seamless Integration: Automated pipelines can be
integrated with other data systems, such as databases,
data warehouses, and machine learning platforms.
 Version Control: Automated scripts can be managed
using version control systems (e.g., Git), facilitating
collaboration and tracking changes.
5. Cost Reduction:
 Reduced Labor Costs: Automation can significantly
reduce the need for manual data processing, lowering
labor costs.
 Improved Resource Utilization: Automated pipelines
can optimize resource utilization, minimizing processing
time and costs.
 Reduced Errors and Rework: By minimizing errors
and inconsistencies, automation reduces the need for
costly rework.
6. Improved Data Governance and Compliance:
 Data Lineage: Automated pipelines provide a clear
record of data transformations, facilitating data lineage
tracking.
 Data Quality Monitoring: Automated systems can
incorporate data quality checks and monitoring to ensure
compliance with data governance policies.
 Auditing: Automated logs and records can facilitate data
auditing and compliance reporting.
In summary: Automating data preprocessing improves
efficiency, accuracy, reliability, and cost-effectiveness. It is
essential for organizations that handle large volumes of data
and require consistent, high-quality data for analysis and
decision-making.

Data Science Fundamentals with R Python and Open Data 1st Edition Marco Cremonini - Get instant access to the full ebook content
100% (1)
Data Science Fundamentals with R Python and Open Data 1st Edition Marco Cremonini - Get instant access to the full ebook content
57 pages
EZkeys 2 - Manual - Toontrack
0% (1)
EZkeys 2 - Manual - Toontrack
1 page
CSC211 Material
100% (1)
CSC211 Material
47 pages
Lab1 411 Eman Yahya 7773225
No ratings yet
Lab1 411 Eman Yahya 7773225
16 pages
R-Lab p-4,2,1
No ratings yet
R-Lab p-4,2,1
12 pages
(Ebook) R for Health Data Science by Ewen Harrison, Riinu Pius ISBN 1000226107 pdf download
100% (3)
(Ebook) R for Health Data Science by Ewen Harrison, Riinu Pius ISBN 1000226107 pdf download
77 pages
Big Data - Lab 3
No ratings yet
Big Data - Lab 3
25 pages
R for Health Data Science 1st Edition Ewen Harrison download
100% (2)
R for Health Data Science 1st Edition Ewen Harrison download
85 pages
R for Health Data Science 1st Edition Ewen Harrison 2024 Scribd Download
100% (3)
R for Health Data Science 1st Edition Ewen Harrison 2024 Scribd Download
65 pages
RSTUDIO
No ratings yet
RSTUDIO
44 pages
50 R Exercises
No ratings yet
50 R Exercises
44 pages
R For Health Data Science Ewen Harrison Riinu Pius download
No ratings yet
R For Health Data Science Ewen Harrison Riinu Pius download
78 pages
R for Health Data Science 1st Edition Ewen Harrison - The full ebook with all chapters is available for download
100% (4)
R for Health Data Science 1st Edition Ewen Harrison - The full ebook with all chapters is available for download
74 pages
R Lab File Deepak
No ratings yet
R Lab File Deepak
27 pages
Task-by-Task-Guide_-Build-and-deploy-a-stroke-prediction-model-using-R
No ratings yet
Task-by-Task-Guide_-Build-and-deploy-a-stroke-prediction-model-using-R
5 pages
Praktikum Modul 3
No ratings yet
Praktikum Modul 3
5 pages
ProgrammingForDS14_Rbasics
No ratings yet
ProgrammingForDS14_Rbasics
32 pages
Chapter 2. Pre-Processing Data
No ratings yet
Chapter 2. Pre-Processing Data
37 pages
Dav Exps - Merged - Merged
No ratings yet
Dav Exps - Merged - Merged
99 pages
Data Science Practicals
No ratings yet
Data Science Practicals
40 pages
Buy ebook Data Science Fundamentals with R Python and Open Data 1st Edition Marco Cremonini cheap price
100% (18)
Buy ebook Data Science Fundamentals with R Python and Open Data 1st Edition Marco Cremonini cheap price
71 pages
R For Health Data Science 1st Edition Ewen Harrison Riinu Pius download
100% (1)
R For Health Data Science 1st Edition Ewen Harrison Riinu Pius download
81 pages
R Programming Cheat Sheet
No ratings yet
R Programming Cheat Sheet
15 pages
DS Lab
No ratings yet
DS Lab
31 pages
Teaching MBAs With Chatgpt- Lessons Learned
No ratings yet
Teaching MBAs With Chatgpt- Lessons Learned
22 pages
R Programming
No ratings yet
R Programming
11 pages
mod3 tables EPP
No ratings yet
mod3 tables EPP
9 pages
CRC.Data.Science
No ratings yet
CRC.Data.Science
443 pages
Data Preparation: Treatment of Missing Values
No ratings yet
Data Preparation: Treatment of Missing Values
26 pages
PushpendraLabFile
No ratings yet
PushpendraLabFile
51 pages
Data Science Fundamentals with R Python and Open Data 1st Edition Marco Cremonini download
No ratings yet
Data Science Fundamentals with R Python and Open Data 1st Edition Marco Cremonini download
62 pages
M3 Dar
No ratings yet
M3 Dar
52 pages
R_code_intro
No ratings yet
R_code_intro
46 pages
materi 4
No ratings yet
materi 4
30 pages
Data analysis with R
No ratings yet
Data analysis with R
72 pages
Kanak Gupta 1116 SEC Assignment
No ratings yet
Kanak Gupta 1116 SEC Assignment
3 pages
R-Programming Lab Mannual
No ratings yet
R-Programming Lab Mannual
33 pages
Week 1-3
No ratings yet
Week 1-3
17 pages
r file code
No ratings yet
r file code
16 pages
r 2m
No ratings yet
r 2m
34 pages
DA_Lab_Week-1
No ratings yet
DA_Lab_Week-1
7 pages
Iyer Vadammma Tamilnadu Brahmin Wedding
No ratings yet
Iyer Vadammma Tamilnadu Brahmin Wedding
23 pages
Exploratory Data Analysis With R Roger D Peng download
No ratings yet
Exploratory Data Analysis With R Roger D Peng download
53 pages
R Basics Continued - Factors and Data Frames - Intro To R and RStudio For Genomics
No ratings yet
R Basics Continued - Factors and Data Frames - Intro To R and RStudio For Genomics
17 pages
Data Minig and Techniquezz
No ratings yet
Data Minig and Techniquezz
48 pages
Data Preprocessing With R - Hands-On Tutorial
No ratings yet
Data Preprocessing With R - Hands-On Tutorial
14 pages
Module 7_(Data Analysis with R Programming)
No ratings yet
Module 7_(Data Analysis with R Programming)
18 pages
Data Preparation: Handling Missing Values and Outliers
No ratings yet
Data Preparation: Handling Missing Values and Outliers
28 pages
Advanced R Data Analysis Training PDF
No ratings yet
Advanced R Data Analysis Training PDF
72 pages
11 RIC Journal
No ratings yet
11 RIC Journal
46 pages
First Machine Problem
No ratings yet
First Machine Problem
6 pages
CRM Cheat Sheet
No ratings yet
CRM Cheat Sheet
7 pages
R programming Roadmap
No ratings yet
R programming Roadmap
5 pages
Week13 Slides Review
No ratings yet
Week13 Slides Review
23 pages
Lesson 7 - The Data Frame
No ratings yet
Lesson 7 - The Data Frame
7 pages
Uni T - 2 - R Programming
No ratings yet
Uni T - 2 - R Programming
10 pages
Data Science Fundamentals with R Python and Open Data 1st Edition Marco Cremonini - The full ebook with all chapters is available for download now
100% (1)
Data Science Fundamentals with R Python and Open Data 1st Edition Marco Cremonini - The full ebook with all chapters is available for download now
52 pages
MSDR PDF
No ratings yet
MSDR PDF
479 pages
R-Programming-Cheat-Sheet
No ratings yet
R-Programming-Cheat-Sheet
7 pages
Empirical Guidance
No ratings yet
Empirical Guidance
38 pages
DSR LAB MANUAL - 10 programs
No ratings yet
DSR LAB MANUAL - 10 programs
34 pages
R Fast Track Guide - 86 Key Points Every Programmer from Other Languages Should Master
From Everand
R Fast Track Guide - 86 Key Points Every Programmer from Other Languages Should Master
Ginno
No ratings yet
Manivannan -MBAD2121(Marketing Research )
No ratings yet
Manivannan -MBAD2121(Marketing Research )
20 pages
MBAD2115 - Operations Management.docx
No ratings yet
MBAD2115 - Operations Management.docx
3 pages
SEM 3 - MBBD2132 - Data Visualization for Managers
No ratings yet
SEM 3 - MBBD2132 - Data Visualization for Managers
49 pages
LAB MANUAL
No ratings yet
LAB MANUAL
46 pages
46
No ratings yet
46
62 pages
United States Patent (10) Patent No.: US 8.484,696 B2
No ratings yet
United States Patent (10) Patent No.: US 8.484,696 B2
26 pages
Assignment 1-OOP
No ratings yet
Assignment 1-OOP
8 pages
Alt Codes List - Alt Key Codes For Symbols, Special Characters
No ratings yet
Alt Codes List - Alt Key Codes For Symbols, Special Characters
3 pages
AQA-COMP1-W-MS-Jun09 Version 1.1
No ratings yet
AQA-COMP1-W-MS-Jun09 Version 1.1
31 pages
Sheen Catalogue 2009 Rebrand
No ratings yet
Sheen Catalogue 2009 Rebrand
43 pages
02-Unit2
No ratings yet
02-Unit2
38 pages
Class 7 Worksheet 2
No ratings yet
Class 7 Worksheet 2
4 pages
logic
No ratings yet
logic
10 pages
CLOUD SECURITY
No ratings yet
CLOUD SECURITY
20 pages
Software Assurance Maturity Model (SAMM)
No ratings yet
Software Assurance Maturity Model (SAMM)
19 pages
PROGRAMMING in .Net Technology (LAS) QUARTER 2 - For Learners
No ratings yet
PROGRAMMING in .Net Technology (LAS) QUARTER 2 - For Learners
29 pages
Aman_Babu_s_Resume
No ratings yet
Aman_Babu_s_Resume
1 page
APT Programming
100% (3)
APT Programming
62 pages
PPS QBANK 2 of 2023,2024 ptu ques papsr
No ratings yet
PPS QBANK 2 of 2023,2024 ptu ques papsr
2 pages
Mca N293 Java Lab Assignment
No ratings yet
Mca N293 Java Lab Assignment
2 pages
V & V Engineer
No ratings yet
V & V Engineer
3 pages
IP Lab - Final Lab Manual 2024-25
No ratings yet
IP Lab - Final Lab Manual 2024-25
90 pages
Close of Business
100% (1)
Close of Business
9 pages
HIDALGO, NANCY VS MCDONNELL, JOSEPH RAY - Court Records - UniCourt
No ratings yet
HIDALGO, NANCY VS MCDONNELL, JOSEPH RAY - Court Records - UniCourt
3 pages
EMUBT Setup Manual
No ratings yet
EMUBT Setup Manual
11 pages
Open Vizsla Catalogue
No ratings yet
Open Vizsla Catalogue
7 pages
Time Table For Winter 2024 Theory Examination
No ratings yet
Time Table For Winter 2024 Theory Examination
1 page
Installing Wonderware InTouch 2014 R2 Runtime
No ratings yet
Installing Wonderware InTouch 2014 R2 Runtime
12 pages
Flow Controls
No ratings yet
Flow Controls
35 pages
Select: Gregory Esteban Canon Ramírez 67000225
No ratings yet
Select: Gregory Esteban Canon Ramírez 67000225
8 pages
Bcs 031
No ratings yet
Bcs 031
23 pages
How To Measure VoIP Quality & MOS Score - Obkio
No ratings yet
How To Measure VoIP Quality & MOS Score - Obkio
46 pages
SAFE Tutorial
No ratings yet
SAFE Tutorial
114 pages

Data Science Using r 2

Uploaded by

Data Science Using r 2

Uploaded by

Name: Abi S

# Sample data with inconsistent date formats

# Function to standardize date formats

# Convert to a consistent format (e.g., YYYY-MM-DD)

# Apply the function

# Print the standardized dates

# Detect duplicate rows

# Print the logical vector indicating duplicate rows

# Remove duplicate rows

# Sample categorical data with inconsistent capitalization

# Standardize to title case (first letter capitalized)

#If you want to replace the original column in your data

# Function to detect and treat outliers using IQR

# Calculate lower and upper bounds

# Treat outliers (e.g., cap them)

return(list(treated_data = data, outliers = outliers))

# Print the treated data and outlier flags

# Check if the SparkSession is active

# Create a sample data frame in R

# Convert the R data frame to a Spark DataFrame

# Print the schema of the Spark DataFrame

# Load data from a CSV file into a Spark DataFrame

# Show the first few rows of the csv spark dataframe.

# Stop the SparkSession when finished

# Sample healthcare data (replace with your actual data)

# Missing value imputation (e.g., mean imputation for

# Outlier detection and treatment (e.g., IQR method)

patient_data <- patient_data %>%

# Data transformation (e.g., date formatting, categorical

# Missing value imputation

# Outlier detection and treatment (Z-score)

patient_data_dt <- as.data.table(patient_data)

#Clean data with data table, and dplyr, and tidyr.

4. 1. Scheduling with Cron Jobs

You might also like