0% found this document useful (0 votes)
2 views

LEARNING R PROGRAMMING FOR DATA SCIENCE ENTHUSIASTS

The document provides an overview of data science, emphasizing its importance in decision-making and the growing demand for data scientists. It introduces R programming as a statistical language suitable for data analysis and visualization, highlighting the dplyr package for data manipulation and ggplot2 for data visualization. Additionally, it includes practical examples and questions related to data analysis using R.

Uploaded by

A.S. ROHIT
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

LEARNING R PROGRAMMING FOR DATA SCIENCE ENTHUSIASTS

The document provides an overview of data science, emphasizing its importance in decision-making and the growing demand for data scientists. It introduces R programming as a statistical language suitable for data analysis and visualization, highlighting the dplyr package for data manipulation and ggplot2 for data visualization. Additionally, it includes practical examples and questions related to data analysis using R.

Uploaded by

A.S. ROHIT
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Learning R Programming for Data Science Enthusiasts

What is Data Science ?

Data science is a field that focuses on making sense of data. It involves collecting
data, cleaning it up, and looking for patterns and insights.

This often includes using math and computer algorithms. Data scientists
also create visual charts and graphs to help explain what the data means.

How is Data Science Helpful ?

All of this helps businesses and organizations make better decisions and solve
problems.

Why Should You Learn Data Science ?

1. Demand
According to the BLS, Data Scientist employment is projected to grow by
36% from 2021 to 2031, far surpassing the average job growth rate. This high
demand makes data science a promising career choice.
2. Growth
The average salary for a Data Scientist is $1,43,970 in the United States.
3. Opportunities

4. Flexibility
Data Scientists are needed in various sectors. They are needed in Healthcare,
Financial Industry, Manufacturing, Logistics etc
R Programming

Why R Programming ?

Python is a general-purpose language that is used for the deployment


and development of various projects.

R is a statistical language used for the analysis and visual


representation of data.

Python is better suitable for machine learning, deep learning, and


large-scale web applications.

R is suitable for statistical learning having powerful libraries for data


experiment and exploration.

R is suitable for statistical learning having powerful libraries for data


experiment and exploration.

R’s statistical packages are highly powerful.

Installing R and RStudio

1. https://cran.r-project.org/bin/windows/base/
2. https://posit.co/download/rstudio-desktop/
DATA MANIPULATION USING DPLYR Package

The dplyr package is used to transform and summarize tabular data


with rows and columns (with rows and columns)

Works with dataframes or inbuilt datasets and converting to data


frames or data tables.

It is much faster and easier than the base R.

Things that can be done with DPLYR

1. Data Filtering

You can use filter() to select specific rows from a dataset based on
conditions, allowing you to focus on subsets of data that are relevant
to your analysis.

#installing a package
install.packages("dplyr")

# In R, the library() function is used to load packages into your R


session.
library(dplyr)

#installing inbuilt dataset


install.packages("nycflights13")

#to view the dataset


View(flights)
head(flights)

#subset data using filter()


#to filter out flights that operated in the month of july
July_flights <- filter(flights,month == 07)
View(July_flights)

#to filter out flights that operated in the month of july


July_flights_3 <- filter(flights,month == 07, day ==3)
View(July_flights_3)
head(flights[flights$month == 9 & flights$day == 3 & flights$origin ==
"LGA", ])

#to look at specific rows


slice(flights, 1:5)

#to create a new variable or add a new column


flights <- mutate(flights, overall_delay = arr_delay - dep_delay)

flights <- transmute(flights, overall_delay = arr_delay - dep_delay)


View(flights)

library(nycflights13)
flights <- nycflights13::flights

#summarize
summary_data <- summarise(flights, avg_air_time = mean(air_time,
na.rm = TRUE))
R <- summari

summary_data
summary_data <- summarise(flights, avg_air_time = sum(air_time,
na.rm = TRUE))
summary_data

summary_data <- summarise(flights, avg_air_time = sd(air_time, na.rm


= TRUE))
summary_data

Grouping Data

#summarizing by the gear


by_gear <- mtcars %>% group_by(gear)

#getting the average and sum of the groups created


a<-summarize(by_gear, gear1=sum(gear), gear2=mean(gear))
Sampling
sample_n(flights,15)

Arranging the DataSet

View(arrange(flights,year,dep_time))
View(arrange(flights,year,desc(dep_time)))

Nesting
df <-mtcars
result <-arrange(sample_n(filter(df,mpg>20), size = 5), desc(mpg))
result

result <- df %>% filter(mpg>20) %>% sample_n(size=10) %>% arrange


(desc(mpg))

result

Selecting the columns


Df_mpg_hp = df %>% select(mpg,hp,)
Df_mpg_hp
Questions:

1. How many passengers traveled in the Titanic?


2. How many passengers survived?
3. How many passengers did not survive?
4. What is the category of passengers that were the most?
5. What is the median age of passengers in the dataset?
6. How many male passengers are in the dataset?
7. How many female passengers are in the dataset?
8. How many passengers were in each class (1st, 2nd, 3rd)?
DATA VISUALIZATION USING ggplot2

library(ggplot2)

data()
?Bod

install.packages("ggplot2")
library(ggplot2)

Scatterplot
ggplot (BOD, aes(Time,demand)) + geom_point(size=3) +
geom_line(color = "red")

CO2 %>% ggplot(aes(conc,uptake, colour = Treatment)) + geom_point()

CO2 %>% ggplot(aes(conc,uptake, colour = Treatment)) +


geom_point(size = 3, alpha = 0.5)+ geom_smooth()

CO2 %>% ggplot(aes(conc,uptake, colour = Treatment)) +


geom_point(size = 3, alpha = 0.5)+ geom_smooth(method=lm,se = F)+
labs(title = "Concentration of Co2")+theme_classic()

BoxPLot

CO2 %>% ggplot(aes(Treatment,uptake))+


geom_boxplot()

CO2 %>% ggplot(aes(Treatment,uptake))+


geom_boxplot()+geom_point()

CO2 %>% ggplot(aes(Treatment,uptake))+


geom_boxplot()+geom_point(aes(size =3, colour = Plant))
CO2 %>% ggplot(aes(Treatment,uptake))+
geom_boxplot()+geom_point(alpha=0.5,aes(size =3, colour = Plant))

CO2 %>% ggplot(aes(Treatment,uptake))+


geom_boxplot()+geom_point(alpha=0.5,aes(size =3, colour = Plant))+
coord_flip()+
theme_bw()+
labs(title = "Chilled vs Non-chilled")

View(mpg)

Questions using mtcars


1. Is there a relationship between a car's weight and its miles per
gallon (MPG)?

mtcars %>% ggplot(aes(wt,mpg)) + geom_point()

2. What is the distribution of car counts based on the number of


cylinders?

mtcars %>%
group_by(cyl) %>%
summarize(count = n()) %>%
ggplot(aes(x = as.factor(cyl), y = count)) +
geom_bar(stat = "identity")

3. What is the distribution of MPG among the cars in the dataset?

mtcars %>% ggplot(aes(mpg))+geom_histogram()

4. Does car speed (mph) change with engine displacement (cubic


inches)?

mtcars %>% ggplot(aes(disp,hp)) +geom_line()

You might also like