0% found this document useful (0 votes)
16 views67 pages

Session 12

The document provides an overview of data analytics techniques using SQL and R, focusing on data import, manipulation, and querying. It includes examples of reading data from various formats, performing SQL queries using the sqldf package, and practical exercises related to employee and admission data. Additionally, it highlights the popularity of SQL in the analytics industry and offers SQL query examples for data selection and filtering.

Uploaded by

Sahil Dugar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views67 pages

Session 12

The document provides an overview of data analytics techniques using SQL and R, focusing on data import, manipulation, and querying. It includes examples of reading data from various formats, performing SQL queries using the sqldf package, and practical exercises related to employee and admission data. Additionally, it highlights the popularity of SQL in the analytics industry and offers SQL query examples for data selection and filtering.

Uploaded by

Sahil Dugar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Data Analytics

Session-12
SQL and Regression

Indian Institute of Management Rohtak


Workspace and Working Directory
> getwd():It shows the working directory

Indian Institute of Management Rohtak


Workspace and Working Directory
> setwd():It shows the working directory

Indian Institute of Management Rohtak


Workspace and Working Directory

setwd("C:/Users/your User Name here/Desktop")

> setwd("F:/da2024")
> setwd("~/")
> getwd()
[1] "F:/da2024"
getwd()
"C:/Users/PRS/Documents"
Indian Institute of Management Rohtak
Workspace and Working Directory

Indian Institute of Management Rohtak


Import Data (input.csv)
➢ data <- read.csv("C:/Users/admin/Desktop/input.csv")

• data <- read.csv("input.csv")

data <- read.csv(file.choose(), header=T)


Excel file
library("readxl")
my_data <-
read_excel("my_file.xls")
my_data <- read_excel(file.choose())
# Specify sheet by its name
my_data <- read_excel("my_file.xlsx", sheet =
"data")
# Specify sheet by its index
my_data <- read_excel("my_file.xlsx", sheet = 2)
Import from Desktop
The function read.table() can then be used to read the data frame directly

> airqual <- read.table("C:/Desktop/airquality.txt")

Similarly, to read .csv files the read.csv() function can be used to read in the data frame
directly

> airqual <- read.csv("C:/Desktop/airquality.csv")


excle:

library("readxl")
airqual <-
read_excel("C:\\Users\\admin\\Desktop\\BA_Gradesheet.
xlsx")
Indian Institute of Management Rohtak
Import Data (input.csv)

• data <- read.csv("input.csv")

View(data)
Practices
Paste your input.csv file in working directory
data <- read.csv("input.csv")
print(data)
# Get the max salary from data frame.
sal <- max(data$salary)
print(sal)
[1] 843.25
Get the details of the person with max salary
# Get the person detail having max salary.
retval <- subset(data, salary == max(salary))
print(retval)

Indian Institute of Management Rohtak


Practices
#Get all the people working in IT department
retval <- subset( data, dept == "IT")
print(retval)

#Get the persons in IT department whose salary is greater


than 600

info <- subset(data, salary > 600 & dept == "IT")


print(info)

# Write filtered data into a new file.


write.csv(retval,"output.csv")
newdata <- read.csv("output.csv")
print(newdata)
Indian Institute of Management Rohtak
Employee.csv
• Get maximum salary from given data
• Get the details of person having max salary
• Get the min salary in each dept
• Get the details of people in HR dept
• Get the details of people with salary >25000 in
Sales dept

Indian Institute of Management Rohtak


SQL ?
popularity of SQL in worldwide analytics / data science
industry. According to an online survey conducted
by Oreilly Media in 2022, it was found that among all the
programming languages, SQL was used by 77% of the
respondents followed by R and Python. It was also
discovered that people who know Excel (Spreadsheet)
tend to get significant salary boost once they learn SQL.
Also, according to a survey done by datasciencecentral, it
was inferred that R users tend to get a nice salary boost
once they learn SQL. In a way, SQL as a language is meant
to complement your current set of skills.
Indian Institute of Management Rohtak
Data Selection
SELECT – It tells you which columns to select.
FROM – It tells you columns to be selected should be from which table (dataset)
LIMIT – By default, a command is executed on all rows in a table. This commands
limits the number of rows. Limiting the rows leads to faster execution of
commands.
WHERE – This commands specifies a filter condition; i.e., the data retrieval has to
be done based on some variable filtering.
Comparison Operators – Everyone knows these operators as ( = , != , < , > , <= , >=
). They are used in conjunction with the WHERE command.
Logical Operators – The famous logical operators (AND, OR, NOT ) are also being
used to specify multiple filtering conditions. Other operators are:
LIKE – It is used to extract similar values and not exact values.
IN – It is used to specify the list of values to extract or leave out from a variable.
BETWEEN – As the names suggests, it activates a condition based on variable(s)
in the table.
IS NULL – It allows you to extract data without missing values from the specified
column.
ORDER BY – It is used to order a variable in descending or ascending order.

Indian Institute of Management Rohtak


Practising SQL in R
For writing SQL queries, we’ll use sqldf package.
It is one of the most versatile package packages
available these days which activate SQL in R.

Indian Institute of Management Rohtak


Practising SQL in R

library(sqldf)

sqldf("select * from data")


id name salary start_date dept
1 Rick 623.30 1/1/2012 IT
2 Dan 515.20 9/23/2013 Operations
3 Michelle 611.00 11/15/2014 IT
4 Ryan 729.00 5/11/2014 HR
5 Gary 843.25 3/27/2015 Finance
6 Nina 578.00 5/21/2013 IT
7 Simon 632.80 7/30/2013 Operations
8 Guru 722.50 6/17/2014 Finance

Indian Institute of Management Rohtak


sqldf("SELECT name FROM data WHERE
salary < 623 and dept=='IT'")
1Michelle
2 Nina
sqldf("select * from data where dept!='HR'")
id name salary start_date dept
1 1 Rick 623.30 1/1/2012 IT
2 2 Dan 515.20 9/23/2013 Operations
3 3 Michelle 611.00 11/15/2014 IT
4 5 Gary 843.25 3/27/2015 Finance
5 6 Nina 578.00 5/21/2013 IT
6 7 Simon 632.80 7/30/2013 Operations
7 8 Guru 722.50 6/17/2014 Finance

Indian Institute of Management Rohtak


sqldf("select sum(salary) as 'Total_Count'
from data")
> Total_Count
1 5255.05

> sqldf("select min(salary), max(salary)


from data")
min(salary) max(salary)
1 515.2 843.25

Indian Institute of Management Rohtak


sqldf("select * from data order by name ")
id name salary start_date dept
1 2 Dan 515.20 9/23/2013 Operations
sqldf("select * from data
2 5 Gary 843.25 3/27/2015 Finance order by dept,name ")
3 8 Guru 722.50 6/17/2014 Finance
4 3 Michelle 611.00 11/15/2014 IT
5 6 Nina 578.00 5/21/2013 IT
6 1 Rick 623.30 1/1/2012 IT
7 4 Ryan 729.00 5/11/2014 HR
8 7 Simon 632.80 7/30/2013 Operations

sqldf("select * from data where name like 'R%' ")


id name salary start_date dept
1 1 Rick 623.3 1/1/2012 IT
2 4 Ryan 729.0 5/11/2014 HR

sqldf("select * from data order by name desc ")


Indian Institute of Management Rohtak
Indian Institute of Management Rohtak
sqldf("select * from data order by
dept desc,name asc")
id name salary start_date dept
1 2 Dan 515.20 9/23/2013 Operations
2 7 Simon 632.80 7/30/2013 Operations
3 3 Michelle 611.00 11/15/2014 IT
4 6 Nina 578.00 5/21/2013 IT
5 1 Rick 623.30 1/1/2012 IT
6 4 Ryan 729.00 5/11/2014 HR
7 5 Gary 843.25 3/27/2015 Finance
8 8 Guru 722.50 6/17/2014 Finance
>
Indian Institute of Management Rohtak
sqldf("select * from data where salary in
(515.20,722.50,611.00)")
id name salary start_date dept
1 2 Dan 515.2 9/23/2013 Operations
2 3 Michelle 611.0 11/15/2014 IT
3 8 Guru 722.5 6/17/2014 Finance

sqldf("SELECT * FROM data WHERE (salary > 600


AND name like 'R%')")
id name salary start_date dept
1 1 Rick 623.3 1/1/2012 IT
2 4 Ryan 729.0 5/11/2014 HR

sqldf("select * fromid data where salary


name salary start_date
>
dept
600 order
by name desc ") 1 7 Simon 632.80 7/30/2013 Operations
2 4 Ryan 729.00 5/11/2014 HR
3 1 Rick 623.30 1/1/2012 IT
4 3 Michelle 611.00 11/15/2014 IT
5 8 Guru 722.50 6/17/2014 Finance
6 5 Gary 843.25 3/27/2015 Finance
Indian Institute of Management Rohtak
Road.xlsx

Crashes.xlsx
Indian Institute of Management Rohtak
Indian Institute of Management Rohtak
crash <- read.csv("crashes.csv")
OR

crash <- read.csv.sql("crashes.csv",sql = "select *


from file")

p1 <- read.csv("roads.csv")
OR

p1 <- read.csv.sql("roads.csv",sql = "select * from


file")

Indian Institute of Management Rohtak


p1

Crash
Indian Institute of Management Rohtak
#join the data sets
➢a<-sqldf("select * from
crash join p1 on
crash.Road = p1.Road ")
➢ View(a)

Indian Institute of Management Rohtak


Interstate275 is not present
Indian Institute of Management Rohtak
B1<-sqldf("select crash.Year, crash.Volume, p1.*
from crash left join p1 on crash.Road = p1.Road")
#left join
View(B1)

Indian Institute of Management Rohtak


Test
➢ data()

data("UCBAdmissions")

ucb <- as.data.frame(UCBAdmissions)

View(ucb)

Indian Institute of Management Rohtak


Indian Institute of Management Rohtak
Return all records for Female student
admission result
sqldf("select * from ucb where
Gender = 'Female'")
1 Admitted Female A 89
2 Rejected Female A 19
3 Admitted Female B 17
4 Rejected Female B 8
5 Admitted Female C 202
6 Rejected Female C 391
7 Admitted Female D 131
8 Rejected Female D 244
9 Admitted Female E 94
10 Rejected Female E 299
11 Admitted Female F 24
12 Rejected Female F 317
Indian Institute of Management Rohtak
total admitted student's
sqldf("select sum(Freq) from
ucb where Admit =
'Admitted'")
## SUM("Freq")
## 1 1755

Indian Institute of Management Rohtak


# return total reject females
sqldf("select sum(Freq) as
total_ladies from ucb where
Admit = 'Rejected' AND Gender =
'Female'")

## total_ladies
## 1 1278
Indian Institute of Management Rohtak
Group by dept

Indian Institute of Management Rohtak


Department wise total admitted students
Hint : Group by dept
sqldf("select Dept, sum(Freq) as
sum_admitted from ucb where
Admit = 'Admitted' group by Dept")
Dept sum_admitted
1 A 601
2 B 370
3 C 322
4 D 269
5 E 147
6 F 46
Indian Institute of Management Rohtak
Group by admitted and rejected

sqldf("select Admit, sum(Freq) as


sum_admitted from ucb group by
admit")
Admit
sum_admitted
1 Admitted 1755
2 Rejected 2771 Indian Institute of Management Rohtak
https://www.hackerearth.com/blog/developers/exclusive-sql-tutorial-on-data-
analysis-in-r/

https://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/sql.html

https://jasminedaly.com/tech-short-papers/sqldf_tutorial.html

Indian Institute of Management Rohtak


Some Practice Database SQL Questions

emp (eno, ename, bdate,


title, salary, dno)
proj (pno, pname, budget,
dno)
dept (dno, dname, mgreno)
workson (eno, pno, resp,
hours)

Indian Institute of Management Rohtak


Write an SQL query that returns the project
number and name for projects with a
budget greater than $100,000.
proj (pno, pname, budget, dno)

SELECT pno, pname


FROM proj
WHERE budget > 100000
Indian Institute of Management Rohtak
Write an SQL query that returns all works on
records where hours worked is less than
10 and the responsibility is 'Manager'.
workson (eno, pno, resp, hours)

SELECT *
FROM workson
WHERE hours < 10 AND resp
= 'Manager'
Indian Institute of Management Rohtak
Write an SQL query that returns the employees
(number and name only) who have a
title of 'EE' or 'SA' and make more than $35,000.
emp (eno, ename, bdate, title, salary, dno)

SELECT eno, ename


FROM emp
WHERE (title = 'EE' OR title
= 'SA') AND salary > 35000

Indian Institute of Management Rohtak


Write an SQL query that returns the employees (name
only) in department no 'D1’ ordered by decreasing
salary.
emp (eno, ename, bdate, title, salary, dno)

SELECT ename
FROM emp
WHERE dno = 'D1'
ORDER BY salary DESC
Indian Institute of Management Rohtak
Write an SQL query that returns the departments (all fields)
ordered by ascending department name.

dept (dno, dname, mgreno)

SELECT *
FROM dept
ORDER BY dname ASC

Indian Institute of Management Rohtak


Write an SQL query that returns the employee name, department
name, and employee title.
emp (eno, ename, bdate, title,
salary, dno)
proj (pno, pname, budget, dno)
dept (dno, dname, mgreno)
workson (eno, pno, resp, hours)

SELECT ename, dname, title


FROM emp, dept
WHERE emp.dno = dept.dno

Indian Institute of Management Rohtak


Write an SQL query that returns the project name, department name,
and budget for all projects with a budget < $50,000.
emp (eno, ename, bdate, title,
salary, dno)
proj (pno, pname, budget, dno)
dept (dno, dname, mgreno)
workson (eno, pno, resp, hours)

SELECT pname, dname, budget


FROM proj, dept
WHERE budget < 50000 AND
proj.dno = dept.dno
Indian Institute of Management Rohtak
Write an SQL query that returns the project name, hours worked, and
project number for all works on records where hours > 10.
emp (eno, ename, bdate, title,
salary, dno)
proj (pno, pname, budget, dno)
dept (dno, dname, mgreno)
workson (eno, pno, resp, hours)
SELECT pname, hours,
proj.pno
FROM workson, proj
WHERE hours > 10 AND
proj.pno = workson.pno
Indian Institute of Management Rohtak
Write an SQL query that returns the employee name, project
name, employee title, and hours for all works on records.

emp (eno, ename, bdate, title,


salary, dno)
proj (pno, pname, budget, dno)
dept (dno, dname, mgreno)
workson (eno, pno, resp, hours)

SELECT ename, pname,


title,hours
FROM emp, proj, workson
WHERE emp.eno = workson.eno
and proj.pno = workson.pno
Indian Institute of Management Rohtak
Linear Regression
Steps to Establish a Regression
A simple example of regression is predicting weight of a person when
his height is known. To do this we need to have the relationship
between height and weight of a person.
The steps to create the relationship is:
Carry out the experiment of gathering a sample of observed values
of height and corresponding weight.
Create a relationship model using the lm() functions in R.
Find the coefficients from the model created and create the
mathematical equation using these.
Get a summary of the relationship model to know the average error
in prediction.
Also called residuals.
To predict the weight of new persons, use the predict() function in
R.

Indian Institute of Management Rohtak


Input Data
Height <- c(151, 174, 138, 186, 128, 136, 179, 163,
152, 131)

Weight <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

# Apply the lm() function.

relation <- lm(Weight~Height)

Indian Institute of Management Rohtak


summary(relation)

Person weight = -38.45009 + 0.67461* person height

Indian Institute of Management Rohtak


Person weight = b0 + b1 * person height

Person weight = -38.45009 + 0.67461* person height

# Find weight of a person with height 170.

Person weight = -38.45009 + 0.67461*


person hight

=-38.45009+0.67461*170=76.23365

x11 <- data.frame(Height=170)


result <-predict(relation,x11)
print(result)
76 Indian Institute of Management Rohtak
Find weights ??,if height are as
follows:170,172,174,176,178
76.22869, 77.57791, 78.92713, 80.27635 ,81.62557

k1<-data.frame(Height=c(170,172,174,176,178))
> k1
Height
1 170 x14<-predict(relation,k1)
2 172
3 174
4 176
5 178 x14
Indian Institute of Management Rohtak
Visualize the Regression Graphically
# Give the chart file a name.
png(file = "linearregression.png") # Plot the chart.

plot(Height,Weight, abline(lm(Weight~Height)))
# Save the file.
dev.off() getwd()
plot(Height,Weight,col="blue",main="Height &
Weight Regression",
abline(lm(Weight~Height)),cex =
1.3,pch=16,xlab="Weight in Kg",ylab="Height in
cm")
Indian Institute of Management Rohtak
Indian Institute of Management Rohtak
plot(x,y,pch=2,cex=6,col="red") Indian Institute of Management Rohtak
plot(Height,Weight,pch=4,cex=7,col="red")

Indian Institute of Management Rohtak


ggplot(relation, aes(Weight,Height)) +
geom_point() + stat_smooth(method = lm)

Indian Institute of Management Rohtak


Multiple Regression

Multiple regression is an extension of linear regression


into relationship between more than two variables. In
simple linear relation we have one predictor and one
response variable, but in multiple regression we have
more than one predictor variable and one response
variable.
y= a + b1x1 + b2x2 +...bnxn
y is the response variable.

a, b1, b2...bn are the coefficients.

x1, x2, ...xn are the predictor variables

Indian Institute of Management Rohtak


Multiple Regression
Input Data
Consider the data set "mtcars" available in the R
environment. It gives a comparison between
different car models in terms of mileage per gallon
(mpg), cylinder displacement("disp"), horse
power("hp"), weight of the car("wt") and some
more parameters.
The goal of the model is to establish the
relationship between "mpg" as a response
variable with "disp","hp" and "wt" as predictor
variables. We create a subset of these variables
from the mtcars data set for this purpose
Indian Institute of Management Rohtak
Multiple Regression

input <- mtcars[,c("mpg","disp","hp","wt")]

View(input)

Indian Institute of Management Rohtak


Multiple Regression
# Create the relationship model.

model <- lm(mpg~disp+hp+wt, data=input)

model <- lm(mpg~., data=input)

model <- lm(mpg~., input)

Indian Institute of Management Rohtak


Multiple Regression
Create Equation for Regression Model

Apply Equation for predicting New Values


We can use the regression equation created above to
predict the mileage when a new set of values for
displacement, horse power and weight is provided
For a car with disp = 221, hp = 102 and wt = 2.91 the
predicted mileage is

Indian Institute of Management Rohtak


Multiple Regression

a <- data.frame(disp=221,hp=102,wt=2.91)

> result <-predict(model,a)

> print(result)

22.65987

Indian Institute of Management Rohtak


Data Frame
x<-
data.frame("roll"=1:3,"name"=c("jack","jill
","Tom"),"age"=c(20,22,23))
add new column with data
x$bloodgroup<-c("A+","B-","AB+")
x using cbind also with data
x<-cbind(x,city=c("Delhi","Mumbai","Chennai"))

Indian Institute of Management Rohtak


adding new row
x<-rbind(x,c(4,"jack",24,"B+","Delhi"))
Deletion
x$age <- NULL
x[,c(2,4)]<- NULL
data[-c(row_number), ]

x[-c(1,3), ]
Indian Institute of Management Rohtak
Thank you !!!
Indian Institute of Management Rohtak

You might also like