R Doc Ii Vee
R Doc Ii Vee
R Doc Ii Vee
ASSIGNMENT II
SUBMITTED TO:
MR. SUBHANKAR MISHRA
ASST. PROF
NIFT BHUBANESWAR
SUBMITTED BY:
VAISISTHA BAL
BFT/17/470
DEPT. OF FASHION TECHNOLOGY
1
NATIONAL INSTITUTE OF FASHION TECHNOLOGY
Result: 21.76625
Interpretation: The trim function removes 10% of the indentation or the leading and trailing
whitespace from the first and last lines while maintaining the mean. The mean for the column
“MEDV” and the selected range “h” is 21.7665, which is the average of the median value of
prices
• To sort:
2
NATIONAL INSTITUTE OF FASHION TECHNOLOGY
sort(v)
Result: [1] 12.7 13.1 13.2 13.5 13.6 13.9 14.4 14.5 14.5 14.8 15.0 15.2 15.6 16.0
[15] 16.5 16.6 16.6 17.4 17.5 18.2 18.2 18.4 18.7 18.9 18.9 18.9 18.9 19.3
[29] 19.4 19.4 19.6 19.6 19.7 19.9 20.0 20.0 20.0 20.2 20.3 20.4 20.5 20.6
[43] 20.8 20.9 21.0 21.0 21.2 21.2 21.4 21.4 21.6 21.7 21.7 22.0 22.0 22.2
[57] 22.2 22.5 22.6 22.8 22.9 22.9 22.9 23.1 23.3 23.4 23.4 23.5 23.6 23.9
[71] 23.9 24.1 24.2 24.7 24.7 24.7 24.8 25.0 25.0 25.0 25.3 26.6 26.6 27.1
[85] 27.5 28.0 28.4 28.7 28.7 30.8 31.6 33.0 33.2 33.4 34.7 34.9 35.4 36.2
[99] 38.7 43.8
Interpretation: sorts the values in the selected range in ascending order.
• To calculate inter quartile range:
IQR(h)
Result: 5.8
Interpretation: It calculates the inter quartile range for the given dataset. In this
case, the data range is stored in h. IQR is equal to 5.8 means; it removes 25% of the
data from the front and the end and shows how spread out the middle range is. IQR
can be performed on an ordered data set from smallest to highest. It shows the
likelihood of where the new data point will be within the data set.
• To calculate standard deviation:
sd(h)
Result: 5.948352
Interpretation: This function shows the quantity by which each of the data points
differ from the mean of the values.
• To calculate quantile:
quantile(h)
3
NATIONAL INSTITUTE OF FASHION TECHNOLOGY
Interpretation: The quantile shows the range of data from lowest to highest. 0% is
the lowest point which is equal to 12.7 and 100% is the highest range which is equal
to 43.8. 50% shows the median equal to 21.5. 25% shows the median of 0% to 50%
which is equal to 18.9 and 75% is the median from 50% to 100% equal to 24.7.
• To calculate median:
median(h)
Result: 21.5
Interpretation: Middle value of the data range is 21.5.
• To calculate mad
Mad(h)
Result: 4.07715
Interpretation: It shows the average absolute difference of column values from each
other.
• To find out variance
var(h)
Result: 35.38289
• To find out the maximum
max(h)
Result: 43.8
Interpretation: In the specified data range, max shows the highest value, i.e.
43.8.
• To find out the minimum
min(h)
Result: 12.7
Interpretation: In the specified range, min function shows the lowest value, i.e. 12.7
BASIC GRAPHS
• barplot(table(v))
Interpretation: The highest number or count of median value of prices lie between
the range of 17.5 to 19.7
4
NATIONAL INSTITUTE OF FASHION TECHNOLOGY
• hist(v)
Interpretation: shows the distribution of median value of prices between the points
with the highest median values for the specified data range falling between 20 to 25
• rug(v)
Interpretation: Rug displays individual points on the graph, where the values are
more likely to occur. With more details, as we can, most value fall between 20 to 25
5
NATIONAL INSTITUTE OF FASHION TECHNOLOGY
Interpretation:
• pie(table(v))
6
NATIONAL INSTITUTE OF FASHION TECHNOLOGY
7
NATIONAL INSTITUTE OF FASHION TECHNOLOGY
• CORRELATION PLOT
corrplot(cor(select(housing, -CHAS)), method = 'number')
With increase in poor income households in an area,the median price of the houses
decreases exponentially.
8
NATIONAL INSTITUTE OF FASHION TECHNOLOGY
NOTE: The plot reveals the peak densities of MEDV are in Between 15 and 30.
• SCATTER PLOTS
SCATTER PLOTS of different variables w.r.t `MEDV` with geom SMOOTHENING
emphasizing on showing correlation patterns
melt(select(housing,
c(CRIM, RM, AGE, RAD,
TAX, LSTAT, MEDV, INDUS,
NOX, PTRATIO, ZN)),
id.vars = "MEDV") %>%
ggplot(aes(x=value, y=MEDV, color=variable)) +
geom_point(alpha=0.7) + stat_smooth(aes(color='black')) +
facet_wrap(~variable, scales='free', ncol = 2) +
labs(x="Variable Value", y="Median House Price($1000)") + theme_minimal()
9
NATIONAL INSTITUTE OF FASHION TECHNOLOGY
NOTE:
2. There is a slight decrease in Median Price(MEDV) of the houses with the increasing `AGE`
of the house
3. With the increase in the the Number of rooms (RM), the Median Price(MEDV) also has an
almost linear increase; while avg. rooms of the range 4-6 has slow increase in housing prices,
whereas in the range 6-9, there is a higher rate of increase in housing prices
• BOXPLOT
par(mfrow=c(1, 4))
boxplot(housing$CRIM, main='CRIM', col='aquamarine')
boxplot(housing$ZN, main='ZN', col='aquamarine')
boxplot(housing$RM, main='RM', col='aquamarine')
boxplot(housing$B, main='B', col='aquamarine')
NOTE: It can be observed that variable 'crim' and 'black' take wide range of values.
Variables 'CRIM', 'ZN', 'RM' and 'B' have a large difference between their median and mean
which indicates lot of outliers in respective variables.
CRIM: Most houses in Bostom Regions have lower crime index rates
• HISTOGRAM PLOT
ggh_nox = ggplot(housing, aes(x=NOX)) +
geom_histogram(aes(y=..density..), bins=50,
color='darkblue',fill="lightblue", position="dodge") +
geom_density(alpha=.2, fill='antiquewhite3') + theme_minimal()
10
NATIONAL INSTITUTE OF FASHION TECHNOLOGY
fill="red", position="dodge") +
geom_density(alpha=.2, fill='antiquewhite3') + theme_minimal()
NOTE: TAX: Tax rates for different houses are NOT UNIFORM due to different socio-
economic attributes
LSTAT: The Percent of Lower Status commumnities are more uniform in the 3-19
percentage range from observation
ZN: There is a high percentage of house in Boston which have 0 Landzones with
<25000 sq feet of land zones.
We can conclude that there is small scope of new residential developments around
the existing houses
NOX: The concentration of Nitrogen Oxide levels vary all around the Boston area UN-
UNIFORMLY
We can also observe that all the houses lie under 1.0 NOX pptm(parts per ten million)
level which is recommended as healthy NO2 levels for WHO
11
NATIONAL INSTITUTE OF FASHION TECHNOLOGY
The drop_na() helps us remove all the missing functions from the dataset. It will help us
further with transformation and Linear Modelling, since we don’t want missing values to
affect our modelling.
STEP 4: TRANSFORMATION
PACKAGES WE MAY NEED:
- Library(tidyverse)
- Library(dplyr)
BASIC FUNCTIONS OF DPLYR:
NOTE: Boston is the dataset we are working on. Since, we want information related
to crime rate, it is easier to work on the specific column. So we use Select() to pick the
CRIM; crime rate, column out of the data set. Then, we use Filter(), to choose all the
observations greater than 1 by the conditional statement, CRIM>=1. Used “cm” as a
vector to store the result.
12
NATIONAL INSTITUTE OF FASHION TECHNOLOGY
view(desc)
NOTE: We’re working on the “boston” dataset. To store the result of transformation,
we assign a vector “desc”. To work on MEDV, we choose that specific column by
using select(). To sort in descending order, we must use the arrange(). However, by
default, arrange() sorts in ascending order, so we specify arrange(desc(MEDV))
RESULT: To view the full result in a new file, type “view(desc)”.
13
NATIONAL INSTITUTE OF FASHION TECHNOLOGY
• Find out highest crime rates in the town in descending order, along with their
respective median values
m <- mean(boston$CRIM)
hcrim <- boston %>%
select(CRIM, MEDV) %>%
filter(CRIM >= m) %>%
arrange(desc(CRIM))
view(hcrim)
NOTE: In order to find the highest crime rates, we find out the mean crime rate. Crime
rates greater than mean are the highest range and the crime rates less than the mean
are the lowest range.
Mean of crime rates is stored in “m”. Boston is the dataset we are working on.
Since we have to display two variables, highest crime rates along with their specific
Median values, we choose CRIM and MEDV using select(). To pick crime rates greater
than the mean, we use filter() with the conditional statement CRIM>= m, since mean
crime rate is stored in m. eventually, we sort the crime rates in descending order using
arrange(desc(CRIM))
RESULT: To view result in a separate file, type “view(hcrim))”. It isn’t feasible to show
the full result , so, it is convenient to attach the screenshot
14
NATIONAL INSTITUTE OF FASHION TECHNOLOGY
NOTE: We’re working on the boston dataset. We’ve to perform statistical calculations
in order to proceed with the transformation. To find out the minimum, assign a vector
a, and use the min() for minimum age.
For average age, use mean(), which will be stored in “ma”. Now, to find ot the range
of age, we create a vector “rage”, choose the AGE column by using select(). To filter
the required obs, use the filter() along the conditional statement AGE>= a & AGE<=
ma.
RESULT: Now, you can view the full result in a new file using view(rage). It isn’t feasible
to show the full result, so it is easy to attach the screenshot.
P.S: Tried to sort the list using arrange, however, there seemed to be an issue with the
library packages and the cloud server, which is why it failed run.
15
NATIONAL INSTITUTE OF FASHION TECHNOLOGY
16
NATIONAL INSTITUTE OF FASHION TECHNOLOGY
HYPOTHESIS:
H0 : Increase in the crime rates leads to decrease in median value of the house prices
Ha: Increase in the crime rates do not lead to decrease in median value of the house prices
library(dplyr)
CODE: The step by step process with interpretationa and notes on each step have been
mentioned below
boston <- fread(file = "housing.csv") #mentioned above
FIND CORRELATION
cor(boston$CRIM, boston$MEDV)
RESULT: -0.2863118
NOTE: There is a negative correlation between the two variables, CRIM and MEDV. We will
be worki-ng on those two variables considering the hypothesis we want to test.
Rules of correlation: If cor > 0, positively related and directly proportional
If cor< 0, negatively related and inversely proportional
If cor= 0, variables are independent of each other
17
NATIONAL INSTITUTE OF FASHION TECHNOLOGY
NOTE: As we can see, there seems o be a linear relationship between CRIM and MEDV,
however, the data points are concentrated on one side.
RESULT:
Call:
lm(formula = MEDV ~ CRIM, data = boston)
Coefficients:
(Intercept) CRIM
25.189 -1.011
18
NATIONAL INSTITUTE OF FASHION TECHNOLOGY
RESULT: We can clearly see a linear graph here. However, it is an inverse linear relationship,
with a few outliers.
FIND OUT THE MODEL SUMMARY
Use the summary() to calculate the summary of the linearMod variable. It is stored in the
vector “modelSummary”. We can view the result by calling the lm() and using the print()
Residuals:
Min 1Q Median 3Q Max
-2.5990 -1.4706 -1.0395 0.3764 9.9371
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.35000 0.32443 10.326 < 2e-16 ***
MEDV -0.08110 0.01281 -6.332 5.88e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
19
NATIONAL INSTITUTE OF FASHION TECHNOLOGY
20
NATIONAL INSTITUTE OF FASHION TECHNOLOGY
NOTE: After the data has been sent for training, we can use the lm() to find the intercept
and slope for the training data, w.r.t to Y.
RESULT:
Call:
lm(formula = MEDV ~ CRIM, data = train)
Coefficients:
(Intercept) CRIM
25.320 -0.876
This result is important when we need to plot the graph to visualize the difference between
actual data and training data.
summary(bmodel)
RESULT:
Call:
lm(formula = MEDV ~ CRIM, data = train)
Residuals:
Min 1Q Median 3Q Max
-17.038 -5.479 -2.389 3.038 32.768
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.3197 0.5505 45.994 < 2e-16 ***
CRIM -0.8760 0.1847 -4.742 3.16e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
geom_point() +
geom_abline(slope = -0.876, intercept =25.320 , color = "blue") +
geom_abline(slope =-1.011 , intercept = 25.189, color ="red")
NOTE: Blue represents the training data and red represents the original data
21
NATIONAL INSTITUTE OF FASHION TECHNOLOGY
RESULT
NOTE: Now, in order to know the difference between actual data and predicted value, it is
better to create a new column, such as predError, using the mutate()
view(avspT)
RESULT: The full result can be seen using the view(). It isn’t feasible to display the full
result, so it is appropriate to attach the screenshot
22
NATIONAL INSTITUTE OF FASHION TECHNOLOGY
23
NATIONAL INSTITUTE OF FASHION TECHNOLOGY
FINAL INFERENCE
Since the p-value from the lm() summary, is very very low as compared to the coefficient of
the H0, we can say that the null hypothesis is accepted. We can establish that there is an
inversely proportional linear relationship between CRIM and MEDV.
24