0% found this document useful (0 votes)
15 views97 pages

Lokesh Da

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views97 pages

Lokesh Da

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 97

DATA ANALYTICS LABORATORY

WEEK-1: Practicing basic commands in R

A.Lokesh
Handling packages, setting path and working directories: 21071a1268
IT-B
a. .libPaths() :
Aim: It gets/sets the library trees within which packages are looked for.
Syntax: .libPaths()
Example: .libPaths()

> .libPaths()
[1] "C:/Users/nenav/AppData/Local/R/win-library/4.3"
[2] "C:/Program Files/R/R-4.3.1/library"

b. find.package() :
Aim: It returns path to the locations where the given packages are found.
Syntax: find.package(package, lib.loc = NULL, quiet = FALSE,
verbose = getOption("verbose")
Example: find.package(“base”)

> find.package()
[1] "C:/Program Files/R/R-4.3.1/library/stats"
[2] "C:/Program Files/R/R-4.3.1/library/graphics"
[3] "C:/Program Files/R/R-4.3.1/library/grDevices"
[4] "C:/Program Files/R/R-4.3.1/library/utils"
[5] "C:/Program Files/R/R-4.3.1/library/datasets"
[6] "C:/Program Files/R/R-4.3.1/library/methods"
[7] "C:/PROGRA~1/R/R-4.3.1/library/base"

c. installed.packages()
Aim: Find details of all packages installed in the specified librarires.
Syntax: installed.packages(lib.loc = NULL, priority = NULL,
noCache=FALSE,fields = NULL, subarch = .Platform$r_arch, …)
Example: installed.packages()
> installed.packages()
Package LibPath Version
base "base" "C:/Program Files/R/R-4.3.1/library" "4.3.1"
boot "boot" "C:/Program Files/R/R-4.3.1/library" "1.3-28.1"
class "class" "C:/Program Files/R/R-4.3.1/library" "7.3-22"
cluster "cluster" "C:/Program Files/R/R-4.3.1/library" "2.1.4"
codetools "codetools" "C:/Program Files/R/R-4.3.1/library" "0.2-19"
compiler "compiler" "C:/Program Files/R/R-4.3.1/library" "4.3.1"
datasets "datasets" "C:/Program Files/R/R-4.3.1/library" "4.3.1"
foreign "foreign" "C:/Program Files/R/R-4.3.1/library" "0.8-84"
graphics "graphics" "C:/Program Files/R/R-4.3.1/library" "4.3.1"
grDevices "grDevices" "C:/Program Files/R/R-4.3.1/library" "4.3.1"
grid "grid" "C:/Program Files/R/R-4.3.1/library" "4.3.1"
KernSmooth "KernSmooth" "C:/Program Files/R/R-4.3.1/library" "2.23-21"
lattice "lattice" "C:/Program Files/R/R-4.3.1/library" "0.21-8"
MASS "MASS" "C:/Program Files/R/R-4.3.1/library" "7.3-60"
Matrix "Matrix" "C:/Program Files/R/R-4.3.1/library" "1.5-4.1"
methods "methods" "C:/Program Files/R/R-4.3.1/library" "4.3.1"
mgcv "mgcv" "C:/Program Files/R/R-4.3.1/library" "1.8-42"
nlme "nlme" "C:/Program Files/R/R-4.3.1/library" "3.1-162"
nnet "nnet" "C:/Program Files/R/R-4.3.1/library" "7.3-19"
parallel "parallel" "C:/Program Files/R/R-4.3.1/library" "4.3.1"
rpart "rpart" "C:/Program Files/R/R-4.3.1/library" "4.1.19"
spatial "spatial" "C:/Program Files/R/R-4.3.1/library" "7.3-16"
splines "splines" "C:/Program Files/R/R-4.3.1/library" "4.3.1"
stats "stats" "C:/Program Files/R/R-4.3.1/library" "4.3.1"
stats4 "stats4" "C:/Program Files/R/R-4.3.1/library" "4.3.1"
survival "survival" "C:/Program Files/R/R-4.3.1/library" "3.5-5"
tcltk "tcltk" "C:/Program Files/R/R-4.3.1/library" "4.3.1"
tools "tools" "C:/Program Files/R/R-4.3.1/library" "4.3.1"
translations "translations" "C:/Program Files/R/R-4.3.1/library" "4.3.1"
utils "utils" "C:/Program Files/R/R-4.3.1/library" "4.3.1"
Priority Depends
base "base" NA
boot "recommended" "R (>= 3.0.0), graphics, stats"
class "recommended" "R (>= 3.0.0), stats, utils"
cluster "recommended" "R (>= 3.5.0)"
codetools "recommended" "R (>= 2.1)"
compiler "base" NA
datasets "base" NA
foreign "recommended" "R (>= 4.0.0)"
graphics "base" NA
grDevices "base" NA
grid "base" NA
KernSmooth "recommended" "R (>= 2.5.0), stats"
lattice "recommended" "R (>= 4.0.0)"
MASS "recommended" "R (>= 4.0), grDevices, graphics, stats, utils"
Matrix "recommended" "R (>= 3.5.0), methods"
methods "base" NA
mgcv "recommended" "R (>= 3.6.0), nlme (>= 3.1-64)"
nlme "recommended" "R (>= 3.5.0)"
nnet "recommended" "R (>= 3.0.0), stats, utils"
parallel "base" NA
rpart "recommended" "R (>= 2.15.0), graphics, stats, grDevices"
spatial "recommended" "R (>= 3.0.0), graphics, stats, utils"
splines "base" NA
stats "base" NA
stats4 "base" NA
survival "recommended" "R (>= 3.5.0)"
tcltk "base" NA
tools "base" NA
translations NA NA
utils "base" NA
Imports LinkingTo
base NA NA
boot NA NA
class "MASS" NA
cluster "graphics, grDevices, stats, utils" NA
codetools NA NA
compiler NA NA
datasets NA NA
foreign "methods, utils, stats" NA
graphics "grDevices" NA
grDevices NA NA
grid "grDevices, utils" NA
KernSmooth NA NA
lattice "grid, grDevices, graphics, stats, utils" NA
MASS "methods" NA
Matrix "graphics, grid, lattice, stats, utils" NA
methods "utils, stats" NA
mgcv "methods, stats, graphics, Matrix, splines, utils" NA
nlme "graphics, stats, utils, lattice" NA
nnet NA NA
parallel "tools, compiler" NA
rpart NA NA
spatial NA NA
splines "graphics, stats" NA
stats "utils, grDevices, graphics" NA
stats4 "graphics, methods, stats" NA
survival "graphics, Matrix, methods, splines, stats, utils" NA
tcltk "utils" NA
tools NA NA
translations NA NA
utils NA NA
Suggests
base "methods"
boot "MASS, survival"
class NA
cluster "MASS, Matrix"
codetools NA
compiler NA
datasets NA
foreign NA
graphics NA
grDevices "KernSmooth"
grid NA
KernSmooth "MASS, carData"
lattice "KernSmooth, MASS, latticeExtra, colorspace"
MASS "lattice, nlme, nnet, survival"
Matrix "MASS, expm"
methods "codetools"
mgcv "parallel, survival, MASS"
nlme "Hmisc, MASS, SASmixed"
nnet "MASS"
parallel "methods"
rpart "survival"
spatial "MASS"
splines "Matrix, methods"
stats "MASS, Matrix, SuppDists, methods, stats4"
stats4 NA
survival NA
tcltk NA
tools "codetools, methods, xml2, curl, commonmark, knitr, xfun,\nmath
jaxr, V8"
translations NA
utils "methods, xml2, commonmark, knitr"
Enhances
base NA
boot NA
class NA
cluster NA
codetools NA
compiler NA
datasets NA
foreign NA
graphics NA
grDevices NA
grid NA
KernSmooth NA
lattice "chron"
MASS NA
Matrix "MatrixModels, SparseM, graph, igraph, maptools, sfsmisc, sp,\n
spdep"
methods NA
mgcv NA
nlme NA
nnet NA
parallel "snow, Rmpi"
rpart NA
spatial NA
splines NA
stats NA
stats4 NA
survival NA
tcltk NA
tools NA
translations NA
utils NA
License License_is_FOSS License_restricts_u
se
base "Part of R 4.3.1" NA NA
boot "Unlimited" NA NA
class "GPL-2 | GPL-3" NA NA
cluster "GPL (>= 2)" NA NA
codetools "GPL" NA NA
compiler "Part of R 4.3.1" NA NA
datasets "Part of R 4.3.1" NA NA
foreign "GPL (>= 2)" NA NA
graphics "Part of R 4.3.1" NA NA
grDevices "Part of R 4.3.1" NA NA
grid "Part of R 4.3.1" NA NA
KernSmooth "Unlimited" NA NA
lattice "GPL (>= 2)" NA NA
MASS "GPL-2 | GPL-3" NA NA
Matrix "GPL (>= 2) | file LICENCE" NA NA
methods "Part of R 4.3.1" NA NA
mgcv "GPL (>= 2)" NA NA
nlme "GPL (>= 2)" NA NA
nnet "GPL-2 | GPL-3" NA NA
parallel "Part of R 4.3.1" NA NA
rpart "GPL-2 | GPL-3" NA NA
spatial "GPL-2 | GPL-3" NA NA
splines "Part of R 4.3.1" NA NA
stats "Part of R 4.3.1" NA NA
stats4 "Part of R 4.3.1" NA NA
survival "LGPL (>= 2)" NA NA
tcltk "Part of R 4.3.1" NA NA
tools "Part of R 4.3.1" NA NA
translations "Part of R 4.3.1" NA NA
utils "Part of R 4.3.1" NA NA
OS_type MD5sum NeedsCompilation Built
base NA NA NA "4.3.1"
boot NA NA "no" "4.3.1"
class NA NA "yes" "4.3.1"
cluster NA NA "yes" "4.3.1"
codetools NA NA "no" "4.3.1"
compiler NA NA NA "4.3.1"
datasets NA NA NA "4.3.1"
foreign NA NA "yes" "4.3.1"
graphics NA NA "yes" "4.3.1"
grDevices NA NA "yes" "4.3.1"
grid NA NA "yes" "4.3.1"
KernSmooth NA NA "yes" "4.3.1"
lattice NA NA "yes" "4.3.1"
MASS NA NA "yes" "4.3.1"
Matrix NA NA "yes" "4.3.1"
methods NA NA "yes" "4.3.1"
mgcv NA NA "yes" "4.3.1"
nlme NA NA "yes" "4.3.1"
nnet NA NA "yes" "4.3.1"
parallel NA NA "yes" "4.3.1"
rpart NA NA "yes" "4.3.1"
spatial NA NA "yes" "4.3.1"
splines NA NA "yes" "4.3.1"
stats NA NA "yes" "4.3.1"
stats4 NA NA NA "4.3.1"
survival NA NA "yes" "4.3.1"
tcltk NA NA "yes" "4.3.1"
tools NA NA "yes" "4.3.1"
translations NA NA NA "4.3.1"
utils NA NA "yes" "4.3.1"
d. install.packages()
Aim: It is used to install various R packages
Syntax: install.packages(pkgs,lib)
Example: install.packages(“readxl”)

> install.packages("readxl")
package ‘cli’ successfully unpacked and MD5 sums checked
package ‘glue’ successfully unpacked and MD5 sums checked
package ‘utf8’ successfully unpacked and MD5 sums checked
package ‘rematch’ successfully unpacked and MD5 sums checked
package ‘fansi’ successfully unpacked and MD5 sums checked
package ‘lifecycle’ successfully unpacked and MD5 sums checked
package ‘magrittr’ successfully unpacked and MD5 sums checked
package ‘pillar’ successfully unpacked and MD5 sums checked
package ‘pkgconfig’ successfully unpacked and MD5 sums checked
package ‘rlang’ successfully unpacked and MD5 sums checked
package ‘vctrs’ successfully unpacked and MD5 sums checked
package ‘hms’ successfully unpacked and MD5 sums checked
package ‘prettyunits’ successfully unpacked and MD5 sums checked
package ‘R6’ successfully unpacked and MD5 sums checked
package ‘crayon’ successfully unpacked and MD5 sums checked
package ‘cellranger’ successfully unpacked and MD5 sums checked
package ‘tibble’ successfully unpacked and MD5 sums checked
package ‘cpp11’ successfully unpacked and MD5 sums checked
package ‘progress’ successfully unpacked and MD5 sums checked
package ‘readxl’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in


C:\Users\nenav\AppData\Local\Temp\Rtmp63KyV0\downloaded_packages

e. packageDescription()
Aim: Parses and returns the DESCRIPTION file of a package as a
"packageDescription".
Syntax: packageDescription(pkg, lib.loc = NULL, fields = NULL, drop = TRUE,
encoding = "")
Example: packageDescription(“stats”)
> packageDescription("stats")
Package: stats
Version: 4.3.1
Priority: base
Title: The R Stats Package
Author: R Core Team and contributors worldwide
Maintainer: R Core Team <[email protected]>
Contact: R-help mailing list <[email protected]>
Description: R statistical functions.
License: Part of R 4.3.1
Imports: utils, grDevices, graphics
Suggests: MASS, Matrix, SuppDists, methods, stats4
NeedsCompilation: yes
Built: R 4.3.1; x86_64-w64-mingw32; 2023-06-16 07:34:01 UTC; windows

-- File: C:/Program Files/R/R-4.3.1/library/stats/Meta/package.rds


f. help()
Aim: help is the primary interface to the help systems.
Syntax: help(topic, package = NULL, lib.loc = NULL, verbose =
getOption("verbose"), try.all.packages = getOption("help.try.all.packages"),
help_type = getOption("help_type"))
Example: help(lapply)
lapply {base} R Documentation

g. library()
Aim: library and require load and attach add-on packages.
Syntax: library(datasets)
Example: library()
> dir()
Aim: returns a character vector of file and/or folder names within a directory..
Syntax: dir(path)
Example: dir(“C:/Program Files”)
> dir("C:/Program Files")
[1] "Autodesk" "Common Files"
[3] "Dell" "desktop.ini"
[5] "dotnet" "Google"
[7] "Intel" "Internet Explorer"
[9] "MATLAB" "McAfee"
[11] "McAfee.com" "Microsoft Office"
[13] "Microsoft Office 15" "Microsoft OneDrive"
[15] "Microsoft Silverlight" "Microsoft Update Health Tools"
[17] "ModifiableWindowsApps" "National Instruments"
[19] "R" "RStudio"
[21] "Uninstall Information" "Waves"
[23] "Windows Defender" "Windows Mail"
[25] "Windows Media Player" "Windows NT"
[27] "Windows Photo Viewer" "Windows Sidebar"
[29] "WindowsApps" "WindowsPowerShell"

> setwd()
Aim: setwd() stands for set working directory. This is used to set the working
environment.
Syntax: setwd(‘path’)
Example: setwd('C:/')

> getwd()
Aim: getwd() stands forget working directory. It is used to get the current working
directory of the environment.
Syntax: getwd()
Example: getwd()
> getwd()
[1] "C:/"
2. Variables in R Programming

I. Create a variable “RectangleHeight” and assign the value 2 to it. Note the use of the
operator “<-” to assign a value to the variable. Likewise, the variable
“RectangleWidth” is defined and assigned the value 4. Compute the area of a
rectangle and store it in “RectangleArea” and print the value of RectangleArea.
> r_hight<-3
> r_width<-4
> r_area<-r_hight*r_width
> print(r_area)

II. ls()
Aim: return a vector of character strings giving the names of the objects in the
specified environment..
Syntax: ls(name, pos , envir = as.environment(), all.names , pattern, sorted )
Example: ls()

> ls()
[1] "r_area" "r_hight" "r_width"

3. Input statements
i. scan()
Aim: Read data into a vector or list from the console or file..
Syntax: scan(“data.txt”, what = “character”)
Example: scan(text = “1 2 3 4 5”)

> scan(text = "1 2 3 4 5")


Read 5 items
[1] 1 2 3 4 5

ii. readline()
Aim: reads a line from the terminal
Syntax: readline(prompt=””)
Example: readline(“Hi, enter your name”)
> readline("Hi,Whats your name:")
Hi,Whats your name:HELPsetwd('C:/')rakesh
[1] "HELPsetwd('C:/')rakesh"
4. Output Statements
i. print()
Aim: prints the data written inside the brackets, whether argument or string..
Syntax: print(x,””)
Example: print(“Good Morning”)

> print("Good morning")


[1] "Good morning"

ii. cat()
Aim: Outputs the objects, concatenating the representations. cat performs
much less conversion than print.
Syntax: cat(… , file = "", sep = " ", fill, labels, append)
Example: cat(paste(letters,100*1:26),fill=TRUE,labels=paste0(“{“,1:10”}:”))

5. Few commands to explore datasets


Let the dataset be “PlantGrowth”
I. summary():
Aim: Provide a summary of descriptive statistics for each variable in the dataset.
Syntax: summary(data)
Example: summary(PlantGrowth)

II. str ():


Aim: Display the structure of the dataset, including the data types of variables and
their dimensions..
Syntax: str(data)
Example: summary(PlantGrowth)

III. rnorm ():


Aim: used to generate random numbers from a normal distribution
Syntax: rnorm(n, mean = 0, sd = 1)
Example: rnorm(25)

IV. view ():


Aim: View the data in a tabular format within RStudio's data viewer.
Syntax: View(data)
Example: View(PlantGrowth)
V. ncol ():
Aim: Get the number of columns in the dataset.
Syntax: ncol(data)
Example: ncol(PlantGrowth)

VI. nrow ():


Aim: Get the number of rows in the dataset.
Syntax: nrow(data)
Example: nrow(PlantGrowth)
VII. head ():
Aim: Show the first few rows of the dataset..
Syntax: head(data, n )
Example: head(PlantGrowth)

VIII. tail ():


Aim: Show the last few rows of the dataset..
Syntax: tail(data, n )
Example: tail(PlantGrowth)

IX. edit ():


Aim: It invokes a text editor
Syntax: edit(name,file,title,editor,… )
Example: edit(PlantGrowth)
X. fix ():
Aim: invokes edit on x and then assigns the new (edited) version of x in the user's
workspace.
Syntax: fix(x, …)
Example: fix(PlantGrowth)
XI. plot ():
Aim: This function is used to draw points (markers) in a diagram.
Syntax: plot(x,y, …)
Example: plot(head(PlantGrowth)->weight,head(PlantGrowth)->group)

XII. save.image ():


Aim: It is just a short-cut for “save my current workspace
Syntax: save.image(im,file,quality=0.7)
Example: x<-stats::runif(20)
y<-list(a=1,b=TRUE,c=’oops’)
save(x,y,file=”dataAnalytics.RData”)
save.image()
DATA ANALYTICS LABORATORY
WEEK-2: Loading and handling data using R

A.Lokesh
21071a1268
DATES IT-B
i) Print System’s date

Syntax: Sys.Date()
> Sys.Date()
[1] "2023-07-30"
ii) Print System’s time

Syntax: Sys.time()
> Sys.time()
[1] "2023-07-30 19:32:14 IST"

iii) Print the time Zone


Syntax: Sys.timezone()
> Sys.timezone()
> "Asia/Calcutta"

iv) Format of date

Specifier Description

%a Abbreviated weekday

%A Full weekday

%b Abbreviated month

%B Full month

%C Century
%y Year without century

Specifier Description

%Y Year with century

%d Day of month (01-31)

%j Day in Year (001-366)

%m Month of year (01-12)

%D Data in %m/%d/%y format

%u Weekday (01-07) Starts on Monday

> date<-Sys.Date()
> format(date,format='%a')
[1] "Sun"
> format(date,format='%a')
[1] "Sun"
> format(date,format='%A')
[1] "Sunday"
> format(date,format='%b')
[1] "Jul"
> format(date,format='%B')
[1] "July"
> format(date,format='%C')
[1] "20"
> format(date,format='%y')
[1] "23"
> format(date,format='%Y')
[1] "2023"
> format(date,format='%d')
[1] "30"
> format(date,format='%j')
[1] "211"
> format(date,format='%m')
[1] "07"
> format(date,format='%D')
[1] "07/30/23"
> format(date,format='%u')
[1] "7"
FUNCTIONS
i) Sum(Functions with and without null values)

• It finds the sum of the values in the vector.


> print(sum(1:10))
[1] 55

ii) Min(Functions with and without null values)

• It returns the minimum value in the vector.


> print(min(1:10))
[1] 1

iii) Max(Functions with and without null values)

• It returns the maximum value in the vector.


> print(max(1:20))
[1] 20

iv) Seq

• It creates a sequence of elements in a vector.


> print(seq(1:10))
[1] 1 2 3 4 5 6 7 8 9 10

MANIPULATING TEXT IN DATA


i) substr
> num<-"1234567890"
> substr(num,4,5)
[1] "45"
> substr(num,5,6)
[1] "56"

ii) Strsplit
> str<-"i am going to split the scentence"
> strsplit(str," ")
[[1]]
[1] "i" "am" "going"
[4] "to" "split" "the"
[7] "scentence"

iii) Paste
> paste("i","am","Rakesh")
[1] "i am Rakesh"
iv) Grep

It is used for pattern matching and replacement. grep, grepl, regexpr, gregexpr and regexec search
for matches with argument pattern within each element of a character vector.
> paste("i","am","Rakesh")
[1] "i am Rakesh"
v) Toupper
> print(toupper(str))
[1] "I AM GOING TO SPLIT THE SCENTENCE"

vi) Tolower
> print(tolower(str))
[1] "i am going to split the scentence"

vii) rep
> rep(1:10,time=2)
[1] 1 2 3 4 5 6 7 8 9 10 1 2 3
[14] 4 5 6 7 8 9 10

MISSING VALUES TREATMENT


i) Is.na

• This function returns a vector that contains only logical value (either True or False).
> data=c(1,2,NA,34,12,NA,11)
> print(data)
[1] 1 2 NA 34 12 NA 11
> print(data)
[1] 1 2 NA 34 12 NA 11
> print(is.na(data))
[1] FALSE FALSE TRUE FALSE FALSE TRUE
[7] FALSE

ii) na.omit

• It simply rules out any rows that contain any missing value and forgets those rows
forever.
> na.omit(data)
[1] 1 2 34 12 11
attr(,"na.action")
[1] 3 6
attr(,"class")
[1] "omit"

iii) na.exclude

• This argument ignores rows having at least one missing value.

> na.exclude(data)
[1] 1 2 34 12 11
attr(,"na.action")
[1] 3 6
attr(,"class")
[1] "exclude"
iv) na.fail

• It terminates the execution if any of the missing values are found.

> na.fail(data)
Error in na.fail.default(data) : missing values in object
v) na.pass

• Take no action.

> na.pass(data)
[1] 1 2 NA 34 12 NA 11

VECTORS
(i) Creation with numbers, string values, logical values
> num<-c(4,1,3,2,7)
> str<-"this is R language"
> bool<-c(TRUE,FALSE)
(ii) Declaration of vector
> vec=list(str,num,bool)
> print(vec)
[[1]]
[1] "this is R language"

[[2]]
[1] 4 1 3 2 7

[[3]]
[1] TRUE FALSE

(iii) Sequence Vector with 1,2,3 parameters as arguments


Syntax: seq(from, to, by)
> vec1<-seq(1,2,3)
> print(vec1)
[1] 1

(iv) VECTOR ACCESS:


(i) Adding value to vector
> vec1<-seq(1,10,1)
> print(vec1)
[1] 1 2 3 4 5 6 7 8 9 10
> new_val<-12
> vec1<-c(vec1,new_val)
> print(vec1)
[1] 1 2 3 4 5 6 7 8 9 10 12

(ii) Modify value of a vector


> vec1[3]<-23
> print(vec1)
[1] 1 2 23 4 5 6 7 8 9 10 12
(v) VECTOR NAMES:
(i)Plotting vectors using bar chart
> barplot(vec1,name.arg=seq_along(vec1),col="skyblue",main="Bar Graph
of Vector",xlab="Index",ylab="Values")

(vi) VECTOR MATH:


(i) Applying mathematical functions
> v2<-c(7,8,9)
> v1<-c(1,2,3)
> print(v1+v2)
[1] 8 10 12
> print(v1*v2)
[1] 7 16 27
> print(v2-v1)
[1] 6 6 6
> print(v1/v2)
[1] 0.1428571 0.2500000 0.3333333
> print(2*v1)
[1] 2 4 6
(vi) VECTOR RECYCLING:
(i) Scatterplot of vector

MATRICES
(i) Creation of matrices
Matrices are created using matrix() function. The matrix() function takes a data vector as
input and reshapes it into a matrix with the specified number of rows and columns.
(ii) Accessing matrix
Access elements of a matrix in R using square brackets `[]`. To access specific elements, you
need to specify the row and column indices.

(iii) Plotting matrix using Contour Plot


A contour plot displays 3-dimensional data in a 2-dimensional format. Creating a contour plot
using the `contour()` function.
(iv) Persp
The persp() function is used to create 3D surface plots. It displays the matrix data as a 3D
plot.

(v) Image
The image() function is used to create an image plot of the matrix data. It displays the values
in the matrix as a heatmap, with colors representing different values.
FACTORS
(i) Creating Factor
Factors are used to represent categorical data. Creating a factor using the factor() function.
Factors are used to store data that has specific categories or levels.

(ii) As.integer
The as.integer() function is used to convert factors into their underlying integer
representation. Each unique level in the factor is assigned a unique integer value.

(iii) Levels
The levels() function is used to extract or modify the levels of a factor. It allows to view the
unique categories present in the factor and also assign custom levels if needed.
(iv) Plotting factors
Plotting factors using categorical plots to visualize the distribution of different categories.

(v) Apply legend


The legend() function is used to specify the labels and colors for the legend items.
LIST
(i) Creation

(ii) List with tags and values


(iii) Add or delete elements from list

(iv) Size of list

(v) Recursive list


AGGREGATING AND GROUP PROCESSING OF A VARIABLE
(i) Aggregation function
Aggregation functions are used to calculate summary statistics on a vector or a data frame.
Some commonly used aggregation functions include sum, mean, median, min, max, sd
(standard deviation), var(variance), etc.

(ii) tapply function


The tapply() function is used to apply a function to subsets of a vector based on one or
more factors. It is particularly useful for grouping data and performing calculations within
each group.

SIMPLE ANALYSIS
• Describe Data Structure
(i) names(),

(ii) str(),

(iii) summary(),
(iv) head() and tail()

• Describe Variable Structure (i)Summary

(ii) Mean

(iii) Sum

(iv) table
(v) hist

(vi) Boxplot
METHODS FOR READING DATA
(i) Read from CSV file
Aim: The read.csv() function or its alternative read.csv2() depending on your locale
settings
Syntax:read.csv(“path”)
> data<-read.csv("C:\\Users\\rakes\\Downloads\\username.csv")
> data
Username..Identifier.First.name.Last.name
1 booker12;9012;Rachel;Booker
2 grey07;2070;Laura;Grey
3 johnson81;4081;Craig;Johnson
4 jenkins46;9346;Mary;Jenkins
5 smith79;5079;Jamie;Smith

(ii) Read from Spread sheets


Aim: the read.csv() function or its alternative read.csv2() depending on your locale
settings
Syntax:read.xlsx(file)
> library(readxl)
> data<-read_excel("C:\\Users\\rakes\\Downloads\\EmployeeSampleData\\Employ ee
Sample Data.xlsx")
> print(data)
# A tibble: 1,000 × 14
EEID `Full Name` `Job Title` Department `Business Unit` Gender
<chr> <chr> <chr> <chr> <chr> <chr>
1 E02387 Emily Davis Sr. Manger IT Research & Dev… Female
2 E04105 Theodore D… Technical … IT Manufacturing Male
3 E02572 Luna Sande… Director Finance Speciality Pro… Female
4 E02832 Penelope J… Computer S… IT Manufacturing Female
5 E01639 Austin Vo Sr. Analyst Finance Manufacturing Male
6 E00644 Joshua Gup… Account Re… Sales Corporate Male
7 E01550 Ruby Barnes Manager IT Corporate Female
8 E04332 Luke Martin Analyst Finance Manufacturing Male
9 E04533 Easton Bai… Manager Accounting Manufacturing Male
10 E03838 Madeline W… Sr. Analyst Finance Speciality Pro… Female
# ℹ 990 more rows
# ℹ 8 more variables: Ethnicity <chr>, Age <dbl>,
# `Hire Date` <dttm>, `Annual Salary` <dbl>, `Bonus %` <dbl>,
# Country <chr>, City <chr>, `Exit Date` <dttm>
# ℹ Use `print(n = ...)` to see more rows

(iii) Read from Package


Aim: datasets are typically part of the package's functionality and can be accessed
easily.
Syntax: library(“package”)
> library("datasets")
> mtcars
mpg cyl disp hp drat wt qsec vs am gear
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4
carb
Mazda RX4 4
Mazda RX4 Wag 4
Datsun 710 1
Hornet 4 Drive 1
Hornet Sportabout 2
Valiant 1
Duster 360 4
Merc 240D 2
Merc 230 2
Merc 280 4
Merc 280C 4
Merc 450SE 3
Merc 450SL 3
Merc 450SLC 3
Cadillac Fleetwood 4
Lincoln Continental 4
Chrysler Imperial 4
Fiat 128 1
Honda Civic 2
Toyota Corolla 1
Toyota Corona 1
Dodge Challenger 2
AMC Javelin 2
Camaro Z28 4
Pontiac Firebird 2
Fiat X1-9 1
Porsche 914-2 2
Lotus Europa 2
Ford Pantera L 4
Ferrari Dino 6
Maserati Bora 8
Volvo 142E 2

(iv) Read from Webpages


Aim: you can use several libraries that allow you to scrape or extract data from
HTML tables, APIs, or other web elements..
Syntax: library(rvest)
> library(rvest)
> url<-"https://www.kaggle.com/datasets"
> page<-read_html(url)
> page
{html_document}
<html lang="en">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; ...
[2] <body data-turbolinks="false">\r\n <main><div id="site-cont ...

(v) Read from JSON


Aim: The jsonlite package is designed to parse and create JSON data in R. It allows
you to convert JSON data into R data structures (lists, data frames, etc.) and vice
versa. It is an essential tool when working with APIs or handling data stored in JSON
format.
Syntax: install.packages("jsonlite")
> library(jsonlite)
> j<-fromJSON("C:\\Users\\rakes\\Downloads\\dwsample1-json.json")
> j
$fruit
[1] "Apple"

$size
[1] "Large"

$color
[1] "Red"

(vi) Reading XML F


Aim: you can use the xml2 package, which provides functions for parsing and
working with XML data
Syntax: read_xml(“path”)
library(xml2)
> x<-read_xml("C:\\Users\\rakes\\OneDrive\\Desktop\\sample.xml")
> x
{xml_document}
<catalog>
[1] <book id="bk101">\n <author>Gambardella, Matthew</author>\ ...
[2] <book id="bk102">\n <author>Ralls, Kim</author>\n <title> ...
[3] <book id="bk103">\n <author>Corets, Eva</author>\n <title ...
[4] <book id="bk104">\n <author>Corets, Eva</author>\n <title ...
[5] <book id="bk105">\n <author>Corets, Eva</author>\n <title ...
[6] <book id="bk106">\n <author>Randall, Cynthia</author>\n < ...
[7] <book id="bk107">\n <author>Thurman, Paula</author>\n <ti ...
[8] <book id="bk108">\n <author>Knorr, Stefan</author>\n <tit ...
[9] <book id="bk109">\n <author>Kress, Peter</author>\n <titl ...
[10] <book id="bk110">\n <author>O'Brien, Tim</author>\n <titl ...
[11] <book id="bk111">\n <author>O'Brien, Tim</author>\n <titl ...
[12] <book id="bk112">\n <author>Galos, Mike</author>\n <title ...
DATA ANALYTICS LABORATORY
WEEK-3: Exploring datasets using R

A.Lokesh
1. Data frame 21071a1268
IT-B
(a) Creation
Aim: Data Frames are data displayed in a format as a table.
Syntax: data.frame()
Example:
> df<-data.frame(Roll_no=c(1,2,3),
+ name=c("RAkesh","Rajesh","Rakhi"),
+ score=c(1281,1282,1283)
+ )
> df
Roll_no name score
1 1 RAkesh 1281
2 2 Rajesh 1282
3 3 Rakhi 1283

(b) Access
Aim: To access data from data frames for manipulating, using,etc.
Syntax: data_frame$column_name
Example: ac$name
> df$name
[1] "RAkesh" "Rajesh" "Rakhi"

(c) String in double brackets


Aim: To access data from data frames using column name.

Syntax: data_frame[[“column_name”]]

Example: ac[[“Name”]
> df[["name"]]
[1] "RAkesh" "Rajesh" "Rakhi"

(d) Ordering data frames


Aim: To sort the data in data frames in desired order.
Syntax: order(dataframe$column_name,order)
Example: order(ac$score,decreasing=FALSE)
> df<-df[order(df$score,decreasing=FALSE)]
> df
Roll_no name score
1 1 RAkesh 1281
2 2 Rajesh 1282
3 3 Rakhi 1283

(2) Understanding data in data frame


(a) Load data frame
> data<-read.csv("C:/Users/rakes/OneDrive/Desktop/Boston.csv")
> data
X crim zn indus chas nox rm age dis rad tax ptratio black
1 1 0.00632 18.0 2.31 0 0.5380 6.575 65.2 4.0900 1 296 15.3 396.90
2 2 0.02731 0.0 7.07 0 0.4690 6.421 78.9 4.9671 2 242 17.8 396.90
3 3 0.02729 0.0 7.07 0 0.4690 7.185 61.1 4.9671 2 242 17.8 392.83
4 4 0.03237 0.0 2.18 0 0.4580 6.998 45.8 6.0622 3 222 18.7 394.63
5 5 0.06905 0.0 2.18 0 0.4580 7.147 54.2 6.0622 3 222 18.7 396.90
6 6 0.02985 0.0 2.18 0 0.4580 6.430 58.7 6.0622 3 222 18.7 394.12
7 7 0.08829 12.5 7.87 0 0.5240 6.012 66.6 5.5605 5 311 15.2 395.60
8 8 0.14455 12.5 7.87 0 0.5240 6.172 96.1 5.9505 5 311 15.2 396.90
9 9 0.21124 12.5 7.87 0 0.5240 5.631 100.0 6.0821 5 311 15.2 386.63
10 10 0.17004 12.5 7.87 0 0.5240 6.004 85.9 6.5921 5 311 15.2 386.71
11 11 0.22489 12.5 7.87 0 0.5240 6.377 94.3 6.3467 5 311 15.2 392.52
12 12 0.11747 12.5 7.87 0 0.5240 6.009 82.9 6.2267 5 311 15.2 396.90
13 13 0.09378 12.5 7.87 0 0.5240 5.889 39.0 5.4509 5 311 15.2 390.50
14 14 0.62976 0.0 8.14 0 0.5380 5.949 61.8 4.7075 4 307 21.0 396.90
15 15 0.63796 0.0 8.14 0 0.5380 6.096 84.5 4.4619 4 307 21.0 380.02

(i) Sub setting data frame


Aim: To extract specific rows from a data frame based on certain conditions.
Syntax: subset_df <- subset(data_frame, condition)
> subset<-data[c('crim','zn','indus')]
> subset
crim zn indus
1 0.00632 18.0 2.31
2 0.02731 0.0 7.07
3 0.02729 0.0 7.07
4 0.03237 0.0 2.18
5 0.06905 0.0 2.18
6 0.02985 0.0 2.18
7 0.08829 12.5 7.87
8 0.14455 12.5 7.87
9 0.21124 12.5 7.87
10 0.17004 12.5 7.87
11 0.22489 12.5 7.87
12 0.11747 12.5 7.87
13 0.09378 12.5 7.87
14 0.62976 0.0 8.14
15 0.63796 0.0 8.14

(ii) Read from tab separated file


Aim: To read data from a tab-separated file and store it in a data frame.
Syntax: data_frame <- read.table("file_path/file_name.tsv", header = TRUE,
sep = "\t")
> subset<-data[c('crim','zn','indus')]
> subset
crim zn indus
1 0.00632 18.0 2.31
2 0.02731 0.0 7.07
3 0.02729 0.0 7.07
4 0.03237 0.0 2.18
5 0.06905 0.0 2.18
6 0.02985 0.0 2.18
7 0.08829 12.5 7.87
8 0.14455 12.5 7.87
9 0.21124 12.5 7.87
10 0.17004 12.5 7.87

(iii) Reading from table


Aim: To convert data from a tabular form into a data frame.
Syntax: data_frame <- as.data.frame(matrix_data)

(iv) Merging data frames


Aim: To combine two or more data frames based on common columns.
Syntax: merged_df <- merge(data_frame1, data_frame2, by =
"common_column", all = TRUE/FALSE)

(3) Data summary and ddply


The aim of the "summary" function is to provide a quick overview of the data's central
tendencies, distribution, and missing values.
Syntax: summary(object, ...)

#Summarisation

#ddply
Aim: The "ddply" function is used to split a data frame into subsets based on one or
more variables, apply a specified function to each subset, and then combine the results
into a new data frame.
Syntax: ddply(data, .variables, .fun)

(4) Invalid values and outliers


The aim of handling invalid values is to ensure data quality by identifying and dealing
with data entries that do not conform to the expected or permissible range of values for
a particular variable. By addressing invalid values. we can improve the accuracy and
reliability of your data analysis.

Outliers
The aim of handling outliers is to identify and manage data points that significantly
deviate from the rest of the data. Outliers can arise due to natural variations in the data
or measurement errors and can have a substantial impact on statistical analysis,
visualization, and modelling. By addressing outliers, we can prevent them from unduly
influencing the results of your analysis.

(5) Descriptive statistics


(a) Frequency
The aim of frequency analysis is to determine the count or occurrence of each unique
value in a dataset, often presented in the form of a frequency table. Syntax: table(x)

(b) Mean
The aim of calculating the mean is to find the arithmetic average of a set of numeric
values.
Syntax: mean(x)
(c) Mode
The aim of calculating the mode is to find the value that appears most frequently in a
dataset.
Syntax: mode_x <- as.numeric(names(table_x[table_x == max(table_x)]))

(d) Median
The median is to find the middle value of a dataset when it is ordered. Syntax:
median(x)

(e) Standard deviation


The standard deviation is to measure the spread or dispersion of data points from the
mean.
Syntax: sd(x)

(f) Removing NA
The removal of missing or NA (Not Available) values from the dataset to perform
calculations or analyses on complete data.
Syntax: omit(x)

(g) Plotting
Plotting is to visually represent data for better understanding and analysis.

Syntax: plot(), hist(), barplot(),….etc.


(h) Abline
Adding an abline is to include a straight line in a plot to represent a linear relationship
or to visualize a specific reference line.
Syntax: abline(a = NULL, b = NULL, h = NULL, v = NULL, ...)

(6) Spotting problems in data with visualization


(a) Histograms
Histogram is to visualize the distribution of a continuous variable by dividing it into
intervals (bins) and showing the frequency or count of data points falling into each bin.
Syntax: hist(x, breaks = ..., main = ..., xlab = ..., ylab = ...)
(b) Barplot
Barplot is to display the distribution of categorical variables or to compare the values
of different categories.
Syntax: barplot(height, names.arg = ..., main = ..., xlab = ..., ylab = ...)

(c) densityplot
Density plot is to visualize the probability density function of a continuous variable,
providing insights into the data's underlying distribution.
Syntax: densityplot(x, main = ..., xlab = ..., ylab = ...)
DATA ANALYTICS LABORATORY
WEEK-4: HDFS (Storage) commands

1) Version A.Lokesh
21071a1268
The Hadoop fs shell command version prints the Hadoop version. IT-B

Syntax: hadoop version

Output:

2) ls command with options:


LS (List) command is used to display the files and directories in HDFS,This list command shows the
list of files and directories with permissions, user, group, size, and other details.

HDFS Description
ls Options

-c Display the paths of files and directories only.

-d Directories are listed as plain files.

-h Formats the sizes of files in a human-readable fashion rather than


several bytes.

-q Print? instead of non-printable characters.

-R Recursively list the contents of directories.

-t Sort files by modification time (most recent first).

-S Sort files by size.

-r Reverse the order of the sort.


-u Use the time of last access instead of modification for display and
sorting.

-e Display the erasure coding policy of files and directories.

Synatax: hdfs dfs -ls /path

Output:

3) get:
The Hadoop fs shell command get copies the file or directory from the Hadoop file system to the
local file system.

Syntax:hdfs dfs -get <src> <localdest>

Output:

4) copyToLocal:
copyToLocal command copies the file from HDFS to the local file system.

Syntax:hdfs dfs –copyToLocal <path_to_file_in_hdfs> <local_file_path>

Output:
5) cat

The cat command reads the file in HDFS and displays the content of the file on console or stdout.

Syntax:hdfs dfs -cat /path_to_file_in_hdfs

Output:

6) put:
The Hadoop fs shell command put is similar to the copyFromLocal, which copies files or directory
from the local filesystem to the destination in the Hadoop filesystem.

Syntax: hdfs dfs -put <localsrc> <dest>


Output:

7) copyFromLocal

This command copies the file from the local file system to HDFS.

Syntax: hadoop fs -copyFromLocal <localsrc> <hdfs destination>

Output:

8. fsck
The fsck Hadoop command is used to check the health of the HDFS.
Syntax: hadoop fsck <path> [ -move | -delete | -openforwrite] [-files [-blocks [-locations | -racks]]]
Options Description
<path> start checking from the path specified here
-move It moves a corrupted file to the lost+found directory.
-delete It deletes the corrupted files present in HDFS.
-openforwrite It prints the files which are opened for write
-files It prints the files being checked.
-blocks It prints out all the blocks of the file while checking.
-locations It prints the location of all the blocks of files while checking.
-racks It displays the network topology for DataNode locations.
Output:

9) mkdir
This command creates the directory in HDFS if it does not already exist.

Syntax: hdfs dfs –mkdir /path/directory_name

Output:
10) cp
The cp command copies a file from one directory to another directory within the HDFS.

Syntax: hdfs dfs -cp <src> <dest>

Output:

11) touchz
touchz command creates a file in HDFS with file size equals to 0 byte. The directory is the name of
the directory where we will create the file, and filename is the name of the new file we are going to
create.

Syntax: hdfs dfs –touchz /directory/filename

Output:

12) du
This Hadoop fs shell command du prints a summary of the amount of disk usage of all
files/directories in the path.

Syntax: hadoop fs –du –s /directory/filename

Output:

13) count
The Hadoop fs shell command count counts the number of files, directories, and bytes under the
paths that matches the specified file pattern.

Syntax: hadoop fs -count [options] <path>

Options:
-q – shows quotas(quota is the hard limit on the number of names and amount of space used for
individual directories)

-u – it limits output to show quotas and usage only

-h – shows sizes in a human-readable format

-v – shows header line

Output:

14) rm command with options


The rm command removes the file present in the specified path.

Syntax: hadoop fs –rm <path>

Options:
-r : Recursively remove directories and files
-skipTrash : To bypass trash and immediately delete the source
-f : Mention if there is no file existing
-rR : Recursively delete directories

Output:

15) mv
The HDFS mv command moves the files or directories from the source to a destination within hdfs.

Syntax: hadoop fs -mv <src> <dest>

Output:
16) help
The Hadoop fs shell command help shows help for all the commands or the specified command.

Syntax: hadoop fs -help [command]

Output:

17) usage
The Hadoop fs shell command usage returns the help for an individual command.

Syntax: hadoop fs -usage <command>

Output:

18) df
The Hadoop fs shell command df shows the capacity, size, and free space available on the HDFS file
system.The -h option formats the file size in the human-readable format.

Syntax: hdfs dfs -df [-h] <path>

Output:

19) chmod
The Hadoop fs shell command chmod changes the permissions of a file.The -R option recursively
changes files permissions through the directory structure.The user must be the owner of the file or
superuser.

Syntax: hdfs dfs -chmod [-R] <mode> <path>

Output:
20) tail
The Hadoop fs shell tail command shows the last 1KB of a file on console or stdout.The -f shows the
append data as the file grows

Syntax: hdfs dfs -tail [-f] <file>

Output:

21) expunge
HDFS expunge command makes the trash empty.

Syntax: hdfs dfs -expunge

Output:

22) appendToFile
This command appends the contents of all the given local files to the provided destination file on the
HDFS filesystem. The destination file will be created if it is not existing earlier.
Syntax: hadoop fs -appendToFile <localsrc> <dest>

Output:

25)chown
The Hadoop fs shell command chown changes the owner of the file.The -R option recursively
changes files permissions through the directory structure. The user must be the owner of the file or
superuser.

Syntax: hdfs dfs -chown [-R] [owner] [:[group]] <path>

Output:
DATA ANALYTICS LABORATORY
WEEK-5: Map Reduce Programming

A.LOKESH
Map Reduce Programming-Max Temperature 21071A1268
Step-1: IT-B

open eclipse editor and create a new project


Step-2:
Add Jar files from hadoop and hadoop client.After click on finish.
Step-3:
Right click on src from project create class files

Step-4:
Type source code in src files and save
maxtemperature.java
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Mapper;
public class maxtemperature extends Mapper<LongWritable, Text, Text, IntWritable > {

public void map(LongWritable key, Text value, Context context)


throws IOException, InterruptedException {
String line=value.toString();
String year=line.substring(15,19);
int airtemp;
if(line.charAt(87)== '+')
{
airtemp=Integer.parseInt(line.substring(88,92));
}
else
airtemp=Integer.parseInt(line.substring(87,92));
String q=line.substring(92,93);
if(airtemp!=9999&&q.matches("[01459]"))
{
context.write(new Text(year),new IntWritable(airtemp));

}
}

maxtemreduce.java
import java.io.IOException;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Reducer;
public class maxtempreduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int maxvalue=Integer.MIN_VALUE;
for (IntWritable value : values) {
maxvalue=Math.max(maxvalue, value.get());
}
context.write(key, new IntWritable(maxvalue));
}

}
maxtempdriver.java
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class maxtempdriver {

public static void main(String[] args) throws Exception {


Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "maxtemperature");
job.setJarByClass(maxtempdriver.class);
// TODO: specify a mapper
job.setMapperClass(maxtemperature.class);
// TODO: specify a reducer
job.setReducerClass(maxtempreduce.class);

// TODO: specify output types


job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

// TODO: specify input and output DIRECTORIES (not files)


FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

if (!job.waitForCompletion(true))
return;
}

}
Step-5:

Export project file to desktop or any folder that contain input file
Step-6:
Run commands in terminal
1. create input directory

2. move the input file to input directory


3. check whether the input data is moved or not by printing the data.

4. Run the Jar file which contain src code

Step-5
print output from output directory
DATA ANALYTICS LABORATORY
WEEK-6: Map Reduce Programming
Map Reduce Programming- Word Count
Step-1:
A.LOKESH
open eclipse editor and create a new project
21071A1268
Step-2: IT-B

Add Jar files from hadoop and hadoop client.After click on finish.
Step-3:
Right click on src from project create class files

Step-4:
Type source code in src files and save
import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

public static class TokenizerMapper


extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1);


private Text word = new Text();

public void map(Object key, Text value, Context context


) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}

public static class IntSumReducer


extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}

public static void main(String[] args) throws Exception {


Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Step-5:

Export project file to desktop or any folder that contain input file
Step-6:
Run commands in terminal
1. create input directory

2. move the input file to input directory

3. check whether the input data is moved or not by printing the data.
4. Run the Jar file which contain src code

Step-5
print output from output directory
DATA ANALYTICS LABORATORY
WEEK-7: Data Processing Tool – Hive
Data Processing Tool-Hive A.LOKESH
1) How to enter the HIVE Shell? 21071A1268
IT-B

Q2) Create a database?

Q3)How to create a Managed Table in Hive?

Q4)How to load the data from Local to Hive Table?

Q5)How to check where Managed Table is created in hive?

Q6)check the schema of the created table emp?


Q7)How to see all tables present in the database?

Q8)Select all the enames from emp table?

Q9)Get the records where name is ‘A’?

Q10)Count the total number of records in the created table?

Q11)Group the sumof salaries as per deptno?


Q12)Get the salary of people between 1000 and 2000?

Q13)select the name of employees where job has exactly 5 characters?

Q14)List the employee names where job has l as the second character?

Q15) Retrieve the total salary for each department?

Q16) Add a column to the table?


Q17) How to Rename a table?

Q18) How to drop table?

Q1) Create a database called movies?

Q2) Work with database movies?

Q3) create a table movies_details inside movies database?

Q4) Load the data set of movies from local to hive table?

Q5) Check the table created inside database.

Q6) Retrieve all the records in movies_details?


Select *from movies_details
Q7)Print all movies between year 1920 and 1990
hive> select * from movie_details where year between 1920 and 1990;
Q9)Select all records where movie name starts from letter c or C
hive> select * from movie_details where name LIKE 'C%' or name LIKE 'c%';
Q9)select all records where movie name starts with The
hive> select * from movie_details where name LIKE 'The%';
Q10)What is the maximum rating of the movie?
hive> select max(rating) from movie_details;

Q11)count the number of records


hive> select count(*) from movie_details;
Q12)select rating of the movie School Ties

Q13) List all the years with total number of views in each year
( hint group by year), restrict the records to 5
hive> select year, sum(views) from movie_details group by year LIMIT 5;
PARTITIONING AND BUCKETING
Q1) create a database shopping?

Q2)create table (shopping1) inside the database shopping?

Q3) Load the data in HIVE table from local?

Q4) create a partition (shopping3) for table shopping1 and also create 3 buckets inside each partition?
Q5)Populate the partition with data?

Q6) Check your partition?

Q7) Check out the buckets for the partition “utensils”?


DATA ANALYTICS LABORATORY
WEEK-8: Data Processing Tool – Hive

WORD COUNT IN HIVE A.LOKESH


21071A1268
AIM: To perform word count on a text file using Hive query Language IT-B

Objective:To perform word count on a text file using functions like split and explode

Step1:Creating a data base hive_wordcount_table.

Step 2:
Use created database and create table hive_count_tb.

Step 3: Make a .txt file on local machine consisting of few sentences, in my case I
made hive_count.txt, then load that data to table called hive_count_tb.

Step 4: Check whether the table contains the data by Show tables command and Select * from
hive_count_tb.
Step 4: The data we have is in sentences, first we have to convert that it into words
applying space as delimiter using split function.

Step 5: Explode is to expand an array in a single row across multiple rows, one for
each value in the array.

Step 6: Implement the query using above query and group by .


DATA ANALYTICS LABORATORY
WEEK-9: Data Processing Tool – Pig

A.LOKESH
21071A1268
Pig Commands IT-B
Q1: How to enter in grunt shell?

Q2: Create two data sets using gedit command in local?

Q3: Copy the above files in HDFS?

Q4: How to read your (pigfile.txt and pigfile1.txt) data in PIG?


Q5: Specify the schema for above two tables?
Q6: Check the schema of the two tables?

Q7: Combine the two tables

Q8: Split the c data set into two different relations eg. d and e? E.g. I want one data set where
$0 is having value 1 and other data set where value of $0 is 4?

Q9: Do filtering on data set c where $1 is greater than 3?

Q10: Group data set c by $2?


Q11: Multiple column 1 and 2 and keep result in 2?

Q: 12: Select column 1 and 2 from dataset a ?

Q13: Store the above result in HDFS?

Q14: check the file written in HDFS?


DATA ANALYTICS LABORATORY
WEEK-10: Data Processing Tool – Pig
A.LOKESH
21071A1268
IT-B
WORD COUNT IN PIG
Aim: To perform word count on a text file using pig latin commands
Objective: To perform word count on a text file using functions like tokenize and flatten
STEP 1: Create a file in local on which you want to perform word count.
STEP 2: Copy the file in HDFS

STEP 3: Load the data in pig

STEP 4: Condense all the tuples in each line to one single line using function FLATTEN and
then break the line into words using TOKENIZE function
STEP 5: Now group the collection of words based on word

STEP 6: Determine the count of each word


STEP 7: Arrange the words in desc order

STEP 8: Store the above result in HDFS


DATA ANALYTICS LABORATORY
WEEK-11: Exploring text mining algorithms

loring Text mining algorithms A.LOKESH


Storing and managing the group of documents 21071A1268
IT-B

Current files in the folder

Representing the documents into corpus


Install.packages(“tm”)
TermDocumentMatrix():

Access the document IDs and terms

Finding the no of terms in the document


#preprocessing
After preprocessing
Relationship between terms
DATA ANALYTICS LABORATORY
WEEK-12: Exploring text mining algorithms

Exploring Text mining algorithms A.LOKESH


Storing and managing the group of documents 21071A1268
IT-B

Current files in the folder

Representing the documents into corpus


Install.packages(“tm”)
TermDocumentMatrix():

Access the document IDs and terms

Finding the no of terms in the document


#preprocessing
After preprocessing
Relationship between terms

You might also like