Lokesh Da
Lokesh Da
A.Lokesh
Handling packages, setting path and working directories: 21071a1268
IT-B
a. .libPaths() :
Aim: It gets/sets the library trees within which packages are looked for.
Syntax: .libPaths()
Example: .libPaths()
> .libPaths()
[1] "C:/Users/nenav/AppData/Local/R/win-library/4.3"
[2] "C:/Program Files/R/R-4.3.1/library"
b. find.package() :
Aim: It returns path to the locations where the given packages are found.
Syntax: find.package(package, lib.loc = NULL, quiet = FALSE,
verbose = getOption("verbose")
Example: find.package(“base”)
> find.package()
[1] "C:/Program Files/R/R-4.3.1/library/stats"
[2] "C:/Program Files/R/R-4.3.1/library/graphics"
[3] "C:/Program Files/R/R-4.3.1/library/grDevices"
[4] "C:/Program Files/R/R-4.3.1/library/utils"
[5] "C:/Program Files/R/R-4.3.1/library/datasets"
[6] "C:/Program Files/R/R-4.3.1/library/methods"
[7] "C:/PROGRA~1/R/R-4.3.1/library/base"
c. installed.packages()
Aim: Find details of all packages installed in the specified librarires.
Syntax: installed.packages(lib.loc = NULL, priority = NULL,
noCache=FALSE,fields = NULL, subarch = .Platform$r_arch, …)
Example: installed.packages()
> installed.packages()
Package LibPath Version
base "base" "C:/Program Files/R/R-4.3.1/library" "4.3.1"
boot "boot" "C:/Program Files/R/R-4.3.1/library" "1.3-28.1"
class "class" "C:/Program Files/R/R-4.3.1/library" "7.3-22"
cluster "cluster" "C:/Program Files/R/R-4.3.1/library" "2.1.4"
codetools "codetools" "C:/Program Files/R/R-4.3.1/library" "0.2-19"
compiler "compiler" "C:/Program Files/R/R-4.3.1/library" "4.3.1"
datasets "datasets" "C:/Program Files/R/R-4.3.1/library" "4.3.1"
foreign "foreign" "C:/Program Files/R/R-4.3.1/library" "0.8-84"
graphics "graphics" "C:/Program Files/R/R-4.3.1/library" "4.3.1"
grDevices "grDevices" "C:/Program Files/R/R-4.3.1/library" "4.3.1"
grid "grid" "C:/Program Files/R/R-4.3.1/library" "4.3.1"
KernSmooth "KernSmooth" "C:/Program Files/R/R-4.3.1/library" "2.23-21"
lattice "lattice" "C:/Program Files/R/R-4.3.1/library" "0.21-8"
MASS "MASS" "C:/Program Files/R/R-4.3.1/library" "7.3-60"
Matrix "Matrix" "C:/Program Files/R/R-4.3.1/library" "1.5-4.1"
methods "methods" "C:/Program Files/R/R-4.3.1/library" "4.3.1"
mgcv "mgcv" "C:/Program Files/R/R-4.3.1/library" "1.8-42"
nlme "nlme" "C:/Program Files/R/R-4.3.1/library" "3.1-162"
nnet "nnet" "C:/Program Files/R/R-4.3.1/library" "7.3-19"
parallel "parallel" "C:/Program Files/R/R-4.3.1/library" "4.3.1"
rpart "rpart" "C:/Program Files/R/R-4.3.1/library" "4.1.19"
spatial "spatial" "C:/Program Files/R/R-4.3.1/library" "7.3-16"
splines "splines" "C:/Program Files/R/R-4.3.1/library" "4.3.1"
stats "stats" "C:/Program Files/R/R-4.3.1/library" "4.3.1"
stats4 "stats4" "C:/Program Files/R/R-4.3.1/library" "4.3.1"
survival "survival" "C:/Program Files/R/R-4.3.1/library" "3.5-5"
tcltk "tcltk" "C:/Program Files/R/R-4.3.1/library" "4.3.1"
tools "tools" "C:/Program Files/R/R-4.3.1/library" "4.3.1"
translations "translations" "C:/Program Files/R/R-4.3.1/library" "4.3.1"
utils "utils" "C:/Program Files/R/R-4.3.1/library" "4.3.1"
Priority Depends
base "base" NA
boot "recommended" "R (>= 3.0.0), graphics, stats"
class "recommended" "R (>= 3.0.0), stats, utils"
cluster "recommended" "R (>= 3.5.0)"
codetools "recommended" "R (>= 2.1)"
compiler "base" NA
datasets "base" NA
foreign "recommended" "R (>= 4.0.0)"
graphics "base" NA
grDevices "base" NA
grid "base" NA
KernSmooth "recommended" "R (>= 2.5.0), stats"
lattice "recommended" "R (>= 4.0.0)"
MASS "recommended" "R (>= 4.0), grDevices, graphics, stats, utils"
Matrix "recommended" "R (>= 3.5.0), methods"
methods "base" NA
mgcv "recommended" "R (>= 3.6.0), nlme (>= 3.1-64)"
nlme "recommended" "R (>= 3.5.0)"
nnet "recommended" "R (>= 3.0.0), stats, utils"
parallel "base" NA
rpart "recommended" "R (>= 2.15.0), graphics, stats, grDevices"
spatial "recommended" "R (>= 3.0.0), graphics, stats, utils"
splines "base" NA
stats "base" NA
stats4 "base" NA
survival "recommended" "R (>= 3.5.0)"
tcltk "base" NA
tools "base" NA
translations NA NA
utils "base" NA
Imports LinkingTo
base NA NA
boot NA NA
class "MASS" NA
cluster "graphics, grDevices, stats, utils" NA
codetools NA NA
compiler NA NA
datasets NA NA
foreign "methods, utils, stats" NA
graphics "grDevices" NA
grDevices NA NA
grid "grDevices, utils" NA
KernSmooth NA NA
lattice "grid, grDevices, graphics, stats, utils" NA
MASS "methods" NA
Matrix "graphics, grid, lattice, stats, utils" NA
methods "utils, stats" NA
mgcv "methods, stats, graphics, Matrix, splines, utils" NA
nlme "graphics, stats, utils, lattice" NA
nnet NA NA
parallel "tools, compiler" NA
rpart NA NA
spatial NA NA
splines "graphics, stats" NA
stats "utils, grDevices, graphics" NA
stats4 "graphics, methods, stats" NA
survival "graphics, Matrix, methods, splines, stats, utils" NA
tcltk "utils" NA
tools NA NA
translations NA NA
utils NA NA
Suggests
base "methods"
boot "MASS, survival"
class NA
cluster "MASS, Matrix"
codetools NA
compiler NA
datasets NA
foreign NA
graphics NA
grDevices "KernSmooth"
grid NA
KernSmooth "MASS, carData"
lattice "KernSmooth, MASS, latticeExtra, colorspace"
MASS "lattice, nlme, nnet, survival"
Matrix "MASS, expm"
methods "codetools"
mgcv "parallel, survival, MASS"
nlme "Hmisc, MASS, SASmixed"
nnet "MASS"
parallel "methods"
rpart "survival"
spatial "MASS"
splines "Matrix, methods"
stats "MASS, Matrix, SuppDists, methods, stats4"
stats4 NA
survival NA
tcltk NA
tools "codetools, methods, xml2, curl, commonmark, knitr, xfun,\nmath
jaxr, V8"
translations NA
utils "methods, xml2, commonmark, knitr"
Enhances
base NA
boot NA
class NA
cluster NA
codetools NA
compiler NA
datasets NA
foreign NA
graphics NA
grDevices NA
grid NA
KernSmooth NA
lattice "chron"
MASS NA
Matrix "MatrixModels, SparseM, graph, igraph, maptools, sfsmisc, sp,\n
spdep"
methods NA
mgcv NA
nlme NA
nnet NA
parallel "snow, Rmpi"
rpart NA
spatial NA
splines NA
stats NA
stats4 NA
survival NA
tcltk NA
tools NA
translations NA
utils NA
License License_is_FOSS License_restricts_u
se
base "Part of R 4.3.1" NA NA
boot "Unlimited" NA NA
class "GPL-2 | GPL-3" NA NA
cluster "GPL (>= 2)" NA NA
codetools "GPL" NA NA
compiler "Part of R 4.3.1" NA NA
datasets "Part of R 4.3.1" NA NA
foreign "GPL (>= 2)" NA NA
graphics "Part of R 4.3.1" NA NA
grDevices "Part of R 4.3.1" NA NA
grid "Part of R 4.3.1" NA NA
KernSmooth "Unlimited" NA NA
lattice "GPL (>= 2)" NA NA
MASS "GPL-2 | GPL-3" NA NA
Matrix "GPL (>= 2) | file LICENCE" NA NA
methods "Part of R 4.3.1" NA NA
mgcv "GPL (>= 2)" NA NA
nlme "GPL (>= 2)" NA NA
nnet "GPL-2 | GPL-3" NA NA
parallel "Part of R 4.3.1" NA NA
rpart "GPL-2 | GPL-3" NA NA
spatial "GPL-2 | GPL-3" NA NA
splines "Part of R 4.3.1" NA NA
stats "Part of R 4.3.1" NA NA
stats4 "Part of R 4.3.1" NA NA
survival "LGPL (>= 2)" NA NA
tcltk "Part of R 4.3.1" NA NA
tools "Part of R 4.3.1" NA NA
translations "Part of R 4.3.1" NA NA
utils "Part of R 4.3.1" NA NA
OS_type MD5sum NeedsCompilation Built
base NA NA NA "4.3.1"
boot NA NA "no" "4.3.1"
class NA NA "yes" "4.3.1"
cluster NA NA "yes" "4.3.1"
codetools NA NA "no" "4.3.1"
compiler NA NA NA "4.3.1"
datasets NA NA NA "4.3.1"
foreign NA NA "yes" "4.3.1"
graphics NA NA "yes" "4.3.1"
grDevices NA NA "yes" "4.3.1"
grid NA NA "yes" "4.3.1"
KernSmooth NA NA "yes" "4.3.1"
lattice NA NA "yes" "4.3.1"
MASS NA NA "yes" "4.3.1"
Matrix NA NA "yes" "4.3.1"
methods NA NA "yes" "4.3.1"
mgcv NA NA "yes" "4.3.1"
nlme NA NA "yes" "4.3.1"
nnet NA NA "yes" "4.3.1"
parallel NA NA "yes" "4.3.1"
rpart NA NA "yes" "4.3.1"
spatial NA NA "yes" "4.3.1"
splines NA NA "yes" "4.3.1"
stats NA NA "yes" "4.3.1"
stats4 NA NA NA "4.3.1"
survival NA NA "yes" "4.3.1"
tcltk NA NA "yes" "4.3.1"
tools NA NA "yes" "4.3.1"
translations NA NA NA "4.3.1"
utils NA NA "yes" "4.3.1"
d. install.packages()
Aim: It is used to install various R packages
Syntax: install.packages(pkgs,lib)
Example: install.packages(“readxl”)
> install.packages("readxl")
package ‘cli’ successfully unpacked and MD5 sums checked
package ‘glue’ successfully unpacked and MD5 sums checked
package ‘utf8’ successfully unpacked and MD5 sums checked
package ‘rematch’ successfully unpacked and MD5 sums checked
package ‘fansi’ successfully unpacked and MD5 sums checked
package ‘lifecycle’ successfully unpacked and MD5 sums checked
package ‘magrittr’ successfully unpacked and MD5 sums checked
package ‘pillar’ successfully unpacked and MD5 sums checked
package ‘pkgconfig’ successfully unpacked and MD5 sums checked
package ‘rlang’ successfully unpacked and MD5 sums checked
package ‘vctrs’ successfully unpacked and MD5 sums checked
package ‘hms’ successfully unpacked and MD5 sums checked
package ‘prettyunits’ successfully unpacked and MD5 sums checked
package ‘R6’ successfully unpacked and MD5 sums checked
package ‘crayon’ successfully unpacked and MD5 sums checked
package ‘cellranger’ successfully unpacked and MD5 sums checked
package ‘tibble’ successfully unpacked and MD5 sums checked
package ‘cpp11’ successfully unpacked and MD5 sums checked
package ‘progress’ successfully unpacked and MD5 sums checked
package ‘readxl’ successfully unpacked and MD5 sums checked
e. packageDescription()
Aim: Parses and returns the DESCRIPTION file of a package as a
"packageDescription".
Syntax: packageDescription(pkg, lib.loc = NULL, fields = NULL, drop = TRUE,
encoding = "")
Example: packageDescription(“stats”)
> packageDescription("stats")
Package: stats
Version: 4.3.1
Priority: base
Title: The R Stats Package
Author: R Core Team and contributors worldwide
Maintainer: R Core Team <[email protected]>
Contact: R-help mailing list <[email protected]>
Description: R statistical functions.
License: Part of R 4.3.1
Imports: utils, grDevices, graphics
Suggests: MASS, Matrix, SuppDists, methods, stats4
NeedsCompilation: yes
Built: R 4.3.1; x86_64-w64-mingw32; 2023-06-16 07:34:01 UTC; windows
g. library()
Aim: library and require load and attach add-on packages.
Syntax: library(datasets)
Example: library()
> dir()
Aim: returns a character vector of file and/or folder names within a directory..
Syntax: dir(path)
Example: dir(“C:/Program Files”)
> dir("C:/Program Files")
[1] "Autodesk" "Common Files"
[3] "Dell" "desktop.ini"
[5] "dotnet" "Google"
[7] "Intel" "Internet Explorer"
[9] "MATLAB" "McAfee"
[11] "McAfee.com" "Microsoft Office"
[13] "Microsoft Office 15" "Microsoft OneDrive"
[15] "Microsoft Silverlight" "Microsoft Update Health Tools"
[17] "ModifiableWindowsApps" "National Instruments"
[19] "R" "RStudio"
[21] "Uninstall Information" "Waves"
[23] "Windows Defender" "Windows Mail"
[25] "Windows Media Player" "Windows NT"
[27] "Windows Photo Viewer" "Windows Sidebar"
[29] "WindowsApps" "WindowsPowerShell"
> setwd()
Aim: setwd() stands for set working directory. This is used to set the working
environment.
Syntax: setwd(‘path’)
Example: setwd('C:/')
> getwd()
Aim: getwd() stands forget working directory. It is used to get the current working
directory of the environment.
Syntax: getwd()
Example: getwd()
> getwd()
[1] "C:/"
2. Variables in R Programming
I. Create a variable “RectangleHeight” and assign the value 2 to it. Note the use of the
operator “<-” to assign a value to the variable. Likewise, the variable
“RectangleWidth” is defined and assigned the value 4. Compute the area of a
rectangle and store it in “RectangleArea” and print the value of RectangleArea.
> r_hight<-3
> r_width<-4
> r_area<-r_hight*r_width
> print(r_area)
II. ls()
Aim: return a vector of character strings giving the names of the objects in the
specified environment..
Syntax: ls(name, pos , envir = as.environment(), all.names , pattern, sorted )
Example: ls()
> ls()
[1] "r_area" "r_hight" "r_width"
3. Input statements
i. scan()
Aim: Read data into a vector or list from the console or file..
Syntax: scan(“data.txt”, what = “character”)
Example: scan(text = “1 2 3 4 5”)
ii. readline()
Aim: reads a line from the terminal
Syntax: readline(prompt=””)
Example: readline(“Hi, enter your name”)
> readline("Hi,Whats your name:")
Hi,Whats your name:HELPsetwd('C:/')rakesh
[1] "HELPsetwd('C:/')rakesh"
4. Output Statements
i. print()
Aim: prints the data written inside the brackets, whether argument or string..
Syntax: print(x,””)
Example: print(“Good Morning”)
ii. cat()
Aim: Outputs the objects, concatenating the representations. cat performs
much less conversion than print.
Syntax: cat(… , file = "", sep = " ", fill, labels, append)
Example: cat(paste(letters,100*1:26),fill=TRUE,labels=paste0(“{“,1:10”}:”))
A.Lokesh
21071a1268
DATES IT-B
i) Print System’s date
Syntax: Sys.Date()
> Sys.Date()
[1] "2023-07-30"
ii) Print System’s time
Syntax: Sys.time()
> Sys.time()
[1] "2023-07-30 19:32:14 IST"
Specifier Description
%a Abbreviated weekday
%A Full weekday
%b Abbreviated month
%B Full month
%C Century
%y Year without century
Specifier Description
> date<-Sys.Date()
> format(date,format='%a')
[1] "Sun"
> format(date,format='%a')
[1] "Sun"
> format(date,format='%A')
[1] "Sunday"
> format(date,format='%b')
[1] "Jul"
> format(date,format='%B')
[1] "July"
> format(date,format='%C')
[1] "20"
> format(date,format='%y')
[1] "23"
> format(date,format='%Y')
[1] "2023"
> format(date,format='%d')
[1] "30"
> format(date,format='%j')
[1] "211"
> format(date,format='%m')
[1] "07"
> format(date,format='%D')
[1] "07/30/23"
> format(date,format='%u')
[1] "7"
FUNCTIONS
i) Sum(Functions with and without null values)
iv) Seq
ii) Strsplit
> str<-"i am going to split the scentence"
> strsplit(str," ")
[[1]]
[1] "i" "am" "going"
[4] "to" "split" "the"
[7] "scentence"
iii) Paste
> paste("i","am","Rakesh")
[1] "i am Rakesh"
iv) Grep
It is used for pattern matching and replacement. grep, grepl, regexpr, gregexpr and regexec search
for matches with argument pattern within each element of a character vector.
> paste("i","am","Rakesh")
[1] "i am Rakesh"
v) Toupper
> print(toupper(str))
[1] "I AM GOING TO SPLIT THE SCENTENCE"
vi) Tolower
> print(tolower(str))
[1] "i am going to split the scentence"
vii) rep
> rep(1:10,time=2)
[1] 1 2 3 4 5 6 7 8 9 10 1 2 3
[14] 4 5 6 7 8 9 10
• This function returns a vector that contains only logical value (either True or False).
> data=c(1,2,NA,34,12,NA,11)
> print(data)
[1] 1 2 NA 34 12 NA 11
> print(data)
[1] 1 2 NA 34 12 NA 11
> print(is.na(data))
[1] FALSE FALSE TRUE FALSE FALSE TRUE
[7] FALSE
ii) na.omit
• It simply rules out any rows that contain any missing value and forgets those rows
forever.
> na.omit(data)
[1] 1 2 34 12 11
attr(,"na.action")
[1] 3 6
attr(,"class")
[1] "omit"
iii) na.exclude
> na.exclude(data)
[1] 1 2 34 12 11
attr(,"na.action")
[1] 3 6
attr(,"class")
[1] "exclude"
iv) na.fail
> na.fail(data)
Error in na.fail.default(data) : missing values in object
v) na.pass
• Take no action.
> na.pass(data)
[1] 1 2 NA 34 12 NA 11
VECTORS
(i) Creation with numbers, string values, logical values
> num<-c(4,1,3,2,7)
> str<-"this is R language"
> bool<-c(TRUE,FALSE)
(ii) Declaration of vector
> vec=list(str,num,bool)
> print(vec)
[[1]]
[1] "this is R language"
[[2]]
[1] 4 1 3 2 7
[[3]]
[1] TRUE FALSE
MATRICES
(i) Creation of matrices
Matrices are created using matrix() function. The matrix() function takes a data vector as
input and reshapes it into a matrix with the specified number of rows and columns.
(ii) Accessing matrix
Access elements of a matrix in R using square brackets `[]`. To access specific elements, you
need to specify the row and column indices.
(v) Image
The image() function is used to create an image plot of the matrix data. It displays the values
in the matrix as a heatmap, with colors representing different values.
FACTORS
(i) Creating Factor
Factors are used to represent categorical data. Creating a factor using the factor() function.
Factors are used to store data that has specific categories or levels.
(ii) As.integer
The as.integer() function is used to convert factors into their underlying integer
representation. Each unique level in the factor is assigned a unique integer value.
(iii) Levels
The levels() function is used to extract or modify the levels of a factor. It allows to view the
unique categories present in the factor and also assign custom levels if needed.
(iv) Plotting factors
Plotting factors using categorical plots to visualize the distribution of different categories.
SIMPLE ANALYSIS
• Describe Data Structure
(i) names(),
(ii) str(),
(iii) summary(),
(iv) head() and tail()
(ii) Mean
(iii) Sum
(iv) table
(v) hist
(vi) Boxplot
METHODS FOR READING DATA
(i) Read from CSV file
Aim: The read.csv() function or its alternative read.csv2() depending on your locale
settings
Syntax:read.csv(“path”)
> data<-read.csv("C:\\Users\\rakes\\Downloads\\username.csv")
> data
Username..Identifier.First.name.Last.name
1 booker12;9012;Rachel;Booker
2 grey07;2070;Laura;Grey
3 johnson81;4081;Craig;Johnson
4 jenkins46;9346;Mary;Jenkins
5 smith79;5079;Jamie;Smith
$size
[1] "Large"
$color
[1] "Red"
A.Lokesh
1. Data frame 21071a1268
IT-B
(a) Creation
Aim: Data Frames are data displayed in a format as a table.
Syntax: data.frame()
Example:
> df<-data.frame(Roll_no=c(1,2,3),
+ name=c("RAkesh","Rajesh","Rakhi"),
+ score=c(1281,1282,1283)
+ )
> df
Roll_no name score
1 1 RAkesh 1281
2 2 Rajesh 1282
3 3 Rakhi 1283
(b) Access
Aim: To access data from data frames for manipulating, using,etc.
Syntax: data_frame$column_name
Example: ac$name
> df$name
[1] "RAkesh" "Rajesh" "Rakhi"
Syntax: data_frame[[“column_name”]]
Example: ac[[“Name”]
> df[["name"]]
[1] "RAkesh" "Rajesh" "Rakhi"
#Summarisation
#ddply
Aim: The "ddply" function is used to split a data frame into subsets based on one or
more variables, apply a specified function to each subset, and then combine the results
into a new data frame.
Syntax: ddply(data, .variables, .fun)
Outliers
The aim of handling outliers is to identify and manage data points that significantly
deviate from the rest of the data. Outliers can arise due to natural variations in the data
or measurement errors and can have a substantial impact on statistical analysis,
visualization, and modelling. By addressing outliers, we can prevent them from unduly
influencing the results of your analysis.
(b) Mean
The aim of calculating the mean is to find the arithmetic average of a set of numeric
values.
Syntax: mean(x)
(c) Mode
The aim of calculating the mode is to find the value that appears most frequently in a
dataset.
Syntax: mode_x <- as.numeric(names(table_x[table_x == max(table_x)]))
(d) Median
The median is to find the middle value of a dataset when it is ordered. Syntax:
median(x)
(f) Removing NA
The removal of missing or NA (Not Available) values from the dataset to perform
calculations or analyses on complete data.
Syntax: omit(x)
(g) Plotting
Plotting is to visually represent data for better understanding and analysis.
(c) densityplot
Density plot is to visualize the probability density function of a continuous variable,
providing insights into the data's underlying distribution.
Syntax: densityplot(x, main = ..., xlab = ..., ylab = ...)
DATA ANALYTICS LABORATORY
WEEK-4: HDFS (Storage) commands
1) Version A.Lokesh
21071a1268
The Hadoop fs shell command version prints the Hadoop version. IT-B
Output:
HDFS Description
ls Options
Output:
3) get:
The Hadoop fs shell command get copies the file or directory from the Hadoop file system to the
local file system.
Output:
4) copyToLocal:
copyToLocal command copies the file from HDFS to the local file system.
Output:
5) cat
The cat command reads the file in HDFS and displays the content of the file on console or stdout.
Output:
6) put:
The Hadoop fs shell command put is similar to the copyFromLocal, which copies files or directory
from the local filesystem to the destination in the Hadoop filesystem.
7) copyFromLocal
This command copies the file from the local file system to HDFS.
Output:
8. fsck
The fsck Hadoop command is used to check the health of the HDFS.
Syntax: hadoop fsck <path> [ -move | -delete | -openforwrite] [-files [-blocks [-locations | -racks]]]
Options Description
<path> start checking from the path specified here
-move It moves a corrupted file to the lost+found directory.
-delete It deletes the corrupted files present in HDFS.
-openforwrite It prints the files which are opened for write
-files It prints the files being checked.
-blocks It prints out all the blocks of the file while checking.
-locations It prints the location of all the blocks of files while checking.
-racks It displays the network topology for DataNode locations.
Output:
9) mkdir
This command creates the directory in HDFS if it does not already exist.
Output:
10) cp
The cp command copies a file from one directory to another directory within the HDFS.
Output:
11) touchz
touchz command creates a file in HDFS with file size equals to 0 byte. The directory is the name of
the directory where we will create the file, and filename is the name of the new file we are going to
create.
Output:
12) du
This Hadoop fs shell command du prints a summary of the amount of disk usage of all
files/directories in the path.
Output:
13) count
The Hadoop fs shell command count counts the number of files, directories, and bytes under the
paths that matches the specified file pattern.
Options:
-q – shows quotas(quota is the hard limit on the number of names and amount of space used for
individual directories)
Output:
Options:
-r : Recursively remove directories and files
-skipTrash : To bypass trash and immediately delete the source
-f : Mention if there is no file existing
-rR : Recursively delete directories
Output:
15) mv
The HDFS mv command moves the files or directories from the source to a destination within hdfs.
Output:
16) help
The Hadoop fs shell command help shows help for all the commands or the specified command.
Output:
17) usage
The Hadoop fs shell command usage returns the help for an individual command.
Output:
18) df
The Hadoop fs shell command df shows the capacity, size, and free space available on the HDFS file
system.The -h option formats the file size in the human-readable format.
Output:
19) chmod
The Hadoop fs shell command chmod changes the permissions of a file.The -R option recursively
changes files permissions through the directory structure.The user must be the owner of the file or
superuser.
Output:
20) tail
The Hadoop fs shell tail command shows the last 1KB of a file on console or stdout.The -f shows the
append data as the file grows
Output:
21) expunge
HDFS expunge command makes the trash empty.
Output:
22) appendToFile
This command appends the contents of all the given local files to the provided destination file on the
HDFS filesystem. The destination file will be created if it is not existing earlier.
Syntax: hadoop fs -appendToFile <localsrc> <dest>
Output:
25)chown
The Hadoop fs shell command chown changes the owner of the file.The -R option recursively
changes files permissions through the directory structure. The user must be the owner of the file or
superuser.
Output:
DATA ANALYTICS LABORATORY
WEEK-5: Map Reduce Programming
A.LOKESH
Map Reduce Programming-Max Temperature 21071A1268
Step-1: IT-B
Step-4:
Type source code in src files and save
maxtemperature.java
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Mapper;
public class maxtemperature extends Mapper<LongWritable, Text, Text, IntWritable > {
}
}
maxtemreduce.java
import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Reducer;
public class maxtempreduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int maxvalue=Integer.MIN_VALUE;
for (IntWritable value : values) {
maxvalue=Math.max(maxvalue, value.get());
}
context.write(key, new IntWritable(maxvalue));
}
}
maxtempdriver.java
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
if (!job.waitForCompletion(true))
return;
}
}
Step-5:
Export project file to desktop or any folder that contain input file
Step-6:
Run commands in terminal
1. create input directory
Step-5
print output from output directory
DATA ANALYTICS LABORATORY
WEEK-6: Map Reduce Programming
Map Reduce Programming- Word Count
Step-1:
A.LOKESH
open eclipse editor and create a new project
21071A1268
Step-2: IT-B
Add Jar files from hadoop and hadoop client.After click on finish.
Step-3:
Right click on src from project create class files
Step-4:
Type source code in src files and save
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
Export project file to desktop or any folder that contain input file
Step-6:
Run commands in terminal
1. create input directory
3. check whether the input data is moved or not by printing the data.
4. Run the Jar file which contain src code
Step-5
print output from output directory
DATA ANALYTICS LABORATORY
WEEK-7: Data Processing Tool – Hive
Data Processing Tool-Hive A.LOKESH
1) How to enter the HIVE Shell? 21071A1268
IT-B
Q14)List the employee names where job has l as the second character?
Q4) Load the data set of movies from local to hive table?
Q13) List all the years with total number of views in each year
( hint group by year), restrict the records to 5
hive> select year, sum(views) from movie_details group by year LIMIT 5;
PARTITIONING AND BUCKETING
Q1) create a database shopping?
Q4) create a partition (shopping3) for table shopping1 and also create 3 buckets inside each partition?
Q5)Populate the partition with data?
Objective:To perform word count on a text file using functions like split and explode
Step 2:
Use created database and create table hive_count_tb.
Step 3: Make a .txt file on local machine consisting of few sentences, in my case I
made hive_count.txt, then load that data to table called hive_count_tb.
Step 4: Check whether the table contains the data by Show tables command and Select * from
hive_count_tb.
Step 4: The data we have is in sentences, first we have to convert that it into words
applying space as delimiter using split function.
Step 5: Explode is to expand an array in a single row across multiple rows, one for
each value in the array.
A.LOKESH
21071A1268
Pig Commands IT-B
Q1: How to enter in grunt shell?
Q8: Split the c data set into two different relations eg. d and e? E.g. I want one data set where
$0 is having value 1 and other data set where value of $0 is 4?
STEP 4: Condense all the tuples in each line to one single line using function FLATTEN and
then break the line into words using TOKENIZE function
STEP 5: Now group the collection of words based on word