R Notes Based on Text Module 2
R Notes Based on Text Module 2
Importing data from Text files and other software, Exporting data, importing data from
● Importing data from Text files and other software, Exporting data
○ This topic covers how to bring data into R from external sources like text files
and other software formats.
○ "R in a Nutshell" discusses reading data from text files in the context of
importing data. It also mentions exporting data.
○ The syllabus also mentions exporting data as part of this topic.
● Importing data from databases- Database Connection packages
○ This topic deals with how R represents and handles missing values, specifically
the special values NA (Not Available) and NULL.
○ "R in a Nutshell" explicitly discusses special values including NA. While it
doesn't dedicate a separate section to NULL in the main data handling chapters, it
does mention NULL as a special object type. The predict function's
na.action argument and the lm function's handling of NA values in data are
also mentioned.
● Combining data sets, Transformations
KTU - AIT 362 – Programming in R (2019 Scheme) | Based on prescribed text and syllabus.
Prepared by Mr. Nisanth P, Asst. Professor, Department of AI & ML , VAST | +91 90372 22822
1
○ This section covers merging or joining multiple datasets together and
transforming variables within a dataset.
○ "R in a Nutshell" addresses combining data sets and includes a section on
transformations. This section details methods for reassigning variables and
using the transform function. It also covers pasting together data.
● Binning Data, Subsets, summarizing functions
○ This topic includes categorizing continuous data into bins, selecting specific
portions (subsets) of a dataset, and using functions to get summary
information about the data.
○ While "R in a Nutshell" doesn't have a section explicitly titled "Binning Data," it
does discuss techniques like creating factor vectors which can be related to
binning. It has a section on subsets as part of data manipulation. The syllabus
itself later mentions summarizing functions again. "R in a Nutshell" has a
section on summarizing functions such as summary, tapply, and
aggregate.
● Data Cleaning, Finding and removing Duplicates, Sorting
Importing data from Text files and other software, Exporting data
R offers a family of functions based on the read.table function for importing delimited text files.
Some of the most commonly used functions include:
● read.csv: Specifically designed for comma-separated value (CSV) files. It assumes a header
row and uses a comma as the default separator.
● read.delim: Used for files where fields are delimited by a specific character, with the tab
character (\t) being the default.
● read.csv2 and read.delim2: These are variations that use a semicolon (;) as the separator
and a comma (,) for the decimal point, which is common in some European locales.
KTU - AIT 362 – Programming in R (2019 Scheme) | Based on prescribed text and syllabus.
Prepared by Mr. Nisanth P, Asst. Professor, Department of AI & ML , VAST | +91 90372 22822
2
These functions have various arguments to control how the data is read. Some important ones include:
You can even import data directly from a URL by providing the URL as the file argument to these
functions. For example, you can fetch a CSV file from a website.
For fixed-width files, where variables are differentiated by their location on each line, R provides the
read.fwf function. This function requires specifying the widths of each field.
For more complex scenarios, you can read data into R one line at a time using the readLines function.
Another versatile function is scan, which allows reading data in various formats and specifying the data
type to be read.
"R in a Nutshell" also provides tips for creating text files that are easy to import into R. For example,
when exporting from Microsoft Excel, you can choose between CSV and tab-delimited files and should
ideally specify Unix-style line delimiters. When exporting from databases, using GUI tools to export the
results can be easier than command-line scripts. For very large text files, the book suggests using scripting
languages like Perl, Python, or Ruby to preprocess the data into a more manageable format before
importing it into R.
While many software packages can export data as text files, R can also directly read files in various other
formats using functions often provided by additional packages. Table 11-1 in "R in a Nutshell" lists
some of these functions:
KTU - AIT 362 – Programming in R (2019 Scheme) | Based on prescribed text and syllabus.
Prepared by Mr. Nisanth P, Asst. Professor, Department of AI & ML , VAST | +91 90372 22822
3
● S (older versions): read.S (in foreign package).
● SPSS: read.spss (in foreign package).
● SAS Permanent Dataset: read.ssd (in foreign package).
● Systat: read.systat (in foreign package).
● SAS XPORT File: read.xport (in foreign package).
The foreign package is a crucial package that provides many of these functions for reading data files
from other statistical software. You would typically need to install and load this package using
install.packages("foreign") and library(foreign) before using these functions.
Additionally, for users familiar with Microsoft Excel, the RExcel software allows running R directly
from within Excel on Microsoft Windows systems, facilitating data exchange. Information about this
software can be found at the provided URL.
For importing data from databases, R provides Database Connection Packages like RODBC and DBI.
The syllabus explicitly mentions these. These packages allow you to connect to various database systems
and retrieve data directly into R. This topic is covered in detail in a subsequent section of Module 2.
Exporting Data
R can export R data objects, primarily data frames and matrices, to text files using the write.table
function. This function has several arguments to control the output format:
You can also export R data objects (like individual variables, lists, or entire workspaces) in a binary
format using the save function. This saves the objects to a file (typically with the extension .RData or
.rda) that can be reloaded into R later using the load function. The save function allows you to
specify which objects to save and the file name. Connections, such as those created with gzfile for
compressed files, can be used with save and load.
KTU - AIT 362 – Programming in R (2019 Scheme) | Based on prescribed text and syllabus.
Prepared by Mr. Nisanth P, Asst. Professor, Department of AI & ML , VAST | +91 90372 22822
4
In summary, R provides a robust set of tools for importing data from various text file formats and other
software, often relying on the base functions for text files and additional packages like foreign for
other formats. Similarly, the write.table and save functions enable flexible exporting of data and R
objects. Understanding these functions and their arguments is crucial for effectively managing data within
R.
There are primarily two approaches to getting data from databases into R: exporting the data to a text
file and then importing it into R, or establishing a direct connection from R to the database using
database connection packages.
One effective method, particularly for large datasets (1 GB or more) accessed once for analysis, is to first
export the data from the database to a text file (as discussed in the "Importing data from Text files and
other software" topic) and then import that file into R. "R in a Nutshell" suggests that importing from text
files can be significantly faster than direct database connections for very large datasets. For guidance on
importing these exported text files, refer to the "Text Files" section. Tools for querying and exporting data
from databases often include GUI options like Toad for Data Analysts (for various databases), MySQL
Query Browser, and Oracle SQL Developer, which can simplify the export process compared to
command-line scripts.
For scenarios requiring regular reports or repeated analysis, directly importing data into R via database
connections can be more efficient. To achieve this, R utilizes optional packages that need to be installed
based on the specific database(s) you intend to connect to and the preferred connection method. "R in a
Nutshell" highlights two main sets of database interfaces available in R:
● RODBC (R Open DataBase Connectivity): This package allows R to retrieve data from
databases using ODBC (Open DataBase Connectivity) connections. ODBC is a standard
interface that enables different programs to connect to various databases.
KTU - AIT 362 – Programming in R (2019 Scheme) | Based on prescribed text and syllabus.
Prepared by Mr. Nisanth P, Asst. Professor, Department of AI & ML , VAST | +91 90372 22822
5
3. Configure an ODBC connection to your database. This typically involves
setting up a Data Source Name (DSN). Examples for configuring an ODBC
connection to an SQLite database are provided for both Microsoft Windows and
Mac OS X.
○ Once configured, you can use functions from the RODBC package to interact with the
database:
1. odbcConnect(dsn, uid = "", pwd = "", ...): Establishes a
connection to the database using the specified DSN and optional username (uid)
and password (pwd). It returns an object of class RODBC representing the
connection (often called a channel).
2. odbcGetInfo(channel): Retrieves information about the ODBC
connection.
3. sqlTables(channel): Lists the tables available in the database for the
connected user.
4. sqlColumns(channel, sqtable): Provides detailed information about
the columns in a specific table (sqtable).
5. sqlPrimaryKeys(channel, sqtable): Discovers the primary keys for
a table.
6. sqlFetch(channel, sqtable, ..., colnames = , rownames
= ): Fetches the entire content of a table or view (sqtable) into an R data
frame.
7. sqlQuery(channel, query, errors = TRUE, max = 0, ...,
rows_at_time = ): Executes an arbitrary SQL query in the database and
returns the results as a data frame. The max argument can be used to fetch results
piecewise for very large queries, with sqlGetResults used to retrieve
subsequent rows.
8. sqlSave(channel, data, tablename, ...),
sqlUpdate(channel, data, tablename, ...): Functions for
writing an R data frame to a database table or updating an existing table (see help
files for details).
9. odbcClose(channel): Closes the database connection.
10. odbcCloseAll(): Closes all open RODBC channels.
○ RODBC also has a second set of lower-level functions like odbcQuery, odbcTables,
odbcColumns, odbcPrimaryKeys (for executing queries but not fetching results)
and odbcFetchResults (to get the results), but the higher-level functions are
generally more convenient.
● DBI (Database Interface): DBI is not a single package but rather a framework and a set of
packages for accessing databases. It provides a common database abstraction for R software. To
connect to specific databases using DBI, you need to install additional driver packages. Table
11-3 in "R in a Nutshell" lists some of these packages and the corresponding databases:
KTU - AIT 362 – Programming in R (2019 Scheme) | Based on prescribed text and syllabus.
Prepared by Mr. Nisanth P, Asst. Professor, Department of AI & ML , VAST | +91 90372 22822
6
○ RMySQL for MySQL
○ RSQLite for SQLite
○ ROracle for Oracle
○ RPostgreSQL for PostgreSQL
○ RJDBC for any database with a JDBC driver
○ To use DBI:
1. Install the appropriate driver package (e.g.,
install.packages("RSQLite")). Installing a driver package often
automatically loads the DBI package as well.
2. Open a connection using the dbConnect(drv, ...) function. The drv
argument can be a driver object created with dbDriver("driverName")
(e.g., dbDriver("SQLite")) or a character string specifying the driver.
Additional arguments, specific to the database type (like dbname for SQLite),
are also required.
3. Get database information using functions like dbGetInfo(con) (for
connection details), dbListConnections(drv) (for connections associated
with a driver), dbListTables(con) (for available tables), and
dbListFields(con, "tableName") (for column names).
4. Query the database using functions like dbGetQuery(con,
sqlStatement) which executes an SQL query and returns the result as a data
frame. Alternatively, you can use dbSendQuery(con, sqlStatement) to
send a query and then fetch(res) to retrieve the results, allowing for
piecewise fetching.
5. Read or write entire tables using dbReadTable(con, "tableName")
and dbWriteTable(con, dataFrame, "tableName", ...)
respectively. You can also check if a table exists with dbExistsTable(con,
"tableName") and remove it with dbRemoveTable(con,
"tableName").
6. Close the connection using dbDisconnect(con). You can also unload the
driver with dbUnloadDriver(drv) to free system resources.
○ DBI utilizes S4 objects to represent drivers and connections, which can help in writing
better code.
● TSDBI (Time Series Database Interface): This is another database interface in R specifically
designed for time series data. TSDBI packages are available for various databases, including
MySQL (TSMySQL), SQLite (TSSQLite), Fame (TSFame), PostgreSQL (TSPostgreSQL), and
any database with an ODBC driver (TSODBC).
Choosing between RODBC and DBI often depends on factors like driver availability across different
operating systems, the need for specific database features or performance, and personal preference
regarding coding style (RODBC is more procedural, while DBI uses S4 objects).
KTU - AIT 362 – Programming in R (2019 Scheme) | Based on prescribed text and syllabus.
Prepared by Mr. Nisanth P, Asst. Professor, Department of AI & ML , VAST | +91 90372 22822
7
Finally, for working with data from Hadoop, "R in a Nutshell" mentions the RHadoop package as a
mature and well-integrated project, which includes functions for moving data in and out of Hadoop
clusters and executing Map/Reduce jobs. Another option mentioned is using Hadoop streaming with other
applications like R. The book refers to the "R and Hadoop" section for more details.
NA (Not Available)
In R, NA values are used to represent missing values. NA stands for "not available". You might
encounter NA values in several situations:
● When text is loaded into R, NA can be used to represent missing values in the original data.
● When data is loaded from databases, NULL values in the database are often replaced by NA
in R.
If you expand the size of a vector, matrix, or array beyond the initially defined size, the newly created
spaces will be filled with the value NA. For example:
> v <- c(1,2,3)
>v
1 2 3 NA
● When the length of the vector v is increased to 4, the fourth element is assigned the value NA.
Similarly, expanding a vector w also results in NA values for the uninitialized elements.
It's important to note that NA is a specific value used to indicate that a piece of data is missing or not
applicable. Many functions in R have an argument, often na.rm (NA remove), which specifies how NA
values should be treated during calculations. By default, if any value in a vector is NA, functions like
mean() will often return NA. However, setting na.rm=TRUE will instruct the function to ignore the
NA values in the calculation.
Furthermore, many modeling functions in R have an na.action argument to specify how NA values in
the data should be handled during model fitting. Common options include na.omit (remove rows with
NA values) and na.fail (return an error if NA values are present).
NULL
KTU - AIT 362 – Programming in R (2019 Scheme) | Based on prescribed text and syllabus.
Prepared by Mr. Nisanth P, Asst. Professor, Department of AI & ML , VAST | +91 90372 22822
8
NULL is another special object in R, represented by the symbol NULL. The symbol NULL always points
to the same object. NULL is distinct from NA, Inf, -Inf, and NaN.
NULL represents the absence of an object. It has a length of zero and no type.
A key distinction is that NA represents a missing value within a data structure (like a vector or data
frame), whereas NULL represents the absence of a data structure or object itself. An NA in a vector
makes the length of the vector remain the same, but indicates that a particular element's value is unknown.
NULL, on the other hand, effectively means "nothing is there."
For example, if you create a list and assign NULL to one of its elements, that element will effectively
disappear, and the length of the list might change depending on how it's handled. If you assign NA, the
element will still exist but have a missing value.
In summary, both NA and NULL are important concepts for handling data in R. NA signifies missing
values within a dataset, which is common in real-world data. NULL represents the absence of an object
and is often used in programming contexts like function arguments and return values. Understanding their
differences is crucial for effective data manipulation and analysis in R.
Combining Datasets in R
There are several functions in R that allow you to combine multiple data structures into a single one:
paste(): This function is used to concatenate multiple character vectors into a single character
vector. If you try to concatenate a vector of another type, it will be coerced to a character vector first. By
default, the values being pasted together are separated by a space, but you can specify a different
separator using the sep argument or no separator at all. The collapse argument can be used to
concatenate all the values in the resulting vector into a single character string with a specified separator.
> x <- c("a", "b", "c")
KTU - AIT 362 – Programming in R (2019 Scheme) | Based on prescribed text and syllabus.
Prepared by Mr. Nisanth P, Asst. Professor, Department of AI & ML , VAST | +91 90372 22822
9
> paste(x, y)
"a-1#b-2#c-3"
cbind(): This function combines objects (like vectors, matrices, or data frames) by adding columns.
You can think of this as combining tables horizontally. For this to work effectively, the objects you are
combining should generally have the same number of rows. "R in a Nutshell" provides an example of
combining a data frame top.5.salaries with another data frame more.cols using cbind().
> top.5.salaries <- data.frame(name.last=c("Manning", "Brady"), name.first=c("Peyton", "Tom"),
salary=c(18700000, 14626720))
rbind(): This function combines objects (like vectors, matrices, or data frames) by adding rows. This
can be visualized as stacking tables vertically. For rbind() to work properly, the objects should have
the same number of columns and ideally the same column names. "R in a Nutshell" demonstrates
combining two data frames, top.5.salaries and next.three, using rbind().
> top.5.salaries <- data.frame(name=c("A", "B"), value=c(10, 20))
name value
1 A 10
2 B 20
KTU - AIT 362 – Programming in R (2019 Scheme) | Based on prescribed text and syllabus.
Prepared by Mr. Nisanth P, Asst. Professor, Department of AI & ML , VAST | +91 90372 22822
10
3 C 30
4 D 40
merge(): This is a powerful function for combining two data frames based on common columns or
row names. The merge() function in R is similar to database join operations. You can specify the
common columns to join by using the by, by.x, and by.y arguments. The all, all.x, and all.y
arguments control whether rows with no match in the other data frame should be included (similar to
OUTER, LEFT OUTER, and RIGHT OUTER joins in SQL). By default, merge() performs a natural
join (only rows with matching values in the specified columns are included).
> df1 <- data.frame(ID=c(1, 2, 3), Name=c("X", "Y", "Z"))
ID Name Value
1 2 Y 100
2 3 Z 200
ID Name Value
1 1 X NA
2 2 Y 100
3 3 Z 200
4 4 <NA> 300
make.groups() (from the lattice package): Sometimes you might want to combine a set of
similar objects (either vectors or data frames) into a single data frame, with an additional column that
labels the source of each object. The make.groups() function in the lattice package is useful for
this purpose.
> library(lattice)
KTU - AIT 362 – Programming in R (2019 Scheme) | Based on prescribed text and syllabus.
Prepared by Mr. Nisanth P, Asst. Professor, Department of AI & ML , VAST | +91 90372 22822
11
> make.groups(Vector1 = vec1, Vector2 = vec2)
data which
1 1 Vector1
2 2 Vector1
3 3 Vector1
4 4 Vector2
5 5 Vector2
6 6 Vector2
These functions provide various ways to combine datasets in R depending on the structure of your data
and the desired outcome. Chapter 12 of "R in a Nutshell" provides more detailed explanations and
examples of these and other data preparation techniques.
Transformations in R
Chapter 12 of "R in a Nutshell" also covers data transformations, which involve modifying variables
within a data set.
● Reassigning Variables:
○ You can create new variables or modify existing ones within a data frame using standard
assignment. For instance, dow30$mid <- (dow30$High + dow30$Low) / 2
creates a new column named mid in the dow30 data frame.
● The transform() Function:
○ The transform() function provides a convenient way to change variables in a data
frame. Its formal definition is transform(_data, ...).
○ You specify the data frame as the first argument, followed by a series of expressions that
use variables from that data frame. transform() applies each expression and returns
the modified data frame.
For example:
> dow30.transformed <- transform(dow30, Date=as.Date(Date),
○
● Applying a Function to Each Element of an Object:
○ R offers several functions to apply a function to a set of objects, including:
■ lapply(): Applies a function to each element of a list and returns a list.
KTU - AIT 362 – Programming in R (2019 Scheme) | Based on prescribed text and syllabus.
Prepared by Mr. Nisanth P, Asst. Professor, Department of AI & ML , VAST | +91 90372 22822
12
■ sapply(): A user-friendly version of lapply() that tries to simplify the
output to a vector or matrix.
■ apply(): Applies a function to the margins of an array or matrix.
■ tapply(): Applies a function to each cell of a ragged array, based on
combinations of factor levels.
■ mapply(): A multivariate version of lapply() that applies a function to
corresponding elements of multiple vectors, lists, or other vector-like objects.
● The plyr Package:
○ The plyr package provides a set of functions (with names like *dply) for applying a
function to an R data object (array, data frame, or list) and returning the results in various
formats.
● Binning Data:
○ Binning involves grouping observations into bins based on the value of a variable.
○ Shingles: These represent intervals and are used extensively in the lattice package
for conditioning variables. The shingle() function handles shingles.
○ cut() Function: This function divides the range of a numeric variable into intervals
(bins) and codes the values according to which interval they fall into.
● Reshaping Data:
○ Reshaping transforms the structure of data, often between "wide" and "long" formats.
○ stack() and unstack() are basic functions for this.
○ The reshape() function is more powerful but can be complex. The direction
argument specifies whether to transform to "long" or "wide" format.
○ The reshape2 package offers a more intuitive approach with two main functions:
■ melt(): Turns a data frame into a long format.
■ cast(): Takes a melted data frame and reshapes it into a different format.
These methods provide a comprehensive set of tools for combining and transforming data sets in R,
allowing you to prepare your data effectively for analysis.
Here are the key methods for binning data in R as described in the sources:
KTU - AIT 362 – Programming in R (2019 Scheme) | Based on prescribed text and syllabus.
Prepared by Mr. Nisanth P, Asst. Professor, Department of AI & ML , VAST | +91 90372 22822
13
● The cut() function is specifically designed for taking a continuous variable and
splitting it into discrete pieces (bins).
● Its default form for numeric vectors is cut(x, breaks, labels = NULL,
include.lowest = FALSE, right = TRUE, dig.lab = 3,
ordered_results = FALSE).
○ x: This is the numeric vector that you want to convert into a factor by binning its
values.
○ breaks: This argument specifies how the bins should be defined. It can be
either:
■ A single integer value indicating the desired number of breakpoint
intervals. cut() will then try to create approximately equal-width
intervals.
■ A numeric vector specifying the actual breakpoint values that define the
boundaries of the bins.
○ labels: This is an optional argument that allows you to provide labels for the
levels (bins) in the output factor. If not provided, labels are generated based on
the break points.
○ include.lowest: A logical value indicating whether the interval should
include the lowest breakpoint if right = TRUE, or the highest if right =
FALSE. The default is FALSE.
○ right: A logical value that specifies whether the intervals should be closed on
the right and open on the left (i.e., (a, b]). If right = FALSE, intervals
will be open on the right and closed on the left (i.e., [a, b)). The default is
TRUE.
○ dig.lab: An integer that controls the number of digits used when generating
labels if labels are not explicitly specified. The default is 3.
○ ordered_results: A logical value indicating whether the resulting factor
should be an ordered factor. The default is FALSE.
An example in ["R in a Nutshell 2nd edition.pdf"] demonstrates how to use cut() to bin batting
averages and then use table() to count the number of players in each bin:
> library(nutshell)
> data(batting.2008)
KTU - AIT 362 – Programming in R (2019 Scheme) | Based on prescribed text and syllabus.
Prepared by Mr. Nisanth P, Asst. Professor, Department of AI & ML , VAST | +91 90372 22822
14
> battingavg.2008.bins <- cut(batting.2008.over100AB$AVG,breaks=10)
> table(battingavg.2008.bins)
● This code first calculates the batting average (AVG), then subsets the data to include
players with over 100 at-bats, and finally uses cut() to divide the batting averages into
10 bins. The table() function then shows the frequency of players in each of these
bins.
● The cut() function is also used in conjunction with lattice graphics to bin a date
variable into months in the bwplot() function.
2. Shingles:
● While tapply() and aggregate() are primarily for summarizing data within
groups, they can be used in conjunction with binning. You would first use cut() to
create a factor representing the bins, and then use tapply() or aggregate() to
calculate summary statistics for each bin.
● The table() and xtabs() functions are used to create contingency tables
(frequency counts) based on factor levels. After using cut() to bin a numeric variable
into a factor, you can use these functions to see the distribution of the data across the
created bins.
KTU - AIT 362 – Programming in R (2019 Scheme) | Based on prescribed text and syllabus.
Prepared by Mr. Nisanth P, Asst. Professor, Department of AI & ML , VAST | +91 90372 22822
15
In summary, R provides powerful and flexible tools for binning data. The cut() function
is a fundamental tool for creating bins from continuous variables, while shingles offer a
more specialized approach often used with the lattice package for conditioning.
Additionally, summarizing functions can be effectively used after binning to analyze the
data within the defined intervals.
Both subsets and summarizing functions are important aspects of preparing data for analysis in R, as
highlighted in your syllabus and detailed in Chapter 12 of "R in a Nutshell".
Subsets
Often, when working with data, you may need to focus on a specific portion of it rather than the entire
dataset. This is where subsetting comes in handy. You might have too much data overall, or you might
want to analyze specific groups of observations based on certain criteria, such as patient records for a
particular age range and condition. Subsetting allows you to extract the relevant parts of your data for
further analysis.
● Bracket Notation: You can subset data frames using square brackets []. For data frames, you
can select rows by providing a vector of logical values within the brackets. If a logical expression
can describe the rows you want to select, you can use that expression as an index. For example, if
you have a data frame named batting.w.names and want to select rows where yearID is
equal to 2008, you could potentially use a logical condition within the brackets (though the source
doesn't provide the exact bracket notation example, it sets up a subset() example based on this
condition).
● The subset() Function: A more readable alternative to bracket notation is the subset()
function. While anything achievable with subset() can also be done with bracket notation,
subset() can make your code clearer by allowing you to directly use variable names from the
data frame in your selection criteria. The subset() function has the following arguments:
○ x: The object (like a data frame, vector, or matrix) from which you want to calculate a
subset.
KTU - AIT 362 – Programming in R (2019 Scheme) | Based on prescribed text and syllabus.
Prepared by Mr. Nisanth P, Asst. Professor, Department of AI & ML , VAST | +91 90372 22822
16
○ subset: A logical expression that describes which rows you want to return. Only rows
where this expression evaluates to TRUE will be included in the subset.
○ select: An expression indicating which columns you want to return. You can specify
column names (as character strings or unquoted names) or use functions to help select
columns.
○ drop: This argument is passed to the [ operator and its default is FALSE.
> data(Batting)
> head(batting.w.names.2008)
+ c("playerID","yearID","stint","teamID","G","AB","H"))
> head(batting.w.names.2008.short)
KTU - AIT 362 – Programming in R (2019 Scheme) | Based on prescribed text and syllabus.
Prepared by Mr. Nisanth P, Asst. Professor, Department of AI & ML , VAST | +91 90372 22822
17
3 aardsda01 2008 1 BOS 47 0 0
○ The first example selects all columns for rows where yearID is 2008. The second
example selects only the specified columns ("playerID", "yearID", "stint",
"teamID", "G", "AB", "H") for the same subset of rows.
Random Sampling: Sometimes, you might want to take a random sample of your data. This can be
useful if you have a very large dataset and want to work with a more manageable subset for performance
reasons or statistical validity. The sample() function can be used to take a random sample of
observations.
> # Assuming 'nrow(batting.w.names)' gives the total number of rows
> sample_indices <- sample(1:n_rows, size = 100) # Take a random sample of 100 row indices
> batting.sample <- batting.w.names[sample_indices, ] # Subset the data frame using these indices
> head(batting.sample)
● The sample() function here generates a random selection of row numbers, which are then used
to subset the data frame.
For more complex random sampling methods like stratified sampling, the sampling package is
KTU - AIT 362 – Programming in R (2019 Scheme) | Based on prescribed text and syllabus.
Prepared by Mr. Nisanth P, Asst. Professor, Department of AI & ML , VAST | +91 90372 22822
18
mentioned.
Summarizing Functions
Summarizing functions in R allow you to aggregate and condense your data into a more concise and
understandable form. This is often necessary when you have very detailed data, but you need to see
overall trends or key statistics. "R in a Nutshell" describes several functions for this purpose:
tapply(): The tapply() function is a flexible way to apply a function to subsets of a vector, where
the subsets are defined by the levels of one or more factor variables. Its syntax is:
tapply(X, INDEX, FUN = NULL, ..., simplify = TRUE)
●
○ X: The vector you want to summarize.
○ INDEX: A list of factors (or objects that can be coerced to factors) of the same length as
X. Each unique combination of factor levels in INDEX defines a subset of X.
○ FUN: The function to be applied to each subset of X. If NULL, tapply() returns a table
of the subsets.
○ ...: Optional arguments to be passed to FUN.
○ simplify: A logical value indicating whether the result should be simplified to an
array if possible.
For example, to find the mean number of hits (H) for each team (teamID) in the batting.2008 data
(assuming you have this data frame):
> library(nutshell)
> data(batting.2008)
> head(mean_hits_by_team)
CLE
2.610989e+02
KTU - AIT 362 – Programming in R (2019 Scheme) | Based on prescribed text and syllabus.
Prepared by Mr. Nisanth P, Asst. Professor, Department of AI & ML , VAST | +91 90372 22822
19
● This applies the mean() function to the H column, grouped by the unique values in the teamID
column.
aggregate(): The aggregate() function is another powerful tool for summarizing data, especially
within data frames. It can also be applied to time series. The syntax for data frames is:
aggregate(x, by, FUN, ...)
●
○ x: The data frame (or other data object) you want to aggregate.
○ by: A list of grouping elements, each as long as the rows of x. These typically are
columns (or combinations of columns) that define the groups for aggregation.
○ FUN: A scalar function to be used to compute the summary statistic for each group.
○ ...: Further arguments passed to FUN.
The example in "R in a Nutshell" shows how to use aggregate() to sum various batting statistics (AB,
H, BB, 2B, 3B, HR) for each team (teamID):
> aggregate(x=batting.2008[, c("AB", "H", "BB", "2B", "3B", "HR")],
+ by=list(batting.2008$teamID), FUN=sum)
Group.1 AB H BB 2B 3B HR
...
● Here, the by argument is a list containing batting.2008$teamID, which specifies that the
aggregation should be done for each unique team ID. The FUN=sum argument indicates that the
sum() function should be applied to the specified columns for each team.
● Counting Values: You can also summarize data by counting the occurrences of different values
or combinations of values using functions like table() and xtabs(). The table() function
uses cross-classifying factors to build a contingency table of the counts at each combination of
KTU - AIT 362 – Programming in R (2019 Scheme) | Based on prescribed text and syllabus.
Prepared by Mr. Nisanth P, Asst. Professor, Department of AI & ML , VAST | +91 90372 22822
20
factor levels. For example, after binning a continuous variable using cut(), you can use
table() to see the frequency of observations in each bin .
summary(): The summary() function is a generic function that provides a quick and useful overview
of the data in a data frame or other R object. The output of summary() depends on the data type of each
column. For numeric variables, it typically shows the minimum, first quartile, median, mean, third
quartile, and maximum values. For factor variables, it shows the count of each level. For character
variables, it usually provides the class and length.
> summary(batting.2008$AB)
> summary(as.factor(batting.2008$teamID))
ARI ATL BAL BOS CHA CLE COL DET FLO HOU KCA LAA LAN MIA
33 34 33 35 33 33 35 34 32 33 32 34 33 31
MIL MIN NYA NYN OAK PHI PIT SDN SEA SFN SLN TBA TEX TOR
34 33 33 34 33 34 33 34 34 32 34 33 35 33
WAS
33
● str(): While not strictly a summarizing function in the sense of aggregation, the str()
function (for "structure") provides a concise overview of the structure of an R object, including its
data type, dimensions, and the first few values. This can be helpful in understanding the overall
makeup of your data.
These subsetting and summarizing techniques are fundamental for exploring and preparing your data for
more advanced analyses in R.
Data Cleaning
KTU - AIT 362 – Programming in R (2019 Scheme) | Based on prescribed text and syllabus.
Prepared by Mr. Nisanth P, Asst. Professor, Department of AI & ML , VAST | +91 90372 22822
21
Data cleaning is a crucial step in preparing data for analysis. It involves identifying problems caused by
data collection, processing, and storage processes and modifying the data so that these problems
don’t interfere with analysis. Data cleaning doesn’t mean changing the meaning of data.
"R in a Nutshell" emphasizes that even data from seemingly perfect sources like web servers or financial
databases can have issues. It estimates that a significant portion (around 80%) of the effort in a typical
data analysis project is spent on finding, cleaning, and preparing data.
The syllabus mentions "Data Cleaning" as a key step in Module 2 ("Reading and writing data"). It also
lists "Missing Data - NA, NULL" as a related concept, as handling missing values is often a part of data
cleaning. Course Outcome 2 includes a question to "Explain Data cleaning in R".
"R in a Nutshell" provides an example related to credit scores, where valid scores should be between 340
and 840, but the data contained values like 997, 998, and 999, which had special meanings like
"insufficient data". Similarly, hospital patient data might have issues with the same doctor seeing multiple
patients with the same name, leading to incorrect single records, or the same patient seeing multiple
doctors, creating multiple records.
Therefore, data cleaning involves identifying and addressing such inconsistencies and errors to ensure the
data is suitable for meaningful analysis.
Data sources often contain duplicate values. Depending on how the data is used, duplicates can cause
problems, so checking for duplicates is a good practice if they aren’t supposed to be there. The
syllabus lists "Finding and removing Duplicates" as a specific data preparation task.
"R in a Nutshell" highlights that if stock tickers are fetched multiple times accidentally, it will result in
duplicate rows in the data. R provides useful functions for detecting duplicate values, such as the
duplicated() function. This function returns a logical vector showing which elements are
duplicates of values with lower indices. Applying duplicated() to a data frame will indicate which
rows are duplicates of earlier rows.
You can use the logical vector returned by duplicated() to remove duplicates using subsetting:
Alternatively, you can use the unique() function to directly return a vector, data frame, or array
with duplicate elements/rows removed:
KTU - AIT 362 – Programming in R (2019 Scheme) | Based on prescribed text and syllabus.
Prepared by Mr. Nisanth P, Asst. Professor, Department of AI & ML , VAST | +91 90372 22822
22
Sorting
"R in a Nutshell" mentions sorting as one of the final operations that might be useful for analysis, along
with ranking functions. The primary function for sorting in R is sort(). The sort() function sorts
(or orders) a vector or factor (partially) into ascending (or descending) order.
> sort(x)
To sort in descending order, you can use the decreasing = TRUE argument:
The order() function is also relevant for sorting. However, instead of returning the sorted vector itself,
order() returns a permutation that rearranges its first argument into ascending or descending
order, breaking ties by further arguments. This is particularly useful for sorting data frames, as you can
use the indices returned by order() to rearrange the rows of the data frame based on the values in one
or more columns.
For instance, to sort a data frame df based on the values in the column age in ascending order:
You can also sort by multiple columns. For example, to sort by age (ascending) and then by salary
(descending) within each age group:
The - sign before df$salary indicates descending order for that column.
In summary, data cleaning involves addressing various data quality issues. Finding and removing
duplicates can be done using duplicated() and unique(). Sorting can be achieved using sort()
for vectors and factors, and order() for rearranging rows in data frames based on column values. These
are fundamental steps in preparing data for effective analysis in R.
KTU - AIT 362 – Programming in R (2019 Scheme) | Based on prescribed text and syllabus.
Prepared by Mr. Nisanth P, Asst. Professor, Department of AI & ML , VAST | +91 90372 22822
23
KTU - AIT 362 – Programming in R (2019 Scheme) | Based on prescribed text and syllabus.
Prepared by Mr. Nisanth P, Asst. Professor, Department of AI & ML , VAST | +91 90372 22822
24