0% found this document useful (0 votes)

136 views

Assignment For MCA 3rd Sem HPU R Programming

The document discusses a data mining/programming assignment in R studio. Students are asked to complete 4 programs on topics like data analysis and cluster analysis on Iris dataset using R. It also provides supplementary material on R studio, R language and its features, and libraries useful for official statistics and survey statistics.

Uploaded by

Akash deep

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

136 views

Assignment For MCA 3rd Sem HPU R Programming

Uploaded by

Akash deep

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

SUBJECT: Data mining/Programming with R studio

ASSIGNMENT: To be submitted on 21-10-22.

Total marks: 10, Compulsory/ Assessment based assignment

Directions: Implement the programs in R studio and paste the code and outputs in your respective files

Question 1: What are the features of R studio and R language?

Question 2: What is the use of Apriori algorithm in Data mining, explain its significance

Program 1: Write a program to apply different operators in R studio

Link: https://www.datamentor.io/r-programming/operator/

Program 2: Write a program using R on Mnist/Iris dataset for the purpose of data analysis.

Link: https://www.statology.org/iris-dataset-r/

Program 3: Write a program in R for implementing cluster analysis on Iris dataset

Link: https://techvidvan.com/tutorials/cluster-analysis-in-r/

Program 4: Write a program to implement time series forecasting using R in R studio, use any 1 model.

https://www.pluralsight.com/guides/time-series-forecasting-using-r

SUPPLEMENTARY MATERIAL
Data mining is a field of analytics concerned with discovering patterns in large
data sets. Many data-mining specialists are turning to R for its cutting-edge analytical capabilities.

Introduction to R
R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to
the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent
Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There
are some important differences, but much code written for S runs unaltered under R.
R provides a wide variety of statistical (linear and nonlinear modeling, classical statistical tests, time-series
analysis, classification, clustering, …) and graphical techniques, and is highly extensible. The S language is often
the vehicle of choice for research in statistical methodology, and R provides an Open-Source route to participation
in that activity.
One of R’s strengths is the ease with which well-designed publication-quality plots can be produced, including
mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor
design choices in graphics, but the user retains full control.
R is available as Free Software under the terms of the Free Software Foundation’s GNU General Public License in
source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including
FreeBSD and Linux), Windows, and macOS.
R is a great place to start your data science journey because it is an environment designed from the ground up to
support data science. R is not just a programming language, but it is also an interactive environment for doing
data science.

RStudio is an integrated development environment, or IDE, for R programming. Download and install it from
http://www.rstudio.com/download.
If you’re using Windows, launch R from the Start menu. On a Mac, double-click the R icon in the Applications
folder. For Linux, type R at the command prompt of a terminal window.
R is a case-sensitive, interpreted language. You can enter commands one at a time at the command prompt (>) or
run a set of commands from a source file. There are a wide variety of data types, including vectors, matrices, data
frames (similar to datasets), and lists (collections of objects).

All objects are kept in memory during an interactive session. Basic functions are
available by default. Other functions are contained in packages that can be
attached to a current session as needed. Statements consist of functions and assignments. R uses the symbol <- for
assignments, rather than the typical = sign.

Comments are preceded by the # symbol. Any text appearing after the # is ignored by the R interpreter.

The R environment
R is an integrated suite of software facilities for data manipulation, calculation, and graphical display. It includes
• an effective data handling and storage facility,
• a suite of operators for calculations on arrays, in particular matrices,
• a large, coherent, integrated collection of intermediate tools for data analysis,
• graphical facilities for data analysis and display either on-screen or on hardcopy, and
• a well-developed, simple, and effective programming language which includes conditionals, loops, user-
defined recursive functions, and input and output facilities.
The term “environment” is intended to characterize it as a fully planned and coherent system, rather than an
incremental accretion of very specific and inflexible tools, as is frequently the case with other data analysis
software.
R, like S, is designed around a true computer language, and it allows users to add additional functionality by
defining new functions. Much of the system is itself written in the R dialect of S, which makes it easy for users to
follow the algorithmic choices made. For computationally-intensive tasks, C, C++, and Fortran code can be linked
and called at run time. Advanced users can write C code to manipulate R objects directly.
Many users think of R as a statistics system. We prefer to think of it as an environment within which statistical
techniques are implemented. R can be extended (easily) via packages. There are about eight packages supplied
with the R distribution and many more are available through the CRAN family of Internet sites covering a very
wide range of modern statistics.
R has its LaTeX-like documentation format, which is used to supply comprehensive documentation, both online
in several formats and hardcopy.

HOW TO USE IMPORTANT LIBRARIES

Official Statistics & Survey Statistics

Installation: The packages from this task view can be installed automatically using the ctv package. For
example, ctv::install.views("OfficialStatistics", coreOnly = TRUE) installs all the core packages
or ctv::update.views("OfficialStatistics") installs all packages that are not yet installed and up-to-
date. See the CRAN Task View Initiative for more details.
This CRAN Task View contains a list of packages with methods typically used in official statistics and survey
statistics. Many packages provide functions for more than one of the topics listed below. Therefore, this list is not
a strict categorization and packages may be listed more than once.

The task view is split into several parts

• First part: “Producing Official Statistics”. This first part is targeted at people working at national
statistical institutes, national banks, international organizations, etc. who are involved in the production
of official statistics and using methods from survey statistics. It is loosely aligned to the “Generic
Statistical Business Process Model”.
• Second part: “Access to Official Statistics”. This second part’s target audience is everyone interested to
use official statistics results directly from within R.
• Third part: “Related Methods” shows packages that are important in official and survey statistics, but do
not directly fit into the production of official statistics. It complements with a subsection
on “Miscellaneous” - a collection of packages that are loosely linked to official statistics or that provide
limited complements to official statistics and survey methods.

First Part: Production of Official Statistics

1 Preparations/ Management/ Planning (questionnaire design, etc.)

• questionr package contains a set of functions to make the processing and analysis of surveys easier. It
provides interactive shiny apps and addins for data recoding, contingency tables, dataset metadata
handling, and several convenience functions.
• surveydata makes it easy to keep track of metadata from surveys, and to easily extract columns with
specific questions.
• blaise implements functions for reading and writing files in the Blaise Format (Statistics Netherlands).

2 Sampling

• sampling includes many different algorithms (Brewer, Midzuno, pps, systematic, Sampford, balanced
(cluster or stratified) sampling via the cube method, etc.) for drawing survey samples and calibrating the
design weights.
• pps contains functions to select samples using pps sampling. Also stratified simple random sampling is
possible as well as to compute joint inclusion probabilities for Sampford’s method of pps sampling.
• BalancedSampling provides functions to select balanced and spatially balanced probability samples in
multi-dimensional spaces with any prescribed inclusion probabilities. It also includes the local pivot
method, the cube and local cube method and a few more methods.
• PracTools contains functions for sample size calculation for survey samples using stratified or clustered
one-, two-, and three-stage sample designs as well as functions to compute variance components for
multistage designs and sample sizes in two-phase designs.
• surveyplanning includes tools for sample survey planning, including sample size calculation, estimation
of expected precision for the estimates of totals, and calculation of optimal sample size allocation.
• stratification allows univariate stratification of survey populations with a generalisation of the Lavallee-
Hidiroglou method.
• SamplingStrata offers an approach for choosing the best stratification of a sampling frame in a
multivariate and multidomain setting, where the sampling sizes in each strata are determined in order to
satisfy accuracy constraints on target estimates. To evaluate the distribution of target variables in
different strata, information of the sampling frame, or data from previous rounds of the same survey,
may be used.
• R2BEAT provides functions for multivariate, domain-specific optimal sample size allocation for one-
and two-stage stratified sampling designs (i.e., generalization of the allocation methods of Neyman and
Tschuprow to the case of several variables).

3 Data Collection (incl. record linkage)

3.1 Data Integration (Statistical Matching and Record Linkage)

• StatMatch provides functions to perform statistical matching between two data sources sharing a number
of common variables. It creates a synthetic data set after matching of two data sources via a likelihood
approach or via hot-deck.
• MatchIt allows nearest neighbor matching, exact matching, optimal matching and full matching amongst
other matching methods. If two data sets have to be matched, the data must come as one data frame
including a factor variable which includes information about the membership of each observation.
• MatchThem provides tools of matching and weighting multiply imputed datasets to control for effects of
confounders. Multiple imputed data files from mice and amelia can be used directly.
• stringdist can calculate various string distances based on edits (damerau-levenshtein, hamming,
levenshtein, optimal sting alignment), qgrams (q-gram, cosine, jaccard distance) or heuristic metrics
(jaro, jaro-winkler).
• reclin is a record linkage toolkit to assist in performing probabilistic record linkage and deduplication.
• XBRL allows the extraction of business financial information from XBRL Documents.
• RecordLinkage implements the Fellegi-Sunter method for record linkage.
• fastLink implements a Fellegi-Sunter probabilistic record linkage model that allows for missing data and
the inclusion of auxiliary information. Documentation can be found on
http://imai.princeton.edu/research/linkage.html
• fuzzyjoin provides function for joining tables based on exact or similar matches. It allows for matching
records based on inaccurate keys.
• PPRL implements privacy preserving record linkage, especially useful when personal ID’s cannot be
used to link two data sets. This approach then protects the identity of persons.

3.2 Web Scraping

Web scraping is used nowadays used more frequently in the production of official statistics. For example in price
statistics, the collection of product prices, formerly collected by hand over the web or by in person visits to stores
are replaced by scraping specific homepages. Tools for this process step are not listed here, but a detailed overview
can be found on the CRAN task view on WebTechnologies.

4 Data Processing

4.1 Weighting and Calibration

• survey allows for post-stratification, generalized raking/calibration, GREG estimation and trimming of
weights.
• sampling provides the function calib() to calibrate for nonresponse (with response homogeneity groups)
for stratified samples.
• laeken provides the function calibWeights() for calibration, which is possibly faster (depending on the
example) than calib() from sampling.
• icarus focuses on calibration and re-weighting in survey sampling and was designed to provide a familiar
setting in R for users of the SAS macro Calmar developed by INSEE.
• CalibrateSSB include a function to calculate weights and estimates for panel data with non-response.
• Frames2 allows point and interval estimation in dual frame surveys. When two probability samples (one
from each frame) are drawn. Information collected is suitably combined to get estimators of the
parameter of interest.
• surveysd (archived) provides calibration by iterative proportinal fitting, a calibrated bootstrap optimized
for complex surveys and error estimation based on it.
• inca performs calibration weighting with integer weights.

4.2 Editing (including outlier detection)

• validate includes rule management and data validation and package validatetools is checking and
simplifying sets of validation rules.
• errorlocate includes error localisation based on the principle of Fellegi and Holt. It supports categorical
and/or numeric data and linear equalities, inequalities and conditional rules. The package includes a
configurable backend for MIP-based error localization.
• editrules convert readable linear (in)equalities into matrix form.
• deducorrect depends on package editrules and applies deductive correction of simple rounding, typing
and sign errors based on balanced edits. Values are changed so that the given balanced edits are fulfilled.
To determine which values are changed the Levenstein-metric is applied.
• deductive allows for data correction and imputation using deductive methods.
• rspa implements functions to minimally adjust numerical records so they obey (in)equation restrictions.
• surveyoutliers winsorize values of a variable of interest.
• univOutl includes various methods for detecting univariate outliers, e.g. the Hidiroglou-Berthelot
method.
• extremevalues is designed to detect univariate outliers based on modeling the bulk distribution.

4.3 Imputation

A general overview of imputation methods can be found in the CRAN Task View on Missing Data, MissingData.
However, most of these presented methods do not take into account the specificities of survey’s from complex
designs, i.e., methods that are not specifically designed for official statistics and surveys. For example, the criteria
for applying a method often depend on the scale of the data, which in official statistics are usually a mixture of
continuous, semi-continuous, binary, categorical, and count variables. In addition, measurement error can greatly
affect non-robust imputation methods.

Commonly used packages within statistical agencies are VIM and simputation having fast k-nearest neighbor
(knn) algorithms for general distances and (robust) EM-based multiple imputation algorithms implemented.

4.4 Seasonal Adjustment

Seasonal adjustment is an important step in producing official statistics and a very limited set of methodologies
are used here frequently, e.g. X13-ARIMA-SEATS developed by the US Census Bureau. In the CRAN Task
View TimeSeries section seasonal adjustment, R packages for this can be found.

5 Analysis of Survey Data

5.1 Estimation and Variance Estimation

• survey works with survey samples. It allows to specify a complex survey design (stratified sampling
design, cluster sampling, multi-stage sampling and pps sampling with or without replacement). Once the
given survey design is specified within the function svydesign(), point and variance estimates can be
computed. The resulting object can be used to estimate (Horvitz-Thompson-) totals, means, ratios and
quantiles for domains or the whole survey sample, and to apply regression models. Variance estimation
for means, totals and ratios can be done either by Taylor linearization or resampling (BRR, jackkife,
bootstrap or user-defined).
• robsurvey provides functions for the computation of robust (outlier-resistant) estimators of finite
population characteristics (means, totals, ratios, regression, etc.) using weight reduction, trimming,
winsorization and M-estimation. The package complements survey.
• surveysd (archived) offers calibration, bootstrap and error estimation for complex surveys (incl. designs
with rotational designs).
• gustave provides a toolkit for analytical variance estimation in survey sampling.
• lavaan.survey provides a wrapper function for packages survey and lavaan. It can be used for fitting
structural equation models (SEM) on samples from complex designs (clustering, stratification, sampling
weights, and finite population corrections). Using the design object functionality from package survey,
lavaan objects are re-fit (corrected) with the lavaan.survey() function. This function also accommodates
replicate weights and multiply imputed datasets.
• vardpoor allows to calculate linearisation of several nonlinear population statistics, variance estimation
of sample surveys by the ultimate cluster method, variance estimation for longitudinal and cross-sectional
measures, and measures of change for any stage cluster sampling designs.
• rpms fits a linear model to survey data in each node obtained by recursively partitioning the data. The
algorithm accounts for one-stage of stratification and clustering as well as unequal probability of
selection.
• collapse implements advanced and computationally fast methods for grouped and weighted statistics and
multi-type data aggregation (e.g. mean, variance, statistical mode etc.), fast (grouped, weighted)
transformations of time series and panel data (e.g. scaling, centering, differences, growth rates), and fast
(grouped, weighted, panel-decomposed) summary statistics for complex multilevel / panel data.
• srvyr is inspired by the synthetic style of the dplyr package (i.e., piping, verbs
like group_by and summarize). It offers summary statistics for design objects of the survey package.
• weights provides a variety of functions for producing simple weighted statistics, such as weighted
Pearson’s correlations, partial correlations, Chi-Squared statistics, histograms and t-tests.
• svrep provides tools for creating, updating and analyzing survey replicate weights as an extension
of survey. Non-response adjustments to both full-sample and replicate weights can be applied.

5.2 Visualization

• VIM is designed to visualize missing values using suitable plot methods. It can be used to analyse the
structure of missing values in microdata using univariate, bivariate, multiple and multivariate plots where
the information of missing values from specified variables are highlighted in selected variables. It also
comes with a graphical user interface.
• longCatEDA extends the matrixplot from package VIM to check for monotone missingness in
longitudinal data.
• treemap provide treemaps. A treemap is a space-filling visualization of aggregates of data with
hierarchical structures. Colors can be used to relate to highlight differences between comparable
aggregates.
• tmap offers a layer-based way to make thematic maps, like choropleths and bubble maps.
• rworldmap outline how to map country referenced data and support users in visualizing their own data.
Examples are given, e.g., maps for the world bank and UN. It provides also new ways to visualize maps.

6 Statistical Disclosure Control

Data from statistical agencies and other institutions are in its raw form mostly confidential and data providers
have to be ensure confidentiality by both modifying the original data so that no statistical units can be re-identified
and by guaranteeing a minimum amount of information loss.

Unit-level data (microdata)

• sdcMicro can be used to anonymize data, i.e. to create anonymized files for public and scientific use. It
implements a wide range of methods for anonymizing categorical and continuous (key) variables. The
package also contains a graphical user interface, which is available by calling the function sdcGUI.
• simPop using linear and robust regression methods, random forests (and many more methods) to simulate
synthetic data from given complex data. It is also suitable to produce synthetic data when the data have
hierarchical and cluster information (such as persons in households) as well as when the data had been
collected with a complex sampling design. It makes use of parallel computing internally.
• synthpop using regression tree methods to simulate synthetic data from given data. It is suitable to
produce synthetic data when the data have no hierarchical and cluster information (such as households)
as well as when the data does not collected with a complex sampling design.

Aggregated information (tabular data)

• sdcTable can be used to provide confidential (hierarchical) tabular data. It includes the HITAS and the
HYPERCUBE technique and uses linear programming packages (Rglpk and lpSolveAPI) for solving (a
large amount of) linear programs.
• sdcSpatial can be used to smooth or/and suppress raster cells in a map. This is useful when plotting raster-
based counts on a map.
• sdcHierarchies provides methods to generate, modify, import and convert nested hierarchies that are
often used when defining inputs for statistical disclosure control methods.
• SmallCountRounding can be used to protect frequency tables by rounding necessary inner cells so that
cross-classifications to be published are safe.

Remote access

• DSI is an interface to DataShield. DataShield is an infrastructure and series of R packages that enables
the remote and non-disclosive analysis of sensitive research data.

Second Part: Access to Official Statistics

Access to data from international organizations and multiple organizations

• OECD searches and extracts data from the OECD.

• Rilostat contains tools to download data from the international labour organisation database together
with search and manipulation utilities. It can also import ilostat data that are available on their data base
in SDMX format.
• eurostat provides search for and access to data from Eurostat, the statistical agency for the European
Union.
• ipumsr provides an easy way to import census, survey and geographic data provided by IPUMS.
• FAOSTAT can be used to download data from the FAOSTAT database of the Food and Agricultural
Organization (FAO) of the United Nations.
• pxweb provides generic interface for the PX-Web/PC-Axis API used by many National Statistical
Agencies.
• PxWebApiData provides easy API access to e.g. Statistics Norway, Statistics Sweden and Statistics
Finland.
• rdhs interacts with The Demographic and Health Surveys (DHS) Program datasets.
• prevR implements functions (see import.dhs()) to import data from the Demographic Health Survey.
• rsdmx provides easy access to data from statistical organisations that support SDMX web services. The
package contains a list of SDMX access points of various national and international statistical institutes.
• readsdmx implements functions to read SDMX into data frames from local SDMX-ML file or web-
service. By OECD.
• regions offers tools to process regional statistics focusing on European data.
• statcodelists makes the internationally standardized SDMX code lists available for the R user.
• rdbnomics provides access to the DB.nomics database on macroeconomic data from 38 official providers
such as INSEE, Eurostat, Wolrd bank, etc.
• iotables makes input-output tables tidy, and allows for economic and environmental impact analysis with
formatting the data received from the Eurostat data warehouse into appropriate, validated, matrix forms.

Access to data from national organizations

• tidyqwi provides an api for accessing the United States Census Bureau’s Quarterly Workforce Indicator.
• tidyBdE provides access to official statistics provided by the Spanish Banking Authority Banco de
Espana.
• cancensus provides access to Statistics Canada’s Census data with the option to retrieve all data as spatial
data.
• sorvi provides access to Finnish open government data.
• insee searches and extracts data from the Insee’s BDM database.
• acs downloads, manipulates, and presents the American Community Survey and decennial data from the
US Census.
• censusapi implements a wrapper for the U.S. Census Bureau APIs that returns data frames of Census
data and meta data.
• censusGeography converts specific United States Census geographic code for city, state (FIP and ICP),
region, and birthplace.
• idbr implements functions to make requests to the US Census Bureau’s International Data Base API.
• tidycensus provides an integrated R interface to the decennial US Census and American Community
Survey APIs and the US Census Bureau’s geographic boundary files
• inegiR provides access to data published by INEGI, Mexico’s official statistics agency.
• cbsodataR provides access to Statistics Netherlands’ (CBS) open data API.
• EdSurvey includes analysis of NCES Education Survey and Assessment Data.
• nomisr gives access to Nomis UK Labour Market Data including Census and Labour Force Survey.
• readabs implements functions to download and tidy time series data from the Australian Bureau of
Statistics.
• BIFIEsurvey includes tools for survey statistics in educational assessment including data with replication
weights (e.g. from bootstrap).
• CANSIM2R provides functions to extract CANSIM (Statistics Canada) tables and transform them into
readily usable data.
• statcanR provides an R connection to Statistics Canada’s Web Data Service. Open economic data
(formerly CANSIM tables) are accessible as a data frame in the R environment.
• cdlTools provides functions to download USDA National Agricultural Statistics Service (NASS)
cropscape data for a specified state.

Third Part: Related Methods

Small Area Estimation

• sae provides functions for small area estimation (basic area- and unit-level model, Fay-Herriot model
with spatial/ temporal correlations), for example, direct estimators, the empirical best predictor and
composite estimators.
• rsae provides functions to estimate the parameters of the basic unit-level small area estimation (SAE)
model (aka nested error regression model) by means of maximum likelihood (ML) or robust M-
estimation. On the basis of the estimated parameters, robust predictions of the area-specific means are
computed (incl. MSE estimates; parametric bootstrap).
• emdi provides functions that support estimating, assessing and mapping regional disaggregated
indicators. So far, estimation methods comprise direct estimation, the model-based unit-level approach
Empirical Best Prediction, the area-level model and various extensions of it, as well as their precision
estimates. The assessment of the used model is supported by a summary and diagnostic plots. For a
suitable presentation of estimates, map plots can be easily created and exported.
• hbsae provides functions to compute small area estimates based on a basic area or unit-level model. The
model is fit using restricted maximum likelihood, or in a hierarchical Bayesian way. Auxiliary
information can be either counts resulting from categorical variables or means from continuous
population information.
• BayesSAE provides Bayesian estimation methods that range from the basic Fay-Herriot model to its
improvement such as You-Chapman models, unmatched models, spatial models and so on.
• SAEval provides diagnostics and graphic tools for the evaluation of small area estimators
• mind provides multivariate prediction and inference (mean square error) for domains using mixed linear
models as proposed in Datta, Day, and Basawa (1999, J. Stat. Plan. Inference)
• JoSAE provides point and variance estimation for the generalized regression (GREG) and a unit level
empirical best linear unbiased prediction EBLUP estimators can be made at domain level. It basically
provides wrapper functions to the nlme package that is used to fit the basic random effects models.

Microsimulation

• simPop allows to produce synthetic population data, sometimes needed as a starting population for
microsimulations.
• sms provides facilities to simulate micro-data from given area-based macro-data. Simulated annealing is
used to best satisfy the available description of an area. For computational issues, the calculations can be
run in parallel mode.
• saeSim implements tools for the simulation of data in the context of small area estimation.
• SimSurvey simulates age-structured spatio-temporal populations given built-in or user-defined sampling
protocols.

Indices, Indicators, Tables and Visualization of Indicators

• laeken provides functions to estimate popular risk-of-poverty and inequality indicators (at-risk-of-
poverty rate, quintile share ratio, relative median risk-of-poverty gap, Gini coefficient). In addition,
standard and robust methods for tail modeling of Pareto distributions are provided for semi-parametric
estimation of indicators from continuous univariate distributions such as income variables.
• convey estimates variances on indicators of income concentration and poverty using familiar linearized
and replication-based designs created by the survey package such as the Gini coefficient, Atkinson index,
at-risk-of-poverty threshold, and more than a dozen others.
• ineq computes various inequality measures (Gini, Theil, entropy, among others), concentration measures
(Herfindahl, Rosenbluth), and poverty measures (Watts, Sen, SST, and Foster). It also computes and
draws empirical and theoretical Lorenz curves as well as Pen’s parade. It is not designed to deal with
sampling weights directly (these could only be emulated via rep(x, weights)).
• DHS.rates estimates key indicators (especially fertility rates) and their variances for the Demographic
and Health Survey (DHS) data.
• micEconIndex implements functions to compute prices indices (of type Paasche, Fisher and Laspeyres);
see priceIndex(). For estimating quantities (of goods, for example) see function quantityIndex().

Miscellaneous

• samplingbook includes sampling procedures from the book ‘Stichproben. Methoden und praktische
Umsetzung mit R’ by Goeran Kauermann and Helmut Kuechenhoff (2010).
• SDaA is designed to reproduce results from Lohr, S. (1999) ‘Sampling: Design and Analysis, Duxbury’
and includes the data sets from this book.
• samplingVarEst implements Jackknife methods for variance estimation of unequal probability with one
or two stage designs.
• memisc includes tools for the management of survey data, graphics and simulation.
• anesrake provides a comprehensive system for selecting variables and weighting data to match the
specifications of the American National Election Studies.
• spsurvey includes facilities for spatial survey design and analysis for equal and unequal probability
(stratified) sampling.
• FFD provides function to calculate optimal sample sizes of a population of animals living in herds for
surveys to substantiate freedom from disease. The criteria of estimating the sample sizes take the herd-
level clustering of diseases as well as imperfect diagnostic tests into account and select the samples based
on a two-stage design. Inclusion probabilities are not considered in the estimation. The package provides
a graphical user interface as well.
• mipfp provides multidimensional iterative proportional fitting to calibrate n-dimensional arrays given
target marginal tables.
• MBHdesign provides spatially balanced designs from a set of (contiguous) potential sampling locations
in a study region.
• quantification provides different functions for quantifying qualitative survey data. It supports the
Carlson-Parkin method, the regression approach, the balance approach and the conditional expectations
method.
• surveybootstrap (archived) includes tools for using different kinds of bootstrap for estimating sampling
variation using complex survey data.
• RRreg (archived) implements univariate and multivariate analysis (correlation, linear, and logistic
regression) for several variants of the randomized response technique, a survey method for eliminating
response biases due to social desirability.
• RRTCS includes randomized response techniques for complex surveys.
• panelaggregation aggregates business tendency survey data (and other qualitative surveys) to time series
at various aggregation levels.
• rtrim implements functions to study trends and indices for monitoring data. It provides tools for
estimating animal/plant populations based on site counts, including occurrence of missing data.
• rjstat. Read and write data sets in the JSON-stat format.
• diffpriv implements the perturbation of statistics with differential privacy.
• easySdcTable provides a graphical interface to a small selection of functionality of package sdcTable.
• MicSim includes methods for microsimulations. Given a initial population, mortality rates, divorce rates,
marriage rates, education changes, etc. and their transition matrix can be defined and included for the
simulation of future states of the population. The package does not contain compiled code but
functionality to run the microsimulation in parallel is provided.
CRAN packages
C errorlocate, sae, sampling, SamplingStrata, sdcMicro, sdcTable, simPop, survey, validate, validatetools, VI
or M.
e:

R acs, anesrake, BalancedSampling, BayesSAE, BIFIEsurvey, blaise, CalibrateSSB, cancensus, CANSIM2R,

eg cbsodataR, cdlTools, censusapi, censusGeography, collapse, convey, deducorrect, deductive, DHS.rates, dif
ul fpriv, DSI, easySdcTable, editrules, EdSurvey, emdi, eurostat, extremevalues, FAOSTAT, fastLink, FFD, F
ar rames2, fuzzyjoin, gustave, hbsae, icarus, idbr, inca, inegiR, ineq, insee, iotables, ipumsr, JoSAE, laeken, la
: vaan, lavaan.survey, longCatEDA, MatchIt, MatchThem, MBHdesign, memisc, micEconIndex, MicSim, mi
nd, mipfp, nlme, nomisr, OECD, panelaggregation, PPRL, pps, PracTools, prevR, pxweb, PxWebApiData,
quantification, questionr, R2BEAT, rdbnomics, rdhs, readabs, readsdmx, reclin, RecordLinkage, regions, Ri
lostat, rjstat, robsurvey, rpms, RRTCS, rsae, rsdmx, rspa, rtrim, rworldmap, saeSim, SAEval, samplingbook
, samplingVarEst, SDaA, sdcHierarchies, sdcSpatial, simputation, SimSurvey, SmallCountRounding, sms, s
orvi, spsurvey, srvyr, statcanR, statcodelists, StatMatch, stratification, stringdist, surveydata, surveyoutliers,
surveyplanning, svrep, synthpop, tidyBdE, tidycensus, tidyqwi, tmap, treemap, univOutl, vardpoor, weights
, XBRL.

Clustering:
Installation: The packages from this task view can be installed automatically using the ctv package. For
example, ctv::install.views("Cluster", coreOnly = TRUE) installs all the core packages
or ctv::update.views("Cluster") installs all packages that are not yet installed and up-to-date. See
the CRAN Task View Initiative for more details.

This CRAN Task View contains a list of packages that can be used for finding groups in data and modeling
unobserved cross-sectional heterogeneity. Many packages provide functionality for more than one of the topics
listed below, the section headings are mainly meant as quick starting points rather than as an ultimate
categorization. Except for package stats and cluster (which essentially ship with base R and hence are part of
every R installation), each package is listed only once.

Most of the packages listed in this view, but not all, are distributed under the GPL. Please have a look at the
DESCRIPTION file of each package to check under which license it is distributed.

Hierarchical Clustering:

• Functions hclust() from package stats and agnes() from the cluster are the primary functions for
agglomerative hierarchical clustering, function diana() can be used for divisive hierarchical clustering.
Faster alternatives to hclust() are provided by the packages fastcluster and flashClust.
• Function dendrogram() from package stats and associated methods can be used for improved
visualization of cluster dendrograms.
• Package dendextend provides functions for easy visualization (coloring labels and branches, etc.),
manipulation (rotating, pruning, etc.), and comparison of dendrograms (tangelgrams with heuristics for
optimal branch rotations, and tree correlation measures with bootstrap and permutation tests for
significance).
• Package dynamicTreeCut contains methods for the detection of clusters in hierarchical clustering
dendrograms.
• Package genieclust implements a fast hierarchical clustering algorithm with a linkage criterion which is
a variant of the single linkage method combining it with the Gini inequality measure to robustify the
linkage method while retaining computational efficiency to allow for the use of larger data sets.
• Package idendr0 allows us to interactively explore hierarchical clustering dendrograms and the clustered
data. The data can be visualized (and interacted with) in a built-in heat map, but also in GGobi dynamic
interactive graphics (provided by rggobi), or base R plots.
• Package mdendro provides an alternative implementation of agglomerative hierarchical clustering. The
package natively handles similarity matrices, calculates variable-group dendrograms, which solve the
non-uniqueness problem that arises when there are ties in the data, and calculates five descriptors for the
final dendrogram: cophenetic correlation coefficient, space distortion ratio, agglomerative coefficient,
chaining coefficient, and tree balance.
• Package protoclust implements a form of hierarchical clustering that associates a prototypical element
with each interior node of the dendrogram. Using the package’s plot() function, one can produce
dendrograms that are prototype-labeled and are therefore easier to interpret.
• Package pvclust assesses the uncertainty in hierarchical cluster analysis. It provides approximately
unbiased p-values as well as bootstrap p-values.

Partitioning Clustering:

• Function kmeans() from package stats provides several algorithms for computing partitions concerning
Euclidean distance.
• Function pam() from package cluster implements partitioning around medoids and can work with
arbitrary distances. Function clara() is a wrapper to pam() for larger data sets. Silhouette plots and
spanning ellipses can be used for visualization.
• Package apcluster implements Frey’s and Dueck’s Affinity Propagation clustering. The algorithms in the
package are analogous to the Matlab code published by Frey and Dueck.
• Package ClusterR implements k-means, mini-batch-kmeans, k-medoids, affinity propagation clustering,
and Gaussian mixture models with the option to plot, validate, predict (new data) and estimate the optimal
number of clusters. The package takes advantage of RcppArmadillo to speed up the computationally
intensive parts of the functions.
• Package clusterSim allows searching for the optimal clustering procedure for a given dataset.
• Package clustMixType implements Huang’s k-prototypes extension of k-means for mixed-type data.
• Package evclust implements various clustering algorithms that produce a credal partition, i.e., a set of
Dempster-Shafer mass functions representing the membership of objects to clusters.
• Package flexclust provides k-centroid cluster algorithms for arbitrary distance measures, hard
competitive learning, neural gas, and QT clustering. Neighborhood graphs and image plots of partitions
are available for visualization. Some of this functionality is also provided by package cclust.
• Package kernlab provides a weighted kernel version of the k-means algorithm by kkmeans and spectral
clustering by specc.
• Package kml provides k-means clustering specifically for longitudinal (joint) data.
• Package skmeans allows spherical k-Means Clustering, i.e. k-means clustering with cosine similarity. It
features several methods, including a genetic and a simple fixed-point algorithm and an interface to the
CLUTO vcluster program for clustering high-dimensional datasets.
• Package Spectrum implements a self-tuning spectral clustering method for single or multi-view data and
uses either the eigengap or multimodality gap heuristics to determine the number of clusters. The method
is sufficiently flexible to cluster a wide range of Gaussian and non-Gaussian structures with automatic
selection of K.
• Package tclust allows for trimmed k-means clustering. In addition, using this package other covariance
structures can also be specified for the clusters.

Model-Based Clustering:

• ML estimation:
• For semi- or partially supervised problems, where for a part of the observations labels are given with
certainty or with some probability, package bgmm provides belief-based and soft-label mixture
modeling for mixtures of Gaussians with the EM algorithm.
• Package EMCluster provides EM algorithms and several efficient initialization methods for model-
based clustering of finite mixture Gaussian distribution with unstructured dispersion in unsupervised
as well as a semi-supervised learning situations.
• Packages funHDDC and funFEM implement model-based functional data analysis.
The funFEM package implements the funFEM algorithm which allows the cluster of time series or,
more generally, functional data. It is based on a discriminative functional mixture model which
allows the clustering of the data in a unique and discriminative functional subspace. This model
presents the advantage to be parsimonious and can therefore handle long time series.
The funHDDC package implements the funHDDC algorithm which allows the clustering of
functional data within group-specific functional subspaces. The funHDDC algorithm is based on a
functional mixture model which models and clusters the data into group-specific functional
subspaces. The approach allows afterward meaningful interpretations by looking at the group-
specific functional curves.
• Package GLDEX fits mixtures of generalized lambda distributions and for grouped conditional data
package mixdist can be used.
• Package GMCM fits Gaussian mixture copula models for unsupervised clustering and meta-
analysis.
• Package HDclassif provides function hddc to fit the Gaussian mixture model to high-dimensional
data where it is assumed that the data lives in a lower dimension than the original space.
• Package teigen allows fitting multivariate t-distribution mixture models (with eigen-decomposed
covariance structure) from a clustering or classification point of view.
• Package mclust fits mixtures of Gaussians using the EM algorithm. It allows fine control of the
volume and shape of covariance matrices and agglomerative hierarchical clustering based on
maximum likelihood. It provides comprehensive strategies using hierarchical clustering, EM, and
the Bayesian Information Criterion (BIC) for clustering, density estimation, and discriminant
analysis. Package Rmixmod provides tools for fitting mixture models of multivariate Gaussian or
multinomial components to a given data set with either a clustering, a density estimation, or a
discriminant analysis point of view. Package mclust as well as
packages mixture and Rmixmod provide all 14 possible variance-covariance structures based on the
eigenvalue decomposition.
• Package MetabolAnalyze fits mixtures of probabilistic principal component analysis with the EM
algorithm.
• For grouped conditional data package mixdist can be used.
• Package MixAll provides EM estimation of diagonal Gaussian, gamma, Poisson, and categorical
mixtures combined based on the conditional independence assumption using different EM variants
and allowing for missing observations. The package accesses the clustering part of the Statistical
ToolKit STK++.
• Package mixR performs maximum likelihood estimation of finite mixture models for raw or binned
data for families including Normal, Weibull, Gamma, and Lognormal using the EM algorithm,
together with the Newton-Raphson algorithm or the bisection method when necessary. The package
also provides information criteria or the bootstrap likelihood ratio test for model selection and the
model fitting process is accelerated using package Rcpp.
• Package mixtools provide fitting with the EM algorithm for parametric and non-parametric
(multivariate) mixtures. Parametric mixtures include mixtures of multinomials, multivariate
normals, normals with repeated measures, Poisson regressions, and Gaussian regressions (with
random effects). Non-parametric mixtures include the univariate semi-parametric case where
symmetry is imposed for identifiability and multivariate non-parametric mixtures with conditional
independent assumption. In addition fitting mixtures of Gaussian regressions with the Metropolis-
Hastings algorithm is available.
• Fitting finite mixtures of uni- and multivariate scale mixtures of skew-normal distributions with the
EM algorithm is provided by package mixsmsn.
• Package MoEClust fits parsimonious finite multivariate Gaussian mixtures of expert models via the
EM algorithm. Covariates may influence the mixing proportions and/or component densities and all
14 constrained covariance parameterizations from package mclust are implemented.
• Package movMF fits finite mixtures of von Mises-Fisher distributions with the EM algorithm.
• Package mritc provides tools for classification using normal mixture models and (higher resolution)
hidden Markov normal mixture models fitted by various methods.
• Package prabclus clusters a presence-absence matrix object by calculating an MDS from the
distances, and applying maximum likelihood Gaussian mixtures clustering to the MDS points.
• Package psychomix estimates mixtures of the dichotomous Rasch model (via conditional ML) and
the Bradley-Terry model.
• Package rebmix implements the REBMIX algorithm to fit mixtures of conditionally independent
normal, lognormal, Weibull, gamma, binomial, Poisson, Dirac, or von Mises component densities
as well as mixtures of multivariate normal component densities with unrestricted variance-
covariance matrices.
• Bayesian estimation:
• Bayesian estimation of finite mixtures of multivariate Gaussians is possible using package bayesm.
The package provides functionality for sampling from such a mixture as well as estimating the model
using Gibbs sampling. Additional functionality for analyzing the MCMC chains is available for
averaging the moments over MCMC draws, determining the marginal densities, clustering
observations, and plotting the uni- and bivariate marginal densities.
• Package bayesmix provides Bayesian estimation using JAGS.
• Package Bmix (archived) provides Bayesian Sampling for stick-breaking mixtures.
• Package bmixture provides Bayesian estimation of finite mixtures of univariate Gamma and normal
distributions.
• Package GSM fits mixtures of gamma distributions.
• Package IMIFA fits Infinite Mixtures of Infinite Factor Analyzers and a flexible suite of related
models for clustering high-dimensional data. The number of clusters and/or several cluster-specific
latent factors can be non-parametrically inferred, without recourse to model selection criteria.
• Package mcclust implements methods for processing a sample of (hard) clusterings, e.g. the MCMC
output of a Bayesian clustering model. Among them are methods that find a single best clustering to
represent the sample, which is based on the posterior similarity matrix or a relabeling algorithm.
• Package mixAK contains a mixture of statistical methods including the MCMC methods to analyze
normal mixtures with possibly censored data.
• Package NPflow fits Dirichlet process mixtures of multivariate normal, skew-normal, or skew t-
distributions. The package was developed and oriented toward flow-cytometry data preprocessing
applications.
• Package PReMiuM is a package for profile regression, which is a Dirichlet process Bayesian
clustering where the response is linked non-parametrically to the covariate profile.
• Package rjags provides an interface to the JAGS MCMC library which includes a module for mixture
modeling.
• Other estimation methods:
• Package AdMit allows the fitting of an adaptive mixture of Student-t distributions to approximate a
target density through its kernel function.

Other Cluster Algorithms and Clustering Suites:

• Package ADPclust allows the clustering of high-dimensional data based on a two-dimensional decision
plot. This density-distance plot plots each data point the local density against the shortest distance to all
observations with a higher local density value. The cluster centroids of this non-iterative procedure can
be selected using an interactive or automatic selection mode.
• Package amap provides alternative implementations of k-means and agglomerative hierarchical
clustering.
• Package biclust provides several algorithms to find biclusters in two-dimensional data.
• Package cba implements clustering techniques for business analytics like “rock” and “proximus”.
• Package CHsharp clusters 3-dimensional data into their local modes based on a convergent form of Choi
and Hall’s (1999) data sharpening method.
• Package clue implements ensemble methods for both hierarchical and partitioning cluster methods.
• Package CoClust implements a clustering algorithm that is based on copula functions and therefore
allows to group observations according to the multivariate dependence structure of the generating process
without any assumptions on the margins.
• Package compHclust provides complimentary hierarchical clustering which was specially designed for
microarray data to uncover structures present in the data that arise from ‘weak’ genes.
• Package DatabionicSwarm implements a swarm system called Databionic swarm (DBS) for self-
organized clustering. This method can adapt itself to structures of high-dimensional data such as natural
clusters characterized by distance and/or density-based structures in the data space.
• Package dbscan provides a fast reimplementation of the DBSCAN (density-based spatial clustering of
applications with noise) algorithm using a kd-tree.
• Fuzzy clustering and bagged clustering are available in package e1071. Further and more extensive tools
for fuzzy clustering are available in package fclust.
• Package FactoClass performs a combination of factorial methods and cluster analysis.
• Package FCPS provides many conventional clustering algorithms with consistent input and output,
several statistical approaches for the estimation of the number of clusters as well as the mirrored density
plot (MD-plot) of clusterability and offers a variety of clustering challenges any algorithm should be able
to handle when facing real-world data.
• The hopach algorithm is a hybrid between hierarchical methods and PAM and builds a tree by recursively
partitioning a data set.
• For graphs and networks, model-based clustering approaches are implemented in latentnet.
• Package ORIClust provides order-restricted information-based clustering, a clustering algorithm that has
specifically been developed for bioinformatics applications.
• Package pdfCluster provides tools to perform cluster analysis via kernel density estimation. Clusters are
associated with the maximally connected components with estimated density above a threshold. In
addition, a tree structure associated with the connected components is obtained.
• Package prcr implements the 2-step cluster analysis where first hierarchical clustering is performed to
determine the initial partition for the subsequent k-means clustering procedure.
• Package ProjectionBasedClustering implements projection-based clustering (PBC) for high-dimensional
datasets in which clusters are formed by both distance and density structures (DDS).
• Package randomLCA provides the fitting of latent class models which optionally also include a random
effect. Package poLCA allows for polytomous variable latent class analysis and
regression. BayesLCA allows fitting Bayesian LCA models employing the EM algorithm, Gibbs
sampling, or variational Bayes methods.
• Package RPMM fits recursively partitioned mixture models for Beta and Gaussian Mixtures. This is a
model-based clustering algorithm that returns a hierarchy of classes, similar to hierarchical clustering,
but also similar to finite mixture models.
• Self-organizing maps are available in package som.

Cluster-wise Regression:

• Package crimCV fits finite mixtures of zero-inflated Poisson models for longitudinal data with time as a
covariate.
• Multigroup mixtures of latent Markov models on mixed categorical and continuous data (including time
series) can be fitted using depmix or depmixS4. The parameters are optimized using a general-purpose
optimization routine given linear and nonlinear constraints on the parameters.
• Package flexmix implements a user-extensible framework for EM estimation of mixtures of regression
models, including mixtures of (generalized) linear models.
• Package fpc provides fixed-point methods both for model-based clustering and linear regression. A
collection of asymmetric projection methods can be used to plot various aspects of clustering.
• Package lcmm fits a latent class linear mixed model which is also known as a growth mixture model or
heterogeneous linear mixed model using a maximum likelihood method.
• Package mixreg fits mixtures of one-variable regressions and provides the bootstrap test for the number
of components.
• Package mixPHM fits mixtures of proportional hazard models with the EM algorithm.

Additional Functionality:

• Package clusterGeneration contains functions for generating random clusters and random
covariance/correlation matrices, calculating a separation index (data and population version) for pairs of
clusters or cluster distributions, and 1-D and 2-D projection plots to visualize clusters.
Alternatively, MixSim generates a finite mixture model with Gaussian components for prespecified
levels of maximum and/or average overlaps. This model can be used to simulate data for studying the
performance of cluster algorithms.
• Package clusterCrit computes various clustering validation or quality criteria and partition comparison
indices.
• For cluster validation package clusterRepro tests the reproducibility of a cluster. Package clv contains
popular internal and external cluster validation methods ready to use for most of the outputs produced
by functions from the package cluster and clValid calculates several stability measures.
• Package clustvarsel provides variable selection for Gaussian model-based clustering. Variable selection
for latent class analysis for clustering multivariate categorical data is implemented in
package LCAvarsel. Package VarSelLCM provides variable selection for model-based clustering of
continuous, count, categorical or mixed-type data with missing values where the models used impose a
conditional independence assumption given group membership.
• Package factoextra provides some easy-to-use functions to extract and visualize the output of
multivariate data analyses in general including also heuristic and model-based cluster analysis. The
package also contains functions for simplifying some cluster analysis steps and uses ggplot2-based
visualization.
• The functionality to compare the similarity between two cluster solutions is provided by the cluster.
stats() in package fpc.
• The stability of k-centroid clustering solutions fitted using functions from package flexclust can also be
validated via bootFlexclust() using bootstrap methods.
• Package MOCCA provides methods to analyze cluster alternatives based on multi-objective
optimization of cluster validation indices.
• Package NbClust implements 30 different indices which evaluate the cluster structure and should help to
determine a suitable number of clusters.
• Mixtures of univariate normal distributions can be printed and plotted using package nor1mix.
• Package seriation provides dissplot() for visualizing dissimilarity matrices using seriation and matrix
shading. This also allows for inspection cluster quality by restricting objects belonging to the same cluster
to be displayed in consecutive order.
• Package sigclust provides a statistical method for testing the significance of clustering results.
• Package treeClust calculates dissimilarities between data points based on their leaf memberships in
regression or classification trees for each variable. It also performs the cluster analysis using the resulting
dissimilarity matrix with available heuristic clustering algorithms in R.

CRAN packages
Core: cluster, flexclust, flexmix, mclust, Rmixmod.

Regular: AdMit, ADPclust, amap, apcluster, BayesLCA, bayesm, bayesmix, bgmm, biclust, bmixture, cba, cclust, CHs
harp, clue, clusterCrit, clusterGeneration, ClusterR, clusterRepro, clusterSim, clustMixType, clustvarsel, clv,
clValid, CoClust, compHclust, crimCV, DatabionicSwarm, dbscan, dendextend, depmix, depmixS4, dynamic
TreeCut, e1071, EMCluster, evclust, FactoClass, factoextra, fastcluster, fclust, FCPS, flashClust, fpc, funFE
M, funHDDC, genieclust, GLDEX, GMCM, GSM, HDclassif, idendr0, IMIFA, kernlab, kml, latentnet, LCA
varsel, lcmm, mcclust, mdendro, MetabolAnalyze, mixAK, MixAll, mixdist, mixPHM, mixR, mixreg, MixSi
m, mixsmsn, mixtools, mixture, MOCCA, MoEClust, movMF, mritc, NbClust, nor1mix, NPflow, ORIClust,
pdfCluster, poLCA, prabclus, prcr, PReMiuM, ProjectionBasedClustering, protoclust, psychomix, pvclust, ra
ndomLCA, rebmix, rjags, RPMM, seriation, sigclust, skmeans, som, Spectrum, tclust, teigen, treeClust, VarS
elLCM.

Financial and Empirical Analysis

Installation: The packages from this task view can be installed automatically using the ctv package. For
example, ctv::install.views("Finance", coreOnly = TRUE) installs all the core packages
or ctv::update.views("Finance") installs all packages that are not yet installed and are up-to-date.
See the CRAN Task View Initiative for more details.

This CRAN Task View contains a list of packages useful for empirical work in Finance, grouped by topic.

Besides these packages, a very wide variety of functions suitable for empirical work in Finance is provided by
both the basic R system (and its set of recommended core packages), and several other packages on
the Comprehensive R Archive Network (CRAN). Consequently, several of the other CRAN Task Views may
contain suitable packages, in particular the Econometrics, Optimization, Robust, and TimeSeries Task Views.

The ctv package supports these Task Views. Its functions install. view and update. views allow, respectively,
installation or update of packages from a given Task View; the option coreOnly can restrict operations to
packages labeled as core below.

Contributions are always welcome and encouraged, either via e-mail to the maintainer or by submitting an issue
or pull request in the GitHub repository linked above. See the Contributing page in the CRAN Task Views repo
for details.
Standard regression models

• A detailed overview of the available regression methodologies is provided by the Econometrics task
view. This is complemented by the Robust task view, which focuses on more robust and resistant
methods.
• Linear models such as ordinary least squares (OLS) can be estimated by lm() (from by the stats
package contained in the basic R distribution). Maximum Likelihood (ML) estimation can be
undertaken with the standard optim() function. Many other suitable methods are listed in
the Optimization view. Non-linear least squares can be estimated with the nls() function, as well as
with nlme() from the nlme package.
• For the linear model, a variety of regression diagnostic tests are provided by
the car, lmtest, strucchange, urca, and sandwich packages. The Rcmdr package provide user
interfaces that may be of interest as well.

Time series

• A detailed overview of tools for time series analysis can be found in the TimeSeries task view. Below
a brief overview of the most important methods in finance is given.
• Classical time series functionality is provided by the arima() and KalmanLike() commands in the
basic R distribution.
• The dse and timsac packages provide a variety of more advanced estimation methods; fracdiff can
estimate fractionally integrated series; longmemo covers related material.
• For volatility modeling, the standard GARCH(1,1) model can be estimated with the garch() function
in the tseries package. Rmetrics (see below) contains the fGarch package which has additional
models. The rugarch package can be used to model a variety of univariate GARCH models with
extensions such as ARFIMA, in-mean, external regressors and various other specifications; with
methods for fit, forecast, simulation, inference and plotting are provided too. The rmgarch builds on
it to provide the ability to estimate several multivariate GARCH models. The betategarch package can
estimate and simulate the Beta-t-EGARCH model by Harvey. The bayesGARCH package can
perform Bayesian estimation of a GARCH(1,1) model with Student’s t innovations. For multivariate
models, the gogarch package provides functions for generalized orthogonal GARCH models.
The gets package (which was preceded by a related package AutoSEARCH) provides automated
general-to-specific model selection of the mean and log-volatility of a log-ARCH-X model.
The lgarch package can estimate and fit log-GARCH models. The garchx package estimate GARCH
models with leverage and external covariates. The bmgarch package fits several multivariate GARCH
models in a Bayesian setting.
• Unit root and cointegration tests are provided by tseries, and urca. The Rmetrics
packages timeSeries and fMultivar contain a number of estimation functions for ARMA, GARCH,
long memory models, unit roots and more. The CADFtest package implements the Hansen unit root
test.
• dlm package provides Bayesian and likelihood analysis of dynamic linear models (i.e. linear Gaussian
state space models).
• The vars package offer estimation, diagnostics, forecasting and error decomposition of VAR and
SVAR model in a classical framework.
• The dyn and dynlm packages are suitable for dynamic (linear) regression models.
• Several packages provide wavelet analysis functionality: wavelets, waveslim, wavethresh. Some
methods from chaos theory are provided by the package tseriesChaos. tsDyn adds time series analysis
based on dynamical systems theory.
• The forecast package adds functions for forecasting problems.
• The stochvol package implements Bayesian estimation of stochastic volatility using Markov Chain
Monte Carlo, and factorstochvol extends this to the multivariate case.
• The MSGARCH package adds methods to fit (by Maximum Likelihood or Bayesian), simulate, and
forecast various Markov-Switching GARCH processes.
• The DriftBurstHypothesis package estimates a t-test statistics for the explosive drift burst hypothesis
(Christensen, Oomen and Reno, 2018).
• Package lmForc various in-sample, out-of-sample, pseudo-out-of-sample and benchmark linear
model forecast tests.
Finance

• The Rmetrics suite of packages comprises fAssets, fBasics, fBonds, timeDate (formerly:
fCalendar), fCopulae, fExtremes, fGarch, fImport, fNonlinear, fPortfolio, fRegression, timeSeries (f
ormerly: fSeries), fTrading, and contains a very large number of relevant functions for different aspect
of empirical and computational finance.
• The RQuantLib package provides several option-pricing functions as well as some fixed-income
functionality from the QuantLib project to R. The RcppQuantuccia provides a smaller subset of
QuantLib functionality as a header-only library; at current only some calendaring functionality is
exposed.
• The quantmod package offers a number of functions for quantitative modelling in finance as well as
data acquisition, plotting and other utilities.
• The backtest offers tools to explore portfolio-based hypotheses about financial instruments.
The pa package offers performance attribution functionality for equity portfolios.
• The PerformanceAnalytics package contains a large number of functions for portfolio performance
calculations and risk management.
• The TTR contains functions to construct technical trading rules in R.
• The sde package provides simulation and inference functionality for stochastic differential equations.
• The vrtest package contains a number of variance ratio tests for the weak-form of the efficient markets
hypothesis.
• The gmm package provides generalized method of moments (GMM) estimations function that are
often used when estimating the parameters of the moment conditions implied by an asset pricing
model.
• The BurStFin and BurStMisc package has a collection of function for Finance including the
estimation of covariance matrices.
• The AmericanCallOpt package contains a pricer for different American call options.
• The FinAsym package implements the Lee and Ready (1991) and Easley and O’Hara (1987) tests for,
respectively, trade direction, and probability of informed trading.
• The parma package provides support for portfolio allocation and risk management applications.
• The GUIDE package provides a GUI for DE rivatives and contains numerous pricer examples as well
as interactive 2d and 3d plots to study these pricing functions.
• The SharpeR package contains a collection of tools for analyzing significance of trading strategies,
based on the Sharpe ratio and overfit of the same.
• The RND package implements various functions to extract risk-neutral densities from option prices.
• The LSMonteCarlo package can price American Options via the Least Squares Monte Carlo method.
• The BenfordTests package provides seven statistical tests and support functions for determining if
numerical data could conform to Benford’s law.
• The OptHedging package values call and put option portfolio and implements an optimal hedging
strategy.
• The markovchain package provides functionality to easily handle and analyse discrete Markov chains.
• The tvm package models provides functions for time value of money such as cashflows and yield
curves.
• The MarkowitzR package provides functions to test the statistical significance of Markowitz
portfolios.
• The pbo package models the probability of backtest overfitting, performance degradation, probability
of loss, and the stochastic dominance when analysing trading strategies.
• The OptionPricing package implements efficient Monte Carlo algorithms for the price and the
sensitivities of Asian and European Options under Geometric Brownian Motion.
• The matchingMarkets package implements a structural estimator to correct for the bias arising from
endogenous matching (e.g. group formation in microfinance or matching of firms and venture
capitalists).
• The restimizeapi package interfaces the API at www.estimize.com which provides crowd-sourced
earnings estimates.
• The credule package is another pricer for credit default swaps.
• The obAnalytics package analyses and visualizes information from events in limit order book data.
• The derivmkts package adds a set of pricing and expository functions useful in teaching derivatives
markets.
• The PortfolioEffectHFT package provides portfolio analysis suitable for intra-day and high-frequency
data, and also interfaces the PortfolioEffect service.
• The ragtop package prices equity derivatives under an extension to Black and Scholes supporting
default under a power-law link price and hazard rate.
• The InfoTrad packages estimates PIN and extends it to different factorization and estimation
algorithms.
• The FinancialMath package contains financial math and derivatives pricing functions as required by
the actuarial exams by the Society of Actuaries and Casualty Actuarial Society ‘Financial
Mathematics’ exam.
• The tidyquant package re-arranges functionality from several other key packages for use in the so-
called tidyverse.
• The BCC1997 prices European options under the Bakshi, Cao anc Chen (1997) model for stochastic
volatility, stochastic rates and random jumps.
• The Sim.DiffProc package provides functions to simulate and analyse multidimensional Itô and
Stratonovitch stochastic calculus for continuous-time models.
• The BLModel package computes the posterior distribution in a Black-Litterman model from a prior
distribution given by asset returns and continuous distribution of views given by an external function.
• The rpatrec (archived) package aims to recognise charting patterns in (financial) time series data.
• The PortfolioOptim can solve both small and large sample portfolio optimization.
• The estudy2 (archived) package implements most commonly used parametric and nonparametric
event-study methodology tests.
• The DtD package computes the distance to default per Merton’s model.
• The PeerPerformance package analyzes performance of investments funds relative to its peers in a
pairwise manner that is robust to false discoveries.
• The crseEventStudy package provides another event-study tool to analyse abnormal return in long-
horizon events.
• The simfinapi package provides R access to SimFin fundamental financial statement data (given an
API key).
• The NFCP package models commodity prices via an n-factor term structure estimation.
• The LSMRealOptions package uses least-squares Monte Carlo to value American and Real options.
• The AssetCorr package estimates intra- and inter-cohort correlations from default data in a Vasicek
credit portfolio model.
• The ichimoku package provides tools for creating and visualising Ichimoku Kinko Hyo strategies, and
provides an interface to the OANDA fxTrade API for retrieving historical and live streaming price
data (which requires free registration).
• The greeks package calculate sensitivities of financial option prices for European and Asian and
American options in the Black Scholes model.
• The RTL (Risk Tool Library) package offers a collection of functions and metadata to complement
core packages in finance and commodities, including futures expiry tables.
• The GARCHSK package estimates GARCHSK and GJRSK models allowing for time-varying
volatility, skewness and kurtosis.
• The bidask package offers a novel procedure to estimate bid-ask spreads from OHLC data, and
implements other reference models.
• The strand package adds a framework for discrete (share-level) simulations of investment strategies.
• The HDShOP package constructs shrinkage estimators of high-dimensional mean-variance portfolios
and performs high-dimensional tests on optimality, and the DOSPortfolio package uses it to constructs
dynamic optimal shrinkage estimators for the weights of the global minimum variance portfolio.
• The fixedincome package adds functions for fixed-income models, calculation, models and curves.

Risk management

• The packages qrmtools and qrmdata provide tools and data for standard tasks in Quantitative Risk
Management (QRM) and accompany the book of McNeil, Frey, Embrechts (2005, 2015, “Quantitative
Risk Management: Concepts, Techniques, Tools”).
• The Task View ExtremeValue regroups a number of relevant packages.
• The mvtnorm package provides code for multivariate Normal and t-distributions.
• The package nvmix provides functionality for multivariate normal variance mixtures (including
normal and t for non-integer degrees of freedom).
• The Rmetrics packages fPortfolio and fExtremes also contain a number of relevant functions.
• The packages copula and copulaData cover a wide range of modeling tasks for copulas.
• The actuar package provides an actuarial perspective to risk management.
• The ghyp package provides generalized hyberbolic distribution functions as well as procedures for
VaR, CVaR or target-return portfolio optimizations.
• The ChainLadder package provides functions for modeling insurance claim reserves; and
the lifecontingencies package provides functions for financial and actuarial evaluations of life
contingencies.
• The ESG package can be used to model for asset projection, a scenario-based simulation approach.
• The riskSimul package provides efficient simulation procedures to estimate tail loss probabilities and
conditional excess for a stock portfolios where log-returns are assumed to follow a t-copula model
with generalized hyperbolic or t marginals.
• The GCPM package analyzes the default risk of credit portfolio using both analytical and simulation
approaches.
• The FatTailsR package provides a family of four distributions tailored to distribution with symmetric
and asymmetric fat tails.
• The Dowd package contains functions ported from the ‘MMR2’ toolbox offered in Kevin Dowd’s
book “Measuring Market Risk”.
• The PortRisk package computes portfolio risk attribution.
• The NetworkRiskMeasures package implements some risk measures for financial networks such as
DebtRank, Impact Susceptibility, Impact Diffusion and Impact Fluidity.
• The Risk package computes 26 financial risk measures for any continuous distribution.
• The RiskPortfolios package constructs risk-based portfolios as per the corresponding papers by Ardia
et al.
• The reinsureR package models reinsurances a class Claims whose objective is to store claims and
premiums, on which different treaties can be applied.
• The RM2006 package estimates conditional covariance matrix using the RiskMetrics 2006
methodology described in Zumbach (2007).
• The cvar package computes expected shortfall and value at risk for continuous distributions.
• riskParityPortfolio offers fast implementations for constructing risk-parity portfolios.
• The monobin package performs monotonic binning of numeric risk factor in credit rating models (PD,
LGD, EAD) development.
• The etrm package contains a collection of functions to perform core tasks within energy trading and
risk management (ETRM).
• Package ufRisk offers multiple Value at Risk and Expected Shortfall measures from both parametric
and semiparametrics models.
• Packages bondAnalyst and stockAnalyst provide a number of, respectively, bond pricing and fixed-
income valuation functions and fundamental equity valuation function corresponding to standard
industry practices for risk and return.

Books

• The NMOF package provides functions, examples and data from Numerical Methods and
Optimization in Finance by Manfred Gilli, Dietmar Maringer and Enrico Schumann (2011), including
the different optimization heuristics such as Differential Evolution, Genetic Algorithms, Particle
Swarms, and Threshold Accepting.
• The FRAPO package provides data sets and code for the book Financial Risk Modelling and Portfolio
Optimization with R by Bernhard Pfaff (2013).

Data and date management

• The zoo and timeDate (part of Rmetrics) packages provide support for irregularly-spaced time series.
The xts package extends zoo specifically for financial time series. See the TimeSeries task view for
more details.
• timeDate also addresses calendar issues such as recurring holidays for a large number of financial
centers, and provides code for high-frequency data sets.
• The fame package can access Fame time series databases (but also requires a Fame backend).
The tis package provides time indices and time-indexed series compatible with Fame frequencies.
• The TSdbi (archived) package provides a unifying interface for several time series data base
backends, and its SQL implementations provide a database table design.
• The IBrokers package provides access to the Interactive Brokers API for data access (but requires an
account to access the service).
• The data.table package provides very efficient and fast access to in-memory data sets such as asset
prices.
• The package highfrequency contains functionality to manage, clean and match highfrequency trades
and quotes data and enables users to calculate various liquidity measures, estimate and forecast
volatility, and investigate microstructure noise and intraday periodicity.
• The bizdays package compute business days if provided a list of holidays.
• The TAQMNGR package manages tick-by-tick (equity) transaction data performing ‘cleaning’,
‘aggregation’ and ‘import’ where cleaning and aggregation are performed according to Brownlees and
Gallo (2006).
• The Rblpapi package offers efficient access to the Bloomberg API and allows bdp, bdh,
and bds queries as well as data retrieval both in (regular time-)bars and ticks (albeit without subsecond
resolution).
• The finreportr package can download reports from the SEC Edgar database, and relies on, inter alia,
the XBRL package for parsing these reports.
• The GetTDData package imports Brazilian government bonds data (such as LTN, NTN-B and LFT )
from the Tesouro Direto website.
• The fmdates package implements common date calculations according to the ISDA schedules, and
can check for business in different locales.
• Data from Kenneth French’s website can be downloaded with packages FFdownload and frenchdata.
Individual datasets can also be downloaded with function French in package NMOF.

CRAN packages
Core: fAssets, fBasics, fBonds, fCopulae, fExtremes, fGarch, fImport, fMultivar, fNonlinear, fPortfolio, fRegre
ssion, fTrading, PerformanceAnalytics, rugarch, timeDate, timeSeries, tseries, urca, xts, zoo.

Regular: actuar, AmericanCallOpt, AssetCorr, backtest, bayesGARCH, BCC1997, BenfordTests, betategarch, bid
ask, bizdays, BLModel, bmgarch, bondAnalyst, BurStFin, BurStMisc, CADFtest, car, ChainLadder, cop
ula, copulaData, credule, crseEventStudy, cvar, data.table, derivmkts, dlm, DOSPortfolio, Dowd, DriftBu
rstHypothesis, dse, DtD, dyn, dynlm, ESG, etrm, factorstochvol, fame, FatTailsR, FFdownload, Financial
Math, FinAsym, finreportr, fixedincome, fmdates, forecast, fracdiff, FRAPO, frenchdata, GARCHSK, ga
rchx, GCPM, gets, GetTDData, ghyp, gmm, gogarch, greeks, GUIDE, HDShOP, highfrequency, IBroker
s, ichimoku, InfoTrad, lgarch, lifecontingencies, lmForc, lmtest, longmemo, LSMonteCarlo, LSMRealOp
tions, markovchain, MarkowitzR, matchingMarkets, monobin, MSGARCH, mvtnorm, NetworkRiskMea
sures, NFCP, nlme, NMOF, nvmix, obAnalytics, OptHedging, OptionPricing, pa, parma, pbo, PeerPerfor
mance, PortfolioEffectHFT, PortfolioOptim, PortRisk, qrmdata, qrmtools, quantmod, ragtop, Rblpapi, Rc
mdr, RcppQuantuccia, reinsureR, restimizeapi, Risk, riskParityPortfolio, RiskPortfolios, riskSimul, RM2
006, rmgarch, RND, RQuantLib, RTL, sandwich, sde, SharpeR, Sim.DiffProc, simfinapi, stochvol, stock
Analyst, strand, strucchange, TAQMNGR, tidyquant, timsac, tis, tsDyn, tseriesChaos, TTR, tvm, ufRisk,
vars, vrtest, wavelets, waveslim, wavethresh, XBRL.

Analysis of Ecological and Environmental data

Installation: The packages from this task view can be installed automatically using the ctv package. For
example, ctv::install.views("Environmetrics", coreOnly = TRUE) installs all the core packages
or ctv::update.views("Environmetrics") installs all packages that are not yet installed and up-to-
date. See the CRAN Task View Initiative for more details.
Introduction

This Task View contains information about using R to analyse ecological and environmental data.

The base version of R ships with a wide range of functions for use within the field of environmetrics. This
functionality is complemented by a plethora of packages available via CRAN, which provide specialist methods
such as ordination & cluster analysis techniques. A brief overview of the available packages is provided in this
Task View, grouped by topic or type of analysis. As a testament to the popularity of R for the analysis of
environmental and ecological data, a special volume of the Journal of Statistical Software was produced in
2007.

Those interested in environmetrics should consult the Spatial view. Complementary information is also
available in the Cluster, and SpatioTemporal task views.

If you have any comments or suggestions for additions or improvements, then please contact the maintainer or
submit an issue or pull request in the GitHub repository linked above.

A list of available packages and functions is presented below, grouped by analysis type.

General packages

These packages are general, having wide applicability to the environmetrics field.

• Package EnvStats is the successor to the S-PLUS module EnvironmentalStats, both by Steven
Millard. A user guide in the form of a book has recently been released.

Modelling species responses and other data

Analysing species response curves or modelling other data often involves the fitting of standard statistical
models to ecological data and includes simple (multiple) regression, Generalized Linear Models (GLM),
extended regression (e.g. Generalized Least Squares [GLS]), Generalized Additive Models (GAM), and mixed
effects models, amongst others.

• The base installation of R provides lm() and glm() for fitting linear and generalized linear models,
respectively.
• Generalized least squares and linear and non-linear mixed effects models extend the simple regression
model to account for clustering, heterogeneity and correlations within the sample of observations.
Package nlme provides functions for fitting these models. The package is supported by Pinheiro &
Bates (2000) Mixed-effects Models in S and S-PLUS, Springer, New York. An updated approach to
mixed effects models, which also fits Generalized Linear Mixed Models (GLMM) and Generalized
non-Linear Mixed Models (GNLMM) is provided by the lme4 package, though this is currently beta
software and does not yet allow correlations within the error structure.
• Recommended package mgcv fits GAMs and Generalized Additive Mixed Models (GAMM) with
automatic smoothness selection via generalized cross-validation. The author of mgcv has also written
a companion monograph, Wood (2017) Generalized Additive Models; An Introduction with R Second
Edition. Chapman Hall/CRC, which has an accompanying package gamair.
• Alternatively, package gam provides an implementation of the S-PLUS function gam() that includes
LOESS smooths.
• Proportional odds models for ordinal responses can be fitted using polr() in the MASS package, of
Bill Venables and Brian Ripley.
• A negative binomial family for GLMs to model over-dispersion in count data is available in MASS.
• Models for overdispersed counts and proportions
o Package pscl also contains several functions for dealing with over-dispersed count data.
Poisson or negative binomial distributions are provided for both zero-inflated and hurdle
models.
o aod provides a suite of functions to analyse overdispersed counts or proportions, plus utility
functions to calculate e.g. AIC, AICc, Akaike weights.
• Detecting change points and structural changes in parametric models is well catered for in
the segmented package and the strucchange package respectively. segmented is discussed in an R
News article

Tree-based models

Tree-based models are being increasingly used in ecology, particularly for their ability to fit flexible models to
complex data sets and the simple, intuitive output of the tree structure. Ensemble methods such as bagging,
boosting and random forests are advocated for improving predictions from tree-based models and to provide
information on uncertainty in regression models or classifiers.

Univariate trees

Tree-structured models for regression, classification and survival analysis, following the ideas in the CART
book, are implemented in

• recommended package rpart

• party provides an implementation of conditional inference trees which embed tree-structured
regression models into a well-defined theory of conditional inference procedures

Multivariate trees

Multivariate trees are available in

• package party can also handle multivariate responses.

Ensembles of trees

Ensemble techniques for trees:

• The Random Forest method of Breiman and Cutler is implemented in randomForest, providing
classification and regression based on a forest of trees using random inputs
• Package ipred provides functions for improved predictive models for classification, regression and
survival problems.

Graphical tools for the visualization of trees are available in package maptree.

Packages mda and earth implement Multivariate Adaptive Regression Splines (MARS), a technique which
provides a more flexible, tree-based approach to regression than the piecewise constant functions used in
regression trees.

Ordination

R and add-on packages provide a wide range of ordination methods, many of which are specialized techniques
particularly suited to the analysis of species data. The two main packages are ade4 and vegan. ade4 derives
from the traditions of the French school of “Analyse des Donnees” and is based on the use of the duality
diagram. vegan follows the approach of Mark Hill, Cajo ter Braak and others, though the implementation owes
more to that presented in Legendre & Legendre (1988) Numerical Ecology, 2nd English Edition, Elsevier.
Where the two packages provide duplicate functionality, the user should choose whichever framework that best
suits their background.

• Principal Components (PCA) is available via the prcomp() function. rda() (in
package vegan), pca() (in package labdsv) and dudi.pca() (in package ade4), provide more
ecologically-orientated implementations.
• Redundancy Analysis (RDA) is available via rda() in vegan and pcaiv() in ade4.
• Canonical Correspondence Analysis (CCA) is implemented in cca() in both vegan and ade4.
• Detrended Correspondence Analysis (DCA) is implemented in decorana() in vegan.
• Principal coordinates analysis (PCO) is implemented
in dudi.pco() in ade4, pco() in labdsv, pco() in ecodist, and cmdscale() in package MASS.
• Non-Metric multi-Dimensional Scaling (NMDS) is provided by isoMDS() in
package MASS and nmds() in ecodist. nmds(), a wrapper function for isoMDS(), is also provided by
package labdsv. vegan provides helper function metaMDS() for isoMDS(), implementing random
starts of the algorithm and standardized scaling of the NMDS results. The approach adopted
by vegan with metaMDS() is the recommended approach for ecological data.
• Coinertia analysis is available via coinertia() and mcoa(), both in ade4.
• Co-correspondence analysis to relate two ecological species data matrices is available in cocorresp.
• Canonical Correlation Analysis (CCoA - not to be confused with CCA, above) is available
in cancor() in standard package stats.
• Procrustes rotation is available in procrustes() in vegan and procuste() in ade4, with
both vegan and ade4 providing functions to test the significance of the association between ordination
configurations (as assessed by Procrustes rotation) using permutation/randomization and Monte Carlo
methods.
• Constrained Analysis of Principal Coordinates (CAP), implemented in capscale() in vegan, fits
constrained ordination models similar to RDA and CCA but with any dissimilarity coefficient.
• Constrained Quadratic Ordination (CQO; formerly known as Canonical Gaussian Ordination (CGO))
is a maximum likelihood estimation alternative to CCA fit by Quadratic Reduced Rank Vector GLMs.
Constrained Additive Ordination (CAO) is a flexible alternative to CQO which uses Quadratic
Reduced Rank Vector GAMs. These methods and more are provided in Thomas
Yee’s VGAM package.
• Fuzzy set ordination (FSO), an alternative to CCA/RDA and CAP, is available in
package fso. fso complements a recent paper on fuzzy sets in the journal Ecology by Dave Roberts
(2008, Statistical analysis of multidimensional fuzzy set ordinations. Ecology 89(5), 1246-1260).

Dissimilarity coefficients

Much ecological analysis proceeds from a matrix of dissimilarities between samples. A large amount of effort
has been expended formulating a wide range of dissimilarity coefficients suitable for ecological data. A
selection of the more useful coefficients are available in R and various contributed packages.

Standard functions that produce, square, symmetric matrices of pair-wise dissimilarities include:

• dist() in standard package stats

• daisy() in recommended package cluster
• vegdist() in vegan
• dsvdis() in labdsv
• Dist() in amap
• distance() in ecodist
• a suite of functions in ade4

Function distance() in package analogue can be used to calculate dissimilarity between samples of one matrix
and those of a second matrix. The same function can be used to produce pair-wise dissimilarity matrices, though
the other functions listed above are faster. distance() can also be used to generate matrices based on Gower’s
coefficient for mixed data (mixtures of binary, ordinal/nominal and continuous variables). Function daisy() in
package cluster provides a faster implementation of Gower’s coefficient for mixed-mode data than distance() if
a standard dissimilarity matrix is required. Function gowdis() in package FD also computes Gower’s
coefficient and implements extensions to ordinal variables.

Cluster analysis

Cluster analysis aims to identify groups of samples within multivariate data sets. A large range of approaches
to this problem have been suggested, but the main techniques are hierarchical cluster analysis, partitioning
methods, such as k -means, and finite mixture models or model-based clustering. In the machine learning
literature, cluster analysis is an unsupervised learning problem.

The Cluster task view provides a more detailed discussion of available cluster analysis methods and appropriate
R functions and packages.

Hierarchical cluster analysis:

• hclust() in standard package stats

• Recommended package cluster provides functions for cluster analysis following the methods
described in Kaufman and Rousseeuw (1990) Finding Groups in data: an introduction to cluster
analysis, Wiley, New York
• hcluster() in amap
• pvclust is a package for assessing the uncertainty in hierarchical cluster analysis. It provides
approximately unbiased p -values as well as bootstrap p -values.

Partitioning methods:

• kmeans() in stats provides k -means clustering

• cmeans() in e1071 implements a fuzzy version of the k -means algorithm
• Recommended package cluster also provides functions for various partitioning methodologies.

Mixture models and model-based cluster analysis:

• mclust and flexmix provide implementations of model-based cluster analysis.

• prabclus clusters a species presence-absence matrix object by calculating an MDS from the distances,
and applying maximum likelihood Gaussian mixtures clustering to the MDS points. The maintainer’s,
Christian Hennig, website contains several publications in ecological contexts that use prabclus,
especially Hausdorf & Hennig (2007; Oikos 116 (2007), 818-828).

Ecological theory

There is a growing number of packages and books that focus on the use of R for theoretical ecological models.

• vegan provides a wide range of functions related to ecological theory, such as diversity indices
(including the “so-called” Hill’s numbers [e.g. Hill’s N 2 ] and rarefaction), ranked abundance
diagrams, Fisher’s log series, Broken Stick model, Hubbell’s abundance model, amongst others.
• untb provides a collection of utilities for biodiversity data, including the simulation ecological drift
under Hubbell’s Unified Neutral Theory of Biodiversity, and the calculation of various diagnostics
such as Preston curves.
• Package BiodiversityR provides a GUI for biodiversity and community ecology analysis.
• Function betadiver() in vegan implements all the diversity indices reviewed in Koleff et al
(2003; Journal of Animal Ecology 72(3), 367-382 ). betadiver() also provides a plot method to
produce the co-occurrence frequency triangle plots of the type found in Koleff et al (2003).
• Function betadisper(), also in vegan, implements Marti Anderson’s distance-based test for
homogeneity of multivariate dispersions (PERMDISP, PERMDISP2), a multivariate analogue of
Levene’s test (Anderson 2006; Biometrics 62, 245-253 ). Anderson et al (2006; Ecology Letters 9(6),
683-693 ) demonstrate the use of this approach for measuring beta diversity.
• The FD package computes several measures of functional diversity indices from multiple traits.
Population dynamics
Estimating animal abundance and related parameters

This section concerns estimation of population parameters (population size, density, survival probability, site
occupancy etc.) by methods that allow for incomplete detection. Many of these methods use data on marked
animals, variously called ‘capture-recapture’, ‘mark-recapture’ or ‘capture-mark-recapture’ data.

• Rcapture fits loglinear models to estimate population size and survival rate from capture-recapture
data as described by Baillargeon and Rivest (2007).
• secr estimates population density given spatially explicit capture-recapture data from traps, passive
DNA sampling, automatic cameras, sound recorders etc. Models are fitted by maximum likelihood.
The detection function may be half-normal, exponential, cumulative gamma etc. Density surfaces may
be fitted. Covariates of density and detection parameters are specified via formulae.
• unmarked fits hierarchical models of occurrence and abundance to data collected on species subject
to imperfect detection. Examples include single- and multi-season occupancy models, binomial
mixture models, and hierarchical distance sampling models. The data can arise from survey methods
such temporally replicated counts, removal sampling, double-observer sampling, and distance
sampling. Parameters governing the state and observation processes can be modelled as functions of
covariates.
• Package RMark provides a formula-based R interface for the MARK package which fits a wide
variety of capture-recapture models. See the RMark website and a NOAA report (PDF) for further
details.
• Package marked provides a framework for handling data and analysis for mark-recapture. marked can
fit Cormack-Jolly-Seber (CJS) and Jolly-Seber (JS) models via maximum likelihood and the CJS
model via MCMC. Maximum likelihood estimates for the CJS model can be obtained using R or via
a link to the Automatic Differentiation Model Builder software. A description of the package was
published in Methods in Ecology and Evolution.
• mrds fits detection functions to point and line transect distance sampling survey data (for both single
and double observer surveys). Abundance can be estimated using Horvitz-Thompson-type estimators.
• Distance is a simpler interface to mrds for single observer distance sampling surveys.
• dsm fits density surface models to spatially-referenced distance sampling data. Count data are
corrected using detection function models fitted using mrds or Distance. Spatial models are
constructed as in mgcv.

Packages secr can also be used to simulate data from the respective models.

See also the SpatioTemporal task view for analysis of animal tracking data under Moving objects, trajectories.

Modelling population growth rates:

• Package popbio can be used to construct and analyse age- or stage-specific matrix population models.

Environmental time series

• Time series objects in R are created using the ts() function, though see tseries or zoo below for
alternatives.
• Classical time series functionality is provided by the ar(), and arima() functions in standard package
stats for autoregressive (AR), moving average (MA), autoregressive moving average (ARMA) and
integrated ARMA (ARIMA) models.
• The forecast package provides methods and tools for displaying and analysing univariate time series
forecasts including exponential smoothing via state space models and automatic ARIMA modelling
• The dse package provide a variety of more advanced estimation methods and multivariate time series
analysis.
• Packages tseries and zoo provide general handling and analysis of time series data.
• Irregular time series can be handled using package zoo as well as by irts() in package tseries.
• pastecs provides functions specifically tailored for the analysis of space-time ecological series.
• strucchange allows for testing, dating and monitoring of structural change in linear regression
relationships.
• Detecting change points in time series data — see segmented above.
• The surveillance package implements statistical methods for the modelling of and change-point
detection in time series of counts, proportions and categorical data. Focus is on outbreak detection in
count data time series.
• Package dynlm provides a convenient interface to fitting time series regressions via ordinary least
squares
• Package dyn provides a different approach to that of dynlm, which allows time series data to be used
with any regression function written in the style of lm such
as lm(), glm(), loess(), rlm() and lqs() from MASS, randomForest() (package randomForest), rq() (pa
ckage quantreg) amongst others, whilst preserving the time series information.
• The openair provides numerous tools to analyse, interpret and understand air pollution time series data
• The bReeze package is a collection of widely used methods to analyse, visualize, and interpret wind
data. Wind resource analyses can subsequently be combined with characteristics of wind turbines to
estimate the potential energy production.
• The Rbeast package provides a Bayesian model averaging method to decompose time series into
abrupt changes, trend, and seasonality and can be used for changepoint detection, time series
decomposition, and nonlinear trend analysis.

Additionally, a fuller description of available packages for time series analysis can be found in
the TimeSeries task view.

Spatial data analysis

See the Spatial CRAN Task View for an overview of spatial analysis in R.

Extreme values

ismev provides functions for models for extreme value statistics and is support software for Coles (2001) An
Introduction to Statistical Modelling of Extreme Values , Springer, New York. Other packages for extreme
value theory include

• evir
• evd
• evdbayes (archived), which provides a Bayesian approach to extreme value theory
• extRemes

Phylogenetics and evolution

Packages specifically tailored for the analysis of phylogenetic and evolutionary data include:

• ape
• ouch

UseRs may also be interested in Paradis (2006) Analysis of Phylogenetics and Evolution with R, Springer, New
York, a book in the “Use R!” book series from Springer.

Soil science

Several packages are now available that implement R functions for widely-used methods and approaches in
pedology.

• soiltexture provides functions for soil texture plot, classification and transformation.
• aqp contains a collection of algorithms related to modelling of soil resources, soil classification, soil
profile aggregation, and visualization.
• The Soil Water project on R-Forge.R-project.org provides packages providing soil water retention
functions, soil hydraulic conductivity functions and pedotransfer functions to estimate their parameter
from easily available soil properties. Two packages form the project: soilwaterfun and soilwaterptf.

Hydrology and Oceanography

A growing number of packages are available that implement methods specifically related to the fields of
hydrology and oceanography. Also see the Extreme Value and the Climatology sections for related packages.

• hydroTSM is a package for management, analysis, interpolation and plotting of time series used in
hydrology and related environmental sciences.
• hydroGOF is a package implementing both statistical and graphical goodness-of-fit measures between
observed and simulated values, mainly oriented to be used during the calibration, validation, and
application of hydrological/environmental models. Related packages are tiger (archived), which
allows temporally resolved groups of typical differences (errors) between two time series to be
determined and visualized, and qualV which provides quantitative and qualitative criteria to compare
models with data and to measure similarity of patterns
• EcoHydRology (archived) provides a flexible foundation for scientists, engineers, and policymakers
to base teaching exercises as well as for more applied use to model complex eco-hydrological
interactions.
• topmodel is a set of hydrological functions including an R implementation of the hydrological model
TOPMODEL, which is based on the 1995 FORTRAN version by Keith Beven. New functionality is
being developed as part of the RHydro package on R-Forge.
• Package seacarb provides functions for calculating parameters of the seawater carbonate system.
• Stephen Sefick’s StreamMetabolism package contains function for calculating stream metabolism
characteristics, such as GPP, NDM, and R, from single station diurnal Oxygen curves.
• Package oce supports the analysis of Oceanographic data, including ADP measurements, CTD
measurements, sectional data, sea-level time series, and coastline files.
• The nsRFA package provides collection of statistical tools for objective (non-supervised) applications
of the Regional Frequency Analysis methods in hydrology.
• The boussinesq package is a collection of functions implementing the one-dimensional Boussinesq
Equation (ground-water).
• rtop is a package for geostatistical interpolation of data with irregular spatial support such as runoff
related data or data from administrative units.

Climatology

Several packages related to the field of climatology.

• seas implements a number of functions for analysis and graphics of seasonal data.
• RMAWGEN is set of S3 and S4 functions for spatial multi-site stochastic generation of daily time
series of temperature and precipitation making use of Vector Autoregressive Models.

Palaeoecology and stratigraphic data

Several packages now provide specialist functionality for the import, analysis, and plotting of palaeoecological
data.

• Transfer function models including weighted averaging (WA), modern analogue technique (MAT),
Locally-weighted WA, & maximum likelihood (aka Gaussian logistic) regression (GLR) are provided
by the rioja and analogue packages.
• Import of common, legacy, palaeo-data formats is provided by package vegan (Cornell format).
• Stratigraphic data plots can be drawn using Stratiplot() function in analogue and
functions strat.plot() and strat.plot.simple in the rioja package. Also see the tidypaleo package, which
provides tools to produce stratigraphic plots using ggplot(). A blog post by the maintainer of
the tidypaleo package, Dewey Dunnington, shows how to use the package to create stratigraphic plots.
• analogue provides extensive support for developing and interpreting MAT transfer function models,
including ROC curve analysis. Summary of stratigraphic data is supported via principal curves in
the prcurve() function.

Other packages

Several other relevant contributed packages for R are available that do not fit under nice headings.

• Andrew Robinson’s equivalence package provides some statistical tests and graphics for assessing
tests of equivalence. Such tests have similarity as the alternative hypothesis instead of the null. The
package contains functions to perform two one-sided t-tests (TOST) and paired t-tests of equivalence.
• Thomas Petzoldt’s simecol package provides an object oriented framework and tools to simulate
ecological (and other) dynamic systems within R. See the simecol website and a R News article on
the package for further information.
• Functions for circular statistics are found in CircStats and circular.
• Package e1071 provides functions for latent class analysis, short time Fourier transform, fuzzy
clustering, support vector machines, shortest path computation, bagged clustering, naive Bayes
classifier, and more...
• Package pgirmess provides a suite of miscellaneous functions for data analysis in ecology.
• mefa provides functions for handling and reporting on multivariate count data in ecology and
biogeography.
• Sensitivity analysis of models is provided by package sensitivity. sensitivity contains a collection of
functions for factor screening and global sensitivity analysis of model output.
• Functions to analyse coherence, boundary clumping, and turnover following the pattern-based
metacommunity analysis of Leibold and Mikkelson (2002) are provided in the metacom package.
• Growth curve estimation via noncrossing and nonparametric regression quantiles is implemented in
package quantregGrowth. A supporting paper is Muggeo et al. (2013).
• The siplab package provides an R platform for experimenting with spatially explicit individual-based
vegetation models. A supporting paper is García, O. (2014).
• PMCMRplus provides parametric and non-parametric many-to-one and all-pairs multiple comparison
procedures for continuous or at least interval based variables. The package provides implementations
of a wide range of tests involving pairwise multiple comparisons.

CRAN packages
Co ade4, cluster, labdsv, MASS, mgcv, vegan.
re:

Re amap, analogue, aod, ape, aqp, BiodiversityR, boussinesq, bReeze, CircStats, circular, cocorresp, Distance, dse,
gul dsm, dyn, dynlm, e1071, earth, ecodist, EnvStats, equivalence, evd, evir, extRemes, FD, flexmix, forecast, fso,
ar: gam, gamair, hydroGOF, hydroTSM, ipred, ismev, lme4, maptree, marked, mclust, mda, mefa, metacom, mrds
, nlme, nsRFA, oce, openair, ouch, party, pastecs, pgirmess, PMCMRplus, popbio, prabclus, pscl, pvclust, qual
V, quantreg, quantregGrowth, randomForest, Rbeast, Rcapture, rioja, RMark, RMAWGEN, rpart, rtop, seacarb
, seas, secr, segmented, sensitivity, simecol, siplab, soiltexture, StreamMetabolism, strucchange, surveillance, t
opmodel, tseries, unmarked, untb, VGAM, zoo.

Machine Learning and statistics

Installation: The packages from this task view can be installed automatically using the ctv package. For
example, ctv::install.views("MachineLearning", coreOnly = TRUE) installs all the core packages
or ctv::update.views("MachineLearning") installs all packages that are not yet installed and up-to-
date. See the CRAN Task View Initiative for more details.
Several add-on packages implement ideas and methods developed at the borderline between computer science
and statistics - this field of research is usually referred to as machine learning. The packages can be roughly
structured into the following topics:

• Neural Networks and Deep Learning : Single-hidden-layer neural network are implemented in
package nnet (shipped with base R). Package RSNNS offers an interface to the Stuttgart Neural
Network Simulator (SNNS). Packages implementing deep learning flavours of neural networks
include deepnet (feed-forward neural network, restricted Boltzmann machine, deep belief network,
stacked autoencoders), RcppDL (denoising autoencoder, stacked denoising autoencoder, restricted
Boltzmann machine, deep belief network) and h2o (feed-forward neural network, deep autoencoders).
An interface to tensorflow is available in tensorflow. The torch package implements an interface to
the libtorch library.
• Recursive Partitioning : Tree-structured models for regression, classification and survival analysis,
following the ideas in the CART book, are implemented in rpart (shipped with base R) and tree.
Package rpart is recommended for computing CART-like trees. A rich toolbox of partitioning
algorithms is available in Weka, package RWeka provides an interface to this implementation,
including the J4.8-variant of C4.5 and M5. The Cubist package fits rule-based models (similar to trees)
with linear regression models in the terminal leaves, instance-based corrections and boosting.
The C50 package can fit C5.0 classification trees, rule-based models, and boosted versions of
these. pre can fit rule-based models for a wider range of response variable types.
Two recursive partitioning algorithms with unbiased variable selection and statistical stopping
criterion are implemented in package party and partykit. Function ctree() is based on non-parametric
conditional inference procedures for testing independence between response and each input variable
whereas mob() can be used to partition parametric models. Extensible tools for visualizing binary trees
and node distributions of the response are available in package party and partykit as well. Partitioning
of mixed-effects models (GLMMs) can be performed with package glmertree; partitioning of
structural equation models (SEMs) can be performed with package semtree. Graphical tools for the
visualization of trees are available in package maptree.
Partitioning of mixture models is performed by RPMM.
Computational infrastructure for representing trees and unified methods for prediction and
visualization is implemented in partykit. This infrastructure is used by package evtree to implement
evolutionary learning of globally optimal trees. Survival trees are available in various packages.

Trees for subgroup identification with respect to heterogenuous treatment effects are available in
packages partykit, model4you, dipm, quint, pkg("SIDES"), pkg("psica"),
and pkg("MrSGUIDE") (and probably many more).

• Random Forests : The reference implementation of the random forest algorithm for regression and
classification is available in package randomForest. Package ipred has bagging for regression,
classification and survival analysis as well as bundling, a combination of multiple models via
ensemble learning. In addition, a random forest variant for response variables measured at arbitrary
scales based on conditional inference trees is implemented in
package party. randomForestSRC implements a unified treatment of Breiman’s random forests for
survival, regression and classification problems. Quantile regression forests quantregForest allow to
regress quantiles of a numeric response on exploratory variables via a random forest approach. For
binary data, The varSelRF and Boruta packages focus on variable selection by means for random
forest algorithms. In addition, packages ranger and Rborist offer R interfaces to fast C++
implementations of random forests. Reinforcement Learning Trees, featuring splits in variables which
will be important down the tree, are implemented in package RLT. wsrf implements an alternative
variable weighting method for variable subspace selection in place of the traditional random variable
sampling. Package RGF is an interface to a Python implementation of a procedure called regularized
greedy forests. Random forests for parametric models, including forests for the estimation of
predictive distributions, are available in packages trtf (predictive transformation forests, possibly
under censoring and trunction) and grf (an implementation of generalised random forests).
• Regularized and Shrinkage Methods : Regression models with some constraint on the parameter
estimates can be fitted with the lasso2 (archived) and lars packages. Lasso with simultaneous updates
for groups of parameters (groupwise lasso) is available in package grplasso; the grpreg package
implements a number of other group penalization models, such as group MCP and group SCAD. The
L1 regularization path for generalized linear models and Cox models can be obtained from functions
available in package glmpath, the entire lasso or elastic-net regularization path (also in elasticnet) for
linear regression, logistic and multinomial regression models can be obtained from package glmnet.
The penalized package provides an alternative implementation of lasso (L1) and ridge (L2) penalized
regression models (both GLM and Cox models). Package RXshrink can be used to identify and
display TRACEs for a specified shrinkage path and to determine the appropriate extent of shrinkage.
Semiparametric additive hazards models under lasso penalties are offered by package ahaz. Fisher’s
LDA projection with an optional LASSO penalty to produce sparse solutions is implemented in
package penalizedLDA. The shrunken centroids classifier and utilities for gene expression analyses
are implemented in package pamr. An implementation of multivariate adaptive regression splines is
available in package earth. Various forms of penalized discriminant analysis are implemented in
packages hda and sda. Package LiblineaR offers an interface to the LIBLINEAR library.
The ncvreg package fits linear and logistic regression models under the the SCAD and MCP
regression penalties using a coordinate descent algorithm. The same penalties are also implemented
in the picasso package. An implementation of bundle methods for regularized risk minimization is
available form package bmrm. The Lasso under non-Gaussian and heteroscedastic errors is estimated
by hdm, inference on low-dimensional components of Lasso regression and of estimated treatment
effects in a high-dimensional setting are also contained. Package SIS implements sure independence
screening in generalised linear and Cox models. Elastic nets for correlated outcomes are available
from package joinet. Robust penalized generalized linear models and robust support vector machines
are fitted by package mpath using composite optimization by conjugation operator.
The islasso package provides an implementation of lasso based on the induced smoothing idea which
allows to obtain reliable p-values for all model parameters. Best-subset selection for linear, logistic,
Cox and other regression models, based on a fast polynomial time algorithm, is available from
package abess.
• Boosting and Gradient Descent : Various forms of gradient boosting are implemented in
package gbm (tree-based functional gradient descent boosting).
Package lightgbm and xgboost implement tree-based boosting using efficient trees as base learners
for several and also user-defined objective functions. The Hinge-loss is optimized by the boosting
implementation in package bst. An extensible boosting framework for generalized linear, additive and
nonparametric models is available in package mboost. Likelihood-based boosting for mixed models
is implemented in GMMBoost. GAMLSS models can be fitted using boosting by gamboostLSS.
• Support Vector Machines and Kernel Methods : The function svm() from e1071 offers an interface to
the LIBSVM library and package kernlab implements a flexible framework for kernel learning
(including SVMs, RVMs and other kernel learning algorithms). An interface to the SVMlight
implementation (only for one-against-all classification) is provided in package klaR.
• Bayesian Methods : Bayesian Additive Regression Trees (BART), where the final model is defined in
terms of the sum over many weak learners (not unlike ensemble methods), are implemented in
packages BayesTree, BART, and bartMachine. Bayesian nonstationary, semiparametric nonlinear
regression and design by treed Gaussian processes including Bayesian CART and treed linear models
are made available by package tgp. Bayesian structure learning in undirected graphical models for
multivariate continuous, discrete, and mixed data is implemented in package BDgraph; corresponding
methods relying on spike-and-slab priors are available from package ssgraph. Naive Bayes classifiers
are available in naivebayes.
• Optimization using Genetic Algorithms : Package rgenoud offers optimization routines based on
genetic algorithms. The package Rmalschains implements memetic algorithms with local search
chains, which are a special type of evolutionary algorithms, combining a steady state genetic algorithm
with local search for real-valued parameter optimization.
• Association Rules : Package arules provides both data structures for efficient handling of sparse binary
data as well as interfaces to implementations of Apriori and Eclat for mining frequent itemsets,
maximal frequent itemsets, closed frequent itemsets and association rules.
Package opusminer provides an interface to the OPUS Miner algorithm (implemented in C++) for
finding the key associations in transaction data efficiently, in the form of self-sufficient itemsets, using
either leverage or lift.
• Fuzzy Rule-based Systems : Package frbs implements a host of standard methods for learning fuzzy
rule-based systems from data for regression and classification. Package RoughSets provides
comprehensive implementations of the rough set theory (RST) and the fuzzy rough set theory (FRST)
in a single package.
• Model selection and validation : Package e1071 has function tune() for hyper parameter tuning and
function errorest() (ipred) can be used for error rate estimation. The cost parameter C for support
vector machines can be chosen utilizing the functionality of package svmpath. Data splitting for
crossvalidation and other resampling schemes is available in the splitTools package. Functions for
ROC analysis and other visualisation techniques for comparing candidate classifiers are available from
package ROCR. Packages hdi and stabs implement stability selection for a range of models, hdi also
offers other inference procedures in high-dimensional models.
• Causal Machine Learning : The package DoubleML is an object-oriented implementation of the
double machine learning framework in a variety of causal models. Building upon the mlr3 ecosystem,
estimation of causal effects can be based on an extensive collection of machine learning methods.
• Other procedures : Evidential classifiers quantify the uncertainty about the class of a test pattern using
a Dempster-Shafer mass function in package evclass. The OneR (One Rule) package offers a
classification algorithm with enhancements for sophisticated handling of missing values and numeric
data together with extensive diagnostic functions.
• Meta packages : Package tidymodels provides miscellaneous functions for building predictive
models, including parameter tuning and variable importance measures. In a similar spirit,
package mlr3 offers high-level interfaces to various statistical and machine learning packages.
Package SuperLearner implements a similar toolbox. The h2o package implements a general purpose
machine learning platform that has scalable implementations of many popular algorithms such as
random forest, GBM, GLM (with elastic net regularization), and deep learning (feedforward
multilayer networks), among others. An interface to the mlpack C++ library is available from
package mlpack. CORElearn implements a rather broad class of machine learning algorithms, such as
nearest neighbors, trees, random forests, and several feature selection methods. Similar,
package rminer interfaces several learning algorithms implemented in other packages and computes
several performance measures.
• Visualisation (initially contributed by Brandon Greenwell) The stats::termplot() function package can
be used to plot the terms in a model whose predict method supports type="terms". The effects package
provides graphical and tabular effect displays for models with a linear predictor (e.g., linear and
generalized linear models). Friedman’s partial dependence plots (PDPs), that are low dimensional
graphical renderings of the prediction function, are implemented in a few
packages. gbm, randomForest and randomForestSRC provide their own functions for displaying
PDPs, but are limited to the models fit with those packages (the
function partialPlot from randomForest is more limited since it only allows for one predictor at a
time). Packages pdp, plotmo, and ICEbox are more general and allow for the creation of PDPs for a
wide variety of machine learning models (e.g., random forests, support vector machines, etc.);
both pdp and plotmo support multivariate displays (plotmo is limited to two predictors while pdp uses
trellis graphics to display PDPs involving three predictors). By default, plotmo fixes the background
variables at their medians (or first level for factors) which is faster than constructing PDPs but
incorporates less information. ICEbox focuses on constructing individual conditional expectation
(ICE) curves, a refinement over Friedman’s PDPs. ICE curves, as well as centered ICE curves can
also be constructed with the partial() function from the pdp package.

CRAN packages
Co abess, e1071, gbm, kernlab, mboost, nnet, randomForest, rpart.
re:

Re ahaz, arules, BART, bartMachine, BayesTree, BDgraph, bmrm, Boruta, bst, C50, CORElearn, Cubist, deepnet,
gul dipm, DoubleML, earth, effects, elasticnet, evclass, evtree, frbs, gamboostLSS, glmertree, glmnet, glmpath, G
ar: MMBoost, grf, grplasso, grpreg, h2o, hda, hdi, hdm, ICEbox, ipred, islasso, joinet, klaR, lars, LiblineaR, lightg
bm, maptree, mlpack, mlr3, model4you, mpath, naivebayes, ncvreg, OneR, opusminer, pamr, party, partykit, p
dp, penalized, penalizedLDA, picasso, plotmo, pre, quantregForest, quint, randomForestSRC, ranger, Rborist,
RcppDL, rgenoud, RGF, RLT, Rmalschains, rminer, ROCR, RoughSets, RPMM, RSNNS, RWeka, RXshrink,
sda, semtree, SIS, splitTools, ssgraph, stabs, SuperLearner, svmpath, tensorflow, tgp, tidymodels, torch, tree, tr
tf, varSelRF, wsrf, xgboost.

R Programming Notes
100% (1)
R Programming Notes
32 pages
Learn R Programming in A Day
100% (7)
Learn R Programming in A Day
229 pages
R Language 1st Unit Deep
100% (3)
R Language 1st Unit Deep
61 pages
1.R Unit 1
No ratings yet
1.R Unit 1
49 pages
R Language
No ratings yet
R Language
59 pages
Introduction R
No ratings yet
Introduction R
20 pages
Unit 1
No ratings yet
Unit 1
22 pages
Lab 01
No ratings yet
Lab 01
11 pages
CH02 Introduction To R
No ratings yet
CH02 Introduction To R
22 pages
R Practical Report
No ratings yet
R Practical Report
55 pages
Statistical Methods Lab Manual-2021-22
No ratings yet
Statistical Methods Lab Manual-2021-22
58 pages
Unit 1- Data Analysis Using r
No ratings yet
Unit 1- Data Analysis Using r
28 pages
Unit 1
No ratings yet
Unit 1
19 pages
Nirula R Programming Lab Manual (1)
No ratings yet
Nirula R Programming Lab Manual (1)
94 pages
R Programming for students
No ratings yet
R Programming for students
40 pages
R Programming in Statistics
No ratings yet
R Programming in Statistics
403 pages
DAR Programming - An Approach to Data Analytics-1
No ratings yet
DAR Programming - An Approach to Data Analytics-1
156 pages
R Unit 1 2018 Notes
No ratings yet
R Unit 1 2018 Notes
36 pages
Unit1-R (1)
No ratings yet
Unit1-R (1)
10 pages
R Programming Presentation
100% (1)
R Programming Presentation
23 pages
Introduction To R
No ratings yet
Introduction To R
6 pages
R Tutiorial
No ratings yet
R Tutiorial
6 pages
Statistical Computing & R Programming Notes PDF
100% (2)
Statistical Computing & R Programming Notes PDF
22 pages
R Programming R Basics For Beginners. (Z-Library)
No ratings yet
R Programming R Basics For Beginners. (Z-Library)
177 pages
Data Analysis Using R
100% (1)
Data Analysis Using R
78 pages
R Programming
No ratings yet
R Programming
11 pages
R Lanaguage
No ratings yet
R Lanaguage
25 pages
Unit 1
No ratings yet
Unit 1
16 pages
Introduction To R PDF
No ratings yet
Introduction To R PDF
36 pages
CS ELEC 4 - Analytics Techniques & Tools/Machine Learning: Module No.: 1 (Prelim) Module Title: Writer
No ratings yet
CS ELEC 4 - Analytics Techniques & Tools/Machine Learning: Module No.: 1 (Prelim) Module Title: Writer
22 pages
Use R For Climate Research
No ratings yet
Use R For Climate Research
31 pages
Introduction To R
No ratings yet
Introduction To R
67 pages
What Is R Programming
No ratings yet
What Is R Programming
7 pages
Chapter-1:-Introduction To R Language: 1.1 History and Overview
No ratings yet
Chapter-1:-Introduction To R Language: 1.1 History and Overview
7 pages
unit 1 notes (2)
No ratings yet
unit 1 notes (2)
13 pages
R Manual
No ratings yet
R Manual
84 pages
R Tutorial Session 1-2
100% (1)
R Tutorial Session 1-2
8 pages
Computing-II - Lecture Notes-I
No ratings yet
Computing-II - Lecture Notes-I
72 pages
R Programming Lab Manual (1)
No ratings yet
R Programming Lab Manual (1)
73 pages
Introduction To R Programming Notes For Students
No ratings yet
Introduction To R Programming Notes For Students
41 pages
DataAnalyticsUsingR Dr.P.rajesh
No ratings yet
DataAnalyticsUsingR Dr.P.rajesh
77 pages
R Programming Lab
No ratings yet
R Programming Lab
46 pages
R Lang
No ratings yet
R Lang
3 pages
LAB MANUAL
No ratings yet
LAB MANUAL
46 pages
Introduction To R
No ratings yet
Introduction To R
33 pages
Week1 2020
No ratings yet
Week1 2020
15 pages
Unit 5 R
No ratings yet
Unit 5 R
51 pages
01 Quick Introduction To R
No ratings yet
01 Quick Introduction To R
17 pages
A Concise Tutorial On R
No ratings yet
A Concise Tutorial On R
112 pages
BigData_BCom-Unit-3
No ratings yet
BigData_BCom-Unit-3
15 pages
Edar M-1
No ratings yet
Edar M-1
46 pages
R Programming
No ratings yet
R Programming
11 pages
R Programming Language_ 2020 Edition
No ratings yet
R Programming Language_ 2020 Edition
228 pages
R Material
No ratings yet
R Material
105 pages
Co358U R' Programming Lab: Government College of Engineering Jalgaon M.S. Department of Computer Engineering
No ratings yet
Co358U R' Programming Lab: Government College of Engineering Jalgaon M.S. Department of Computer Engineering
97 pages
A Crash R Course On Statistical Graphics
No ratings yet
A Crash R Course On Statistical Graphics
169 pages
R programming unit 1
No ratings yet
R programming unit 1
83 pages
SC&RP - Unit 1
No ratings yet
SC&RP - Unit 1
106 pages

Assignment For MCA 3rd Sem HPU R Programming

Uploaded by

Assignment For MCA 3rd Sem HPU R Programming

Uploaded by

SUBJECT: Data mining/Programming with R studio

ASSIGNMENT: To be submitted on 21-10-22.

Total marks: 10, Compulsory/ Assessment based assignment

Question 1: What are the features of R studio and R language?

Program 1: Write a program to apply different operators in R studio

Program 3: Write a program in R for implementing cluster analysis on Iris dataset

HOW TO USE IMPORTANT LIBRARIES

Official Statistics & Survey Statistics

The task view is split into several parts

First Part: Production of Official Statistics

1 Preparations/ Management/ Planning (questionnaire design, etc.)

3 Data Collection (incl. record linkage)

3.2 Web Scraping

4.1 Weighting and Calibration

4.2 Editing (including outlier detection)

4.4 Seasonal Adjustment

5 Analysis of Survey Data

5.1 Estimation and Variance Estimation

6 Statistical Disclosure Control

Unit-level data (microdata)

Aggregated information (tabular data)

Second Part: Access to Official Statistics

Access to data from international organizations and multiple organizations

• OECD searches and extracts data from the OECD.

Access to data from national organizations

Third Part: Related Methods

Small Area Estimation

Indices, Indicators, Tables and Visualization of Indicators

R acs, anesrake, BalancedSampling, BayesSAE, BIFIEsurvey, blaise, CalibrateSSB, cancensus, CANSIM2R,

Other Cluster Algorithms and Clustering Suites:

Financial and Empirical Analysis

Data and date management

Analysis of Ecological and Environmental data

Modelling species responses and other data

• recommended package rpart

Multivariate trees are available in

• package party can also handle multivariate responses.

Ensemble techniques for trees:

• dist() in standard package stats

Hierarchical cluster analysis:

• hclust() in standard package stats

• kmeans() in stats provides k -means clustering

Mixture models and model-based cluster analysis:

• mclust and flexmix provide implementations of model-based cluster analysis.

Modelling population growth rates:

Environmental time series

Spatial data analysis

See also the ExtremeValue task view for further information.

Phylogenetics and evolution

Hydrology and Oceanography

Several packages related to the field of climatology.

Palaeoecology and stratigraphic data

Machine Learning and statistics

You might also like