Data Science Training On Statistical Techniques For Analytics

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 154
At a glance
Powered by AI
Some of the key takeaways from the document include that data mining is used to find patterns in large datasets, it involves techniques like statistics, machine learning, and pattern recognition, and it has applications in fields like banking, marketing, and manufacturing.

Some applications of data mining mentioned in the document include using it for credit ratings and targeted marketing, fraud detection, customer relationship management, and manufacturing and production.

From a commercial viewpoint, reasons for mining data include having large amounts of data being collected that is now possible to analyze due to cheaper and more powerful computers, and the competitive pressure to provide better customized services for customers.

DATA MINING TODAYS AGENDA

Introduction to statistics with descriptive analysis


and answers to below question: What is data mining
Data sources, User interface (visualization), methods
and impact on performance for other tasks?
Data mining process, techniques, process models with
tools and applications, data mining issues
Text mining topics
Mahout(theoretical concepts)

DATA SCIENCE TRAINING ON STATISTICAL TECHNIQUES FOR


ANALYTICS

DATA MINING

Process of semi-automatically analyzing large databases


to find patterns that are:
valid:

hold on new data with some certainty


novel: non-obvious to the system
useful: should be possible to act on the item
understandable: humans should be able to interpret the
pattern

Also known as Knowledge Discovery in Databases


(KDD)
Machine Learning/
Pattern
Recognition
Data Mining

Statistics/
AI

Database
systems

WHY DATA MINING

Credit ratings/targeted marketing:

Given a database of 100,000 names, which persons are the


least likely to default on their credit cards?
Identify likely responders to sales promotions

Fraud detection

Which types of transactions are likely to be fraudulent, given


the demographics and transactional history of a particular
customer?

Customer relationship management:

Which of my customers are likely to be the most loyal, and


which are most likely to leave for a competitor? :

Data Mining helps extract such information

WHY MINE DATA? COMMERCIAL


VIEWPOINT

Lots of data is being collected


and warehoused
Web data, e-commerce
purchases at department/
grocery stores
Bank/Credit Card
transactions

Computers have become cheaper and more powerful


Competitive Pressure is Strong

Provide better, customized services for an edge (e.g. in


Customer Relationship Management)

WHY MINE DATA? SCIENTIFIC


VIEWPOINT

Data collected and stored at


enormous speeds (GB/hour)
remote sensors on a satellite
telescopes scanning the skies
microarrays generating gene
expression data
scientific simulations
generating terabytes of data

Traditional techniques infeasible for


raw data
Data mining may help scientists

in classifying and segmenting data


in Hypothesis Formation

APPLICATIONS
Banking:

loan/credit card approval

predict good customers based on old customers

Customer

identify those who are likely to leave for a competitor.

Targeted

relationship management:

marketing:

identify likely responders to promotions

Fraud

detection: telecommunications, financial


transactions

from an online stream of event identify fraudulent events

Manufacturing

and production:

automatically adjust knobs when process parameter changes

DATA MINING ARCHITECTURE


Communicates between users and data
mining system. Visualizes results or
perform exploration on data and
schemas.

This is the
information of
domain we are
mining like concept
hierarchies, to
organize attributes
onto various levels
of abstraction

Tests for interestingness of a pattern


Performs functionalities like
characterization, association,
classification, prediction etc.
Is responsible for fetching relevant data
based on user request

This is usually the source of data.


The data may require cleaning and
integration.

Architecture of data mining system

Also contains user


beliefs, which can
be used to access
interestingness of
pattern or
thresholds

DATA SOURCE
Application-Orientation

Subject-Orientation

Operation
al
Database
Loans

Credit
Card

Data
Warehouse
Customer
Vendor

Trust
Savings

Product
Activity

EXAMPLES OF OPERATIONAL DATA


Data

Industry

Customer All
File

Usage

Technology

Volumes

Track
Legacy application, flatSmall-medium
Customer
files, main frames
Details
Account
Finance
Control
Legacy applications, Large
Balance
account
hierarchical databases,
activities
mainframe
Point-of- Retail
Generate
ERP, Client/Server,
Very Large
Sale data
bills, managerelational databases
stock
Call
Telecomm- Billing
Legacy application,
Very Large
Record
unications
hierarchical database,
mainframe
Production Manufact- Control
ERP,
Medium
Record
uring
Production relational databases,
AS/400

DATA SOURCE-MART & WARE


HOUSE
Data Sources

Data Marts

Data Warehouse

CONTI

If you end up creating multiple warehouses,


integrating them is a problem

DATA VISUALIZATION:
SEEING THE DATA

VISUAL PRESENTATION
For any kind of high dimensional data set,
displaying predictive relationships is a challenge.
The picture on the previous slide uses 3-D
graphics to portray the weather balloon data
We learn very little from just examining the
numbers .
Shading is used to represent relative degrees of
thunderstorm activity, with the darkest regions
the heaviest activity.

A BIT OF HISTORY
An early effort used sequences of twodimensional graphs to add depth.
Current virtual reality programs allow the user
to step through a data set. Try going to a
realtors website and taking a tour of a house up
for sale.

HUMAN VISUAL PERCEPTION AND


DATA VISUALIZATION

Data visualization is so powerful because the


human visual cortex converts objects into
information so quickly.
The next three slides show (1) usage of global
private networks, (2) flow through natural gas
pipelines, and (3) a risk analysis report that
permits the user to draw an interactive yield
curve.
All three use height or shading to add additional
dimensions to the figure.

High
Activity

Low
Activity

2003, Prentice-Hall

GLOBAL PRIVATE NETWORK


ACTIVITY

Height shows total flow through compressor stations.

2003, Prentice-Hall

NATURAL GAS PIPELINE


ANALYSIS

2003, Prentice-Hall

AN ENLIVENED RISK
ANALYSIS REPORT

GEOGRAPHICAL INFORMATION
SYSTEMS
A GIS is a special purpose database that contains a
spatial coordinate system. A comprehensive
GIS requires:
1.
2.
3.
4.

Data input from maps, aerial photos, etc.


Data storage, retrieval and query
Data transformation and modeling
Data reporting (maps, reports and plans)

On the live map, clicking on an area allows the user


to drill down and see results for smaller areas.

2003, Prentice-Hall

TELEPHONE POLLING RESULTS

THE SPECIAL CAPABILITIES OF


A GIS

In general, a GIS contains two types of data:

Spatial data: these elements correspond to a uniquely-defined


location on earth. They could be in point, line or polygon
form.
Attribute data: These are the data that will be portrayed at
the geographic references established by spatial data.

Example: Data from an opinion poll is displayed for


multiple regions in the United States. Clicking on an
area allows the user to drill down to the results for
smaller areas.
https://gramener.com/fmcg/? Channel=Direct
https://gramener.com/codersearch/#lang=&
city=Bangalore&color=lang&size=5&gap=50
https://gramener.com/mahabharatha /

TRENDS IN DATA WAREHOUSING (IMPACT ON PERFORMANCE FOR OTHER TASKS)

Customer interaction and learning relationships


require capturing information everywhere and
massive scalability.
Enterprise applications generate data that is
doubling very 9-12 months.
The time available for working with data is shrinking
and the need for 247 access is becoming the norm.
Fast implementation and ease of management are
becoming more and more important.
In the future, more organizations will build Web
applications that operate in conjunction with the
DW.

THE FUTURE OF DATA MINING(IMPACT ON PERFORMANCE)


As promising as the field may be, it has pitfalls:
The

quality of data can make or break the data mining effort.


In order to mine the data, companies first have to integrate, transform and
cleanse it.
To obtain value from data mining, organizations must be able to change their
mode of operation and maintain the effort.
Finally, there are concerns about privacy.

PERSONALIZATION VERSUS PRIVACY

Companies that use data mining for target marketing walk a tightrope
between personalization and privacy.
Further, technology appears to create new ways to acquire information
faster than the legal system can handle the ethical and property
issues.
Nonetheless, many view information as a natural resource that should
be managed as such.

TRENDS AFFECTING THE FUTURE OF DATA MINING

While the available data increases exponentially, the


number of new data analysts graduating each year
has been fairly constant. Either of lot of data will go
unanalyzed or automatic procedures will be needed.
Increases in hardware speed and capacity makes it
possible to analyze data sets that were too large just
a few years ago.
The next generation Internet will connect sites 100
times faster than current speeds.
To be more profitable, businesses will need to react
more quickly and offer better service, and do it all
with fewer people and at a lower cost.

THE FUTURE OF DATA VISUALIZATION(PERFORMANCE)

Weapons performance and safety data


visualization coupled with simulation models can
show how weapons perform under typical
conditions and the effect of weapons aging.
Medical trauma treatment todays surgeons use
computer vision to assist in surgery. In the future
this trend suggests that local medical personnel can
also be assisted from afar by specialists through
telepresence.

STATISTICS

We use statistics for many reasons:


To

mathematically describe/depict our findings


To draw conclusions from our results
To test hypotheses
To test for relationships among variables

STATISTICS
Numerical representations of our data
Can be:

Descriptive

statistics summarize data.


Inferential statistics are tools that indicate how
much confidence we can have when we generalize
from a sample to a population.

STATISTICS

Powerful tools we must use them for good.


Be

sure our data is valid and reliable


Be sure we have the right type of data
Be sure statistical tests are applied appropriately
Be sure the results are interpreted correctly
Remember numbers may not lie, but people can

STATISTICS: WHATS WHAT?

Descriptive objectives/
research questions:

Descriptive

Comparative objectives/ hypotheses

statistics
Inferential

Statistics

DESCRIPTIVE STATISTICS
Can be applied to any measurements
(quantitative or qualitative)
Offers a summary/ overview/ description of data.
Does not explain or interpret.

DESCRIPTIVE STATISTICS
Number
Frequency Count
Percentage
Deciles and quartiles
Measures of Central
Tendency (Mean,
Midpoint, Mode)

Variability

Variance and standard deviation

Graphs

Normal Curve

SHAPES OF DISTRIBUTION
Normal Curve (aka Bell Curve)
Repeated sampling of a population should result
in a normal distribution- clustering of values
around a central tendency.
In a symmetrical distribution, median, mode and
mean all fall at the same point

INFERENTIAL STATISTICS

Allows for comparisons across variables


i.e.

is there a relation between ones occupation and


their reason for using the public library?

Hypothesis Testing

LEVELS OF SIGNIFICANCE

The level of significance is the predetermined


level at which a null hypothesis is not supported.
The most common level is p < .05
P

=probability
< = less than (> = more than)

ERROR TYPE

Type I error
Reject

the null
hypothesis when it is
really true

Type II error

Fail

to reject the null


hypothesis when it is
really false

PROBABILITY
By using inferential statistics to make decisions,
we can report the probability that we have made
a Type I error (indicated by the p value we
report)
By reporting the p value, we alert readers to the
odds that we were incorrect when we decided to
reject the null hypothesis

PARAMETRIC AND
NONPARAMETRIC STATISTICS
Parametric

statistical tests generally require


interval or ratio level data and assume that
the scores were drawn from a normally
distributed population or that both sets of
scores were drawn from populations with the
same variance or spread of scores
Nonparametric methods do not make
assumptions about the shape of the
population distribution. These are typically
less powerful and often need large samples

SELECTING AN APPROPRIATE
STATISTICAL TEST

The appropriate measurement scale(s) to use


Is intent to characterize respondents (descriptive statistics)
or draw inferences to population (inferential statistics)
The level of significance used and focusing on one- or twotailed distribution
Whether the mean or median better characterize the
dataset
Whether the population is normal
The number of independent (experimental or predicator
variables that evaluators manipulate and that presumably
change) and dependent (influenced by the independent
variable(s))
Uses parametric or nonparametric statistics
Willing to risk a type I or type II errors

I: possibility of rejecting a true null hypothesis


II: possibility of accepting the null hypothesis when it is false

DATA MINING MODELS AND


FUNCTIONALITIES

42

Pren
tice
Hall

CONTI

Data Mining functionalities are used to specify the kind of patterns to be found
in data mining tasks.
Decisions

in data mining

Kinds

of databases to be mined

Kinds

of knowledge to be discovered

Kinds

of techniques utilized

Kinds

of applications adapted

Data Mining tasks can be classified into two categories


Descriptive: Characterize general properties of data in the database

Clustering / similarity matching

Association rules and variants

Deviation detection

Predictive: perform inference on data to make predictions

Regression

Classification

Collaborative Filtering

CONTI
Medicine:

disease outcome, effectiveness of


treatments
analyze

patient disease history: find


relationship between diseases

Molecular/Pharmaceutical:

drugs
Scientific data analysis:
identify

clusters

Web

identify new

new galaxies by searching for sub

site/store design and promotion:

find

affinity of visitor to pages and modify layout

THE KDD PROCESS


Problem fomulation
Data collection

subset data: sampling might hurt if highly skewed data


feature selection: principal component analysis, heuristic
search

Pre-processing: cleaning

name/address cleaning, different meanings (annual, yearly),


duplicate removal, supplying missing values

Transformation:

map complex objects e.g. time series data to features e.g.


frequency
Choosing mining task and mining method:
Result evaluation and Visualization:

Knowledge discovery is an iterative process

RELATIONSHIP WITH OTHER


FIELDS
Overlaps

with machine learning, statistics,


artificial intelligence, databases, visualization
but more stress on
scalability

of number of features and instances


stress on algorithms and architectures whereas
foundations of methods and formulations provided
by statistics and machine learning.
automation for handling large, heterogeneous data

Classification
(Supervised learning)

CLASSIFICATION

Given old data about customers and payments, predict


new applicants loan eligibility.

Previouscustomers
Age
Salary
Profession
Location
Customertype

Classifier

Decisionrules
Salary>5L
Prof.=Exec

Newapplicantsdata

Good/
bad

CLASSIFICATION METHODS
Goal: Predict class Ci = f(x1, x2, .. Xn)
Regression: (linear or any other polynomial)

a*x1

+ b*x2 + c = Ci.

Nearest neighour
Decision tree classifier: divide decision space into
piecewise constant regions.
Probabilistic/generative models
Neural networks: partition by non-linear boundaries

Clustering or
Unsupervised Learning

CLUSTERING
Unsupervised

learning when old data with


class labels not available e.g. when introducing
a new product.
Group/cluster existing customers based on time
series of payment history such that similar
customers in same cluster.
Key requirement: Need a good measure of
similarity between instances.
Identify

each

micro-markets and develop policies for

APPLICATIONS
Customer

marketing

segmentation e.g. for targeted

Group/cluster

existing customers based on time series


of payment history such that similar customers in same
cluster.
Identify micro-markets and develop policies for each

Collaborative
group

Text

filtering:

based on common items purchased

clustering
Compression

DISTANCE FUNCTIONS
Numeric

data: euclidean, manhattan


distances
Categorical data: 0/1 to indicate
presence/absence followed by
Hamming

distance (# dissimilarity)
Jaccard coefficients: #similarity in 1s/(# of 1s)
data dependent measures: similarity of A and
B depends on co-occurance with C.

Combined
weighted

numeric and categorical data:

normalized distance:

Data types and levels of measurement.

Data types and levels of measurement.

DATA TYPES
Qualitative (or categorical) data consist of
values that can be separated into different
categories that are distinguished by some
nonnumeric characteristic.
Quantitative data consist of values representing
counts or measurements.

DETERMINE WHETHER THE DATA


DESCRIBED ARE QUALITATIVE OR
QUANTITATIVE AND EXPLAIN WHY.
A

persons social security number


The number of textbooks owned by a
student
The incomes of college graduates
The gender of college graduates

TYPES OF QUANTITATIVE DATA


Continuous data can take on any value in a
given interval. Continuous data values results
from some continuous scale that covers a range of
values without gaps, interruptions, or jumps.
Discrete data can take on only particular
distinct values and not other values in between.
The values in discrete data is either a finite
number or a countable number.

STATE WHETHER THE ACTUAL DATA ARE


DISCRETE OR CONTINUOUS AND EXPLAIN
WHY.
The

number of 1916 dimes still in


circulation
The voltage of electricity in a power
line
The number of eggs that hens lay
The time it takes for a student to
complete a test

LEVELS OF MEASUREMENT
Nominal
Ordinal
Interval
Ratio
Nominal and ordinal are qualitative (categorical)
levels of measurement.
Interval and ratio are quantitative levels of
measurement.

TYPES OF QUALITATIVE
MEASUREMENTS
Nominal level of measurementclassifies data
into names, labels or categories in which no order
or ranking can be imposed. Examplethe
number of courses offered in each of the different
colleges.
Ordinal level of measurementclassifies data
into categories that can be ordered or ranked, but
precise differences between the ranks do not
exist. Generally it does not make sense to do
calculations with data at the ordinal level.
Exampleletter grades of A, B, C, D, and F.

TYPES OF QUANTITATIVE
MEASUREMENTS

Interval level of measurementranks data, precise


differences between units of measure exist, but there is no
meaningful zero. If a zero exists, it is an an arbitrary
point.ExampleIQ scores, it makes sense to talk about
someone having an IQ 20 points higher than another
person, but an IQ of zero has no meaning.
Ratio level of measurementhas all the characteristics of
the interval level, but a true zero exists. Also, true ratios
exist when the same variable is measured on two different
members of the population. Exampleweight of an
individual. It makes sense to say that a 150 lb adult
weighs twice as much as a 75 lb. child.

CLASSIFY THE FOLLOWING AS TO QUALITATIVE


OR QUANTITATIVE MEASUREMENT. THEN
STATE THE LEVEL OF MEASUREMENT.

Eye Color (blue, brown, green, hazel)


Rating scale (poor, good, excellent)
ACT score
Salary
Age
Ranking of high school football teams in Missouri
Nationality
Temperature
Zip code

DATA MINING IN PRACTICE

APPLICATION AREAS
Industry
Application
Finance
Credit Card Analysis
Insurance
Claims, Fraud Analysis
Telecommunication Call record analysis
Transport
Logistics management
Consumer goods
promotion analysis
Data Service providersValue added data
Utilities
Power usage analysis

WHY NOW?
Data is being produced
Data is being warehoused
The computing power is available
The computing power is affordable
The competitive pressures are strong
Commercial products are available

DATA MINING WORKS WITH


WAREHOUSE DATA
Data

Warehousing provides
the Enterprise with a memory

Data Mining provides the


Enterprise with
intelligence

USAGE SCENARIOS
Data

warehouse mining:

assimilate

data from operational sources


mine static data

Mining

log data
Continuous mining: example in process
control
Stages in mining:

data selection pre-processing: cleaning


transformation mining result evaluation
visualization

MINING MARKET
Around

20 to 30 mining tool vendors


Major tool players:
SAS

Enterprise Miner
IBM SPSS & AMOS
R
Met-Lab
Rapid Miner
Tanagra
Weka
Clementine,

All

pretty much the same set of tools


Many embedded products:
fraud detection:
electronic commerce applications,
health care,
customer relationship management

OLAP MINING INTEGRATION

OLAP (On Line Analytical Processing)


Fast

interactive exploration of multidim. aggregates.

Heavy

reliance on manual operations for analysis:

Tedious

Ideal

and error-prone on large multidimensional data

platform for vertical integration of mining


but needs to be interactive instead of batch.

STATE OF ART IN MINING OLAP


INTEGRATION
Decision
find

trees [Information discovery, Cognos]

factors influencing high profits

Clustering

[Pilot software]
segment customers to define hierarchy on that dimension

Time

series analysis: [Seagates Holos]

Query

for various shapes along time: eg. spikes, outliers

Multi-level Associations [Han et al.]


find association between members of dimensions

Sarawagi

[VLDB2000]

DATA MINING IN USE


The

US Government uses Data Mining to track


fraud
A Supermarket becomes an information broker
Basketball teams use it to track game strategy
Cross Selling
Target Marketing
Holding on to Good Customers
Weeding out Bad Customers

MAJOR ISSUES IN DATA MINING IN NUTSHELL


1.Mining different kinds of data
2.Handling multiple levels of abstraction
3.Incorporation of background knowledge
4.Visualization of mining results
5.Handling of incomplete or noisy data
6.Scalability of algorithms

MAJOR ISSUES IN DATA MINING IN


DETAIL

Mining methodology and user interaction

Mining different kinds of knowledge in databases

Interactive mining of knowledge at multiple levels of


abstraction

Incorporation of background knowledge

Data mining query languages and ad-hoc data mining

Expression and visualization of data mining results

Handling noise and incomplete data

Pattern evaluation: the interestingness problem

Performance and scalability

Efficiency and scalability of data mining algorithms

Parallel, distributed and incremental mining methods

CONTI

Issues relating to the diversity of data types

Handling relational and complex types of data


Mining information from heterogeneous databases and
global information systems (WWW)

Issues related to applications and social impacts

Application of discovered knowledge


Domain-specific data mining tools
Intelligent query answering
Process control and decision making

Integration of the discovered knowledge with existing


knowledge: A knowledge fusion problem
Protection of data security, integrity, and privacy

INTRODUCTION
TO
R LANGUAGE

TOPICS FOR R AN INTRODUCTION


What is R
List of companies which using R
Benefits of R
Limitation of R
How to install R
How to change directory
Starting with R
Basic Commands of R
General Commands for packages
Importing data in R
Vector in R
Matrix in R
Factor in R
Data Frame in R
List in R
R as a calculator and Plotter
Statistics in R
(a) Central Tendency(Descriptive analysis)
(b) Basic assignment and operations In R
(c) Loading(Importing Data Files )
(d) Working with data In R
(e) Case Study

WHAT IS R

R is a programming language and environment for statistical computing and


graphics.
R is an open-source (GPL - General Public License project) statistical
environment modeled after S and S-Plus, S language was developed in the
late 1980s by John Chambers at AT&T labs. The R project was started by
Robert Gentleman and Ross Ihaka of the Statistics Department of the
University of Auckland, New Zealand in 1995 and now, R is developed by
the R Development Core Team, of which Chambers is a member.
R software environment is written primarily in C, Fortran, and R. R uses
a command line interface however, several graphical user interfaces are
available for use with R. For computationally intensive tasks, C, C++, and
Fortran code can be linked and called at run time. Advanced users can write C
or Java code to manipulate R objects directly.
You can freely download R from http://www.r-project.org/ .

R provides a wide variety of statistical (linear and nonlinear modeling,


classical statistical tests, time-series analysis, classification, clustering, ...)
and graphical techniques, and is highly extensible.

R is an integrated suite of software facilities for data manipulation,


calculation and graphical display. It includes

an effective data handling and storage facility,


a suite of operators for calculations on arrays, in particular matrices,
a large, coherent, integrated collection of intermediate tools for data
analysis,
graphical facilities for data analysis and display either on-screen or on
hardcopy, and
a well-developed, simple and effective programming language which
includes conditionals, loops, user-defined recursive functions and input and
output facilities.

R software work on different packages. Packages are collections


of R functions, data, and compiled code in a well-defined format. The
directory where packages are stored is called the library.

R comes with a standard set of packages. Others are available for download
and installation. Once installed, they have to be loaded into the every session
to be used.

There are about eight packages supplied with the R distribution and many
more are available through the CRAN family of Internet sites covering a very
wide range of modern statistics.

To
download
and
install
a
other
packages
visit
http://cran.r-project.org/web/packages/ site. Here you can find a list of
different available packages which are currently in Rs CRAN package
repository. Till now, CRAN package repository have 4366 available
packages.

LIST OF COMPANIES WHICH USING R


Asia Capital Reinsurance chooses revolution deploy
R to analyze complex data and generate business
insights.
ANZ, the fourth largest bank in Australia, using R for
credit risk analysis.
Analysis of Facebook Status Updates, Facebook's
Social Network Graph
Jesse Bridgewater works on "social search
awesomeness" for the Bing search engine, and is
setting up his develop environment with the necessary
tools including python, vim, and R.

Google uses R to Predict Economic Activity


Mozilla, the foundation responsible for the Firefox web
browser, uses R to visualize Web activity.
Analytics Outsourcer Leverages Continually Evolving R
Library to Stay at the Cutting Edge.
New Scientist magazine uses for data journalism and for
visualization and analysis in news articles.
New York Times uses R for Data Visualization and to
prepare NYT Charts
Free dating site OkCupid (which was recently acquired by
match.com) collects a lot of data. With over 3 million
members, they have a wealth of information upon which
to identify trends about the love lives of a typical OkCupid
member.
Source: http://www.revolutionanalytics.com/what-is-open-source-r/companiesusing-r.php

BENEFITS OF R

R is open-source statistical software.

R runs on UNIX, Windows and Macintosh.


R has an excellent built-in help system.
R is a programming language, so it will feel more familiar for programmer,
and for new user it easy to learn because R has no more large programming
syntax.
R programming is very powerful due to there large collection of built-in
statistical function and R also allow to create user defined function.
R has also excellent graph facilities.
R has stronger object-oriented programming facilities as well as R
supports procedural programming facilities.

LIMITATION OF R

It has a limited graphical interface.

There is no commercial support.

It can not handle large datasets.

R is not so easy to use for the novice.

no one to complain to if something doesntwork because its opensource software.


Many R commands give little thought to memory management,
andso R can very quickly consume all available memory.

HOW TO INSTALL R

When You open http://www.r-project.org/ site , it provide a link to download


R. then they give a list of different CRAN mirrors list. Here you can select
your particular mirror and click on it.

On the next page, they give a choice for R setup for different operating
system. Choose option to download R according your OS . Windows user
click on Download R for Windows.

One more page is open and it give a general information and link
install R for the first time click on it and you can see the
Download R 2.15.3 for Windows (47 megabytes, 32/64 bit) to download R
exe on your PC/Laptop.

After downloading R 2.15.3-win.exe file , click on run and install in


particular directory.

Double click on icon of R from the desktop and open a R Console window.

How to change directory


When R is started, By default, it looks for files in the directory in which R is
installed. If you want to know the current directory path then use.
getwd() command
It returns the current directory path in which R looks for files. If you want to
change directory and set directory with the files of the project you are
currently working on. In R, there are two ways:
1.
2.

Click on File-> change dir


setwd() command

Your current directory is C:\ and you want to work on some files ]which
are in D:\practise then you use setwd() command as follows:
setwd(D:\\practise)

STARTING WITH R

When you open a R console window, You look like this:

WHEN YOU OPEN A R STUDIO WINDOW(GUI BASED),

In R console window, you see the > sign in red color at end of lines is
called the prompt. When a command is too long to fit on a line, a + sign is
used for the continuation prompt.

Now, we do general mathematics in R like addition, subtraction,


multiplication and division.
>2+2
[1]4
>2-3
[1]-1
>156/12
[1]13
>18*5
[1]90

In R software, If you want to assign value to particular variable then a <- sign
is used. The = is valid as of Rs previous version of 1.4.0. but now-w-days, <is used for assignment.

Example: >str<-Hello, I am R Console Window


>str
[1] Hello, I am R Console Window

In above example, str is a variable, and in which we can store a string like Hello,
I am R Console Window. To show the output, we just give a variable name and
press enter. Using the <- operator or sign, we can assign numeric, text, vector,
data frame, character, list etc

Note: in R console window, a up-arrow key is used to retrieve a previous


commands and down-arrow key is used to retrieve forwards commands as well as
left and right arrow keys used to edit previous commands in R console window.

Basic Commands of R

To Clear the R console window, We use shortcut


ctrl+L.

To close the R console window, we use command,


>q()
or shortcut is Alt+F4.

To print some text without store it in any variable , data frame or vector, we
use print() command.
Example:
Output

>print(Hello, I am programmer)

[1]Hello, I am programmer

To create a small dataset quickly in R, we use c() function.


Example: >num1<-c(50,69,41,89,75,69,25,85,75,39,81,65)
>num1
Output

[1] 50 69 41 89 75 69 25 85 75 39 81 65

GENERAL COMMANDS FOR PACKAGES


Finding Available Packages

R is working on different packages. R has number of different packages to perform number of statistical
and graphical techniques.

In R, To obtain a list of all packages available at a given mirror site, use the available.packages() command.

Example: > available.packages()

IN ABOVE WINDOW, YOU CAN CHOOSE YOUR MIRROR SITE TO


KNOW AVAILABLE PACKAGES. THEN IT GIVES A LIST OF AVAILABLE
PACKAGES ON MIRROR SITE AND IT LOOK LIKE BELOW:

Installing Packages

1.

In R , We can install different packages in two ways.


Using prompt: use the install.packages("NAME") command, where NAME is
the name of the desired package.

Example: install.packages(tm)
2. Using Packages menu: Click on Packages->Install Package(s). It open a
window with a available packages then select appropriate package and click
on ok and it look like as below:

Removing Packages: To remove a specific package, use the


remove.packages("NAME") command, where NAME is the name of the
desired package.

Example: >remove.packages(tm)

Updating Packages: R is open source software, so that who have a time they
go and work on it and comes with something new and update related package
or library. To update all installed packages, use the update.packages()
command. For each out of date package that is found, you will be prompted to
confirm the update, as demonstrated below. In this case, you should type "y"
and press enter to continue and then it download update automatically.
Example: >update.packages()

Loading Packages: As we know, R packages can be installed once but it


load every session of R, so that to use a package, it first needs to be loaded
through the library(NAME) command, where NAME is the name of the
package. Each time that R is run, you will have to reload any special
packages that you need.
Example: >library(Rstem)

IMPORTING DATA IN R

When we start working with R, we need data to run various statistical and
graphical technique. To import data into R, we use different commands for
different types of files. To import different types of files in R, we must be
install and load a foreign package. So, first you install and load foreign
package into R console window.

To know about foreign package, we use following command in R

Example: > library(help=foreign)

As a output of this command, we get all information related to foreign


package like version, authors, title, description and also get small index of
package.

Generally, foreign package is used to access all functions for reading and
writing data stored by statistical packages such as Minitab, S, SAS, SPSS,
Stata, Systat, dBase files etc

Import .csv file in R: To import comma delimited file( .CSV), we use


read.csv() command. The sample data can also be incomma
separated values(CSV) format. Each cell inside such data file is
separated by a special character, which usually is a comma,
although other characters can be used as well.

Syntax: read.csv(file, header = TRUE, sep = ",", quote="\"", dec=".",


fill = TRUE, comment.char="", ...)
Example: > prdct<- read.csv("d:\\product.csv", header=TRUE, sep=" ")
> test
Product_id. Product_name.

Price. Qty

1,

Rice,

12000, 20kg

2,

Sugar,

15000, 35kg

3,

Tea,

18000, 30kg

4,

Daal,

25000, 23kg

5,

wheat,

56000, 25kg

Excel File: To import excel file into R, we must be install and load
gdata package. (for .xls file format)

Example Code:
>library(gdata)
>data<-read.xls(D:\\data.xls)

(for .xlsx file format, we must load

Example Code:
>library(xlsx)
>library(xlsxjars)
>data1<-read.xls(D:\\data1.xlsx,1)
Where, first argument is path of file or URL of xlsx file, second numeric
argument is a sheet index of a file.

SPSS File:

Example Code:
>Library(foreign)
>data1<-read.spss(D:\\datafile.spss", use.value.labels=TRUE,
max.value.labels = 10)

Where, first argument is a path of file or URL to read a file.


Use.value.labels: Convert variables with value labels
intoRfactors
max.value.labels: Only variables with value labels and at most this many
unique values will be converted to factors if use.value.labels = TRUE.

VECTOR IN R

Vector is a one of the data structure of R.


Vector is a sequence of data elements which have same data type.
Sometimes, data element refer as vector member or component.
Vector contain numeric, logical and character values.
Example:
> numvec <- c(12,52,48,36,85)
> numvec
[1] 12 52 48 36 85
>charvec<-c(Aa,Bb,Cc,Dd,Ee)
>charvec
[1] Aa Bb Cc Dd Ee
>logicalvec<-c(TRUE, FALSE,FALSE,TRUE,TRUE)
>logicalvec
[1] TRUE FALSE FALSE TRUE TRUE

The length() function is used to know the length of vector.

Example:
>length(numvec)
[1] 5

To display the rst element of a vector.


>numvec[1]
[1] 12

To fetch data range from the middle of the vector.


> a<-numvec[2:4]
>a
[1] 52 48 36

To fetch particular element of the vector.


>a<-numvec[c(1,3,5)]
>a
[1] 12 48 85

To multiply each element of vector by 3 and store result in a vector.


>a<-3*numvec
>a
[1] 36 156 144 108 255
To divide a each element of vector by 2 and store result in b vector.
>b<-2/numvec
>b
[1] 0.16666667 0.03846154 0.04166667 0.05555556 0.02352941
multiply the vectors term by term and transform each term into its logarithm
>log(a*a)
[1] 7.167038 10.099712 9.939627 9.364262 11.082527
To transpose a , and the result display as a row vector
> t(a)
[,1] [,2] [,3] [,4] [,5]
[1,] 36 156 144 108 255
To calculate sum elements of vector.
>sum(numvec)
[1] 233

MATRIX IN R

Matrix() creates a matrix from the given set of values.

Syntax: matrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames =


NULL)
Where,
data = it is a data vector.
nrow = desired number of rows.
ncol = desired number of columns.
byrow = its a logical value. IfFALSE(the default) the matrix is filled
by columns, otherwise the matrix is filled by rows.
dimnames = Adimnamesattribute for the matrix:NULLor alistof
length 2 giving the row and column names respectively. An empty
list is treated asNULL, and a list of length one as row names.
The list can be named, and the list names will be used as names
for the dimensions.

Example Code:
>datamtrx <- matrix(c(1,2,3, 11,12,13), nrow = 2, ncol = 3, byrow = TRUE, dimnames =
list(c("row1", "row2"), c("C.1", "C.2", "C.3")))
>datamtrx
C.1 C.2 C.3
row1

row2 11

12

13

build the numeric matrix m1 of dimension 5 4.(fill matrix by column)


> m1=matrix(1:20,nrow=5)
>m1
[,1] [,2] [,3] [,4]
[1,]

11

16

[2,]

12

17

[3,]

13

18

[4,]

14

19

[5,]

10 15

20

build the numeric matrix m1 of dimension 5 4.(fill matrix by row)


> m2<-matrix(1:20,nrow=5,byrow=T)
> m2
[,1] [,2] [,3] [,4]
[1,] 1

4
[2,] 5

[3,] 9 10

11 12

[4,] 13 14 15 16
[5,] 17 18 19 20

How to transpose the matrix m2.


> m3=t(m2)
> m3
[,1] [,2] [,3] [,4] [,5]
[1,]

13 17
[2,]

6 10

14 18

[3,]

7 11

15 19

[4,]

8 12

16 20

Product calculation of matrix m2 and matrix m3 .


> product=m3%*%m2
> product
[,1]

[,2] [,3] [,4]

[1,] 565 610 655 700


[2,] 610 660 710 760
[3,] 655 710 765 820
[4,] 700 760 820 880

To know the dimension of the matrix, we use dim() function.


> dim(product)
[1] 4 4

How to select particular column of the matrix.


> product[,4]
[1] 700 760 820 880

How to select multiple rows:


> product[c(2,4),]
[,1] [,2] [,3] [,4]
[1,] 610 660 710 760
[2,] 700 760 820 880

How to delete row from product matrix


>product[-2,]
[,1] [,2] [,3] [,4]
[1,] 565 610 655 700
[2,] 655 710 765 820
[3,] 700 760 820 880
How to delete column from product matrix
> product[,-3]
[,1] [,2] [,3]
[1,] 565 610

700

[2,] 610 660

760

[3,] 655 710

820

[4,] 700 760

880

set to zero elements of a less than 5,but note that m1[m1<5] is a vector:
> m1[m1<15]=0
> m1
[,1] [,2] [,3] [,4] [,5]

[1,]

0 17

[2,]

0 18

[3,]

0 15 19

[4,]

0 16 20

How to merge matrix m1 and m2. Use rbind() function, it merge to matrix
vertically.

>m1=matrix(1:20,ncol=5)
>m2<-matrix(1:20,nrow=5,byrow=T)
>rbind(m1,m2)

[,1] [,2] [,3] [,4] [,5]


[1,]

9 13 17

[2,]

6 10 14 18

[3,]

7 11 15 19

[4,]

8 12 16 20

[5,]

[6,]

9 10

[7,] 11 12 13 14 15
[8,] 16 17 18 19 20

How to calculate sum of the rows or columns. If you want to sum of the
column then set 2 as a second argument, and set 1, to calculate row sum of
a matrix.
> apply(m2,1,sum)
[1] 15 40 65 90
> apply(m2,2,sum)
[1] 34 38 42 46 50

FACTOR IN R

factors are variables in R which take on a limited number of different


values; such variables are often referred to as categorical variables. One of
the most important uses of factors is in statistical modeling; since
categorical variables enter into statistical models differently than continuous
variables, storing data as factors insures that the modeling functions will
treat such data correctly.

Factors in R are stored as a vector of integer values with a corresponding


set of character values to use when the factor is displayed.

Factor can be created by using factor() command/function. Both numeric


and character variables can be made into factors, but a factor's levels will
always be character values. You can see the possible levels for a factor
through the levels command. The levels of a factor are used when
displaying the factor's values.

Syntax:
factor(x = character(), levels, labels = levels, exclude = NA, ordered =
is.ordered(x))
Where, x = a vector of data, usually taking a small number of distinct values.
levels: an optional vector of the values that x might have taken.
labels = an optional vector of labels for the levels.
exclude = a vector of values to be excluded when forming the set of levels.
This should be of the same type as x.
ordered = logical flag to determine if the levels should be regarded as ordered
Example Code:
> state=c("Tarang","Veer","shashi","sarang","Anand")
> statef=factor(state)
> levels(statef)
[1] "Anand" "sarang" "shashi" "Tarang" "Veer"

> incomes=c(60,59,40,42,23)
> tapply(incomes,statef,mean)
Anand sarang shashi Tarang Veer
23

42

40

60

59

tapply() function is used to apply a function to each cell of a ragged array,


that is to each (non-empty) group of values given by a unique combination
of the levels of certain factors.

DATA FRAME IN R

A data frame is a table, or two-dimensional array-like structure, in which


each column contains measurements on one variable, and each row contains
one case.

the data you store in the columns of a data frame can be of various types.
I.e., one column might be a numerical variable, another might be a factor,
and a third might be a character variable. All columns have to be the same
length means that contain the same number of data items.

To create a data frame in R we use data.frame() function creates data


frames, tightly coupled collections of variables which share many
of the properties of matrices and of lists, used as the fundamental
data structure.

Syntax:

data.frame(..., row.names = NULL, check.rows = FALSE, check.names = TRUE,


stringsAsFactors = default.stringsAsFactors())
Where, row.names = NULL or a single integer or character string specifying a
column to be used as row names, or a character or integer vector giving the row
names for the data frame.
check.rows = if TRUE then the rows are checked for consistency of length and
names.
check.names = logical. If TRUE then the names of the variables in the data frame
are checked to ensure that they are syntactically valid variable names and are not
duplicated.
stringsAsFactors = logical: should character vectors be converted to factors.
Example Code:
> df<-data.frame(no=c(1,2,3), name=c("Ajay","Jayendra","Raj"),
Income=c(10000,20000,30000))
> df

no

name Income
1

Ajay

2 Jayendra 20000

Raj

10000
30000

To know the name of columns in the data frame object, we use names() command.
>names(df)
[1] no name Income

To access the particular column from data frame object, as follows:


>df$name
[1] Ajay Jayendra Raj
Levels: Ajay Jayendra Raj

All components in a data frame can be extracted as vectors with the corresponding
name:
>attach(df)
>Income
[1] 10000 20000 30000

To remove a data frame from a R object search path, we use detach() function.
>detach()

How to add new data column in a data frame:


>df$gender<-c("M","M","M")
> df
no

name

Income gender

Ajay

10000

2 Jayendra 20000

Raj

30000

To sort data of data frame using order option of the data frame object.

Example Code: > df$address<-c("Rajkot","Ahmedabad","Surat")


> df
1

no

name Income gender address

Ajay

10000

Rajkot

2 Jayendra 20000

Ahmedabad

Surat

Raj

30000

> sort<-df[order(df[,"address"]), ]
> sort
no

name

Income gender address

Jayendra

20000

Ahmedabad

Ajay

10000

Rajkot

Raj

30000

Surat

LIST IN R

list is an object consisting of an ordered collection of objects known as


its components.

There is no particular need for the components to be of the same mode or


type, and, for example, a list could consist of a numeric vector, a logical
value, a matrix, a complex vector, a character array, a function, and so on.

Lists are used to bind vectors with different lengths, this is impossible to do
with matrices. Another advantage to lists is that you may assign categories
to the values.

Example Code:
> n = c(2, 3, 5)
> s = c("aa", "bb", "cc", "dd", "ee")
> b = c(TRUE, FALSE, TRUE, FALSE, FALSE)
> x = list(n, s, b, 3)
>x

Output:

[[1]]
[1] 2 3 5
[[2]]
[1] "aa" "bb" "cc" "dd" "ee"
[[3]]
[1] TRUE FALSE TRUE FALSE FALSE
[[4]]
[1] 3
If we want to retrive a particular list slice, we use following code:
>x[2]
[[1]]
[1]"aa""bb""cc""dd""ee"

If you want to modify in any list slice, then we can do easily like as
follows:
> x[[2]][1] = "ta"
> x[[2]]
[1] "ta" "bb" "cc" "dd" "ee

But this changes made effect only x , not in s vector.


>s
[1]"aa""bb""cc""dd""ee

R AS A CALCULATOR AND PLOTTER

> log2(32)
[1] 5

> sqrt(2)
[1] 1.414214

> seq(0, 5, length=6)


[1] 0 1 2 3 4 5

-1 .0

-0 .5

0 .0

0 .5

1 .0

> plot(sin(seq(0, 2*pi, length=100)))

s in (s e q (0 , 2 * p i, le n g th = 1 0 0 ))

20

40

60
Index

80

100

STATISTICS IN R

Introduction of Packages:-

They are crucial infrastructure to efficiently produce,


load and keep consistent software libraries from
various authors and at most packages deal with
statistics and data analysis also many statistical
researchers provide their methods as R packages
The R distribution contains functionality for large
number of statistical procedures as below (list is end
less)

Descriptive statistics
Regression (relational statistics like linear, generalized linear
models and nonlinear regression models) Analysis
Time series analysis
Clustering (classification statistics)Analysis etc

DESCRIPTIVE STATISTICS WITH R


To start with descriptive statistics with R, we refresh our basic fundamental of
statistics. We go through some basic terms:
1.
Population: A population is any entire collection of people, animals, plants
or things from which we may collect data.
In other words, "The term "population" is used in statistics to represent all
possible measurements or outcomes that are of interest to us in a particular
study."
2.
Sample: A sample is a group of units selected from the population.
"The term "sample" refers to a portion of the population that is
representative of the population from which it was selected.
3.
Observation: An observation is the value, at a particular period, of a
particular variable. We can also say that observation is a single piece of data
about a variable.
4. Variable: A variable is an attribute that describes a person, place, thing, or
idea. The value of the variable can "vary" from one entity to another.
e.g.: name, height, age etc

STATISTICAL FUNCTION IN R
If you are beginner then, I suggest that use this command to know all
statistical function which are available in R.
>help(package=stats)
This command gives you a list of function in alphabetical order, so that you can
do your task very easily and efficiently.

Mean(): this is a generic function to find arithmetic mean in R.

Syntax: mean(x, na.rm = FALSE)


> x<-c(10,25,5,63,45)
> mean(x)
[1] 29.6
If in a data, there is some value missing then, it gives an error, but we
set na.rm argument as TRUE. It is a logical value indicating
whether NA values should be stripped before the computation
proceeds.
> x<-c(10,25,NA,5,63,NA,45)
> mean(x)
[1] NA
> mean(x, na.rm=T)
[1] 29.6

Median: This function is used to calculate a simple median.


Median is the middle point of ordered data. either the number of
observations is odd, then the middle observation is a median or the
number of observations is even the average of the two middle
observations is a median.

Example Code:
> x<-c(155, 160, 171, 182, 162, 153, 190, 167, 168, 165, 191)
> median(x)
[1] 167
> x<-c(155, 160, 171, 182, 162, 153, 190, 167, 168, 165, 191, 175)
> median(x)
[1] 167.5
> x<-c(155, 160, 171, 182, 162, 153, 190, 167, 168, 165, NA, 191)
> median(x)
[1] NA
> median(x, na.rm=T)
[1] 167

Sd() : This function computes the standard deviation of

the values in x.
Syntax: sd(x, na.rm = FALSE)

If na.rm is TRUE then missing values are removed


before computation proceeds.

Example Code: > x<-c(10,25,NA,5,63,NA,45)


> sd(x)
[1] NA
> sd(x, na.rm=T)
[1] 24.30638

BASIC ASSIGNMENT AND OPERATIONS IN R

Arithmetic Operations:
+,

-, *, /, ^ are the standard arithmetic operators.

Matrix Arithmetic.
*

is element wise multiplication


%*% is matrix multiplication

Assignment
To

assign a value to a variable use <-

How to use help in R?


R

has a very good help system built in.


If you know which function you want help with simply use ?
_______ with the function in the blank.
Ex: ?hist.
If you dont know which function to use, then use
help.search(_______).
Ex: help.search(histogram).

LOADING(IMPORTING DATA FILES )


How do we get data into R?
Remember we have no point and click
First make sure your data is in an easy to read
format such as CSV (Comma Separated Values).
Use code:

D <- read.table(path,sep=,,header=TRUE)

WORKING WITH DATA IN R


Accessing columns.
D has our data in it. But you cant see it directly.
To select a column use D$column.

Subsetting

data.
Use a logical operator to do this.
==,

>, <, <=, >=, <> are all logical operators.


Note that the equals logical operator is two = signs.

Example:

D[D$Gender

== M,]
This will return the rows of D where Gender is M.
Remember R is case sensitive!
This code does nothing to the original dataset.
D.M <- D[D$Gender == M,] gives a dataset with the
appropriate rows.
Ca s e Study-1(part
A & B)

GRAPHICS
IN
R LANGUAGE

THE AGENDA
Introduction to graphics in R language
Line Charts( code & Output)
Bar Charts ( code & Output)
Histograms ( code & Output)
Pie Charts ( code & Output)
Dot charts ( code & Output)
Colorful Dot charts ( code & Output)
How to save graph

INTRODUCTION TO GRAPHICS IN R
LANGUAGE

Generally people preferred R for stat

R has a very rich set of graphics facilities. The R Graphics entire


book given by Paul Murrell (Chapman and Hall, 2005), is devoted to
presenting the various graphics facilities of R.

R provides a large facilities to prepare a different type of charts and


graphics in R.

R support following types of charts and graphics.


1. Line Charts
2. Bar Charts
3. Histograms
4. Pie Charts
5. Dot charts
6. Misc

SIMPLE PLOTTING POINTS

First, we start with how to plot a points on a graph. Lets first we load
graphics package and then we use plot() function to put points on graphs.
Example:
> library(graphics)
> sample<-c(20,52,62,14,65,42,48,36,26)
> plot(sample)

TO CONTINUE THIS EXAMPLE WITH ADDING TITLE TO A GRAPH AND CONNECT ALL
POINTS IN A GRAPH.
EXAMPLE:
> PLOT(SAMPLE, TYPE="O", COL="BLUE")
> TITLE(MAIN="THIS IS DEMO OF LINE CHART", COL.MAIN="RED", FONT.MAIN=4)

4. C = FOR THE LINES PART ALONE OF "B",


5. O = FOR BOTH OVERPLOTTED,
6. H = FOR HISTOGRAM LIKE (OR HIGH-DENSITY) VERTICAL
LINES,
7. S = FOR STAIR STEPS,
8. S = FOR OTHER STEPS, SEE DETAILS BELOW,
9. N = FOR NO PLOTTING.
MAIN - THE MAIN TITLE (ON TOP) USING FONT AND SIZE
(CHARACTER EXPANSION) PARAMETER("FONT.MAIN") AND
COLOR PARAMETER("COL.MAIN").
SUB - SUB-TITLE (AT BOTTOM) USING FONT AND
SIZE PARAMETER("FONT.SUB") AND
COLOR PARAMETER("COL.SUB").
XLAB - X AXIS LABEL USING FONT AND CHARACTER
EXPANSION PARAMETER("FONT.LAB") AND
COLOR PARAMETER("COL.LAB").
YLAB - Y AXIS LABEL, SAME FONT ATTRIBUTES AS XLAB.
TO KNOW MORE ABOUT DIFFERENT PARAMETERS OF PLOT()
FUNCTION VISIT:
HTTP://127.0.0.1:30636/LIBRARY/GRAPHICS/HTML/PAR.HTML LINK.

BAR CHARTS

Simple Bar chart: To create a simple bar chart in R, we use barplot() fucntion.
Example code: >sample<-c(c(20,52,62,14,65,42,48,36,26)
> barplot(sample)

FILL DIAGNAL LINES IN BARS AND BORDER:


EXAMPLE CODE:
> BARPLOT(SAMPLE, MAIN="BAR CHART EXAMPLE",
XLAB="DAYS",YLAB="SAMPLE",
NAMES.ARG=C("1","2","3","4","5","6","7","8","9"), BORDER="BLUE",
DENSITY=C(10,20,30,40,50,60,70,80,90))

Where, main this parameter used to display give main title of the
graph.

Xlab, Ylab this parameter used to give label of x and Y- axis of the
graph.

Name.args this parameter is used to put value of name on the yaxis.

Density - a vector giving the density of shading lines, in lines per


inch, for the bars or bar components. The default value
of NULL means that no shading lines are drawn. Non-positive
values of density also inhibit the drawing of shading lines.

Border - the color to be used for the border of the bars.

BAR CHART WITH DIFFERENT COLOR USING FILE DATA:


> DATA1 <- READ.TABLE("D:\\TEST.TXT", HEADER=T, SEP="\T")
> BARPLOT(AS.MATRIX(DATA1), MAIN="COLORFUL BAR CHART
EXAMPLE", YLAB= "QUANTITY", BESIDE=TRUE, COL=RAINBOW(5))

Where,

data1 is a data vector- in which we can store a test namely tab delimiter file
using read.table() function.

as.matrix(): as.matrix() function attempts to turn its argument into a matrix.

If you want to know more parameter of barplot() function visit:


http://127.0.0.1:30636/library/graphics/html/barplot.html link.

HISTOGRAM
simple histogram:

Example code: >sample<-c(20,52,62,14,65,42,48,36,26)


> hist(sample)

PIE CHART

Simple Pie chart:

Example code: >sample<-c(20,52,62,14,65,42,48,36,26)


> pie(sample)

WE USE DIFFERENT COLOR TO DISPLAY SELLING OF RICE IN A


WEEK.
EXAMPLE CODE: > RICE <- C(2,3,1,2,2,8,3)
> PIE(RICE, MAIN="RICE SELLING IN A WEEK",
COL=RAINBOW(LENGTH(RICE)),
LABELS=C("MON","TUE","WED","THU","FRI","SAT","SUN"))

PIE CHART CHANGE THE COLORS, LABEL USING PERCENTAGES,


CREATE A LEGEND: TO CONTINUE WITH ABOVE EXAMPLE, HERE WE
WANT TO ADD ONE COLOR VECTOR TO DISPLAY RICE SELLING IN
DIFFERENT COLORS WITH THE MIX OF WHITE AND BLACK AND THEN
WE CONVERT VALUE IN PERCENTAGE AND DISPLAY IN PIE CHART.
EXAMPLE CODE:
> RICE <- C(2,3,1,2,2,8,3)
> COLORS <C("WHITE","GREY90","GREY70","GREY50","GREY30","GREY10","BLA
CK")
> RICE_LABELS <- ROUND(RICE/SUM(RICE) * 100, 1)
> RICE_LABELS <- PASTE(RICE_LABELS, "%", SEP="")
> PIE(RICE, MAIN="RICE SELLING", COL=COLORS,
LABELS=RICE_LABELS, CEX=0.8)
> LEGEND(1.5, 0.5, C("MON","TUE","WED","THU","FRI","SAT","SUN"),
CEX=0.8, FILL=COLORS)

DOT CHART

Simple Dot chart in R: to create a dotted chart in R, we use


dotchart() function in R. here, we use test.txt file to draw a dotted
chart.

Example Code:
>product<- read.table(D:\\test.txt", header=T, sep="\t")
>dotchart(t(product))
Here, the t() function is new. t() function is used toreturn the
transpose ofx object. X object is a matrix or data frame,
here name is product.

COLORFUL DOT CHART


Example Code:
>product<- read.table(D:\\test.txt", header=T, sep="\t")
>dotchart(t(product), color=c("red","blue","darkgreen"), main="Dotchart for
Product", cex=0.8)

HOW TO SAVE GRAPH

R gives us facilities to save a graph output in .png, .pdf, Windows


Bitmap (BMP), postscript (ps) and JPEG format.

To create a different file format use following functions in R:


1. jpeg(): this function produces a JPEG file.
2.png() : this function produces a PNG file.
3. bmp() : this function produces a bmp file.
4. pdf() : this function creates a pdf file.
file.

5. postscript() : thisfunction produces a postscript

Example Code: >png("hist.png")


>hist(sample)
>dev.off()

CREATE A FILE USING THE ONE OF THE FUNCTIONS AS PER ABOVE


DESCRIBE THEN PERFORM ALL THE PLOT FUNCTIONS TO ACHIEVE
THE FINAL GRAPH. THE DEV.OFF FUNCTION IS USED TO CLOSE THE
GRAPHICS DEVICE. THE OPTIONS H AND W CONTROL THE SIZE (IN
PIXELS) OF THE HEIGHT AND WIDTH OF THE GRAPH BEING
SAVED. THE FILE SAVES ON CURRENT DIRECTORY, TO KNOW PATH,
USE GETWD() FUNCTION.

Question?

Thank you!

You might also like