R Introduction
R Introduction
Welcome to the statistical world of R language. This chapter discuss the basic introduction to R language
with the help of few case studies. The R software provides an environment for data management and
statistical analysis. Although this environment is perceived to be unpleasant as compared to much user
friendly software’s, but the way its demand is increasing in academics and corporates, it will be definitely the
future of statistical analysis in both academics and corporates.
1.1 Benefits of R
The R project was started by Robert Gentleman and Ross Ihaka of the Statistics Department of the
University of Auckland in 1995. It has quickly gained a widespread audience. It is currently maintained by
the R core-development team, a hard-working, international team of volunteer developers. The R project web
page http://www.r-project.org is the main site for information on R. At this site are directions for obtaining
the software, accompanying packages and other sources of documentation.
2. Starting R
Console is the main window where you can run commands and see the results of executing these commands.
In graphics window graphs will appear as a result of the commands.
>, + and #
The symbol “>” is called the prompt and indicates that the lines after this symbol are typed by users. Lines
beginning by anything else are produced by R. You can type the commands after the symbol “>” to instruct
R to execute them. If a command is too long to fit on a line, a + is used for the continuation prompt. The
symbol # is used to make comments after the command. Basically anything after the comment character is
ignored by R.
Objects
Functions
An object is anything created in R. It may be a variable or a collection of variables. Functions are inbuilt in
the software. These are separated by <-. That is Objects <- Function
Typing less: You can save a lot of typing in R. Arrow keys can be used to retrieve your previous commands
in R. In particular, each command is stored in a history and the up arrow will traverse backwards along this
history and the down arrow forwards. Left and right arrow keys will work as expected.
5+10+20-15+25
15+10/2-15/3*5
(15+20/2-50/10)*5
pi*2^5-sqrt(16)+ log10(15) -log(5)
15-17*2/3-20
abs(15-17*2/3-20)
factorial(5)
log10(2)
log(2)
exp(0.6931472)
Exercise 2: Following is the data set of few companies w.r.t the details of their stock price, earning per share,
book value per share, the average PE ratio of the industry these companies belongs to and the type of
industry.
Company Sector Buy Current EPS Book Industry Industry
Price Price Value PE Type
DLF Real estate 110 90 2.52 85 29 Manufacturin
g
SBI Banking 180 160 147 1584 11 Service
HDFC Banking 990 1000 36 178 25 Service
Bharti Telecom 310 326 19 152 14 Service
Reliance Petroleum 990 1020 69 609 18 Manufacturin
g
Infosys IT 1100 1120 185 733 23 Service
BHEL Infra 140 110 13 138 17 Manufacturin
g
Ranbaxy Pharma 577 450 9.78 26 22.57 Manufacturin
g
Tata Steel Metal 283 250 75.41 629 10.33 Manufacturin
g
L&T Infra 1470 1200 62.86 362 17.39 Manufacturin
g
8. Calculate the descriptive statistics (mean, median, mode, standard deviation, variance, minimum marks and
maximum) of the variables.
9. Export the new dataset in the default folder in .csv format.
Scaler : A single number e.g. a=16
Vector : a = c(2,5,6,7,8,9,10)
Data frame : Data = data.frame(a,b,c), where a, b and c are vectors. Creates a matrix of variables and
observations
Exercise 3 (Descriptive statistics in R) The HR manager of ABC Ltd is interested in analysing the performance score
as well as the retention level of the employees in the company. She collected the data of 40 employees working with the
company w.r.t. to six variables (gender, age, performance score, education background, monthly income and the time
spent by the employees in the company). The data set consisting of the details of selected variables of the 40 employees
of a company is given below:
Employee Gender Age (in years) Performance Education Monthly Time spent
Code Score Background Income in the
company
1 M 25 57 BSc 29000 2
2 M 27 78 Bcom 30000 2
3 M 36 57 Btech 43000 7
4 F 43 46 BE 56000 6
5 F 36 59 BA 67000 5
6 F 28 65 BSc 76000 3
7 F 23 67 BA 15000 4
8 M 35 73 Bcom 72000 5
9 F 34 49 BE 52000 4
10 M 45 63 Btech 65000 7
11 F 34 68 Btech 61000 6
12 F 43 62 BA 89000 5
13 M 42 75 BSc 87000 6
14 M 25 56 Bcom 39000 2
15 F 43 64 Bcom 73000 4
16 F 42 69 BSc 76000 3
17 M 33 75 BA 30000 3
18 M 26 65 BE 28000 2
19 M 27 52 Btech 39000 3
20 F 24 99 BA 19000 3
21 F 25 56 BA 20000 4
22 M 26 87 BE 32000 3
23 F 35 67 BA 48000 9
24 F 36 52 BSc 52000 5
25 M 25 91 BSc 26000 2
26 M 36 78 BE 54000 7
27 F 38 50 BA 71000 6
28 F 39 72 Btech 39000 6
29 M 35 69 Btech 41000 5
30 M 36 61 Btech 43000 5
31 F 31 66 Bcom 20000 6
32 F 34 89 Bcom 45000 4
33 F 35 59 BE 51000 3
34 M 32 60 BA 48000 2
35 M 71 65 Btech 120000 15
36 M 29 59 Btech 49000 4
37 F 32 72 BSc 50000 7
38 M 36 63 BE 48000 5
39 F 48 56 BE 51000 8
40 M 39 73 BSc 63000 3
7. Make the frequency distribution table of gender and education background of the employees.
8. Find out the univariate outliers in the variables performance scores and monthly income using box plot
diagram.
9. Use barplot( ), hist ( ) and pie ( ) functions to plot the graph of performance score of the employees.
10. Test the normal distribution of the variables ps, age, mi and rt
11. Plot the bivariate plot between the variables age and monthly income (using plot(c( )function)
12. (a) Test the null hypothesis that the average monthly income of the employees of the company is Rs 50000.
(b) Test the null hypothesis that average performance score of employees with all education background is
same.
13. Analyse the correlation between age, performance score, monthly income and retention of the employees.
14. Run the analyse the following regression model
Performance score=α + β 1∗Age+ β 2∗Retention