Mydata - Read - CSV ("Nameofthedatafile - CSV") : Sorting A Data Frame
Mydata - Read - CSV ("Nameofthedatafile - CSV") : Sorting A Data Frame
csv")
What we do is just read the data file and assign the data to an R internal object called 'mydata'
This is technically called a data frame. Type mydata to visualise your data frame and see the
columns.
To access one of the columns of the data frame, just use the \$ operator:
mydata$column1
This will work if you don't have NA's in your data. But maybe there are some NA's, and then the
function won't work. But we can fix it by telling R to ignore the NA's
mean(mydata$column1, na.rm=TRUE)
That's it. To calculate a standard deviation it's the same procedure, but this time use the sd()
function. As you might reckon, the median is computed with the median() function. All easypeasy.
You can also see a bunch of summary statistics of an object with the summary() function:
summary(mydata)
If you want to add a constant to each of the values of your column1, say 1500, simply type
1500+mydata$column1
If you want it sorted in descending order, just add a minus sign before the column name.
Now type mysorteddata and see the new data frame: it's now sorted by column1! Cool.
Now, let's say you want to select only the 15 first rows (those with lower column1 values). Use
the head() function:
low <- head(mysorteddata, 15)
And if you want to select only the 15 last rows (those with higher column1 values,) use the tail()
function instead:
high <- tail(mysorteddata, 15)
Then you can apply statistical functions as normal on the new 'low' and 'high' data frames.
Linear regression
Here our buddy is the lm() function. A linear regression comes like this: y=+x. In R it
would be like:
lm(y ~ x, somedata)
Where y and x are columns of a dataframe called 'somedata'. To see the results of the regression,
use our old friend summary():
summary(lm(y ~ x, somedata))
Suddenly, RStudio will show the plot. To add the regression line, use the abline function (type it
immediately after the above command, just separated by an enter line):
plot(x=somedata$x, y=somedata$y)
abline(lm(y ~ x, somedata))
And there you have it, guys! Here's all you need to get through this week's homework. If you
want to know more about R and statistics, check out my blog mathsuser.blogspot.com. It's full of
loads of cool stuff.
Last tip: As a side remark, I had trouble with the illiteracy rate questions. Just remember that
illiteracy rate is the same as 100 minus the literacy rate.
Good luck!