0% found this document useful (0 votes)
238 views5 pages

Week 3 - Home Work

The document provides instructions for cleaning and analyzing a dataset on US state populations from 2010-2011. It involves: 1) Creating a function to read the CSV file from a URL into a dataframe; 2) Cleaning the dataframe by removing unnecessary columns and rows, and changing column names; 3) Storing the cleaned dataframe and calculating summary statistics like the mean population; 4) Finding the most populous state in 2011 and sorting the data by 2011 population.

Uploaded by

Anu Maria
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
238 views5 pages

Week 3 - Home Work

The document provides instructions for cleaning and analyzing a dataset on US state populations from 2010-2011. It involves: 1) Creating a function to read the CSV file from a URL into a dataframe; 2) Cleaning the dataframe by removing unnecessary columns and rows, and changing column names; 3) Storing the cleaned dataframe and calculating summary statistics like the mean population; 4) Finding the most populous state in 2011 and sorting the data by 2011 population.

Uploaded by

Anu Maria
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Week

3 - Home Work

Step 1: Create a function (named readStates) to read a CSV file into R

1. Note that you are to read a URL, not a file local to your computer.
2. The file is a dataset on state populations (within the United States).

The URL is: http://www2.census.gov/programs-surveys/popest/tables/2010-


2011/state/totals/nst-est2011-01.csv

> ReadStates <- function()


+ {
+ StatePopulation <- read.csv("http://www2.census.gov/programs-
surveys/popest/tables/2010-2011/state/totals/nst-est2011-01.csv",header =
FALSE)
+ return(StatePopulation)
+ }
> MyData <- ReadStates()
> #MyData
>

Step 2: Clean the dataframe

3. Note the issues that need to be fixed (removing columns, removing rows,
changing column names).
4. Within your function, make sure there are 51 rows (one per state + the district of
Columbia). Make sure there are only 5 columns with the columns having the
following names (stateName, Jul2010, Jul2011, base2010, base2011).
5. Make sure the last four columns are numbers (i.e. not strings).

> # Clean the dataframe


> MyData <- MyData[-1:-9,]
> MyData <- MyData[-52:-58,]
> MyData <- MyData[1:5]
> Datastored <- MyData
>
> # Change the column names
> MyData <- Datastored
> colnames(MyData,do.NULL = TRUE, prefix = "col")
[1] "V1" "V2" "V3" "V4" "V5"
> oldnames = c("V1", "V2", "V3", "V4", "V5")
> newnames = c("StateName","July_2010","July_2011","Base_2010","Base_2011")
> for(i in 1:5) names(MyData)[names(MyData) == oldnames[i]] = newnames[i]
> MyData
StateName July_2010 July_2011 Base_2010 Base_2011
10 .Alabama 4,779,736 4,779,735 4,785,401 4,802,740
11 .Alaska 710,231 710,231 714,146 722,718
12 .Arizona 6,392,017 6,392,013 6,413,158 6,482,505
13 .Arkansas 2,915,918 2,915,921 2,921,588 2,937,979
14 .California 37,253,956 37,253,956 37,338,198 37,691,912
15 .Colorado 5,029,196 5,029,196 5,047,692 5,116,796
16 .Connecticut 3,574,097 3,574,097 3,575,498 3,580,709
17 .Delaware 897,934 897,934 899,792 907,135
18 .District of Columbia 601,723 601,723 604,912 617,996
19 .Florida 18,801,310 18,801,311 18,838,613 19,057,542
20 .Georgia 9,687,653 9,687,660 9,712,157 9,815,210
21 .Hawaii 1,360,301 1,360,301 1,363,359 1,374,810
22 .Idaho 1,567,582 1,567,582 1,571,102 1,584,985
23 .Illinois 12,830,632 12,830,632 12,841,980 12,869,257
24 .Indiana 6,483,802 6,483,800 6,490,622 6,516,922
25 .Iowa 3,046,355 3,046,350 3,050,202 3,062,309
26 .Kansas 2,853,118 2,853,118 2,859,143 2,871,238
27 .Kentucky 4,339,367 4,339,362 4,347,223 4,369,356
28 .Louisiana 4,533,372 4,533,372 4,545,343 4,574,836
29 .Maine 1,328,361 1,328,361 1,327,379 1,328,188
30 .Maryland 5,773,552 5,773,552 5,785,681 5,828,289
31 .Massachusetts 6,547,629 6,547,629 6,555,466 6,587,536
32 .Michigan 9,883,640 9,883,635 9,877,143 9,876,187
33 .Minnesota 5,303,925 5,303,925 5,310,658 5,344,861
34 .Mississippi 2,967,297 2,967,297 2,970,072 2,978,512
35 .Missouri 5,988,927 5,988,927 5,995,715 6,010,688
36 .Montana 989,415 989,415 990,958 998,199
37 .Nebraska 1,826,341 1,826,341 1,830,141 1,842,641
38 .Nevada 2,700,551 2,700,551 2,704,283 2,723,322
39 .New Hampshire 1,316,470 1,316,472 1,316,807 1,318,194
40 .New Jersey 8,791,894 8,791,894 8,799,593 8,821,155
41 .New Mexico 2,059,179 2,059,180 2,065,913 2,082,224
42 .New York 19,378,102 19,378,104 19,395,206 19,465,197
43 .North Carolina 9,535,483 9,535,475 9,560,234 9,656,401
44 .North Dakota 672,591 672,591 674,629 683,932
45 .Ohio 11,536,504 11,536,502 11,537,968 11,544,951
46 .Oklahoma 3,751,351 3,751,354 3,760,184 3,791,508
47 .Oregon 3,831,074 3,831,074 3,838,332 3,871,859
48 .Pennsylvania 12,702,379 12,702,379 12,717,722 12,742,886
49 .Rhode Island 1,052,567 1,052,567 1,052,528 1,051,302
50 .South Carolina 4,625,364 4,625,364 4,637,106 4,679,230
51 .South Dakota 814,180 814,180 816,598 824,082
52 .Tennessee 6,346,105 6,346,110 6,357,436 6,403,353
53 .Texas 25,145,561 25,145,561 25,253,466 25,674,681
54 .Utah 2,763,885 2,763,885 2,775,479 2,817,222
55 .Vermont 625,741 625,741 625,909 626,431
56 .Virginia 8,001,024 8,001,030 8,023,953 8,096,604
57 .Washington 6,724,540 6,724,540 6,742,950 6,830,038
58 .West Virginia 1,852,994 1,852,996 1,854,368 1,855,364
59 .Wisconsin 5,686,986 5,686,986 5,691,659 5,711,767
60 .Wyoming 563,626 563,626 564,554 568,158
> # Clean the StateName, change Census data to numeric.
>
> MyData$StateName <- gsub("\\.","",MyData[,1])
> MyData$July_2010 <- as.numeric(gsub(",","",MyData[,2]))
> MyData$July_2011 <- as.numeric(gsub(",","",MyData[,3]))
> MyData$Base_2010 <- as.numeric(gsub(",","",MyData[,4]))
> MyData$Base_2011 <- as.numeric(gsub(",","",MyData[,5]))
> DataSorted <- MyData
>

Step 3: Store and Explore the dataset

6. Store the dataset into a dataframe, called dfStates.


7. Test your dataframe by calculating the mean for the July2011 data, by doing:
mean(dfStates$Jul2011) àyou should get an answer of 6,053,834

> MyData <- DataSorted


> #Store the dataset into a dataframe, called dfStates
> dfStates <- data.frame(MyData)
>
> #Test your dataframe by calculating the mean for the July2011 data
> Mean <- mean(dfStates$July_2011)
> Mean
[1] 6053834
>

Step 4: Find the state with the Highest Population

8. Based on the July2011 data, what is the population of the state with the highest
population? What is the name of that state?
9. Sort the data, in increasing order, based on the July2011 data.

>
> #what is the population of the state with the highest population?
> MaxPop <- max(MyData$July_2011,na.rm = FALSE)
> MaxPop
[1] 37253956
>
> #What is the name of that state?
> MaxPopState <- MyData$StateName[which.max(MyData$July_2011)]
> MaxPopState
[1] "California"
>
> #Sort data in increasing order
> SortedData <- MyData[order(MyData$July_2011),]
> SortedData
StateName July_2010 July_2011 Base_2010 Base_2011
60 Wyoming 563626 563626 564554 568158
18 District of Columbia 601723 601723 604912 617996
55 Vermont 625741 625741 625909 626431
44 North Dakota 672591 672591 674629 683932
11 Alaska 710231 710231 714146 722718
51 South Dakota 814180 814180 816598 824082
17 Delaware 897934 897934 899792 907135
36 Montana 989415 989415 990958 998199
49 Rhode Island 1052567 1052567 1052528 1051302
39 New Hampshire 1316470 1316472 1316807 1318194
29 Maine 1328361 1328361 1327379 1328188
21 Hawaii 1360301 1360301 1363359 1374810
22 Idaho 1567582 1567582 1571102 1584985
37 Nebraska 1826341 1826341 1830141 1842641
58 West Virginia 1852994 1852996 1854368 1855364
41 New Mexico 2059179 2059180 2065913 2082224
38 Nevada 2700551 2700551 2704283 2723322
54 Utah 2763885 2763885 2775479 2817222
26 Kansas 2853118 2853118 2859143 2871238
13 Arkansas 2915918 2915921 2921588 2937979
34 Mississippi 2967297 2967297 2970072 2978512
25 Iowa 3046355 3046350 3050202 3062309
16 Connecticut 3574097 3574097 3575498 3580709
46 Oklahoma 3751351 3751354 3760184 3791508
47 Oregon 3831074 3831074 3838332 3871859
27 Kentucky 4339367 4339362 4347223 4369356
28 Louisiana 4533372 4533372 4545343 4574836
50 South Carolina 4625364 4625364 4637106 4679230
10 Alabama 4779736 4779735 4785401 4802740
15 Colorado 5029196 5029196 5047692 5116796
33 Minnesota 5303925 5303925 5310658 5344861
59 Wisconsin 5686986 5686986 5691659 5711767
30 Maryland 5773552 5773552 5785681 5828289
35 Missouri 5988927 5988927 5995715 6010688
52 Tennessee 6346105 6346110 6357436 6403353
12 Arizona 6392017 6392013 6413158 6482505
24 Indiana 6483802 6483800 6490622 6516922
31 Massachusetts 6547629 6547629 6555466 6587536
57 Washington 6724540 6724540 6742950 6830038
56 Virginia 8001024 8001030 8023953 8096604
40 New Jersey 8791894 8791894 8799593 8821155
43 North Carolina 9535483 9535475 9560234 9656401
20 Georgia 9687653 9687660 9712157 9815210
32 Michigan 9883640 9883635 9877143 9876187
45 Ohio 11536504 11536502 11537968 11544951
48 Pennsylvania 12702379 12702379 12717722 12742886
23 Illinois 12830632 12830632 12841980 12869257
19 Florida 18801310 18801311 18838613 19057542
42 New York 19378102 19378104 19395206 19465197
53 Texas 25145561 25145561 25253466 25674681
14 California 37253956 37253956 37338198 37691912
>

Step 5: Explore the distribution of the states

10. Write a function that takes two parameters. The first is a vector and the
second is a number.
11. The function will return the percentage of the elements within the vector
that is less than the same (i.e. the cumulative distribution below the value
provided).
12. For example, if the vector had 5 elements (1,2,3,4,5), with 2 being the
number passed into the function, the function would return 0.2 (since 20% of
the numbers were below 2).
13. Test the function with the vector ‘dfStates$Jul2011Num’, and the mean of
dfStates$Jul2011Num’.

> MyFunction <- function(MyVector,MyNumber)


+ {
+ Value <- MyVector < MyNumber
+ MyVal <- length(which(Value,arr.ind = FALSE, useNames = TRUE))
+ NumVect <- length(MyVector)
+ CumilativeDist <- (MyVal/NumVect)*100
+ return(CumilativeDist)
+ }
> MyFunction(dfStates$July_2011,Mean)
[1] 66.66667

You might also like