0% found this document useful (0 votes)
2 views23 pages

Programming for Data Science Assignment-2

The document outlines the steps needed to clean and prepare a dataset in RStudio, focusing on filtering, renaming columns, and converting data types for analysis. It details the use of functions like str(), summary(), and View() to assess the data structure and identify necessary modifications. Additionally, it includes instructions for creating functions to convert temperature units and perform basic arithmetic operations.

Uploaded by

Faisal Mohammed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views23 pages

Programming for Data Science Assignment-2

The document outlines the steps needed to clean and prepare a dataset in RStudio, focusing on filtering, renaming columns, and converting data types for analysis. It details the use of functions like str(), summary(), and View() to assess the data structure and identify necessary modifications. Additionally, it includes instructions for creating functions to convert temperature units and perform basic arithmetic operations.

Uploaded by

Faisal Mohammed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 23

1.

After I have this file in RStudio, I will likely need to make the following changes:
i. Since we only want data from the 50 US states, I'll need to filter out rows related to the District
of Columbia and Puerto Rico and also the rows above and below the data.
ii. Columns such as population estimates, changes, and percentages etc. may be read in as
character strings. These will need to be converted into numeric types for calculations and
analysis.
iii. We observe missing values represented by a dash ('-'), indicating zero or no data. We will need to
replace these dashes with NA or 0,
iv. The column headers are spread across multiple rows. We will need to correct these into a single
row by providing correct column names.
v. We will have to make sure the data types align with our needs for analysis.
2. Importing Dataset:
3. Using str(), summary(), and View() functions :

From the str() function, we can observe that:


a) The first column name i.e.
“table.with.row.headers.in.column.A.and.column.headers.in.rows.3.through.5...leading.dots.indicate
.sub.parts.” indicates a descriptive title rather than data.
b) We see in EXCEL that there are no columns X.8, X.9, and X.10 but the R is reading columns X.8, X.9,
and X.10 as empty (all NA values).
c) We will need to drop the first row with the table title.
d) We will need to drop the empty columns (X.8, X.9, and X.10) since they don’t contain any data.
e) We need to use the content in rows 2 though 4 as headers and manually rename the columns based
on what each column represents.
f) We need to remove rows that don’t contain actual data, in this case, the top rows and few
bottom rows.
From the summary() function:

We can observe that:


1. R reads all columns (X to X7) as text since it shows them as character data type. This is because the
csv includes text headers and other non-numeric data.
2. R reads columns X.8, X.9, and X.10 as logical type with NA values throughout all 69 rows, which
means that they are empty.
3. Since the row1 constains table title and following 3 rows contain headers, this caused R to interpret
the columns as character.
4. We need to remove empty columns i.e X.8, X.9, X.10 as they are empty.
5. We need to rename the columns accordingly since current column names are not meaningful
because the CSV file's headers are spread across multiple rows.
6. We need to convert data types of columns fromX to X.7 as numeric, after cleaning up any text
headers.
7. As per our requiremnt, we will also need remove rows for United State, 4 regions, and for District of
Columbia and Puerto Rico so as to obtain data for only 50 states.
From the View() function:

We can now visually observe that


a) R is reading columns X.8, X.9, X.10 as NA column.
b) Rows 1 to 4 and Rows 63 to 69 don’t contain analytical data but only have descriptive texts.
c) Columns X.4 and X.5 contains same data suggesting that ranking data is not correct in all columns
therefore we cannot rely upon columns X.4 to X.7 for any analysis.
4. As per our requirement, the possible column names are:
1. "State",
2. "Population_Estimates_Base_April_2010",
3. "Population_Estimates_Base_July_2011",
4. "Population_Change_2010_2011",
5. "Population_Change_Percentage",
6. "Population_Ranking_April_2010",
7. "Population_Ranking_July_2011",
8. "Ranking_Change_2010_2011",
9. "Ranking_Change_Percentage"
5.
a) Remove any unneeded rows at the top of the file.
b) Use the View() function to show that the changes have been made.
c) Remove any columns which are not needed by telling the RStudio that the dataframe is to keep only
the column you wish to keep.
d) Remove the unneeded rows at the bottom of the dataframe -these are either blank or contain data
which we do not need.
e) Create a vector called MyColNames which will hold the new column names.
f) Use the colnames() function to give the MyData dataframe column names from the vector you
created in the previous step.
g) Use the View() function to see if the column names have been added.
h) You will notice from the output of the View() function that the States have a dot before them. Use
the gsub() function to remove these dots and to set the State column data type as character. Use the
View() function and the str() function to verify that the dots have been removed and that the data
type is correct.
i) You will note that the remaining columns are all character data, but they should be numeric data.
Use the gsub command to change these column data types to numeric. You will use one command
for each column. Be sure to remove the commas from the numbers. Make all the changes and list
the code you used below. Then use the str() function to show that the data types are correct.
Code Used:
MyData$Population_Estimates_Base_April_2010<-as.numeric(gsub(",","", MyData$Population_Estimates_Base_April_2010))

MyData$Population_Estimates_Base_July_2011 <- as.numeric(gsub(",", "", MyData$Population_Estimates_Base_July_2011))

MyData$Population_Change_2010_2011 <- as.numeric(gsub(",", "", MyData$Population_Change_2010_2011))

MyData$Population_Change_Percentage <- as.numeric(gsub(",", "", MyData$Population_Change_Percentage))

MyData$Population_Ranking_April_2010 <- as.numeric(gsub(",", "", MyData$Population_Ranking_April_2010))

MyData$Population_Ranking_July_2011 <- as.numeric(gsub(",", "", MyData$Population_Ranking_July_2011))

MyData$Ranking_Change_2010_2011 <- as.numeric(gsub(",", "", MyData$Ranking_Change_2010_2011))

MyData$Ranking_Change_Percentage <- as.numeric(gsub(",", "", MyData$Ranking_Change_Percentage))


j) You may have received a message that a NAS was introduced by coercion. This means that the
program encountered something it did not like. This is because the Percent column value for Maine
was so small that it would not show up in a reasonable number of decimal places, so R used the NA
value. Set the value of the Percent column for the Maine row to zero. Then use the View() function
to show that the value has been set to zero.
k) Everything should look good to this point. How many rows are in your dataframe? Use the str()
function to show the number of rows.

We have 51 rows in dataframe.


l) You have probably noticed that there are 51 rows, but the US only has 50 States. That is because we
removed the Puerto Rico data when we removed the rows at the bottom of the dataframe. We did
not remove the data for the District of Columbia. Use the View() function to find the row number for
the District of Columbia, and then delete that row.

MyData dataframe is now cleaned.


6. Create a function to convert Celsius degrees to Fahrenheit. Test the function with the following
inputs: -40, 0, 20, 100. Show the code to create the function, and the results of each run of the
function for the above four values.
7. Create a function that will convert Fahrenheit to Celsius. Use the following values to test the
function: -40, 32, 68. 212. . Show the code to create the function, and the results of each run of the
function for the above four values.
8. Write a function to add two numbers together. Show your results with the numbers 278 and -136.
9. Write a function that will accept three numbers and will return the largest number.
10. Write a function called average that will accept a set of 5 numbers and will return the average of
these numbers.

You might also like