Introduction To Data Wrangling
Introduction To Data Wrangling
Wrangling
By Ayamba Victor Ndoma
Questions for a Dataset
Let's use pandas to take a look at sample datasets from Kaggle data!
From the dataset, what are good questions you can ask based on this information?
The dataset we are going to explore is the Breast Cancer Wisconsin dataset gotten from
Kaggle
There is more information about the column in the dataset in this link. Follow the link to
1.ID number
The following ten features are computed for each cell nucleus. For each of these ten features,
a column is created for the mean, standard error, and max value. Table 1. 0 in the next slide
shows the description of each feature of the dataset.
Breast Cancer Dataset Description
Table 1.0 Breast cancer dataset feature description
Feature Description
Mean of distances from center to points on
Radius
the perimeter
Texture Standard deviation of gray-scale values
Perimeter
Area
Smoothness Local variation in radius lengths
Compactness Perimeter2 / Area - 1.0
Concavity Severity of concave portions of the contour
Concave Points Number of concave portions of the contour
Symmetry
Fractal Dimension "Coastline approximation" - 1
What Type of Questions are we to ask?
Data Gathering and Reading CSV Files In
Python
Data Gathering
Data gathering can happen in a number of ways:
Either by downloading files that are readily available from online repository like Kaggle and UCI
There may also be a need to combine data from multiple different formats.
Usually dataset use for analysis are in a format called CSV (Comma Separated Values)
A CSV file is a text file with a tabular structure that holds only raw data.
CSV files are almost like excel files except that each data in a row are separated by commas
making it easy and faster for processing using code like Python.
Reading CSV Continues..
Reading CSV Continues..
For more information on how to use the read_csv function, always refer to the official
documentation here.
Quiz #1
Use `read_csv()` to read in `cancer_data.csv` and use an appropriate column as the index.
Then, use `.head()` on your dataframe to see if you've done this correctly.
*Hint: First call `read_csv()` without parameters and then `head()` to see what the data looks
like.
Quiz #2
Use `read_csv()` to read in `powerplant_data.csv` with more descriptive column names based on
Then, use `.head()` on your dataframe to see if you've done this correctly.
*Hint: Like in the previous quiz, first call `read_csv()` without parameters and then `head()` to see
Kindly note the dataset has also been provided to you on the google classroom and WhatsApp
group chat.
Also more information concerning the dataset from the website is shown on the next slide.
Quiz #2 Information
Writing to CSV
Now, we'll save your second dataframe from the second quiz with power plant data into a csv file
for the next for more analysis.
df_powerplant.to_csv('powerplant_data_edited.csv’)
Checking to see if it works
1. df = pd.read_csv('powerplant_data_edited.csv’)
2. df.head()
What's this `Unnamed:0`? `to_csv()` will store our index unless we tell it not to. To make it
ignore the index, we have to provide the parameter `index=False`
3. df_powerplant.to_csv('powerplant_data_edited.csv', index=False)
4. df = pd.read_csv('powerplant_data_edited.csv’)
5. df.head()
Assessing and Building Intuition
Assessing and Building Intuition Continues..
Assessing and Building Intuition Continues..
We can select data using `loc` and `iloc`, which you can read more about here. `loc` uses labels of rows or
columns to select data, while `iloc` uses the index numbers. We'll use these to index the dataframe below.
Selecting Multiple Ranges in Pandas
Selecting the columns for the mean dataframe was pretty straightforward - the columns we needed
to select were all together (`id`, `diagnosis`, and the mean columns).
Now we run into a little issue when we try to do the same for the standard errors or maximum
values. `id` and `diagnosis` are separated from the rest of the columns we need!
We can't specify all of these in one range.
First, try creating the standard error dataframe on your own to see why doing this with just `loc`
and `iloc` is an issue.
Then, use this stackoverflow link to learn how to select multiple ranges in Pandas and try it below.
By the way, to figure this out myself, I just found this link by googling "how to select multiple
ranges df.iloc"
Hint: You may have to import a new package!
Conclusion
In this section, we have considered the first part of the data analysis process which is data
gathering. We have seen how to read and write csv files, how to build intuition using Pandas
library. In the next lesson we will see how we clean these dataset and check for issues such as
1. Learn.udacity.com
2. https://stackoverflow.com/questions/41256648/select-multiple-ranges-of-columns-in-
pandas-dataframe
3. https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html