0% found this document useful (0 votes)
36 views

Introduction To Data Wrangling

This document discusses data wrangling and analyzing a breast cancer dataset. It covers reading CSV files into Pandas dataframes, assessing the data to build intuition, selecting portions of data, and writing dataframes back to CSV. Key steps include loading the dataset, describing the features, asking questions of the data, and preparing for further cleaning and analysis.

Uploaded by

elnathanen97
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Introduction To Data Wrangling

This document discusses data wrangling and analyzing a breast cancer dataset. It covers reading CSV files into Pandas dataframes, assessing the data to build intuition, selecting portions of data, and writing dataframes back to CSV. Key steps include loading the dataset, describing the features, asking questions of the data, and preparing for further cleaning and analysis.

Uploaded by

elnathanen97
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Introduction to Data

Wrangling
By Ayamba Victor Ndoma
Questions for a Dataset

Let's use pandas to take a look at sample datasets from Kaggle data!

From the dataset, what are good questions you can ask based on this information?

The dataset we are going to explore is the Breast Cancer Wisconsin dataset gotten from

Kaggle

There is more information about the column in the dataset in this link. Follow the link to

explore more about the dataset.


Information About the Breast Cancer Dataset
Attribute Information:

1.ID number

2.Diagnosis (M = malignant, B = benign) which shows if a patient has cancer or not.

3.The dataset contains 30 features (column)

The following ten features are computed for each cell nucleus. For each of these ten features,
a column is created for the mean, standard error, and max value. Table 1. 0 in the next slide
shows the description of each feature of the dataset.
Breast Cancer Dataset Description
Table 1.0 Breast cancer dataset feature description

Feature Description
Mean of distances from center to points on
Radius
the perimeter
Texture Standard deviation of gray-scale values
Perimeter
Area
Smoothness Local variation in radius lengths
Compactness Perimeter2 / Area - 1.0
Concavity Severity of concave portions of the contour
Concave Points Number of concave portions of the contour
Symmetry
Fractal Dimension "Coastline approximation" - 1
What Type of Questions are we to ask?
Data Gathering and Reading CSV Files In
Python
Data Gathering
Data gathering can happen in a number of ways:

Either by downloading files that are readily available from online repository like Kaggle and UCI

Or by getting data from an API or web scrapping

Or by pulling data from existing databases

There may also be a need to combine data from multiple different formats.

Usually dataset use for analysis are in a format called CSV (Comma Separated Values)

A CSV file is a text file with a tabular structure that holds only raw data.

CSV files are almost like excel files except that each data in a row are separated by commas
making it easy and faster for processing using code like Python.
Reading CSV Continues..
Reading CSV Continues..

For more information on how to use the read_csv function, always refer to the official
documentation here.
Quiz #1

Use `read_csv()` to read in `cancer_data.csv` and use an appropriate column as the index.

Then, use `.head()` on your dataframe to see if you've done this correctly.

*Hint: First call `read_csv()` without parameters and then `head()` to see what the data looks

like.
Quiz #2

Use `read_csv()` to read in `powerplant_data.csv` with more descriptive column names based on

the description of features on this website.

Then, use `.head()` on your dataframe to see if you've done this correctly.

*Hint: Like in the previous quiz, first call `read_csv()` without parameters and then `head()` to see

what the data looks like.*

Kindly note the dataset has also been provided to you on the google classroom and WhatsApp

group chat.

Also more information concerning the dataset from the website is shown on the next slide.
Quiz #2 Information
Writing to CSV
Now, we'll save your second dataframe from the second quiz with power plant data into a csv file
for the next for more analysis.
df_powerplant.to_csv('powerplant_data_edited.csv’)
Checking to see if it works
1. df = pd.read_csv('powerplant_data_edited.csv’)
2. df.head()
What's this `Unnamed:0`? `to_csv()` will store our index unless we tell it not to. To make it
ignore the index, we have to provide the parameter `index=False`
3. df_powerplant.to_csv('powerplant_data_edited.csv', index=False)
4. df = pd.read_csv('powerplant_data_edited.csv’)
5. df.head()
Assessing and Building Intuition
Assessing and Building Intuition Continues..
Assessing and Building Intuition Continues..

We can select data using `loc` and `iloc`, which you can read more about here. `loc` uses labels of rows or
columns to select data, while `iloc` uses the index numbers. We'll use these to index the dataframe below.
Selecting Multiple Ranges in Pandas

Selecting the columns for the mean dataframe was pretty straightforward - the columns we needed
to select were all together (`id`, `diagnosis`, and the mean columns).
Now we run into a little issue when we try to do the same for the standard errors or maximum
values. `id` and `diagnosis` are separated from the rest of the columns we need!
We can't specify all of these in one range.
First, try creating the standard error dataframe on your own to see why doing this with just `loc`
and `iloc` is an issue.
Then, use this stackoverflow link to learn how to select multiple ranges in Pandas and try it below.
By the way, to figure this out myself, I just found this link by googling "how to select multiple
ranges df.iloc"
Hint: You may have to import a new package!
Conclusion

In this section, we have considered the first part of the data analysis process which is data

gathering. We have seen how to read and write csv files, how to build intuition using Pandas

library. In the next lesson we will see how we clean these dataset and check for issues such as

missing values, incorrect data types, duplicates and structural issues.


References

1. Learn.udacity.com

2. https://stackoverflow.com/questions/41256648/select-multiple-ranges-of-columns-in-

pandas-dataframe

3. https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html

You might also like