0% found this document useful (0 votes)

36 views

Introduction To Data Wrangling

This document discusses data wrangling and analyzing a breast cancer dataset. It covers reading CSV files into Pandas dataframes, assessing the data to build intuition, selecting portions of data, and writing dataframes back to CSV. Key steps include loading the dataset, describing the features, asking questions of the data, and preparing for further cleaning and analysis.

Uploaded by

elnathanen97

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views

Introduction To Data Wrangling

Uploaded by

elnathanen97

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Introduction to Data

Wrangling
By Ayamba Victor Ndoma
Questions for a Dataset

Let's use pandas to take a look at sample datasets from Kaggle data!

From the dataset, what are good questions you can ask based on this information?

The dataset we are going to explore is the Breast Cancer Wisconsin dataset gotten from

Kaggle

There is more information about the column in the dataset in this link. Follow the link to

explore more about the dataset.

Information About the Breast Cancer Dataset
Attribute Information:

1.ID number

2.Diagnosis (M = malignant, B = benign) which shows if a patient has cancer or not.

3.The dataset contains 30 features (column)

The following ten features are computed for each cell nucleus. For each of these ten features,
a column is created for the mean, standard error, and max value. Table 1. 0 in the next slide
shows the description of each feature of the dataset.
Breast Cancer Dataset Description
Table 1.0 Breast cancer dataset feature description

Feature Description
Mean of distances from center to points on
Radius
the perimeter
Texture Standard deviation of gray-scale values
Perimeter
Area
Smoothness Local variation in radius lengths
Compactness Perimeter2 / Area - 1.0
Concavity Severity of concave portions of the contour
Concave Points Number of concave portions of the contour
Symmetry
Fractal Dimension "Coastline approximation" - 1
What Type of Questions are we to ask?
Data Gathering and Reading CSV Files In
Python
Data Gathering
Data gathering can happen in a number of ways:

Either by downloading files that are readily available from online repository like Kaggle and UCI

Or by getting data from an API or web scrapping

Or by pulling data from existing databases

There may also be a need to combine data from multiple different formats.

Usually dataset use for analysis are in a format called CSV (Comma Separated Values)

A CSV file is a text file with a tabular structure that holds only raw data.

CSV files are almost like excel files except that each data in a row are separated by commas
making it easy and faster for processing using code like Python.
Reading CSV Continues..
Reading CSV Continues..

For more information on how to use the read_csv function, always refer to the official
documentation here.
Quiz #1

Use `read_csv()` to read in `cancer_data.csv` and use an appropriate column as the index.

Then, use `.head()` on your dataframe to see if you've done this correctly.

*Hint: First call `read_csv()` without parameters and then `head()` to see what the data looks

like.
Quiz #2

Use `read_csv()` to read in `powerplant_data.csv` with more descriptive column names based on

the description of features on this website.

Then, use `.head()` on your dataframe to see if you've done this correctly.

*Hint: Like in the previous quiz, first call `read_csv()` without parameters and then `head()` to see

what the data looks like.*

Kindly note the dataset has also been provided to you on the google classroom and WhatsApp

group chat.

Also more information concerning the dataset from the website is shown on the next slide.
Quiz #2 Information
Writing to CSV
Now, we'll save your second dataframe from the second quiz with power plant data into a csv file
for the next for more analysis.
df_powerplant.to_csv('powerplant_data_edited.csv’)
Checking to see if it works
1. df = pd.read_csv('powerplant_data_edited.csv’)
2. df.head()
What's this `Unnamed:0`? `to_csv()` will store our index unless we tell it not to. To make it
ignore the index, we have to provide the parameter `index=False`
3. df_powerplant.to_csv('powerplant_data_edited.csv', index=False)
4. df = pd.read_csv('powerplant_data_edited.csv’)
5. df.head()
Assessing and Building Intuition
Assessing and Building Intuition Continues..
Assessing and Building Intuition Continues..

We can select data using `loc` and `iloc`, which you can read more about here. `loc` uses labels of rows or
columns to select data, while `iloc` uses the index numbers. We'll use these to index the dataframe below.
Selecting Multiple Ranges in Pandas

Selecting the columns for the mean dataframe was pretty straightforward - the columns we needed
to select were all together (ìd`, `diagnosis`, and the mean columns).
Now we run into a little issue when we try to do the same for the standard errors or maximum
values. ìd` and `diagnosis` are separated from the rest of the columns we need!
We can't specify all of these in one range.
First, try creating the standard error dataframe on your own to see why doing this with just `loc`
and ìloc` is an issue.
Then, use this stackoverflow link to learn how to select multiple ranges in Pandas and try it below.
By the way, to figure this out myself, I just found this link by googling "how to select multiple
ranges df.iloc"
Hint: You may have to import a new package!
Conclusion

In this section, we have considered the first part of the data analysis process which is data

gathering. We have seen how to read and write csv files, how to build intuition using Pandas

library. In the next lesson we will see how we clean these dataset and check for issues such as

missing values, incorrect data types, duplicates and structural issues.

References

1. Learn.udacity.com

2. https://stackoverflow.com/questions/41256648/select-multiple-ranges-of-columns-in-

pandas-dataframe

3. https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html

Using Artificial Intelligence Absolute Beginner's Guide by Michael Miller
No ratings yet
Using Artificial Intelligence Absolute Beginner's Guide by Michael Miller
648 pages
Python - DataScience Question - Paper
No ratings yet
Python - DataScience Question - Paper
5 pages
Cody's Data Cleaning Techniques Using SAS, Third Edition
From Everand
Cody's Data Cleaning Techniques Using SAS, Third Edition
Ron Cody
4.5/5 (3)
PSSE30 USERSManual
100% (5)
PSSE30 USERSManual
786 pages
Lab-3 Pandas Library
No ratings yet
Lab-3 Pandas Library
14 pages
Pandas in Python
No ratings yet
Pandas in Python
59 pages
Pandas
No ratings yet
Pandas
41 pages
lecture-week2
No ratings yet
lecture-week2
72 pages
Chapter 4 - Python For Data Analysis
No ratings yet
Chapter 4 - Python For Data Analysis
47 pages
Python For DS Cheat Sheet
100% (2)
Python For DS Cheat Sheet
6 pages
cheat_sheet_pandas
No ratings yet
cheat_sheet_pandas
4 pages
Learn Python Pandas For Data Science Quick TutorialExamples For All Primary Operations of DataFrames
No ratings yet
Learn Python Pandas For Data Science Quick TutorialExamples For All Primary Operations of DataFrames
37 pages
Murali Internship
No ratings yet
Murali Internship
34 pages
Pandas Notes
No ratings yet
Pandas Notes
4 pages
Experiment No 3 Importing and Exporting Data in Python Using Pandas Student
No ratings yet
Experiment No 3 Importing and Exporting Data in Python Using Pandas Student
6 pages
Data Manipulation With Pandas
No ratings yet
Data Manipulation With Pandas
39 pages
Utf-8''libraries Data Management
No ratings yet
Utf-8''libraries Data Management
9 pages
20CA2204 DATA SCIENCE QB WITH ANSWERS
No ratings yet
20CA2204 DATA SCIENCE QB WITH ANSWERS
48 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
13 pages
Hw0 Programming Handout 4TbRRB6IAl
No ratings yet
Hw0 Programming Handout 4TbRRB6IAl
2 pages
ANL252 SU4 Jul2022
No ratings yet
ANL252 SU4 Jul2022
55 pages
Python-for-Data-Analysis (Pandas
No ratings yet
Python-for-Data-Analysis (Pandas
31 pages
Pandas DataFrame Notes
100% (1)
Pandas DataFrame Notes
10 pages
3Y3Z2Xzqn7 U Y%K : 2. How To Create A Data Frame Using A Dictionary of Pre-Existing Columns or Numpy 2D Arrays?
No ratings yet
3Y3Z2Xzqn7 U Y%K : 2. How To Create A Data Frame Using A Dictionary of Pre-Existing Columns or Numpy 2D Arrays?
8 pages
Lesson - 3 - 1 Data Wrangling
No ratings yet
Lesson - 3 - 1 Data Wrangling
29 pages
Cheat Sheet
No ratings yet
Cheat Sheet
10 pages
justenoughpython_pandas_220915_175329
No ratings yet
justenoughpython_pandas_220915_175329
64 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
12 pages
More On Pandas
No ratings yet
More On Pandas
51 pages
Pandas Cheat Sheet - Python For Data Science
No ratings yet
Pandas Cheat Sheet - Python For Data Science
5 pages
Pandas Tutorial 1: Pandas Basics (Reading Data Files, Dataframes, Data Selection)
No ratings yet
Pandas Tutorial 1: Pandas Basics (Reading Data Files, Dataframes, Data Selection)
15 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
10 pages
Pandas DataFrameObject
No ratings yet
Pandas DataFrameObject
4 pages
Jashan ML
No ratings yet
Jashan ML
20 pages
Pandas PDF(2)
No ratings yet
Pandas PDF(2)
25 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
96 pages
2
No ratings yet
2
5 pages
Data Wrangling With Python and Pandas
No ratings yet
Data Wrangling With Python and Pandas
7 pages
Pandas
No ratings yet
Pandas
5 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
10 pages
Python For Data Science
No ratings yet
Python For Data Science
45 pages
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
No ratings yet
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
12 pages
UNIT -4 -PART 2
No ratings yet
UNIT -4 -PART 2
36 pages
CH-6 Data Loading, Storage, and File Formats
No ratings yet
CH-6 Data Loading, Storage, and File Formats
163 pages
Python for ML
No ratings yet
Python for ML
41 pages
Reading CSV
No ratings yet
Reading CSV
6 pages
7 Days Analytics Course 3feiz7 4
No ratings yet
7 Days Analytics Course 3feiz7 4
8 pages
13-007 Datasets and DataFrames
No ratings yet
13-007 Datasets and DataFrames
10 pages
Pandas Basics for Data Science
No ratings yet
Pandas Basics for Data Science
2 pages
41b Data Wrangling, Grouping and Aggregation
No ratings yet
41b Data Wrangling, Grouping and Aggregation
31 pages
Pandas - Cheatsheet
No ratings yet
Pandas - Cheatsheet
4 pages
Data Science lab manual..
No ratings yet
Data Science lab manual..
54 pages
CSL-410-L16
No ratings yet
CSL-410-L16
22 pages
lab 1 ML lab
No ratings yet
lab 1 ML lab
15 pages
CO3_1_Pandas Series and Data Frame
No ratings yet
CO3_1_Pandas Series and Data Frame
37 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
47 pages
Introduction To Data Science Using Python Part2
No ratings yet
Introduction To Data Science Using Python Part2
45 pages
CH 3 2
No ratings yet
CH 3 2
17 pages
Practical Guide To Pandas For Data Science
100% (1)
Practical Guide To Pandas For Data Science
26 pages
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
César Pérez López
No ratings yet
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
C++ Data Structures Explained: A Practical Guide with Examples
From Everand
C++ Data Structures Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
JD_Data_Scientist_Campus
No ratings yet
JD_Data_Scientist_Campus
2 pages
Quantum Random Number Generator Thesis
100% (3)
Quantum Random Number Generator Thesis
5 pages
Business Rule Revolution
100% (1)
Business Rule Revolution
250 pages
Altivar 11
No ratings yet
Altivar 11
8 pages
Itc2001 CTL
No ratings yet
Itc2001 CTL
9 pages
ACX5 Brochure en
100% (3)
ACX5 Brochure en
44 pages
Basic Calculus
No ratings yet
Basic Calculus
585 pages
DSP Sine Wave Inverter
67% (9)
DSP Sine Wave Inverter
8 pages
Summary of FreeDOS Commands
No ratings yet
Summary of FreeDOS Commands
5 pages
VERTIV White Paper Non-Reised Floor Cooling
No ratings yet
VERTIV White Paper Non-Reised Floor Cooling
6 pages
Application Note: Meas LVDT Technology
No ratings yet
Application Note: Meas LVDT Technology
4 pages
Leica TS13 LR Data Sheet
No ratings yet
Leica TS13 LR Data Sheet
2 pages
Passive Voice
No ratings yet
Passive Voice
3 pages
TE 303TP Quick Reference Guide Usb
No ratings yet
TE 303TP Quick Reference Guide Usb
4 pages
CAMS Technical FAQ
No ratings yet
CAMS Technical FAQ
10 pages
5 SDLC Systems Development Life Cycle
100% (3)
5 SDLC Systems Development Life Cycle
105 pages
New Product Rench EN ET: P-Channel Enhancement-Mode MOSFET
No ratings yet
New Product Rench EN ET: P-Channel Enhancement-Mode MOSFET
4 pages
IEEE Project Report
No ratings yet
IEEE Project Report
14 pages
Implementation of Automatic Solar Street Light Control Circuit
No ratings yet
Implementation of Automatic Solar Street Light Control Circuit
5 pages
Assignment 5B PDF
No ratings yet
Assignment 5B PDF
2 pages
Chapter 14 - Calculus 02
No ratings yet
Chapter 14 - Calculus 02
38 pages
Kaspersky Threat Feed App For MISP
No ratings yet
Kaspersky Threat Feed App For MISP
16 pages
HIS Strategic Plan 2009-2014, 05.08
100% (3)
HIS Strategic Plan 2009-2014, 05.08
58 pages
Pdf-Statement-Details Rosa 2
No ratings yet
Pdf-Statement-Details Rosa 2
2 pages
MIDIPLUS Origin 37 - V0.1 - 20110816
100% (1)
MIDIPLUS Origin 37 - V0.1 - 20110816
21 pages
Sriram Soft Solutions Presentation
No ratings yet
Sriram Soft Solutions Presentation
12 pages
16-QAM Digital Modulation
100% (1)
16-QAM Digital Modulation
8 pages
Sigma Notation
No ratings yet
Sigma Notation
4 pages

Introduction To Data Wrangling

Uploaded by

Introduction To Data Wrangling

Uploaded by

Introduction to Data

explore more about the dataset.

2.Diagnosis (M = malignant, B = benign) which shows if a patient has cancer or not.

3.The dataset contains 30 features (column)

Or by getting data from an API or web scrapping

Or by pulling data from existing databases

the description of features on this website.

what the data looks like.*

missing values, incorrect data types, duplicates and structural issues.

You might also like