0% found this document useful (0 votes)

73 views3 pages

Gathering Data: E-Predictions/image-Predictions - TSV

This document summarizes data wrangling efforts for a project involving three datasets. The tasks included gathering data from multiple sources, assessing for quality and tidiness issues, and cleaning the data. Key steps taken were identifying missing values, inconsistent formatting, merging datasets, and consolidating columns. The cleaned data was then stored as a new CSV file.

Uploaded by

pola osama

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

73 views3 pages

Gathering Data: E-Predictions/image-Predictions - TSV

Uploaded by

pola osama

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

This report briefly describes my wrangling efforts.

Project details
The tasks of this project are as follows:
• Gathering data
• Assessing data
• Cleaning data

1. Gathering Data

I gathered the first data by manually downloading the twitter-archive-

enhanced.csv file that Udacity provided to me.

The second dataset was programmatically downloaded from Udacity's

server using the requests function (URL =
https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_imag
e-predictions/image-predictions.tsv)

The third dataset was gathered via the Twitter API by using the Tweepy
library and stored as a JSON file. Specifically, I gathered information about
the favorite count and retweet count for each tweet as well as the tweet id.

2. Assessing Data

I assessed the data visually (i.e., using the head() and sample() function)
and programmatically to find quality and tidiness issues.

Quality Issues: -

Here is a list of quality issues I found for the three datasets:

a) DataFrame twitter_archive

- retweeted_user_id and retweeted_status_id column: there are some

retweets

- expanded_urls column: tweets/ retweets without images

- timestamp: not datetime format

- name column: none appears 745 (missing data but not NAN)

- name column: some names are false (O, a, not..)

- tweet_id: is int, should be type object as no calculation is needed

- text and rating_numerator column: tweets that include more than one
rating and/or decimal numbers, hence, wrong or missing data in the
rating_numerator and rating_denominator column

- pupper, puppo, floofer and doggo column: For 1976 IDs there are no dog
"stage" information.

- pupper, puppo, floofer and doggo column: There are some IDs with more
than one dog "stage" information (two dogs are rated).

- missing column for the fraction of rating_numerator and

rating_denominator

b) DataFrame predictions

- p1,p2,p3 columns: dog breeds are not consistently lower or uppercase

- tweet_id is int, should be type object as no calculation is needed

- img_num column does not contain new information

c) DataFrame twitter_add_info
- tweet_id is int, should be type object as no calculation is needed

Tidiness Issues

Here is a list of tidiness issues I found:

- twitter_archive: 4 columns (dogger, floofer, pupper and puppo) for one
variable (dog stage)

- predictions: the dog breed prediction could be packed into one column
(breed_pred)
- predictions: the prediction confidence could be packed into one column
(pred_confidence)

- predictions: jpg_url, breed_pred and pred_confidence should be joined to

twitter_archive DataFrame

- twitter_add_info: favorite_count and retweet_count column should be

joined to twitter_archive DataFrame

3. Cleaning Data

Before I started cleaning the datasets I created copies of each dataset.

Afterwards, I tried to fix every problem programmatically. First, I fixed the
quality issues regarding missing/ wrong data. Some issues were fixed in
one cleaning step as they were closely related. In one case I had to change
the dog stage information manually. Because the code I wrote does apply
in most cases, but it does not regard the order of occurrence of the words
(if doggo occurs before pupper the dog stage will be doggo even if pupper
is the meant dog stage and doggo is only a part of a word like "didodoggo"
for ID 817777686764523521). As there's only one case in this dataset
where the function does not work properly I changed the dog_stage for this
ID manually to pupper.
After this, I fixed the tidiness issues, merged the datasets, and fixed the
remaining quality issues.

4. Storing Data

After cleaning the data, I stored the final cleaned dataset as a csv-File:
complete_df_clean.to_csv('twitter_archive_master.csv', encoding='utf-8',
index=False)

Decline Python PowerBI Dashboard
No ratings yet
Decline Python PowerBI Dashboard
15 pages
Introduction To Power System Simulator For Engineering
100% (1)
Introduction To Power System Simulator For Engineering
1,048 pages
Wais Workin Memory Numbers
No ratings yet
Wais Workin Memory Numbers
10 pages
Science Scope and Sequence
100% (5)
Science Scope and Sequence
41 pages
Machine Learning Project
83% (6)
Machine Learning Project
37 pages
ZZ + WMX
No ratings yet
ZZ + WMX
5 pages
SQL Activities
No ratings yet
SQL Activities
8 pages
Lab Assignment 1 Title: Data Wrangling I: Problem Statement
No ratings yet
Lab Assignment 1 Title: Data Wrangling I: Problem Statement
12 pages
Math 403 Engineering Data Analysis Module
No ratings yet
Math 403 Engineering Data Analysis Module
191 pages
Answer Book (Ashish)
100% (1)
Answer Book (Ashish)
21 pages
Manual
No ratings yet
Manual
48 pages
Project - Machine Learning-Business Report: By: K Ravi Kumar PGP-Data Science and Business Analytics (PGPDSBA.O.MAR23.A)
No ratings yet
Project - Machine Learning-Business Report: By: K Ravi Kumar PGP-Data Science and Business Analytics (PGPDSBA.O.MAR23.A)
38 pages
Pandas Library Problems For Parctice
No ratings yet
Pandas Library Problems For Parctice
13 pages
Wrangle Report
No ratings yet
Wrangle Report
4 pages
DS100-1 WS 2.5 Enrico, DM
No ratings yet
DS100-1 WS 2.5 Enrico, DM
5 pages
PW2 DataCleaning
No ratings yet
PW2 DataCleaning
6 pages
Lab 7
No ratings yet
Lab 7
6 pages
Lab 7
No ratings yet
Lab 7
6 pages
Part C - Assignment No. 2 Mini-Project On Twitter
No ratings yet
Part C - Assignment No. 2 Mini-Project On Twitter
7 pages
The Impact of Rumination Induction On IQ Performance
No ratings yet
The Impact of Rumination Induction On IQ Performance
56 pages
Exercise 7 - Pandas
No ratings yet
Exercise 7 - Pandas
2 pages
Lesson - 3 - 1 Data Wrangling
No ratings yet
Lesson - 3 - 1 Data Wrangling
29 pages
Overview of Data Cleaning
No ratings yet
Overview of Data Cleaning
17 pages
Twitter Project2
No ratings yet
Twitter Project2
339 pages
NTU Scholarship Essay
No ratings yet
NTU Scholarship Essay
3 pages
Packages in Python
No ratings yet
Packages in Python
17 pages
00 - Lesson - Data Science Workflow - Jupyter Notebook
No ratings yet
00 - Lesson - Data Science Workflow - Jupyter Notebook
6 pages
Banking System Project
No ratings yet
Banking System Project
13 pages
Co Digit Ooo
No ratings yet
Co Digit Ooo
15 pages
Data Cleaning
No ratings yet
Data Cleaning
13 pages
Exercises 4
No ratings yet
Exercises 4
7 pages
Part C - Assignment No. 2 Mini-Project On Twitter
No ratings yet
Part C - Assignment No. 2 Mini-Project On Twitter
7 pages
Ml-Data Wrangling-Assignment 01
No ratings yet
Ml-Data Wrangling-Assignment 01
2 pages
Quality Control Sheet
No ratings yet
Quality Control Sheet
2 pages
Part A Assignment 6
No ratings yet
Part A Assignment 6
28 pages
Cleaning Data in Python: Worksheet 2.5
No ratings yet
Cleaning Data in Python: Worksheet 2.5
1 page
Chapter2 - Data Wrangling
No ratings yet
Chapter2 - Data Wrangling
48 pages
Latent Dirichlet Allocation
No ratings yet
Latent Dirichlet Allocation
44 pages
Regression Modeling 1698066428
No ratings yet
Regression Modeling 1698066428
23 pages
University Institute of Engineering Department of Computer Science & Engineering
No ratings yet
University Institute of Engineering Department of Computer Science & Engineering
11 pages
Data Cleaning and Exploratory Data Analysis With Pandas On Trending Youtube Video Statistics
No ratings yet
Data Cleaning and Exploratory Data Analysis With Pandas On Trending Youtube Video Statistics
5 pages
ICT202 Machine Learning - Assignment 2
No ratings yet
ICT202 Machine Learning - Assignment 2
2 pages
DFSMS/MVS V1R4 Technical Guide: June 1997
No ratings yet
DFSMS/MVS V1R4 Technical Guide: June 1997
176 pages
Idsa For Quiz 1
No ratings yet
Idsa For Quiz 1
21 pages
WorkshopPLUS - Data AI Azure Machine Learning
No ratings yet
WorkshopPLUS - Data AI Azure Machine Learning
2 pages
Chapter 3
No ratings yet
Chapter 3
47 pages
Documento Corregido
No ratings yet
Documento Corregido
16 pages
Part A Assignment - No - 1
No ratings yet
Part A Assignment - No - 1
7 pages
AStudyof Intelligencein North Africaandthe Middle East
No ratings yet
AStudyof Intelligencein North Africaandthe Middle East
370 pages
Lannet
No ratings yet
Lannet
3 pages
Task 4P-1
No ratings yet
Task 4P-1
5 pages
Fikadu Research
No ratings yet
Fikadu Research
45 pages
PDS Exp 7 To 9
No ratings yet
PDS Exp 7 To 9
10 pages
Sma Exp 3
No ratings yet
Sma Exp 3
7 pages
Databaseexam PDF
No ratings yet
Databaseexam PDF
3 pages
SUKANYA 11th FEBRUARY 2024 MACHINE LEARNING2 CODED PROJECT
No ratings yet
SUKANYA 11th FEBRUARY 2024 MACHINE LEARNING2 CODED PROJECT
37 pages
Car Price Prediction
No ratings yet
Car Price Prediction
42 pages
Analyzing and Visualizing Weratedogs: "They'Re Good Dogs Brent" ?
No ratings yet
Analyzing and Visualizing Weratedogs: "They'Re Good Dogs Brent" ?
2 pages
Data Science Project
No ratings yet
Data Science Project
34 pages
Week 3 Exercise 04 - Missing Data
No ratings yet
Week 3 Exercise 04 - Missing Data
5 pages
Module 3
No ratings yet
Module 3
72 pages
1
No ratings yet
1
3 pages
Task 2P-1
No ratings yet
Task 2P-1
4 pages
Abend Codes
No ratings yet
Abend Codes
53 pages
Practice Midterm2 A Sol PDF
No ratings yet
Practice Midterm2 A Sol PDF
14 pages
Part C Assignment No 2 Mini Project On Twitter 1
No ratings yet
Part C Assignment No 2 Mini Project On Twitter 1
9 pages
DSBDA Lab Manual24-25
No ratings yet
DSBDA Lab Manual24-25
58 pages
Introduction of Definition of Terms in A Research Paper
No ratings yet
Introduction of Definition of Terms in A Research Paper
7 pages
Data Mining
No ratings yet
Data Mining
23 pages
DA Lab
No ratings yet
DA Lab
27 pages
Sma 3
No ratings yet
Sma 3
3 pages
Zilla Swasthya Samiti Nayagarh
No ratings yet
Zilla Swasthya Samiti Nayagarh
4 pages
Rapid Analytics 1.0.000 Manual
No ratings yet
Rapid Analytics 1.0.000 Manual
22 pages
SQL Server Cheat Sheet: by Via
No ratings yet
SQL Server Cheat Sheet: by Via
1 page
Doc3 Merged
No ratings yet
Doc3 Merged
16 pages
European Implementation Manual On Tourism Satellite Accounts (TSA)
No ratings yet
European Implementation Manual On Tourism Satellite Accounts (TSA)
140 pages
C - Sc. - Practical File For 2022 - HY
No ratings yet
C - Sc. - Practical File For 2022 - HY
2 pages
MIT-BIH Arrhythmia Database Directory (Tables) PDF
No ratings yet
MIT-BIH Arrhythmia Database Directory (Tables) PDF
1 page
AQ10
No ratings yet
AQ10
3 pages
AIX Increase Disk Space
No ratings yet
AIX Increase Disk Space
2 pages
Compare and Contrast CSV, JSON, and XML Dataset Formats. Which Format Would You Choose For Image Data and Why?
No ratings yet
Compare and Contrast CSV, JSON, and XML Dataset Formats. Which Format Would You Choose For Image Data and Why?
9 pages
Adarsh Gupta Resume
No ratings yet
Adarsh Gupta Resume
2 pages
Wrangle Report
No ratings yet
Wrangle Report
3 pages
02 Working With Data
No ratings yet
02 Working With Data
3 pages
Class 11 IP-2
No ratings yet
Class 11 IP-2
2 pages
Catalogo Libros Procalculo
No ratings yet
Catalogo Libros Procalculo
2 pages
Sma Exp 09 Code Print
No ratings yet
Sma Exp 09 Code Print
5 pages
Grand Project Questions-EN
No ratings yet
Grand Project Questions-EN
4 pages
ML 2
No ratings yet
ML 2
25 pages
03 Numpy and Pandas
No ratings yet
03 Numpy and Pandas
68 pages
cdp201 10 11 2023
No ratings yet
cdp201 10 11 2023
17 pages
10 Streamlit
No ratings yet
10 Streamlit
7 pages
DMS (313302) Unit1
No ratings yet
DMS (313302) Unit1
75 pages

Gathering Data: E-Predictions/image-Predictions - TSV

Uploaded by

Gathering Data: E-Predictions/image-Predictions - TSV

Uploaded by

This report briefly describes my wrangling efforts.

I gathered the first data by manually downloading the twitter-archive-

The second dataset was programmatically downloaded from Udacity's

Here is a list of quality issues I found for the three datasets:

- retweeted_user_id and retweeted_status_id column: there are some

- expanded_urls column: tweets/ retweets without images

- name column: some names are false (O, a, not..)

- tweet_id: is int, should be type object as no calculation is needed

- missing column for the fraction of rating_numerator and

- p1,p2,p3 columns: dog breeds are not consistently lower or uppercase

- tweet_id is int, should be type object as no calculation is needed

- img_num column does not contain new information

Here is a list of tidiness issues I found:

- predictions: jpg_url, breed_pred and pred_confidence should be joined to

- twitter_add_info: favorite_count and retweet_count column should be

Before I started cleaning the datasets I created copies of each dataset.

You might also like