0% found this document useful (0 votes)
73 views3 pages

Gathering Data: E-Predictions/image-Predictions - TSV

This document summarizes data wrangling efforts for a project involving three datasets. The tasks included gathering data from multiple sources, assessing for quality and tidiness issues, and cleaning the data. Key steps taken were identifying missing values, inconsistent formatting, merging datasets, and consolidating columns. The cleaned data was then stored as a new CSV file.

Uploaded by

pola osama
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views3 pages

Gathering Data: E-Predictions/image-Predictions - TSV

This document summarizes data wrangling efforts for a project involving three datasets. The tasks included gathering data from multiple sources, assessing for quality and tidiness issues, and cleaning the data. Key steps taken were identifying missing values, inconsistent formatting, merging datasets, and consolidating columns. The cleaned data was then stored as a new CSV file.

Uploaded by

pola osama
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

This report briefly describes my wrangling efforts.

Project details
The tasks of this project are as follows:
• Gathering data
• Assessing data
• Cleaning data

1. Gathering Data

I gathered the first data by manually downloading the twitter-archive-


enhanced.csv file that Udacity provided to me.

The second dataset was programmatically downloaded from Udacity's


server using the requests function (URL =
https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_imag
e-predictions/image-predictions.tsv)

The third dataset was gathered via the Twitter API by using the Tweepy
library and stored as a JSON file. Specifically, I gathered information about
the favorite count and retweet count for each tweet as well as the tweet id.

2. Assessing Data

I assessed the data visually (i.e., using the head() and sample() function)
and programmatically to find quality and tidiness issues.

Quality Issues: -

Here is a list of quality issues I found for the three datasets:

a) DataFrame twitter_archive

- retweeted_user_id and retweeted_status_id column: there are some


retweets

- expanded_urls column: tweets/ retweets without images


- timestamp: not datetime format

- name column: none appears 745 (missing data but not NAN)

- name column: some names are false (O, a, not..)

- tweet_id: is int, should be type object as no calculation is needed

- text and rating_numerator column: tweets that include more than one
rating and/or decimal numbers, hence, wrong or missing data in the
rating_numerator and rating_denominator column

- pupper, puppo, floofer and doggo column: For 1976 IDs there are no dog
"stage" information.

- pupper, puppo, floofer and doggo column: There are some IDs with more
than one dog "stage" information (two dogs are rated).

- missing column for the fraction of rating_numerator and


rating_denominator

b) DataFrame predictions

- p1,p2,p3 columns: dog breeds are not consistently lower or uppercase

- tweet_id is int, should be type object as no calculation is needed

- img_num column does not contain new information

c) DataFrame twitter_add_info
- tweet_id is int, should be type object as no calculation is needed

Tidiness Issues

Here is a list of tidiness issues I found:


- twitter_archive: 4 columns (dogger, floofer, pupper and puppo) for one
variable (dog stage)

- predictions: the dog breed prediction could be packed into one column
(breed_pred)
- predictions: the prediction confidence could be packed into one column
(pred_confidence)

- predictions: jpg_url, breed_pred and pred_confidence should be joined to


twitter_archive DataFrame

- twitter_add_info: favorite_count and retweet_count column should be


joined to twitter_archive DataFrame

3. Cleaning Data

Before I started cleaning the datasets I created copies of each dataset.


Afterwards, I tried to fix every problem programmatically. First, I fixed the
quality issues regarding missing/ wrong data. Some issues were fixed in
one cleaning step as they were closely related. In one case I had to change
the dog stage information manually. Because the code I wrote does apply
in most cases, but it does not regard the order of occurrence of the words
(if doggo occurs before pupper the dog stage will be doggo even if pupper
is the meant dog stage and doggo is only a part of a word like "didodoggo"
for ID 817777686764523521). As there's only one case in this dataset
where the function does not work properly I changed the dog_stage for this
ID manually to pupper.
After this, I fixed the tidiness issues, merged the datasets, and fixed the
remaining quality issues.

4. Storing Data

After cleaning the data, I stored the final cleaned dataset as a csv-File:
complete_df_clean.to_csv('twitter_archive_master.csv', encoding='utf-8',
index=False)

You might also like