0% found this document useful (0 votes)
7 views15 pages

RM - Pandas - Importing Data

This document is a tutorial on importing data using the Pandas library in Python, covering various file formats including CSV, text, and JSON. It explains how to read these files into dataframes, handle missing values, and perform operations such as specifying delimiters and data types. Additionally, it demonstrates how to read files directly from URLs and manage JSON data structures.

Uploaded by

thanvarshini01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views15 pages

RM - Pandas - Importing Data

This document is a tutorial on importing data using the Pandas library in Python, covering various file formats including CSV, text, and JSON. It explains how to read these files into dataframes, handle missing values, and perform operations such as specifying delimiters and data types. Additionally, it demonstrates how to read files directly from URLs and manage JSON data structures.

Uploaded by

thanvarshini01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

MDS5103/MSBA5104 Segment 06

PANDAS: IMPORTING
DATA - TUTORIAL
PANDAS: IMPORTING DATA

Table of Contents

1. Reading CSV Files 4

2. Reading Text Files 6

3. Reading Files Using URL 8

4. Handling JSON Files 12

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 2/15
PANDAS: IMPORTING DATA

Introduction
Flat files are basic text files containing records or tabular data. A record is a row of fields
or attributes. These flat files also consist of columns and each column is a feature or an
attribute. For example, name, address, age, income and so on. Flat files typically contain
a header, which is the first row, and it describes the contents of that data column.

Files also have extensions. For example, ‘.csv’ represents files with comma-separated
values and ‘.txt’ represents a text file. In a ‘.csv’ file, values in each row are separated by
a comma. In a ‘.txt’ file values can be separated by characters or sequences of characters,
for example, a tab. The character or sequences of such characters are called ‘delimiters’.
In data science, these 2-dimensional labelled data are read from files and read into a
dataframe. Several operations, such as manipulation, slicing, reshaping, group-by, joins,
and merging can be performed on these dataframes. These are done at various stages
of analysis, such as exploratory data analysis, data pre-processing, building models as
well as visualisation. This topic covers reading such files and preparing them for further
manipulations.

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 3/15
PANDAS: IMPORTING DATA

1. Reading CSV Files


As a first step, get the current working directory using the ‘getcwd()’ function and set it to
a directory using the ‘chdir()’ function. Note that the files are also saved to this current
working directory.

import os
os.getcwd()

Output
'E:\\Python for ML'

The required libraries and classes can be imported as shown below.

import pandas as pd
from pandas import Series, DataFrame
import numpy as np

In the code below ‘sample2.csv’ is a comma-separated file. It can be read using the
‘read_csv()’ function as shown below. The complete path of the file can be specified. If
the complete path of the file is not specified, the file is searched for in the current working
directory. The ‘sep’ or ‘delimiter’ parameter can be used to specify the delimiter. By
default, for a CSV file, it is assumed to be ‘,’. The delimiter can be explicitly provided within
single quotes or double quotes as shown in the code below.

dat=pd.read_csv('sample2.csv',sep=',',header=0)
dat.head()

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 4/15
PANDAS: IMPORTING DATA

Output
Age Customer Income Location
0 30 A 1200 A1
1 32 B 1333 A2
2 33 NA NA NA

The ‘read_csv()’ function reads the content of the file and converts it into a dataframe
as shown above. It is a good practice to display the contents of the dataframe using
the head() method after the data is read.

The ‘read_table()’ function can also be used to read the contents of a file with tabular
data. The default delimiter in this case is ‘\t’ or a tab space.

Typically, the first line of the file contains the header or column names for the data.
The location of the header row can be specified using the ‘header’ parameter. In the
code below, header = 0 implies that the first row of the data contains the column
names. Since the ‘read_table()’ function is used and the file is comma-separated, the
‘sep’ parameter should be specified.

#We will use pd.read_table() method to do imports


dat1=pd.read_table('sample2.csv',sep=',',header=0)
dat1.head()

Output
Age Customer Income Location
0 30 A 1200.0 A1
1 32 B 1333.0 A2
2 33 NA NA NA

Once the data is read into a dataframe, it is recommended to check the data type of
each column present in the data. This is done by displaying the ‘dtypes’ attribute.

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 5/15
PANDAS: IMPORTING DATA

dat.dtypes

Output
Age int64
Customer object
Income float64
Location object
dtype: object

The output above indicates that ‘Age’ is being treated as an integer data type and ‘Income’
as a float data type, and so on.

2. Reading Text Files


The file ‘sample1.txt’ is a text file in which the data is stored with ‘tab’ as the separator.
The content can be read using the ‘read_table()’ function with ‘\t’ as the separator.

dat2=pd.read_table('sample1.txt', sep ='\t')


dat2.head()

Output
Age Customer Income Location
0 30 A 1200 A1
1 32 B 1333 A2
2 33 NAN Missing NAN

The output above indicates that there are a few missing values in this dataframe and
the missing values are not always represented using NANs. By default, Python
recognises NaNs as missing values.

dat2.dtypes

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 6/15
PANDAS: IMPORTING DATA

Output
Age int64
Customer object
Income object
Location object
dtype: object

Since there are values such as ‘Missing’ in the Income column, Python sets the type of
the column to an ‘object’ as shown in the above output. In real-life, data will be similar to
this, and the missing values may be represented as blanks, ‘.’ and so on.

In such cases, the strings that should be translated as NANs can be provided to the
parameter ‘na_values’ while reading the content. For example, while reading the data
in ‘sample1.txt’, the ‘na_values’ should be specified as ‘Missing’ as shown below.

dat2=pd.read_table('sample1.txt', na_values=['Missing'])
dat2.head()

Output
Age Customer Income Location
0 30 A 1200 A1
1 32 B 1333 A2
2 33 NAN NAN NAN

As shown in the output above, the value ‘Missing’ has been replaced with NAN. Any
string can be provided in the list for ‘na_values’ and that will be replaced by NAN in the
dataframe during the read operation.

dat2.dtypes

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 7/15
PANDAS: IMPORTING DATA

Output
Age int64
Customer object
Income float64
Location object
dtype: object

Since there are no strings in the values, the data type of the ‘Income’ field is now set
as ‘float64’ as shown in the above output.

3. Reading Files Using URL


The Internet contains various repositories of data and often there is a need to read it
directly from the web. The ‘request’ module in the ‘urllib’ library contains functions that
can perform such operations. The code below demonstrates the same. It reads the file
from the internet and stores it locally. This is done using the ‘urlretrieve()’ function. To
begin with, import the ‘urlretrieve’ module from ‘urllib.request’ library. The ‘urlretrieve()’
function takes two parameters, the first one is the URL where the file is present and
the second is the name and location where the file should be stored locally. The code
below demonstrates the same.

from urllib.request import urlretrieve


# Import pandas
import pandas as pd
# Assign url of file: url
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-
quality/winequality-red.csv'
# Save file locally
urlretrieve(url, 'winequality-red.csv')

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 8/15
PANDAS: IMPORTING DATA

Output
('winequality-red.csv', <http.client.HTTPMessage at 0x880cef0>)

In this case, the contents are stored in a file named 'winequality-red.csv' in the current
working directory.

The file can be read directly from the URL as well and be converted into a dataframe using
the ‘read_csv()’ function. In this case, a local copy of the file need not be saved. Note that
the contents of the file must be analysed before performing this operation. For example,
the file contains data that is separated by semicolons. Therefore, the ‘sep’ parameter
should be set to ‘;’. The code below demonstrates the same.

import pandas as pd
# Assign url of file: url
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-
quality/winequality-red.csv'
# Read file into a DataFrame: df
df = pd.read_csv(url, sep=';')
# Print the head of the DataFrame
print(df.head())

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 9/15
PANDAS: IMPORTING DATA

Output

free
fixed volatile citric residual sulfur
acidity acidity acid sugar chlorides dioxide
0 7.4 0.7 0 1.9 0.076 11
1 7.8 0.88 0 2.6 0.098 25
2 7.8 0.76 0.04 2.3 0.092 15
3 11.2 0.28 0.56 1.9 0.075 17
4 7.4 0.7 0 1.9 0.076 11

total
sulfur
dioxide density pH sulphates alcohol quality
34 0.9978 3.51 0.56 9.4 5
67 0.9968 3.2 0.68 9.8 5
54 0.997 3.26 0.65 9.8 5
60 0.998 3.16 0.58 9.8 6
34 0.9978 3.51 0.56 9.4 5

The output displays the first five rows of the ‘wine quality’ data.

The code below demonstrates connecting to other types of files on the web using the
‘requests’ module. The ‘requests’ module contains the ‘get()’ function which can be used
to retrieve data from a given URL.

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 10/15
PANDAS: IMPORTING DATA

# Import package
import requests
# Specify the url: url
url = "https://archive.ics.uci.edu/ml/machine-learning-
databases/abalone/abalone.data"
# Packages the request, send the request and catch the response: r
r = requests.get(url)
# Extract the response: text
text = r.text
# Print
print(text)

Output
M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7
I,0.425,0.3,0.095,0.3515,0.141,0.0775,0.12,8
F,0.53,0.415,0.15,0.7775,0.237,0.1415,0.33,20
F,0.545,0.425,0.125,0.768,0.294,0.1495,0.26,16
M,0.475,0.37,0.125,0.5095,0.2165,0.1125,0.165,9
F,0.55,0.44,0.15,0.8945,0.3145,0.151,0.32,19
F,0.525,0.38,0.14,0.6065,0.194,0.1475,0.21,14
:
:

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 11/15
PANDAS: IMPORTING DATA

4. Handling JSON Files


In the code below, the ‘visa.txt’ file is a JSON file. JSON is a popular structure where
data is arranged as dictionaries. There can be several dictionaries within a dictionary.
The ‘json’ library can be used to handle such files.

To begin with, the contents of the file have to be read by using the ‘open()’ function. This
will return the entire content as a string. The ‘open()’ function can be used along with
the ‘with’ keyword. The content read is then passed to the ‘json.load()’ to be converted
into a form that can easily be manipulated in Python.

import pandas as pd
import json
with open('visa.txt') as json_data:
d = json.load(json_data)
print(d)

Output
{'fields': [{'label': 'COUNTRY', 'id': 'a', 'type': 'string'}, {'label': 'MISSION', 'id': 'b', 'type': 's
tring'}, {'label': 'VISA ISSUE DATE', 'id': 'c', 'type': 'string'}, {'label': 'EMPLOYMENT VIS
A', 'id': 'd', 'type': 'string'}, {'label': 'TOURIST VISA',…

The output shows a snapshot of the data in ‘json’ format that is read from the ‘visa.txt’
file.

type(d)

Output
dict

The type of the return value from ‘json.load()’ is a dictionary object.

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 12/15
PANDAS: IMPORTING DATA

JSON files can also be read directly from the web. Most of the data used in machine
learning are usually from the web and are in JSON format. The code below
demonstrates the loading of a JSON file directly from the web. For this, we use both the
‘json’ module as well as the ‘request’ module. The connection to the URL can be opened
using the ‘with’ keyword and the ‘urlopen()’ function. The ‘json.loads()’ function, along
with the ‘url.read.decode()’ function, will convert the data to a dictionary object.

import urllib.request, json


with urllib.request.urlopen("https://api.github.com") as url:
data = json.loads(url.read().decode())
print(data)

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 13/15
PANDAS: IMPORTING DATA

Output
{'current_user_url': 'https://api.github.com/user', 'current_user_authorizations_htm
l_url': 'https://github.com/settings/connections/applications{/client_id}', 'authoriz
ations_url': 'https://api.github.com/authorizations', 'code_search_url': 'https://api.
github.com/search/code?q={query}{&page,per_page,sort,order}', 'commit_search_
url': 'https://api.github.com/search/commits?q={query}{&page,per_page,sort,orde
r}', 'emails_url': 'https://api.github.com/user/emails', 'emojis_url': 'https://api.githu
b.com/emojis', 'events_url': 'https://api.github.com/events', 'feeds_url': 'https://api
.github.com/feeds', 'followers_url': 'https://api.github.com/user/followers', 'followi
ng_url': 'https://api.github.com/user/following{/target}', 'gists_url': 'https://api.gith
ub.com/gists{/gist_id}', 'hub_url': 'https://api.github.com/hub', 'issue_search_url': '
https://api.github.com/search/issues?q={query}{&page,per_page,sort,order}', 'iss
ues_url': 'https://api.github.com/issues', 'keys_url': 'https://api.github.com/user/k
eys', 'label_search_url': 'https://api.github.com/search/labels?q={query}&repositor
y_id={repository_id}{&page,per_page}', 'notifications_url': 'https://api.github.com/n
otifications', 'organization_url': 'https://api.github.com/orgs/{org}', 'organization_r
epositories_url': 'https://api.github.com/orgs/{org}/repos{?type,page,per_page,sor
t}', 'organization_teams_url': 'https://api.github.com/orgs/{org}/teams', 'public_gis
ts_url': 'https://api.github.com/gists/public', 'rate_limit_url': 'https://api.github.com
/rate_limit', 'repository_url': 'https://api.github.com/repos/{owner}/{repo}', 'reposit
ory_search_url': 'https://api.github.com/search/repositories?q={query}{&page,per
_page,sort,order}', 'current_user_repositories_url': 'https://api.github.com/user/rep
os{?type,page,per_page,sort}', 'starred_url': 'https://api.github.com/user/starred{/
owner}{/repo}', 'starred_gists_url': 'https://api.github.com/gists/starred', 'topic_se
arch_url': 'https://api.github.com/search/topics?q={query}{&page,per_page}', 'user
_url': 'https://api.github.com/users/{user}', 'user_organizations_url': 'https://api.git
hub.com/user/orgs', 'user_repositories_url': 'https://api.github.com/users/{user}/r
epos{?type,page,per_page,sort}', 'user_search_url': 'https://api.github.com/search/
users?q={query}{&page,per_page,sort,order}'}

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 14/15
PANDAS: IMPORTING DATA

References:
Python for Data Analysis by Wes McKinney

E-references:
https://jakevdp.github.io/PythonDataScienceHandbook/

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 15/15

You might also like