0% found this document useful (0 votes)
66 views

Python - Working With Data - Text Formats

This document discusses reading text file formats into Python for data analysis. It provides an example of reading a .txt file of monthly UK rainfall data from 1910-present. Functions are defined to read the header metadata, check for and handle missing values, and read the data into a dictionary with columns as keys and masked arrays as values. The techniques demonstrated work similarly for comma-separated CSV files using Python's csv module.

Uploaded by

sunil jha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views

Python - Working With Data - Text Formats

This document discusses reading text file formats into Python for data analysis. It provides an example of reading a .txt file of monthly UK rainfall data from 1910-present. Functions are defined to read the header metadata, check for and handle missing values, and read the data into a dictionary with columns as keys and masked arrays as values. The techniques demonstrated work similarly for comma-separated CSV files using Python's csv module.

Uploaded by

sunil jha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Python – working with data –

text formats
ASCII or text file formats
Advantages of working with text formats:
• They are usually human-readable.
• They tend to be simple structures.
• It is relatively easy to write code to interpret
them.
Disadvantages include:
• Inefficient storage for big data volumes.
• Most people invent their own format so
there is a lack of standardisation.
Using python to read text formats
As we have seen Python has a great toolkit for
reading files and working with strings.

In this example we use a file that we found on


the web, and then adapt some code to read it
into a useful, re-usable form.
Our example file
We found a suitable data set on the web:
http://www.metoffice.gov.uk/climate/uk/summaries/datasets#Yearorder

Met Office monthly weather statistics for


the UK since 1910.
Header

Lines numbers
(for reference
only)
Data (first 9 columns)
Data (last 8 columns)
Look! A missing value!
Let's write some code to read it
We'll need:

• To read the header and data separately


• To think about the data structure (so it is easy to
retrieve the data in a useful manner).

Let's put into practice what we have learnt:

• Use NumPy to store the arrays


• But we'll need to test for missing values and use
Masked Array (numpy.ma)
Example code (and data)
Please refer to the example code:

example_code/test_read_rainfall.py

And data file:

example_data/uk_rainfall.txt
Reading the header
UK Rainfall (mm)
Areal series, starting from 1910
Allowances have been made for topographic, coastal
and urban effects where relationships are
found to exist.
Seasons: Winter=Dec-Feb, Spring=Mar-May,
Summer=June-Aug, Autumn=Sept-Nov. (Winter:
Year refers to Jan/Feb).
Values are ranked and displayed to 1 dp. Where
values are equal, rankings are based in
order of year descending.
Data are provisional from December 2014 & Winter
2015. Last updated 07/04/2015
Reading the header
UK Rainfall (mm) Line 1 is important
Areal series, starting from information.
1910
Allowances have been made for topographic, coastal
and urban effects where relationships are
found to exist. Other lines are useful
information.
Seasons: Winter=Dec-Feb, Spring=Mar-May,
Summer=June-Aug, Autumn=Sept-Nov. (Winter:
Let's capture the metadata in:
Year refers to Jan/Feb).
Values are ranked and displayed to 1 dp. Where
values are equal, rankings are based in
- location: UK
order of year descending.
- variable:
Data are provisional from December Rainfall
2014 & Winter
- units: mm
2015. Last updated 07/04/2015
Reading the header
def readHeader(fname):
# Open the file and read the relevant lines
f = open(fname)
head = f.readlines()[:6]
f.close()

# Get important stuff


location, variable, units = head[0].split()
units = units.replace("(", "").replace(")", "")

# Put others lines in comments


comments = head[1:6]
return (location, variable, units, comments)
Test the reader
>>> (location, variable, units, comments) = \
readHeader("example_data/uk_rainfall.txt")

>>> print location, variable, units


UK Rainfall mm

>>> print comments[1]


Allowances have been made for topographic, coastal
and urban effects where relationships are found to
exist.
Write a function to handle missing
data properly
import numpy.ma as MA

def checkValue(value):
# Check if value should be a float
# or flagged as missing
if value == "---":
value = MA.masked
else:
value = float(value)
return value
Reading the data (part 1)
import numpy.ma as MA
def readData(fname):
# Open file and read column names and data block
f = open(fname)

# Ignore header
for i in range(7):
f.readline()

col_names = f.readline().split()
data_block = f.readlines()
f.close()

# Create a data dictionary, containing


# a list of values for each variable
data = {}
Data (first 9 columns)
Reading the data (part 2)
# Add an entry to the dictionary for each column
for col_name in col_names:

data[col_name] = MA.zeros(len(data_block), 'f',


fill_value = -999.999)
Reading the data (part 3)
# Loop through each value: append to each column
for (line_count, line) in enumerate(data_block):
items = line.split()

for (col_count, col_name) in enumerate(col_names):


value = items[col_count]
data[col_name][line_count] = checkValue(value)

return data
Testing the code
>>> data = readData("example_data/uk_rainfall.txt")
>>> print data["Year"]
[ 1910. 1911. 1912. ...

>>> print data["JAN"]


[ 111.40000153 59.20000076 111.69999695 ...

>>> winter = data["WIN"]


>>> print MA.is_masked(winter[0])
True
>>> print MA.is_masked(winter[1])
False
Look! A missing value!
What about CSV or tab-delimited?
The above example will work exactly the same with
a tab-delimited file (because the string split
method splits on white space) .

If the file used commas (CSV) to separate columns


then you could use:

line.split(",")
Or try the Python "csv" module
There is a python "csv" module that is able to read text files
with various delimiters. E.g.:

>>> import csv


>>> r = csv.reader(open("example_data/weather.csv"))
>>> for row in r:
... print row

['Date', 'Time', 'Temp', 'Rainfall']


['2014-01-01', '00:00', '2.34', '4.45']
['2014-01-01', '12:00', '6.70', '8.34']
['2014-01-02', '00:00', '-1.34', '10.25']

See: https://docs.python.org/2/library/csv.html

You might also like