0% found this document useful (0 votes)
115 views8 pages

How To Parse Data Tables From A PDF Bank Statement With Python - by Phillip Heita - Nov, 2021 - Medium

This document summarizes a Medium article that describes how to parse data tables from a PDF bank statement into a pandas dataframe using the Python tabula library. It discusses installing tabula, importing packages, defining the file path, cleaning the transaction description column, reading the PDF with tabula, cleaning up the output, joining tables, distinguishing debits from credits, converting columns to numeric, and classifying transactions into spending categories. The full article provides code examples to implement these steps.

Uploaded by

dirga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
115 views8 pages

How To Parse Data Tables From A PDF Bank Statement With Python - by Phillip Heita - Nov, 2021 - Medium

This document summarizes a Medium article that describes how to parse data tables from a PDF bank statement into a pandas dataframe using the Python tabula library. It discusses installing tabula, importing packages, defining the file path, cleaning the transaction description column, reading the PDF with tabula, cleaning up the output, joining tables, distinguishing debits from credits, converting columns to numeric, and classifying transactions into spending categories. The full article provides code examples to implement these steps.

Uploaded by

dirga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

11/29/21, 4:58 PM How to Parse Data Tables from a PDF Bank Statement with Python | by Phillip Heita | Nov,

hillip Heita | Nov, 2021 | Medium

Open in app

Phillip Heita
95 Followers About Follow

How to Parse Data Tables from a PDF Bank


Statement with Python
This article looks at how to extract data from your PDF bank statement with Python.

Phillip Heita Nov 16 · 3 min read

Snapshot of PDF

The image above shows a snapshot of my student life, the flying home during breaks and
Uber trips about four years back. We see that the PDF contains a date, a relatively long
transaction description, the amount, the running balance, and accrued bank
charges. These variables serve as a sound basis for answering exciting questions
regarding one's spending behaviour. But before answering any questions using this data,
we need to "liberate" it from the PDF.

https://medium.com/@phillipheita/how-to-parse-data-tables-from-a-pdf-bank-statement-with-python-ebc3b8dd8990 1/8
11/29/21, 4:58 PM How to Parse Data Tables from a PDF Bank Statement with Python | by Phillip Heita | Nov, 2021 | Medium

Installation & imports


Open in app
For this example, let's parse the data tables from the PDF bank statement into a pandas
data frame, using a Python package called tabula. Let's take a look at the code!.

To install tabula, run:

!pip install -q tabula-py

First, we import a couple of packages and define the path to the PDF bank statement.

1 ################################################################################
2 # Only execute the code if the script is ran and not if it is imported #
3 ################################################################################
4 #if __name__ == "__main__":
5 import numpy as np # Numerical Python package
6 import tabula # PDF table extra package
7 import numpy as np
8 import pandas as pd
9 import os
10 import re,string
11 import sys
12 from dateutil.parser import parse # Fixing the dates
13
14 if not sys.warnoptions:
15 import warnings
16 warnings.simplefilter("ignore")
17
18 # The path to the PDF bank statement
19 filepath = '~/BankStatement.pdf'

parameters_imports.py
hosted with ❤ by GitHub view raw

Clean transaction description


Second, as seen from the PDF snapshot, the transaction description column is quite busy.
It includes the type of transaction, whether Point of Sale (POS) purchase, Cash
Withdrawal etc., along with the merchant name, Tallie bean, South African Airways, and
Uber in this case and a combination of what I suspect to be the masked card number and

https://medium.com/@phillipheita/how-to-parse-data-tables-from-a-pdf-bank-statement-with-python-ebc3b8dd8990 2/8
11/29/21, 4:58 PM How to Parse Data Tables from a PDF Bank Statement with Python | by Phillip Heita | Nov, 2021 | Medium

date. To clean this column up, we do a couple of things using regular expressions,
Open in app
including removing punctuations, numbers and certain unnecessary words.

1 def clean_trns_desc(text):
2 text = text.lower()
3 # removing anything within square brackets
4 text = re.sub('\[.*?\]', '', text) #TODO: Ensure this is not excluding stuff
5 # if any of these punctuation marks in (string.punctuation) get rid of it
6 text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
7 # Getting rid of all numbers
8 text = re.sub('\d+', '', text)
9 # get rid of the word purch
10 text = re.sub('purch', '', text)
11 # Get rid of the word annkp
12 text = re.sub('aankp', '', text)
13 text = re.sub('puchc', '', text)
14 text = re.sub('aankg', '', text)
15 return text
16
17 round1 = lambda x: clean_trns_desc(x)

parameters_clean_trns_desc.py
hosted with ❤ by GitHub view raw

Putting Everything Together


The bulk of the remaining code is a single function called main_func, which accepts the
file path and then does several things. This section will provide a breakdown of the
individual pieces of code, along with the motivations.

1 try:
2 df_list = tabula.read_pdf(filepath,stream=True,guess=True,pages='all',
3 multiple_tables=True,
4 pandas_options={
5 'header':None}
6 )
7 except Exception as e:
8 print('The Error is',e)
9
10 ### Clean up each page before joining them together
11 df = []
12 for dfs in df_list:
13 dfs = dfs[dfs.columns[dfs.isnull().mean() < 0.8]]
14 # Drop rows with any empty cells
https://medium.com/@phillipheita/how-to-parse-data-tables-from-a-pdf-bank-statement-with-python-ebc3b8dd8990 3/8
11/29/21, 4:58 PM How to Parse Data Tables from a PDF Bank Statement with Python | by Phillip Heita | Nov, 2021 | Medium

15 dfs.dropna(axis=0,how='any',thresh=2,subset=None,inplace=True)
Open
16 in app # dfs['Description'] = dfs.iloc[:,1].str.cat(dfs.iloc[:,2],sep=" ")
17 if dfs.shape[1] > 5:
18 dfs.drop(dfs.columns[-1],axis=1,inplace=True)
19 df.append(dfs)
20 else:
21 df.append(dfs)
22
23 # Join individual dataframes into one
24 df_fin = pd.concat([df[1],df[2],df[3],df[4]], axis=0, sort=False) #FIX: make this part dynam
25 df_fin = df_fin[~df_fin.iloc[:,0].str.contains("Date")]
26 df_fin.columns = ['date',"trns_desc_1",'trns_desc_2','trns_desc_3','amount','balance']

parameters_read_tabula.py
hosted with ❤ by GitHub view raw

You might find that your bank statement has a completely different structure, and thus
you might need to tweak the input parameters depending on the format. The code above
reads in the content of each PDF page into a list (df_list), using tabula.read_pdf().
Given that the initial output is imperfect, i.e., contains columns with missing values, we
clean up each list element by dropping unnecessary columns, appending them to a new
data frame and renaming column names to get the view below:

Almost there! Now to answer reasonable questions with this data, we have to do a few
more things. Firstly, we want to distinguish between debits and credits, i.e. money
leaving and coming into the account. Secondly, convert the amount and balance
columns to numeric to allow us to perform aggregations later on. Thirdly, create a
transaction type (trns_type) column by extracting the first two words from the
trns_desc_1 column. Finally, although not ideal, manually classify the various
transactions into arbitrarily chosen spending categories, including Groceries,
Transport/Fuel, Construction, Airtime, Savings & Investments, Fast Food, Health & Fitness,
https://medium.com/@phillipheita/how-to-parse-data-tables-from-a-pdf-bank-statement-with-python-ebc3b8dd8990 4/8
11/29/21, 4:58 PM How to Parse Data Tables from a PDF Bank Statement with Python | by Phillip Heita | Nov, 2021 | Medium

and Restaurant/Bars. The spend classification is the most tedious part of the code, and
Open in app
thus spend category via machine learning algorithms might be more promising.

1 # Get the statement start and end date


2 p_s, p_e = df_fin.iloc[0][0], df_fin.iloc[-1][0]
3
4 # Feature Engineering
5 #df_fin['month_year'] = df_fin['date'].str.extract(pat = '([A-Z].{2})')
6 #df_fin['date'] = df_fin['date'].str.strip(" ").astype(str)
7 #df_fin['trns_date'] = pd.to_datetime(df_fin['date'].apply(parse))
8 #df_fin['month_year'] = df_fin['trns_date'].dt.strftime('%Y-%m')#.astype('category')
9 #df_fin['month_year'] = df_fin['month_year'].astype('category')
10 #df_fin['month_year_two'] = df_fin['month_year'].astype('str')
11
12 df_fin['cr_dr_ind'] = np.nan
13
14 lst = [df_fin]
15 # FIX: Build in an indicator for bank charges
16 for col in lst:
17 col.loc[col['amount'].str.contains('Cr'), 'cr_dr_ind'] = 'CR'
18 col.loc[~col['amount'].str.contains('Cr'), 'cr_dr_ind'] = 'DR'
19
20 # clean amount and balance
21 # FIX: add more general strings
22 df_fin['amount_cleaned'] = df_fin['amount'].replace(to_replace=['Cr',','], value='', regex=T
23 df_fin['balance_cleaned'] = df_fin['balance'].replace(to_replace=['Cr',','], value='', regex
24
25 df_fin['amount_cleaned'] = pd.to_numeric(df_fin['amount_cleaned'],errors='coerce')
26 df_fin['balance_cleaned'] = pd.to_numeric(df_fin['balance_cleaned'],errors='coerce')
27
28 # Get the statement opening and closing balances
29 bal_s, bal_e = df_fin['balance_cleaned'].head(1)[0], df_fin['balance_cleaned'].tail(1).tolist
30
31 # Remove Fees
32 df_fin = df_fin[~df_fin['trns_desc_1'].str.startswith('#')]
33
34 # Create column to allow for easier summing
35 df_fin['Count'] = 1
36
37 # Get first two words of column 1
38 df_fin['trns_type'] = df_fin['trns_desc_1'].str.split(' ').str[0] +' '+ df_fin['trns_desc_1'
39
40 df_fin['merchant'] = df_fin['trns_desc_2'].apply(round1).str.strip(" ")
41
https://medium.com/@phillipheita/how-to-parse-data-tables-from-a-pdf-bank-statement-with-python-ebc3b8dd8990 5/8
11/29/21, 4:58 PM How to Parse Data Tables from a PDF Bank Statement with Python | by Phillip Heita | Nov, 2021 | Medium
41
42 df_fin['merchant_category'] = np.nan
Open in app
43
44 lst = [df_fin]
45
46 # Manual process to classify individual transactions
47 for col in lst:
48 col.loc[(col['merchant'].str.contains(r'\bwoolworths\b|\bok foods\b|\bmetro\b|\bhypersave
49 col.loc[(col['merchant'].str.contains(r'\buber\b|\blefa\b|\bcab\b|\bfuel\b|\bpetro\b',reg
50 col.loc[(col['merchant'].str.contains(r'\bbuco\b|\bbuild it\b|\bbuildit\b|\bcashbuild\b|
51 col.loc[(col['merchant'].str.contains(r'\bairtime\b',regex=True)),'merchant_category'] =
52 col.loc[(col['merchant'].str.contains(r'\bsavings\b|\bsaving\b|\binvest\b|\binvestment\b
53 col.loc[(col['merchant'].str.contains(r'\bpizza\b|\bkfc\b|\bhungry lion\b|\bchicken licke
54 col.loc[(col['merchant'].str.contains(r'salary|payrol|\bsal\b',regex=True)),'merchant_cat
55 col.loc[(col['merchant'].str.contains(r'\babc pharmacy\b|\bauas valley pharmacy\b|\bclic
56 col.loc[(col['merchant'].str.contains(r'\bairtime\b',regex=True)),'merchant_category'] =
57 col.loc[(col['merchant'].str.contains(r'ocean basket|cappello',regex=True)),'merchant_cat
58
59 # 'Other' category
60 df_fin['merchant_category'].fillna('Other',inplace=True)
61
62 # Count the number of unpaids
63 #df_fin['unpaid_ind'] = np.nan
64
65 lst_unpaid = [df_fin]
66
67 for col in lst_unpaid:
68 col.loc[(col['merchant'].str.contains(r'\bunpaid\b',regex=True)),'unpaid_ind'] = 1
69
70 #df_fin['unpaid_ind'].fillna(0,inplace=True)

parameters_read_tabula2.py
hosted with ❤ by GitHub view raw

https://medium.com/@phillipheita/how-to-parse-data-tables-from-a-pdf-bank-statement-with-python-ebc3b8dd8990 6/8
11/29/21, 4:58 PM How to Parse Data Tables from a PDF Bank Statement with Python | by Phillip Heita | Nov, 2021 | Medium

There you have it! We have gone from a PDF with tables to a well-formatted data frame.
Open in app
Pretty cool!. Have a look at the complete code here.

We will visualize this data in the following article and present the final format in a
dashboard view using Python, Panel and Plotly. Here's a preview:

Screenshot of the Panel Application.

That's all for this article, thank you for reading, and I hope you found this article
interesting! Stay tuned for the next one.

Let me know if you have a better alternative to the manual spend classification in the
comments 🙂.

Get an email whenever Phillip Heita publishes.


Emails will be sent to [email protected].
Subscribe Not you?

Python Data Science Pdf Data Visualization Banking

https://medium.com/@phillipheita/how-to-parse-data-tables-from-a-pdf-bank-statement-with-python-ebc3b8dd8990 7/8
11/29/21, 4:58 PM How to Parse Data Tables from a PDF Bank Statement with Python | by Phillip Heita | Nov, 2021 | Medium

Open in app

About Write Help Legal

Get the Medium app

https://medium.com/@phillipheita/how-to-parse-data-tables-from-a-pdf-bank-statement-with-python-ebc3b8dd8990 8/8

You might also like