How To Parse Data Tables From A PDF Bank Statement With Python - by Phillip Heita - Nov, 2021 - Medium
How To Parse Data Tables From A PDF Bank Statement With Python - by Phillip Heita - Nov, 2021 - Medium
Open in app
Phillip Heita
95 Followers About Follow
Snapshot of PDF
The image above shows a snapshot of my student life, the flying home during breaks and
Uber trips about four years back. We see that the PDF contains a date, a relatively long
transaction description, the amount, the running balance, and accrued bank
charges. These variables serve as a sound basis for answering exciting questions
regarding one's spending behaviour. But before answering any questions using this data,
we need to "liberate" it from the PDF.
https://medium.com/@phillipheita/how-to-parse-data-tables-from-a-pdf-bank-statement-with-python-ebc3b8dd8990 1/8
11/29/21, 4:58 PM How to Parse Data Tables from a PDF Bank Statement with Python | by Phillip Heita | Nov, 2021 | Medium
First, we import a couple of packages and define the path to the PDF bank statement.
1 ################################################################################
2 # Only execute the code if the script is ran and not if it is imported #
3 ################################################################################
4 #if __name__ == "__main__":
5 import numpy as np # Numerical Python package
6 import tabula # PDF table extra package
7 import numpy as np
8 import pandas as pd
9 import os
10 import re,string
11 import sys
12 from dateutil.parser import parse # Fixing the dates
13
14 if not sys.warnoptions:
15 import warnings
16 warnings.simplefilter("ignore")
17
18 # The path to the PDF bank statement
19 filepath = '~/BankStatement.pdf'
parameters_imports.py
hosted with ❤ by GitHub view raw
https://medium.com/@phillipheita/how-to-parse-data-tables-from-a-pdf-bank-statement-with-python-ebc3b8dd8990 2/8
11/29/21, 4:58 PM How to Parse Data Tables from a PDF Bank Statement with Python | by Phillip Heita | Nov, 2021 | Medium
date. To clean this column up, we do a couple of things using regular expressions,
Open in app
including removing punctuations, numbers and certain unnecessary words.
1 def clean_trns_desc(text):
2 text = text.lower()
3 # removing anything within square brackets
4 text = re.sub('\[.*?\]', '', text) #TODO: Ensure this is not excluding stuff
5 # if any of these punctuation marks in (string.punctuation) get rid of it
6 text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
7 # Getting rid of all numbers
8 text = re.sub('\d+', '', text)
9 # get rid of the word purch
10 text = re.sub('purch', '', text)
11 # Get rid of the word annkp
12 text = re.sub('aankp', '', text)
13 text = re.sub('puchc', '', text)
14 text = re.sub('aankg', '', text)
15 return text
16
17 round1 = lambda x: clean_trns_desc(x)
parameters_clean_trns_desc.py
hosted with ❤ by GitHub view raw
1 try:
2 df_list = tabula.read_pdf(filepath,stream=True,guess=True,pages='all',
3 multiple_tables=True,
4 pandas_options={
5 'header':None}
6 )
7 except Exception as e:
8 print('The Error is',e)
9
10 ### Clean up each page before joining them together
11 df = []
12 for dfs in df_list:
13 dfs = dfs[dfs.columns[dfs.isnull().mean() < 0.8]]
14 # Drop rows with any empty cells
https://medium.com/@phillipheita/how-to-parse-data-tables-from-a-pdf-bank-statement-with-python-ebc3b8dd8990 3/8
11/29/21, 4:58 PM How to Parse Data Tables from a PDF Bank Statement with Python | by Phillip Heita | Nov, 2021 | Medium
15 dfs.dropna(axis=0,how='any',thresh=2,subset=None,inplace=True)
Open
16 in app # dfs['Description'] = dfs.iloc[:,1].str.cat(dfs.iloc[:,2],sep=" ")
17 if dfs.shape[1] > 5:
18 dfs.drop(dfs.columns[-1],axis=1,inplace=True)
19 df.append(dfs)
20 else:
21 df.append(dfs)
22
23 # Join individual dataframes into one
24 df_fin = pd.concat([df[1],df[2],df[3],df[4]], axis=0, sort=False) #FIX: make this part dynam
25 df_fin = df_fin[~df_fin.iloc[:,0].str.contains("Date")]
26 df_fin.columns = ['date',"trns_desc_1",'trns_desc_2','trns_desc_3','amount','balance']
parameters_read_tabula.py
hosted with ❤ by GitHub view raw
You might find that your bank statement has a completely different structure, and thus
you might need to tweak the input parameters depending on the format. The code above
reads in the content of each PDF page into a list (df_list), using tabula.read_pdf().
Given that the initial output is imperfect, i.e., contains columns with missing values, we
clean up each list element by dropping unnecessary columns, appending them to a new
data frame and renaming column names to get the view below:
Almost there! Now to answer reasonable questions with this data, we have to do a few
more things. Firstly, we want to distinguish between debits and credits, i.e. money
leaving and coming into the account. Secondly, convert the amount and balance
columns to numeric to allow us to perform aggregations later on. Thirdly, create a
transaction type (trns_type) column by extracting the first two words from the
trns_desc_1 column. Finally, although not ideal, manually classify the various
transactions into arbitrarily chosen spending categories, including Groceries,
Transport/Fuel, Construction, Airtime, Savings & Investments, Fast Food, Health & Fitness,
https://medium.com/@phillipheita/how-to-parse-data-tables-from-a-pdf-bank-statement-with-python-ebc3b8dd8990 4/8
11/29/21, 4:58 PM How to Parse Data Tables from a PDF Bank Statement with Python | by Phillip Heita | Nov, 2021 | Medium
and Restaurant/Bars. The spend classification is the most tedious part of the code, and
Open in app
thus spend category via machine learning algorithms might be more promising.
parameters_read_tabula2.py
hosted with ❤ by GitHub view raw
https://medium.com/@phillipheita/how-to-parse-data-tables-from-a-pdf-bank-statement-with-python-ebc3b8dd8990 6/8
11/29/21, 4:58 PM How to Parse Data Tables from a PDF Bank Statement with Python | by Phillip Heita | Nov, 2021 | Medium
There you have it! We have gone from a PDF with tables to a well-formatted data frame.
Open in app
Pretty cool!. Have a look at the complete code here.
We will visualize this data in the following article and present the final format in a
dashboard view using Python, Panel and Plotly. Here's a preview:
That's all for this article, thank you for reading, and I hope you found this article
interesting! Stay tuned for the next one.
Let me know if you have a better alternative to the manual spend classification in the
comments 🙂.
https://medium.com/@phillipheita/how-to-parse-data-tables-from-a-pdf-bank-statement-with-python-ebc3b8dd8990 7/8
11/29/21, 4:58 PM How to Parse Data Tables from a PDF Bank Statement with Python | by Phillip Heita | Nov, 2021 | Medium
Open in app
https://medium.com/@phillipheita/how-to-parse-data-tables-from-a-pdf-bank-statement-with-python-ebc3b8dd8990 8/8