A Table Detection, Cell Recognition and Text Extraction Algorithm To Convert Tables in Images To Excel Files by Hucker Marius Towards Data Science
A Table Detection, Cell Recognition and Text Extraction Algorithm To Convert Tables in Images To Excel Files by Hucker Marius Towards Data Science
Search Medium
Save
1 of 19 23/12/22, 18:33
A table detection, cell recognition and text extraction ... https://towardsdatascience.com/a-table-detection-cell-...
source: pixabay
Let’s say you have a table in an article, pdf or image and want to transfer it into an
excel sheet or dataframe to have the possibility to edit it. Especially in the field of
preprocessing for Machine Learning this algorithm will be exceptionally helpful
to convert many images and tables to editable data.
In the case that your data exists of text-based PDFs there is already a handful of
free solutions. The most popular ones are tabular, camelot/excalibur, which you
can find under https://tabula.technology/, https://camelot-py.readthedocs.io
/en/master/, https://excalibur-py.readthedocs.io/en/master/.
However, what if your PDF is image-based or if you find an article with a table
online? Why not just take a screenshot and convert it into an excel sheet? Since
there seems to be no free or open source software for image-based data (jpg, png,
image-based pdf etc.) the idea came up to develop a generic solution to convert
tables into editable excel-files.
2 of 19 23/12/22, 18:33
A table detection, cell recognition and text extraction ... https://towardsdatascience.com/a-table-detection-cell-...
Getting started
The algorithm consists of three parts: the first is the table detection and cell
recognition with Open CV, the second the thorough allocation of the cells to the
proper row and column and the third part is the extraction of each allocated cell
through Optical Character Recognition (OCR) with pytesseract.
As most table recognition algorithms, this one is based on the line structure of
the table. Clear and detectable lines are necessary for the proper identification of
cells. Tables with broken lines, gaps and holes lead to a worse identification and
the cells only partially surrounded by lines are not detected. In case some of your
documents have broken lines make sure to read this article and repair the lines:
Click here.
First, we need the input data, which is in my case a screenshot in png-format. The
goal is to have a dataframe and excel-file with the identical tabular structure,
where each cell can be edited and used for further analysis.
3 of 19 23/12/22, 18:33
A table detection, cell recognition and text extraction ... https://towardsdatascience.com/a-table-detection-cell-...
import cv2
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import csv
try:
from PIL import Image
except ImportError:
import Image
import pytesseract
The first step is to read in your file from the proper path, using thresholding to
convert the input image to a binary image and inverting it to get a black
background and white lines and fonts.
4 of 19 23/12/22, 18:33
A table detection, cell recognition and text extraction ... https://towardsdatascience.com/a-table-detection-cell-...
The next step is to define a kernel to detect rectangular boxes, and followingly the
tabular structure. First, we define the length of the kernel and following the
vertical and horizontal kernels to detect later on all vertical lines and all
horizontal lines.
# A kernel of 2x2
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 2))
5 of 19 23/12/22, 18:33
A table detection, cell recognition and text extraction ... https://towardsdatascience.com/a-table-detection-cell-...
We combine the horizontal and vertical lines to a third image, by weighting both
with 0.5. The aim is to get a clear tabular structure to detect each cell.
6 of 19 23/12/22, 18:33
A table detection, cell recognition and text extraction ... https://towardsdatascience.com/a-table-detection-cell-...
0.5, 0.0)
bitxor = cv2.bitwise_xor(img,img_vh)
bitnot = cv2.bitwise_not(bitxor)
After having the tabular structure we use the findContours function to detect the
contours. This helps us to retrieve the exact coordinates of each box.
The following function is necessary to get a sequence of the contours and to sort
them from top-to-bottom (https://www.pyimagesearch.com/2015/04/20/sorting-
contours-using-python-and-opencv/).
7 of 19 23/12/22, 18:33
A table detection, cell recognition and text extraction ... https://towardsdatascience.com/a-table-detection-cell-...
# construct the list of bounding boxes and sort them from top
to
# bottom
boundingBoxes = [cv2.boundingRect(c) for c in cnts]
(cnts, boundingBoxes) = zip(*sorted(zip(cnts, boundingBoxes),
key=lambda b:b[1][i], reverse=reverse))
Next we retrieve the position, width and height of each contour and store it in the
box list. Then we draw rectangles around all our boxes and plot the image. In my
case I only did it for boxes smaller then a width of 1000 px and a height of 500 px
to neglect rectangles which might be no cells, e.g. the table as a whole. These two
values depend on your image size, so in case your image is a lot smaller or bigger
you need to adjust both.
8 of 19 23/12/22, 18:33
A table detection, cell recognition and text extraction ... https://towardsdatascience.com/a-table-detection-cell-...
# Get position (x,y), width and height for every contour and show
the contour on image
for c in contours:
x, y, w, h = cv2.boundingRect(c)
plotting = plt.imshow(image,cmap=’gray’)
plt.show()
Now as we have every cell, its location, height and width we need to get the right
location within the table. Therefore, we need to know in which row and which
column it is located. As long as a box does not differ more than its own (height +
mean/2) the box is in the same row. As soon as the height difference is higher
than the current (height + mean/2) , we know that a new row starts. Columns are
logically arranged from left to right.
for i in range(len(box)):
if(i==0):
9 of 19 23/12/22, 18:33
A table detection, cell recognition and text extraction ... https://towardsdatascience.com/a-table-detection-cell-...
column.append(box[i])
previous=box[i]
else:
if(box[i][1]<=previous[1]+mean/2):
column.append(box[i])
previous=box[i]
if(i==len(box)-1):
row.append(column)
else:
row.append(column)
column=[]
previous = box[i]
column.append(box[i])
print(column)
print(row)
countcol = 0
for i in range(len(row)):
countcol = len(row[i])
if countcol > countcol:
countcol = countcol
After having the maximum number of cells we store the midpoint of each column
in a list, create an array and sort the values.
center=np.array(center)
center.sort()
At this point, we have all boxes and their values, but as you might see in the
output of your row list the values are not always sorted in the right order. That’s
what we do next regarding the distance to the columns center. The proper
10 of 19 23/12/22, 18:33
A table detection, cell recognition and text extraction ... https://towardsdatascience.com/a-table-detection-cell-...
finalboxes = []
for i in range(len(row)):
lis=[]
for k in range(countcol):
lis.append([])
for j in range(len(row[i])):
diff = abs(center-(row[i][j][0]+row[i][j][2]/4))
minimum = min(diff)
indexing = list(diff).index(minimum)
lis[indexing].append(row[i][j])
finalboxes.append(lis)
outer=[]
for i in range(len(finalboxes)):
for j in range(len(finalboxes[i])):
inner=’’
if(len(finalboxes[i][j])==0):
outer.append(' ')
else:
for k in range(len(finalboxes[i][j])):
y,x,w,h = finalboxes[i][j][k][0],finalboxes[i]
[j][k][1], finalboxes[i][j][k][2],finalboxes[i][j][k][3]
finalimg = bitnot[x:x+h, y:y+w]
kernel = cv2.getStructuringElement(cv2.MORPH_RECT,
(2, 1))
border = cv2.copyMakeBorder(finalimg,2,2,2,2,
cv2.BORDER_CONSTANT,value=[255,255])
resizing = cv2.resize(border, None, fx=2, fy=2,
interpolation=cv2.INTER_CUBIC)
dilation = cv2.dilate(resizing,
kernel,iterations=1)
11 of 19 23/12/22, 18:33
A table detection, cell recognition and text extraction ... https://towardsdatascience.com/a-table-detection-cell-...
out = pytesseract.image_to_string(erosion)
if(len(out)==0):
out = pytesseract.image_to_string(erosion,
config='--psm 3')
inner = inner +" "+ out
outer.append(inner)
The last step is the conversion of the list to a dataframe and storing it into an
excel-file.
#Converting it in a excel-file
data.to_excel(“/Users/YOURPATH/output.xlsx”)
12 of 19 23/12/22, 18:33
A table detection, cell recognition and text extraction ... https://towardsdatascience.com/a-table-detection-cell-...
That’s it! Your table should now be stored in a dataframe and in an excel-file and
can be used for Nature Language Processing, for further analysis via statistics or
just for editing it. This works for tables with a clear and simple structure. In case
your table has an extraordinary structure, in the sense that many cells are
combined, that the cells size varies strongly or that many colours are used, the
algorithm may has to be adopted. Furthermore OCR (pytesseract) is nearly
perfect in recognizing computer fonts. However, if you have tables containing
handwritten input, the results may vary.
If you use it for your own table(s), let me know how it worked.
Also Read:
13 of 19 23/12/22, 18:33
A table detection, cell recognition and text extraction ... https://towardsdatascience.com/a-table-detection-cell-...
14 of 19 23/12/22, 18:33
A table detection, cell recognition and text extraction ... https://towardsdatascience.com/a-table-detection-cell-...
1 import cv2
2 import numpy as np
3 import pandas as pd
4 import matplotlib.pyplot as plt
5 import csv
6
7 try:
8 from PIL import Image
9 except ImportError:
10 import Image
11 import pytesseract
12
13 #read your file
14 file=r'/Users/marius/Desktop/Masterarbeit/Medium/Medium.png'
15 img = cv2.imread(file,0)
16 img.shape
17
18 #thresholding the image to a binary image
19 thresh,img_bin = cv2.threshold(img,128,255,cv2.THRESH_BINARY | cv2.THRESH_OTSU)
20
21 #inverting the image
22 img_bin = 255-img_bin
23 cv2.imwrite('/Users/marius/Desktop/cv_inverted.png',img_bin)
24 #Plotting the image to see the output
25 plotting = plt.imshow(img_bin,cmap='gray')
26 plt.show()
27
28 # countcol(width) of kernel as 100th of total width
29 kernel_len = np.array(img).shape[1]//100
30 # Defining a vertical kernel to detect all vertical lines of image
31 ver_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, kernel_len))
32 # Defining a horizontal kernel to detect all horizontal lines of image
33 hor_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (kernel_len, 1))
34 # A kernel of 2x2
35 kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 2))
36
37 #Use vertical kernel to detect and save the vertical lines in a jpg
38 image_1 = cv2.erode(img_bin, ver_kernel, iterations=3)
39 vertical_lines = cv2.dilate(image_1, ver_kernel, iterations=3)
40 cv2.imwrite("/Users/marius/Desktop/vertical.jpg",vertical_lines)
41 #Plot the generated image
42 plotting = plt.imshow(image_1,cmap='gray')
43 plt.show()
44
45 #Use horizontal kernel to detect and save the horizontal lines in a jpg
46 image_2 = cv2.erode(img_bin, hor_kernel, iterations=3)
47 horizontal_lines = cv2.dilate(image_2, hor_kernel, iterations=3)
15 of 19 23/12/22, 18:33
A table detection, cell recognition and text extraction ... https://towardsdatascience.com/a-table-detection-cell-...
16 of 19 23/12/22, 18:33
A table detection, cell recognition and text extraction ... https://towardsdatascience.com/a-table-detection-cell-...
Sign
123 up for The previous=box[i]
Variable
By124
Towards Data Science
125 if(i==len(box)-1):
Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials and cutting-
126 row.append(column)
edge research to original features you don't want to miss. Take a look.
127
By128 else:a Medium account if you don’t already have one. Review
signing up, you will create
our Privacy Policy for more information
129 about our privacy practices.
row.append(column)
130 column=[]
131 Get this newsletter
previous = box[i]
132 column.append(box[i])
133
134 print(column)
135 print(row)
136
137 #calculating maximum number of cells
17 of 19 23/12/22, 18:33
A table detection, cell recognition and text extraction ... https://towardsdatascience.com/a-table-detection-cell-...
18 of 19 23/12/22, 18:33
A table detection, cell recognition and text extraction ... https://towardsdatascience.com/a-table-detection-cell-...
19 of 19 23/12/22, 18:33