Raw-fi Data

Extract table data from image

2024-10-15

Image

As I explained in my previous post, Extract CSV from tables in PDF, tabula-py only works with text-based PDF. If you need to extract table data from an OCR-based PDF, you will have to find another way.

So, I will introduce how to extract data from a table in an image using img2table. Unlike text-based images, it is not possible to extract characters with complete accuracy. However, in some cases, this may be sufficient.

img2table

img2table is a useful library that can automatically detect tabular data from images and output them as Python objects, Excel files, or CSV files. It is ideal for extracting data from scanned images and screenshots, as it can detect tables in images and extract text in a single process.

img2table can use multiple OCR tools such as Tesseract and PaddleOCR. So, please check READ.md for information on which OCR tools can be used.

Tools and environment

First, install the following libraries to get ready.

pip install img2table opencv-python-headless pandas
  • img2table: A library for extracting table data from images.
  • opencv-python-headless: OpenCV for image processing. Useful when you don't use a GUI.
  • pandas: A library for processing the extracted data as a data frame.

Also, to use Tesseract, please install following:

#sudo yum install leptonica tesseract

Extract table data from an image

In this article, we will extract the data from table_eng.jpg.

table_eng.jpg

The following code will extract this table data in pandas Dataframe format.

import cv2
from img2table.ocr import TesseractOCR
from img2table.document import Image

import io
import os


os.environ['TESSDATA_PREFIX'] = '/usr/share/tesseract/tessdata/'
src = "./images/table_eng.JPG"

def preprocess(image):
  # gray scale
  image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

  # threshold
  _, image = cv2.threshold(image, 150, 255, cv2.THRESH_BINARY)

  # resize
  #image = cv2.resize(image, None, interpolation = cv2.INTER_AREA, fx=2, fy=2)

  # noise
  # image = cv2.medianBlur(image, 1)

  return image


image = cv2.imread(src)
image = preprocess(image)

# Convert cv2 image to byte array
_, encoded_image = cv2.imencode('.png', image)
doc = Image(io.BytesIO(encoded_image.tobytes()))

ocr = TesseractOCR(n_threads=1, lang="eng")
# Table extraction
extracted_tables = doc.extract_tables(ocr=ocr,
                   implicit_rows=False,
                   implicit_columns=False,
                   borderless_tables=False,
                   min_confidence=50)

for table in extracted_tables:
  print(table.df)

In preprocess, some code lines are commented out. Applying all of them reduces accuracy, so this time we adopted gray scale and threshold. Depending on the quality of the image, more preprocessing may be required.

The result is as follows:

            0                         1
0          Name                      Address
1          13th       47 W 13th St, New York, NY 10011, USA
2    20 Cooper Square     20 Cooper Square, New York, NY 10003, USA
3     2nd Street Dorm        LE 2nd St, New York, NY 10003, USA.
4        3rd North        75 3rd Ave, New York, NY 10003, USA.
5   6 Metrotech Center     Metrotech Center, Brooklyn, NY 11201, USA
6      721 Broadway       [721 Broadway, New York, NY 10003, USA
7       7th Street       [40 E 7th St, New York, NY 10003, USA
8      838 Broadway       [838 Broadway, New York, NY 10003, USA
9  9th St. PATH Station        69 W 9th St, New York, NY 10011, USA
10    [Abu Dhabi House  19 Washington Square N, New York, NY 10011, USA
11    Manhattan Hotel          371 7th Ave, New York, NY 10001.
12      [Alumni Hall        33 3rd Ave, New York, NY 10003, USA.
13     Bobst Library 70 Washington Square South, New York, NY 10012...
14     Brittany Hall [55 East 10th Street, New York, NY 10003, Unit...
15        Broome.       400 Broome St, New York, NY 10013, USA
16      Brown Bldg.     29 Washington Pl, New York, NY 10003, USA
17      Cantor Film 36 East Street, New York, NY 10003, United States
18     [Carlyle Hall    25 Union Square W, New York, NY 10003, USA.
19 [Clark Residence Hall        55 Clark St, Brooklyn, NY 11201, USA
20      Coles Center 181 Mercer Street, New York, NY 10012, United ...
21 [College of Dentistry       345 E 24th St, New York, NY 10010, USA
22    [Computer Center      242 Greene St, New York, NY 10003, USA,
23      [Coral Tower East 14th Street, New York, NY 10003, United S...
24       D'Agostino       110 W 3rd St, New York, NY 10012, USA.
25       East Bldg      [239 Greene St, New York, NY 10003, USA,
26          None  13 Washington Square $, New York, NY 10012, USA.
27    Faculty of Arts  [5 Washington Square S, New York, NY 10012, USA
28     Founders Halll        120 12th St, New York, NY 10003, USA
29      [Furman Hall     249 Sullivan St, New York, NY 10012, USA.
30        Gallatin       715 Broadway, New York, NY 10003, USA
31       Glucksman     1 Washington Mews, New York, NY 10003, USA
32      Goddard Hall  80 Washington Square E, New York, NY 10003, USA
33     Gramercy Green        310 3rd Ave, New York, NY 10010, USA
34    Greenwich Hotel     636 Greenwich St, New York, NY 10014, USA
35    Health Services         Broadway, New York, NY 10003, USA
36          HKMC        44 W 4th St, New York, NY 10012, USA
37        Kaufman        44 W 4th St, New York, NY 10012, USA
38       Kevorkian   [50 Washington Square New York, NY 10012, USA

As you can see, there is an unnecessary character "[", there is a period (.) after USA, and "1E" is recognized as "LE". There seems to be room for improvement.

Improvement

There seems to still be room for improvement.

  • Image quality: Blurry or low-resolution tables can reduce OCR accuracy. It is recommended to use images of the highest quality possible.
  • Table complexity: Tables that span multiple pages or have complex layouts can be difficult to extract. Accuracy can be improved by utilizing Img2Table's preprocessing and customization features.
  • Preprocessing: Preprocessing images using OpenCV, such as grayscaling and noise removal, can improve OCR results.

Conclusion

We extracted the table data in the image with img2table and Tesseract and convert them to pandas Dataframe. This method is hard to use when accuracy is required, but on the other hand, it may be possible to quickly obtain the overall picture. We would like to continue trying it.