Extract table data from image with Phi-3 Vision

2024-12-29

Using CPU instead of GPU

In the previous article, I tried to extract text from an image using MPLUG-DOCOWL2. For trying it on Google Colab, I had to pay to use A100 GPU.

In this article, I extract text from an image on CPU with Phi-3 Vision.

Phi-3 Vision

Phi-3 is a family of small, open models developed by Microsoft. The family includes Phi-3 Vision, a multimodal model that combines language and vision capabilities, and the language models Phi-3-mini, Phi-3-small, and Phi-3-medium.

It can be assumed that the development of models to meet specific needs, such as use in resource-limited environments and local execution on devices, is the background behind the development of Phi-3.

Phi-3 Vision is a multimodal model with 4.2 billion parameters. This model combines both language and vision capabilities and can perform inference by combining text and images.

Run Phi-3 Vision on a CPU

The page of Phi-3 Vision describes the execution environment as follows:

Note that by default, the Phi-3-Vision-128K model uses flash attention, which requires certain types of GPU hardware to run. We have tested on the following GPU types:

NVIDIA A100

NVIDIA A6000

NVIDIA H100

However, as of the end of 2024, a model that runs on a CPU is available. So, I try it to analyze images on CPU.

Google Colab Runtime Type for Phi-3 Vision on CPU

As a reulst of my trying, I found that it can run on CPU, but it requires about 13-14GB RAM, which was not enough by default. Therefore, Runtime Type setting for Google Colab must be set to CPU and High Memory.

Run Phi-3 Vision on Google Colab

The gist is here:

https://gist.github.com/kevind391/dc221dc3b02b6d2efebe2eb5e942be3a

I am going to explain the some of code.

First, log in to Huggingface. You will be asked to enter a token, so please get a token in advance.

!huggingface-cli login

Next, download Phi-3 Vision model file. Here, use --include to download the files under "cpu_and_mobile". This allows you to download model files compatible with CPU.

!huggingface-cli download microsoft/Phi-3-vision-128k-instruct-onnx --include cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/* --local-dir .

This code requires ONNX Runtime, so install onnxruntime-genai. ONNX (Open Neural Network Exchange) is an open format for making machine learning models interoperable across different frameworks. Phi-3 model is optimized to run efficiently on a variety of hardware, and ONNX Runtime is used as part of this.

!pip install --pre onnxruntime-genai==0.5.1

Then, download the sample code provided by onnxruntime-genai. This sample code prompts the user to enter the path and prompt for the image to be loaded, and returns the response result of Phi-3 Vision.

Please note that this sample code requires you to use the branch of the version of onnxruntime-genai you have installed.

!wget https://raw.githubusercontent.com/microsoft/onnxruntime-genai/refs/heads/rel-0.5.1/examples/python/phi3v.py

Finally, run the sample code you downloaded.

!python phi3v.py -m cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4 -p cpu

Enter a prompt and the image path that you uploaded to Google Colab in advance, and you will get the following response in about 2-3 minutes.

Loading model...
Image Path (comma separated; leave empty if no image): table_eng.jpg
['table_eng.jpg']
Loading images...
Prompt: show all records
Processing images and prompt...
Generating response...
13th Street
47 W 13th St, New York, NY 10011, USA
20 Cooper Square
20 Cooper Square, New York, NY 10003, USA
2nd Street Dorm
1 E 2nd St, New York, NY 10003, USA
3rd North
75 3rd Ave, New York, NY 10003, USA
6 Metrotech Center
Metrotech Center, Brooklyn, NY 11201, USA
721 Broadway
721 Broadway, New York, NY 10003, USA
7th Street
40 E 7th St, New York, NY 10003, USA
838 Broadway
838 Broadway, New York, NY 10003, USA
9th St. PATH Station
69 W 9th St, New York, NY 10011, USA
Abu Dhabi House
19 Washington Square N, New York, NY 10011, USA
Affinia Manhattan Hotel
371 7th Ave, New York, NY 10001
Alumni Hall
33 3rd Ave, New York, NY 10003, USA
Bobst Library
70 Washington Square South, New York, NY 10012, United States
Brittany Hall
55 East 10th Street, New York, NY 10003, United States
Broome
400 Broome St, New York, NY 10013, USA
Brown Bldg
29 Washington Pl, New York, NY 10003, USA
Cantor Film
36 East 8th Street, New York, NY 10003, United States
Carlyle Hall
25 Union Square W, New York, NY 10003, USA
Clark Residence Hall (Poly)
55 Clark St, Brooklyn, NY 11201, USA
Coles Center
181 Mercer Street, New York, NY 10012, United States
College of Dentistry
345 E 24th St, New York, NY 10010, USA
Computer Center
242 Greene St, New York, NY 10003, USA
Coral Tower
East 14th Street, New York, NY 10003, United States
D'Agostino
110 W 3rd St, New York, NY 10012, USA
East Bldg
239 Greene St, New York, NY 10003, USA
Ehrenkranz
13 Washington Square S, New York, NY 10012, USA
Faculty of Arts
5 Washington Square S, New York, NY 10012, USA
Founders Hall
120 E 12th St, New York, NY 10003, USA
Furman Hall
249 Sullivan St, New York, NY 10012, USA
Gallatin
715 Broadway, New York, NY 10003, USA
Glucksman
1 Washington Mews, New York, NY 10003, USA
Goddard Hall
80 Washington Square E, New York, NY 10003, USA
Gramercy Green
310 3rd Ave, New York, NY 10010, USA
Greenwich Hotel
636 Greenwich St, New York, NY 10014, USA
Health Services
726 Broadway, New York, NY 10003, USA
HKMC
44 W 4th St, New York, NY 10012, USA
Kaufman
44 W 4th St, New York, NY 10012, USA
Kevorkian
50 Washington Square S, New York, NY 10012, USA

Conclusion

I tried Phi-3 Vision to extract text from images on CPU. If you don't mind the execution time, you might want to try using a combination of Phi-3 Vision and CPU. GPU instances are generally expensive, so it's nice to have such a lightweight model.