Raw-fi Data

Extract table data from docx

2024-10-12

Word Document

Last time, I extracted table data from PDF. Like PDF, Word documents are also widely used in business and education, and many documents are saved in Word format. By extracting table data from docx as known as Word Document, you can save the trouble of retyping or converting data when converting it to another format or importing it into an analysis tool.

This time, let's use python-docx to extract table data from Word.

python-docx

python-docx is a Python library for working with .docx files from Microsoft Word 2007 and later. It allows you to read, create, and update Word documents in Python. It can be easily installed with pip install python-docx.

Get table data from a word document using python-docx

By python-docx, you can access and extract table data in a Word document.

  1. Create a Document object: First, load a Word document using the Document() function and create a Document object
  2. Accessing tables: The Document object has a tables attribute that stores all tables in the Word document as a list
  3. Iterating through rows and cells: For each table, access the rows using the rows attribute, and then access the cells using each row's cells attribute
  4. Getting cell contents: The text of a cell can be obtained using the text attribute

Below is a code example to retrieve table data from a Word document:

from docx import Document
document = Document('your_word_file.docx')

for table in document.tables:
  for row in table.rows:
    for cell in row.cells:
      print(cell.text)

Visualize table data read using python-docx, pandas, and matplotlib

To visualize the acquired table data, use pandas and matplotlib.

  1. Create a DataFrame with pandas: Store the acquired table data in a pandas DataFrame. DataFrame is a convenient structure for handling tabular data.
  2. Create a graph with matplotlib: Create a graph using matplotlib based on the data in the DataFrame. You can create various types of graphs, such as line graphs, bar graphs, and scatter plots.

Below is a code example that reads table data from a Word document and visualizes it using pandas and matplotlib on Google Colab:

!pip install --quiet python-docx
!pip install --quiet pandas
!pip install --quiet matplotlib

import docx
import pandas as pd
import matplotlib.pyplot as plt

# Download the file (replace with your download method if needed)
!wget https://www2.hu-berlin.de/stadtlabor/wp-content/uploads/2021/12/sample3.docx -O sample3.docx

doc = docx.Document('sample3.docx')

# Find the first table
table = doc.tables[0]

# Extract data from the table
data = []
for row in table.rows:
  row_data = [cell.text for cell in row.cells]
  data.append(row_data)

# Convert data to pandas DataFrame
df = pd.DataFrame(data[1:], columns=data[0])
df = df.astype({'Responses': int})

# Assuming the first column contains categories and the second column contains values
# You may need to adjust the column names based on your data
df.plot(kind='pie', y='Responses', labels=df['Screen Reader'], autopct='%1.1f%%', startangle=90)
plt.axis('equal')
plt.show()

python-docx_ex

Conclusion

Like tabula-py, tabular data extracted with python-docx can be easily converted to a pandas Dataframe, making it easy to link with other data and visualize it.

python-docx has a lot more functionality than just extracting tabular data. See the python-docx documentation for more information.