EXTRACT INFORMATION FROM A MS WORD FILE USING PYTHON

Do you often have to manually copy the same information out of a Microsoft Word document? Just your luck. There is a simple way to automate this with a Python script.

Some use cases of MS word data extraction may be:

  • Resumes
    • Automatically pull out phone numbers, emails and all bold text
  • Receipts
    • Automatically identify tax-file number information and underlined numbers
  • Legal Documents
    • Summarize legal documents by extracting bold headers that start with a number

Background

Firstly, let’s understand a little more about the structure of a Word document. Surprisingly, a DOCX file is just a ZIP file of a bunch of XML (Extensible Markup Language) documents that use HTML like code. If you unzip any word DOCX file you will get something that looks like the below folders.

The most important folder is ‘word’ and within, the XML that contains the main structure and content is ‘document.xml’. You will notice that the other XML files contain information about style and numbering. This is important because the library python-docx may struggle with anything that is not contained with the document.xml file. If access is required to these details, you may have to work with the XML files directly.

Document Structure

A DOCX file has three hierarchies. The Document object is the entire document and is composed of Paragraphs. A Paragraph object is comprised of Run objects, where each run is a group of text with the same style e.g. font, color, etc. In the below example you will see a single sentence that has 4 runs, each triggered by a change in style. The style changes from normal to bold to underline and back to normal.

Example – Extract Resume Data

Let’s say we have a resume from a gentleman named ‘The Fonz.’ From his DOCX file we want to know the email addresses, phone numbers and all the bold text.

The Code

The aim of the code is to generate a JSON file that extracts our three items of data: email, phone numbers and bold phrases. We will use our code to produce the following JSON.

Note, if you would like to download the code and the Word file to follow along, please use the below link:

Firstly, let’s install the library python-docx. Go ahead and use pip to install the library.

pip install python-docx

Next we will create a new Python file called ‘wordextract.py’ and import our libraries.

from docx import *
import re
import json

#----------01_Import File Name----------
document = Document('thefonz.docx')  #Change filename here

The ‘docx’ library on line 1 will read the Word file, the ‘re’ library on line 2 (short for Regular Expressions) is used to parse text strings and the ‘json’ library will be used to format the output. On line 6, we open the Word file ‘thefonz.docx’ and assign it a variable ‘document.’ If you have another Word file, change your filename on this line.

In section two, we declare lists to place the output for the bold text, emails and phone numbers.

#02_-----------Declare Variables-----------
bolds=[]
emails=[]
phones=[]

The third section does the actual data extraction from the word DOCX .

#-----------03_Extract Elements From the Word File-----------
for para in document.paragraphs:

    #03.1 Find email and phone numbers within the paragraph text
    text = para.text
    email_list = re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+",text)
    phone_list=re.findall(r'[\+\(]?[0-9][0-9 .\-\(\)]{8,}[0-9]',text)

    for email in email_list:
        emails.append(email)

    for phone in phone_list:
        phones.append(phone)

    #03.2 Find the bold style within the word document
    for run in para.runs:
        if run.bold :
            bolds.append(run.text)

The first line of code loops through all the paragraphs in the document. Each paragraph is turned into text on line 5. From the paragraph’s text, regular expressions are used to search for string patterns.

On line 6, the email pattern looks for alphanumeric strings in the format ‘string@string.string’. Similarly on line 7 for phone numbers, we look for text that start with a ‘+’ for an international area code or ‘(‘ followed by numbers for an area code, then a set of numbers that may also contain characters such as a ‘.’ or a ‘-‘ , for at least 8 matches. The results of the regular expression search are then appended to the lists ’emails’ and ‘phones.’

In section 3.2, we look at each run within the paragraph. If the run has attribute run.bold (line 17), then we place the text in the list ‘bolds’. If instead, you want to see italics, use ‘run.italic’. If you wanted underlined text, use ‘run.underline’ etc. For more information on the different types of run objects have a look at the python-docx link.

Note, Regular Expressions are almost a language in their own right. If you want to learn more about Regular Expressions, check out this link and this cheat sheet.

Finally, we create a dictionary of our three pieces of data: emails, phone numbers and bold text. Note, a dictionary maps a set of objects (keys) to another set of objects (values) .

#-----------04_Create Output-----------
style_Dict={'emails':emails,
              'phone_numbers':phones,
              'bold_phrases':bolds
              }

print("\nWord File Output:\n")

r = json.dumps(style_Dict)
loaded_r = json.loads(r)
print("\n",json.dumps(loaded_r,indent=4, sort_keys=False))  #Pretty print the JSON output

On line 8, we use the JSON library json.dumps() to convert the dictionary to a string object. We then use json.loads on line 9, to convert the data to a standard JSON format. The last line uses some attributes of json.dumps to print the output with indentations to get the result shown in figure 4.

Aaaaaaay! That is it, you now have a way to automate data extraction from any MS Word DOCX.

If you got some value out of this blog, please donate to the below Bitcoin address.


19TtNdp8FBgh5YbZRy4HXAaba3oGUSxeTR

Leave a Comment