Following the previous post The Science Behind Readability, I received a question from a management consultant on how to actually extract text from a Power Point deck and assess the readability. For this management consultant, the top line of the slide, or tagline, is crucial in conveying the message of the slide. He described the tagline as the ‘So What’ of the slide.
To help improve taglines, I compiled a standalone windows executable app that reads text from Power Point and then scores the readability of the selected text. Following this, I will share the code for how I extract text from Power Point in Python.
1. A Windows App for Tagline Readability
The attached ZIP file contains a windows command line program called ‘Tagline.exe’. This is a self contained windows program that does not need Python. The app takes a Power Point Deck and assesses the readability of your taglines. I have included a Power Point slide called ‘Test.pptx’ to use to test the app as well. Note that the Power Point slide you are using must be in the same directory as the file Tagline.exe.
You can download the ZIP file here: PPT Readability App
ZIP File Contents
The way the App works, is you type in the file including the file extension e.g. Test.pptx. Note that the file name is case sensitive, so make sure you use capital letters when required. Next you enter the font size to extract from the PPT file. If you do not type in anything and press return, the default is 18 PT. Next you enter the minimum number of characters to pull. This is a filter to remove page numbers and title page text that are not full sentences. If you do not type anything and press return, the default is 30 characters.
Enter Power Point File
Lastly, the app will extract the sentences from your Power Point Deck into a file output.txt and give you the Flesch-Kincaid Grade and Flesch Readability Score.
2. Use Python to Extract Power Point Text
The code I used to extract text from PPT uses a Python library called ‘python-pptx’. This library can be used to extract, but also automate slide generation for reports. I.e. you can both read and write to Power Point Slides. You can read more on how to use this library at Python-PPTX .
In the below I use Python v 2.7. An explanation of the code sections is given below:
In #01, I used the functions os.getcwd for the name of the current directory and os.listdir() to return the files in the current directory.
Next in #02, I define a list to put the extracted lines into and prompt a file name to read from.
Section #03 is where most of the work occurs. I first grab a few more variables, font size and the minimum number of characters for a sentence. #03.2 is a ‘for’ loop that goes through all the shapes and paragraphs within the power point deck. The ‘if’ statement then looks for a font size using ‘if font.size == Pt(size):’. If it finds the right font size, it adds the text into the variable ‘text_runs’. Lastly in #03.3, we write the list to a file called ‘output.txt’. This is where we apply the minimum number of characters to ensure we have proper sentences and not small titles or page numbers. You will notice there is a try and except clause here. The reason this is necessary, is that PPT sometimes uses characters that are difficult to extract to text e.g. a long ‘-‘ or a misspelled word that has a red highlight. When Python does not know what the character is, it will skip that part of the sentence and move to the next.
from pptx import Presentation
from pptx.util import Pt
# text_runs will be populated with a list of strings,
# one for each text run in presentation
text_runs = 
UserFileName=raw_input("\nEnter File Name e.g. 'Test.pptx':")
if (UserFileName in listOfdir) and (UserFileName.endswith('.pptx')):
prs = Presentation(UserFileName)
size=raw_input("Enter size of font to pull from PPT e.g. '18':")
minchar=raw_input("Enter minimum characters in text e.g. '30':")
if size == '':
if minchar == '':
minchar = 30
#03.2_Read ppt deck for all text frames of font size
for slide in prs.slides:
for shape in slide.shapes:
if not shape.has_text_frame:
for paragraph in shape.text_frame.paragraphs:
for run in paragraph.runs:
font = run.font
if font.size == Pt(size):
#03.3_Write output to file 'output.txt'
f = open('output.txt', 'w')
with open('output.txt', 'w') as file_handler:
for item in text_runs:
That’s it for the Power Point Tagling readability blog. Let me know if you have any comments by writing to me at firstname.lastname@example.org