Using Regular Expressions to Make Blackout Poetry

I regex your reality and substitute my own

data science
programming
python
Author

Vincent “VM” Mercator

Published

May 11, 2024

Modified

July 30, 2024

A blackout poem that reads 'made you look'. The original poem is 'The Two Crabs' by Walter Crane.

I don’t think this is how blackout poetry is supposed to be made.

Blackout poetry, also known as erasure poetry, is the act of removing characters and words from another existing text source to create an entirely new poem.1 There are lots of ways to express this poetry type in a more artful context, but my favorites are the funny ones where most of the poem is removed. Some great examples of these include the Tumblr blog reallybadblackoutpoems and the posts in the r/othepelican subreddit.

Instead of making a blackout poem the usual way by first choosing the source text and then blacking out text to make the message, let’s work backwards. Starting with a desired message, we’ll find a source that contains it by combing through a text database, then black out the text.

NoteUpdate

I’ve turned the algorithm that I describe in this article into a command-line application called blackout! You can view its source code on its GitHub repository.

Setup

The public domain is a great source for finding text where using them as input data hopefully won’t get me into legal or ethical trouble.2 For this example, I’ll be using the Public Domain Poetry dataset by HuggingFace user DanFosing,3 (scraped from the Public Domain Poetry website) to find a poem that contains the text blackoutpoem.

import datetime
import os
import re
import string
import sys

import pandas as pd

POEMS_JSON = "_poems.json"
"""Where the JSON database of public domain poems are stored locally."""
BLACKOUT_MESSAGE = "blackoutpoem"
"""The desired message for the blackout poem."""
MIN_POEM_LENGTH = 0
"""Minimum poem length [characters]."""
MAX_POEM_LENGTH = 400
"""Maximum poem length [characters]."""
SEED = 20240511
"""RNG Seed for reproducibility."""

# Get a public domain poetry dataset from Hugging Face; save it for reuse
if os.path.exists(POEMS_JSON):
    pdp_dataset = pd.read_json(POEMS_JSON)
else:
    pdp_dataset = pd.read_json(
        "hf://datasets/DanFosing/public-domain-poetry/poems.json"
    )
    pdp_dataset.to_json(POEMS_JSON)

# Filter poem size
pdp_filter = [
    (MIN_POEM_LENGTH <= len(poem_str) <= MAX_POEM_LENGTH)
    for poem_str in pdp_dataset["text"]
]
pdp_dataset = pdp_dataset[pdp_filter]

To further reduce the dataset and speed up the combing process, let’s also limit our search to poems with 400 characters or fewer.

Finding the Right Poem with RegEx

Now that we have our source text dataset, we can use our message to write a regular expression (abbreviated as “regex”)4 to find the poem we want to black out. Regular expressions describe how to select strings of characters based on a specific rule or pattern.5

We can split a blackout poem with an \(n\)-character message into \(2n+1\) groups of zero or more characters each: there are \(n\) groups for each character in the blackout message, \(n-1\) groups for the characters between blackout message characters, and \(2\) more groups before and after the blackout message.

If our blackout message was abc, for instance, we would need seven groups. The regex syntax uses parentheses to denote individual groups of substrings to match.

flowchart LR
  any1["0+ characters"]
  --> a["#quot;a#quot;"]
  --> any2["0+ characters"]
  --> b["#quot;b#quot;"]
  --> any3["0+ characters"]
  --> c["#quot;c#quot;"]
  --> any4["0+ characters"]

Flowchart showing the pattern a string should match for the regular expression (.*?)(a)(.*?)(b)(.*?)(c)(.*?).

For the groups between blackout message characters, we use the period (.) to select any non-newline character, then place the asterisk (*) to say that we want it to match that zero or more times. We can further speed up the regex searching process by adding a question mark (?) at the end of the group, stating that we want to match as few characters as possible.

So, we can convert our blackout message into a regex using the following string concatenation.

# Create regex from blackout message
blackout_msg_regex = re.compile(
    "(.*?)(" + ")(.*?)(".join(list(BLACKOUT_MESSAGE)) + ")(.*?)"
)
print(blackout_msg_regex)
re.compile('(.*?)(b)(.*?)(l)(.*?)(a)(.*?)(c)(.*?)(k)(.*?)(o)(.*?)(u)(.*?)(t)(.*?)(p)(.*?)(o)(.*?)(e)(.*?)(m)(.*?)')

Since our regular expression won’t search across line breaks, let’s replace all line breaks with the “carriage return” symbol so that the regex can search each poem in its entirety for our blackout message.

# Condense each poem into one line for regex matching
pdp_dataset["strings"] = [poem.replace("\n", "↵") for poem in pdp_dataset["text"]]

We now use our blackout message regex to find a good source poem in our dataset that contains our blackout message.

# Initialize regex match
regex_match = None

# Shuffle the dataset, find a good poem for our blackout message
for poem_idx, poem in pdp_dataset.sample(frac=1, random_state=SEED).iterrows():
    regex_match = re.search(blackout_msg_regex, str(poem["strings"]))
    if regex_match:
        break

# Raise an error if no poem matches the blackout regex
if regex_match is None:
    msg = "Can't find a source poem for the given blackout message."
    raise ValueError(msg)

print(f'"{poem["Title"]}", by {poem["Author"]}\n')
print(poem["text"])
"A Man To A Sunflower", by Peter Courtney Quennell, Sir

See, I have bent thee by thy saffron hair
O most strange masker, 
Towards my face, thy face so full of eyes
O almost legendary monster, 
Thee of the saffron, circling hair I bend,
Bend by my fingers knotted in thy hair
Hair like broad flames.
So, shall I swear by beech-husk, spindleberry,
To break thee, saffron hair and peering eye,
To have the mastery?

Our algorithm has chosen the poem “A Man to A Sunflower” by Sir Peter Courtney Quennell.

Blacking Out the Poem

To black out the poem, all we need to do is replace all the non-blackout message characters in the poem with the ASCII block character (), and replace all “carriage return” symbols with actual carriage return / newline characters.

rebuilt_poem = ""
blackout_regex = re.compile(f"[^{string.whitespace}]")

for group_idx, group in enumerate(regex_match.groups()):
    if group_idx % 2 == 0:
        # Groups for blacking out
        enter_string = group.replace("↵", "\n")
        rebuilt_poem += re.sub(blackout_regex, "█", enter_string)

    else:
        # Groups to preserve
        rebuilt_poem += group

print(rebuilt_poem)
print(f'\n{BLACKOUT_MESSAGE}\n{poem["Author"]}; "{poem["Title"]}"')
████ █ ████ b███ ████ ██ ███ ███████ ████
█ ████ ███████ ███████ 
███████ ██ █████ ███ ████ ██ ██l█ ██ ████
█ a█████ █████████ ████████ 
████ ██ ███ ████████ c███████ ████ █ █████
████ ██ ██ ███████ k█o████ ██ ███ ████
████ ████ █████ ███████
███ █████ █ █████ ██ ███████u███ █████████████
██ █████ t████ ███████ ████ ███ p██████ ████
█o ███e ███ m

blackoutpoem
Peter Courtney Quennell, Sir; "A Man To A Sunflower"

Conclusion

Why did I decide to do this?

██ ████ ███ ███████ b██ █████ █████e█ c██a█u███ ████s█ ██e█ █i██ ███ ██ c█a████ ███ ███ ██ n██████

“because I can”

Excerpt from Answered Extempore By Dr. Swift by Jonathan Swift

If I want to turn this into an actual command-line tool, then I should use a faster language or library that works better with string parsing. Python is great for quick programming, but is notoriously slow due to its high-level nature and global interpreter lock (GIL). While the pandas library was convenient for downloading the database, it’s not designed for string processing.

Parallelization could help with speeding up the regex matching process. I could hand that off to grep or similar tools like ripgrep, ack, or ag for faster parallel regex matching.

Package Versions & Last Run Date

Python 3.13.5 (main, Jun 25 2025, 18:55:22) [GCC 14.2.0]
Pandas 2.3.3
Last Run 2025-10-04 12:52:34.068058

References

Academy of American Poets. “Erasure.” Text. Poets.org. Accessed May 12, 2024. https://poets.org/glossary/erasure.
DanFosing. “Public-Domain-Poetry,” September 2023. https://huggingface.co/datasets/DanFosing/public-domain-poetry.
Kuchling, A. M. “Regular Expression HOWTO.” Python Documentation, April 2024. https://docs.python.org/3/howto/regex.html.

Footnotes

  1. Academy of American Poets, “Erasure”.↩︎

  2. I would tell OpenAI, Microsoft, and LAION-AI to take some notes here, but I’m pretty sure I’ve blocked their web scrapers from my website.↩︎

  3. DanFosing, “Public-Domain-Poetry”.↩︎

  4. I pronounce this as “REE-jeks”, but apparently the correct pronunciation is “REH-gehks”.↩︎

  5. Kuchling, “Regular Expression HOWTO.↩︎

Reuse

CC BY SA 4.0 International License(View License)