Project mines 8 million news pages for poetry
What differentiates a line of text from a news story and a line of text from a poem? Not much, and that’s a problem for researchers of American poetry.
For nearly a century, United States history was documented in newspapers through more than the typical news reports. Millions of poems submitted to and published by newspapers from the late 18th through early 20th centuries also illustrated the lives and concerns of Americans. These poems, if analyzed, could change the history of American literature, said Elizabeth Lorang, research assistant professor and digital humanities projects librarian in the University Libraries.
“Millions of poems were published in newspapers,” she said. “Looking at them will shift the way we understand poetry in the United States.”
But how do researchers locate poems among eight million digitized news pages? Lorang has teamed up with Leen-Kiat Soh, associate professor of computer science and engineering to develop software that will perform image-processing functions to mine data from digital formats.
“This is a big data problem,” Soh said. “What could be done manually, there is now a possibility to do it with computers.”
Similar to text-mining applications, where specific words and phrases are mined from digital sources, the goal of the image processing computer program is to locate specific images or outlines of images. The idea traces back to Lorang’s doctoral dissertation project, when she spent 18 months scouring old newspapers for poems. She was only able to catalog 3,000 poems in that time, but she noticed that the poems were often easily recognizable when looking at the whole page at once.
“Going through the newspapers, I zoomed out on the pages, because the poetic content is so visually distinctive on the page,” Lorang said. “There are differences in white space, justification and other visual cues.”
Lorang and Soh secured a first round of funding to build the software through an 18-month, $60,000 start-up grant from the National Endowment for the Humanities. The funds are enabling Lorang and Soh to work with two undergraduate CSE students to write the code. They plan to have the application running efficiently to locate nearly all of the poems in the millions of digitized news pages of the Chronicling America collection.
The students, Spencer Kulwicki and Maanas Varma Datla, joined the project last spring. The UNL sophomores spend about 15 to 20 hours a week working on code and consulting with Lorang and Soh. The experience has been invaluable to them, they said; the interdisciplinary nature of the project is not surprising to the students, but is demonstrative of the possibilities in their chosen career field.
“When I came to UNL, I had no clue what was going on, but at this moment, I can program in three different languages,” Datla said.
Lorang said she believes the project will grow as more researchers learn about it. She also knows that image processing is gaining significant traction in the realm of digital humanities research.
“If we think about the massive digital libraries that we’re creating, the tradition has been to use the text that’s created in those processes to enable us to discover content, but at the same time we’re creating digital images,” Lorang said. “If we don’t do anything with those digital images, we’re missing a lot of the potential of the digital libraries.
“We’re advancing some new methods of digital library research and information retrieval research. Instead of looking for content by trying to discern linguistic features or textual cues, we are looking for the visual forms of those texts, which is a novel approach. I could foresee this software being used in many different applications. We’ve talked about death notices, advertisements, tabular information such as sports scores or weather information. We see this as a methodology that could be relevant to finding a whole variety of content in digitized collections, but we’re starting with the poetry in newspapers as a test case.”