GovScape Simplifies Search of Millions of Gov Documents

University of Washington

At the end of every presidential term, the End of Term Web Archive preserves that administration's web presence as a vast trove of documents and webpages. The archive began in 2008, with George W. Bush's second term, and runs up to 2024, collecting images, text, graphs, redacted pages and other media. So while it contains important public information, finding that information in the glut can prove difficult.

A University of Washington-led research team created GovScape , an efficient search system for PDFs from the End of Term Web Archive. Users can look up exact keywords, like "FAFSA," or use a semantic search, which finds documents on a topic even if the exact search terms don't appear on the page. A visual search option lets them query for qualities like "redacted documents," "aerial photographs" or "pie charts." The system can currently search the 10 million PDFs hosted online during Donald Trump's first term; the team plans to expand it to the whole archive.

Because researchers used highly efficient artificial intelligence models to read the documents, processing all the PDFs costs less than $1,500, or about $1 per 47,000 pages. By comparison, Google might charge consumers $1 to parse around 100 pages with AI .

The team will present its research July 5 at the Annual Meeting of the Association for Computational Linguistics in San Diego.

"The End of Term Web Archive is immensely important to historians, journalists and the American public," said senior author Benjamin Charles Germain Lee , a UW assistant professor in the Information School. "But many of these digital archives are getting so big — The Internet Archive just announced its trillionth page archived — that finding information is the real challenge."

The team worked with PDFs because they are a ubiquitous file format and can contain text, charts and images — a mix that is challenging for existing search systems but makes the documents ideal candidates for GovScape's multimodal search.

They built a pipeline to process all the documents that splits each PDF into individual pages, saves the pages as images, then pulls out the text. The researchers used highly efficient AI models to generate "embeddings" for both the text and images from each page. Embeddings are essentially a string of numbers that systematically capture the text and images' content.

"Just as library classification systems group books on similar topics on the same shelf, these embeddings group similar pages with one another based on their visual and textual content," Lee said.

Researchers then built different indexing systems for the three kinds of search. The keyword search uses a basic index — similar to a book index — for all the text. If a user types in "FAFSA," the system finds all the pages the word appears on.

For semantic and image searches, the system takes the user's search term and creates an embedding. It then compares this embedding with the indices created from the embeddings of PDF pages and identifies the closest matches, which are returned as search results.

"Our next goal is to cover all of the 70 million PDFs in the entire End of Term Web Archive — everything from 2008 to 2024," Lee said. "One of the challenges moving forward is how to efficiently search at that scale."

Because government archives contain "every file type under the sun," Lee said, future work might expand to documents such as spreadsheets, images and HTML pages.

"I'm really excited about the prospects for better access to government information with projects like GovScape," Lee said. "Being able to actually find relevant information is vital to the health of democracy and to the functioning of society."

Co-authors include Kyle Deeds of Boston University, who completed this research as a doctoral student in the Paul G. Allen School of Computer Science & Engineering; Ying‑Hsiang Huang and Leslie Harka , who completed this research as UW master's students in the Information School; Claire Gong , Shreya Shaji , Alison Yan , Albert Du , and Anjali Gopal , all students in the Allen School; Samuel J. Klein of Harvard University; Shannon Zejiang Shen of the Massachusetts Institute of Technology; Mark Phillips of the University of North Texas; and Trevor Owens of the American Institute of Physics.

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.

You might also like