Nearly one million books published as early as the 15th century — and in 254 languages — are part of a Harvard University collection being released to AI researchers Thursday. Also coming soon are troves of old newspapers and government documents held by Boston’s public library.
Supported by “unrestricted gifts” from Microsoft and ChatGPT maker OpenAI, the Harvard-based Institutional Data Initiative is working with libraries and museums around the world on how to make their historic collections AI-ready in a way that also benefits the communities they serve.
“We’re trying to move some of the power from this current AI moment back to these institutions,” said Aristana Scourtas, who manages research at Harvard Law School’s Library Innovation Lab. “Librarians have always been the stewards of data and the stewards of information.”
It promises to be a boon for AI developers trying to improve the accuracy and reliability of their systems.
“A lot of the data that’s been used in AI training has not come from original sources,” said the data initiative’s executive director, Greg Leppert, who is also chief technologist at Harvard’s Berkman Klein Center for Internet & Society. This book collection goes “all the way back to the physical copy that was scanned by the institutions that actually collected those items,” he said.
Meta is also battling a lawsuit from comedian Sarah Silverman and other published authors who accuse the company of stealing their books from “shadow libraries” of pirated works.
Now, with some reservations, the real libraries are standing up.
When the company first reached out to the Boston Public Library, one of the biggest in the U.S., the library made clear that any information it digitized would be for everyone, said Jessica Chapel, its chief of digital and online services.
“OpenAI had this interest in massive amounts of training data. We have an interest in massive amounts of digital objects. So this is kind of just a case that things are aligning,” Chapel said.
Digitization is expensive. It’s been painstaking work, for instance, for Boston’s library to scan and curate dozens of New England’s French-language newspapers that were widely read in the late 19th and early 20th century by Canadian immigrant communities from Quebec. Now that such text is of use as training data, it helps bankroll projects that librarians want to do anyway.
The new effort was applauded Thursday by the same authors’ group that sued Google over its book project and more recently has brought AI companies to court.
How useful all of this will be for the next generation of AI tools remains to be seen as the data gets shared Thursday on the Hugging Face platform, which hosts datasets and open-source AI models that anyone can download.
The book collection is more linguistically diverse than typical AI data sources. Fewer than half the volumes are in English, though European languages still dominate, particularly German, French, Italian, Spanish and Latin.
A book collection steeped in 19th century thought could also be “immensely critical” for the tech industry’s efforts to build AI agents that can plan and reason as well as humans, Leppert said.
“At a university, you have a lot of pedagogy around what it means to reason,” Leppert said. “You have a lot of scientific information about how to run processes and how to run analyses.”
At the same time, there’s also plenty of outdated data, from debunked scientific and medical theories to racist and colonial narratives.
“When you’re dealing with such a large data set, there are some tricky issues around harmful content and language,” said Kristi Mukk, a coordinator at Harvard’s Library Innovation Lab who said the initiative is trying to provide guidance about mitigating the risks of using the data, to “help them make their own informed decisions and use AI responsibly.”
————