As libraries have increasingly come to recognize the value of digitizing historical works in their holdings, many institutions with significant collections of Chinese materials have committed themselves to large-scale scanning projects, often making the resulting images freely available over the internet.
While an enormously positive development in itself, for many scholarly use cases this represents only the first step towards adequate digitization of these works.
Since its origins in 2005 as an online search tool for a small number of classical Chinese texts, the Chinese Text Project has grown to become one of the largest and most widely used digital libraries of pre-modern Chinese writing, containing tens of thousands of transmitted texts dating from the Warring States through to the late Qing and republican period, while also serving as a platform for the application of digital methods to the study of pre-modern Chinese literature.
Unlike most digital libraries and full-text databases, users of the site are not passive consumers of its materials, but instead active curators through whose work it is maintained and developed – and increasingly, not all users of the library are human.
Volunteers located around the world correct mistakes and add modern punctuation to the texts as time allows and according to their own interests – typically hundreds of corrections are made each day.
Digital methods have revolutionized many aspects of the study of pre-modern Chinese literature, from the simple but transformative ability to perform full-text searches and automated concordancing, through to the application of sophisticated statistical techniques that would be entirely impractical without the aid of a computer.
While the methods themselves have evolved significantly – and continue to do so – one of the most fundamental prerequisites to almost all digital studies of Chinese literature remains access to reliable digital editions of these texts themselves.
OCR software must correctly identify all of these instances as corresponding to the same abstract character – a challenging task for a computer.In an attempt to address this problem, the Chinese Text Project has developed a hybrid system, in which uncorrected OCR results are imported directly into a database system providing full-text search of the source images and assembling the contents of the scanned images of pages into complete textual transcriptions, while also providing an integrated mechanism for users to directly correct the data.