Language tool helps decipher ancient texts

2023-12-26
| China Daily

share

Xunzi, a pioneering large language tool designed specifically for the processing and study of ancient texts, was launched earlier this month by Professor Wang Dongbo and his research team from the College of Information Management at Nanjing Agricultural University.

Xunzi, the first intelligent tool of its kind in China, features a vast corpus of more than 2 billion words from ancient texts, including the Siku Quanshu (The Complete Library in the Four Branches of Literature).

As a language model that can understand natural language, do automatic translation, generate poems, and index automatically, Xunzi has been open-sourced on platforms like GitHub and ModelScope.

The research team named the language model after ancient Chinese philosopher and master of prose, Xun Zi, from the Warring States Period (475-221 BC).

During its research, the team found that he was not only a great philosopher, but also a pioneer in linguistics.

Nowadays, readers often find it difficult to understand ancient texts due to challenges such as complex traditional Chinese characters, vertical layout, and the absence of punctuation marks.

As a result, the launch of Xunzi makes it possible to engage with ancient texts in the era of smart media, Wang says.

In a demonstration, Wang instructed the model to generate a five-character quatrain with Jinling (the name of Nanjing, East China's Jiangsu province, in ancient times) as the theme. The system promptly produced a well-written original quatrain.

Xunzi can also easily tackle challenging works concerning ancient texts, such as reading, comprehension, marking punctuation, and translating texts into modern Chinese.

Experts in ancient Chinese studies can leverage Xunzi for tasks like analyzing word structure, recognizing linguistic entities, and classifying and summarizing ancient texts.

The model can complete all the tasks thanks to high-performance computing facilities provided by Nanjing Agricultural University and a substantial corpus of annotated and refined data accumulated over a long time, Wang says.

"Our team has fed the model with a massive 4 billion-word mixed corpus," he says.

Many factors can influence the building of the language model, such as computing power or application scenarios, but it essentially relies on precise high-quality data fed to it, Wang says. Since 2013, his research team has been focusing on the painstaking manual data annotation to establish a solid foundation for Xunzi.

Wang takes the essay In Praise of Yueyang Tower by Fan Zhongyan, a politician and writer from the Song Dynasty (960-1279), as an example.

"To train the machine to mark all the adjective words in this ancient essay, we need to first train people to do the work, and afterward let the machine learn the marked text," he says.

Wang says the research is expected to benefit both the cultivation of related interdisciplinary talents and the common users of ancient texts. The ultimate goal is to engage a broader audience with ancient texts, promoting innovation in traditional Chinese culture.

While enabling general users to smoothly use ancient text content and advancing the organization and digitalization of ancient texts, Xunzi is poised for extensive applications in AI writing and teaching, digital entertainment, and various other domains.