How the current research was conducted
Currently most grammars for South Africa's Nguni languages are fairly dated (from the 1950s), so applying machine learning to understand how these languages work can help improve the dated linguistic descriptions and reflect modern language use.
Since the four languages share a similar linguistic structure, the textual data can be collected and analysed in parallel to allow researchers to do comparative computational linguistic studies. Using this data, core technologies were developed in the form of morphological analysers, part-of-speech taggers, and lemmatisers.
Using the new morphological analyser to analyse the text improved the overall accuracy to between 82% and 92%, which outperformed previously developed rule-based analysers for the same languages.
SADiLaR is a research infrastructure established by the Department of Science and Innovation (DSI) of the South African government as part of the South African Research Infrastructure Roadmap (SARIR).
These resources are available as open source on their repository website.
|
Definitions of core technologies
Morphological analyser: refers to the analysis of a word based on the meaningful parts contained within and aims to find the smallest units of meaning in a language.
Part-of-speech (POS) tagger: is a software tool that labels words as one of several categories to identify a word's function in a given language, in other words, a noun, verb, etc.
Lemmatiser: groups together different inflectional forms of the same word.
|