
CENTRE FOR TEXT TECHNOLOGY (CTexT®)
Postgraduate study
Successful completion of the four-year BA Language Technology degree will enable the graduate student to continue with postgraduate studies. Students with a BA Languages degree and/or Computer Studies may also be considered for the master’s degree in BA Language Technology. Contact us for more information.
Projects in Human Language Technology, Computational Linguistics and Natural Language Engineering
All the projects below form part of larger projects at the Centre for Text Technology (CTexT®). Therefore you will be supervised by a lecturer from Computer Engineering, as well as a researcher/engineer from CTexT.
Other topics that are not mentioned here can be explored. Feel free to contact us should you have any suggestions or questions.
Seeing as these projects are part of other projects, there are great possibilities for further postgraduate studies. CTexT also provides several bursaries for M-Ing.- and PhD-students.
Several full-time or part-time assistantships as programmers/system developers are available at CTexT. Contact Martin Puttkammer (martin.puttkammer@nwu.ac.za) in this regard.
- Machine Translation
Machine translation deals with the automatic translation of texts from one language (e.g. English) to another language (e.g. Dutch). An example is the system on Google translate, where you can do translations between different language pairs.
Since the 1950s, machine translation has been one of the most interesting challenges in the field of computer engineering, and to date still remains one of the most problematic fields in human language processing. The complexity of language and the diversity of languages make it an exceptionally challenging task for a computer. For instance, how should a computer know how to translate "crooked arms dealer" to "skelm wapenhandelaar" and not to
"krom arms verkoopsmanne"
To attempt to resolve these types of problems, usually one of three approaches is followed:
1. Rule-based approaches;
2. Data-driven approaches (e.g. by making uses of machine learning); and
3. Hybrid approaches (combinations of the two approaches above).
Despite great international interest in machine translation, nearly no research on machine translation in the South African context has been done. The South African Government, however, recently decided to invest a great deal of money in the development of machine-translation systems for South African languages. This project requires an explorative study in the field of machine translation for South African languages, as well as the development of a prototype machine translation system for one South African language pair. Read more about the Autshumato project.
- Copy protection on flash drive
There are several advantages associated with the use of flash drives for the distribution of software. The greatest advantage is that the software can be executed directly from the flash drive, without the need to first install it on the computer. It therefore increases the mobility of the user, in that the user can make use of software on several computers. The problem, however, is that it is very easy to copy software from a flash drive. The challenge in this project is to find a solution that will prevent the unauthorised copying of software from a flash drive.
- Plagiarism protector
Plagiarism is a problem that sometimes occurs among students. This project entails the development of a system that can highlight plagiarism in an assignment by using plagiarism detection metrics and Internet searches, and then comparing assignments with one another.
- Email classifier for Afrikaans
Currently there is a need for an email classifier for Afrikaans. The purpose of such a system is to categorise email for effective distribution within a large organisation, as well as to block spam.
- Question-answering system in a limited field
In question-answering systems, the end-user can gain access to information by means of asking questions in a human language (like Afrikaans or English). For example, you would rather ask a system: ”Who was Moses’ mother”, than typing in “Moses Mother” (causing a lot of incorrect information). In this project you will develop such a system within a limited, chosen domain.
- Recognition of spoken South African languages
When a computer system generates speech, it is often necessary to determine the language of the speech. Such spoken language recognition for the South African languages has a number of interesting challenges that are related to the unique sounds that occur in our languages (e.g. clicks) as well as the hierarchies of family relationships between these languages. During this project we will generate spoken language (currently being recorded by CTexT) with techniques like Hidden Markov models and neural networks, as well as develop new techniques to develop an accurate language recognition system. The student who wants to tackle this project must be able to comfortably programme in C, and should be familiar with the principles of signal processing.
- Fault analysis for pronunciation dictionaries
Pronunciation dictionaries are an important component in systems for speech recognition and speech synthesis, and there are a number of standard dictionaries that are widely used (e.g. "CMUdict" for American English and "BEEP" for British English). An automatic analysis of the content of these dictionaries, however, indicates that they contain somewhat inconsistent pronunciations, and maybe also a number of mistakes. For this project we will make use of machine-learning algorithms to carefully analyse such dictionaries, in order to improve the dictionaries and also to determine guidelines for the compilation of future pronunciation dictionaries.
- Retrieval of written word forms from speech recognition
The aim of speech recognition usually is to recognise a collection of known words. Sometimes, however, the system must ‘guess’ about how to write an unknown word – for example when an unknown proper noun is spoken to the system. The best advice for this problem is to recognise the phonemes in such an utterance, and to then, from the recognised phoneme string, estimate a reasonable spelling. Machine-learning techniques (that are also used when building pronunciation dictionaries) are suited for the latter task. We will make use of an existing speech recognition system, and apply appropriate machine learning algorithms for this task.
- Automatic topic classification of Afrikaans documents
In the past decade several techniques have been developed to automatically group the topics that a number of documents deal with. These techniques usually make use of the scores of different words in the documents (the so-called ‘bag-of-words’ model), and is most successful in English. During the current project, similar techniques will be applied to Afrikaans documents and we will investigate whether pre-processing (e.g. part-of-speech determination or morphological analysis) can be used to achieve better grouping.
- Unsupervised learning of Afrikaans Grammar
Unsupervised learning is an area of machine learning that is based on assumptions with regard to the underlying structures in a data set, without direct access to these structures. This project entails the determination of the morphological structure of words or the syntactic structure of sentences, without any pre-knowledge of Afrikaans grammar. You must therefore teach a computer how Afrikaans works, without you or the computer necessarily knowing anything about Afrikaans!
The results of the first component will be incorporated in a programme that the end user will be able to use to interpret results. As improvements are incorporated by the end user, the algorithm must automatically, in the background, bring about improvements to the initial language model. In this way uncontrolled- and controlled learning are then combined in one system.
- Combination part-of-speech tagger for Afrikaans
CTexT currently possesses three different part-of-speech taggers for Afrikaans. These three part-of-speech taggers possess one characteristic that respectively influences their accuracy. The one part-of-speech tagger, for example, is better than the other two in tagging verbs, while possibly being worst at tagging adjectives. The purpose of this project is to develop a combined part-of-speech tagger that each time selects the best part-of-speech tagger for the task at hand. This project is Linux-based.
- Voice-activated marker
At the School of Languages a system has been developed with which lecturers can mark students’ work with the aid of a list of 80 standardised error messages. The written work is read into the system in txt-format and is then marked on the computer. When a mistake is found, the faulty text is highlighted with the mouse and the lecturer clicks on the specific error message. The faulty section is then changed to another colour. When the student gets back his/her work and moves his/her mouse over the coloured section, a mouse-over message appears providing feedback.
The problem is that it takes very long to properly mark a piece of work, partially due to the size of the error list (it consists of 80 categories). The error list is therefore too long to properly be displayed on one screen, which in turn means a great deal of time is wasted by clicking through the entire tree-view to get to the specific error. Initially drop-down menus were considered, but this will also not entirely solve the problem.
A possible solution is that the error messages be activated by means of speech recognition. The lecturer therefore selects the faulty text, speaks into a microphone (e.g. something like “Error Typing”) and the error is marked by the computer. This process effectively eliminates between 3 and 8 mouse clicks, which will save a considerable amount of time.
- Research questions:
Is it possible for 80 text units (between 2-5 words each) to be accurately and quickly recognised
Is it possible to let the computer execute an order when the unit of text is recognised
Is it possible that the lecturer can record a message, that could be played back to the student along with the comments


