Cihat Eryiğit, PhD Thesis, 2017.

Text to sign language machine translation system for Turkish. İTÜ Graduate School of Science, Engineering and Technology, Istanbul.

[link ~11 MB]

Summary

Computer processing of human languages has been a research topic of interest ever since the invention of computers. Sign languages are the native languages of many prelingually deaf people. As it is the case for spoken languages, sign languages spoken in different countries/communities differ substantially from each other (and also from the spoken languages used in these countries) at lexical, morphological and syntactic levels, and systems tailored for a specific sign language are most of the time not directly applicable for another one. Although sign languages are real human languages, the research focused on their computerized processing remains rather limited compared to that for spoken languages. A very important reason behind this phenomenon is the lack of data resources (usable in computerized systems) for most of the under-studied sign languages. The unconventionality of written sign language representations naturally makes the collection of such resources even harder; i.e., since sign languages are commonly not written languages, there is no written corpus available that would serve as the data for computational studies. The difficulties of hearing impaired individuals in communicating smoothly in written or verbal ways causes obstacles in their access to information, job opportunities and in their education. Hearing impaired individuals use sign languages that are visual and animated languages as their natural language. Turkish Sign Language is a natural language officially recognized in our country and used by deaf individuals living in Turkey. As is the case for many lesser studied natural languages, TİD also introduces unique challenges in natural language processing area. The development in computer technology makes possible to automatically perform the translation between oral and sign languages at a certain level. In every country, different native sign languages are used (e.g. TİD in Turkey, ASL in the U.S.A. etc.), and these sign languages have linguistic properties that are different from the linguistic properties of the spoken language(s) spoken in those countries. There are studies that have been conducted in order to produce translation systems for deaf people to translate official documents and education materials from written text to sign language. There are active studies conducted on developing signing avatars for Tunisian, German, English, Dutch, French and American sign languages. However, sign languages are substantially different from each other as spoken languages are. Therefore, a machine translation system that has been developed for another language cannot be used for a Turkish-TİD translation system directly. In a few recent translation studies from Turkish to TİD, only selected words were translated into TİD signs (with pictures/photos, videos and avatar animations). However, this approach leads to incorrect translations at the sentence level. Turkish and TİD are different languages and as all spoken and sign language translation systems, this issue is a machine translation problem that has to be studied on syntactic and semantic levels. Because of this, it contains all the challenges of machine translation. Again because of this, the recent systems that have been developed for Turkish have been nothing more than a limited dictionary-like system that translates words to signs. This thesis aims to develop a machine translation infrastructure to be used in the translation of written Turkish materials into Turkish Sign Language. The work introduced in this thesis is the first academic study conducted on this topic. The output of the translation system is aimed to be a machine-readable representation of TİD so that it may be fed to animation systems (e.g., avatar or humanoid robot) as input. With this aim, the grammatical properties of Turkish Sign Language that will be used in Turkish-TİD machine translation, natural language processing methods necessary for this translation, previous machine translation studies for sign languages, electronic sign-language dictionaries, and sign languages manual annotation platforms are investigated within the thesis. Since TİD-Turkish language pair lacks of bilingual data resources, we are compelled to choose RBMT(rule-based machine translation)for our initial translation system. With increase in the number of bilingual text corpora, it would become possible to create example-based and statistical machine translation systems or hybrid ones. The representation scheme proposed in this thesis aims to remove the obstructions in front of this process and pave the way for rapid resource creation. Turkish, as a morphologically rich language with flexible word order presents challenges for natural language processing that are different from other widely studied languages such as English. Therefore, one cannot directly apply the methods and findings from other languages to Turkish. In this respect, the introduced structure is treated to be valuable for similar agglutinative oral languages (e.g., Finnish, Hungarian and Korean) and sign language pairs. This system will increase the natural language interaction between students and teachers and contribute to studies on computer-assisted cooperation. In line with MEB's policies, the communication in this setting will be from teacher to student. In other words, the content that the curriculum/teacher aims at delivering will be transformed into a form which can be understood by a deaf student in a more efficient way and thus, enabling the student to adopt to the mixed education classroom setting in a quicker way. We use a transfer-based machine translation approach, where our transfer model is the stage consisting of the translation rules from Turkish to TİD. The input to the translation rules component is the analysis of the source language (produced via the Turkish NLP pipeline) and the output which is going to feed the animation layer is a generated machine readable representation of the target language (TİD). In our transfer model, we aim to use both syntactic and semantic transfer. To this aim, the formalism chosen for both Turkish and TİD syntactic representation is the dependency formalism. In case it is not possible to find an equivalent sign entry with the same lexical sense of a Turkish input, we aim to map our senses to concepts for semantic transfer adapted from the ``Lexicon Model for Ontologies (LEMON)''. Although Lemon supports some features for agglutinative languages, it seems hard to represent all the possible word lexical forms for Turkish due to its highly agglutinative complex nature, which complicates the creation of morphological generation rules (handled mostly by the use of finite state transducers in the literature). A straightforward solution to this is proposed by using the Turkish NLP pipeline to reach lexical entries from provided lexical forms. This thesis introduces a machine-readable knowledge representation of Turkish Sign Language for the first time in the literature. One of the biggest handicaps confronting statistical machine translation systems for sign languages is the collection of bilingual text corpora in machine-readable form, which is a crucial component in the current state-of-the-art approaches. The representation scheme proposed in this thesis also aims to remove the obstructions in front of this process and pave the way for rapid resource creation. The introduced machine readable representation scheme of TİD is linked to ELAN annotation tool in order to produce such corpora and the input for the avatar system to generate natural looking continuous sign sequences. This study also generates an online dictionary platform which houses the unique glosses of the signs, possible variations, and layers required to feed the ELAN tool with adequate depth of information. The developed sign language infrastructure, as well as the sign database and corpus to be generated as a part of the system, will be vital for the researchers working on the TİD domain. The contributions of the thesis are as follows: -A machine-readable knowledge representation was proposed for Turkish Sign Language, -A parallel treebank study based on dependency formalism was conducted for an oral language and sign language pair, -A Turkish sign language electronic dictionary infrastructure, which makes possible to use the annotations in machine translation studies, was developed, -A TİD-specific plugin to ELAN manual annotation platform (which is widely used in linguistic annotation of sign language discourse) was developed so that it can produce machine-readable annotations to be used in machine translation studies, -A prototype to develop Turkish-TİD parallel data sets for machine translation studies (using the proposed annotation infrastructures) was introduced, -Ontology infrastructure for Turkish and TİD was developed, -A rule based machine translation system from written Turkish to TİD based on syntactic and partially semantic transfer was designed and basic translation rules were proposed. The proposed machine translation infrastructure was tested on a parallel text composed of 306 Turkish-TID parallel sentences (selected from primary school textbooks and prepared within the scope of the project TÜBİTAK 114E263). The transfer success rates were shown to fall within acceptable performance levels. The translation system architecture is designed to be expandable. The new rules obtained as a result of TİD linguistic researches will be easily incorporated into the system. In addition, new sign inputs to be added to the TİD dictionary and new semantic relations to be created in the Turkish word network will enhance the performance of the translation system.