The annual conference on empirical methods in natural language processing (EMNLP2018) and the third world conference on machine translation (WMT18) were held in Brussels from 31 October to 4 November. WMT18 is one of the most important international events on research and development in machine translation (MT). Participants from both businesses and universities come together to present the latest advances in this field and to hold debates about them. This year, AT Language Solutions successfully took part in the shared WMT18 task on parallel corpus filtering.
Data-based machine translation
The task in question dealt with the problem of cleaning noisy parallel corpora. This is a common scenario in the development of current data-based machine translation systems, which require huge amounts of training data to function properly. Training data can be obtained, for instance, through web crawling. However, this type of procedure tends to result in noisy data. Parallel corpora obtained through web crawling can contain sentences in a third language, mismatched phrases, incorrect or incomplete translations, etc. At WMT18 participants in this shared parallel corpus filtering task were asked to design a method to select valid translation pairs from an extremely noisy German-English corpus which had been obtained through web crawling, and present the resulting subset of clean sentence pairs. The proposals were assessed by measuring the quality of the machine translation systems trained based on the selected data.
Participation of AT Language Solutions
In our presentation we dealt with the issue within the framework of automatic learning, where the aim is to estimate to what extent two parallel sentences in two languages match and can therefore be considered translations of each other. The article presented at the conference, which contains all the technical details, is publicly available here. The presentation was given in a group session where the different participants presented their approaches. Ours can be seen here. In summary, the score obtained we obtained placed us in the top third of all participants, just a few points away from the top-scoring systems. The detailed results of the task can be viewed here.