EAMT Conference 2014: Welocalize Language Tools Team Overview

Laura CasanellasThe Welocalize Language Tools team attended and presented at the 2014 EAMT Conference in Croatia. In this blog, Laura Casanellas, Welocalize Language Tools Program Manager and presenter at EAMT, provides her highlights and insights from her Welocalize colleagues who took part in the conference.

Just like Trento in 2012 and Nice in 2013, the Welocalize Language Tools Team participated in the Annual Conference of the European Association for Machine Translation (EAMT). The conference took place June 16 – 18 in the city of Dubrovnik, Croatia and four members of the Welocalize Language Tools team attended:

Olga Beregovaya, VP of Language Tools, and Dave Landan, Pre-sales Support Engineer, presented a project poster on “Source Content Analysis and Training Data Selection Impact on an MT-driven Program Design with a Leading LSP.”

Lena Marg, Training Manager and I delivered our presentation “Assumptions, Expectations and Outliers in Post-Editing.”

We take the EAMT conference and associated conferences (International, Asian and American) seriously, as most of the important developments that are currently taking place around machine translation (MT) are presented and followed up in those forums.

As a global language services provider (LSP), Welocalize adds value to the EAMT conference by being able to share real-life MT production experiences, demonstrated through thorough analysis of large and varied quantities of actual data. We are privileged in that we work in a real scenario where some of the new technologies around natural language processing (NLP) and MT can be tested in depth.

EAMT 2014 Poster Presentation WelocalizeIn their poster: Source Content Analysis + Training Data Selection Impact – EAMT POSTER by Welocalize, Olga and Dave stressed the idea of the importance of preparing the training corpus in advance and matching it to the specifics requirements of the content that subsequently will be translated. To give an example, many translation memories come from different projects created at different points in time. They may contain inconsistencies or the sentences in these translation memories can simply be too long or may contain a lot of “noisy” data. They need to be cleaned up before they can be used as engine training assets. Going deeper into the possibilities of automatic data selection and matching it with the source content, Olga and Dave spoke about our suite of analytic applications, divided between proprietary tools like Candidate Scorer, Perplexity Evaluator, StyleScorer and others that are being developed as part of an industry partnership with CNGL: Source Content Profiler and TMT Prime.

Olga Beregovaya’s impressions about the EAMT Conference and Welocalize’s role within it are very positive. “Overall, the great thing about the conference was the applicability of the new generation of academic research in real live production scenarios. Many of the academic talks were relevant for the work on MT adaptation and customization that we do at Welocalize. Today, we need to cover more and more domains and content types so the domain and sub-domain adaptation is becoming the key area of our R&D. This means that we benefit greatly from academic and field research around data acquisition for training SMT systems and the relatively new developments around using terminology databases to augment the SMT training data. Not all of our clients come to us with their legacy translation memories, and while there is some public corpora available, we still need to rely on us acquiring and aligning data ourselves.”

Dave found two presentations he attended particularly interesting that focused on common pain points within the industry. “The challenges of using MT with morphologically rich languages are well-known, and we were happy to see interesting research in possible ways to overcome those challenges. We also found a talk on gathering training data from the web very interesting. The presenters discussed using general and specific data to train separate engines which could be weighted and combined to give improved results in cases of sparse in-domain training data. Indeed there were several innovations from academia that we are looking forward to incorporating into our bleeding-edge MT tools and processes.”

In our presentation, Lena and I focused on different challenges in a real MT production scenario: the necessity of forecasting future post-editing effort, with an emphasis on post-editors behavior, and their personal and cultural circumstances, as an important variable of the MT + PE equation. As part of a large LSP, we have been able to gather large amount of data and focus on the quality of a number of MT outputs related to different languages and content types. Our presentation elaborated on our findings around correlations between different types of evaluation methods (automatic scoring, human evaluations and productivity tests). We obtained interesting findings around the adequacy score in our human evaluation tests and the productivity gains contained on the post-editing effort. We will continue gathering data and investigating around this area.

Another topic that was touched upon during the conference was the area of quality. Lena and Olga both shared their perspectives:

“After closely following the QTLaunchpad project for several months, it was particularly interesting to see and discuss results from their error annotation exercises using MQM earlier in the year. Welocalize took part in these exercises by providing data and annotator resources. The findings of this exercise are contributing to further advances both in quality estimation and quality evaluation, fine-tuning metrics further for better inner-annotator agreement, etc. These discussions also provided some immediate take-aways for our approach to evaluation.” – Lena Marg

“The other area of high relevance to us is Quality Evaluation. Again, it is great to see so many research projects dealing with predicting MT quality and utility. While it still may be challenging to deploy such quality estimation systems in-production as various CAT tools and TMS systems have their own constraints around metadata-driven workflows, it is very encouraging to know that this research is available.” – Olga Beregovaya

“A general theme of the EAMT Conference was the question of how to increase cooperation between the translation and the MT research community. In this context, Jost Zetsche’s keynote speech was important in pointing out that translators should take an active interest in providing constructive feedback on MT and on how they work, to ensure new advances in MT developments are truly benefiting them. And yet, with the presence of some interested freelance translators, translation study researchers and a handful of LSPs presenting on MT, it would seem that progress has already been made in bringing the two sides together.” – Lena Marg

Stradun la nuit_dubrovnikThe EAMT Conference was a great opportunity to meet professionals, academics and researchers who work in the field of MT. The Welocalize team members were able to exchange ideas around the current pressing challenges surrounding MT technology and we still had time to admire the beautiful surroundings of historical Dubrovnik.

Laura Casanellas is program manager on the Welocalize Language Tools team.