Workshop on Automatic and Manual Metrics for Operational Translation Evaluation
London-based Lena Marg is a Training Manager on the Language Tools Team at Welocalize. She is a regular speaker at key localization events on the subject of language tools and machine translation (MT). Lena shares her insights and takeaways from the MTE 2014 conference, the first of three speaking engagements in less than 30 days.
May and June have been busy months for me with exciting opportunities to present and participate in three different localization and machine translation conferences all over Europe. I have already spoken at the MTE 2014 workshop held at the Harpa Conference Centre in Reykjavik and the TAUS Quality Evaluation Summit hosted in Dublin. My third presentation with Laura Casanellas, Program Manager of Language Tools at Welocalize, takes place at the 17th Annual Conference of the European Association for Machine Translation (EAMT), June 16–18 in Dubrovnik, Croatia.
The MTE 2014 workshop, like the TAUS summit both had quality evaluation as the overarching theme with topics ranging from the quality of raw machine translation to human translation, post-edited machine translation (PEMT) and crowdsourcing. The common discussions focused on defining and measuring quality – what are the new trends?
The Automatic and Manual Metrics for Operational Translation Evaluation workshop brought together representatives from academia, industry and government institutions to discuss and assess metrics for manual quality evaluations of MT. The intent was to compare them with well-established metrics for automatic evaluation, as well as reference-less metrics for quality prediction.
The workshop was hosted in the context of the Language Resources and Evaluation Conference (LREC), organized by the European Language Resource Association, under the auspices of UNESCO. Irina Bokova, UNESCO’s Director General, opened the conference. As it turned out, Vigdís Finnbogadóttir, former President of the Republic of Iceland and UNESCO Goodwill Ambassador for Languages, was sitting right in front of me.
The workshop program in the morning had a total of 12 presentations on different aspects of quality evaluation metrics. This was followed by an afternoon of hands-on exercises. We each had 10 minutes to present. All the workshop presentations are available here: http://mte2014.github.io/MTE2014-Workshop-Proceedings.pdf. I personally enjoyed a presentation by a team from the Department of Applied Linguistics and Translation of Saarland University (where I graduated from myself), stressing the importance of function and purpose of a given translation task.
My presentation, “Rating Evaluation Methods through Correlation” focused on the results from a major data gathering exercise we carried out earlier this year by the Welocalize Language Tools team. We correlated results from automatic scoring (in this case referencing BLEU), human scoring of raw MT output on a 1-5 Likert scale, as well as productivity test deltas from 2013 data. The total test set comprising 22 locales, five different MT systems and various source content types.
In line with findings from other speakers and recent publications, we found that while automatic scores such as BLEU serve as great trend indicators for overall MT system performance, they don’t tell us much about how useful the given MT output is for post-editors. Human scoring, on the other hand, correlated with productivity gains seen in post-editing and error classification proves a better indicator on usability. This confirmed the validity of our evaluation approach, comprising productivity data and human evaluation.
During the hands-on exercises in the afternoon, participants were asked to design comprehension tests, as well as post-edit and perform error annotation in machine translated output based on the MQM methodology proposed by the QTLaunchPad project. One test required us to annotate errors found on “chunked” entities, such as sub-entities of sentences-strings, which proved surprisingly challenging. Another interesting exercise was trying to map issues found in MT output to the source.
Last but not least, some participants also trialed Welocalize’s human evaluation scoring and error marking, which provided a great opportunity to discuss our own categories and take away additional food for thought. The exercises confirmed that one key challenge in human evaluation and error annotation is making them quick and easy-to-use for evaluators from very different cultures, usually with a translation background rather than knowledge in formal linguistics.
If you would like to receive additional information about our study, our finding or discuss my presentation, contact me directly at Lena.firstname.lastname@example.org.