Predictive MT and Quality Analysis

Predictive analysis in the localization industry is swiftly becoming a key approach to improving quality and efficiency in the translation workflow. Extracting information from existing datasets to determine patterns and predict future outcomes can significantly help the translation automation process. In this blog, Welocalize Technology Solutions team members Dave Clarke and Dave Landan give us an update on how two new Welocalize tools are benefiting clients.

It is becoming increasingly important for language service providers (LSPs) to quickly determine the nature of content; How suitable it actually is for the envisaged localization outcomes and, subsequently, the appropriate processes and workflows it should be routed through to successfully meet those expectations.  At the same time, today’s clients have an ever-increasing volume, and often diversity, of content to be translated.  Therefore, the ability to analyze large quantities of source content quickly, accurately, and consistently becomes an imperative.

Welocalize has recently added two new tools to its language tool portfolio to help automate these analyses:  TMTprime and StyleScorer. 

TMTprime was developed through a joint collaboration between Welocalize and the Centre for Next Generation Localisation (CNGL, now ADAPT).   TMTprime provides a way to predict which of multiple given translation assistance systems, whether translation memories (TMs) and/or machine translation (MT) engines, would provide the best output for any given content set.  By simply providing TMTprime with TMs and/or MT training data and a “tuning set,” TMTprime learns to predict which of the systems it is trained on is best for different source content types.  We are also currently researching the capabilities of TMTprime when applied to the task of predictive quality analysis, with a view to drastically reducing and, more often, replacing the running of costly and time-consuming human evaluations of multiple MT engines.

StyleScorer is a proprietary Welocalize tool that learns the content authoring style of a set of documents and then through a scoring system, rates how well new content matches the style of the initial documents.  Analytic tools like Welocalize StyleScorer can work with documents in any language and can be useful for analyzing source and target content.  Automated analysis of source content gives fast, accurate impression of suitability or potential difficulty of translation at the very beginning of the production cycle, which quite obviously, is exactly the right time to be informed. Further through the cycle, analyzing target content gives us a way to automate certain tasks in linguistic quality analysis (LQA).

By running StyleScorer on raw MT output, the scores can be used to rank documents that are likely to need more post-editing (PE) to bring them in line with the style of known target documents. This is good news when time is precious because it allows us to focus PE work where it is needed.

TMTprime and StyleScorer are just two examples of the cutting-edge tools that Welocalize uses to make sure that content gets translated as quickly as possible, to appropriate quality levels. More exciting innovation in the area of content analysis will be brought out later this year so watch this space!

Welocalize Technology Solutions

Dave Clarke and Dave Landan

For further reading on StyleScorer, read Dave Landan’s blog: Welocalize StyleScorer helps MT and Linguistic Review Workflow

Click here for more information on weMT

EAMT Conference 2014: Welocalize Language Tools Team Overview

Laura CasanellasThe Welocalize Language Tools team attended and presented at the 2014 EAMT Conference in Croatia. In this blog, Laura Casanellas, Welocalize Language Tools Program Manager and presenter at EAMT, provides her highlights and insights from her Welocalize colleagues who took part in the conference.

Just like Trento in 2012 and Nice in 2013, the Welocalize Language Tools Team participated in the Annual Conference of the European Association for Machine Translation (EAMT). The conference took place June 16 – 18 in the city of Dubrovnik, Croatia and four members of the Welocalize Language Tools team attended:

Olga Beregovaya, VP of Language Tools, and Dave Landan, Pre-sales Support Engineer, presented a project poster on “Source Content Analysis and Training Data Selection Impact on an MT-driven Program Design with a Leading LSP.”

Lena Marg, Training Manager and I delivered our presentation “Assumptions, Expectations and Outliers in Post-Editing.”

We take the EAMT conference and associated conferences (International, Asian and American) seriously, as most of the important developments that are currently taking place around machine translation (MT) are presented and followed up in those forums.

As a global language services provider (LSP), Welocalize adds value to the EAMT conference by being able to share real-life MT production experiences, demonstrated through thorough analysis of large and varied quantities of actual data. We are privileged in that we work in a real scenario where some of the new technologies around natural language processing (NLP) and MT can be tested in depth.

EAMT 2014 Poster Presentation WelocalizeIn their poster: Source Content Analysis + Training Data Selection Impact – EAMT POSTER by Welocalize, Olga and Dave stressed the idea of the importance of preparing the training corpus in advance and matching it to the specifics requirements of the content that subsequently will be translated. To give an example, many translation memories come from different projects created at different points in time. They may contain inconsistencies or the sentences in these translation memories can simply be too long or may contain a lot of “noisy” data. They need to be cleaned up before they can be used as engine training assets. Going deeper into the possibilities of automatic data selection and matching it with the source content, Olga and Dave spoke about our suite of analytic applications, divided between proprietary tools like Candidate Scorer, Perplexity Evaluator, StyleScorer and others that are being developed as part of an industry partnership with CNGL: Source Content Profiler and TMT Prime.

Olga Beregovaya’s impressions about the EAMT Conference and Welocalize’s role within it are very positive. “Overall, the great thing about the conference was the applicability of the new generation of academic research in real live production scenarios. Many of the academic talks were relevant for the work on MT adaptation and customization that we do at Welocalize. Today, we need to cover more and more domains and content types so the domain and sub-domain adaptation is becoming the key area of our R&D. This means that we benefit greatly from academic and field research around data acquisition for training SMT systems and the relatively new developments around using terminology databases to augment the SMT training data. Not all of our clients come to us with their legacy translation memories, and while there is some public corpora available, we still need to rely on us acquiring and aligning data ourselves.”

Dave found two presentations he attended particularly interesting that focused on common pain points within the industry. “The challenges of using MT with morphologically rich languages are well-known, and we were happy to see interesting research in possible ways to overcome those challenges. We also found a talk on gathering training data from the web very interesting. The presenters discussed using general and specific data to train separate engines which could be weighted and combined to give improved results in cases of sparse in-domain training data. Indeed there were several innovations from academia that we are looking forward to incorporating into our bleeding-edge MT tools and processes.”

In our presentation, Lena and I focused on different challenges in a real MT production scenario: the necessity of forecasting future post-editing effort, with an emphasis on post-editors behavior, and their personal and cultural circumstances, as an important variable of the MT + PE equation. As part of a large LSP, we have been able to gather large amount of data and focus on the quality of a number of MT outputs related to different languages and content types. Our presentation elaborated on our findings around correlations between different types of evaluation methods (automatic scoring, human evaluations and productivity tests). We obtained interesting findings around the adequacy score in our human evaluation tests and the productivity gains contained on the post-editing effort. We will continue gathering data and investigating around this area.

Another topic that was touched upon during the conference was the area of quality. Lena and Olga both shared their perspectives:

“After closely following the QTLaunchpad project for several months, it was particularly interesting to see and discuss results from their error annotation exercises using MQM earlier in the year. Welocalize took part in these exercises by providing data and annotator resources. The findings of this exercise are contributing to further advances both in quality estimation and quality evaluation, fine-tuning metrics further for better inner-annotator agreement, etc. These discussions also provided some immediate take-aways for our approach to evaluation.” – Lena Marg

“The other area of high relevance to us is Quality Evaluation. Again, it is great to see so many research projects dealing with predicting MT quality and utility. While it still may be challenging to deploy such quality estimation systems in-production as various CAT tools and TMS systems have their own constraints around metadata-driven workflows, it is very encouraging to know that this research is available.” – Olga Beregovaya

“A general theme of the EAMT Conference was the question of how to increase cooperation between the translation and the MT research community. In this context, Jost Zetsche’s keynote speech was important in pointing out that translators should take an active interest in providing constructive feedback on MT and on how they work, to ensure new advances in MT developments are truly benefiting them. And yet, with the presence of some interested freelance translators, translation study researchers and a handful of LSPs presenting on MT, it would seem that progress has already been made in bringing the two sides together.” – Lena Marg

Stradun la nuit_dubrovnikThe EAMT Conference was a great opportunity to meet professionals, academics and researchers who work in the field of MT. The Welocalize team members were able to exchange ideas around the current pressing challenges surrounding MT technology and we still had time to admire the beautiful surroundings of historical Dubrovnik.

Laura Casanellas is program manager on the Welocalize Language Tools team.

Welocalize to Present in LA at Translation Technology Conference memoQfest Americas

memoqfest 2014Fredrick, Maryland – February 25, 2014 – Welocalize, a global leader in translation and localization will be sharing machine translation (MT) knowledge and expertise at the 2014 memoQfest Americas conference in Los Angeles, February 27 through March 1.

David Landan from Welocalize’s Language Tools Team will be presenting “Better translations through automated source and post-edit analysis” on day two of the conference.

My presentation at memoQfest Americas will discuss how Welocalize is developing processes and tools grounded in computational linguistics and NLP to reduce post-editing effort,” said David Landan, support engineer at Welocalize. “We analyze data using techniques from machine learning, language modeling, and information retrieval.  Our data-driven approach allows us to build more targeted, more accurate MT systems.”

David will explore ways of automating training data selection using a source content analysis suite and show how the selected data led to improved MT engine quality by using Welocalize’s WeScore and StyleScorer as a way to evaluate translations. Welocalize’s WeScore is a dashboard for viewing several metrics in a single application. It makes automatic scoring of MT output easier by handling input parsing formats, tokenization, and running multiple scoring algorithms in parallel.

Machine translation (MT) is a topic with a high level of interest at localization and translation industry events. As global organizations produce more and more content and the demands for quick localization grow, Welocalize will highlight how combining translation approaches, like MT and post-edit analysis, can achieve the desired quality of output that meets time and budget goals.

memoQfest is an annual conference, hosted by Kilgray Translation Technologies, to learn more about trends within the translation technology industry. The memoQfest event also provides networking opportunities for translators, language service providers and translation end-users.

About Welocalize – Welocalize, Inc., founded in 1997, offers innovative translation and localization solutions helping global brands to grow and reach audiences around the world in more than 125 languages. Our solutions include global localization management, translation, supply chain management, people sourcing, language services and automation tools including MT, testing and staffing solutions and enterprise translation management technologies. With over 600 employees worldwide, Welocalize maintains offices in the United States, UK, Germany, Ireland, Japan and