Evolving MT: McAfee’s Journey into MT
Morgan O’Brien from Intel Security (McAfee Division) took part as a guest panelist on the “Evolving MT” discussion at the Welocalize LocLeaders event in Dublin. Morgan talked about McAfee’s journey to machine translation (MT) and in this blog, summarizes some of his key points.
Everyone has to start from somewhere. That usually means starting from nothing. A few years ago, we looked at MT and had to make some decisions. We had to claw our way through the hype and the bold statements that many providers make at conferences and make some choices for ourselves. The journey was all about understanding what was out there in terms of MT offerings, understanding our own content and also understanding what our internal use cases were for MT. So here is some wisdom from that journey that may help others.
Acceptance of MT
You don’t just wake up one morning and decide that you’re going to use MT and that everyone will accept that. It starts by making the business case and it grows from there. We started with a very simple question…“Would MT be better than Pseudo builds for QA testing?” Some free API’s to Microsoft and Google at the time had limits to the amount that you could MT with those API’s. We needed something else, low cost, to get us off the ground. This manifested itself with a copy of Systran. It came with 11 language pairs and gave us the ability to test out theories. Perfect. You build up the reputation of MT within your organization from nowhere and let acceptance follow from the positive results.
Having used MT with some Pseudo test projects, we enhanced the quality by training in our own terminology and UI information. Very quick and dirty but gave the accuracy to the Pseudo translations that we needed. Not linguistic accuracy, but accurate enough for the MT Pseudo builds to be useful. Now that we had used MT, the next steps were to start looking at other uses and providers, filtering through the marketing material and sales pitches. There is no quick answer for this unfortunately. Only you can make the best decisions for your organization. With every bit you learn, you refocus back on your internal challenge and see how it fits. Eventually you will take the plunge with some pilot and proof of concept projects.
Documentation Localization with Post-Edited MT (PEMT)
We completed a number of tests before we started looking at our documentation. We profiled our content. We’d seen similarities in some content types and the terminology used. We also had a large scale terminology management project running in parallel. We had also made ground with our documentation authors who were starting to look at controlled language and terminology. Conditions were good for documentation to go to MT. It’s important to understand that MT quality is affected by several conditions.
Training Corpora for MT
A minimum amount of translation units is needed to train an effective statistical machine translation (SMT) system. If you have more content than is required, profile it and select the best content for training of your MT systems. Profiling content is extremely important to SMT. Our training corpora was selected based on this profiling.
I’ll admit, some was trial and error and some we employed tools to help us on the way. We kept a keen eye on BLEU* scores, as this is all you can do on at this initial stage. Later on, BLEU becomes more irrelevant.
In addition to having the right training corpora collected and cleaned, we also had a side-line in from our terminology management which allowed us to weight the most important terms from our products in our MT output. As part of compliance in the Language Quality Analysis, this was a big deal for us. If we want to post-edit our MT output, then the least we should do is ensure that the terminology suggested is correct for the Post Editor as this significantly increases the quality and speed of post-editing.
Quality Levels:PEMT with Gold Standard
At McAfee, we talk about the “gold standard” with regards to our output, because we don’t want to compromise our language and ultimately the localization message. It’s important to our global brand. Gold Standard means that there is compliance to our terminology, the style and accuracy of message is good and the fluency of the text flows well for the reader. It is our highest quality and what we expect from our translators during a normal human translation workflow. This was the aim for our PEMT.
Content Optimization (Acrolinx)
We don’t have our source content optimized at the moment for MT. Why? Because it takes years of writing and localization in order to have a MT training corpora that matches the authoring process. This is not to say that the content was not good, but it was not optimized in terms of standards which could significantly help MT. Starting at this point, where the authoring is now being written with structure and common standards across the organisation, means that with retraining future MT, we should be seeing better gains. Simple rules for authoring, such as trying to keep sentence length to 12 words, can make a huge impact on your MT effectiveness. And of course, more consistent terminology usage helps hugely.
Translator Productivity Expectations
Ultimately your MT must be tested with productivity. BLEU gives you an indication of quality; however, real quality estimation is based on how much better your translation flow is after MT is introduced. This is calculated by two old favorites: time and money. For effective MT, reducing costs and increasing productivity, you need the following:
- Good selection of MT training corpora (usually ~300K segs bilingual data)
- A process for cleaning, selecting and organizing (engineering and linguistic)
- Terminology consistency (a set process and buy-in to that process)
- Style and rules consistency in authoring (authoring standards)
- Expected quality levels (full PE or light PE?)
- The human factor understanding
- Fit-for-purpose tool set
So, how do you calculate the time and money? This is where tools and the human factor come in.
You can use tools such as TAUS DQF or more integrated tools like iOmegaT or MemoQ for calculating time on segments. TAUS DQF is a good start to assess your general productivity; however, for an actual production project, it does not allow the Post Editor to use the tool and set of macros and shortcuts that they a familiar with. I’d like to see all translator CAT tools in the future have productivity data built in for MT. This may cross some trust boundaries with translators, As long as it has an ‘opt in’ system, I believe that most post-editors will be happy enough to share some data to help you improve your MT.
The Human Factor
The human factor is, very simply, the fact that two people side-to-side will perform differently on any type of work. It’s part skill, part familiarity with tools, part experience and part acceptance.
A post editor with little experience of post-editing and lots in traditional translation will simply not perform as well in a productivity assessment as someone with post-editing experience. Post-editing experience is at least understanding patterns, quick work practice methods and familiarity with the tools they use to implement these methods.
One influencing factor is how much they (post editors) accept the process with MT and want to work with it. If they don’t like it, they simply will not perform well. It’s our challenge, as providers of the MT and the process, to bring the end post editors closer and give them good MT output that evolves and gets better the more they are working on it. Don’t allow the post editors to suffer from repetitive, stupid mistakes from the MT. Make them part of the MT activity workflow and provide rewards by keeping MT quality high.
The MT Feedback Loop
There is currently nothing out there that automates the MT feedback loop other than retraining data. I would like to see CAT tools create a standard for reporting back on MT, through Xliff kits or similar, so that when a kit is delivered, there’s a part for the MT specialist to dive into. If this is possible, the translator that bothered to give feedback should be rewarded for doing so, as they are helping the MT improve.
The Application of Raw MT
There are cases for use of raw MT; however, you have to be careful. Using RAW MT on an area of information, where the message could be misrepresented, could damage your brand and your reputation. For each use case for raw MT, it would be prudent to at least do a usability study every now and again.
Most of the MT systems out there that offer on-demand translation are pretty decent, especially in the IT domain. But there is always a chance of introducing total language blunders into your output. This may have a humorous side (“Windows ME” translated as “Fenêtre Moi” in French!), most of the time they are unprofessional and give the end user the wrong message. Too many embarassing translations and you really should not be using raw MT on what may be high impact content. This makes the case for ‘raw MT’ to actually be ‘Refined MT’. It’s something you can take feedback from various studies to improve the usability of the content but this requires your own engines, as apposed to the free services out there.
Start small. Understand what you want as an achievable goal of quality and where you’re going to use the MT in line with your overall business goals. Understand your content and any issues it may have. Use the right tools. And assess the quality in a variety of different ways, depending on what the content type is and the varying levels of impact that content will have on your global brand.
Morgan O’Brien is project manager at Intel Security (McAfee Division)
*BLEU: Bilingual Evaluation Understudy. An algorithm for evaluating the quality of text which has been machine translated.