Building Culturally Adapted AI Datasets

welocalize October 6, 2022

At LT Innovate’s Language Intelligence @ Work Summit, our Senior AI Solutions Architect, Aaron Schliem shared insights into building culturally adapted AI datasets. In this blog, Aaron provides expert insights into some of the key concepts:


Humans apply a wide range of inferences, prejudices, and judgements to how we perceive things. There are a lot of implicit assumptions that we may share, or we may not. Generally, we will understand each other because we have a wide range of experiences that help inform how we perceive these realities as we communicate.

Machines, on the other hand, don’t have those experiences. Machines are very limited. They are limited to the data that they are processing. For machines, this reality is an explicit one, it is not inferred.

When developing multilingual artificial intelligence (AI) systems such as chatbots or support assistants, they need to be trained and tested with lots of culturally-rich data. This will then mean those explicit references can be portrayed in the actual AI performance.

Basic concepts for building AI data sets


Within a domain, or subject matter area, there’s something called an intent. Intent is something that you want to do, specific to a certain domain. If the subject matter domain is banking, an intent might be wanting to open up a bank account.


The key kinds of information that you might convey, or an AI algorithm might ask you for in order to complete your intent. For example, if I want to open a bank account, the slot would be the account type.


A lexicon is a set of words or terms that are commonly used when interacting or in a particular subject matter domain. Again, if the domain was banking, lexicons may be ‘account’, ‘balance’, ‘finance’, ‘savings’, etc. All of these things go into building out a domain-specific conversational AI engine.


Typically, when humans are creating AI training data, they need to be given a prompt to derive natural language expression. The prompt is a description of a domain specific intent – this helps to position the person psychologically to be able to produce the kind of data that is representative of the thing’s humans would say.


Utterances are the core currency that is used to train a conversational AI engine. Within utterances, there are:

  • Carrier Phrases
    Carrier phrases can be somewhat generic and surround all of the key information.  For example, “I would like to”, “I really feel like”, “Could you help me to”. They are the ways that the important information (intent and slots) is carried so that the machine can understand us coherently.
  • Slot Values
    Slot values are inserted into these carrier phrases. The slot value is the information. Again, if the domain is banking, the slot is the account type, but the slot value may be checking account.
  • Fillers or non-informational words/sounds
    Not every human being uses sound in the same way. Fillers are things like ‘hmm’, ‘um’, ‘erm’, ‘I guess’ – the things we all use in our everyday conversations but are not the core information that we’re trying to convey.

Dimensions of culture

Taking these basics of AI data, you can then apply dimensions of culture and diversity to reflect this in the datasets being built and ensure you’re hitting the mark in terms of cultural diversity.

Colloquialisms, texting and abbreviations

Here we’re thinking about non-standard language. Just real language that people use on an everyday basis. Including when they talk to their smart phone, computer, or assistant devices. For example, in Australia, the word ‘arvo’ is used commonly for afternoon as an abbreviation. Words like this should be represented in the data set so that the AI engine can also understand.

Regulatory Regimes

An example of this could be for the insurance domain and considering what kind of insurance people need in a particular culture/country. In some countries, it’s a requirement to have car insurance, resulting in a very common intent. This is how culture plays a role in defining what we actually want, because it defines what we need by virtue of the regulatory regime.

Local Customs

What are the common things that people do? In Japan, for example, it’s not uncommon to send your luggage ahead of you not common in American culture. But if you’re Japanese, you might choose to ask a hotel or a travel service.

Local brands

Even though we do live in a globalized world, there’s still a very large local focus. Think of your local bank, insurance agency, or healthcare providers, local brands are important. They need to be a part of the dataset so that the AI engine can perceive them as a part of that experience for the person.

Product categories

There might be a need in the south of France to buy a parasol because of the heat of the sun. That may not be relevant in another location where there may be only cold weather. Think about the local reality and the kinds of products that are sold there.

Institutional Infrastructure

An example of this could be if you live in a location where perhaps there’s only private health insurance. It’s confusing to create a lot of data related to public health because it simply is not a reality in a particular culture.


Certain cultures have certain preferences. For example, food preferences. Chayote is a common food in central America – this might be part of an inquiry made to a restaurant and so needs to be understood by the AI engine.

AI training data needs to be infused with real-life. The kinds of things that real people care about. Those things then get reflected in lexicon, intent, prompts, slots, and slot values.


So HOW do we build culture into the AI datasets?

Typically, data for AI is gathered by asking a wide range of different kinds of people to generate it. But a lot can be done to maximise the success of those people and the probability that they will reflect important cultural dynamics in the data they are delivering.

Here are some ways we can adapt the task design to maximize the success of the data:

  • Locale contextualization. When writing a prompt for someone, it should be done in a way that makes them think about their locale and makes an assumption about where they sit right now in their local reality. An example would be ‘imagine that it’s snowing, and you want to buy a shovel’. This helps place them in the reality of their culture and to generate data that might be produced naturally.
  • Start with easy domains and intents. It’s not easy for people to imagine what they might do in a given scenario, and to also imagine a very diverse range of sentence structures that they might use. Starting with easier content, allows workers to better understand what you’re looking for, before moving forward into more complex domains.
  • Simplifying the user experience (UX). If a worker is too focused on a complex user interface, that can detract from their ability to focus on the subject matter itself and in the way that they’re employing language, which is ultimately what we want.
  • Infuse prompts with diversity. Asking for the data in a certain way by using a variety of different genders and people in those examples can help to reflect greater gender diversity in the datasets
  • Limit individual contributions. Each of us is limited by our own life experience and cultural experiences. You can’t expect one person to create such a vast array of culturally reflected content. It’s important to limit what each person can contribute to the total dataset.
  • Use NLP as a validation tool. When you’re working at scale, conversational AI requires large volumes of data in order to produce good algorithms. Checking at a very micro level for what is in the content your workforce is delivering can be really tedious and challenging whereas NLP can help do that at scale.

Sourcing Data Workers

When producing high-quality, high diversity data, sourcing data workers is very important. Ultimately, the workers are going to be the lens through which their individual cultural experiences shine.

Here are some potential techniques you might deploy when sourcing workers:

  • Consumer/user experience. Look for people who have consumer and user experience, rather than subject matter specialists. You want somebody who may have used a chatbot from their local fast-food place. That matters way more than being an expert in how fast-food organization runs.
  • Use vocal register targeting. Gather different voices – high, low, medium, a wide range of pitches. Don’t gender target which can often lead to bias in sourcing the data.
  • Technology experience. Users with different amounts of experience will interface with machines in different ways. Some with lower experience will produce data that is different to those with high tech experience. Don’t assume based on someone’s age!


Bilingualism and code-switching is a really important and complex phenomenon in conversational AI. Many people speak more than one language and switch back and forth between the two in a fluid way when talking.

Use imaginative contextualization

You want to have reflections of bilingualism, but don’t force it! Write prompts that place someone in a mental frame of reference that might elicit from them a code-switching utterance.

Say it out loud.

Often when we write things first it becomes stilted and unnatural. If someone is a naturally bilingual person allow them to say something out loud first and see how it flows off the tongue, and then write it down. That can produce a lot more natural data that reflects true use of code-switching in a conversation setting.

Enforce script usage.

If workers are writing the utterances, it’s important to make some decisions about how you’re going to write the words – should you write them in Latin characters or are you going to transliterate them? It’s important to have a structure, otherwise parsing multiple languages is going to be quite challenging.

Welocalize develops high-quality conversational AI multilingual training data sets, so you can create conversational AI systems that understand text and speech in multiple languages. For more information about Welocalize AI services, connect with us here.