IXP - Multilingual sources and datasets

ixp

latest

false

Communications Mining user guide

Multilingual sources and datasets

Multilingual support in Communications Mining, enabling models to process sources containing multiple languages without translation.

Communications Mining™ supports multilingual sources and datasets. This means that the models can understand sources that contain multiple different supported languages, without actually having to translate them.

The languages available within multilingual sources and datasets are:

English
Dutch
French
German
Italian
Japanese
Portuguese
Spanish

If you work and do business in several languages that the platform supports, you can train on messages in those languages, rather than translating everything into a single language.

Key considerations

If a dataset is multilingual, you cannot view translations of any messages, as provided for translated datasets. As a result, you will need to understand all of the languages in the dataset to effectively train their model.
Understanding multiple languages is a more complex machine-learning problem than understanding a single language. As a result, these datasets may potentially experience a slight drop in performance compared to datasets in a single language.
If the dataset contains other languages than the supported ones, applying labels used for supported languages may cause confusion. Instead, annotate these instances with language-specific labels.
Note:
The platform cannot process or understand the content of unsupported languages.

Creating multilingual sources and datasets

When creating a data source or a dataset, the platform selects by default the English language for both of them.

To change the language while creating your data source or dataset, proceed as follows:

Navigate to the Set the language, and enable translation for your source step.
In the Language dropdown menu, select Multilingual.
Note:
- You can no longer change the language once the data source or dataset is created.
- Multilingual datasets can contain sources of any language family that the platform supports.
- To learn how to create data sources and datasets, check Creating a data source and Creating a dataset.

Supported languages in Preview

Note:

We currently support a wide range of additional languages in Preview mode, as shown in the following list. This means that our team refines them based on your usage.

Afrikaans
Albanian
Amharic
Arabic
Armenian
Assamese
Azerbaijani
Basque
Belarusian
Bengali
Bengali (Romanized)
Bosnian
Breton
Bulgarian
Burmese
Burmese
Catalan
Chinese (Simplified)
Chinese (Traditional)
Croatian
Czech
Danish
Esperanto
Estonian
Filipino
Finnish
Galician
Georgian
Greek
Gujarati
Hausa
Hebrew
Hindi
Hindi (Romanized)
Hungarian
Icelandic
Indonesian
Irish
Javanese
Kannada
Kazakh
Khmer
Korean
Kurdish (Kurmanji)
Kyrgyz
Lao
Latin
Latvian
Lithuanian
Macedonian
Malagasy
Malay
Malayalam
Marathi
Mongolian
Nepali
Norwegian
Oriya
Oromo
Pashto
Persian
Polish
Punjabi
Romanian
Russian
Sanskrit
Scottish Gaelic
Serbian
Sindhi
Sinhala
Slovak
Slovenian
Somali
Sundanese
Swahili
Swedish
Swiss German
Tamil
Tamil (Romanized)
Telugu
Telugu (Romanized)
Thai
Turkish
Ukrainian
Urdu
Urdu (Romanized)
Uyghur
Uzbek
Vietnamese
Welsh
Western Frisian
Xhosa
Yiddish

On this page

Key considerations
Creating multilingual sources and datasets
Supported languages in Preview

Was this page helpful?

PREVIOUSCreating a dataset

NEXTEnabling sentiment on a dataset

Key considerations​

Creating multilingual sources and datasets​

Supported languages in Preview​

Was this page helpful?

Key considerations

Creating multilingual sources and datasets

Supported languages in Preview