This is one of the 52 terms in The Language of Localization published by XML Press in 2017 and the contributor for this term is Aljoscha Burchardt.
What is it?
A collection (usually electronic) of texts in two languages that can be considered translations of each other and that are aligned at the sentence or paragraph level.
Why is it important?
A bitext is one of the most basic results of translation. It can be used in the language industry for training, revision, and quality control. Bitexts also serve as training data for statistical machine translation.
Why does a business professional need to know this?
In linguistics, a sentence is often considered as a natural unit. In translation, a translated sentence pair is, therefore, also a natural unit.
From a technical point, bitexts are a straightforward representation of the source text and the product of translation. They can serve as an exchange or interface format between localization experts, system developers, and machines. Bitexts play a key role in training, evaluating, and improving localization technologies, such as translation memories, terminology management tools, or machine translation engines. They can also serve as a basic format for proofreading and interaction with customers, e.g., in the process of formal quality control. XLIFF is a standard format for representing bitexts in localization processes.
If bitexts are used for training language technology applications, they must provide the application with all information necessary for their intended functionality. To do this, they need to have optimal quality, represent a sensible range of linguistic variation, and have a large enough vocabulary. In general, it is best to use bitexts based on literal, uncreative translations when setting up translation engines.
Bitexts usually present (complete) ordered texts that are normally aligned at the sentence or paragraph level. This makes it possible to study the meaning of larger linguistic texts, also known as discourse, such as how texts organize information, are coherent, and reference topics both inside and outside of the current text. Such analyses can be used to improve the quality of the translation memory and, in the case of machine translation, to train the system.