XTM Advanced Text Aligner

Grace Cowan

updated at March 16, 2023

Introduction

Alignment became a classical computational linguistics topic following the publication of the seminal paper by Gale & Church in 1993. Although the paper is very interesting, it makes an assumption that the source and target texts correspond at least in terms of paragraphs, if not at the sentence level.

The real world

In reality the input material for alignment projects is often quite different from the theoretical work of Gale & Church. Typically you can encounter the following issues:

The source and target documents are not the same version:
- Additional text may have been added to the source document
- Text may have been removed or changed
- Text may have been reordered and moved around in the source document
In an unconstrained translation environment (i.e. where a CAT tool has not been used) the translator can make arbitrary decisions regarding the translation:
- Sentences may be omitted, the translator regarding them as superfluous, or unwarranted in the target language
- Multiple sentences may be rendered as one sentence
- A single sentence may be translated as multiple sentences
- Complete paragraphs may be ‘reworked’ by the translator to provide a clearer interpretation in the target language.
- The translator may decide to completely rework the structure of the original document, especially if this is an unconstrained translation of a Microsoft Word document.

In a real world environment it is not possible to rely on the classical approach: the documents may be too different to provide the basis of a Gale & Church approach to alignment. In my experience this is the case in 90% of real world alignment projects: someone has the translation and a version of the source, but the two do not correspond.

The XTM Aligner approach

The only real viable solution to the problem of alignment lies with the use of dictionaries. It has recently been possible to source commercially Big Data lexicons covering a very large number of languages. Nevertheless it is not possible to rely just on standard lexicons. Translators can often use synonyms that do not correspond directly to the source term. In addition there is the problem of grammatical ‘function’ words that are not covered necessarily adequately by lexicons.

Lexicons provide a very effective way of assessing the viability of matching individual segments of text, and also of identifying text that is out of sequence between source and target elements. The newly released XTM Aligner creates a ‘skeleton’ overlay of the text and assesses the degree of change between the source and target versions of the document. It then uses this overlay to commence the alignment of each individual segment, depending on its viability: the segment may be new or modified, or it may have been moved within the document.

Languages and scripts

The XTM Aligner is designed to work with all scripts and languages, including Cyrillic, Greek, Hebrew, Arabic, Chinese, Japanese, Korean and Devanagari. It can currently align in any direction between the following languages:

AF – Afrikaans
AR – Arabic
AZ – Azerbaijani
BE – Belorussian
BG – Bulgarian
BN – Bengali
CA – Catalan
CS – Czech
DA – Danish
DE – German
EL – Greek
EN – English
ES – Spanish
ET – Estonian
FA – Farsi
FI – Finnish
FR – French
HE – Hebrew
HI – Hindi
HR – Croatian/Bosnian/Serbian
HU – Hungarian
HY – Armenian
ID – Bahasa (Indonesian)
IS – Icelandic
IT – Italian
JA – Japanese
KA – Georgian
KO – Korean
LT – Lithuanian
LV – Latvian
MK – Macedonian
ML – Maltese
MS – Bahasa (Malaysian)
NL – Dutch
NO – Norwegian
PL – Polish
PT – Portuguese
RO – Romanian
RU – Russian
SK – Slovak
SL – Slovenian
SQ – Albanian
SV – Swedish
SW – Swahili (Kiswahili)
TH – Thai
TL – Tagalog
TR – Turkish
UK – Ukrainian
VI – Vietnamese
ZH – Chinese

Excel output

The XTM Aligner creates an output file in Microsoft Excel format, including the use of colour to provide hints as to the viability of the match. A probability score is also provided using the standard mathematical scoring of ‘0’ for no probability and ‘1’ for total probability.

The Excel file can then be used by a linguist to check and correct if necessary the alignments. Why use Excel? Excel provides a very good and effective way of correcting misaligned segments. Individual cells can be edited, deleted and the subsequent rows moved up or down to realign the remaining part of the document. It is quick and easy to proof and tidy up the alignment if required.

The aligned Excel document can then be uploaded directly into XTM as is. Any source cells without corresponding target text will be ignored.

The XTM Aligner creates two Excel files:

One with all source and target segments
A second file with ‘.90+’ in the name containing only high probability matches.

The second ‘90+’ file can be used for ‘fast’ alignment where the translator only needs to review and confirm the matches, rather than trying to work on the alignment of the whole two files.

Conclusion

The XTM Aligner is designed to cope with the most demanding alignment projects that exist in the ‘real’ world, including all scripts and languages. It attempts to ‘salvage’ and much translation memory as possible from what may otherwise seem an impossible task.