How robust are multilingual wordlists? Implications for language trees and colexification data

Event: How Robust Are Multilingual Wordlists? Implications for Language Trees and Colexification Data?

Date: Friday 26 June 2026

Time: 12:00-13:30

Venue: Online via Zoom

The Linguistic Circle Seminar will be delivered by Mr David Snee (PhD Candidate, University of Passau, Chair of Multilingual Computational Linguistics).

Abstract
Multilingual wordlists are datasets which provide translations for the same set of concepts across a number of languages. These wordlists are widely used to compare languages and reconstruct their evolutionary histories, yet we know surprisingly little about how early decisions in the preparation of these datasets influence the results of later analyses.

This talk initially examines how different researchers translate the same concepts when they compile multilingual wordlists. In a sample of 10 datasets covering 9 language families, we find that only about 83% of concept translations result in the same word form, and only 23% use identical phonetic transcriptions.

In a second study using 7 datasets from 5 language families, we test how these differences affect computationally inferred trees that model evolutionary relationships between languages using Bayesian statistics.

The results show that variation in concept translation can noticeably change the resulting trees, suggesting alternative evolutionary relationships between languages. Additional analyses suggest that these differences mainly impact smaller subgroups, while the major branches of language families typically remain stable.

A closer look at the Indo-European and Tupian datasets illustrates, with examples, how concept translation variation can influence relationships within these language trees. Finally, the talk will discuss our work on the robustness of colexification data in the CLICS4 database.

A colexification occurs when the same word expresses more than one concept, for example the German word Bank can refer to both a BENCH and to a BANK. Our work aims to more easily facilitate targeted analyses in this area by mapping the CLICS4 data topology more effectively.

Ultimately, the talk highlights the need to explicitly address data robustness in computational historical linguistics.