Teemu Roos -
Probabilistic models for phylogenetics and stemmatology: Theory and
practice
It
has long been known that textual traditions that are produced by
repeated copying with modification, as well as many other cultural
objects, evolve in ways that can be likened to biological evolution.
Hence, it is not surprising that many techniques initially developed
for building evolutionary trees (phylogenetics) can be applied to the
analysis of such cultural objects. I will discuss recent advances in
the theory and practice of phylogenetics applied to the study of
cultural evolution. In particular, I will describe a new method based
on probabilistic models such as Bayesian networks. In experiments
with artificially created textual traditions, the new method
outperforms current state-of-the-art in the specific task of
reconstructing copying histories both in terms of a numerical score
as well as interpretability.
No
prior knowledge of phylogenetics or algorithmics is assumed.
Short bio:
TeemuRoos
is a Senior Researcher at the Helsinki Institute for Information
Technology HIIT and an Adjunct Professor in Computer Science at the
University of Helsinki, Finland. His research interests include
machine learning, probabilistic modelling, information theory, and
their cross-disciplinary applications.
____________________________
Michael Cysouw - Back to the roots: using regular sound correspondences for linguistic phylogeny (as one should)
Traditional historical linguistic stresses the importance of looking for regular sound correspondences for the phylogenetic reconstruction of languages. In recent computational phylogenetic work this old truism is mostly disregarded. This is unfortunate and unnecessary. I will argue that it is possible to statistically approach the regularity of sound correspondences in a straightforward way, even without the necessity of perfectly detected cognacy
However, for finding similarities in the structure of
more peripheral classes direct cross-linguistic identification is illegitimate.
E.g. it is not possible to unequivocally decide on structural grounds whether
e.g. German schauen (auf + ACC) and Lithuanian žiūrėti (į + ACC), both meaning
‘look at’, should be treated as belonging to similar classes in the two
languages (for genetically closely related languages a possible approach would
be to check whether the verbs use coding devices that are cognate, but this is
not a very useful approach even for large genetic groupings, let alone for
genetically unrelated languages). Yet, there is a need to quantitatively grasp
the intuitive idea that some pairs of languages are closer to each other in
terms of their systems of verb classes than others. The basic claim in this
study is that assessment can be based on properties of groups that the verbs
fall into (e.g. one can check to what extent the verbal meaning that require
auf + ACC in German overlaps with the group of verbs that take į + ACC in
Lithuanian). In my talk, I am going to discuss entropy-based measures (mutual
information and predictability) that can be used to measure (dis)similarities
between languages in this respect. It will be shown that the results obtained
with the help of these techniques are in some respects different from results
arrived at with the help of simpler methods based on transitivity alone.
Michael Cysouw - Back to the roots: using regular sound correspondences for linguistic phylogeny (as one should)
Traditional historical linguistic stresses the importance of looking for regular sound correspondences for the phylogenetic reconstruction of languages. In recent computational phylogenetic work this old truism is mostly disregarded. This is unfortunate and unnecessary. I will argue that it is possible to statistically approach the regularity of sound correspondences in a straightforward way, even without the necessity of perfectly detected cognacy
____________________________
Sergej Saj -
Two-place verb classes: towards measuring (dis)similarity between the languages
of Europe
The study is a part of a project devoted to the study of
valency classes in the languages of Europe. It is based on a questionnaire
consisting of 130 polyvalent predicates. These predicates were chosen with the
help of a pilot study, so that most of predicates chosen are not uniformly
transitive across languages, but rather, often fall in one of smaller two-place
valency classes. I will concentrate on the problem of measuring (dis)similarity
between languages based on the data obtained for a small (currently consisting
of 15) but ever-growing sample of genetically diverse languages of Europe.
In some respects the data obtained for these languages can
be compared directly. The simplest kind of typology is based on binary features
such as e.g. whether or not a particular meaning, e.g. ‘wait’, is expressed by
a transitive verb. The legitimacy of this operation rests upon typological
assumptions about cross-linguistic validity of the notion of transitivity.
Genenarilizing these binary results one may arrive at a very simple measure
allowing to calculate the overall transitivity profile for various languages
(and for the verbs chosen the transitivity ratio shows a high degree of
variability, with SAE languages being much more transitive that peripheral
European languages).
Likewise, for comparing the sets of transitive and
intransitive verbs (not their sizes) the usual techniques (Hamming’s distance
etc.) are appropriate. This approach allows one to build Neighbor Nets and
similar visualizations so that one grasps both genetic (e.g. Lithuanian is very
similar to Latvian) and areal (Basque is quite close to French) similarities.
____________________________
Jamie
Tehrani –Phylomemetics in Anthropology
Anthropologists
have become increasingly interested in the application of
phylogenetic methods to study “descent with modification” in
cultural traditions. Folk tales, weaving styles, pottery techniques,
etc. are handed down from one generation to another and gradually
evolve into new forms through the accumulation of copying errors and
innovations. These processes have clear parallels in other fields
where phylogenetics has been successfully applied, namely
evolutionary biology, historical linguistics and stemmatology.
However, it is important to note that traditional textiles,
folktales, etc. are rarely copied from a single model, but are
compiled from many sources. Moreover, the objects of these traditions
are not “copied” in a literal sense, but are reconstructed from
observation and memory, which can make learning very different from
replication. These points have important implications for how we
approach, code and analyse folk tradition data, which I discuss using
examples from my own work and that of other anthropologists.
____________________________
Gerold Schneider - Syntactic
parsing as a phylogenetic task
The sequence-based character of natural language has been
described by Sinclair's idiom principle (Sinclair 1991), Hunston and Francis'
pattern grammar (Hunston and Francis 2000), and by Hoey's lexical priming
theory (Hoey 2005). Syntactic rules and sequence preferences (often called
collocations) work in close cooperation with each other. The blind application
of syntactic rules typically leads to dozens of syntactically correct analyses
for real-world sentences, although typically all except for one are
semantically implausible. Bi-lexically or tri-lexically conditioned
collocational preference statistics (Collins 1999, Nivre 2006, Schneider 2008)
are needed to rank and prune these analyses, calculating the probability of a
certain syntactic relation (such as object) given the lemma of the governor and
the lemma of the dependent in a dependency representation. For example, the
sequence verb-noun is licensed to attach as an object or adjunct relation
according to the grammar. Given the governor lemma "eat" and the
dependent lemma "pizza", the probability for an object relation is
very high, while for the governor lemma "eat" and the dependent lemma
"Friday", the probability for an adjunct relation is high.
More recent approaches such as data-oriented parsing (Bod et
al. 2003) condition on larger context than just the governor and the dependent.
Parsing can be seen as subtree-mapping. The most closely related gold standard
subtree from the manually annotated training resource is used to deliver the
analyses of the candidate sequence, (be it sentence, clause, or chunk), at
parse time, as far as sparse data allows. While word-sequences and genetic
sequences may not have much in common, mapping candidate sequences to training
sequences could be seen as a phylogenetic task. The candidate sequence is seen
as a genetically mutated version of the gold standard sequence. Finding the
most closely related sequence efficiently at parse-time is vital.
Even if sequences are short, the sparse data problem is
enormous, due to Zipf's law. It says that most word types are rare, and the
combination of rare events is exponentially rarer. Even in the case of a local
bi- and tri-lexical preference for a dependency relation, the majority of the
counts needed for disambiguating between the various candidates at parse-time
are null counts, and we need to back off to semantic classes or less lexical
information. In terms of genetic mutations, we can only compare sequences with
enormous distances between them.
____________________________
Steven Moran &
Johann-Mattis List - A Python Toolkit for Quantitative Tasks in
Historical-Comparative Linguistics
The use of numerous quantitative methods in historical
linguistics, often inspired from algorithms in information theory and
evolutionary biology, has led to a situation in which there are many different
tools for the comparative analysis of linguistic data. These tools, however,
are typically incompatible with each other. For example, the STARLING software
package (cf. Starostin 2000) is database software that provides
lexicostatistical and glottochronological analyses, the calculation of family
trees, and some rudimentary routines for automatic cognate detection. However,
there are also software packages and programs for phonetic alignment analyses,
such as the Rug/L04 software (Kleiweg 2009) and the ALINE algorithm (Kondrak
2000). Other software tools are described in the literature, but not made
publicly available (Downey et al. 2008). Additionally there are algorithms that
are not yet implemented but show promise (Covington 1996). Finally, there are
various software packages for evolutionary biology that are often used in
linguistic applications, such as MrBayes (Ronquist & Huelsenbeck 2003),
Phylip (Felsenstein 2005) and SplitsTree (Huson 1998), but they have not yet
been satisfactorily updated to handle the intricacies of linguistic data. This
myriad of software puts a particular burden on linguists: those who want to
analyze their data in more than just one way and compare their results have to
convert their data into many different formats and they have to be familiar
with many different kinds of software (of course software packages have their
own' respective idiosyncrasies as well). As a result, errors can increase
during the process of data transformation and the comparability of the output
of different tools is decreased since not only the data format of the various
tools may vary, but also the content of the data required (some of the tools
have only a limited application range). The expenditure of time required for
such research can be enormous.
Our goal is to overcome these problems by developing a
Python library for quantitative tasks in historical-comparative linguistics
that unifies existing methods within a single open source framework, offers
easy routines to convert linguistic data into the formats needed for
third-party software, and provides a forum to publish new and innovative
quantitative methods in historical linguistics.
Balthasar
Bickel - Exploring similarities: phylogenetic methods beyond
phylogeny
(will follow)
____________________________
Harald Hammarström - An Algorithm for Isogloss-Compatible Historical Reconstruction (Poster Session)
Orthodox theory in linguistics (Ross 1997 inter alia) holds that the only valid criterion for positing a subgroup is by exclusive shared innovations. Yet modern phylogenetic inference algorithm provide a tree output without any explicit reference to exclusive shared innovations in their calculations and it remains unclear to what extent this implicitly modeled. In fact, empirically, modern phylogenetic methods tend to find series of binary branchings where linguists, based on exclusive shared innovations, have higher-order branchings (e.g., the Indo-European tree of Bouckaert et al. 2012). We will present an algorithm whose input is a matrix of languages x features and infers a tree-model subgrouping based on shared innovations. The algorithm closely models the intuitions in orthodox comparative linguistics and is similar to, but not identical to, Maximum Parsimony. Tested on Chapacuran, Quechuan and Indo-European datasets, it does indeed find amounts of non-binary branchings realistic to traditional linguistic analysis.
(will follow)
____________________________
Harald Hammarström - An Algorithm for Isogloss-Compatible Historical Reconstruction (Poster Session)
Orthodox theory in linguistics (Ross 1997 inter alia) holds that the only valid criterion for positing a subgroup is by exclusive shared innovations. Yet modern phylogenetic inference algorithm provide a tree output without any explicit reference to exclusive shared innovations in their calculations and it remains unclear to what extent this implicitly modeled. In fact, empirically, modern phylogenetic methods tend to find series of binary branchings where linguists, based on exclusive shared innovations, have higher-order branchings (e.g., the Indo-European tree of Bouckaert et al. 2012). We will present an algorithm whose input is a matrix of languages x features and infers a tree-model subgrouping based on shared innovations. The algorithm closely models the intuitions in orthodox comparative linguistics and is similar to, but not identical to, Maximum Parsimony. Tested on Chapacuran, Quechuan and Indo-European datasets, it does indeed find amounts of non-binary branchings realistic to traditional linguistic analysis.