Konstantinos Kogkalidis and Gijs Wijnholds
Assessing the ability of large-scale language models to automatically acquire aspects of linguistic theory has become a prominent theme in the literature ever since the inception of BERT [3] and its many variants, largely due to their unanticipated performance. Standard practice involves attaching BERT to a shallow neural model of low parametric complexity, and training the latter at detecting various linguistic patterns of interest, revealing in the process the amount to which they are encoded within BERT’s representations. The consensus points to BERT-like models having some capacity for syntactic understanding [7]. Their contextualized representations encode structural hierarchies [6] that can be projected into parse structures, using linear [4] or hyperbolic transformations [1], from which one can even obtain an accurate reconstruction of the underlying
constituency tree [8].
Despite their broadening scope, a latent bias persists in the insights provided by the probing literature, due to its focus being, by default, on English. English, albeit boasting a rich collection of evaluation resources, is characterized by a simple grammar with relatively few complications over the syntactic and morphological axes. Specifically when it comes to syntax, English lies in close
proximity to a context-free language, a class characterized by its low rank in terms of formal complexity and expressive power [2]. Perhaps more importantly, several commonly used evaluation test beds, including the Penn Treebank [5], are in themselves context-free, muddying the territory between probing for acquired syntactic generalization and arbitrating pattern extraction. As such,
claims about the syntactic skills of language models should not be assumed to freely transfer between languages (and, in some cases, even datasets).
In this paper, we seek to evaluate BERT in the face of patterns that go beyond context-freeness. We employ a mildly context-sensitive grammar formalism to generate complex patterns that do not naturally occur in English. We choose instead to experiment on Dutch, a language long-argued to be noncontext free, due it its capacity for exhibiting an arbitrary number of cross-serial dependencies. In Dutch, cross-serial dependencies arise in sentences where verbs form clusters, causing their respective dependencies with their arguments to intersect when drawn on a plane.
To that end, we first identify two well-studied constructions in Dutch that commonly involve cross-serial dependencies: control verb nesting and verb raising. We produce an artificial but naturalistic dataset of annotated samples for each construction; each sample contains span annotations for the verb- and nounphrases occurring within, as well as a mapping that associates each verb to its
corresponding subject. We then implement a probing model intended to select a verb’s subject from a number of candidate phrases, train it on a gold-standard resource of Dutch, and employ it on our data. Our experimental results convey a rapidly declining performance in the presence of discontinuous syntax, suggesting that the Dutch models investigated do not automatically learn to resolve the complex dependencies occurring in the language. To facilitate further research on the topic, our code is publicly available online.
References
[1] Boli Chen, Yao Fu, Guangwei Xu, Pengjun Xie, Chuanqi Tan, Mosha Chen,
and Liping Jing. Probing {bert} in hyperbolic spaces. In International
Conference on Learning Representations, 2021.
[2] Noam Chomsky. Three models for the description of language. IRE Transactions on information theory, 2(3):113–124, 1956.
[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
[4] John Hewitt and Christopher D. Manning. A structural probe for finding
syntax in word representations. In Proceedings of the 2019 Conference of the
North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, Volume 1 (Long and Short Papers), pages
4129–4138, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
[5] Dan Klein and Christopher D Manning. Parsing with treebank grammars:
Empirical bounds, theoretical models, and the structure of the Penn treebank. In Proceedings of the 39th Annual Meeting of the Association for
Computational Linguistics, pages 338–345, 2001.
[6] Yongjie Lin Lin, Yi Chern Tan, and Robert Frank. Open sesame: Getting inside BERT’s linguistic knowledge. In Proceedings of the Second BlackboxNLP
Workshop on Analyzing and Interpreting Neural Networks for NLP, 2019.
[7] Anna Rogers, Olga Kovaleva, and Anna Rumshisky. A primer in BERTology:
What we know about how BERT works. Transactions of the Association for
Computational Linguistics, 8:842–866, 2020.
[8] David Vilares, Michalina Strzyz, Anders Søgaard, and Carlos G´omezRodr´ıguez. Parsing as pretraining. In Proceedings of the AAAI Conference
on Artificial Intelligence, volume 34, pages 9114–9121, 2020.