Position: PhD Candidate
Current Institution: Johns Hopkins University
Abstract: What Kind of Language Is Hard to Language-Model?
What Kind of Language Is Hard to Language-Model?” (Sabrina J. Mielke Ryan Cotterell Kyle Gorman Brian Roark Jason Eisner; ACL 2019) How language-agnostic are current state-of-the-art NLP tools? Are there some types of language that are easier to model with current methods? In prior work (Cotterell et al. 2018) we attempted to address this question for language modeling and observed that recurrent neural network language models do not perform equally well over all the high-resource European languages found in the Europarl corpus. We speculated that inflectional morphology may be the primary culprit for the discrepancy. In this paper we extend these earlier experiments to cover 69 languages from 13 language families using a multilingual Bible corpus. Methodologically we introduce a new paired-sample multiplicative mixed-effects model to obtain language difficulty coefficients from at-least-pairwise parallel corpora. In other words the model is aware of inter-sentence variation and can handle missing data. Exploiting this model we show that “translationese” is not any easier to model than natively written language in a fair comparison. Trying to answer the question of what features difficult languages have in common we try and fail to reproduce our earlier (Cotterell et al. 2018) observation about morphological complexity and instead reveal far simpler statistics of the data that seem to drive complexity in a much larger sample.
Sabrina is a PhD student at the Johns Hopkins University researching open-vocabulary language modeling for unit discovery. Pivoting from formal language theory her PhD let her publish in ACL NAACL EMNLP LREC and AAAI writing on morphology fair language model comparison stochastic romanization (at Google AI) and metacognition and calibration for chatbots (at Facebook AI Research). Aside from those she has been reviewing for NeurIPS ICML ICLR and a number of workshops co-organized the SIGMORPHON shared task in 2018 2019 2020 and 2021 and the SIGTYP 2020 and 2021 shared tasks. Currently she serves as a co-chair in the Widening NLP (WiNLP) initiative and is involved in the BigScience summer of large language models workshop through her most recent part-time research internship at HuggingFace. Sabrina has TA’d a variety of computer science classes in Germany and the US and is currently teaching JHU’s Artificial Intelligence class for the second time.