Probabilistic and Multivariate Modelling in Latin Grammar: the Participle-Auxiliary Alternation as a Case Study

  • James Brookes

Student thesis: Phd


Recent research has shown that language is sensitive to probabilities and a whole host of multivariate conditioning factors. However, most of the research in this arena centres on the grammar of English, and, as yet, there is no statistical modelling on the grammar of Latin, studies of which have to date been largely philological. The rise in advanced statistical methodologies allows us to capture the underlying structure of the rich datasets which this corpus only language can potentially offer. This thesis intends to remedy this deficit by applying probabilistic and multivariate models to a specific case study, namely the alternation of word order in Latin participle auxiliary clusters (pacs), which alternate between participle-auxiliary order, as in mortuus est 'dead is' and est mortuus 'is dead'. The broad research questions to be explored in this thesis are the following: (i) To what extent are probabilistic models useful and reflective of Latin syntax variation phenomena?, (ii) What are the most useful statistical models to use?, (iii) What types of linguistic variables influence variation, (iv) What theoretical implications and explanations do the statistical models suggest?Against this backdrop, a dataset of 2409 pac observations are extracted from Late Re- publican texts of the first century bc. The dataset is annotated for an "information space" of thirty-three predictor variables from various levels of linguistics: text and lemma-based variability, prosody and phonology, grammar, semantics and pragmatics, and usage-based features such as frequency.The study exploits such statistical tools as generalized linear models and multilevel generalized linear models for the regression modelling of the binary categorical outcome. However, because of the potential collinearity, and the many predictor terms, amongst other issues, the use of these models to assess the joint effect of all predictors is particularly problematic. As such, the new statistical toolkit of random forests is utilized for evaluating the relative contribution of each predictor.Overall, it is found that Latin is indeed probabilistic in its grammar, and the condition- ing factors that govern it are spread widely throughout the language space. It is also noted that probabilistic models, such as the ones used in this study, have practical applications in traditional areas of philology, including textual criticism and literary stylistics.
Date of Award31 Dec 2014
Original languageEnglish
Awarding Institution
  • The University of Manchester
SupervisorDavid Langslow (Supervisor) & Benedikt Szmrecsanyi (Supervisor)


  • Latin
  • Grammatical variation
  • Statistics
  • Probability
  • Multivariate
  • Auxiliary
  • Participle

Cite this