TY - CONF
T1 - The Statistical Approximation Hypothesis
T2 - Digital Literary Stylistics Special Interest Group Workshop at the Digital Humanities 2023 conference
AU - Nini, Andrea
PY - 2023/7/10
Y1 - 2023/7/10
N2 - The reason for the effectiveness of function word frequency in stylometry is still a mystery. Although some notable attempts have been made in the past to shed some light on this, like Kestemont (2014) or McMenamin (2002), these explanations are still rather unsatisfactory. For example, although one could argue that the frequency of a function word like the is part of the style of an author, it is still unclear why this is the case. In addition, any stylometric method based on frequency of single words is similarly implausible scientifically because the basic unit of language processing is not the word but something larger and more similar to a phrase (Sinclair 2004; Christiansen & Chater 2016; Langacker 1987).In this paper I introduce a Theory of Linguistic Individuality based on Cognitive Linguistics that can explain the effectiveness of function word frequencies with the Statistical Approximation Hypothesis: the frequency of function words is a statistical approximation to the distribution of a person’s unique repository of basic linguistic units. This repository is idiosyncratic because of the processes of entrenchment (Schmid 2015) and chunking (Gobet et al. 2001). Despite their simplicity, methods compatible with this Theory, such as n-gram tracing (Grieve et al. 2019), outperform Delta as well as more sophisticated machine learning methods in tests on the English refcor corpus (Jannidis et al. 2015) and the c50 corpus (Houvardas & Stamatatos 2006). This finding is interpreted as indirect evidence in favour of the proposed hypothesis.ReferencesChristiansen, Morten H. & Nick Chater. 2016. The Now-or-Never bottleneck: A fundamental constraint on language. Behavioral and Brain Sciences. Cambridge University Press 39. e62. https://doi.org/10.1017/S0140525X1500031X.Gobet, Fernand, Peter C.R. Lane, Steve Croker, Peter C-H. Cheng, Gary Jones, Iain Oliver & Julian M Pine. 2001. Chunking mechanisms in human learning. Trends in Cognitive Sciences 5(6). 236–243. https://doi.org/10.1016/S1364-6613(00)01662-4.Grieve, Jack, Emily Chiang, Isobelle Clarke, Hannah Gideon, Aninna Heini, Andrea Nini & Emily Waibel. 2019. Attributing the Bixby Letter using n-gram tracing. Digital Scholarship in the Humanities 34(3). 493–512.Houvardas, John & Efstathios Stamatatos. 2006. N-gram feature selection for authorship identification. In Jérôme Euzenat & John Domingue (eds.), Artificial Intelligence: Methodology, Systems, and Applications, 77–86. Bulgaria. https://doi.org/10.1007/11861461_10.Jannidis, Fotis, Steffen Pielström, Christof Schöch & Thorsten Vitt. 2015. Improving Burrows’ Delta - An empirical evaluation of text distance measures. In Digital Humanities Conference 2015. Sydney, Australia: Alliance of Digital Humanities Organizations.Kestemont, Mike. 2014. Function Words in Authorship Attribution From Black Magic to Theory? In Proceedings ofthe 3rd Workshop on Computational Linguistics for Literature (CLfL) @ EACL 2014, 59–66. Gothenburg, Sweden: Association for Computational Linguistics. (11 April, 2020).Langacker, Ronald W. 1987. Foundations of cognitive grammar. Vol. 1. Stanford, CA: Stanford University Press.McMenamin, Gerald R. 2002. Forensic Linguistics: Advances in Forensic Stylistics. Boca Raton, Fla: CRC Press.Schmid, Hans-Jörg. 2015. A blueprint of the Entrenchment-and-Conventionalization Model. Yearbook of the German Cognitive Linguistics Association 3(1). 3–25. https://doi.org/10.1515/gcla-2015-0002.Sinclair, John. 2004. Trust the Text: Language, Corpus and Discourse. London: Routledge.
AB - The reason for the effectiveness of function word frequency in stylometry is still a mystery. Although some notable attempts have been made in the past to shed some light on this, like Kestemont (2014) or McMenamin (2002), these explanations are still rather unsatisfactory. For example, although one could argue that the frequency of a function word like the is part of the style of an author, it is still unclear why this is the case. In addition, any stylometric method based on frequency of single words is similarly implausible scientifically because the basic unit of language processing is not the word but something larger and more similar to a phrase (Sinclair 2004; Christiansen & Chater 2016; Langacker 1987).In this paper I introduce a Theory of Linguistic Individuality based on Cognitive Linguistics that can explain the effectiveness of function word frequencies with the Statistical Approximation Hypothesis: the frequency of function words is a statistical approximation to the distribution of a person’s unique repository of basic linguistic units. This repository is idiosyncratic because of the processes of entrenchment (Schmid 2015) and chunking (Gobet et al. 2001). Despite their simplicity, methods compatible with this Theory, such as n-gram tracing (Grieve et al. 2019), outperform Delta as well as more sophisticated machine learning methods in tests on the English refcor corpus (Jannidis et al. 2015) and the c50 corpus (Houvardas & Stamatatos 2006). This finding is interpreted as indirect evidence in favour of the proposed hypothesis.ReferencesChristiansen, Morten H. & Nick Chater. 2016. The Now-or-Never bottleneck: A fundamental constraint on language. Behavioral and Brain Sciences. Cambridge University Press 39. e62. https://doi.org/10.1017/S0140525X1500031X.Gobet, Fernand, Peter C.R. Lane, Steve Croker, Peter C-H. Cheng, Gary Jones, Iain Oliver & Julian M Pine. 2001. Chunking mechanisms in human learning. Trends in Cognitive Sciences 5(6). 236–243. https://doi.org/10.1016/S1364-6613(00)01662-4.Grieve, Jack, Emily Chiang, Isobelle Clarke, Hannah Gideon, Aninna Heini, Andrea Nini & Emily Waibel. 2019. Attributing the Bixby Letter using n-gram tracing. Digital Scholarship in the Humanities 34(3). 493–512.Houvardas, John & Efstathios Stamatatos. 2006. N-gram feature selection for authorship identification. In Jérôme Euzenat & John Domingue (eds.), Artificial Intelligence: Methodology, Systems, and Applications, 77–86. Bulgaria. https://doi.org/10.1007/11861461_10.Jannidis, Fotis, Steffen Pielström, Christof Schöch & Thorsten Vitt. 2015. Improving Burrows’ Delta - An empirical evaluation of text distance measures. In Digital Humanities Conference 2015. Sydney, Australia: Alliance of Digital Humanities Organizations.Kestemont, Mike. 2014. Function Words in Authorship Attribution From Black Magic to Theory? In Proceedings ofthe 3rd Workshop on Computational Linguistics for Literature (CLfL) @ EACL 2014, 59–66. Gothenburg, Sweden: Association for Computational Linguistics. (11 April, 2020).Langacker, Ronald W. 1987. Foundations of cognitive grammar. Vol. 1. Stanford, CA: Stanford University Press.McMenamin, Gerald R. 2002. Forensic Linguistics: Advances in Forensic Stylistics. Boca Raton, Fla: CRC Press.Schmid, Hans-Jörg. 2015. A blueprint of the Entrenchment-and-Conventionalization Model. Yearbook of the German Cognitive Linguistics Association 3(1). 3–25. https://doi.org/10.1515/gcla-2015-0002.Sinclair, John. 2004. Trust the Text: Language, Corpus and Discourse. London: Routledge.
U2 - 10.5281/zenodo.8136362
DO - 10.5281/zenodo.8136362
M3 - Abstract
Y2 - 10 July 2023 through 10 July 2023
ER -