Modelling semantic change workshop


Linguistic DNA of Modern Western Thought:

Modelling concepts and semantic change in English 1500–1800





Workshop 1: Computer-Assisted Language Processing

Friday 18th September 2015
University of Sussex, Jubilee Building, Room G22



Registration & coffee (9.00-9.30)

Session 1 (9.30-10.45)
                           Susan Fitzmaurice: Introduction to Linguistic DNA
                           Research Associates & HRI: Resources, progress, problems and queries
Coffee break (10.45-11.15)

Session 2 (11.15-12.30)
Diana McCarthy (University of Cambridge): Inducing and contrasting word meanings from different sources
Kathryn Allan (UCL): ‘Degrees of lexicalization’ in the Historical Thesaurus of the OED
Lunch (12.30-1.30)

Session 3 (1.30-3.15)

                           Gabriel Egan (De Montfort University): Instructive failures in authorship attribution by shared phrases in large textual corpora
                           Dirk Geeraerts (KU Leuven): Quantitative corpus approaches to lexical and conceptual variation I
Dirk Speelman (KU Leuven): Quantitative corpus approaches to lexical and conceptual variation II
Coffee break (3.15-3.30)

Session 4 (3.30-5.00)

                           Panel discussion: Dawn Archer (Manchester Metropolitan University), Scott Gibbens (Jisc Historical Texts), David Weir (University of Sussex), Pip Willcox (Bodleian Libraries)

Close (5.00)




‘Degrees of lexicalization’ in the Historical Thesaurus of the OED’ by Kathryn Allan

One of the most intriguing issues raised by the Historical Thesaurus of the Oxford English Dictionary (HTOED), which will be addressed by the ‘Linguistic DNA of Modern Western Thought’ project, is the significance of vocabulary size. Why are some semantic fields very densely populated in comparison to others, and why are concepts lexicalised to differing degrees across time? For some concepts, such as those in fields such as Food and Colour, there are obvious answers. There are no terms for ‘potato’ attested earlier than the late sixteenth century because it is not native to Britain and was only brought to the country then, and many terms from the late eighteenth onwards show the increasing numbers of varieties that have become familiar to speakers in modern times. Similarly, the rise in non-basic colour terms from the early Modern English period onwards corresponds to the technological changes that enabled the production of dyes, leading to sophisticated methods of creating and recreating precisely differentiated shades (discussed by Carole Biggam and Laura Wright, for example). This example seems to provide fairly clear evidence to support the view suggested in the preface of HTOED that in some cases the ‘degree of lexicalization [of a category] reflect[s] its considerable degree of importance to speakers of the language’. However, in other cases, including many abstract categories, the relationship between semantic field and conceptual domain is much less straightforward, and the emergence of a high number of new terms has no obvious external-world trigger. For example, HTOED records a spike of new terms for ‘sweet (in taste)’ between 1400 and 1700, including several variant forms with a common derivation such as douce, dulcet, dulce, dulcid, dulcorous and dulceous. Some of these are only attested a small number of times, and none replace the basic term sweet; their appearance is most readily explained as the result of shifts in stylistic norms combined with greater receptivity to Latinate vocabulary in this period. This paper considers some of the difficulties that emerge when considering the degree of lexicalisation of different concepts, and especially the complications that emerge from the data itself.

‘Instructive failures in authorship attribution by shared phrases in large textual corpora’ by Gabriel Egan

We can learn much from the mistakes made in recent authorship attribution endeavours that hunt for phrases shared between a work of unknown or contested authorship and the works in large textual corpora available to us digitally. Investigators have long known to suspect our intuitive conviction that a series of apparently unusual phrases cannot be shared between two works merely by chance--in fact they can--and have long acknowledged that a series of ‘negative checks’ are needed to be sure that linguistic constructions that seem rare really are rare. Despite knowing of these pitfalls, spectacular errors have been made recently because i) the methods of searching corpora are fallible in ways unforeseen by the investigator, ii) textual corpora are not necessarily as complete as investigators believe, iii) it is shockingly easy to introduce methodological bias into the experiments, and iv) it is easy to misunderstand and/or misrepresent the statistical significance of particular findings. This talk will discuss what went wrong in a series of recent investigations and draw lessons from them. Chiefly, the finding is that the principles that are supposed to prevail in scientific experiments should also govern work in our field. Our datasets and software source code should be made publicly available so that anyone may replicate our investigations. And our statistical methods should be subject to proper critique by professional statisticians. These rather dry findings will, it is hoped, be leavened by the telling of some amusing stories about what happens when authorship attribution goes wrong.

‘Inducing and Contrasting Word Meanings from Different Sources’ by Diana McCarthy

In Computational Linguistics, work on representing lexical meaning previously focused on manually created inventories. There is however now a large body of work that builds models of word meaning directly from corpus data. This has the advantage that one does not rely on advance knowledge of the relevant meanings; instead the knowledge emerges from the data.  This provides more scope for contrasting the meanings induced from different sources where the sources could differ with respect to, for example, textual domain, genre or time. In this talk I will outline some approaches for inducing word meanings and describe work, conducted with collaborators at the University of Melbourne, to induce and compare word meanings from different sources using topic models. I'll particularly focus on our work use these models to detect novel word senses in diachronic corpora. One key draw back with automatic word sense induction is the requirement for a large amount of data for training the models. Since corpora of a sufficient size have only been available in the last twenty years or so this limits application of these techniques to word meaning change attested within that period and for the types of corpora available. I will also therefore describe some corpus linguistics work, conducted with Sketch Engine, for the National Ecosystem Assessment. In this work we used Sketch Engine to contrast usages of  lexemes pertaining to the environment from different sources (academic, government and public). I'll discuss the pros and cons of the different approaches which can, of course, be complementary to one another.

‘Quantitative corpus approaches to lexical and conceptual variation’ by Dirk Geeraerts and Dirk Speelman

In this talk, we intend to present an overview of various types of corpus-based variation studies that we have been conducting in our research team Quantitative Lexicology and Variationist Linguistics and that we believe could be interesting for the 'Linguistic DNA' project. Specifically, we will introduce the distinction between formal and conceptual onomasiological variation, with a further distinction between direct and indirect approaches to the latter, and suggest that a formal onomasiological and an indirect conceptual onomasiological perspective could be the most relevant ones for the 'Linguistic DNA' project. We will illustrate these perspectives, with a methodological focus on the diagnostic concept of ‘onomasiological profile’ and the use of semantic vector spaces.


Popular Posts