<
‘Inducing and Contrasting Word Meanings from Different Sources’ by Diana McCarthy
In Computational Linguistics, work on representing lexical meaning previously focused on manually created inventories. There is however now a large body of work that builds models of word meaning directly from corpus data. This has the advantage that one does not rely on advance knowledge of the relevant meanings; instead the knowledge emerges from the data. This provides more scope for contrasting the meanings induced from different sources where the sources could differ with respect to, for example, textual domain, genre or time. In this talk I will outline some approaches for inducing word meanings and describe work, conducted with collaborators at the University of Melbourne, to induce and compare word meanings from different sources using topic models. I'll particularly focus on our work use these models to detect novel word senses in diachronic corpora. One key draw back with automatic word sense induction is the requirement for a large amount of data for training the models. Since corpora of a sufficient size have only been available in the last twenty years or so this limits application of these techniques to word meaning change attested within that period and for the types of corpora available. I will also therefore describe some corpus linguistics work, conducted with Sketch Engine, for the National Ecosystem Assessment. In this work we used Sketch Engine to contrast usages of lexemes pertaining to the environment from different sources (academic, government and public). I'll discuss the pros and cons of the different approaches which can, of course, be complementary to one another.
‘Quantitative corpus approaches to lexical and conceptual variation’ by Dirk Geeraerts and Dirk Speelman
|
Workshop 1: Computer-Assisted Language Processing
Friday 18th September 2015
University of Sussex, Jubilee Building, Room G22
University of Sussex, Jubilee Building, Room G22
Programme
Registration & coffee (9.00-9.30)
Session 1 (9.30-10.45)
Susan
Fitzmaurice: Introduction to Linguistic
DNA
Research
Associates & HRI: Resources,
progress, problems and queries
Discussion
Coffee break (10.45-11.15)
Session 2 (11.15-12.30)
Diana
McCarthy (University of Cambridge): Inducing and contrasting word meanings
from different sources
Kathryn
Allan (UCL): ‘Degrees of lexicalization’ in the Historical Thesaurus of the OED
Discussion
Lunch (12.30-1.30)
Session 3 (1.30-3.15)
Gabriel
Egan (De Montfort University): Instructive
failures in authorship attribution by shared phrases in large textual corpora
Dirk Geeraerts (KU Leuven): Quantitative corpus
approaches to lexical and conceptual variation I
Dirk
Speelman (KU Leuven): Quantitative corpus
approaches to lexical and conceptual variation II
Discussion
Coffee break (3.15-3.30)
Session 4 (3.30-5.00)
Panel
discussion: Dawn Archer (Manchester Metropolitan University), Scott Gibbens
(Jisc Historical Texts), David Weir (University of Sussex), Pip Willcox
(Bodleian Libraries)
Close (5.00)
Abstracts
‘Degrees of
lexicalization’ in the Historical
Thesaurus of the OED’ by Kathryn Allan
One of the most
intriguing issues raised by the Historical
Thesaurus of the Oxford English Dictionary (HTOED), which will be addressed by the ‘Linguistic DNA of Modern
Western Thought’ project, is the significance of vocabulary size. Why are some
semantic fields very densely populated in comparison to others, and why are
concepts lexicalised to differing degrees across time? For some concepts, such
as those in fields such as Food and Colour, there are obvious answers. There
are no terms for ‘potato’ attested earlier than the late sixteenth century
because it is not native to Britain and was only brought to the country then,
and many terms from the late eighteenth onwards show the increasing numbers of
varieties that have become familiar to speakers in modern times. Similarly, the
rise in non-basic colour terms from the early Modern English period onwards
corresponds to the technological changes that enabled the production of dyes,
leading to sophisticated methods of creating and recreating precisely differentiated
shades (discussed by Carole Biggam and Laura Wright, for example). This example
seems to provide fairly clear evidence to support the view suggested in the
preface of HTOED that in some cases
the ‘degree of lexicalization [of a category] reflect[s] its considerable
degree of importance to speakers of the language’. However, in other cases,
including many abstract categories, the relationship between semantic field and
conceptual domain is much less straightforward, and the emergence of a high
number of new terms has no obvious external-world trigger. For example, HTOED records a spike of new terms for
‘sweet (in taste)’ between 1400 and 1700, including several variant forms with
a common derivation such as douce, dulcet, dulce, dulcid, dulcorous and dulceous. Some of these are only attested a small number of times,
and none replace the basic term sweet;
their appearance is most readily explained as the result of shifts in stylistic
norms combined with greater receptivity to Latinate vocabulary in this period.
This paper considers some of the difficulties that emerge when considering the
degree of lexicalisation of different concepts, and especially the
complications that emerge from the data itself.
‘Instructive
failures in authorship attribution by shared phrases in large textual corpora’
by Gabriel Egan
We can learn
much from the mistakes made in recent authorship attribution endeavours that
hunt for phrases shared between a work of unknown or contested authorship and
the works in large textual corpora available to us digitally. Investigators
have long known to suspect our intuitive conviction that a series of apparently
unusual phrases cannot be shared between two works merely by chance--in fact
they can--and have long acknowledged that a series of ‘negative checks’ are
needed to be sure that linguistic constructions that seem rare really are rare.
Despite knowing of these pitfalls, spectacular errors have been made recently
because i) the methods of searching corpora are fallible in ways unforeseen by
the investigator, ii) textual corpora are not necessarily as complete as
investigators believe, iii) it is shockingly easy to introduce methodological
bias into the experiments, and iv) it is easy to misunderstand and/or
misrepresent the statistical significance of particular findings. This talk
will discuss what went wrong in a series of recent investigations and draw
lessons from them. Chiefly, the finding is that the principles that are
supposed to prevail in scientific experiments should also govern work in our
field. Our datasets and software source code should be made publicly available
so that anyone may replicate our investigations. And our statistical methods
should be subject to proper critique by professional statisticians. These
rather dry findings will, it is hoped, be leavened by the telling of some
amusing stories about what happens when authorship attribution goes wrong.
‘Inducing and Contrasting Word Meanings from Different Sources’ by Diana McCarthy
In Computational Linguistics, work on representing lexical meaning previously focused on manually created inventories. There is however now a large body of work that builds models of word meaning directly from corpus data. This has the advantage that one does not rely on advance knowledge of the relevant meanings; instead the knowledge emerges from the data. This provides more scope for contrasting the meanings induced from different sources where the sources could differ with respect to, for example, textual domain, genre or time. In this talk I will outline some approaches for inducing word meanings and describe work, conducted with collaborators at the University of Melbourne, to induce and compare word meanings from different sources using topic models. I'll particularly focus on our work use these models to detect novel word senses in diachronic corpora. One key draw back with automatic word sense induction is the requirement for a large amount of data for training the models. Since corpora of a sufficient size have only been available in the last twenty years or so this limits application of these techniques to word meaning change attested within that period and for the types of corpora available. I will also therefore describe some corpus linguistics work, conducted with Sketch Engine, for the National Ecosystem Assessment. In this work we used Sketch Engine to contrast usages of lexemes pertaining to the environment from different sources (academic, government and public). I'll discuss the pros and cons of the different approaches which can, of course, be complementary to one another.
‘Quantitative corpus approaches to lexical and conceptual variation’ by Dirk Geeraerts and Dirk Speelman
In this talk,
we intend to present an overview of various types of corpus-based variation
studies that we have been conducting in our research team Quantitative
Lexicology and Variationist Linguistics and that we believe could be
interesting for the 'Linguistic DNA' project. Specifically, we will introduce
the distinction between formal and conceptual onomasiological variation, with a
further distinction between direct and indirect approaches to the latter, and
suggest that a formal onomasiological and an indirect conceptual
onomasiological perspective could be the most relevant ones for the 'Linguistic
DNA' project. We will illustrate these perspectives, with a methodological focus
on the diagnostic concept of ‘onomasiological profile’ and the use of semantic
vector spaces.
Comments
Post a Comment