Past projects

Analysis of Verbal Interaction and Automated Text Retrieval

Aviator logo

An automated system for the identification of new words and new uses of existing words. The software is designed as a series of filters, through which the textual data 'flows' at regular intervals, thus providing a diachronic view of new linguistic events. Through these filters the software identifies: 1) new word forms (neologisms), 2) new word pairs, 3) words which have changed in meaning, and 4) increase or decline in word frequency over time.

This system automatically monitors the changing state of the language in text. The software is designed as a series of filters through which the textual data 'flows' at regular intervals, thus providing a diachronic view of linguistic events. The project has fed into the ACRONYM and APRIL projects.

Filter 1: Identifies and categorises new word forms (neologisms)
Filter 2: Identifies new word pairs or terms
Filter 3: Identifies words which have changed in meaning
Filter 4: Tracks the frequency of each type across time, with a view to establishing the core, unchanging words of the language.

Project funded by the Department of Trade and Industry from 1990-1993. The results of this project informed the ACRONYM and APRIL projects (below).

Automatic Collocational Retrieval of 'Nyms'

Acronym logo

An automated system to identify semantically related pairs of words (or 'nyms'), based on the similarity of their collocational environments. The ACRONYM thesaural facility identifies the conventional sense relations of synonymy (e.g. doctor and medic), antonymy (e.g. luxury and no-frills), hyponymy (e.g. doctor and paediatrician), meronymy (e.g. car and engine), and less well-established semantic relations characteristic of textual usage. The lexical realisations of the sense relations are very often different from those found in a representation of the mental lexicon such as Roget's Thesaurus, but they reflect the thesaurus as it is used in text.

See The ACRONYM Project: Discovering the textual thesaurus by A. Renouf for a description of the system and its results.

Project funded by EPSRC from 1994-1997.

Summary Extraction Algorithm Generated Using Lexical Links

SEAGULL logo
An automated summarisation system which produces cohesive summaries (abridgements) of texts by extracting key topic-bearing sentences. This summarisation tool is a sentence extractor designed to create short summaries which express the essence of a text (regarding its conceptual content and the development of topic). It exploits the patterns of lexical repetition across a text by finding links and bonds between its sentences.

See Textual Distraction as a Basis for Evaluating Automatic Summarisers by A. Renouf & A. Kehoe for further details and an evaluation.

The concepts of links and bonds were also used in the SHARES project to measure document similarity.

Analysis and Prediction of Innovation in the Lexicon

April logo

This work is concerned with the development of a system for the semi-automatic classification of rare words in journalistic text, over a period of years, with a view to extrapolating from the resultant analysis and predicting some aspects of the future structure of the language.

As with other Unit research, APRIL findings serve a dual purpose: a linguistic role in informing descriptions of the nature of rare words and their patterns of productivity, and an IT role in assisting in the refinement of indexes to large textual database systems.

The rare words of the lexicon constitute 50% of the types (different words) in any database, yet they are routinely ignored in the management of databases. They are statistically significant, and the received wisdom is that they are a miscellany of typographical errors and ephemera that will not yield much informational benefit as retrieval mechanisms in database search.

However, this is not so. These singletons, or hapax legomena, which trickle into and out of the language, are an intrinsic part of its fabric. They form classes at and below the level of word. In terms of word formation, for instance, they are primarily compounds (like eco-chic) and derivations (like cosmopolitanising). In terms of derivation, there is a clear ranking in the morphemes and classes of morphemes chosen. Grammatical trends are apparent.

The study brings fascinating insights into the nature of productivity in the language.

See A Finer Definition of Neology in English: the life-cycle of a word by A. Renouf for further details.

Project funded by the EPSRC from 1997-2000.

System of Hypermatrix Analysis, Retrieval, Evaluation and Summarisation

Shares logo
Informed by the SEAGULL project, SHARES is an automated system for the retrieval of similar documents. The background of this research lies in an investigation into improved methods of identifying related texts in a collection, given one or more exemplar texts. More specifically, it is the examination of the hypothesis that similar patterns of lexical repetition are sufficiently maintained across differently authored documents on similar topics to support a high-performance retrieval engine.

We have developed an intertextual mechanism for the identification and ranking of documents in terms of their relatedness to one or more exemplar texts. The SHARES approach is novel in taking the degree of Lexical Cohesion (Hoey, 1991) between texts as the primary criterion for document similarity. It uses a novel hypermatrix structure, which identifies links between repeated words, and bonds between closely linked sentences, across texts. Links and bonds will be strong between texts which are similar in content, and weak or non-existent between dissimilar texts.

Project funded by the EPSRC. Project duration: 2000-2004