Multilingual Lexical Representation:  Structure-Sharing versus Micro-Features
Carole Tiberius, ITRI, University of Brighton

This paper explores a relatively new approach to lexical representation in Multilingual Natural Language Processing. Most multilingual lexicons that have been constructed for NLP systems up to now, are simply monolingual lexicons with links between translation equivalents (e.g. MULTILEX 1993).

The approach described here, goes further and explores the possibility of developing a multilingual lexicon in which information can be shared across (related) languages. Related languages exhibit many similarities across their syntax, morphology, phonology, etc.

Compare:

        swim    -- swam    -- swum         (English)
        schwimm -- schwamm -- geschwommen  (German)
        zwem    -- zwom    -- gezwommen    (Dutch)

Capturing such similarities can help to produce more robust, more readily maintainable and more readily extensible multilingual natural language processing systems for related languages (Cahill and Gazdar 1996). Consider adding new languages. By allowing sharing of information across languages, it might be possible to add new
languages to a system by defining them by difference to related languages already available in the system (for example, Afrikaans by reference to Dutch).

DATR, an inheritance-based lexical knowledge representation formalism (Evans and Gazdar 1996), is used to represent the lexical information. The rationale of inheritance-based lexicons requires information to be pushed as far up the hierarchy as it can go, generalising as much as possible. In a monolingual lexicon, this means that information which applies to all words of the language appears right at the top of the hierarchy, whereas information that is common to all nouns will appear above all the individual noun entries and so on. In a multilingual lexicon, the same rationale can be extended to carry across languages: information which is common to several languages is stated at higher points in the hierarchy than that which is unique to just one of the languages.

Currently, it is not clear what is the best way of structuring such a hierarchical multilingual lexicon. Several models have been proposed (Evans 1996), but none of these has been sufficiently tested to make claims in favour of one of them.

In this paper, I will explore two of these models: the structure-sharing model and the micro-features model. I will explain both models and I will discuss their advantages and disadvantages with reference to two sample lexical fragments of Danish, Dutch, English and Icelandic. I will conclude with some suggestions for further research.

References

Cahill, L. and G. Gazdar. 1996. 'A lexical analysis of numeral expressions in three related languages', In Proceedings of the AISB-96 Workshop on Multilinguality in the Lexicon, pp.69-75.

Evans, R. 1996. 'Exploiting inheritance in multilingual lexicons', Unpublished manuscript presented at the AISB-96 Workshop on Multilinguality in the Lexicon, Brighton, 1996.

Evans, R. and G. Gazdar. 1996. 'DATR: A Language for Lexical Knowledge Representation', In Computational Linguistics, Vol.22-2, pp.167-216.

MULTILEX, 1993. 'MLEX_d Standards for a Multifunctional Lexicon', Final Report, CAP GEMINI INNOVATION for the MULTILEX Consortium, Paris


Back to the Conference Schedule