Abstract
Through the alignment of definitions from
two or more different sources, it is
possible to retrieve pairs of words that can
be used indistinguishably in the same
sentence without changing the meaning of
the concept. As lexicographic work
exploits common defining schemes, such
as genus and differentia, a concept is
similarly defined by different dictionaries.
The difference in words used between two
lexicographic sources lets us extend the
lexical knowledge base, so that clustering
is available through merging two or more
dictionaries into a single database and
then using an appropriate alignment
technique. Since alignment starts from the
same entry of two dictionaries, clustering
is faster than any other technique.
The algorithm introduced here is analogy-based, and starts from calculating the
Levenshtein distance, which is a variation
of the edit distance, and allows us to align
the definitions. As a measure of similarity,
the concept of longest collocation couple
is introduced, which is the basis of
clustering similar words. The process
iterates, replacing similar pairs of words
in the definitions until no new clusters are
found.
two or more different sources, it is
possible to retrieve pairs of words that can
be used indistinguishably in the same
sentence without changing the meaning of
the concept. As lexicographic work
exploits common defining schemes, such
as genus and differentia, a concept is
similarly defined by different dictionaries.
The difference in words used between two
lexicographic sources lets us extend the
lexical knowledge base, so that clustering
is available through merging two or more
dictionaries into a single database and
then using an appropriate alignment
technique. Since alignment starts from the
same entry of two dictionaries, clustering
is faster than any other technique.
The algorithm introduced here is analogy-based, and starts from calculating the
Levenshtein distance, which is a variation
of the edit distance, and allows us to align
the definitions. As a measure of similarity,
the concept of longest collocation couple
is introduced, which is the basis of
clustering similar words. The process
iterates, replacing similar pairs of words
in the definitions until no new clusters are
found.
Original language | English |
---|---|
Title of host publication | Proceedings of the 18th International Conference on Computational Linguistics (COLING 2000) |
Place of Publication | New Brunswick |
Publisher | Association for Computational Linguistics |
Pages | 795-801 |
Volume | 2 |
ISBN (Print) | 1-55860-717-X |
Publication status | Published - 2000 |
Keywords
- clustering
- alignment of definitions
- computational linguistics
- natural language processing