Abstract
In many real applications of text mining, information retrieval and natural language processing, large-scale features are frequently used, which often make the employed machine learning algorithms intractable, leading to the well-known problem "curse of dimensionality". Aiming at not only removing the redundant information from the original features but also improving their discriminating ability, we present a novel approach on supervised generation of low-dimensional, proximity-based, graph embeddings to facilitate multi-label classification. The optimal embeddings are computed from a supervised adjacency graph, called multi-label graph, which simultaneously preserves proximity structures between samples constructed based on feature and multi-label class information. We propose different ways to obtain this multi-label graph, by either working in a binary label space or a projected real label space. To reduce the training cost in the dimensionality reduction procedure caused by large-scale features, a smaller set of relation features between each sample and a set of representative prototypes are employed. The effectiveness of our proposed method is demonstrated with two document collections for text categorization based on the "bag of words" model.
Original language | English |
---|---|
Title of host publication | KDIR 2010 - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval|KDIR - Proc. Int. Conf. Knowl. Discov. Inf. Retr. |
Pages | 74-84 |
Number of pages | 10 |
Publication status | Published - 2010 |
Event | International Conference on Knowledge Discovery and Information Retrieval, KDIR 2010 - Valencia Duration: 1 Jul 2010 → … |
Conference
Conference | International Conference on Knowledge Discovery and Information Retrieval, KDIR 2010 |
---|---|
City | Valencia |
Period | 1/07/10 → … |
Keywords
- Adjacency graph
- Dimensionality reduction
- Embedding
- Multi-label classification
- Supervised