The Taub Faculty of Computer Science Events and Talks
Sarai Duek (M.Sc. Thesis Seminar)
Wednesday, 17.08.2016, 12:00
Advisor: Prof. Shaul Markovitch
Text categorization is a prominent task of labeling documents with predefined categories. The main approach to performing text categorization is learning from labeled examples. For many tasks, it may be difficult to find examples in one language but easy in others. The problem of learning from examples in one or more languages and classify (categorize) in another is called Cross-language learning.
Existing approaches for solving this problem relies on translation of the texts from all languages into one language. Such approaches suffer from the known weaknesses of machine translation being imprecise and inherently limited when facing more complex texts that require human world knowledge and inferring abilities.
In this work we present a novel approach that solves the general cross-language text categorization. Our method utilizes a hierarchical language-independent ontology and a group of interpreters for mapping texts in each of the involved languages to a Language-Independent Space (LIS) spanned by the suggested ontology. Given a set of examples in multiple languages, we apply the semantic interpreters to generate features in the LIS. When categorizing a new document, we first map the document to the LIS and then apply the language-independent classifier. Our methodology works on the most general cross-lingual text categorization problems, being able to learn from any mix of languages and classify documents in any other language.