Scalable Evaluation and Improvement of Document Set Expansion via Neural Positive-Unlabeled Learning
Alon Jacovi, Gang Niu, Yoav Goldberg, Masashi Sugiyama
Machine Learning for NLP Long paper Paper
You can open the pre-recorded video in separate windows.
Abstract:
We consider the situation in which a user has collected a small set of documents on a cohesive topic, and they want to retrieve additional documents on this topic from a large collection. Information Retrieval (IR) solutions treat the document set as a query, and look for similar documents in the collection. We propose to extend the IR approach by treating the problem as an instance of positive-unlabeled (PU) learning---i.e., learning binary classifiers from only positive (the query documents) and unlabeled (the results of the IR engine) data. Utilizing PU learning for text with big neural networks is a largely unexplored field. We discuss various challenges in applying PU learning to the setting, showing that the standard implementations of state-of-the-art PU solutions fail. We propose solutions for each of the challenges and empirically validate them with ablation tests. We demonstrate the effectiveness of the new method using a series of experiments of retrieving PubMed abstracts adhering to fine-grained topics, showing improvements over the common IR solution and other baselines.
NOTE: Video may display a random order of authors.
Correct author list is at the top of this page.