08 June 2011
Carnegie Mellon University
Jaime Carbonell

As machine learning gains prominence we must address the issue of providing the learning algorithms with the most useful labeled training instances efficiently. Active learning seeks to select near-optimal training sequences in an on-line manner. However, traditional active learning makes restrictive assumptions such as the existence of a single omniscient labeling oracle. Instead, we investigate more realistic settings such as the presence of multiple potentially-fallible or reluctant external information sources with potentially variable cost and unknown reliability. Proactive learning reaches out to these sources and jointly optimizes learning source properties (e.g. accuracy, cost, expertise area) and selects the most informative instances and sources for the learning task at hand. The proactive sampling methods trade off cost vs. information value and amortized benefit vs. immediate rewards, being largely agnostic to the base-level learning algorithms. We have applied these methods to synthetic data, benchmark test data, and most recently are applying them to new challenges such as low-resource machine translation, inferring the human interactome (and host-pathogen interactomes), and large-scale wind-energy optimization. The talk will focus primarily on the underlying methods, but will also touch on these applications.