Notes on Active Learning
Why? Rather than using all the knowledge to train, can we just pick the 1) best books? 2) ask the right questions 11 asking an oracle to label unlabeled data points
Active Learning Scenarios
-
Membership query synthesis
Learner may request the label for an unlabeled example in the input space, or even generate a new data point and ask for the label.
-
Stream-based selective sampling
Algorithm doesn’t access all the data at once, rather based on the data points we receive we have to decide if we wanna label them or not.
-
Pool-based sampling
Assumes we have a set of labeled data and a vast amount of unlabeled data. How is this different from the previous one?
- Here we have access to a bunch of unlabeled data at once.
Main question: Asking the right question
In the literature, “asking the right question” is technically referred to as a Query Strategy. When i say asking the right question, its essentially considering which data points are worth labeling (both for classification and regression). The following measures can be used to determine which data points to label:
-
Uncertainty Sampling: Select instances with the highest entropy 22
-
Query-By-Commitee: Maintains a group of models and queries the instance they disagree on the most.
-
Expected Model Change: A decision-theoretic approach that selects the instance that would most significantly impact the current model’s parameters (e.g., Expected Gradient Length).
-
Expected Error Reduction: Aims to query the instance that will most reduce the model’s future generalization error (or “risk”).
-
Variance Reduction: Minimizes future error indirectly by minimizing the model’s output variance, often using Fisher information.
-
Density Weighted Methods: Ensures the learner doesn’t just pick “confusing” outliers by also considering how representative an instance is of the overall data distribution
MoveThanks!