The unfortunate reality is that many of the most commonly used machine learning metrics don't account for the complex trade-offs that come with real-world decision making. This is one of the challenges that Sanmi Koyejo has dedicated his research to addressing.
Sanmi is an assistant professor at the Department of Computer Science at the University of Illinois where he applies his background in cognitive science, probabilistic modeling, and Bayesian inference to pursue his research which focuses broadly on "adaptive and robust machine learning."
Constructing ML Models that Optimize Complex Metrics
As an example of the disconnect between simple and complex machine learning metrics, think about an information retrieval problem, like search or document classification. For these types of problems, it's common to use a metric known as the F-measure to assess your model's performance. F-measure is preferred to simpler metrics like accuracy because it produces a more balanced result by looking at the model's precision and recall.
Before Sanmi began his research in this area, there wasn't a good understanding of how to build a machine learning system that was specifically good at optimizing F-measure.
Sanmi and his collaborators explored this area through a series of papers including
Online Classification with Complex Metrics on making models that optimize complex, non-decomposable metrics. (Non-decomposable here means you can't write the metric as an average, which would allow you to apply existing tools like gradient descent.)
Scaling up to More Complex Measures
To generalize this idea beyond simple binary classifiers, we have to think about the confusion matrix, which is a key statistical tool used in assessing classifiers. The confusion matrix measures the distribution of predictions that a classifier makes given an input with a certain label.
Sanmi's research provided guidance for building models that optimized arbitrary metrics based on the confusion matrix.
"Initially we work[ed out] linear weighted combinations. Eventually, we got to ratios of linear things, which captures things like F-measure. Now we're at the point where we can pretty much do any function of the confusion matrix."
Domain Experts and Metric Elicitation
Having developed a framework for optimizing classifiers against complex performance metrics, the next question Sanmi asked (because it was the next question asked of him), is which one should you choose for a particular problem? This is where metric elicitation comes in.
The idea is to flip the question around and try to determine good metrics for a particular problem by interacting with experts or users to determine which of the metrics we can now optimize for best approximate how the experts are making trade-offs against various types of predictions or classification errors.
For example, a doctor understands the costs associated with diagnosing or misdiagnosing someone with a disease. The trade-off factors could include treatment prices or side effects--factors that can be compressed to the pros/cons of predicting a diagnosis or not. Building a trade-off function for these decisions is difficult. Metric elicitation allows us to identify the preferences of doctors through a series of interactions with them, and to identify the trade-offs that should correspond to their preferences." Once we know these trade-offs, we can build a metric that captures them, which allows you to optimize those preferences directly in your models using the techniques Sanmi developed earlier.
In research developed with Gaurush Hiranandani and other colleagues at the University of Illinois,
Performance Metric Elicitation from Pairwise Classifier Comparisons proposes a system of asking experts to rank pairs of preferences, kind of like an eye exam for machine learning metrics.
Metric Elicitation and Inverse Reinforcement Learning
Sanmi notes that learning metrics in this manner is similar to inverse reinforcement learning, where reward functions are being learned, often by interaction with humans. However, the fields differ in that RL is more focused on replicating behavior rather than getting the reward function correct. Metric elicitation, on the other hand, is focused on replicating the same decision-making reward function as the human expert. Matching the model's reward function, as opposed to the model's behavior, has the benefit of greater generalizability, which allows metrics that are agnostic to data distribution or the specific learner you're using.
Sanmi mentions another interesting area of application around fairness and bias, where you have different measures of fairness that correspond to different notions of trade-offs. Upcoming research is focused on finding "elicitation procedures that build context-specific notions of metrics or statistics" that should be normalized across groups to reach a fairness goal in a specific setting.
Robust Distributed Learning
This interview also covers Sanmi's research into robust distributed learning, which aims to harden distributed machine learning systems against adversarial attacks.
Be sure to check out the full interview for the interesting discussion Sam and Sanmi had on both metric elicitation and robust distributed learning. The latter discussion starts about 33 minutes into the interview.