Join our list for notifications and early access to events
Ilias Diakonikolas is faculty in the Computer Science Department at the University of Wisconsin-Madison where his main research interests are in algorithms and machine learning. His work focuses on understanding the tradeoff between statistical efficiency, computational efficiency, and robustness for fundamental problems in statistics and machine learning, where robustness refers broadly to a model's ability to deal with noisy data.
Ilias recently won the Outstanding Paper award at NeurIPS for his work, Distribution-Independent PAC Learning of Halfspaces with Massart Noise, which focuses on an area called high-dimensional robust learning and is essentially the first progress made around distribution-independent learning with noise since the 80s.
Robustness in Machine Learning
This field seeks to identify machine learning algorithms that are robust to noise in their input data under a variety of settings. Machine learning algorithms typically make assumptions about the data that they operate on, and the (often implicit) models that are used to generate that data.
Statisticians and computer scientists have long since developed algorithms that can weed out noise when the input data is only a single or a few values (dimensions). But with high-dimensional data like images or medical records, separating the good data from the noise remains a challenging research problem. Ilias emphasizes that, "Even a very tiny deviation from the assumed model could make the machine learning algorithm very very fragile...we need to rectify this and develop a methodology to have robust learning algorithms for high dimensional problems."
The robust learning problem that Ilias is working on differs from other research on robustness to adversarial attacks in terms of when the data is perturbed. Typically, in the adversarial robustness scenario, you assume the model is trained with clean data and the goal is to improve its performance when presented with corrupt data (i.e. a test-time attack). For the type of robustness Ilias is working with the model is trained on corrupted data from the beginning (i.e. a training-time attack).
Problems with Corrupt Data in High-Dimensional Settings
Consider the problem of estimating the mean of a group of independent and identically distributed (IID) samples. As you probably know, the process is fairly simple (take the sum and divide by number of samples). But if there's an outlier that is not from the distribution you assumed, then your mean would be wrong and your estimate may never converge to the true mean (because it could be arbitrarily far from the true mean). If you're only operating in a single dimension, you can get around the problem of the outliers by taking the median instead of the mean.
If we apply this same method to high-dimensional data points, and your data is clean (meaning drawn from a single distribution), then you can similarly calculate the empirical mean. However, if your data is corrupt and has a lot of outliers, then the approach of taking the median won't work.
In a high-dimensional setting, you can't look at individual coordinates and calculate the median because noise can accumulate across multiple dimensions, and the amount of errors can scale to a very large amount. This can be avoided with information-theoretic bounds, but using a computationally efficient algorithm is not something we knew how to do.
The NeurIPS Paper: Distribution-Independent, PAC Learning, Massart Noise, Oh My!
The goal of this paper is to find an algorithm able to fit a linear model to noisy, high-dimensional data. Specifically, it aims to perform binary classification, that is, applying a linear separator that distinguishes a point into two classes: positive or negative.
"What we show in this paper is that, in fact, in a very reasonable model of random label perturbations, this is something that...can be done with a computationally efficient algorithm...You can actually learn linear separators efficiently in the presence of noise."
Let's explore the terminology introduced in the title of Ilias' paper:
Distribution-Independent. In observing labeled examples of x (points in high dimensions) and y (binary labels), we don't need to make any assumptions about the x's! We can simply assume that our x's are IID samples and from some fixed distribution—but even that can be arbitrary because in reality, you might not know anything about the distribution of the data.
PAC Learning (Probably Approximately Correct). This is the standard model of binary classification given by Leslie Valiant in the 80s, and similar models by Vapnik, where you try to find a classifier that generalizes over unseen data.
Halfspace: The linear separator we seek between positive and negative classes.
Massart Noise. One way to model noise in your data is to set the probability that a given data point (or label in this case) is going to be flipped. Massart Noise is a bit more subtle. With Massart Noise, the probability that a label is perturbed is bounded by some constant but we do not know the exact probability that it is going to be flipped, it can be anything from zero to the constant.
Iteratively Applying Stochastic Gradient Descent
To try to solve this problem, Ilias and his team's initial approach was to try to use good-old, handy-dandy Stochastic Gradient Descent (SGD). Unfortunately, they learned that not only would SGD fail, but they proved that it would be impossible to use to solve this problem.
However, all was not lost. In the process of proving the intractability of SGD, the team discovered that SGD can actually solve a non-trivial part of the problem. This was enough to give them a toe-hold, and this paper presents an algorithm for iteratively letting SGD solve that part of the process, and then re-running it on what's left getting more of the solution each time.
Of course, an algorithm may exist, but that doesn't mean it's an efficient one, for example, if it requires an exponential number of iterations to find a solution. In fact, Ilias and his team demonstrated that in every iteration, a non-trivial and inverse polynomial fraction of the space is covered, therefore the maximum number of iterations required to cover the space is at most a polynomial number of iterations.
What's next?
For Ilias, this is just the beginning, "In some sense, [this] is the first paper in this line of work of doing distribution-independent learning with noise...and I believe it's going to lead to many developments."
A continuation of the work in the paper is being done with his colleague Christos Tzamos and two PhD students at UW, showing that if you make reasonable assumptions about the distribution based on concentration and anti-concentration (i.e. that the distribution of the inputs is not too dense or empty anywhere), SGD can work as-is.
A related venture is a survey Ilias worked on with longtime collaborator, Daniel Kane called Recent Advances in Algorithmic High-Dimensional Robust Statistics. The survey covers an overview of about a 100 papers exploring the techniques that have been developed in the space so far, and evaluates what direction the community should go in next. The survey will be published soon as part of the book Beyond Worst-Case Analysis.
Practical Implications: Data Poisoning and Implementation
One of the practical implications of robust statistics is the prevention of data poisoning. Data poisoning is the phenomenon that occurs when you have incoming data from outside and the system is vulnerable to malicious users who insert fake data that destroys the behavior of the model.
While the potential for applications is large, what's holding back widespread implementation is that the algorithms use spectral methods, and are not automatic as the machine learning community wants them to be. Further, many real-world problems are non-convex, meaning that SGD can't be applied directly. Ilias believes that can change soon, and is currently working on eliminating the algorithmics and giving structure to these non-convex problems to formulate them in such a way that SGD can sufficiently solve them.
NeurIPS 2019 Outstanding Paper: Distribution-Independent PAC Learning of Halfspaces with Massart NoiseSlides: Distribution-Independent PAC Learning of Halfspaces with Massart NoisePaper: Private Testing of Distributions via Sample PermutationsPaper: Nearly Tight Bounds for Robust Proper Learning of Halfspaces with a MarginPaper: Outlier-Robust High-Dimensional Sparse Estimation via Iterative FilteringPaper: Robust Estimators in High Dimensions without the Computational IntractabilitySurvey: Recent Advances in Algorithmic High-Dimensional Robust Statistics