Feature selection is a common technique used in machine learning, where useful input features are used to train the model and non-predictive features are discarded. This often improves model quality by letting it focus on the most important features. A common way to select features is based on the mutual information between the features and the label that the model predicts. This often works well for classical machine learning scenarios where the data is tabular and does not contain too many features. But does it still work for very high-dimensional data like images, which requires a feature extractor to carry out machine learning task? We try to answer this question and got an encouraging result.
But What is Mutual Information?
Before going into the technical details of this exploration, first we would like to explore what mutual information is. Mutual information is a metric modeling the relationship between two variables, say X and Y. Then mutual information can be defined as:
I(X;Y) = H(X) – H(X|Y) = H(Y) – H(Y|X)
where H(X) is the entropy of a variable X, a measure about how much uncertainty on average can be obtained by querying the variable X (remember these variables are in fact probability distributions, not fixed values). H(X|Y), or the entropy of X given Y, which means how much additional information we can get when knowing X after we already know the value of Y.
If X is a deterministic function of Y, for example, we cannot get more information about X after knowing the value of Y. Therefore, H(X|Y) = 0, and the mutual information is maximal, which is H(X) or H(Y) (they are the same since they convey the same amount of information). On the other hand, if the two variables are independent, we get just as much information if we query X directly or if we query X after knowing the value of Y. Thus H(X) – H(X|Y) = 0, and the mutual information is minimal.
Features with high mutual information with the label are informative — they provide information for the model to predict the values, and are worth selecting.
How to Estimate Mutual Information?
To use mutual information for feature selection, you first have to estimate it. It is intractable to calculate the exact value since the underlying actual probability distributions are not known. The probability distributions and thus the mutual information needs to be estimated from the data.
Fortunately, we don’t have to go through the tedious math here, as there is already a pair of good estimators in the library scikit-learn: mutual_info_classif and mutual_info_regression. Note that, since one of the variables (the class) is categorical, mutual_info_classif is used.
The Few-Shot Image Classification Pipeline
Few-shot image classification refers to the process of classifying images into one category out of multiple choices, with the model only having access to a few images per category. As images are very high-dimensional (millions of dimensions usually), it is hard to model the data distribution from scratch and make accurate classifications on only a few data points without overfitting. Therefore, a pretrained vision encoder, one that has already been trained on many images on certain tasks, is often used.
As the vision encoder extracts features and maps the image into a high-dimensional vector, it is passed through to a prototypical network classifier. The classifier stores the component-wise mean value of the embeddings of the images. Once it encounters a new image embedding, it selects the class with the closest mean image embedding, known as the class prototype.
Then how do we join this pipeline with the mutual information feature selection together? We do it when the class prototypes are generated. That means we mask out the non-useful features using a Boolean mask and retain the more useful features. Or in the general case, we can have a mask, a vector the same size as the class prototype, that is element-wise multiplied with the class prototype to generate the new prototype that we will work with.
Experimental Setups
We have tested 3 types of feature masks:
- Boolean mask: only selects features with feature-label mutual information exceeding a threshold value γ (which is a hyperparameter)
- Exponential mask: exponentiate the feature-label mutual information by a certain exponent k (which is a hyperparameter)
- Top-k feature sampling: only selects the λ most informative features from the set of features (where λ is a hyperparameter, a fraction representing the proportion of selected features)
We have tested the few-shot image classification pipeline on 2 closely related vision encoders:
- ViT-S-14 (DINOv2)
- ViT-B-14 (DINOv2)
And 5 datasets (covering the general and fine-grained classification cases):
- CIFAR-100
- CUB-200-2011
- FGVC-Aircraft
- Flowers-102
- miniImageNet
We use the common 5-shot 5-class setting to evaluate the accuracy of each setup, where 5 samples are drawn randomly from 5 classes in the dataset. Then, we evaluate the accuracy of the model on 15 samples per class for 5 classes.
Experimental Results
The experimental results (accuracy table) across different setups with different hyperparameter choices are listed below:
Observations
From the experimental results, we make some observations:
- For the exponent case, an exponent less than 1 usually works better than an exponent greater than 1
- For most fine-grained datasets, the Boolean mask works better than the exponental mask
- There were some setups where models perform very well on one dataset, but records a decrease in performance from the baseline in another. Therefore, the performance might be highly dependent on the base-model/dataset pair, and without a setup that consistently improves the model accuracy, it might be difficult to implement this paradigm in actual scenarios.
Conclusion
Although mutual-information-based feature selection might be difficult to implement in real life scenarios, it reflects a promising prospect about how mutual information can be used to improve the accuracy of neural network classifiers. Future work can focus on how to implement an effective feature-label mutual information regularization that can improve the performance of neural networks.