2 Zachary C. Lipton, Charles Elkan, and Balakrishnan Naryanaswamy
Two approaches exist for optimizing performance on F1. Structured loss min-
imization incorporates the performance metric into the loss function and then
optimizes during training. In contrast, plug-in rules convert the numerical out-
puts of a classifier into optimal predictions [5]. In this paper, we highlight the
latter scenario to differentiate between the beliefs of a system and the predictions
selected to optimize alternative metrics. In the multilabel case, we show that the
same beliefs can produce markedly dissimilar optimally thresholded predictions
depending upon the choice of averaging method.
That F1 is asymmetric in the positive and negative class is well-known. Given
complemented predictions and actual labels, F1 may award a different score.
It also generally known that micro F1 is affected less by performance on rare
labels, while Macro-F1 weighs the F1 of on each label equally [11]. In this pa-
per, we show how these properties are manifest in the optimal decision-making
thresholds and introduce a theorem to describe that threshold. Additionally,
we demonstrate that given an uninformative classifier, optimal thresholding to
maximize F1 predicts all instances positive regardless of the base rate.
While F1 is widely used, some of its properties are not widely recognized.
In particular, when choosing predictions to maximize the expectation of F1 for
a batch of examples, each prediction depends not only on the probability that
the label applies to that example, but also on the distribution of probabilities
for all other examples in the batch. We quantify this dependence in Theorem 1,
where we derive an expression for optimal thresholds. The dependence makes it
difficult to relate predictions that are optimally thresholded for F1 to a system’s
predicted probabilities.
We show that the difference in F1 score between perfect predictions and
optimally thresholded random guesses depends strongly on the base rate. As
a result, assuming optimal thresholding and a classifier outputting calibrated
probabilities, predictions on rare labels typically gets a score between close to
zero and one, while scores on common labels will always be high. In this sense,
macro average F1 can be argued not to weigh labels equally, but actually to give
greater weight to performance on rare labels.
As a case study, we consider tagging articles in the biomedical literature with
MeSH terms, a controlled vocabulary of 26,853 labels. These labels have hetero-
geneously distributed base rates. We show that if the predictive features for rare
labels are lost (because of feature selection or another cause) then the optimal
threshold to maximize macro F1 leads to predicting these rare labels frequently.
For the case study application, and likely for similar ones, this behavior is far
from desirable.
2 Definitions of Performance Metrics
Consider binary classification in the single or multilabel setting. Given training
data of the form {hx
1
, y
1
i, . . . , hx
n
, y
n
i} where each x
i
is a feature vector of
dimension d and each y
i
is a binary vector of true labels of dimension m, a
probabilistic classifier outputs a model which specifies the conditional probability
评论0
最新资源