Thresholding+Classifiers+to+Maximize+F1+Score资源-CSDN文库

机器学习

需积分: 9 191 浏览量 2017-07-31 21:16:25 上传评论收藏 515KB PDF 举报

资源详情

资源评论

Thresholding Classiﬁers to Maximize F1 Score

Zachary C. Lipton, Charles Elkan, and Balakrishnan Naryanaswamy

University of California, San Diego,

La Jolla, California, 92093-0404, USA

{zlipton,celkan,muralib}@cs.ucsd.edu

Abstract. This paper provides new insight into maximizing F1 scores

in the context of binary classiﬁcation and also in the context of multil-

abel classiﬁcation. The harmonic mean of precision and recall, F1 score

is widely used to measure the success of a binary classiﬁer when one

class is rare. Micro average, macro average, and per instance average F1

scores are used in multilabel classiﬁcation. For any classiﬁer that pro-

duces a real-valued output, we derive the relationship between the best

achievable F1 score and the decision-making threshold that achieves this

optimum. As a special case, if the classiﬁer outputs are well-calibrated

conditional probabilities, then the optimal threshold is half the optimal

F1 score. As another special case, if the classiﬁer is completely uninfor-

mative, then the optimal behavior is to classify all examples as positive.

Since the actual prevalence of positive examples typically is low, this

behavior can be considered undesirable. As a case study, we discuss the

results, which can be surprising, of applying this procedure when pre-

dicting 26,853 labels for Medline documents.

Keywords: machine learning, evaluation methodology, F1-score, multilabel clas-

siﬁcation, binary classiﬁcation

1 Introduction

Performance metrics are useful for comparing the quality of predictions across

systems. Some commonly used metrics for binary classiﬁcation are accuracy,

precision, recall, F1 score, and Jaccard index [15]. Multilabel classiﬁcation is

an extension of binary classiﬁcation that is currently an area of active research

in supervised machine learning [18]. Micro averaging, macro averaging, and per

instance averaging are three commonly used variants of F1 score used in the

multilabel setting. In general, macro averaging increases the impact on ﬁnal

score of performance on rare labels, while per instance averaging increases the

importance of performing well on each example [17]. In this paper, we present

theoretical and experimental results on the properties of the F1 metric.

For concreteness, the results of this paper are given speciﬁcally for the F1 metric

and its multilabel variants. However, the results can be generalized to Fβ metrics

for β 6= 1.

arXiv:1402.1892v2 [stat.ML] 14 May 2014

2 Zachary C. Lipton, Charles Elkan, and Balakrishnan Naryanaswamy

Two approaches exist for optimizing performance on F1. Structured loss min-

imization incorporates the performance metric into the loss function and then

optimizes during training. In contrast, plug-in rules convert the numerical out-

puts of a classiﬁer into optimal predictions [5]. In this paper, we highlight the

latter scenario to diﬀerentiate between the beliefs of a system and the predictions

selected to optimize alternative metrics. In the multilabel case, we show that the

same beliefs can produce markedly dissimilar optimally thresholded predictions

depending upon the choice of averaging method.

That F1 is asymmetric in the positive and negative class is well-known. Given

complemented predictions and actual labels, F1 may award a diﬀerent score.

It also generally known that micro F1 is aﬀected less by performance on rare

labels, while Macro-F1 weighs the F1 of on each label equally [11]. In this pa-

per, we show how these properties are manifest in the optimal decision-making

thresholds and introduce a theorem to describe that threshold. Additionally,

we demonstrate that given an uninformative classiﬁer, optimal thresholding to

maximize F1 predicts all instances positive regardless of the base rate.

While F1 is widely used, some of its properties are not widely recognized.

In particular, when choosing predictions to maximize the expectation of F1 for

a batch of examples, each prediction depends not only on the probability that

the label applies to that example, but also on the distribution of probabilities

for all other examples in the batch. We quantify this dependence in Theorem 1,

where we derive an expression for optimal thresholds. The dependence makes it

diﬃcult to relate predictions that are optimally thresholded for F1 to a system’s

predicted probabilities.

We show that the diﬀerence in F1 score between perfect predictions and

optimally thresholded random guesses depends strongly on the base rate. As

a result, assuming optimal thresholding and a classiﬁer outputting calibrated

probabilities, predictions on rare labels typically gets a score between close to

zero and one, while scores on common labels will always be high. In this sense,

macro average F1 can be argued not to weigh labels equally, but actually to give

greater weight to performance on rare labels.

As a case study, we consider tagging articles in the biomedical literature with

MeSH terms, a controlled vocabulary of 26,853 labels. These labels have hetero-

geneously distributed base rates. We show that if the predictive features for rare

labels are lost (because of feature selection or another cause) then the optimal

threshold to maximize macro F1 leads to predicting these rare labels frequently.

For the case study application, and likely for similar ones, this behavior is far

from desirable.

2 Deﬁnitions of Performance Metrics

Consider binary classiﬁcation in the single or multilabel setting. Given training

data of the form {hx

, y

i, . . . , hx

, y

i} where each x

is a feature vector of

dimension d and each y

is a binary vector of true labels of dimension m, a

probabilistic classiﬁer outputs a model which speciﬁes the conditional probability

Thresholding Classiﬁers to Maximize F1 Score 3

Actual Positive Actual Negative

Predicted Positive tp fp

Predicted Negative fn tn

Fig. 1: Confusion Matrix

of each label applying to each instance given the feature vector. For a batch of

data of dimension n × d, the model outputs an n × m matrix C of probabilities.

In the single-label setting, m = 1 and C is an n × 1 matrix, i.e. a column vector.

A decision rule D(C) : R

n×m

→ {0, 1}

n×m

converts a matrix of probabilities

C to binary predictions P . The gold standard G ∈ R

n×m

represents the true

values of all labels for all instances in a given batch. A performance metric M

assigns a score to a prediction given a gold standard:

M(P |G) : {0, 1}

n×m

× {0, 1}

n×m

→ R ∈ [0, 1].

The counts of true positives tp, false positives f p, false negatives fn, and true

negatives tn are represented via a confusion matrix (Figure 1).

Precision p = tp/(tp + fp) is the fraction of all positive predictions that are

true positives, while recall r = tp/(tp + fn) is the fraction of all actual positives

that are predicted positive. By deﬁnition the F1 score is the harmonic mean of

precision and recall: F 1 = 2/(1/r + 1/p). By substitution, F1 can be expressed

as a function of counts of true positives, false positives and false negatives:

F 1 =

2tp

2tp + fp + fn

. (1)

The harmonic mean expression for F1 is undeﬁned when tp = 0, but the trans-

lated expression is deﬁned. This diﬀerence does not impact the results below.

2.1 Basic Properties of F1

Before explaining optimal thresholding to maximize F1, we ﬁrst discuss some

properties of F1. For any ﬁxed number of actual positives in the gold standard,

only two of the four entries in the confusion matrix (Figure 1) vary independently.

This is because the number of actual positives is equal to the sum tp + fn while

the number of actual negatives is equal to the sum tn + f p. A second basic

property of F1 is that it is non-linear in its inputs. Speciﬁcally, ﬁxing the number

fp, F1 is concave as a function of tp (Figure 2). By contrast, accuracy is a linear

function of tp and tn (Figure 3).

As mentioned in the introduction, F1 is asymmetric. By this, we mean that

the score assigned to a prediction P given gold standard G can be arbitrarily

diﬀerent from the score assigned to a complementary prediction P

given com-

plementary gold standard G

. This can be seen by comparing Figure 2 with

Figure 5. This asymmetry is problematic when both false positives and false

negatives are costly. For example, F1 has been used to evaluate the classiﬁcation

of tumors as benign or malignant [1], a domain where both false positives and

false negatives have considerable costs.

剩余15页未读，继续阅读

评论收藏

内容反馈

Thresholding+Classifiers+to+Maximize+F1+Score

评论0

最新资源

Thresholding+Classifiers+to+Maximize+F1+Score

评论0

最新资源

相关推荐

Fuzzy Homogeneity Approach to Multilevel Thresholding

A multistage adaptive thresholding method

How to Segment Images Using Color Thresholding.zip

Thresholding in Edge Detection.pdf

matlab开发-Thresholding

Thresholding (2nd)_Thresholdingmethods_

Global-Thresholding-Optimum-Thresholding-Otsu-:vs2013+opencv 基本全局阈值处理 最佳全局阈值处理（Otsu）

Thresholding for Change Detection

Adaptive wavelet thresholding for image denoising and compression.pdf

seg.rar_Image thresholding

Optimal multi-level thresholding using a two-stage Otsu optimization approach

Minimum cross entropy thresholding

thresh_tool.rar_image_it_thresholding

Adaptive Thresholding Using the Integral Image

Thresholding in Edge Detection： A Statistical Approach

Adaptive Thresholding for the DigitalDesk _EPC-1993-110.

tre.rar_images_thresholding

Otsus-Thresholding-master.zip

ChatGPT教程（终极版）最全整理

博客中Kmeans以及FCM算法数据（免积分）

hugging face的models-openai-clip-vit-large-patch14文件夹

神经网络回归预测--气温数据集

XGBoost+LightGBM+LSTM-光伏发电量预测

Mathwork+Matlab+编程手册

Stable-Diffusion WEBUI 简体中文语言包（2023.05.30更新）

基于Python+pytorch的图像处理+附完整代码图像处理，能够轻松实现图像的读取、显示、裁剪等还有机器学习等操作

时间序列预测模型实战案例(Xgboost)(Python)(机器学习)包括时间序列预测和时间序列分类，点击即可运行！

中文短信数据集-带标签

Global-Thresholding-Optimum-Thresholding-Otsu-:vs2013+opencv 基本全局阈值处理最佳全局阈值处理（Otsu）