Platt scaling

In machine learning, Platt scaling or Platt calibration is a method of transforming the outputs of a classification model into a probability distribution over classes. The method was invented by John Platt in the context of support vector machines,^[1] replacing an earlier method by Vapnik, but can be applied to other classification models.^[2] Platt scaling works by fitting a logistic regression model to a classifier's scores.

Description

Let $f$ be a real-valued function that is used as a binary classifier to predict, for examples $x$ , a label $y$ from the set {+1, -1}, as $y = sign(f (x))$ (disregarding the possibility of a zero output for now). When what is required is instead a probability $P(y =1| x)$ , but the model does not provide this (or gives bad probability estimates), Platt scaling can be used. This method produces probabilities

\mathrm {P} (y=1|x)={\frac {1}{1+\exp(Af(x)+B)}}

,

i.e., a logistic transformation of the classifier scores $f (x)$ . Note that predictions can now be made according to $y = 1$ iff $P(y =1| x) > ½$ ; if $B \neq 0$ , the probability estimates contain a correction compared to the old decision function $y = sign(f (x))$ .^[3]

The (scalar) parameters $A$ and $B$ are estimated using a maximum likelihood method. The training set for parameter optimization is typically the same as that for the original classifier $f$ . To avoid overfitting to this set, a held-out calibration set or cross-validation can be used, but Platt additionally suggests transforming the labels $y$ to target probabilities

t_{+}={\frac {N_{+}+1}{N_{+}+2}}

for positive samples (

y = 1

), and

t_{-}={\frac {1}{N_{-}+2}}

for negative samples,

y = -1

.

Here, $N ₊$ and $N ₋$ are the number of positive and negative samples, resp. This transformation follows by applying Bayes' rule to a model of out-of-sample data that has a uniform prior over the labels.^[1]

Platt himself suggested using the Levenberg–Marquardt algorithm to optimize the parameters, but a Newton algorithm was later proposed that should be more numerically stable.^[4]

References

^ ^a ^b Platt, John (1999). "Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods" (PDF). Advances in large margin classifiers. 10 (3): 61–74.
^ Niculescu-Mizil, Alexandru; Caruana, Rich (2005). Predicting good probabilities with supervised learning (PDF). ICML.
^ Olivier Chapelle; Vladimir Vapnik; Olivier Bousquet; Sayan Mukherjee (2002). "Choosing multiple parameters for support vector machines" (PDF). Machine Learning. 46: 131–159.
^ Lin, Hsuan-Tien; Lin, Chih-Jen; Weng, Ruby C. (2007). "A note on Platt's probabilistic outputs for support vector machines" (PDF). Machine Learning. 68 (3): 267–276.

[platt99-1] Platt, John (1999). "Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods" (PDF). Advances in large margin classifiers. 10 (3): 61–74.

[2] Niculescu-Mizil, Alexandru; Caruana, Rich (2005). Predicting good probabilities with supervised learning (PDF). ICML.

[3] Olivier Chapelle; Vladimir Vapnik; Olivier Bousquet; Sayan Mukherjee (2002). "Choosing multiple parameters for support vector machines" (PDF). Machine Learning. 46: 131–159.

[4] Lin, Hsuan-Tien; Lin, Chih-Jen; Weng, Ruby C. (2007). "A note on Platt's probabilistic outputs for support vector machines" (PDF). Machine Learning. 68 (3): 267–276.

[1]

[2]

[3]

[4]

Description

See also

References