Hinge function Source: en.wikipedia.org/wiki/Hinge_function
Loss function in machine learning
The vertical axis represents the value of the Hinge loss (in blue) and zero-one loss (in green) for fixed t = 1, while the horizontal axis represents the value of the prediction y. The plot shows that the Hinge loss penalizes predictions y < 1, corresponding to the notion of a margin in a support vector machine.
For an intended output t = ±1 and a classifier score y, the hinge loss of the prediction y is defined as
Note that should be the "raw" output of the classifier's decision function, not the predicted class label. For instance, in linear SVMs, , where are the parameters of the hyperplane and is the input variable(s).
When t and y have the same sign (meaning y predicts the right class) and , the hinge loss . When they have opposite signs, increases linearly with y, and similarly if , even if it has the same sign (correct prediction, but not by enough margin).
While binary SVMs are commonly extended to multiclass classification in a one-vs.-all or one-vs.-one fashion,[2]
it is also possible to extend the hinge loss itself for such an end. Several different variations of multiclass hinge loss have been proposed.[3] For example, Crammer and Singer[4]
defined it for a linear classifier as[5]
,
where is the target label, and are the model parameters.
Weston and Watkins provided a similar definition, but with a sum rather than a max:[6][3]
.
In structured prediction, the hinge loss can be further extended to structured output spaces. Structured SVMs with margin rescaling use the following variant, where w denotes the SVM's parameters, y the SVM's predictions, φ the joint feature function, and Δ the Hamming loss:
The hinge loss is a convex function, so many of the usual convex optimizers used in machine learning can work with it. It is not differentiable, but has a subgradient with respect to model parameters w of a linear SVM with score function that is given by
Plot of three variants of the hinge loss as a function of z = ty: the "ordinary" variant (blue), its square (green), and the piece-wise smooth version by Rennie and Srebro (red). The y-axis is the l(y) hinge loss, and the x-axis is the parameter t
However, since the derivative of the hinge loss at is undefined, smoothed versions may be preferred for optimization, such as Rennie and Srebro's[7]
or the quadratically smoothed
suggested by Zhang.[8] The modified Huber loss is a special case of this loss function with , specifically .