The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.
where can be constant (usually set to 1) or trainable.
The swish family was designed to smoothly interpolate between a linear function and the ReLU function.
When considering positive values, Swish is a particular case of doubly parameterized sigmoid shrinkage function defined in [2]: Eq 3 . Variants of the swish function include Mish.[3]
Special values
For β = 0, the function is linear: f(x) = x/2.
For β = 1, the function is the Sigmoid Linear Unit (SiLU).
Thus, the swish family smoothly interpolates between a linear function and the ReLU function.[1]
Since , all instances of swish have the same shape as the default , zoomed by . One usually sets . When is trainable, this constraint can be enforced by , where is trainable.
Derivatives
Because , it suffices to calculate its derivatives for the default case.so is odd.so is even.
History
SiLU was first proposed alongside the GELU in 2016,[4] then again proposed in 2017 as the Sigmoid-weighted Linear Unit (SiL) in reinforcement learning.[5][1] The SiLU/SiL was then again proposed as the SWISH over a year after its initial discovery, originally proposed without the learnable parameter β, so that β implicitly equaled 1. The swish paper was then updated to propose the activation with the learnable parameter β.
^Hendrycks, Dan; Gimpel, Kevin (2016). "Gaussian Error Linear Units (GELUs)". arXiv:1606.08415 [cs.LG].
^Elfwing, Stefan; Uchibe, Eiji; Doya, Kenji (2017-11-02). "Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning". arXiv:1702.03118v3 [cs.LG].