Thus, the swish family smoothly interpolates between a linear function and the ReLU function.[1]
Since , swish with negative values of is equivalent to a linear transform of swish with positive values of , and do not lead to a different shape. Consequently, one usually sets . When is trainable, this constraint can be enforced by , where is trainable.
History
SiLU was first proposed alongside the GELU in 2016,[3] then again proposed in 2017 as the Sigmoid-weighted Linear Unit (SiL) in reinforcement learning.[4][1] The SiLU/SiL was then again proposed as the SWISH over a year after its initial discovery, originally proposed without the learnable parameter β, so that β implicitly equaled 1. The swish paper was then updated to propose the activation with the learnable parameter β.
^Hendrycks, Dan; Gimpel, Kevin (2016). "Gaussian Error Linear Units (GELUs)". arXiv:1606.08415 [cs.LG].
^Elfwing, Stefan; Uchibe, Eiji; Doya, Kenji (2017-11-02). "Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning". arXiv:1702.03118v3 [cs.LG].