Skip to main content
00 Days
00 Hrs
00 Min
00 Sec

Activation Functions: The Unsung Component That Makes Neural Networks Work

A neural network without activation functions is just a very complicated way of doing linear algebra.

Every layer would multiply its inputs by a matrix of weights and add a bias. Stack as many of those layers as you want, and the result is still equivalent to a single matrix multiplication. The depth would be an illusion. A hundred-layer network would have no more expressive power than a one-layer network.

Activation functions break that equivalence. They introduce non-linearity, the property that makes deep networks capable of representing and learning complex functions rather than just linear transformations. Understanding what they do and why different choices matter is one of those foundational pieces of knowledge that clarifies a lot of otherwise puzzling behavior in neural networks.

The role of an activation function is simple to describe. After each layer computes its weighted sum of inputs, the activation function is applied to that sum before passing the result to the next layer. The activation function takes a single number in and produces a single number out, applied independently to each unit in the layer. It's the non-linearity of this function that matters: if the output is not a linear function of the input, the composition of many layers becomes capable of representing a much richer class of functions than any single linear transformation could.

The earliest activation functions used in neural networks were sigmoid and hyperbolic tangent, or tanh. Sigmoid takes any input and squashes it to a value between zero and one, producing an S-shaped curve. Tanh is similar but squashes to a range between negative one and one. Both are smooth, differentiable everywhere, and have a clear interpretation: sigmoid can be read as a probability, tanh as a centered version of the same. They were the standard choices for decades.

Both have a significant problem that wasn't fully appreciated until networks got deeper. In the regions far from zero, both functions become very flat. The gradient, the signal used to update weights during training, becomes very small in these regions. When gradients are multiplied together across many layers during backpropagation, small gradients become vanishingly small. Weights in early layers receive almost no update signal. This is the vanishing gradient problem, and it made training deep networks with sigmoid or tanh activations extremely difficult.

ReLU, the Rectified Linear Unit, addressed this problem with striking simplicity. Its definition: if the input is positive, pass it through unchanged. If the input is negative or zero, output zero. That's it. The gradient in the positive region is always exactly one, which doesn't vanish no matter how many layers it passes through. Training deep networks became dramatically more tractable once ReLU became the standard activation function, and this contributed significantly to the deep learning renaissance of the 2010s.

ReLU introduced its own problem, the dying ReLU issue covered in a separate piece in this blog. Neurons that receive predominantly negative inputs output zero for every input and receive no gradient signal, effectively dropping out of the network. Several variants addressed this. Leaky ReLU outputs a small negative value rather than zero for negative inputs, preserving a gradient signal. ELU uses an exponential curve for negative inputs, producing smoother behavior near zero. Parametric ReLU learns the slope for negative inputs rather than fixing it.

The activation functions used in modern large language models have moved beyond these simpler variants. GELU, the Gaussian Error Linear Unit, applies a smooth approximation that weights inputs by the probability that a Gaussian random variable is smaller than the input. The result is a smooth, non-monotonic function that performs better than ReLU on many natural language tasks and has become the standard activation in transformer models. SwiGLU, a variant that combines a gating mechanism with a smooth activation, is used in several frontier large language models including versions of Llama and PaLM. The specific choice of activation function at this scale has measurable effects on model quality, which is why it's an active area of architectural experimentation.

The reasons why one activation function outperforms another on a given task are not always fully understood. The empirical results are clear enough: GELU consistently outperforms ReLU on language tasks, ReLU and its variants often perform well on vision tasks, and the right choice depends on architecture and domain in ways that theory doesn't yet fully predict. This is one of those areas where neural network research still proceeds substantially through experiment rather than through principled derivation.

For practitioners, activation functions are mostly a choice made once at architecture design time and then left alone. The practical guidance is relatively settled: for transformer-based language models, GELU or SwiGLU variants are standard; for convolutional networks applied to vision, ReLU variants remain common; for networks where dying neurons have been a problem, Leaky ReLU or ELU are reasonable alternatives. The deeper value of understanding activation functions is what it reveals about the fundamental mechanism of neural networks: the non-linearity that makes deep learning possible is not an incidental detail of implementation but the essential ingredient that makes the whole enterprise work.