User:Jeblad/Standard notation (neural net)

Standard notation as it is used within deep learning, has changed a lot since the first published works. It is undergoing some standardization, but mostly at an informal level.

Notation

Indexes

training: Superscript $\left(i\right)$ like $\mathbf {x} ^{\left(i\right)}$ denotes the iᵗʰ training example in a trainingset
layer: Superscript $\left[l\right]$ like $\mathbf {x} ^{\left[l\right]}$ denotes the lᵗʰ layer in a set of layers
sequence: Superscript $\left\langle t\right\rangle$ like $\mathbf {x} ^{\left\langle t\right\rangle }$ denotes the tᵗʰ item in a sequence of items
1D node: Subscript $i$ like $x_{i}$ denotes the iᵗʰ node in a one-dimensional layer
2D node: Subscript $ij$ or $i,j$ like $x_{ij}$ or $x_{i,j}$ denotes the node at iᵗʰ row and jᵗʰ column in a two-dimensional layer^{[note 1]}
1D weight: Subscript $ij$ or $i,j$ like $w_{ij}$ or $w_{i,j}$ denotes the weight between node iᵗʰ at previous layer and jᵗʰ at following layer^{[note 2]}

Sizes

number of samples: $m$ is the number of samples in the dataset
input size: $n_{x}$ is the size of input $x$ (or number of features)
output size: $n_{y}$ is the size of output $y$ (or number of classes)
hidden units: $n_{h}^{\left[l\right]}$ is the number of units in hidden layer $\left[l\right]$
number of layers: $L$ is the number of layers in the network
input sequence size: $T_{x}$ is the size of the input sequence
output sequence size: $T_{y}$ is the size of the output sequence
input training sequence size: $T_{x}^{\left(l\right)}$ is the size of the input training sequence (each sample training sequence)
output training sequence size: $T_{y}^{\left(l\right)}$ is the size of the output training sequence (each sample training sequence)

Other

cross entropy: $H(p,q)=-\sum _{x\in {\mathcal {X}}}p(x)\,\log q(x)$
elementwise sequence loss: ${\mathcal {L}}^{\left\langle t\right\rangle }\left({\hat {y}}^{\left\langle t\right\rangle },{y}^{\left\langle t\right\rangle }\right)$ and by using cross entropy $-y^{\left\langle t\right\rangle }\,\log {\hat {y}}^{\left\langle t\right\rangle }-\left(1-y^{\left\langle t\right\rangle }\right)\,\log \left(1-{\hat {y}}^{\left\langle t\right\rangle }\right)$ that is the sum would be over $\left\{x\in {\mathcal {X}}:{\text{similar}},{\text{dissimilar}}\right\}$ for classification in and out of a single class

References

Notes

^ This can easily be confused with a weight index.
^ Michael Nielson defines $w_{jk}$ as weight from kᵗʰ neuron to jᵗʰ, while Andrew Ng defines it in opposite direction.

[nodes-1] This can easily be confused with a weight index.

[weights-2] Michael Nielson defines $w_{jk}$ as weight from kᵗʰ neuron to jᵗʰ, while Andrew Ng defines it in opposite direction.

[note 1]

[note 2]