Camera matrix

In computer vision a camera matrix or camera projection matrix is a $3\times 4$ matrix which describes the mapping of a pinhole camera from 3D points in the world to 2D points in an image.

Let $\mathbf {x}$ be a representation of a 3D point in homogeneous coordinates (a 4-dimensional vector), and let $\mathbf {y}$ be a representation of the image of this point in the pinhole camera (a 3-dimensional vector). Then the following relation holds

\mathbf {y} \sim \mathbf {C} \,\mathbf {x}

where $\mathbf {C}$ is the camera matrix and the $\,\sim$ sign implies that the left and right hand sides are equal up to a non-zero scalar multiplication.

Since the camera matrix $\mathbf {C}$ is involved in the mapping between elements of two projective spaces, it too can be regarded as a projective element. This means that it has only 11 degrees of freedom since any multiplication by a non-zero scalar results in an equivalent camera matrix.

Derivation

The geometry related to the mapping of a pinhole camera is illustrated in the figure. The figure contains the following basic objects

A 3D orthogonal coordinate system with its origin at O. This is also where the camera pinhole is located. The three axes of the coordinate system are referred to as X1, X2, X3. Axis X3 is pointing in the viewing direction of the camera.

An image plane where the 3D world is projected through the pinhole of the camera. The image plane is parallel to axes X1 and X2 and it located at distance $f$ from the origin O in the negative direction of the X3 axis. A practical implementation of a pinhole camera implies that the image plane is located such that it intersects the X3 axis at coordinate -f where f > 0.

A point P somewhere in the world at coordinate $(x_{1},x_{2},x_{3})$ relative to the axes X1,X2,X3.

The projection line of point P into the camera. This is the green line which passes through point P and the point O.

The projection of point P onto the image plane, denoted Q. This point is given by the intersection of the projection line (green) and the image plane. In any practical situation we can assume that X3 > 0 which means that the intersection point is well defined.

There is also a 2D coordinate system in the image plane, with the origin where the X3 axis intersects the image plane and with axes Y1 and Y2 which are parallel to X1 and X2, respectively. The coordinates of point Q relative to this coordinate system is $(y_{1},y_{2})$ .

The pinhole of the camera, through which all projection lines must pass it assumed to be infinitely small. In the following this point in 3D space is referred to as the camera focal point or the camera center.

Next we want to understand how the coordinates $(y_{1},y_{2})$ of point Q depend on the coordinates $(x_{1},x_{2},x_{3})$ of point P. This can be done with the help of the following figure which shows the same scene as the previous figure but now from above, looking down in the negative direction of the X2 axis.

In this figure we see two similar triangles, both having parts of the projection line (green) as their hypotenuses. The catheti of the left triangle are $-y_{1}$ and f and the catheti of the right triangle are $x_{1}$ and $x_{3}$ . Since the two triangles are similar it follows that

{\frac {-y_{1}}{f}}={\frac {x_{1}}{x_{3}}}

or

y_{1}=-{\frac {f\,x_{1}}{x_{3}}}

A similar investigation, looking in the negative direction of the X1 axis gives

{\frac {-x_{2}}{f}}={\frac {x_{2}}{x_{3}}}

or

x_{2}=-{\frac {f\,x_{2}}{x_{3}}}

This can be summarized as

{\begin{pmatrix}y_{1}\\y_{2}\end{pmatrix}}=-{\frac {f}{x_{3}}}{\begin{pmatrix}x_{1}\\x_{2}\end{pmatrix}}

which is an expression that describes the relation between the 3D coordinates $(x_{1},x_{2},x_{3})$ of point P and its image coordinates $(y_{1},y_{2})$ given by point Q in the image plane.

Before continuing, it should be noted that the mapping from 3D to 2D coordinates described by a pinhole camera is a perspective projection followed by a $180^{\circ }$ rotation in the image plane. This corresponds to how a real pinhole camera operates, the resulting image is rotated $180^{\circ }$ and the relative size of projected objects depends on their distance to the focal point and the overall size of the image depends on the distance f between the image plane and the focal point.

The next step is to rewrite the last expression in terms of homogeneous coordinates. Instead of the 2D vector $(y_{1},y_{2})$ we consider the projective element (a 3D vector) $(y_{1},y_{2},1)$ and instead of equality we consider equality up to scaling by a non-zero number, denoted $\,\sim$ . First, we write the homogeneous image coordinates as expressions in the usual 3D coordinates.

{\begin{pmatrix}y_{1}\\y_{2}\\1\end{pmatrix}}\sim -{\frac {f}{x_{3}}}{\begin{pmatrix}x_{1}\\x_{2}\\-{\frac {x_{3}}{f}}\end{pmatrix}}\sim {\begin{pmatrix}x_{1}\\x_{2}\\-{\frac {x_{3}}{f}}\end{pmatrix}}

Finally, also the 3D coordinates are expressed in a homogeneous representation and this is how the camera matrix appears:

{\begin{pmatrix}y_{1}\\y_{2}\\1\end{pmatrix}}\sim {\begin{pmatrix}1&0&0&0\\0&1&0&0\\0&0&{\frac {-1}{f}}&0\end{pmatrix}}\,{\begin{pmatrix}x_{1}\\x_{2}\\x_{3}\\1\end{pmatrix}}

or

\mathbf {y} \sim \mathbf {C} \,\mathbf {x}

where $\mathbf {C}$ is the camera matrix, which here is given by

\mathbf {C} ={\begin{pmatrix}1&0&0&0\\0&1&0&0\\0&0&{\frac {-1}{f}}&0\end{pmatrix}}

The mapping between the 3D world and the 2D image presented here results in an image which is roated $180^{\circ }$ . In order to produce an unrotated image, which is what we expect from a camera, there are two possibilities:

Rotate the coordinate system in the image plane $180^{\circ }$ (in either direction). This is the way any pratical implementation of a pinhole camera would solve the problem, for a photographic camera we rotate the image before looking at it, and for a digital camera we read out the pixels in such an order that it becomes rotated.

Place the image plane so that it intersects the X3 axis at f instead of at -f and rework the previous calculations. This would generate a virtual image plane since it cannot be implemented in practice, but this provides a theoretical camera which may be simpler to analyse than the real one.

In both cases the resulting mapping from 3D coordinates to 2D image coordinates is given by

{\begin{pmatrix}y_{1}\\y_{2}\end{pmatrix}}={\frac {f}{x_{3}}}{\begin{pmatrix}x_{1}\\x_{2}\end{pmatrix}}

(same as before except no minus sign), and the corresponding camera matrix now becomes

\mathbf {C} ={\begin{pmatrix}1&0&0&0\\0&1&0&0\\0&0&{\frac {1}{f}}&0\end{pmatrix}}\sim {\begin{pmatrix}f&0&0&0\\0&f&0&0\\0&0&1&0\end{pmatrix}}

The last step is a consequence of $\mathbf {C}$ itself being a projective element.

The camera matrix derived here may appear trivial in the sense that it containt very few non-zero element. This depends to a large extent on the particular coordinate systems which have been chosen for the 3D and 2D points. In practice, however, other forms of camera matrices are common, as will be shown below.

The camera focal point

The camera matrix $\mathbf {C}$ derived in the previous section has a null space which is spanned by the vector

\mathbf {n} ={\begin{pmatrix}0\\0\\0\\1\end{pmatrix}}

This is also the homogeneous representation of the 3D point which has coordinates (0,0,0), that is, the camera focal point O. This means that the focal point (and only this point) cannot be mapped to a particular point in the image plane by the camera. This is consistent with the fact that the projection line becomes ambiguous in this case.

Normalized camera matrix and normalized image coordinates

The camera matrix derived above can be simplified even further if we assume that f = 1:

\mathbf {C} _{0}={\begin{pmatrix}1&0&0&0\\0&1&0&0\\0&0&1&0\end{pmatrix}}=\left({\begin{array}{c|c}\mathbf {I} &\mathbf {0} \end{array}}\right)

where $\mathbf {I}$ here denotes a $3\times 3$ identity matrix. Note that $3\times 4$ matrix $\mathbf {C}$ here is divided into a concatenation of a $3\times 3$ matrix and a 3-dimensional vector. The camera matrix $\mathbf {C} _{0}$ is sometimes referred to as a canonical form.

So far all points in the 3D world have been represented in a camera centered coordinate system, that is, a coordinate system which has its origin at the camera focal point. In practice however, the 3D points may be represented in terms of coordinates relative to an arbitrary coordinate system (X1',X2',X3'). Assuming that the camera coordinate axes (X1,X2,X3) and the axes (X1',X2',X3') are of Euclidean type (orthogonal and isotropic), there is a unique Euclidean 3D transformation (rotation and translation) between the two coordinate systems.

The two operations of rotation and translation of 3D coordinates can be represented as the two $4\times 4$ matrices

\left({\begin{array}{c|c}\mathbf {R} &\mathbf {0} \\\hline \mathbf {0} &1\end{array}}\right)

and

\left({\begin{array}{c|c}\mathbf {I} &\mathbf {t} \\\hline \mathbf {0} &1\end{array}}\right)

where $\mathbf {R}$ is a $3\times 3$ rotation matrix and $\mathbf {t}$ is a 3-dimensional translation vector. When the first matrix is multiplied onto the homogeneous representation of a 3D point, the result is the homogeneous representation of the rotated point, and the second matrix performs instead a translation. Performing the two operations in sequence gives a combined rotation and translation matrix

\left({\begin{array}{c|c}\mathbf {R} &\mathbf {t} \\\hline \mathbf {0} &1\end{array}}\right)

Assuming that $\mathbf {R}$ and $\mathbf {t}$ are precisely the rotation and translations which relate the two coordinate system (X1,X2,X3) and (X1',X2',X3') above, this implies that

\mathbf {x} =\left({\begin{array}{c|c}\mathbf {R} &\mathbf {t} \\\hline \mathbf {0} &1\end{array}}\right)\mathbf {x} '

where $\mathbf {x} '$ is the homogeneous representation of the point P in the coordinate system (X1',X2',X3').

Assuming also that the camera matrix is given by $\mathbf {C} _{0}$ , the mapping from the coordinates in the (X1',X2',X3') system to homogeneous image coordinates becomes

\mathbf {y} \sim \mathbf {C} _{0}\,\mathbf {x} =\left({\begin{array}{c|c}\mathbf {I} &\mathbf {0} \end{array}}\right)\,\left({\begin{array}{c|c}\mathbf {R} &\mathbf {t} \\\hline \mathbf {0} &1\end{array}}\right)\mathbf {x} '=\left({\begin{array}{c|c}\mathbf {R} &\mathbf {t} \end{array}}\right)\,\mathbf {x} '

Consequently, the camera matrix which relates points in the the coordinate system (X1',X2',X3') to image coordinates is

\mathbf {C} _{N}=\left({\begin{array}{c|c}\mathbf {R} &\mathbf {t} \end{array}}\right)

a concatenation of a 3D rotation matrix and a 3-dimensional translation vector.

This type of camera matrix is referred to as a normalized camera matrix, it assumes focal length = 1 and that image coordinates are measured in a coordinate system where the origin is located at the intersection between axis X3 and the image plane and has the same units as the 3D coordinate system. The resulting image coordinates are referred to as normalized image coordinates.

The camera focal point

The null space of the normalized camera matrix $\mathbf {C} _{N}$ described above is spanned by the 4-dimensional vector

\mathbf {n} ={\begin{pmatrix}-\mathbf {R} ^{-1}\,\mathbf {t} \\1\end{pmatrix}}={\begin{pmatrix}{\tilde {\mathbf {n} }}\\1\end{pmatrix}}

This is also, again, the coordinates of the focal point but now relative to the (X1',X2',X3') system. This can be seen by applying first the rotation and then the translation to the 3-dimensional vector ${\tilde {\mathbf {n} }}$ and the result is the homogeneous representation of 3D coordinates (0,0,0).

This implies that the focal point is always a null vector (in its homogeneous representation) of the camera matrix, provided that it is represented in terms of its coordinates relative to the same 3D coordinate system as the camera matrix refers to.

The normalized camera matrix $\mathbf {C} _{N}$ can now be written as

\mathbf {C} _{N}=\mathbf {R} \,\left({\begin{array}{c|c}\mathbf {I} &\mathbf {R} ^{-1}\,\mathbf {t} \end{array}}\right)=\mathbf {R} \,\left({\begin{array}{c|c}\mathbf {I} &-{\tilde {\mathbf {n} }}\end{array}}\right)

where ${\tilde {\mathbf {n} }}$ is the 3D coordinates of the focal point relative to the (X1',X2',X3') system.

General camera matrix

Given the mapping produced by a normalized camera matrix, the resulting normalized image coordinates can be transformed by means of an arbitrary 2D homography. This includes 2D translations and rotations as well as scaling (isotropic and anisotropic) but also general 2D perspective transformations. Such a transformation can be represented as a $3\times 3$ matrix $\mathbf {H}$ which maps the homogeneous normalized image coordinates $\mathbf {y}$ to the homogeneous transformed image coordinates $\mathbf {y} '$ :

\mathbf {y} '=\mathbf {H} \,\mathbf {y}

Insertig the above expression for the normalized image coordinates in terms of the 3D coordinates gives

\mathbf {y} '=\mathbf {H} \,\mathbf {C} _{N}\,\mathbf {x} '

This produces the most general form of camera matrix

\mathbf {C} =\mathbf {H} \,\mathbf {C} _{N}=\mathbf {H} \,\left({\begin{array}{c|c}\mathbf {R} &\mathbf {t} \end{array}}\right)

References

Richard Hartley and Andrew Zisserman (2003). Multiple View Geometry in computer vision. Cambridge University Press. ISBN 0-521-54051-8.