L1-norm principal component analysis

L1-norm principal component analysis (L1-PCA) is a method for robust multivariate data analysis.^[1] L1-PCA is often preferred over standard principal component analysis (PCA) when the analyzed data may contain outliers (faulty points, irregular corruptions).^[2]^[3]^[4]

PCA seeks a collection of orthogonal directions (principal components) that define a subspace wherein data representation is maximized.^[5]^[6]^[7] Standard PCA quantifies data representation as the aggregate L2-norm of the data point projections into the subspace, or equivalently the aggregate Euclidean distance of the original points from their subspace-projected representations. Therefore, standard PCA is also referred to L2-PCA, mostly to be distinguished from L1-PCA.^[8] In PCA and L1-PCA, the number of principal components (PCs) is lower than the rank of the analyzed matrix, which coincides with the dimensionality of space defined by the original data points. Therefore, PCA is commonly employed for dimensionality reduction, e.g., for the purpose of denoising, or compression. Among the advantages of PCA that have contributed to its high popularity are its low-cost implementation by means of singular-value decomposition (SVD)^[9] and its quality approximation of the maximum-variance subspace of the data source, for certain data distributions, such as multivariate Normal, when it operates on sufficiently many nominal data points.

However, in modern big data sets, data often include grossly corrupted, faulty points, commonly referred to as outliers.^[10] Regretfully, standard PCA is known to be very sensitive against outliers, even when they appear as a small fraction of the processed data.^[11] The reason is the L2-norm formulation of PCA places squared emphasis to the magnitude of each coordinate of each data point, ultimately benefiting peripheral points, such as outliers. On the other hand, following an L1-norm formulation, L1-PCA places linear emphasis on the coordinates of each data point, thus counteracting and effectively restraining outliers.^[12]

Formulation

Consider matrix $\mathbf {X} =[\mathbf {x} _{1},\mathbf {x} _{2},\ldots ,\mathbf {x} _{N}]\in \mathbb {R} ^{D\times N}$ consisting of $N$ $D$ -dimensional data points. Define $r=rank(\mathbf {X} )$ . For integer $K$ such that $1\leq K\leq r$ , L1-PCA is formulated as:^[1]

${\begin{aligned}&{\underset {\mathbf {Q} =[\mathbf {q} _{1},\mathbf {q} _{2},\ldots ,\mathbf {q} _{K}]\in \mathbb {R} ^{D\times K}}{\max }}~~\|\mathbf {X} ^{\top }\mathbf {Q} \|_{1}\\&{\text{subject to}}~~\mathbf {Q} ^{\top }\mathbf {Q} =\mathbf {I} _{K}.\end{aligned}}$

1

For $K=1$ , (2) simplifies to finding the L1-norm principal component (L1-PC) of $\mathbf {X}$ as

${\begin{aligned}&{\underset {\mathbf {q} \in \mathbb {R} ^{D\times 1}}{\max }}~~\|\mathbf {X} ^{\top }\mathbf {q} \|_{1}\\&{\text{subject to}}~~\|\mathbf {q} \|_{2}=1.\end{aligned}}$

2

In (1)-(2), L1-norm $\|\cdot \|_{1}$ returns the sum of the absolute entries of its argument and L2-norm $\|\cdot \|_{2}$ returns the sum of the squared entries of its argument. If one substitutes $\|\cdot \|_{1}$ in (2) by the Frobenius/L2-norm $\|\cdot \|_{F}$ , then the problem becomes standard PCA and it is solved by the matrix $\mathbf {Q}$ that contains the $K$ dominant singular vectors of $\mathbf {X}$ (i.e., the singular vectors that correspond to the $K$ highest singular values).

The maximization metric in (2) can be expanded as

$\|\mathbf {X} ^{\top }\mathbf {Q} \|_{1}=\sum _{k=1}^{K}\sum _{n=1}^{N}|\mathbf {x} ^{\top }\mathbf {q} _{k}|.$

3

Solution

For any matrix $\mathbf {A} \in \mathbb {R} ^{m\times n}$ with $m\geq n$ , define $\Phi (\mathbf {A} )$ as the nearest (in the L2-norm sense) matrix to $\mathbf {A}$ that has orthonormal columns. That is, define

${\begin{aligned}\Phi (\mathbf {A} )=&{\underset {\mathbf {Q} \in \mathbb {R} ^{m\times n}}{\text{argmin}}}~~\|\mathbf {A} -\mathbf {Q} \|_{F}\\&{\text{subject to}}~~\mathbf {Q} ^{\top }\mathbf {Q} =\mathbf {I} _{n}.\end{aligned}}$

4

Procrustes Theorem^[13]^[14] states that if $\mathbf {A}$ has SVD $\mathbf {U} _{m\times n}{\boldsymbol {\Sigma }}_{n\times n}\mathbf {V} _{n\times n}^{\top }$ , then $\Phi (\mathbf {A} )=\mathbf {U} \mathbf {V} ^{\top }$ .

Markopoulos et al.^[1] showed that, if $\mathbf {B} _{\text{BNM}}$ is the exact solution to the binary nuclear-norm maximization (BNM)

${\begin{aligned}{\underset {\mathbf {B} \in \{\pm 1\}^{N\times K}}{\text{max}}}~~\|\mathbf {X} \mathbf {B} \|_{*}^{2},\end{aligned}}$

5

then

${\begin{aligned}\mathbf {Q} _{\text{L1}}=\Phi (\mathbf {X} \mathbf {B} _{\text{BNM}})\end{aligned}}$

6

is the exact solution to L1-PCA in (2). The nuclear-norm $\|\cdot \|_{*}$ in (2) returns the summation of the singular values of its matrix argument and can be calculated by means of standard SVD. Moreover, it holds that, given the solution to L1-PCA, $\mathbf {Q} _{\text{L1}}$ , the solution to BNM can be obtained as

${\begin{aligned}\mathbf {B} _{\text{BNM}}={\text{sgn}}(\mathbf {X} ^{\top }\mathbf {Q} _{\text{L1}})\end{aligned}}$

7

where ${\text{sgn}}(\cdot )$ returns the $\{\pm 1\}$ -sign matrix of its matrix argument (with no loss of generality, we can consider ${\text{sgn}}(0)=1$ ). In addition, it follows that $\|\mathbf {X} ^{\top }\mathbf {Q} _{\text{L1}}\|_{1}=\|\mathbf {X} \mathbf {B} _{\text{BNM}}\|_{*}$ . Clearly, BNM in (5) is a combinatorial problem over antipodal binary variables. Therefore, its exact solution can be found through exhaustive evaluation of all $2^{NK}$ elements of its feasibility set, with asymptotic cost ${\mathcal {O}}(2^{NK})$ . Therefore, L1-PCA can also be solved, through BNM, with cost ${\mathcal {O}}(2^{NK})$ (exponential in the product of the number of data points with the number of the sought-after components).

For the special case of $K=1$ (single L1-PC of $\mathbf {X}$ ), BNM takes the binary-quadratic-maximization (BQM) form

${\begin{aligned}&{\underset {\mathbf {b} \in \{\pm 1\}^{N\times 1}}{\text{max}}}~~\mathbf {b} ^{\top }\mathbf {X} ^{\top }\mathbf {X} \mathbf {b} .\end{aligned}}$

7

The transition from (5) to (7) for $K=1$ holds true, since the unique singular value of $\mathbf {X} \mathbf {b}$ is equal to $\|\mathbf {X} \mathbf {b} \|_{2}={\sqrt {\mathbf {b} ^{\top }\mathbf {X} ^{\top }\mathbf {X} \mathbf {b} }}$ , for every $\mathbf {b}$ . Then, if $\mathbf {b} _{\text{BNM}}$ is the solution to BQM in (7), it holds that

${\begin{aligned}\mathbf {q} _{\text{L1}}=\Phi (\mathbf {X} \mathbf {b} _{\text{BNM}})={\frac {\mathbf {X} \mathbf {b} _{\text{BNM}}}{\|\mathbf {X} \mathbf {b} _{\text{BNM}}\|_{2}}}\end{aligned}}$

8

is the exact L1-PC of $\mathbf {X}$ , as defined in (1). In addition, it holds that $\mathbf {b} _{\text{BNM}}={\text{sgn}}(\mathbf {X} ^{\top }\mathbf {q} _{\text{L1}})$ and $\|\mathbf {X} ^{\top }\mathbf {q} _{\text{L1}}\|_{1}=\|\mathbf {X} \mathbf {b} _{\text{BNM}}\|_{2}$ .

Algorithms

Exact Solution

As shown above, the exact solution to L1-PCA can derive by the following two-step process:

1. Solve problem in (5) to obtain  $\mathbf {B} _{\text{BNM}}$ .
2. Apply SVD on  $\mathbf {X} \mathbf {B} _{\text{BNM}}$  to obtain  $\mathbf {Q} _{\text{L1}}$ .

BNM in (5) can be solved by exhaustive search of the feasibility set with cost ${\mathcal {O}}(2^{NK})$ . Also, it can be solved with cost ${\mathcal {O}}(N^{rK-K+1})$ , when $r=rank(\mathbf {X} )$ is considered a constant with respect to $N$ .^[1]^[15]

Approximate/Efficient Solvers

In 2008, Kwak^[12] proposed an iterative algorithm for the solution of L1-PCA, for $K=1$ . This iterative method was later generalized for $K>1$ components.^[16] Another algorithm was proposed for solving BNM in (5) by means of semi-definite programming (SDP). Most recently, L1-PCA in (2) and BNM in (5) were solved efficiently by means of bit-flipping iterations (L1-BF algorithm).

L1-BF Algorithm

 1  function L1BF( $\mathbf {X}$ ,  $K$ ):
 2      Initialize  $\mathbf {B} ^{(0)}\in \{\pm 1\}^{N\times K}$  and  ${\mathcal {L}}\leftarrow \{1,2,\ldots ,NK\}$ 
 3      Set  $t\leftarrow 0$  and  $\omega \leftarrow \|\mathbf {X} \mathbf {B} ^{(0)}\|_{*}$  
 4      Until termination (or  $T$  iterations)
 5           $\mathbf {B} \leftarrow \mathbf {B} ^{(t)}$ ,  $t'\leftarrow t$ 
 6          For  $x\in {\mathcal {L}}$ 
 7               $k\leftarrow \lceil {\frac {x}{N}}\rceil$ ,  $n\leftarrow x-N(k-1)$ 
 8               $[\mathbf {B} ]_{n,k}\leftarrow -[\mathbf {B} ]_{n,k}$                 // flip bit
 9               $a(n,k)\leftarrow \|\mathbf {X} \mathbf {B} \|_{*}$                // calculated by SVD or faster (see^[17])
10              if  $a(n,k)>\omega$ 
11                   $\mathbf {B} ^{(t)}\leftarrow \mathbf {B}$ ,  $t'\leftarrow t+1$ 
12                   $\omega \leftarrow a(n,k)$ 
13              end
14              if  $t'=t$                     // no bit was flipped
15                  if  ${\mathcal {L}}=\{1,2,\ldots ,NK\}$ 
16                      terminate
17                  else
18                       ${\mathcal {L}}\leftarrow \{1,2,\ldots ,NK\}$

References

^ ^a ^b ^c ^d Markopoulos, Panos P.; Karystinos, George N.; Pados, Dimitris A. (October 2014). "Optimal Algorithms for L1-subspace Signal Processing". IEEE Transactions on Signal Processing. 62 (19): 5046–5058. doi:10.1109/TSP.2014.2338077.
^ Barrodale, I. (1968). "L1 Approximation and the Analysis of Data". Applied Statistics. 17 (1): 51. doi:10.2307/2985267.
^ Barnett, Vic; Lewis, Toby (1994). Outliers in statistical data (3. ed. ed.). Chichester [u.a.]: Wiley. ISBN 0471930946. {{cite book}}: |edition= has extra text (help)
^ Kanade, T.; Ke, Qifa (June 2005). "Robust L1 Norm Factorization in the Presence of Outliers and Missing Data by Alternative Convex Programming". 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05). IEEE. doi:10.1109/CVPR.2005.309.
^ Jolliffe, I.T. (2004). Principal component analysis (2nd ed. ed.). New York: Springer. ISBN 0387954422. {{cite book}}: |edition= has extra text (help)
^ Bishop, Christopher M. (2007). Pattern recognition and machine learning (Corr. printing. ed.). New York: Springer. ISBN 978-0-387-31073-2.
^ Pearson, Karl (8 June 2010). "On Lines and Planes of Closest Fit to Systems of Points in Space". The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science. 2 (11): 559–572. doi:10.1080/14786440109462720.
^ Markopoulos, Panos P.; Kundu, Sandipan; Chamadia, Shubham; Pados, Dimitris A. (15 August 2017). "Efficient L1-Norm Principal-Component Analysis via Bit Flipping". IEEE Transactions on Signal Processing. 65 (16): 4252–4264. doi:10.1109/TSP.2017.2708023.
^ Golub, Gene H. (April 1973). "Some Modified Matrix Eigenvalue Problems". SIAM Review. 15 (2): 318–334. doi:10.1137/1015032.
^ Barnett, Vic; Lewis, Toby (1994). Outliers in statistical data (3. ed. ed.). Chichester [u.a.]: Wiley. ISBN 0471930946. {{cite book}}: |edition= has extra text (help)
^ Candès, Emmanuel J.; Li, Xiaodong; Ma, Yi; Wright, John (1 May 2011). "Robust principal component analysis?". Journal of the ACM. 58 (3): 1–37. doi:10.1145/1970392.1970395.
^ ^a ^b Kwak, N. (September 2008). "Principal Component Analysis Based on L1-Norm Maximization". IEEE Transactions on Pattern Analysis and Machine Intelligence. 30 (9): 1672–1680. doi:10.1109/TPAMI.2008.114.
^ Eldén, Lars; Park, Haesun (1 June 1999). "A Procrustes problem on the Stiefel manifold". Numerische Mathematik. 82 (4): 599–619. doi:10.1007/s002110050432.
^ Schönemann, Peter H. (March 1966). "A generalized solution of the orthogonal procrustes problem". Psychometrika. 31 (1): 1–10. doi:10.1007/BF02289451.
^ Markopoulos, PP; Kundu, S; Chamadia, S; Tsagkarakis, N; Pados, DA (2018). "Outlier-Resistant Data Processing with L1-Norm Principal Component Analysis". Advances in Principal Component Analysis (Springer, Singapore). doi:10.1007/978-981-10-6704-4_6.
^ Nie, F; Huang, H; Ding, C; Luo, Dijun; Wang, H (July 2011). "Robust principal component analysis with non-greedy l1-norm maximization". 22nd International Joint Conference on Artificial Intelligence: 1433–1438.
^ mark2017

[mark2014-1] Markopoulos, Panos P.; Karystinos, George N.; Pados, Dimitris A. (October 2014). "Optimal Algorithms for L1-subspace Signal Processing". IEEE Transactions on Signal Processing. 62 (19): 5046–5058. doi:10.1109/TSP.2014.2338077.

[2] Barrodale, I. (1968). "L1 Approximation and the Analysis of Data". Applied Statistics. 17 (1): 51. doi:10.2307/2985267.

[3] Barnett, Vic; Lewis, Toby (1994). Outliers in statistical data (3. ed. ed.). Chichester [u.a.]: Wiley. ISBN 0471930946. {{cite book}}: |edition= has extra text (help)

[4] Kanade, T.; Ke, Qifa (June 2005). "Robust L1 Norm Factorization in the Presence of Outliers and Missing Data by Alternative Convex Programming". 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05). IEEE. doi:10.1109/CVPR.2005.309.

[5] Jolliffe, I.T. (2004). Principal component analysis (2nd ed. ed.). New York: Springer. ISBN 0387954422. {{cite book}}: |edition= has extra text (help)

[6] Bishop, Christopher M. (2007). Pattern recognition and machine learning (Corr. printing. ed.). New York: Springer. ISBN 978-0-387-31073-2.

[7] Pearson, Karl (8 June 2010). "On Lines and Planes of Closest Fit to Systems of Points in Space". The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science. 2 (11): 559–572. doi:10.1080/14786440109462720.

[mark2017-8] Markopoulos, Panos P.; Kundu, Sandipan; Chamadia, Shubham; Pados, Dimitris A. (15 August 2017). "Efficient L1-Norm Principal-Component Analysis via Bit Flipping". IEEE Transactions on Signal Processing. 65 (16): 4252–4264. doi:10.1109/TSP.2017.2708023.

[9] Golub, Gene H. (April 1973). "Some Modified Matrix Eigenvalue Problems". SIAM Review. 15 (2): 318–334. doi:10.1137/1015032.

[10] Barnett, Vic; Lewis, Toby (1994). Outliers in statistical data (3. ed. ed.). Chichester [u.a.]: Wiley. ISBN 0471930946. {{cite book}}: |edition= has extra text (help)

[11] Candès, Emmanuel J.; Li, Xiaodong; Ma, Yi; Wright, John (1 May 2011). "Robust principal component analysis?". Journal of the ACM. 58 (3): 1–37. doi:10.1145/1970392.1970395.

[kwak2008-12] Kwak, N. (September 2008). "Principal Component Analysis Based on L1-Norm Maximization". IEEE Transactions on Pattern Analysis and Machine Intelligence. 30 (9): 1672–1680. doi:10.1109/TPAMI.2008.114.

[13] Eldén, Lars; Park, Haesun (1 June 1999). "A Procrustes problem on the Stiefel manifold". Numerische Mathematik. 82 (4): 599–619. doi:10.1007/s002110050432.

[14] Schönemann, Peter H. (March 1966). "A generalized solution of the orthogonal procrustes problem". Psychometrika. 31 (1): 1–10. doi:10.1007/BF02289451.

[15] Markopoulos, PP; Kundu, S; Chamadia, S; Tsagkarakis, N; Pados, DA (2018). "Outlier-Resistant Data Processing with L1-Norm Principal Component Analysis". Advances in Principal Component Analysis (Springer, Singapore). doi:10.1007/978-981-10-6704-4_6.

[nie2011-16] Nie, F; Huang, H; Ding, C; Luo, Dijun; Wang, H (July 2011). "Robust principal component analysis with non-greedy l1-norm maximization". 22nd International Joint Conference on Artificial Intelligence: 1433–1438.

[17] rk2017

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]