Draft:Nonlinear Multimodal Embeddings

Submission declined on 26 March 2025 by Caleb Stanford (talk).

Neologisms are not considered suitable for Wikipedia unless they receive substantial use and press coverage; this requires strong evidence in independent, reliable, published sources. Links to sites specifically intended to promote the neologism itself do not establish its notability.

If you would like to continue working on the submission, click on the "Edit" tab at the top of the window.
If you have not resolved the issues listed above, your draft will be declined again and potentially deleted.
If you need extra help, please ask us a question at the AfC Help Desk or get live help from experienced editors.
Please do not remove reviewer comments or this notice until the submission is accepted.

Where to get help

If you need help editing or submitting your draft, please ask us a question at the AfC Help Desk or get live help from experienced editors. These venues are only for help with editing and the submission process, not to get reviews.
If you need feedback on your draft, or if the review is taking a lot of time, you can try asking for help on the talk page of a relevant WikiProject. Some WikiProjects are more active than others so a speedy reply is not guaranteed.

How to improve a draft

Wikipedia:Contributing to Wikipedia – a basic overview on how to edit Wikipedia.
Help:Wikitext – how to use the markup
Help:Referencing for beginners – how to include references
Wikipedia:Article development – how to develop your article
Wikipedia:Writing better articles – how to improve your article
Wikipedia:Verifiability – make sure your article includes reliable third-party sources

You can also browse Wikipedia:Featured articles and Wikipedia:Good articles to find examples of Wikipedia's best writing on topics similar to your proposed article.

Improving your odds of a speedy review

To improve your odds of a faster review, tag your draft with relevant WikiProject tags using the button below. This will let reviewers know a new draft has been submitted in their area of interest. For instance, if you wrote about a female astronomer, you would want to add the Biography, Astronomy, and Women scientists tags.

Add tags to your draft

Editor resources

Easy tools: Citation bot (help) | Advanced: Fix bare URLs

Declined by Caleb Stanford 3 months ago. Last edited by Caleb Stanford 3 months ago. Reviewer: Inform author.

Resubmit

Please note that if the issues are not fixed, the draft will be declined again.

Comment: Google scholar returns only 1 result for "Nonlinear multimodal embeddings". Please do not resubmit without a second opinion from someone not connected with the subject matter. Caleb Stanford (talk) 23:38, 26 March 2025 (UTC)

Comment: In accordance with Wikipedia's Conflict of interest policy, I disclose that I have a conflict of interest regarding the subject of this article. YoniNewman (talk) 18:23, 18 March 2025 (UTC)

Nonlinear multimodal embeddings are representation learning techniques used to project data from different modalities, such as text, image, audio or video, into a shared latent space. This allows for semantically similar content across modalities to be mapped to nearby points. Unlike linear approaches, nonlinear multimodal embeddings employ non-linear transformation that can capture complex relationships between different data types and enable more effective cross-modal retrieval, fusion and analysis.

Motivation

While traditional multimodal learning methods demonstrate the value of integrating different data types, they often rely on linear projections that struggle to capture the complex, non-linear relationships inherent in cross-modal data. Real world connections between modalities such as images and text rarely follow simple linear patterns. For instance, an image might relate to certain words through abstract concepts, metaphorical relationships, or cultural associations that linear models cannot adequately represent.^[1] Nonlinear multimodal embeddings address this limitation by employing more sophisticated mathematical transformations capable of modeling intricate interdependencies. This nonlinear approach is particularly motivated by applications requiring fine-grained semantic understanding across modalities. As datasets expand, preserving local structures and overall semantic coherence becomes crucial, requiring embedding methods that handle the data's nonlinearity.

Approaches and Methods

Canonical-correlation analysis based methods

Canonical-correlation analysis (CCA) was first introduced in 1936 by Harold Hotelling^[2] and is a fundamental approach for multimodal learning. CCA aims to find linear relationships between two sets of variables. Given two data matrices $X\in \mathbb {R} ^{n\times p}$ and $Y\in \mathbb {R} ^{n\times q}$ representing different modalities, CCA finds projection vectors $w_{x}\in \mathbb {R} ^{p}$ and $w_{y}\in \mathbb {R} ^{q}$ that maximizes the correlation between the projected variables:

$\rho =\max _{w_{x},w_{y}}{\frac {w_{x}^{\top }\Sigma _{xy}w_{y}}{{\sqrt {w_{x}^{\top }\Sigma _{xx}w_{x}}}{\sqrt {w_{y}^{\top }\Sigma _{yy}w_{y}}}}}$

such that $\Sigma _{xx}$ and $\Sigma _{yy}$ are the within-modality covariance matrices, and $\Sigma _{xy}$ is the between-modality covariance matrix. However, standard CCA is limited by its linearity, which led to the development of nonlinear extensions, such as kernel CCA and deep CCA.

Kernel CCA

Kernel canonical correlation analysis (KCCA) extends traditional CCA to capture nonlinear relationships between modalities by implicitly mapping the data into high dimensional feature spaces using kernel functions. Given kernel functions $K_{x}$ and $K_{y}$ with corresponding Gram matrices $K_{x}\in \mathbb {R} ^{n\times n}$ and $K_{y}\in \mathbb {R} ^{n\times n}$ , KCCA seeks coefficients $\alpha$ and $\beta$ that maximize:

$\rho =\max _{\alpha ,\beta }{\frac {\alpha ^{\top }K_{x}Ky\beta }{{\sqrt {\alpha ^{\top }K_{x}^{2}\alpha }}{\sqrt {\beta ^{\top }K_{y}^{2}\beta }}}}$

To prevent overfitting, regularization terms are typically added, resulting in:

$\rho =\max _{\alpha ,\beta }{\frac {\alpha ^{T}K_{x}K_{y}\beta }{{\sqrt {\alpha ^{T}\left(K_{x}^{2}+\lambda _{x}K_{x}\right)\alpha }}{\sqrt {\;\beta ^{T}\left(K_{y}^{2}+\lambda _{y}K_{y}\right)\beta }}}}$

where $\lambda _{x}$ and $\lambda _{y}$ are regularization parameters. KCCA has proven effective for tasks such as cross-modal retrieval and semantic analysis, though it faces computational challenges with large datasets due to its $O(n^{2})$ memory requirement for sorting kernel matrices.

KCCA was proposed independently by several researchers.^[3]^[4]^[5]

Deep CCA

Deep canonical correlation analysis (DCCA), introduced in 2013, employs neural networks to learn nonlinear transformations for maximizing correlation between modalities. DCCA uses separate neural networks $f_{x}$ and $f_{y}$ for each modality to transofrm the original fata before applying CCA:

$\max _{W_{x},W_{y},\theta _{x},\theta _{y}}\operatorname {corr} \left(f_{x}(X;\theta _{x}),f_{y}(Y;\theta _{y})\right)$

where $\theta _{x}$ and $\theta _{y}$ represent the parameters of the neural networks, and $W_{x}$ and $W_{y}$ are the CCA projection matrices. The correlation objective is computer as:

$\operatorname {corr} (H_{x},H_{y})=\operatorname {tr} \left(T^{-1/2}H_{x}^{T}H_{y}S^{-1/2}\right)$

where $H_{x}=f_{x}(X)$ and $H_{y}=f_{y}(Y)$ are the network outputs, $T=H_{x}^{T}H_{x}+r_{x}I$ , $S=H_{y}^{T}H_{y}+r_{y}I$ and $r_{x},r_{y}$ are the regularization parameters. DCCA overcomes the limitations of linear CCA and kernel CCA by learning complex nonlinear relationships while maintaining computational efficiency for large datasets through mini-batch optimization.^[6]

Graph based methods

Graph based approaches for nonlinear multimodal embeddings leverage graph structure to model relationships between entities across different modalities. These methods typically represent each modality as a graph and then learn embedding that preserve cross-modal similarities, enabling more effective joint representation of heterogeneous data.^[7]

One such method is cross modal graph neural networks (CMGNNs) that extend traditional graph neural networks (GNNs) to handle data from multiple modalities by constructing graphs that capture both intra-modal and inter-modal relationships. These networks model interactions across modalities by representing them as nodes and their relationships as edges.

Another graph based method is Deep Graph Matching Networks (DGMNs) focus on establishing correspondences between nodes in graphs from different modalities by learning a matching function that aligns similar entities across heterogeneous data sources. This approach combines graph neural networks with attention mechanisms to compute node-to-node similarity scores across modalities.

Nonlinear manifold alignment

Manifold alignment is a class of machine learning algorithms that produce projections between sets of data, given that the original data sets lie on a common manifold. Nonlinear manifold alignment extends traditional manifold learning techniques to align data manifolds from different modalities or domains. These methods assume that high-dimensional data from each modality lies on a lower-dimensional manifold, and seek transformations that preserve both the geometric structure within each manifold and the correspondence relationships between manifolds. Techniques such as local tangent space alignment (LTSA) represent the local geometry of the manifold using tangent spaces, which are then aligned to provide global coordinates of the data points.^[8] This method is particularly effective for heterogeneous multimodal data, where linear alignment techniques may fall short.

References

^ Baltrušaitis, Tadas; Ahuja, Chaitanya; Morency, Louis-Philippe (February 2019). "Multimodal Machine Learning: A Survey and Taxonomy". IEEE Transactions on Pattern Analysis and Machine Intelligence. 41 (2): 423–443. arXiv:1705.09406. doi:10.1109/TPAMI.2018.2798607. ISSN 1939-3539. PMID 29994351.
^ Hotelling, H. (1936-12-01). "Relations Between Two Sets of Variates". Biometrika. 28 (3–4): 321–377. doi:10.1093/biomet/28.3-4.321. ISSN 0006-3444.
^ Lai, P (October 2000). "Kernel and Nonlinear Canonical Correlation Analysis". International Journal of Neural Systems. 10 (5): 365–377. doi:10.1016/S0129-0657(00)00034-X.
^ Dorffner, Georg; Bischof, Horst; Hornik, Kurt (2001). Artificial Neural Networks -- ICANN 2001: International Conference Vienna, Austria, August 21-25, 2001 Proceedings. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer-Verlag Berlin Heidelberg Springer e-books. ISBN 978-3-540-44668-2.
^ Akaho, Shotaro (2007-02-14), A kernel method for canonical correlation analysis, arXiv:cs/0609071, arXiv:cs/0609071
^ Andrew, Galen; Arora, Raman; Bilmes, Jeff; Livescu, Karen (2013-05-26). "Deep Canonical Correlation Analysis". Proceedings of the 30th International Conference on Machine Learning. PMLR: 1247–1255.
^ Ektefaie, Yasha; Dasoulas, George; Noori, Ayush; Farhat, Maha; Zitnik, Marinka (April 2023). "Multimodal learning with graphs". Nature Machine Intelligence. 5 (4): 340–350. doi:10.1038/s42256-023-00624-6. ISSN 2522-5839. PMC 10704992. PMID 38076673.
^ Zhang, Zhenyue; Zha, Hongyuan (January 2004). "Principal Manifolds and Nonlinear Dimensionality Reduction via Tangent Space Alignment". SIAM Journal on Scientific Computing. 26 (1): 313–338. arXiv:cs/0212008. Bibcode:2004SJSC...26..313Z. doi:10.1137/S1064827502419154. ISSN 1064-8275.

[1] Baltrušaitis, Tadas; Ahuja, Chaitanya; Morency, Louis-Philippe (February 2019). "Multimodal Machine Learning: A Survey and Taxonomy". IEEE Transactions on Pattern Analysis and Machine Intelligence. 41 (2): 423–443. arXiv:1705.09406. doi:10.1109/TPAMI.2018.2798607. ISSN 1939-3539. PMID 29994351.

[2] Hotelling, H. (1936-12-01). "Relations Between Two Sets of Variates". Biometrika. 28 (3–4): 321–377. doi:10.1093/biomet/28.3-4.321. ISSN 0006-3444.

[3] Lai, P (October 2000). "Kernel and Nonlinear Canonical Correlation Analysis". International Journal of Neural Systems. 10 (5): 365–377. doi:10.1016/S0129-0657(00)00034-X.

[4] Dorffner, Georg; Bischof, Horst; Hornik, Kurt (2001). Artificial Neural Networks -- ICANN 2001: International Conference Vienna, Austria, August 21-25, 2001 Proceedings. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer-Verlag Berlin Heidelberg Springer e-books. ISBN 978-3-540-44668-2.

[5] Akaho, Shotaro (2007-02-14), A kernel method for canonical correlation analysis, arXiv:cs/0609071, arXiv:cs/0609071

[6] Andrew, Galen; Arora, Raman; Bilmes, Jeff; Livescu, Karen (2013-05-26). "Deep Canonical Correlation Analysis". Proceedings of the 30th International Conference on Machine Learning. PMLR: 1247–1255.

[7] Ektefaie, Yasha; Dasoulas, George; Noori, Ayush; Farhat, Maha; Zitnik, Marinka (April 2023). "Multimodal learning with graphs". Nature Machine Intelligence. 5 (4): 340–350. doi:10.1038/s42256-023-00624-6. ISSN 2522-5839. PMC 10704992. PMID 38076673.

[8] Zhang, Zhenyue; Zha, Hongyuan (January 2004). "Principal Manifolds and Nonlinear Dimensionality Reduction via Tangent Space Alignment". SIAM Journal on Scientific Computing. 26 (1): 313–338. arXiv:cs/0212008. Bibcode:2004SJSC...26..313Z. doi:10.1137/S1064827502419154. ISSN 1064-8275.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]