Draft:Nonlinear Multimodal Embeddings
Submission declined on 26 March 2025 by Caleb Stanford (talk). Neologisms are not considered suitable for Wikipedia unless they receive substantial use and press coverage; this requires strong evidence in independent, reliable, published sources. Links to sites specifically intended to promote the neologism itself do not establish its notability.
Where to get help
How to improve a draft
You can also browse Wikipedia:Featured articles and Wikipedia:Good articles to find examples of Wikipedia's best writing on topics similar to your proposed article. Improving your odds of a speedy review To improve your odds of a faster review, tag your draft with relevant WikiProject tags using the button below. This will let reviewers know a new draft has been submitted in their area of interest. For instance, if you wrote about a female astronomer, you would want to add the Biography, Astronomy, and Women scientists tags. Editor resources
| ![]() |
Comment: Google scholar returns only 1 result for "Nonlinear multimodal embeddings". Please do not resubmit without a second opinion from someone not connected with the subject matter. Caleb Stanford (talk) 23:38, 26 March 2025 (UTC)
Comment: In accordance with Wikipedia's Conflict of interest policy, I disclose that I have a conflict of interest regarding the subject of this article. YoniNewman (talk) 18:23, 18 March 2025 (UTC)
Nonlinear multimodal embeddings are representation learning techniques used to project data from different modalities, such as text, image, audio or video, into a shared latent space. This allows for semantically similar content across modalities to be mapped to nearby points. Unlike linear approaches, nonlinear multimodal embeddings employ non-linear transformation that can capture complex relationships between different data types and enable more effective cross-modal retrieval, fusion and analysis.
Motivation
[edit]While traditional multimodal learning methods demonstrate the value of integrating different data types, they often rely on linear projections that struggle to capture the complex, non-linear relationships inherent in cross-modal data. Real world connections between modalities such as images and text rarely follow simple linear patterns. For instance, an image might relate to certain words through abstract concepts, metaphorical relationships, or cultural associations that linear models cannot adequately represent.[1] Nonlinear multimodal embeddings address this limitation by employing more sophisticated mathematical transformations capable of modeling intricate interdependencies. This nonlinear approach is particularly motivated by applications requiring fine-grained semantic understanding across modalities. As datasets expand, preserving local structures and overall semantic coherence becomes crucial, requiring embedding methods that handle the data's nonlinearity.
Approaches and Methods
[edit]Canonical-correlation analysis based methods
[edit]Canonical-correlation analysis (CCA) was first introduced in 1936 by Harold Hotelling[2] and is a fundamental approach for multimodal learning. CCA aims to find linear relationships between two sets of variables. Given two data matrices and representing different modalities, CCA finds projection vectors and that maximizes the correlation between the projected variables:
such that and are the within-modality covariance matrices, and is the between-modality covariance matrix. However, standard CCA is limited by its linearity, which led to the development of nonlinear extensions, such as kernel CCA and deep CCA.
Kernel CCA
[edit]Kernel canonical correlation analysis (KCCA) extends traditional CCA to capture nonlinear relationships between modalities by implicitly mapping the data into high dimensional feature spaces using kernel functions. Given kernel functions and with corresponding Gram matrices and , KCCA seeks coefficients and that maximize:
To prevent overfitting, regularization terms are typically added, resulting in:
where and are regularization parameters. KCCA has proven effective for tasks such as cross-modal retrieval and semantic analysis, though it faces computational challenges with large datasets due to its memory requirement for sorting kernel matrices.
KCCA was proposed independently by several researchers.[3][4][5]
Deep CCA
[edit]Deep canonical correlation analysis (DCCA), introduced in 2013, employs neural networks to learn nonlinear transformations for maximizing correlation between modalities. DCCA uses separate neural networks and for each modality to transofrm the original fata before applying CCA:
where and represent the parameters of the neural networks, and and are the CCA projection matrices. The correlation objective is computer as:
where and are the network outputs, , and are the regularization parameters. DCCA overcomes the limitations of linear CCA and kernel CCA by learning complex nonlinear relationships while maintaining computational efficiency for large datasets through mini-batch optimization.[6]
Graph based methods
[edit]Graph based approaches for nonlinear multimodal embeddings leverage graph structure to model relationships between entities across different modalities. These methods typically represent each modality as a graph and then learn embedding that preserve cross-modal similarities, enabling more effective joint representation of heterogeneous data.[7]
One such method is cross modal graph neural networks (CMGNNs) that extend traditional graph neural networks (GNNs) to handle data from multiple modalities by constructing graphs that capture both intra-modal and inter-modal relationships. These networks model interactions across modalities by representing them as nodes and their relationships as edges.
Another graph based method is Deep Graph Matching Networks (DGMNs) focus on establishing correspondences between nodes in graphs from different modalities by learning a matching function that aligns similar entities across heterogeneous data sources. This approach combines graph neural networks with attention mechanisms to compute node-to-node similarity scores across modalities.
Nonlinear manifold alignment
[edit]Manifold alignment is a class of machine learning algorithms that produce projections between sets of data, given that the original data sets lie on a common manifold. Nonlinear manifold alignment extends traditional manifold learning techniques to align data manifolds from different modalities or domains. These methods assume that high-dimensional data from each modality lies on a lower-dimensional manifold, and seek transformations that preserve both the geometric structure within each manifold and the correspondence relationships between manifolds. Techniques such as local tangent space alignment (LTSA) represent the local geometry of the manifold using tangent spaces, which are then aligned to provide global coordinates of the data points.[8] This method is particularly effective for heterogeneous multimodal data, where linear alignment techniques may fall short.
See also
[edit]References
[edit]- ^ Baltrušaitis, Tadas; Ahuja, Chaitanya; Morency, Louis-Philippe (February 2019). "Multimodal Machine Learning: A Survey and Taxonomy". IEEE Transactions on Pattern Analysis and Machine Intelligence. 41 (2): 423–443. arXiv:1705.09406. doi:10.1109/TPAMI.2018.2798607. ISSN 1939-3539. PMID 29994351.
- ^ Hotelling, H. (1936-12-01). "Relations Between Two Sets of Variates". Biometrika. 28 (3–4): 321–377. doi:10.1093/biomet/28.3-4.321. ISSN 0006-3444.
- ^ Lai, P (October 2000). "Kernel and Nonlinear Canonical Correlation Analysis". International Journal of Neural Systems. 10 (5): 365–377. doi:10.1016/S0129-0657(00)00034-X.
- ^ Dorffner, Georg; Bischof, Horst; Hornik, Kurt (2001). Artificial Neural Networks -- ICANN 2001: International Conference Vienna, Austria, August 21-25, 2001 Proceedings. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer-Verlag Berlin Heidelberg Springer e-books. ISBN 978-3-540-44668-2.
- ^ Akaho, Shotaro (2007-02-14), A kernel method for canonical correlation analysis, arXiv:cs/0609071, arXiv:cs/0609071
- ^ Andrew, Galen; Arora, Raman; Bilmes, Jeff; Livescu, Karen (2013-05-26). "Deep Canonical Correlation Analysis". Proceedings of the 30th International Conference on Machine Learning. PMLR: 1247–1255.
- ^ Ektefaie, Yasha; Dasoulas, George; Noori, Ayush; Farhat, Maha; Zitnik, Marinka (April 2023). "Multimodal learning with graphs". Nature Machine Intelligence. 5 (4): 340–350. doi:10.1038/s42256-023-00624-6. ISSN 2522-5839. PMC 10704992. PMID 38076673.
- ^ Zhang, Zhenyue; Zha, Hongyuan (January 2004). "Principal Manifolds and Nonlinear Dimensionality Reduction via Tangent Space Alignment". SIAM Journal on Scientific Computing. 26 (1): 313–338. arXiv:cs/0212008. Bibcode:2004SJSC...26..313Z. doi:10.1137/S1064827502419154. ISSN 1064-8275.