Jump to content

User:TacticBucket/Optical flow

From Wikipedia, the free encyclopedia

Original article: Optical flow - Wikipedia

Learning-based models

[edit]

Instead of seeking to model optical flow directly, one can train a machine learning system to estimate optical flow. Since 2015, when FlowNet[1] was proposed, learning based models have been applied to optical flow and have gained prominence. Initially, these approaches were based on Convolutional Neural Networks arranged in a U-Net architecture, often utilizing encoder-decoder or feature pyramid structures, such as PWC-Net[2], which integrated cost volumes (a 4D tensor representing the matching costs between all pairs of pixels in two feature maps) and warping (the process of spatially transforming one image based on a predicted flow field) to refine flow estimates across multiple scales. However, with the advent of transformer architecture in 2017, transformer based models have gained prominence.[3] A significant shift occurred with the introduction of RAFT[4] (Recurrent All-Pairs Field Transforms), which replaced coarse-to-fine pyramids with a single GRU-based state that iteratively updates the flow field. By maintaining a constant feature resolution at 1/8 of the input, RAFT significantly improved the preservation of fine details and robustness to fast motion compared to previous bottleneck-heavy designs, influencing a wide range of subsequent models that adopt similar iterative update mechanisms.

However, the all-pairs correlation used in such models is computationally expensive; for high-resolution content like FullHD or 4K, global matching can require more than 32 GB of VRAM[5], making it impractical for consumer-grade GPUs. To address this, efficiency-focused methods such as Flow1D[5], MeFlow[6], and Memfof[7] have been developed. While these approaches generally optimize memory usage by decomposing the 2D search space, the latter optimizes correlation volumes for high-resolution multi-frame sequences, providing a practical implementation for standard GPUs.

Most learning-based approaches to optical flow use supervised learning. In this case, many frame pairs of video data and their corresponding ground-truth flow fields are used to optimise the parameters of the learning-based model to accurately estimate optical flow. This process often relies on vast synthetic training datasets, such as FlyingChairs[1] and FlyingThings3D[8], due to the number of parameters involved.[9] The models are then evaluated on benchmarks like MPI Sintel[10], KITTI[11], and the high-resolution Spring[12] dataset. However, models trained exclusively on synthetic data often struggle with the domain gap when applied to real-world footage.

To address this, some learning-based optical flow approaches use self-supervised learning (sometimes called unsupervised learning) to reduce the need for large datasets with ground-truth data and leverage real-world footage without labels during training. Instead of training models to minimise the differences between estimated and ground-truth flow fields, they are trained to achieve learning objectives such as brightness constancy and smoothness of the flow field.[13] More recently, methods like CroCo[14] have introduced cross-view completion pre-training. Forcing the network to predict the masked regions of one image using the full second image teaches the model a strong geometric understanding and better generalization than models trained solely on task-specific labels.

  1. ^ a b Dosovitskiy, Alexey; Fischer, Philipp; Ilg, Eddy; Hausser, Philip; Hazirbas, Caner; Golkov, Vladimir; Smagt, Patrick van der; Cremers, Daniel; Brox, Thomas (2015). FlowNet: Learning Optical Flow with Convolutional Networks. 2015 IEEE International Conference on Computer Vision (ICCV). IEEE. pp. 2758–2766. doi:10.1109/ICCV.2015.316. ISBN 978-1-4673-8391-2.
  2. ^ Sun, Deqing; Yang, Xiaodong; Liu, Ming-Yu; Kautz, Jan (2018-06-25), PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume, arXiv, doi:10.48550/arXiv.1709.02371, arXiv:1709.02371, retrieved 2026-01-09
  3. ^ Alfarano, Andrea; Maiano, Luca; Papa, Lorenzo; Amerini, Irene (2024). "Estimating optical flow: A comprehensive review of the state of the art". Computer Vision and Image Understanding. 249 104160. doi:10.1016/j.cviu.2024.104160. hdl:11573/1726258.
  4. ^ Teed, Zachary; Deng, Jia (2020-08-25), RAFT: Recurrent All-Pairs Field Transforms for Optical Flow, arXiv, doi:10.48550/arXiv.2003.12039, arXiv:2003.12039, retrieved 2026-01-09
  5. ^ a b Xu, Haofei; Yang, Jiaolong; Cai, Jianfei; Zhang, Juyong; Tong, Xin (2021-08-29), High-Resolution Optical Flow from 1D Attention and Correlation, arXiv, doi:10.48550/arXiv.2104.13918, arXiv:2104.13918, retrieved 2026-01-09
  6. ^ Xu, Gangwei; Chen, Shujun; Jia, Hao; Feng, Miaojie; Yang, Xin (2025-03-04), Memory-Efficient Optical Flow via Radius-Distribution Orthogonal Cost Volume, arXiv, doi:10.48550/arXiv.2312.03790, arXiv:2312.03790, retrieved 2026-01-09
  7. ^ Bargatin, Vladislav; Chistov, Egor; Yakovenko, Alexander; Vatolin, Dmitriy (2025-06-29), MEMFOF: High-Resolution Training for Memory-Efficient Multi-Frame Optical Flow Estimation, arXiv, doi:10.48550/arXiv.2506.23151, arXiv:2506.23151, retrieved 2026-01-09
  8. ^ Mayer, Nikolaus; Ilg, Eddy; Häusser, Philip; Fischer, Philipp; Cremers, Daniel; Dosovitskiy, Alexey; Brox, Thomas (2015-12-07), A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation, arXiv, doi:10.48550/arXiv.1512.02134, arXiv:1512.02134, retrieved 2026-01-09
  9. ^ Tu, Zhigang; Xie, Wei; Zhang, Dejun; Poppe, Ronald; Veltkamp, Remco C.; Li, Baoxin; Yuan, Junsong (1 March 2019). "A survey of variational and CNN-based optical flow techniques". Signal Processing: Image Communication. 72: 9–24. doi:10.1016/j.image.2018.12.002. hdl:1874/379559.
  10. ^ Butler, Daniel J.; Wulff, Jonas; Stanley, Garrett B.; Black, Michael J. (2012). Fitzgibbon, Andrew; Lazebnik, Svetlana; Perona, Pietro; Sato, Yoichi; Schmid, Cordelia (eds.). "A Naturalistic Open Source Movie for Optical Flow Evaluation". Computer Vision – ECCV 2012. Berlin, Heidelberg: Springer: 611–625. doi:10.1007/978-3-642-33783-3_44. ISBN 978-3-642-33783-3.
  11. ^ Menze, Moritz; Geiger, Andreas (2015). "Object scene flow for autonomous vehicles". 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR): 3061–3070. doi:10.1109/CVPR.2015.7298925.
  12. ^ Mehl, Lukas; Schmalfuss, Jenny; Jahedi, Azin; Nalivayko, Yaroslava; Bruhn, Andrés (2023-03-03), Spring: A High-Resolution High-Detail Dataset and Benchmark for Scene Flow, Optical Flow and Stereo, arXiv, doi:10.48550/arXiv.2303.01943, arXiv:2303.01943, retrieved 2026-01-09
  13. ^ Jonschkowski, Rico; Stone, Austin; Barron, Jonathan T.; Gordon, Ariel; Konolige, Kurt; Angelova, Anelia (2020). What Matters in Unsupervised Optical Flow. Cham: Springer International Publishing. pp. 557–572. arXiv:2006.04902. doi:10.1007/978-3-030-58536-5_33. ISBN 978-3-030-58536-5.
  14. ^ Weinzaepfel, Philippe; Lucas, Thomas; Leroy, Vincent; Cabon, Yohann; Arora, Vaibhav; Brégier, Romain; Csurka, Gabriela; Antsfeld, Leonid; Chidlovskii, Boris (2023-08-18), CroCo v2: Improved Cross-view Completion Pre-training for Stereo Matching and Optical Flow, arXiv, doi:10.48550/arXiv.2211.10408, arXiv:2211.10408, retrieved 2026-01-09