Large Vision-Language Models Pre-training, Prompting, and Applications (Kaiyang Zhou, Ziwei Liu, Peng Gao)（Z-Library）

Advances in Computer Vision and Pattern Recognition Kaiyang Zhou Ziwei Liu Peng Gao   Editors Large Vision-Language Models Pre-training, Prompting, and Applications

Advances in Computer Vision and Pattern Recognition Founding Editor Sameer Singh Series Editor Srinivasa Narasimhan, Carnegie Mellon University, Pittsburgh, PA, USA Advisory Editors Richard Bowden, University of Surrey, Guildford, UK Sven Dickinson, University of Toronto, Toronto, ON, Canada Jiaya Jia, The Chinese University of Hong Kong, Shatin, Hong Kong Zhouchen Lin , Peking University, Beijing, China Bernt Schiele, Max Planck Institute for Informatics, Saarbrücken, Germany

The field of computer vision and pattern recognition has a rich history of nearly 50 years. In the past decade, however, the field has experienced remarkable advances in scene understanding and image generation. This advancement is driven by three key factors: (a) the availability of large, diverse datasets, (b) the accessibility of cloud and personal computing, and (c) the open release of advanced neural network architectures and models. These breakthroughs have led to significant successes across numerous application domains, including intelligent transportation, aug- mented reality, healthcare, agriculture, oceanography, and more. The ACVPR book series aims to introduce, analyze, and synthesize recent and foundational research, offering valuable references to both beginners and expert practitioners. The series covers timely topics such as: Deep Learning for Vision Large-scale Foundational Models Generative Methods Multimodal Learning (vision, audio, language, action, etc.) Neural Fields for Vision 3D Computer Vision Computational Photography, Display and Illumination Video Understanding and Synthesis Virtual, Mixed and Augmented Reality Biological and Human Vision Physics-based Vision Vision for Graphics Ethics in Computer Vision Applications (Robotics, Agriculture, Health, Intelligent Transportation, Oceanography, Safety/Security, etc.) This series includes monographs, introductory and advanced textbooks, and state-of-the-art collections. Furthermore, it supports Open Access publication mode.

Kaiyang Zhou Ziwei Liu Peng Gao Editors Large Vision-Language Models Pre-training, Prompting, and Applications

Editors Kaiyang Zhou Hong Kong Baptist University Kowloon, Hong Kong Peng Gao Shanghai Artificial Intelligence Laboratory Shanghai, China Ziwei Liu Nanyang Technological University Singapore, Singapore ISSN 2191-6586 ISSN 2191-6594 (electronic) Advances in Computer Vision and Pattern Recognition ISBN 978-3-031-94968-5 ISBN 978-3-031-94969-2 (eBook) https://doi.org/10.1007/978-3-031-94969-2 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2026 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland If disposing of this product, please recycle the paper.

Preface The pursuit of multimodal understanding has emerged as one of the most trans- formative frontiers in artificial intelligence. The ability to create models that can interpret, reason about, and generate both multimodal data is not only a technical milestone but also an intellectual leap in the way we approach machine cognition. Vision-Language Models (VLMs) have brought about a paradigm shift, bridging the previously distinct domains of computer vision and natural language processing to open new avenues for intelligent systems. This book, Foundations of Vision-Language Models: Concepts and Roadmap, aims to provide a comprehensive overview of the theoretical underpinnings, progress, and challenges in this rapidly evolving field. We explore the foundational concepts that power VLMs, shedding light on the unique architectures, the role of large-scale pre-training, and the essential multimodal representations that form the backbone of these models for both understanding and generation. Through this exploration, we aim to present a clear understanding of how VLMs operate, their capabilities, and the nuances of aligning visual and textual data in ways that allow for complex reasoning and generation. The applications of VLMs are vast and growing—from enhancing image recog- nition systems to enabling sophisticated visual content generation, and even creating systems that can interact with humans in more natural and intuitive ways. However, alongside these opportunities come significant challenges. As the field progresses, issues such as feature alignment, scalability, data requirements, and evaluation metrics require ongoing attention and innovation. Furthermore, concerns regarding computational inefficiency and ethical implications also demand careful considera- tion as we look toward the future of these technologies. In this book, we offer a roadmap for both newcomers and experts interested in understanding the current landscape of VLMs, their limitations, and the exciting directions they may take. We aim to provide not only a technical guide but also a reflection on the broader impact of this technology, laying the groundwork for the next wave of advancements in AI. We hope this book serves as a valuable resource for researchers, practitioners, and students who wish to engage with the foundational concepts and future v

vi Preface directions of VLMs. It is our belief that the continued development of this field will play a critical role in shaping the future of artificial intelligence, enabling machines to interact with, understand, and generate the world in ways that are richer, more accurate, and more human-like than ever before. Kowloon, Hong Kong Singapore, Singapore Shanghai, China December 2024 Kaiyang Zhou Ziwei Liu Peng Gao

Acknowledgements The completion of this book has been a truly collaborative effort. First and foremost, we thank all the contributors whose expertise, dedication, and insightful chapters form the foundation of this work. We also thank the reviewers for their constructive feedback, which was instrumental in enhancing the quality of this book. Special thanks go to Prof. Chen Change Loy for sharing book writing tips that enhanced the clarity and presentation of this work, and to Dr. Bin Fu for his assistance with LaTeX editing, which greatly facilitated the manuscript preparation. Finally, we are deeply grateful to Paul Drougas, Kalai Shahethya, and Katherine Moretti at Springer for their invaluable support and guidance throughout the preparation of this book. vii

Contents 1 Foundations of Vision-Language Models: Concepts and Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Kaiyang Zhou, Ziwei Liu, and Peng Gao 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 The Vision-Language Modeling Paradigm. . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Navigating the Complexities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Perspectives and Progress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Part I Scaling Intelligence: Pre-Training Strategies for Vision-Language Models 2 InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Wenhai Wang, Zhe Chen, Yangzhou Liu, Yue Cao, Weiyun Wang, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, and Jifeng Dai 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3 Multimodal Large Language Models for Video Understanding . . . . . . 59 Yi Wang, Jiashuo Yu, Yinan He, Limin Wang, and Yu Qiao 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.4 Training Multimodal Data Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.6 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 ix

x Contents 4 Generative Multimodal Models Are In-Context Learners . . . . . . . . . . . . . 93 Yufeng Cui, Quan Sun, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Part II Shaping Intelligence: Prompting Techniques for Multimodal Adaptation 5 Differentiable Prompt Learning for Vision-Language Models . . . . . . . . 115 Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.3 Preliminaries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.4 Context Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 6 Test-Time Prompt Tuning for Vision-Language Models . . . . . . . . . . . . . . . 135 Manli Shu, Weili Nie, De-An Huang, Zhiding Yu, Tom Goldstein, Anima Anandkumar, and Chaowei Xiao 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 6.3 TPT: Test-Time Prompt Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 6.5 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 7 Learning Efficient Feature Adapters for Vision-Language Models . . . 165 Renrui Zhang and Peng Gao 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 7.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

Contents xi 8 Efficient Tuning of Vision Foundation Models with Neural Prompt Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 Yuanhan Zhang, Kaiyang Zhou, and Ziwei Liu 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 8.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 8.3 Neural Prompt Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 8.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 8.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 9 Confidence Calibration in Contrastive Vision-Language Models . . . . . 207 Shuoyuan Wang, Kaiyang Zhou, and Hongxin Wei 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 9.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 9.3 An Empirical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 9.4 Open-Vocabulary Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 9.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 9.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 9.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 9.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 Part III Applying Intelligence: Real-World Applications of Vision-Language Models 10 Open-Vocabulary Object Detection Based on Detection Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, and Chen Change Loy 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 10.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 10.3 OV-DETR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 10.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 10.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 11 Unlocking CLIP for Zero-Shot Dense Segmentation . . . . . . . . . . . . . . . . . . . 249 Chong Zhou, Chen Change Loy, and Bo Dai 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 11.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 11.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 11.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 11.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263

xii Contents 12 Adapting CLIP for 3D Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 Xiangyang Zhu, Renrui Zhang, and Peng Gao 12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 12.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270 12.3 PointCLIP for 3D Understanding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 12.4 PointCLIP V2 for 3D Open-World Understanding . . . . . . . . . . . . . . . . . 274 12.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278 12.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 13 Multimodal Face Generation and Manipulation with Collaborative Diffusion Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 Ziqi Huang, Kelvin C. K. Chan, Yuming Jiang, and Ziwei Liu 13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 13.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290 13.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300 13.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311 14 Boosting Diffusion U-Net with Free Lunch for Text-to-Image and Text-to-Video Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315 Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu 14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316 14.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318 14.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319 14.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325 14.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336 15 Text-Conditioned Zero-Shot 3D Avatar Creation and Animation . . . . 341 Fangzhou Hong, Mingyuan Zhang, Liang Pan, Zhongang Cai, Lei Yang, and Ziwei Liu 15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342 15.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344 15.3 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345 15.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360 15.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371 16 Text-Driven 3D Human Motion Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377 Mingyuan Zhang, Zhongang Cai, Liang Pan, Chenyang Gu, Fangzhou Hong, Xinying Guo, Jiawei Ren, Lei Yang, and Ziwei Liu 16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378 16.2 Text-Driven Motion Diffusion Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379 16.3 Retrieval-Augmented Motion Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 385 16.4 Fine-Grained Text-Driven Generation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388 16.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391

Contents xiii 16.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400 17 Text-Driven Scene Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403 Zhaoxi Chen, Guangcong Wang, and Ziwei Liu 17.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404 17.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406 17.3 Panoramic Scene Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407 17.4 Scene Generation from Texts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408 17.5 Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414 17.6 Downstream Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420 17.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427

Contributors Anima Anandkumar California Institute of Technology, Pasadena, CA, USA Zhongang Cai S-Lab, Nanyang Technological University and SenseTime Research, Singapore, Singapore Yue Cao School of Computer Science, Nanjing University, Nanjing, China Kelvin C.K. Chan S-Lab, Nanyang Technological University, Singapore, Singapore Zhaoxi Chen Nanyang Technological University, Singapore, Singapore Zhe Chen School of Computer Science, Nanjing University, Nanjing, China Yufeng Cui Beijing Academy of Artificial Intelligence, Beijing, China Bo Dai Shanghai Artificial Intelligence Laboratory, Shanghai, China Jifeng Dai Department of Electronic Engineering, Tsinghua University, Beijing, China Peng Gao Shanghai Artificial Intelligence Laboratory, Shanghai, China Tom Goldstein University of Maryland, College Park, MD, USA Chenyang Gu S-Lab, Nanyang Technological University, Singapore, Singapore Xinying Guo S-Lab, Nanyang Technological University, Singapore, Singapore Yinan He Shanghai Artificial Intelligence Laboratory, Shanghai, China Fangzhou Hong S-Lab, Nanyang Technological University, Singapore, Singapore Chen Huang Apple Inc., Cupertino, CA, USA De-An Huang Nvidia Corporation, Santa Clara, CA, USA Tiejun Huang Peking University, Beijing, China Ziqi Huang S-Lab, Nanyang Technological University, Singapore, Singapore xv

xvi Contributors Yuming Jiang S-Lab, Nanyang Technological University, Singapore, Singapore Wei Li S-Lab, Nanyang Technological University, Singapore, Singapore Jingjing Liu Tsinghua University, Beijing, China Yangzhou Liu School of Computer Science, Nanjing University, Nanjing, China Ziwei Liu S-Lab, Nanyang Technological University, Singapore, Singapore Chen Change Loy S-Lab, Nanyang Technological University, Singapore, Singapore Lewei Lu SenseTime Research, Hong Kong, China Tong Lu School of Computer Science, Nanjing University, Nanjing, China Zhengxiong Luo Beijing Academy of Artificial Intelligence, Beijing, China Weili Nie Nvidia Corporation, Santa Clara, CA, USA Liang Pan S-Lab, Nanyang Technological University, Singapore, Singapore Yu Qiao Shanghai AI Laboratory, Shanghai, China Yongming Rao Beijing Academy of Artificial Intelligence, Beijing, China Jiawei Ren S-Lab, Nanyang Technological University, Singapore, Singapore Manli Shu University of Maryland, College Park, MD, USA Chenyang Si Nanyang Technological University, Singapore, Singapore Quan Sun Beijing Academy of Artificial Intelligence, Beijing, China Guangcong Wang Great Bay University, Dongguan, China Limin Wang Nanjing University, Nanjing, China Shuoyuan Wang Department of Statistics and Data Science, Southern University of Science and Technology, Shenzhen, China Weiyun Wang School of Computer Science, Fudan University, Shanghai, China Department of Information Engineering, The Chinese University of Hong Kong, Hong Kong, China Xinlong Wang Beijing Academy of Artificial Intelligence, Beijing, China Yi Wang Shanghai AI Laboratory, Shanghai, China Yueze Wang Beijing Academy of Artificial Intelligence, Beijing, China Hongxin Wei Department of Statistics and Data Science, Southern University of Science and Technology, Shenzhen, China Chaowei Xiao University of Wisconsin, Madison, WI, USA Jingkang Yang S-Lab, Nanyang Technological University, Singapore, Singapore

Contributors xvii Lei Yang SenseTime Research, Hong Kong, China Jiashuo Yu Shanghai AI Laboratory, Shanghai, China Qiying Yu Tsinghua University, Beijing, China Zhiding Yu Nvidia Corporation, Santa Clara, CA, USA Yuhang Zang Shanghai AI Laboratory, Shanghai, China Fan Zhang Beijing Academy of Artificial Intelligence, Beijing, China Mingyuan Zhang S-Lab, Nanyang Technological University, Singapore, Singapore Renrui Zhang The Chinese University of Hong Kong, Hong Kong, China Xiaosong Zhang Beijing Academy of Artificial Intelligence, Beijing, China Yuanhan Zhang S-Lab, Nanyang Technological University, Singapore, Singapore Chong Zhou S-Lab, Nanyang Technological University, Singapore, Singapore Kaiyang Zhou Hong Kong Baptist University, Hong Kong, China Xiangyang Zhu Shanghai Artificial Intelligence Laboratory, Shanghai, China Xizhou Zhu Department of Electronic Engineering, Tsinghua University, Beijing, China

Chapter 1 Foundations of Vision-Language Models: Concepts and Roadmap Kaiyang Zhou, Ziwei Liu, and Peng Gao Abstract Vision-language models have significantly advanced the field of artificial intelligence by bridging the gap between visual and textual understanding. These models can enable wide-ranging applications including image recognition, object detection, scene understanding, visual content generation and editing in both 2D and 3D, and visual question answering, to name a few. This chapter introduces the foundational concepts underlying these models, emphasizing their unique ability to learn multimodal representations through novel neural network architectures and large-scale data pre-training. We explore the vision-language modeling paradigm, highlight key challenges in feature alignment, scalability, and data and evaluation, and review notable progress in the field. In addition, we discuss the limitations of current approaches, from computational inefficiencies to ethical concerns, and outline potential directions for future research. This chapter serves as a roadmap for understanding the field’s core principles and its transformative potential in AI applications. Keywords Vision-Language Model · Multimodal Learning · Feature Alignment · Large-Scale Pre-training K. Zhou () CS Department, Hong Kong Baptist University, Hong Kong, China e-mail: kyzhou@hkbu.edu.hk Z. Liu S-Lab, Nanyang Technological University, Singapore, Singapore e-mail: ziwei.liu@ntu.edu.sg P. Gao Shanghai Artificial Intelligence Laboratory, Shanghai, China e-mail: gaopeng@pjlab.org.cn © The Author(s), under exclusive license to Springer Nature Switzerland AG 2026 K. Zhou et al. (eds.), Large Vision-Language Models, Advances in Computer Vision and Pattern Recognition, https://doi.org/10.1007/978-3-031-94969-2_1 1

2 K. Zhou et al. 1.1 Introduction Vision and language are two deeply intertwined capabilities in human intelligence. In artificial intelligence (AI) research, these two capabilities have traditionally been studied in divided fields, which are known as computer vision and natural language processing. Specifically, computer vision focuses on interpreting images, such as recognizing objects in images [1, 2] or identifying their pixel-wise locations [3, 4]; on the other hand, natural language processing aims to analyze and generate language, such as predicting the sentiment of customer reviews [5] or summarizing news articles or books [6]. Fundamentally, humans do not learn concepts from a single modality alone. Rather, the learning process often involves interactions between vision and lan- guage. For example, when children learn the concept of apple, they usually receive a combination of visual and linguistic information: a parent may show a real apple or a picture of an apple and says something related to it, like “this is an apple,” “do you want to eat apple,” “it is red, round, and sweet”; overtime, this process is repeated in various contexts, such as during a meal or bedtime story, and as a result, the child begins to associate the visual characteristics of apples with the linguistic label. Moreover, existing cognitive research shows that children acquire concepts by forming associations between words and visual scenes through everyday communication [7, 8]. Therefore, integrating vision and language is a natural step toward building artificial general intelligence or AGI. The emergence of large vision-language models (VLMs) has dramatically changed the landscape of AI research, which enables numerous novel applications, such as detecting objects of arbitrary categories [9], generating photorealistic images based on text descriptions [10], or using natural language to instruct robots to perform grasping [11], navigation [12], or even more complicated operations like surgery [13]. Compared with early “VLMs,” such as those developed during the 2010s [14–16], modern VLMs are built with a much larger scale in terms of both model architecture and training data [17–21]. For example, the size of modern VLMs has grown exponentially, evolving from just a few million parameters to hundreds of millions, and even billions. Similarly, the scale of training data has expanded to encompass billions of examples [22], a practice that has now become standard for training commercial models. Research has suggested that it is the sheer scale of modern VLMs—both their parameter size and the volume of training data— that allows these models to effectively learn extensive and generalizable world knowledge [23, 24]. With the broad knowledge learned from massive datasets, VLMs are versatile and can be adapted to a wide range of downstream applications, spanning both discriminative and generative tasks, as well as extending from 2D to 3D domains. While the scale of modern VLMs has unlocked unprecedented capabilities, achieving this scale has presented significant challenges, spanning algorithmic, computational, and data-related dimensions. From an algorithmic perspective, designing architectures capable of integrating vision and language is nontrivial.

1 Foundations of Vision-Language Models: Concepts and Roadmap 3 Furthermore, when adapting VLMs to downstream applications, one also needs to handle problems arisen from task-specific designs and modality gaps, such as con- necting vision and language to human pose [25] or motion [26]. Computationally, training VLMs at such scales demands enormous compute resources, which limit the widespread adoption of these models in practice. To address this challenge, advancements in efficient training techniques and adaptation methods, such as prompting [27, 28], are essential. On the data front, curating and managing billion- level training data need to address various issues like data noise, bias, and diversity to ensure robust and safe learning. This chapter introduces fundamental concepts of VLMs, with a particular focus on feature alignment, scalability, and data and evaluation. These concepts aim to provide readers with a solid foundation for understanding the problems and algorithms presented later in this book. The chapter is organized as follows. In Sect. 1.2, we discuss the evolution of the vision-language modeling paradigm. In Sect. 1.3, we highlight key challenges related to feature alignment, scalability, and data and evaluation. In Sect. 1.4, we review representative advancements and provide insights into the current state of progress. 1.2 The Vision-Language Modeling Paradigm This section aims to provide a background on the learning paradigm behind vision- language models (VLMs). Specifically, this section discusses the evolution of the learning paradigm, core components, popular model architectures, and finally, training objectives and strategies widely used by the community. 1.2.1 The Evolution The rise of VLMs can be largely attributed to the advancements in pre-training, which has become a cornerstone of modern AI systems. Pre-training involves training a model on large-scale data for a generic task, such as predicting masked words in text [29] or missing patches in images [30]. This step allows the model to learn generic patterns, structures, and relationships in the data, enabling it to develop representations that are broadly applicable across different downstream tasks. Evolution of Vision Models The rise of pre-training in computer vision technically began with ImageNet [31], a large-scale benchmark that spurred numerous breakthroughs in visual recog- nition. AlexNet’s success with convolutional neural networks (CNNs) marked a turning point [1], which demonstrated the power of deep learning by setting new performance records on ImageNet. Subsequently, the introduction of ResNet [2] addressed the notorious vanishing gradient problem with residual connections,

4 K. Zhou et al. enabling deeper, more expressive networks. More recently, Vision Transformer (ViT) [32] has redefined the field by treating image patches as sequences and learning representations in a fully self-attention manner, achieving state-of-the-art performance on various vision domains [33–36]. The development of these models was underpinned by the growing availability of massive datasets and computational power, both of which enabled increasingly sophisticated models to be trained at scale. Essentially, the remarkable general- ization ability of these pre-trained models marks a broader trend in the evolution of modern computer vision. As these models continued to push the boundaries of visual recognition, they also inspired a rethinking of model architectures and learning paradigms. With advances in unsupervised and self-supervised learning techniques [37–44], the vision community began exploring ways to pre-train models on vast amounts of unlabeled data, which has not only expanded the scope of vision models but also laid the foundation for VLMs. Evolution of Language Models The evolution of pre-training in NLP occurred later than in computer vision. However, the trajectory from early word representation models to the emergence of large language models has not only transformed NLP but also had a profound impact on the vision domain, exemplified by models like the previously mentioned ViT. Initially, foundational models like Word2Vec [45] and GloVe [46] represented words as dense, fixed vectors, capturing semantic relationships through word co- occurrence patterns. Then, the community shifted toward contextual embeddings led by models like ELMo [47]. ELMo’s dynamic word representations, derived from a bidirectional LSTM, addressed the issue of static embeddings by taking into account the context in which a word appeared. This breakthrough allowed models to produce different vector representations of a word depending on its surrounding words, significantly improving performance on tasks requiring fine- grained semantic understanding. The advent of sequence-to-sequence (seq2seq) models [48] and the subse- quent development of Transformers [49] introduced another pivotal transformation. Seq2seq models, which are based on encoder–decoder architectures, enabled sig- nificant progress in tasks like machine translation, as they could encode an entire sequence of words into a fixed-size vector and then decode it into another sequence. However, the reliance on recurrent layers weakens seq2seq models’ ability of handling long-range dependencies. To remedy this weakness, transformers replaced recurrence with self-attention mechanisms, which better capture long-sequence rela- tionships and enable faster training due to parallelization. Representative language transformers include BERT [29] and GPT [50], which are based onmasked language modeling and autoregressive learning (a.k.a. next-token prediction), respectively. Convergence of Vision and Language Models The advancements of pre-trained models in both vision and language communities have largely facilitated the development of VLMs, which integrate both visual

Statistics

Uploader

Large Vision-Language Models Pre-training, Prompting, and Applications (Kaiyang Zhou, Ziwei Liu, Peng Gao)（Z-Library）

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Blog & Notes

Recommended for You

Statistics

Uploader

Large Vision-Language Models Pre-training, Prompting, and Applications (Kaiyang Zhou, Ziwei Liu, Peng Gao)（Z-Library）

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Reply to Comment

Edit Comment

Blog & Notes

Recommended for You