Xihui Liu, UC Berkeley: “From Perception to Imagination: Vision-Language Cross-Modal Understanding and Generation”

Position: Postdoctoral Scholar

Current Institution: UC Berkeley

Abstract: From Perception to Imagination: Vision-Language Cross-Modal Understanding and Generation

Thanks to the fast evolution of deep learning algorithms hardware and large-scale datasets the past decade has witnessed rapid progress in both computer vision and natural language processing. While most previous research mainly focuses on the visual modality or language modality separately bridging vision and language is a fundamental and challenging problem that remains underexplored. Cross-modal vision-language research is important for bridging visual and semantic information. It also has various applications for future intelligent systems such as human-robot interaction. I address this problem from two perspectives: cross-modal understanding and cross-modal generation. Firstly traditional visual perception research mostly focuses on classification detection or segmentation with a closed label set. To enable more intelligent visual systems beyond closed-set labels we explore the integration of language and vision for better cross-modal understanding. We first explore a novel cross-modal attention-guided erasing approach for referring expression grounding where we aim at locating an object from an image with the natural language instruction. Then we introduce a method that incorporates cross-modal interaction into text-image retrieval. Secondly the ability to generate images and language is attracting more and more attention. We explore cross-modal generation between images and language. Image captioning aims at generating text describing a certain image. We first propose a novel approach to encourage discriminative and diverse image captions generated by the image captioning system. On the other hand we investigate the challenging problem of manipulating images based on language instructions. We propose a novel framework named Open-Edit which takes advantage of the visual-semantic embedding space for open-vocabulary open-domain image editing. We also propose a new approach for semantic image synthesis which can better exploit the semantic information for synthesizing images.


Xihui Liu is a postdoc scholar at UC Berkeley advised by Prof. Trevor Darrell. Previously she obtained her Ph.D. degree in July 2021 from The Chinese University of Hong Kong supervised by Prof. Xiaogang Wang and Prof. Hongsheng Li. She obtained her bachelor’s degree from Tsinghua University in 2017. Her research interests include computer vision and deep learning with a special emphasis in cross-modal language and vision and image/video synthesis and editing. She has interned in Adobe Research and NVIDIA Research in 2019 and 2020 respectively. She was awarded Adobe Research Fellowship 2020 CVPR 2019 outstanding reviewer award and ICLR2021 outstanding reviewer award.