Chun-Hsiao (Daniel) Yeh

Chun-Hsiao (Daniel) Yeh
email: daniel_yeh_at_berkeley.edu

I am a fourth-year Ph.D. candidate at University of California, Berkeley, co-advised by Professor Yi Ma (EECS / HKU) and Professor Meng C. Lin (Vision Science). I received my M.Sc. at National Taiwan University (NTU).

Throughout my time at Berkeley, I have engaged in close collaborations with Prof. Yubei Chen at Meta FAIR / UC Davis. I also worked closely with Prof. Stella Yu during the initial two years of my Ph.D. study. Prior to joining Berkeley, I had the privilege of working with Dr. Tyng-Luh Liu at IIS, Academia Sinica. I was a visiting researcher at UC Berkeley / ICSI from 2018 to 2019.

I am currently doing research internship with Meta FAIR, working with Fanyi Xiao on VLM's spatial reasoning & 3D understanding. I also spend two wonderful summers as a research intern at Adobe Research. In 2024, my focus was on the VLM planner for complex instruction-based image editing, collaborating with Krishna Singh, Yilin Wang, Cherry Zhao, Yuheng Li, and Richard Zhang. In 2022, I worked on personalized VLM for video retrieval with Simon Jenni, Fabian Caba, Bryan Russell, and Josef Sivic.

I am passionate about building world models that integrate source from various modalities, with a particular focus on aligning vision with language.

UC Berkeley Ph.D. Student Sept. 21 - Present	FAIR Research Intern May. 25 - Present	Adobe Inc. Research Intern 2nd: May. 24 - Mar. 25 1st: May. 22 - Nov. 22	IIS, Academia Sinica Research Assistant Apr. 20 - Aug. 21	NTU Master Degree Sept. 15 - Mar. 19

News

[05/2025] I join Meta FAIR @ Menlo Park as research intern this summer!
[05/2025] One paper, Gen4Gen is accepted to BMVC 2025 @ UK!
[06/2024] One paper, MDPipe is accepted to MICCAI 2024 @ Morocco!
[05/2024] Start my summer research internship at Adobe Reseach @ San Jose, CA!
[06/2023] One paper is accepted to CVPR 2023 MultiEarth Workshop!
[02/2023] One paper, Meta-Personalization is accepted to CVPR 2023! Congratulations to all the Adobe co-authors!
[12/2022] Passed the qualifying exam and became a PhD candidate @ UC Berkeley !
[11/2022] Finish my first internship at Adobe Inc. !
[07/2022] One paper, DCL is accepted to ECCV 2022! Can't wait to go Israel!
[05/2022] Start the internship at Adobe Research working with Simon Jenni, Fabian Caba, Bryan Russell, and Josef Sivic.
[01/2022] One paper, SAGA is accepted to ICASSP 2022.
[08/2021] Join UC Berkeley as a Ph.D. student!

Publications

	Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs Chun-Hsiao Yeh, Chenyu Wang, Shengbang Tong, Ta-Ying Cheng, Ruoyu Wang, Tianzhe Chu, Yuexiang Zhai, Yubei Chen, Shenghua Gao, and Yi Ma ArXiv, 2025. \| Project Page \| Abstract \| Bibtex \| Preprint \| PDF \| 🤗 Huggingface Benchmark \| Code \| Multi-view understanding, the ability to reconcile visual information across diverse viewpoints for effective navigation, manipulation, and 3D scene comprehension, is a fundamental challenge in Multi-Modal Large Language Models (MLLMs) to be used as embodied agents. While recent MLLMs have shown impressive advances in high-level reasoning and planning, they frequently fall short when confronted with multi-view geometric consistency and crossview correspondence. To comprehensively evaluate the challenges of MLLMs in multi-view scene reasoning, we propose All-Angles Bench, a benchmark of over 2,100 human carefully annotated multi-view question–answer pairs across 90 diverse real-world scenes. Our six tasks (counting, attribute identification, relative distance, relative direction, object manipulation, and camera pose estimation) specifically test model’s geometric correspondence and the capacity to align information consistently across views. Our extensive experiments, benchmark on 27 representative MLLMs including Gemini-2.0-Flash, Claude-3.7-Sonnet, and GPT-4o against human evaluators reveals a substantial performance gap, indicating that current MLLMs remain far from human-level proficiency. Through in-depth analysis, we show that MLLMs are particularly underperforming under two aspects: (1) cross-view correspondence for partially occluded views and (2) establishing the coarse camera poses. These findings highlight the necessity of domain-specific refinements or modules that embed stronger multi-view awareness. We believe that our All-Angles Bench offers valuable insights and contribute to bridging the gap between MLLMs and human-level multiview understanding @article{yeh2025seeing, title={Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs}, author={Chun-Hsiao Yeh, Chenyu Wang, Shengbang Tong, Ta-Ying Cheng, Ruoyu Wang, Tianzhe Chu, Yuexiang Zhai, Yubei Chen, Shenghua Gao and Yi Ma}, journal={arXiv preprint arXiv:2504.15280}, year={2025} }
	Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing Chun-Hsiao Yeh, Yilin Wang, Nanxuan Zhao, Richard Zhang, Yuheng Li, Yi Ma, and Krishna Kumar Singh ArXiv, 2025. \| Project Page \| Abstract \| Bibtex \| Preprint \| Code (Coming Soon) \| Recent diffusion-based image editing methods have significantly advanced text-guided tasks but often struggle to interpret complex, indirect instructions. Moreover, current models frequently suffer from poor identity preservation, unintended edits, or rely heavily on manual masks. To address these challenges, we introduce X-Planner, a Multimodal Large Language Model (MLLM)-based planning system that effectively bridges user intent with editing model capabilities. X-Planner employs chain-of-thought reasoning to systematically decompose complex instructions into simpler, clear sub-instructions. For each sub-instruction , X-Planner automatically generates precise edit types and segmentation masks, eliminating manual intervention and ensuring localized, identity-preserving edits. Additionally, we propose a novel automated pipeline for generating large-scale data to train \textbf{X-Planner} which achieves state-of-the-art results on both existing benchmarks and our newly introduced complex editing benchmark. @article{yeh2025beyond, title={Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing}, author={Yeh, Chun-Hsiao and Wang, Yilin and Zhao, Nanxuan and Zhang, Richard and Li, Yuheng and Ma, Yi and Singh, Krishna Kumar}, journal={arXiv preprint arXiv:2507.05259}, year={2025} }
	Insight: A Multi-Modal Diagnostic Pipeline using LLMs for Ocular Surface Disease Diagnosis Chun-Hsiao Yeh, Jiayun Wang, Andrew D. Graham, Andrea J. Liu, Bo Tan, Yubei Chen, Yi Ma, and Meng C. Lin 27th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), 2024. \| Project Page \| Abstract \| Bibtex \| Preprint \| PDF \| Poster \| 🤗 Huggingface Model \| Code \| Accurate diagnosis of ocular surface diseases is critical in optometry and ophthalmology, which hinge on integrating clinical data sources (e.g., meibography imaging and clinical metadata). Traditional human assessments lack precision in quantifying clinical observations, while current machine-based methods often treat diagnoses as multi-class classification problems, limiting the diagnoses to a predeﬁned closed-set of curated answers without reasoning the clinical relevance of each variable to the diagnosis. To tackle these challenges, we introduce an innovative multi-modal diagnostic pipeline (MDPipe) by employing large language models (LLMs) for ocular surface disease diagnosis. We first employ a visual translator to interpret meibography images by converting them into quantifiable morphology data, facilitating their integration with clinical metadata and enabling the communication of nuanced medical insight to LLMs. To further advance this communication, we introduce a LLM-based summarizer to contextualize the insight from the combined morphology and clinical metadata, and generate clinical report summaries. Finally, we refine the LLMs' reasoning ability with domain-specific insight from real-life clinician diagnoses. Our evaluation across diverse ocular surface disease diagnosis benchmarks demonstrates that MDPipe outperforms existing standards, including GPT-4, and provides clinically sound rationales for diagnoses. @inproceedings{yeh2024insight, title={Insight: A Multi-modal Diagnostic Pipeline Using LLMs for Ocular Surface Disease Diagnosis},author={Yeh, Chun-Hsiao and Wang, Jiayun and Graham, Andrew D and Liu, Andrea J and Tan, Bo and Chen, Yubei and Ma, Yi and Lin, Meng C},booktitle= {International Conference on Medical Image Computing and Computer-Assisted Intervention},pages={711--721},year={2024}, organization={Springer} }
	Gen4Gen: Generative Data Pipeline for Generative Multi-Concept Composition Chun-Hsiao Yeh, Ta-Ying Cheng, He-Yen Hsieh, Chuan-En Lin, Yi Ma, Andrew Markham, Niki Trigoni, H.T. Kung, and Yubei Chen British Machine Vision Conference (BMVC)*, 2025. \| Project Page \| Abstract \| Bibtex \| Preprint \| Code \| Recent text-to-image diffusion models are able to learn and synthesize images containing novel, personalized concepts (e.g., their own pets or specific items) with just a few examples for training. This paper tackles two interconnected issues within this realm of personalizing text-to-image diffusion models. First, current personalization techniques fail to reliably extend to multiple concepts --- we hypothesize this to be due to the mismatch between complex scenes and simple text descriptions in the pre-training dataset (e.g., LAION). Second, given an image containing multiple personalized concepts, there lacks a holistic metric that evaluates performance on not just the degree of resemblance of personalized concepts, but also whether all concepts are present in the image and whether the image accurately reflects the overall text description. To address these issues, we introduce Gen4Gen, a semi-automated dataset creation pipeline utilizing generative models to combine personalized concepts into complex compositions along with text-descriptions. Using this, we create a dataset called MyCanvas, that can be used to benchmark the task of multi-concept personalization. In addition, we design a comprehensive metric comprising two scores (CP-CLIP and TI-CLIP) for better quantifying the performance of multi-concept, personalized text-to-image diffusion methods. We provide a simple baseline built on top of Custom Diffusion with empirical prompting strategies for future researchers to evaluate on MyCanvas. We show that by improving data quality and prompting strategies, we can significantly increase multi-concept personalized image generation quality, without requiring any modifications to model architecture or training algorithms. @article{yeh2024gen4gen, title={Gen4Gen: Generative Data Pipeline for Generative Multi-Concept Composition},author={Yeh, Chun-Hsiao and Cheng, Ta-Ying and Hsieh, He-Yen and Lin, Chuan-En and Ma, Yi and Markham, Andrew and Trigoni, Niki and Kung, Hsiang-Tsung and Chen, Yubei}, journal={arXiv preprint arXiv:2402.15504}, year={2024} }
	Magic-Me: Identity-Specific Video Customized Diffusion Ze Ma, Daquan Zhou, Xue-She Wang, Chun-Hsiao Yeh, Xiuyu Li, Huanrui Yang, Zhen Dong, Kurt Keutzer, Jiashi Feng European Conference on Computer Vision (ECCV) AI4VA Workshop, 2024. \| Project Page \| Abstract \| Bibtex \| Preprint \| 🤗 Huggingface Demo \| Code \| Creating content for a specific identity (ID) has shown significant interest in the field of generative models. In the field of text-to-image generation (T2I), subject-driven content generation has achieved great progress with the ID in the images controllable. However, extending it to video generation is not well explored. In this work, we propose a simple yet effective subject identity controllable video generation framework, termed Video Custom Diffusion (VCD). With a specified subject ID defined by a few images, VCD reinforces the identity information extraction and injects frame-wise correlation at the initialization stage for stable video outputs with identity preserved to a large extent. To achieve this, we propose three novel components that are essential for high-quality ID preservation: 1) an ID module trained with the cropped identity by prompt-to-segmentation to disentangle the ID information and the background noise for more accurate ID token learning; 2) a text-to-video (T2V) VCD module with 3D Gaussian Noise Prior for better inter-frame consistency and 3) video-to-video (V2V) Face VCD and Tiled VCD modules to deblur the face and upscale the video for higher resolution. Despite its simplicity, we conducted extensive experiments to verify that VCD is able to generate stable and high-quality videos with better ID over the selected strong baselines. Besides, due to the transferability of the ID module, VCD is also working well with finetuned text-to-image models available publically, further improving its usability. @article{ma2024magic, title={Magic-Me: Identity-Specific Video Customized Diffusion}, author={Ma, Ze and Zhou, Daquan and Yeh, Chun-Hsiao and Wang, Xue-She and Li, Xiuyu and Yang, Huanrui and Dong, Zhen and Keutzer, Kurt and Feng, Jiashi}, journal={arXiv preprint arXiv:2402.09368}, year={2024} }
	Meta-Personalizing Vision-Language Models to Find Named Instances in Video Chun-Hsiao Yeh, Byran Russell, Josef Sivic, Fabian Caba Heilbron, and Simon Jenni Conference on Computer Vision and Pattern Recognition (CVPR), 2023. \| Project Page \| Abstract \| Bibtex \| Preprint \| PDF \| Project Video \| Dataset \| Large-scale vision-language models (VLM) have shown impressive results for language-guided search applications. While these models allow category-level queries, they currently struggle with personalized searches for moments in a video where a specific object instance such as ``My dog Biscuit'' appears. We present the following three contributions to address this problem. First, we describe a method to meta-personalize a pre-trained VLM, learning how to learn to personalize a VLM at test time to search in video. Our method extends the VLM's token vocabulary by learning novel word embeddings specific to each instance. To capture only instance-specific features, we represent each instance embedding as a combination of shared and learned global category features. Second, we propose to learn such personalization without explicit human supervision. Our approach automatically identifies moments of named visual instances in video using transcripts and vision-language similarity in the VLM's embedding space. Finally, we introduce This-Is-My, a personal video instance retrieval benchmark. We evaluate our approach on This-Is-My and DeepFashion2 and show that we obtain a 15% relative improvement over the state of the art on the latter dataset. @inproceedings{yeh2023meta, title={Meta-Personalizing Vision-Language Models To Find Named Instances in Video}, author={Yeh, Chun-Hsiao and Russell, Bryan and Sivic, Josef and Heilbron, Fabian Caba and Jenni, Simon}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, pages={19123--19132}, year={2023} }
	Decoupled Contrastive Learning Chun-Hsiao Yeh, Cheng-Yao Hong, Yen-Chi Hsu, Tyng-Luh Liu, Yubei Chen, and Yann LeCun European Conference on Computer Vision (ECCV), 2022. \| Abstract \| Bibtex \| Preprint \| PDF \| Project Video \| Code (Lightly-ai) \| Contrastive learning (CL) is one of the most successful paradigms for self-supervised learning (SSL). In a principled way, it considers two augmented "views" of the same image as positive to be pulled closer, and all other images as negative to be pushed further apart. However, behind the impressive success of CL-based techniques, their formulation often relies on heavy-computation settings, including large sample batches, extensive training epochs, etc. We are thus motivated to tackle these issues and establish a simple, efficient, yet competitive baseline of contrastive learning. Specifically, we identify, from theoretical and empirical studies, a noticeable negative-positive-coupling (NPC) effect in the widely used InfoNCE loss, leading to unsuitable learning efficiency concerning the batch size. By removing the NPC effect, we propose decoupled contrastive learning (DCL) loss, which removes the positive term from the denominator and significantly improves the learning efficiency. DCL achieves competitive performance with less sensitivity to sub-optimal hyperparameters, requiring neither large batches in SimCLR, momentum encoding in MoCo, or large epochs. We demonstrate with various benchmarks while manifesting robustness as much less sensitive to suboptimal hyperparameters. Notably, SimCLR with DCL achieves 68.2% ImageNet-1K top-1 accuracy using batch size 256 within 200 epochs pre-training, outperforming its SimCLR baseline by 6.4%. Further, DCL can be combined with the SOTA contrastive learning method, NNCLR, to achieve 72.3% ImageNet-1K top-1 accuracy with 512 batch size in 400 epochs, which represents a new SOTA in contrastive learning. We believe DCL provides a valuable baseline for future contrastive SSL studies. @inproceedings{yeh2022decoupled, title={Decoupled contrastive learning}, author={Yeh, Chun-Hsiao and Hong, Cheng-Yao and Hsu, Yen-Chi and Liu, Tyng-Luh and Chen, Yubei and LeCun, Yann}, booktitle={Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part XXVI}, pages={668--684}, year={2022}, organization={Springer} }
	SAGA: Self-Augmentation with Guided Attention for Representation Learning Chun-Hsiao Yeh, Cheng-Yao Hong, Yen-Chi Hsu, and Tyng-Luh Liu IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022. \| Abstract \| Bibtex \| IEEE Webpage \| PDF \| Self-supervised training that elegantly couples contrastive learning with a wide spectrum of data augmentation techniques has been shown to be a successful paradigm for representation learning. However, current methods implicitly maximize the agreement between differently augmented views of the same sample, which may perform poorly in certain situations. For example, considering an image comprising a boat on the sea, one augmented view is cropped solely from the boat and the other from the sea, whereas linking these two to form a positive pair could be misleading. To resolve this issue, we introduce a Self-Augmentation with Guided Attention (SAGA) strategy, which augments input data based on predictive attention to learn representations rather than simply applying off-the-shelf augmentation schemes. As a result, the proposed self-augmentation framework enables feature learning to enhance the robustness of representation. @inproceedings{yeh2022saga, title={SAGA: Self-Augmentation with Guided Attention for Representation Learning}, author={Yeh, Chun-Hsiao and Hong, Cheng-Yao and Hsu, Yen-Chi and Liu, Tyng-Luh}, booktitle={ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, pages={3463--3467}, year={2022}, organization={IEEE} }
	Scene Novelty Prediction from Unsupervised Discriminative Feature Learning Arian Ranjbar, Chun-Hsiao Yeh, Sascha Hornauer, Stella X. Yu, and Ching-Yao Chan IEEE International Conference on Intelligent Transportation Systems (ITSC), 2020. (* indicates equal contribution) \| Abstract \| Bibtex \| IEEE Webpage \| PDF \| Deep learning approaches are widely explored in safety-critical autonomous driving systems on various tasks. Network models, trained on big data, map input to probable prediction results. However, it is unclear how to get a measure of confidence on this prediction at the test time.Our approach to gain this additional information is to estimate how similar test data is to the training data that the model was trained on. We map training instances onto a feature space that is the most discriminative among them. We then model the entire training set as a Gaussian distribution in that feature space. The novelty of the test data is characterized by its low probability of being in that distribution, or equivalently a large Mahalanobis distance in the feature space.Our distance metric in the discriminative feature space achieves a better novelty prediction performance than the state-of-the-art methods on most classes in CIFAR-10 and ImageNet. Using semantic segmentation as a proxy task often needed for autonomous driving, we show that our unsupervised novelty prediction correlates with the performance of a segmentation network trained on full pixel-wise annotations. These experimental results demonstrate potential applications of our method upon identifying scene familiarity and quantifying the confidence in autonomous driving actions. @inproceedings{ranjbar2020scene, title={Scene novelty prediction from unsupervised discriminative feature learning}, author={Ranjbar, Arian and Yeh, Chun-Hsiao and Hornauer, Sascha and Stella, X Yu and Chan, Ching-Yao}, booktitle={2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC)}, pages={1--7}, year={2020}, organization={IEEE} }
	Face Liveness Detection Based on Perceptual Image Quality Assessment Features with Multi-scale Analysis Chun-Hsiao Yeh, and Herng-Hua Chang IEEE Winter Conference on Applications of Computer Vision (WACV), 2018. \| Abstract \| Bibtex \| IEEE Webpage \| PDF \| Vulnerability of recognition systems to spoofing attacks (presentation attacks) is still an open security issue in the biometrics domain. Among all biometric traits, face is exposed to the most serious threat since it is particularly easy to access and reproduce. In this paper, an effective approach against face spoofing attacks based on perceptual image quality assessment features with multiscale analysis is presented. First, we demonstrate that the recently proposed blind image quality evaluator (BIQE) is effective in detecting spoofing attacks. Next, we combine the BIQE with an image quality assessment model called effective pixel similarity deviation (EPSD), which we propose to obtain the standard deviation of the gradient magnitude similarity map by selecting effective pixels in the image. A total number of 21 features acquired from the BIQE and EPSD constitute the multi-scale descriptor for classification. Extensive experiments based on both intradataset and cross-dataset protocols were performed using three existing benchmarks, namely, Replay-Attack, CASIA, and UVAD. The proposed algorithm demonstrated its superiority in detecting face spoofing attacks over many state of the art methods. We believe that the incorporation of the image quality assessment knowledge into face liveness detection is promising to improve the overall accuracy. @inproceedings{yeh2018face, title={Face liveness detection based on perceptual image quality assessment features with multi-scale analysis}, author={Yeh, Chun-Hsiao and Chang, Herng-Hua}, booktitle={2018 IEEE Winter conference on applications of computer vision (WACV)}, pages={49--56}, year={2018}, organization={IEEE} }
	Face Liveness Detection with Feature Discrimination between Sharpness and Blurriness Chun-Hsiao Yeh, and Herng-Hua Chang International Conference on Machine, Vision and Applications (MVA), 2017. (Oral) \| Abstract \| Bibtex \| IEEE Webpage \| PDF \| Face recognition has been extensively used in a wide variety of security systems for identity authentication for years. However, many security systems are vulnerable to spoofing face attacks (e.g., 2D printed photo, replayed video). Consequently, a number of anti-spoofing approaches have been proposed. In this study, we introduce a new algorithm that addresses the face liveness detection based on the digital focus technique. The proposed algorithm relies on the property of digital focus with various depths of field (DOFs) while shooting. Two features of the blurriness level and the gradient magnitude threshold are computed on the nose and the cheek subimages. The differences of these two features between the nose and the cheek in real face images and spoofing face images are used to facilitate detection. A total of 75 subjects with both real and spoofing face images were used to evaluate the proposed framework. Preliminary experimental results indicated that this new face liveness detection system achieved a high recognition rate of 94.67% and outperformed many state-of-the-art methods. The computation speed of the proposed algorithm was the fastest among the tested methods. @inproceedings{yeh2017face, title={Face liveness detection with feature discrimination between sharpness and blurriness}, author={Yeh, Chun-Hsiao and Chang, Herng-Hua}, booktitle={2017 Fifteenth IAPR International Conference on Machine Vision Applications (MVA)}, pages={398--401}, year={2017}, organization={IEEE} }

Patents

Jenni, Simon, Fabian David Caba Heilbron, Chun-hsiao Yeh, Bryan Russell, and Josef Sivic. "Learning to Personalize Vision-Language Models through Meta-Personalization." U.S. Patent Application 18/210,535, filed December 19, 2024.

Professional Activities

Conference Reviewer: ICLR 2025, NeurIPS 2025, CVPR (2023, 2024, 2025), ECCV (2022, 2024), ICCV (2023, 2025), SIGGRAPH 2025, NeurIPS-W 2024, MICCAI (2024, 2025), CPAL (2024, 2025), ACCV 2024, IV 2022
Journal Reviewer: Pattern Recognition Letters, IEEE Access, Elseiver ESWA

Awards, Honors, and Grants

National Institutes of Health (NIH) Grant, 2023 - 2024
UC Berkeley Conference Travel Grant | $900, 2023
UC Berkeley Conference Travel Grant | $1500, 2022
National Taiwan University Exchange Program Application - Top 3.5% (17/501), 2016
Top-3 in IEEE International Conference on Robotics and Automation (ICRA) Challenge, USA, 2015
First Prize in Undergraduate Project Competition, 2015
Third Place in Federation of International Robot-Soccor Association (FIRA) Competition, China, 2014
First Prize in International Competition on Intelligent Humanoid Robotics (HuroCup), Taiwan, 2014
Dean's List Award, 2014
Top-5 in IEEE International Robot Hands-on Competition & Symposium Robot Bowling Competition (IRHOCS), Taiwan, 2013

Many thanks to webpage, webpage, and website for awesome template.