Chun-Hsiao (Daniel) Yeh

| CV | Google Scholar | GitHub | LinkedIn |

I am a second-year Ph.D. student at University of California, Berkeley, co-advised by Professor Stella X. Yu and Professor Meng C. Lin. I received my M.Sc. at National Taiwan University (NTU).

Prior to joining Berkeley, I had the privilege of working with Dr. Tyng-Luh Liu at IIS, Academia Sinica, and collaborating with Dr. Yubei Chen at Meta AI Research from 2019 to 2021. I also spent half a year as a Visiting Researcher at UC Berkeley / ICSI from 2018 to 2019.

In 2022, I was a Research Intern at Adobe Inc., where I had the opportunity to work with Simon Jenni, Fabian Caba, Bryan Russel, and Josef Sivic.

My research objectives and interests lie in the area of self-supervised representation learning and understanding, particularly in the context of image and video tasks.


UC Berkeley
Ph.D. Student
Sept. 21 - Present


Adobe Inc.
Research Intern
May. 22 - Nov. 22


IIS, Academia Sinica
Research Assistant
Apr. 20 - Aug. 21


UC Berkeley / ICSI
Visiting Researcher
Sept. 18 - Mar. 19


Master Degree
Sept. 15 - Mar. 19

  • [02/2023] One paper is accepted to CVPR 2023! Congratulations to all the Adobe co-authors!
  • [12/2022] Passed the qualifying exam and became a PhD candidate @ UC Berkeley !
  • [11/2022] Finish my first internship at Adobe Inc. !
  • [07/2022] One paper accepted by ECCV 2022! Can't wait to go Israel!
  • [05/2022] Start the internship at Adobe Research working with Simon Jenni, Fabian Caba, Bryan Russel, and Josef Sivic.
  • [01/2022] One paper accepted by ICASSP 2022.
  • [08/2021] Join UC Berkeley as a Ph.D. student!

Meta-Personalizing Vision-Language Models to Find Named Instances in Video
Chun-Hsiao Yeh, Byran Russell, Josef Sivic, Fabian Caba Heilbron, and Simon Jenni
Conference on Computer Vision and Pattern Recognition (CVPR), 2023.

| Project Page | Abstract | Bibtex | Preprint | PDF | Project Video |

Large-scale vision-language models (VLM) have shown impressive results for language-guided search applications. While these models allow category-level queries, they currently struggle with personalized searches for moments in a video where a specific object instance such as ``My dog Biscuit'' appears. We present the following three contributions to address this problem. First, we describe a method to meta-personalize a pre-trained VLM, learning how to learn to personalize a VLM at test time to search in video. Our method extends the VLM's token vocabulary by learning novel word embeddings specific to each instance. To capture only instance-specific features, we represent each instance embedding as a combination of shared and learned global category features. Second, we propose to learn such personalization without explicit human supervision. Our approach automatically identifies moments of named visual instances in video using transcripts and vision-language similarity in the VLM's embedding space. Finally, we introduce This-Is-My, a personal video instance retrieval benchmark. We evaluate our approach on This-Is-My and DeepFashion2 and show that we obtain a 15% relative improvement over the state of the art on the latter dataset.



Decoupled Contrastive Learning
Chun-Hsiao Yeh, Cheng-Yao Hong, Yen-Chi Hsu, Tyng-Luh Liu, Yubei Chen, and Yann LeCun
European Conference on Computer Vision (ECCV), 2022.

| Abstract | Bibtex | Preprint | PDF | Project Video |

Contrastive learning (CL) is one of the most successful paradigms for self-supervised learning (SSL). In a principled way, it considers two augmented "views" of the same image as positive to be pulled closer, and all other images as negative to be pushed further apart. However, behind the impressive success of CL-based techniques, their formulation often relies on heavy-computation settings, including large sample batches, extensive training epochs, etc. We are thus motivated to tackle these issues and establish a simple, efficient, yet competitive baseline of contrastive learning. Specifically, we identify, from theoretical and empirical studies, a noticeable negative-positive-coupling (NPC) effect in the widely used InfoNCE loss, leading to unsuitable learning efficiency concerning the batch size. By removing the NPC effect, we propose decoupled contrastive learning (DCL) loss, which removes the positive term from the denominator and significantly improves the learning efficiency. DCL achieves competitive performance with less sensitivity to sub-optimal hyperparameters, requiring neither large batches in SimCLR, momentum encoding in MoCo, or large epochs. We demonstrate with various benchmarks while manifesting robustness as much less sensitive to suboptimal hyperparameters. Notably, SimCLR with DCL achieves 68.2% ImageNet-1K top-1 accuracy using batch size 256 within 200 epochs pre-training, outperforming its SimCLR baseline by 6.4%. Further, DCL can be combined with the SOTA contrastive learning method, NNCLR, to achieve 72.3% ImageNet-1K top-1 accuracy with 512 batch size in 400 epochs, which represents a new SOTA in contrastive learning. We believe DCL provides a valuable baseline for future contrastive SSL studies.

  title={Decoupled contrastive learning},
  author={Yeh, Chun-Hsiao and Hong,
  Cheng-Yao and Hsu, Yen-Chi and Liu, 
  Tyng-Luh and Chen, Yubei and LeCun, Yann},
  booktitle={Computer Vision--ECCV 2022: 
  17th European Conference, Tel Aviv, Israel, 
  October 23--27, 2022, Proceedings, Part XXVI},

SAGA: Self-Augmentation with Guided Attention for Representation Learning
Chun-Hsiao Yeh, Cheng-Yao Hong, Yen-Chi Hsu, and Tyng-Luh Liu
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022.

| Abstract | Bibtex | IEEE Webpage | PDF |

Self-supervised training that elegantly couples contrastive learning with a wide spectrum of data augmentation techniques has been shown to be a successful paradigm for representation learning. However, current methods implicitly maximize the agreement between differently augmented views of the same sample, which may perform poorly in certain situations. For example, considering an image comprising a boat on the sea, one augmented view is cropped solely from the boat and the other from the sea, whereas linking these two to form a positive pair could be misleading. To resolve this issue, we introduce a Self-Augmentation with Guided Attention (SAGA) strategy, which augments input data based on predictive attention to learn representations rather than simply applying off-the-shelf augmentation schemes. As a result, the proposed self-augmentation framework enables feature learning to enhance the robustness of representation.

  title={SAGA: Self-Augmentation 
  with Guided Attention for 
  Representation Learning},
  author={Yeh, Chun-Hsiao and 
  Hong, Cheng-Yao and Hsu, Yen-Chi 
  and Liu, Tyng-Luh},
  booktitle={ICASSP 2022-2022 IEEE 
  International Conference on Acoustics, 
  Speech and Signal Processing (ICASSP)},

Scene Novelty Prediction from Unsupervised Discriminative Feature Learning
Arian Ranjbar*, Chun-Hsiao Yeh*, Sascha Hornauer, Stella X. Yu, and Ching-Yao Chan
IEEE International Conference on Intelligent Transportation Systems (ITSC), 2020.
(* indicates equal contribution)

| Abstract | Bibtex | IEEE Webpage | PDF |

Deep learning approaches are widely explored in safety-critical autonomous driving systems on various tasks. Network models, trained on big data, map input to probable prediction results. However, it is unclear how to get a measure of confidence on this prediction at the test time.Our approach to gain this additional information is to estimate how similar test data is to the training data that the model was trained on. We map training instances onto a feature space that is the most discriminative among them. We then model the entire training set as a Gaussian distribution in that feature space. The novelty of the test data is characterized by its low probability of being in that distribution, or equivalently a large Mahalanobis distance in the feature space.Our distance metric in the discriminative feature space achieves a better novelty prediction performance than the state-of-the-art methods on most classes in CIFAR-10 and ImageNet. Using semantic segmentation as a proxy task often needed for autonomous driving, we show that our unsupervised novelty prediction correlates with the performance of a segmentation network trained on full pixel-wise annotations. These experimental results demonstrate potential applications of our method upon identifying scene familiarity and quantifying the confidence in autonomous driving actions.

  title={Scene novelty prediction from 
  unsupervised discriminative feature learning},
  author={Ranjbar, Arian and Yeh, Chun-Hsiao 
  and Hornauer, Sascha and Stella, X Yu 
  and Chan, Ching-Yao},
  booktitle={2020 IEEE 23rd International 
  Conference on Intelligent Transportation 
  Systems (ITSC)},

Face Liveness Detection Based on Perceptual Image Quality Assessment Features with Multi-scale Analysis
Chun-Hsiao Yeh, and Herng-Hua Chang
IEEE Winter Conference on Applications of Computer Vision (WACV), 2018.

| Abstract | Bibtex | IEEE Webpage | PDF |

Vulnerability of recognition systems to spoofing attacks (presentation attacks) is still an open security issue in the biometrics domain. Among all biometric traits, face is exposed to the most serious threat since it is particularly easy to access and reproduce. In this paper, an effective approach against face spoofing attacks based on perceptual image quality assessment features with multiscale analysis is presented. First, we demonstrate that the recently proposed blind image quality evaluator (BIQE) is effective in detecting spoofing attacks. Next, we combine the BIQE with an image quality assessment model called effective pixel similarity deviation (EPSD), which we propose to obtain the standard deviation of the gradient magnitude similarity map by selecting effective pixels in the image. A total number of 21 features acquired from the BIQE and EPSD constitute the multi-scale descriptor for classification. Extensive experiments based on both intradataset and cross-dataset protocols were performed using three existing benchmarks, namely, Replay-Attack, CASIA, and UVAD. The proposed algorithm demonstrated its superiority in detecting face spoofing attacks over many state of the art methods. We believe that the incorporation of the image quality assessment knowledge into face liveness detection is promising to improve the overall accuracy.

  title={Face liveness detection 
  based on perceptual image quality 
  assessment features with multi-scale analysis},
  author={Yeh, Chun-Hsiao and Chang, Herng-Hua},
  booktitle={2018 IEEE Winter conference
  on applications of computer vision (WACV)},

Face Liveness Detection with Feature Discrimination between Sharpness and Blurriness
Chun-Hsiao Yeh, and Herng-Hua Chang
International Conference on Machine, Vision and Applications (MVA), 2017. (oral)

| Abstract | Bibtex | IEEE Webpage | PDF |

Face recognition has been extensively used in a wide variety of security systems for identity authentication for years. However, many security systems are vulnerable to spoofing face attacks (e.g., 2D printed photo, replayed video). Consequently, a number of anti-spoofing approaches have been proposed. In this study, we introduce a new algorithm that addresses the face liveness detection based on the digital focus technique. The proposed algorithm relies on the property of digital focus with various depths of field (DOFs) while shooting. Two features of the blurriness level and the gradient magnitude threshold are computed on the nose and the cheek subimages. The differences of these two features between the nose and the cheek in real face images and spoofing face images are used to facilitate detection. A total of 75 subjects with both real and spoofing face images were used to evaluate the proposed framework. Preliminary experimental results indicated that this new face liveness detection system achieved a high recognition rate of 94.67% and outperformed many state-of-the-art methods. The computation speed of the proposed algorithm was the fastest among the tested methods.

  title={Face liveness detection with
  feature discrimination between sharpness
  and blurriness},
  author={Yeh, Chun-Hsiao and Chang, Herng-Hua},
  booktitle={2017 Fifteenth IAPR International
  Conference on Machine Vision Applications (MVA)},

(TensorFlow) Comics Generation
NTU CSIE ADLxMLDS 2017 Fall Project
Tensorflow implementation of Conditional Generative Adversarial Network (CGAN) automatically generates anime images based on given constraints (e.g., green hair, blue eyes).


(TensorFlow) Video Captioning
NTU CSIE ADLxMLDS 2017 Fall Project
Implementation of Seq2seq model (S2VT) and attention mechanism, which generates the description (captions) for the given video.

  Professional Activities
  • Conference Reviewer: IV 2022, ECCV 2022, CVPR 2023, ICCV 2023
  • Journal Reviewer: IEEE Access
  Awards and Honors
  • UC Berkeley Conference Travel Grant | $1500, 2022
  • National Taiwan University Exchange Program Application - Top 3.5% (17/501), 2016
  • Top-3 in IEEE International Conference on Robotics and Automation (ICRA) Challenge, USA, 2015
  • First Prize in Undergraduate Project Competition, 2015
  • Third Place in Federation of International Robot-Soccor Association (FIRA) Competition, China, 2014
  • First Prize in International Competition on Intelligent Humanoid Robotics (HuroCup), Taiwan, 2014
  • Dean's List Award, 2014
  • Top-5 in IEEE International Robot Hands-on Competition & Symposium Robot Bowling Competition (IRHOCS), Taiwan, 2013

Many thanks to webpage, webpage, and website for awesome template.