📝 Selected Publications
For my full publication list, please go to my google scholar profile

OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM
Yutao Hu, Tianbin Li, Quanfeng Lu, Wenqi Shao, Junjun He, Yu Qiao, Ping Luo
- We propose OmniMedVQA, a large-scale and comprehensive Visual Question Answering benchmark tailored to the medical domain. OmniMedVQA contains 12 different modalities and covers more than 20 unique human anatomical regions, establishing a comprehensive benchmark for evaluating the fundamental capabilities of LVLMs in addressing medical challenges.
- We conduct a thorough evaluation for 12 different LVLMs, including 8 general-domain LVLMs and 4 specialized LVLMs designed for medical applications. As far as we know, it is currently the most comprehensive evaluation of LVLMs towards the medical domain.

Beyond One-to-One: Rethinking the Referring Image Segmentation
Yutao Hu, Qixiong Wang, Wenqi Shao, Enze Xie, Zhenguo Li, Jungong Han, Ping Luo
- We find the deficiency of referring image segmentation when meeting the one-to-many and one-to-zero text inputs, which strongly limits the application value in real-world scenarios.
- We collect a new challenging dataset, termed as RefZOM, in which the text inputs are not limited to the one-to-one setting. The proposed dataset provides a new perspective and benchmark for future research.

Hierarchical self-distilled feature learning for fine-grained visual categorization
Yutao Hu, Xiaolong Jiang, Xuhui Liu, Xiaoyan Luo, Yao Hu, Xianbin Cao, Baochang Zhang, Jun Zhang
- We rethink the efficacy of deep supervision in FGVC and propose a new learning framework, termed HSD, in which the prediction from the next block is utilized as the soft labels for the previous one. By doing so, overstrict supervision, which has been proven detrimental to FGVC, is avoided in the intermediate supervision.
- We design a distillation strategy with dynamic temperature adjustment. In our HSD, instead of using the hyperparameter (temperature T) to produce the soft label, we employ SIELoss to regularize each prediction, which not only drives each layer to deliver soft labels adaptively but also improves the generalization ability of the network.

Variational self-distillation for remote sensing scene classification
Yutao Hu, Xin Huang, Xiaoyan Luo, Jungong Han, Xianbin Cao, Jun Zhang
- We propose Variational Self-Distillation Network (VSD-Net) to adaptively exploit class entanglement information and hierarchically distill it from the deep layers into the shallow parts. In this way, besides the category information, more fine-grained information is provided in the training, contributing to discriminative feature learning.

Nas-count: Counting-by-density with neural architecture search
Yutao Hu, Xiaolong Jiang, Xuhui Liu, Baochang Zhang, Jungong Han, Xianbin Cao, David Doermann
- To our best knowledge, NAS-Count is the first attempt at introducing NAS for crowd counting, where a multi-scale architecture is automatically developed to address the scale variation issue.
- A counting-specific two-level search space is developed in NAS-Count, from which a multi-path encoder-decoder architecture (AMSNet) is discovered efficiently with a differentiable search strategy using stochastic gradient descent (SGD)
- scale Pyramid Pooling Loss (SPPLoss) is searched automatically to improve MSE supervision, which helps produce the higher-quality density map via optimizing structural information on multiple scales