Xuan He
2025
Guiding Through Complexity: What Makes Good Supervision for Hard Reasoning Tasks?
Xuan He
|
Da Yin
|
Nanyun Peng
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
How can “weak teacher models” (Bowman et al., 2022) such as average human annotators or existing AI systems, effectively supervise LLMs to improve performance on hard reasoning tasks, especially those that challenge and requires expertise or daily practice from the teacher models? In this paper, we seek for empirical answers to this question by investigating various data-driven strategies that offer supervision data at different quality levels upon tasks of varying complexity. Two intuitive strategies emerge for teacher models to provide supervision during alignment training: 1) using lower-quality supervision from complete tasks that match the difficulty of the target reasoning tasks, and 2) leveraging higher-quality supervision from easier subtasks that are less challenging. Interestingly, we find that even when the outcome error rate for hard task supervision is high (e.g., 90%), training on such data can outperform perfectly correct supervision on easier subtasks on multiple hard math benchmarks. We further identify a more critical factor influencing training performance: step-wise error rates, which indicate the severity of errors in solutions. Specifically, training on hard task supervision with the same outcome error rates but disparate step-wise error rates can lead to a 30% accuracy gap on MATH benchmark. Our results also reveal that supplementing hard task supervision with the corresponding subtask supervision can yield notable performance improvements than simply combining rephrased hard full task supervision, suggesting new avenues for data augmentation. Data and code will be released upon acceptance.
2024
VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation
Xuan He
|
Dongfu Jiang
|
Ge Zhang
|
Max Ku
|
Achint Soni
|
Sherman Siu
|
Haonan Chen
|
Abhranil Chandra
|
Ziyan Jiang
|
Aaran Arulraj
|
Kai Wang
|
Quy Duc Do
|
Yuansheng Ni
|
Bohan Lyu
|
Yaswanth Narsupalli
|
Rongqi Fan
|
Zhiheng Lyu
|
Bill Yuchen Lin
|
Wenhu Chen
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
The recent years have witnessed great advances in video generation. However, the development of automatic video metrics is lagging significantly behind. None of the existing metric is able to provide reliable scores over generated videos. The main barrier is the lack of large-scale human-annotated dataset. In this paper, we release VideoFeedback, the first large-scale dataset containing human-provided multi-aspect score over 37.6K synthesized videos from 11 existing video generative models. We train VideoScore (initialized from Mantis)based on VideoFeedback to enable automatic video quality assessment. Experiments show that the Spearman’s correlation betweenVideoScore and humans can reach 77.1 on VideoFeedback-test, beating the prior best metrics by about 50 points. Further result onother held-out EvalCrafter, GenAI-Bench, and VBench show that VideoScore has consistently much higher correlation with humanjudges than other metrics. Due to these results, we believe VideoScore can serve as a great proxy for human raters to (1) rate different video models to track progress (2) simulate fine-grained human feedback in Reinforcement Learning with Human Feedback (RLHF) to improve current video generation models.
Search
Fix data
Co-authors
- Aaran Arulraj 1
- Abhranil Chandra 1
- Haonan Chen 1
- Wenhu Chen 1
- Quy Duc Do 1
- show all...