Zekun Moore Wang


2025

pdf bib
COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning
Yuelin Bai | Xeron Du | Yiming Liang | Leo Jin | Junting Zhou | Ziqiang Liu | Feiteng Fang | Mingshan Chang | Tianyu Zheng | Xincheng Zhang | Nuo Ma | Zekun Moore Wang | Ruibin Yuan | Haihong Wu | Hongquan Lin | Wenhao Huang | Jiajun Zhang | Chenghua Lin | Jie Fu | Min Yang | Shiwen Ni | Ge Zhang
Findings of the Association for Computational Linguistics: NAACL 2025

Remarkable progress on large language models (LLMs), particularly in English, has facilitated impressive capabilities in following human instructions. However, there remains a noticeable gap in instruction fine-tuning for Chinese, where the complex linguistic features pose significant challenges. Existing datasets, generally distilled from English-centric LLMs, are not well-aligned with Chinese users’ interaction patterns. To bridge this gap, we introduce COIG-CQIA, a new Chinese instruction tuning dataset derived from various real-world data resources and undergoing comprehensive human verification. We conduct extensive experiments on COIG-CQIA, and compare them with strong baseline models and datasets. The experimental results show that models trained on COIG-CQIA achieve highly competitive performance in diverse benchmarks. Additionally, our findings offer several insights for designing effective Chinese instruction-tuning datasets and data mixing strategies. Our dataset are available at https://huggingface.co/datasets/m-a-p/COIG-CQIA.
OSZAR »