Zekun Moore Wang
2025
COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning
Yuelin Bai
|
Xeron Du
|
Yiming Liang
|
Leo Jin
|
Junting Zhou
|
Ziqiang Liu
|
Feiteng Fang
|
Mingshan Chang
|
Tianyu Zheng
|
Xincheng Zhang
|
Nuo Ma
|
Zekun Moore Wang
|
Ruibin Yuan
|
Haihong Wu
|
Hongquan Lin
|
Wenhao Huang
|
Jiajun Zhang
|
Chenghua Lin
|
Jie Fu
|
Min Yang
|
Shiwen Ni
|
Ge Zhang
Findings of the Association for Computational Linguistics: NAACL 2025
Remarkable progress on large language models (LLMs), particularly in English, has facilitated impressive capabilities in following human instructions. However, there remains a noticeable gap in instruction fine-tuning for Chinese, where the complex linguistic features pose significant challenges. Existing datasets, generally distilled from English-centric LLMs, are not well-aligned with Chinese users’ interaction patterns. To bridge this gap, we introduce COIG-CQIA, a new Chinese instruction tuning dataset derived from various real-world data resources and undergoing comprehensive human verification. We conduct extensive experiments on COIG-CQIA, and compare them with strong baseline models and datasets. The experimental results show that models trained on COIG-CQIA achieve highly competitive performance in diverse benchmarks. Additionally, our findings offer several insights for designing effective Chinese instruction-tuning datasets and data mixing strategies. Our dataset are available at https://huggingface.co/datasets/m-a-p/COIG-CQIA.
Search
Fix data
Co-authors
- Yuelin Bai 1
- Mingshan Chang 1
- Xeron Du 1
- Feiteng Fang 1
- Jie Fu 1
- show all...