Dai An Nguyen
2025
Improving Vietnamese-English Cross-Lingual Retrieval for Legal and General Domains
Toan Ngoc Nguyen
|
Nam Le Hai
|
Nguyen Doan Hieu
|
Dai An Nguyen
|
Linh Ngo Van
|
Thien Huu Nguyen
|
Sang Dinh
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)
Document retrieval plays a crucial role in numerous question-answering systems, yet research has concentrated on the general knowledge domain and resource-rich languages like English. In contrast, it remains largely underexplored in low-resource languages and cross-lingual scenarios within specialized domain knowledge such as legal. We present a novel dataset designed for cross-lingual retrieval between Vietnamese and English, which not only covers the general domain but also extends to the legal field. Additionally, we propose auxiliary loss function and symmetrical training strategy that significantly enhance the performance of state-of-the-art models on these retrieval tasks. Our contributions offer a significant resource and methodology aimed at improving cross-lingual retrieval in both legal and general QA settings, facilitating further advancements in document retrieval research across multiple languages and a broader spectrum of specialized domains. All the resources related to our work can be accessed at huggingface.co/datasets/bkai-foundation-models/crosslingual.
Search
Fix data
Co-authors
- Sang Dinh 1
- Nam Le Hai 1
- Nguyen Doan Hieu 1
- Toan Ngoc Nguyen 1
- Thien Huu Nguyen 1
- show all...