Rubén Manrique

Also published as: Ruben Manrique

2025

This paper presents the findings of the AmericasNLP 2025 Shared Tasks: (1) machine translation for truly low-resource languages, (2) morphological adaptation for generating educational examples, and (3) developing metrics for machine translation in Indigenous languages. The shared tasks cover 14 diverse Indigenous languages of the Americas. A total of 11 teams participated, submitting 26 systems across all tasks, languages, and models. We describe the shared tasks, introduce the datasets and evaluation metrics used, summarize the baselines and submitted systems, and report our findings.

pdf bib
Bridging the Gap in Health Literacy: Harnessing the Power of Large Language Models to Generate Plain Language Summaries from Biomedical Texts
Andrés Arias-Russi | Carolina Salazar-Lara | Rubén Manrique
Proceedings of the Second Workshop on Patient-Oriented Language Processing (CL4Health)

pdf bib abs
A Scalable Framework for Legal Text Understanding in Regulatory and Financial Contexts.
Santiago Martínez | Juan Manuel Castañeda | Ruben Manrique
Proceedings of the Joint Workshop of the 9th Financial Technology and Natural Language Processing (FinNLP), the 6th Financial Narrative Processing (FNP), and the 1st Workshop on Large Language Models for Finance and Legal (LLMFinLegal)

This study presents a comprehensive approach to developing a domain-specific large language model (LLM) for regulatory and financial text interpretation. A specialized corpus was constructed through large-scale scraping of financial and regulatory documents across domains such as compliance, licensing, and financial reporting. The data was preprocessed using GPT-4o-mini with prompt engineering to retain critical information and remove noise. We further pre-trained a LLaMA-3.1-8B model on the curated corpus and fine-tuned it using an instruction dataset covering nine tasks from the Coling 2025 Regulations Challenge, including acronym expansion, regulatory question-answering, and XBRL-based financial analytics, employing QLoRA to reduce memory requirements. The model exhibits a slight improvement from baseline answering complex regulatory questions (detailed QA) and expanding acronyms. This study demonstrates the potential of domain-specific LLMs in regulatory text interpretation and lays the groundwork for future research in specialized NLP evaluation methodologies.

pdf bib abs
Historical Ink: Exploring Large Language Models for Irony Detection in 19th-Century Spanish
Kevin Cohen | Laura Manrique-Gómez | Ruben Manrique
Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities

This study explores the use of large language models (LLMs) to enhance datasets and improve irony detection in 19th-century Latin American newspapers. Two strategies were employed to evaluate the efficacy of BERT and GPT models in capturing the subtle nuances nature of irony, through both multi-class and binary classification tasks. First, we implemented dataset enhancements focused on enriching emotional and contextual cues; however, these showed limited impact on historical language analysis. The second strategy, a semi-automated annotation process, effectively addressed class imbalance and augmented the dataset with high-quality annotations. Despite the challenges posed by the complexity of irony, this work contributes to the advancement of sentiment analysis through two key contributions: introducing a new historical Spanish dataset tagged for sentiment analysis and irony detection, and proposing a semi-automated annotation methodology where human expertise is crucial for refining LLMs results, enriched by incorporating historical and cultural contexts as core features.

2024

pdf bib abs
Translation systems for low-resource Colombian Indigenous languages, a first step towards cultural preservation
Juan Prieto | Cristian Martinez | Melissa Robles | Alberto Moreno | Sara Palacios | Rubén Manrique
Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024)

The use of machine learning and Natural Language Processing (NLP) technologies can assist in the preservation and revitalization of indigenous languages, particularly those classified as “low-resource.” Given the increasing digitization of information, the development of translation tools for these languages is of significant importance. These tools not only facilitate better access to digital resources for indigenous communities but also stimulate language preservation efforts and potentially foster more inclusive, equitable societies, as demonstrated by the AmericasNLP workshop since 2021. The focus of this paper is Colombia, a country home to 65 distinct indigenous languages, presenting a vast spectrum of linguistic characteristics. This cultural and linguistic diversity is an inherent pillar of the nation’s identity, and safeguarding it has been increasingly challenging given the dwindling number of native speakers and the communities’ inclination towards oral traditions. Considering this context, scattered initiatives exist to develop translation systems for these languages. However, these endeavors suffer from a lack of consolidated, comparable data. This paper consolidates a dataset of parallel data in four Colombian indigenous languages - Wayuunaiki, Arhuaco, Inga, and Nasa - gathered from existing digital resources. It also presents the creation of baseline models for future translation and comparison, ultimately serving as a catalyst for incorporating more digital resources progressively.

pdf bib
Historical Ink: Semantic Shift Detection for 19th Century Spanish
Tony Montes | Laura Manrique-Gómez | Rubén Manrique
Proceedings of the 5th Workshop on Computational Approaches to Historical Language Change

pdf bib abs
Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction
Laura Manrique-Gomez | Tony Montes | Arturo Rodriguez Herrera | Ruben Manrique
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities

This paper presents two significant contributions: First, it introduces a novel dataset of 19th-century Latin American newspaper texts, addressing a critical gap in specialized corpora for historical and linguistic analysis in this region. Second, it develops a flexible framework that utilizes a Large Language Model for OCR error correction and linguistic surface form detection in digitized corpora. This semi-automated framework is adaptable to various contexts and datasets and is applied to the newly created dataset.

Venues

lchange1

Fix data

Rubén Manrique

2025

2024

Co-authors

Venues