| CET | Sunday 29/03 | Monday 30/03 |
Tuesday 31/03 | Wednesday 1/04 | Thursday 2/04 |
Friday 3/04 |
|---|---|---|---|---|---|---|
| 8-9 | Breakfast | Nature activities | Breakfast | Breakfast | Breakfast | |
| 9-10 | Welcome session | Poster session 3 | ||||
| (9.30) Julia Kreutzer | ||||||
| 10-11 | Roger K.Moore | François Yvon | ||||
| 11-12 | Carlos Ramisch Manon Scholivet Zen Research 3//4 |
|||||
| 12-13 | Lunch | Lunch | Lunch | Lunch | ||
| (12.30) Lunch | ||||||
| 13-14 | Poster session 1 |
Carlos Ramisch Manon Scholivet Zen Research 1 |
Carlos Ramisch Manon Scholivet Zen Research 2 |
Shuttle back to Grenoble |
||
| Nature activities |
||||||
| 14-15 | ||||||
| 15-16 | Timothée Lacroix |
Projects |
Project |
|||
| 16-17 | Arrival in Grenoble | |||||
| Coffee break | Coffee break | Coffee break | ||||
| 17-18 | Poster session 2 |
Isabelle Augenstein |
Project |
|||
| 18-19 | Project presentation |
|||||
| 19-20 | Dinner | Dinner | Dinner | Dinner | Dinner | |
| 20-21 | Social | Social | Social with karaoke! | Birds-of-a-feather session | Social | |
| 21-22 | ||||||
Poster Sessions
Session 1
-
A1 Zofia Milczarek
Evaluating Social Intelligence in the Long-Term[Abstract]
[Work in Progress] Evaluating LLMs and LLM-based agents on long-term memory tasks has grown in popularity in recent years. This evaluation ranges from retrieval of small small information (needle-in-a-haystack), to complex temporal and multi-hop reasoning over long-term dialogue traces, to finally evaluating agents’ capabilities in simulated environments. Generalisation abilities of LLMs have also given rise to studies that evaluate their social intelligence – a dimension becoming increasingly more important as LLM-based systems are being deployed in real-life human-facing contexts (e.g. hospitals, personal assistants, therapy). The existing evaluations focus on Qustion-Anaswering or short-term simulations. Seeing as Long-Term Social Intelligence has been overlooked so far, our goal is to propose to join these two emerging directions and evaluate it via simulated interactions inspired by SmallVille agents. -
A2 Yosra Jelassi
Interpretable attributes for transparent Language Detection[Abstract]
The lack of transparency in machine learning models raises critical concerns, particularly in domains where interpretability and fairness are essential. Understanding the why behind a model’s prediction is crucial not only for experts seeking to validate system behavior but also for ensuring that decisions do not reflect biases. Indeed, fairness is often based on our ability to determine whether predictions are influenced by discriminatory patterns or protected attributes. To address this challenge, the field of explainability has emerged, seeking to develop methods that helps to uncover the behavior of these systems and to make their decisions intelligible to humans. The literature distinguishes multiple types of explainability. One of them is about seeking information about the model learned representations. In this context, a state of the art approach called BA-LR aims to make the model more interpretable by allowing a mapping between learned internal representations and interpretable attributes. A posthoc study enables to correlate these attributes to linguistic traits enabling to identify a speaker. Transposing this idea to language detection, our study focuses on designing a method to model language-specific attributes that are not only discriminative but also linguistically and phonetically meaningful. The discovery of these attributes can be initiated either unsupervisedly by the model itself, or guided through knowledge injection constraints that relate the internal representations with desired linguistic features. These constraints could shape the distribution of learned attributes. For instance, when analyzing a set of languages, we might seek to extract attribute that are either highly distinctive, or widely shared across languages depending on the desired phonetic markers we are looking for. By enforcing a constraint that aligns the attributes with this target, we can drive the model to focus on specific and interpretable attributes. -
A3 Tobias Kalmbach
Always So Sure: Can LLM’s Confidence be Trusted?[Abstract]
Confidence estimation techniques are often used to better gauge the Large Language Model’s (LLM) answers. One such technique is verbalized confidence. This prompting setup produces confidence scores alongside the actual answers, but the mechanisms behind these self-reported confidence values remain poorly understood. This paper presents a comprehensive analysis of verbalized confidence across multiple datasets spanning factual questions, multiple-choice QA, and causal reasoning using four different LLMs. Our investigation reveals that verbalized confidence scores are highly quantized, clustering around specific values (e.g., 0, 90, 100) with minimal differentiation between correct and incorrect answers. Through causal mediation analysis and targeted input perturbations, we demonstrate that confidence score generation is primarily influenced by structural prompt elements like the word “confidence” and the specified scale range rather than the actual question’s content. These findings raise significant concerns about the reliability of verbalized confidence as a self-evaluation mechanism in LLMs. -
A4 Sarah Bouaraba
Clinically-Conditioned Synthetic Generation of French Breast Radiotherapy Reports: A Multi-LLM Study[Abstract]
Clinical NLP for French faces a critical data bottleneck. Unlike English, which benefits from large-scale, publicly available resources such as MIMIC-III [Johnson et al., 2016] and the i2b2/n2c2 shared tasks, French clinical corpora remain exceptionally scarce. Existing resources — including QUAERO [Névéol et al., 2014], CAS [Grabar et al., 2018], and selected CLEF eHealth subsets — cover relatively narrow domains and provide limited annotated data. This scarcity is largely driven by regulatory constraints: under GDPR Article 9, health records are classified as sensitive personal data, and French legislation requires HDS (Hébergeur de Données de Santé) certification for any storage or processing of patient data. As a result, access to real-world clinical records remains highly restricted, even within academic research settings. Synthetic data generation has therefore emerged as a promising strategy to mitigate this structural limitation. We introduce a controlled pipeline for generating synthetic breast radiotherapy end-of-treatment reports (comptes rendus de fin de radiothérapie mammaire) using locally deployed, instruction-tuned LLMs, without domain-specific fine-tuning. This design preserves patient privacy by construction while ensuring experimental reproducibility. Each report is generated under explicit clinical conditioning across predefined variables. Clinical plausibility and textual quality are assessed using lexical richness, inter-report diversity, and fluency metrics, In addition, reports undergo expert clinical review, and we are currently exploring evaluation via an LLM-as-a-judge framework to assess structural completeness and medical coherence. Preliminary experiments indicate encouraging levels of structural compliance and lexical variability. A larger-scale benchmark is currently underway. The resulting synthetic corpus is intended to support downstream clinical NLP applications in French, including named entity recognition (NER) and information extraction (IE). -
A5 Nickil Maveli
Can LLMs Compress (and Decompress)? Evaluating Code Understanding and Execution via Invertibility[Abstract]
LLMs demonstrate strong performance on code benchmarks, yet round-trip code execution reveals limitations in their ability to maintain consistent reasoning across forward and backward execution. We present RoundTripCodeEval (RTCE), a comprehensive benchmark consisting of four distinct code execution reasoning tasks designed to rigorously test round-trip consistency. RTCE provides an execution-free, exact-match evaluation of bijection fidelity, assessing whether models preserve a consistent one-to-one mapping between encoding and decoding operations across various algorithms and directions. We systematically evaluate state-of-the-art Code-LLMs using zero-shot prompting, supervised fine-tuning on execution traces, and self-reflection mechanisms. Each yields modest improvements, but none closes the gap, indicating that current LLMs struggle with true round-trip consistency, which demonstrates that they lack the internal coherence required for trustworthy code reasoning. RTCE surfaces several new and previously unmeasured insights that are not captured by existing I/O-prediction, execution-reasoning, or round-trip natural-language benchmarks. -
A7 Kristýna Onderková
Which Logic do LLMs Use?[Abstract]
The SemEval-2026 shared task 11 investigates how content interacts with formal reasoning in large language models (LLMs), by testing the validity of Aristotelian syllogisms that either align with or oppose commonsense knowledge. Our main system submission uses an LLM translation to a first-order logic (FOL) prover syntax. However, error analysis shows that simple adjustments to modern logic rules are not enough to reconcile Aristotelian and modern logic. This poster investigates how language models can follow different logical frameworks. By comparing the FOL-based prover, a heuristic Aristotelian pipeline, and zero-shot prompting, we test whether models genuinely follow the specified rules or default to unstated patterns learned during training. -
A8 Jeremias Bohn
Adaptive Base Logarithmic Quantisation[Abstract]
In the last years, large language models have grown significantly in size, making it difficult to run those models for inference on consumer hardware since the increase of GPU memory size has stagnated. A common approach is to quantise the models weights and/or activations and thus reducing the memory requirements significantly. While most approaches resort to a linear quantisation codebook, this method is not optimal since precision gets lost in high-density regions of the weight distributions. Instead, we propose a logarithmic quantisation codebook with variable bases, which shows superior downstream task performance and perplexity compared to standard linear approaches, in particular for low bitwidth scenarios. -
A9 Ikram Belmadani
Who Judges the Judge? Evaluating LLM-as-a-Judge for French Medical open-ended QA[Abstract]
Automatic evaluation of medical open-ended question answering (OEQA) remains challenging due to the need for expert annotations. We evaluate whether large language models (LLMs) can act as judges of semantic equivalence in French medical OEQA, comparing closed-access, general-purpose, and biomedical domain-adapted models. Our results show that LLM-based judgments are strongly influenced by the model that generated the answer, with agreement varying substantially across generators. Domain-adapted and large general-purpose models achieve the highest alignment with expert annotations. We further show that lightweight adaptation of a compact model using supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) substantially improves performance and reduces generator sensitivity, even with limited data. Overall, our findings highlight the need for generator-aware evaluation and suggest that carefully adapted small models can support scalable evaluation in low-resource medical settings. -
A10 Gabriel Oliveira Dos Santos
What do vision-language models see (or not) in the context? Investigating multimodal in-context learning[Abstract]
In-context learning (ICL) enables Large Language Models (LLMs) to learn tasks from demonstration examples without parameter updates. Although it has been extensively studied in LLMs, its effectiveness in Vision-Language Models (VLMs) remains underexplored. In this work, we present a systematic study of ICL in VLMs, evaluating seven models spanning four architectures on three image captioning benchmarks. We analyze how prompt design, architectural choices, and training strategies influence multimodal ICL. To our knowledge, we are the first to analyze how attention patterns in VLMs vary with an increasing number of in-context demonstrations. Our results reveal that training on image–text interleaved data enhances ICL performance but does not imply effective integration of visual and textual information from demonstration examples. In contrast, instruction tuning improves instruction-following but can reduce reliance on in-context demonstrations, suggesting a trade-off between instruction alignment and in-context adaptation. Attention analyses further show that current VLMs primarily focus on textual cues and fail to leverage visual information, suggesting a limited capacity for multimodal integration. These findings highlight key limitations in the ICL abilities of current VLMs and provide insights for enhancing their ability to learn from multimodal in-context examples. -
A11 Elke Vandermeerschen
SEED: Self-Explanation-Enhanced Distillation for Common Sense Reasoning in SLMs[Abstract]
Improving the reasoning capabilities of small language models (SLMs) remains challenging, and existing approaches typically rely on large external teacher models or annotated explanation datasets. Such methods are costly, limit scalability, and often fail to capture reasoning quality directly. We introduce SEED (Self-Explanation-Enhanced Distillation for Common Sense Reasoning in SLMs), a self-improvement framework in which an SLM iteratively learns from its own generated natural language explanations. At each iteration, the model produces explanations alongside answers, after which candidate reasoning traces are filtered and reused as supervision for fine-tuning. Unlike prior self-training approaches that rely on consistency or confidence as proxies for correctness—assumptions that are often violated in poorly calibrated SLMs—SEED employs joint multi-signal filtering to assess explanation quality. Specifically, we combine epistemic signals (logit-based evidence strength, sequence-level entropy), semantic signals (natural language inference consistency), and robustness signals (Contrastive Explanation Invariance, CEI) to retain only high-quality reasoning trajectories. Within SEED, explanations play multiple functional roles: they act as intermediate reasoning representations guiding prediction, as supervision targets during self-distillation, and as structured data for iterative training set construction. This unified use of explanations enables the model to bootstrap its reasoning capabilities from its own generated signals, without external supervision. -
A12 Doria Bonzi
CareMedEval dataset: Evaluating Critical Appraisal and Reasoning in the Biomedical Field[Abstract]
Critical appraisal of scientific literature is an essential skill in the biomedical field. While large language models (LLMs) can offer promising support in this task, their reliability remains limited, particularly for critical reasoning in specialized domains. We introduce CareMedEval, an original dataset designed to evaluate LLMs on biomedical critical appraisal and reasoning tasks. Derived from authentic exams taken by French medical students, the dataset contains 534 questions based on 37 scientific articles. Unlike existing benchmarks, CareMedEval explicitly evaluates critical reading and reasoning grounded in scientific papers. Benchmarking state-of-the-art generalist and biomedical-specialized LLMs under various context conditions reveals the difficulty of the task: open and commercial models fail to exceed an Exact Match Rate of 0.5 even though generating intermediate reasoning tokens considerably improves the results. Yet, models remain challenged especially on questions about study limitations and statistical analysis. CareMedEval provides a challenging benchmark for grounded reasoning, exposing current LLM limitations and paving the way for future development of automated support for critical appraisal. -
A13 Deborah Dore
Leveraging Graph Structural Knowledge for Argument Relation Prediction in Political Debates[Abstract]
Argument Mining (AM) aims to detect argument structures in text, including premises, claims, and their support or attack relations. Political debates are a key application domain, as analyzing politicians’ argumentation strategies can help identify fallacious or propagandist arguments. However, predicting relations between argument components remains challenging. Most existing approaches rely only on textual content and ignore structural information from the overall argument graph. In this paper, we address relation prediction by combining structural knowledge from a Knowledge Graph Embedding model with contextual knowledge from a fine-tuned Language Model. Experiments on a benchmark of US presidential debates (1960–2020) show that integrating textual and structural knowledge improves prediction accuracy over existing methods. -
A14 Anastasiia Vozniuk
Broken Benchmarks, Wrong Questions: A Critical View of AI-Text Detection[Abstract]
Detection methods for AI-generated text routinely report accuracy above 95%, yet collapse under real-world perturbations such as paraphrasing or stylistic shifts. We show that this gap is partly explained by systematic quality problems in benchmark datasets, where surface artifacts make detection artificially easy. But beyond the technical failures lies a deeper issue: even a perfect detector would answer the wrong question. AI origin is a weak proxy for what actually motivates concern: is a text truthful, credible, and informative? We propose shifting the research agenda from detecting AI authorship toward directly assessing these content properties, which better address the underlying risks of AI-generated text. -
A15 Alla Chepurova
Wikontic: Constructing Wikidata-Aligned, Ontology-Aware Knowledge Graphs with Large Language Models[Abstract]
Knowledge graphs (KGs) provide structured, verifiable grounding for large language models (LLMs), but current LLM-based systems commonly use KGs as auxiliary structures for text retrieval, leaving their intrinsic quality underexplored. In this work, we propose Wikontic, a multi-stage pipeline that constructs KGs from open-domain text by extracting candidate triplets with qualifiers, enforcing Wikidata-based type and relation constraints, and normalizing entities to reduce duplication. The resulting KGs are compact, ontology-consistent, and well-connected; on MuSiQue, the correct answer entity appears in 96% of generated triplets. On HotpotQA, our triplets-only setup achieves 76.0 F1, and on MuSiQue 59.8 F1, matching or surpassing several retrieval-augmented generation baselines that still require textual context. In addition, Wikontic attains state-of-the-art information-retention performance on the MINE-1 benchmark (86%), outperforming prior KG construction methods. Wikontic is also efficient at build time: KG construction uses less than 1,000 output tokens, about 3× fewer than AriGraph and <1/20 of GraphRAG. The proposed pipeline enhances the quality of the generated KG and offers a scalable solution for leveraging structured knowledge in LLMs. -
A16 Alejandra Lorenzo
Privacy-Preserving Generation of Synthetic Pathology Reports for Information Extraction[Abstract]
A long-standing goal of the clinical NLP community is to extract relevant clinical variables from clinical text. However, progress has been limited by distribution shift from the general domain, the scarcity of publicly available annotated clinical data, and privacy constraints. We propose a privacy-preserving method to generate synthetic data that simulate the information extraction task by associating LLM-generated pathology reports with thirteen variables commonly found in real reports for breast cancer patients. First, we generate synthetic tabular data for these variables and their possible values, comparing several tabular synthesizers and selecting PATE-CTGAN for its strong statistical fidelity and differential privacy guarantees. Second, we generate pathology reports using three different LLMs to maximize linguistic diversity and conditioning generation on synthetic variable–value sets. We create synthetic report/data pairs on which we fine-tune Mistral-7B-Instruct with LoRA-based supervised training. When evaluated on a manually validated benchmark of 377 real pathology reports and their associated variable-value pairs, the fine-tuned model substantially outperforms Mistral-7B-Instruct. These results show that high-quality synthetic data can effectively compensate for limited annotated clinical data while enabling accurate and privacy-preserving clinical information extraction. -
A17 Huy Hoang Ha
(tba)[Abstract]
(tba) -
A18 Emilio Raimond
(tba)[Abstract]
(tba) -
A19 Angelo Basile
PyRater: APythonToolkit for Annotation Analysis[Abstract]
In this work, we build PyRater, an open-source Python library, to address the lack of accessible tools for probabilistic annotation analysis in NLP. Probabilistic models of annotation can jointly estimate gold standard labels, annotator reliability, and item difficulty, outperforming majority voting and standard agreement metrics like Cohen’s Kappa. Despite these advantages, they remain underused in the field, in part due to the absence of user-friendly implementations. PyRater provides a unified interface for several such models, along with built-in dataset readers, visualisation tools, and an API for adding new models. We also present a novel application of these models to zero-shot prompt selection, where labeled development data is unavailable. In a zero-shot setting, prompt choice can significantly affect model performance, yet there is no straightforward way to identify the best prompt without supervision. By treating different prompts as repeated annotations over the same instances, PyRater can rank prompts and predict true labels, effectively acting as an unsupervised ensemble. We validate this on sentiment analysis datasets, where the probabilistic approach outperforms both majority voting and Kappa-based ranking.
Session 2
-
B1 Zihao Li
Test-Time Scaling of Reasoning Models for Machine Translation[Abstract]
Test-time scaling (TTS) has enhanced the performance of Reasoning Models (RMs) on various tasks such as math and coding, yet its efficacy in machine translation (MT) remains underexplored. This paper investigates whether increased inference-time computation improves translation quality. We evaluate 12 RMs across a diverse suite of MT benchmarks spanning multiple domains, examining three scenarios: direct translation, forced-reasoning extrapolation, and post-editing. Our findings show that for general-purpose RMs, TTS provides limited and inconsistent benefits for direct translation, with performance quickly plateauing. However, the effectiveness of TTS is unlocked by domain-specific fine-tuning, which aligns a model’s reasoning process with task requirements, leading to consistent improvements up to an optimal, self-determined reasoning depth. We also find that forcing a model to reason beyond its natural stopping point consistently degrades translation quality. In contrast, TTS proves highly effective in a post-editing context, reliably turning self-correction into a beneficial process. These results indicate that the value of inference-time computation in MT lies not in enhancing single-pass translation with general models, but in targeted applications like multi-step, self-correction workflows and in conjunction with task-specialized models. -
B2 Xinhao Zhang
What Makes an LLM a Good Optimizer? A Trajectory Analysis of LLM-Guided Evolutionary Search[Abstract]
Recent work has demonstrated the promise of orchestrating large language models (LLMs) within evolutionary optimization systems. However, the mechanisms driving these optimization gains remain poorly understood. In this work, we present a large-scale study of LLM-guided evolutionary optimization, collecting optimization trajectories for 15 LLMs across 8 optimization problems. While base problem-solving ability, measured via zero-shot performance, correlates with final optimization outcomes, it explains only part of the variance: models with similar zero-shot capability often induce dramatically different search trajectories and final performance. To explain this gap, we analyze breakthrough dynamics and the geometry of optimization trajectories in the semantic space of candidate solutions. We find that effective LLM optimizers behave as strong local refiners, progressively localizing their search while producing frequent, incremental improvements across generations. In contrast, weaker optimizers exhibit large semantic drift, with occasional large breakthroughs followed by prolonged stagnation, reminiscent of behavior observed in classical metaheuristics. Notably, various measures of solution novelty do not predict final performance; novelty is beneficial only when the search remains sufficiently localized around high-performing regions of the solution space. Our results highlight the importance of trajectory-level evaluation for understanding and improving LLM-based agentic optimization systems, and provide actionable insights for future work on learning to search. -
B3 Thomas Palmeira Ferraz
Latent Reasoning in LLMs: Revisiting the Efficiency-Interpretability Trade-off[Abstract]
We investigate the emerging field of latent reasoning in large language models and discuss its relevance for reasoning and planning. The starting point is a key trade-off: Explicit chain-of-thought reasoning offers a visible reasoning trace, but it is slow, token-intensive, and may fail to faithfully reflect the model’s true internal computation; Latent reasoning, by contrast, moves the reasoning process into hidden states, potentially enabling more efficient computation and broader exploration, at the cost of transparency. We review recent work on latent reasoning with a particular focus on two emerging lines of research: latent sequential reasoning (L-SEQ), where models produce sequences of latent intermediate states, and latent looped reasoning (L-LOOP), where reasoning is deepened through iterative internal computation. The goal is to better understand how these methods are compared to textual chain-of-thought, what kind of efficiency gains these approaches promise, what kind of tasks they appear to support, and how interpretability tools might help uncover more structured or symbolic forms of latent computation. We compare current evidence, and discuss open problems and next research directions. -
B5 Nazanin Shafiabadi
Biases in Translation: Assessing Opinion Distortion in Machine Translated Texts[Abstract]
Current machine translation (MT) evaluation practices largely assume that high lexical and semantic fidelity implies preservation of meaning. We question this assumption by introducing a framework for detecting and quantifying translation-induced distortion—the systematic alteration of a text’s subjective properties during translation. Focusing on stance as a socially consequential property, we formalize stance preservation as an invariance problem and adapt two classical statistical tests, McNemar’s test and the two-proportion Z-test, to diagnose systematic opinion shifts between source texts and their translations. Unlike standard MT metrics such as BLEU or COMET, which prioritize surface similarity and adequacy, our approach explicitly targets preservation of subjective meaning. In controlled experiments with synthetically distorted translations, we demonstrate that the proposed tests are sensitive to graded levels of stance manipulation. We apply our framework to evaluate twelve multilingual models and find that none reliably preserve stance across all tested language directions. Our findings reveal a critical gap in current MT evaluation practices and highlight the need for explicit evaluation of subjective meaning preservation in socially and politically sensitive contexts. -
B6 Marie Dewulf
Webcare through the Eyes of the Bystander: A cross-linguistic comparison of pragmatic-rhetorical features in hotel review-response interactions[Abstract]
Webcare, as a manifestation of digital reputation management, has become ubiquitous within the tourism industry. The significance of this online customer service communication, accessible to all, cannot be overstated. It demonstrates a commitment for guest satisfaction, thereby positively influencing the hotel’s image. Although recent studies suggest that guest reviews and hotel responses are influenced by cultural factors, cross-cultural analyses of hotel interactions remain scarce in terms of the languages and cultures investigated. Therefore, the objective of this project is to conduct a cross-linguistic study of a multilingual corpus consisting of approximately 50,000 hotel reviews and their corresponding responses in German, French, English (UK/US), Italian, Dutch, and Spanish (ES/MX). Specifically, this project aims to explore the cross-linguistic characteristics of hotel interactions in L1. We will use NLP techniques, such as sentiment analysis, to obtain a quantitative overview of the corpus across the 8 different cultures. The knowledge gained from this research will present opportunities for the development of generative AI systems that can automatically craft responses tailored to the linguistic and cultural context. -
B7 Karima Kadaoui
All For One: A Multilinguality Quest to Assist Low Resource Sign Languages[Abstract]
[Work in progress] Sign Language Recognition efforts are hindered by a lack of data. The need for data is especially dire when learning from Sign Language videos, given the inter- and intra-signer variability, and the presence of linguistically irrelevant information making it harder to generalize. To alleviate this issue, we take inspiration from the Speech domain and explore the use of SignWriting similarly to phones, i.e. as a discrete and language-agnostic intermediate representation, in an effort to reduce the complexity of the input video modality. Doing this would allow us to train make use of higher-resource sign-languages (e.g. American Sign Language) to improve perfromance on lower-resource ones (e.g. Emirati Sign Language). -
B8 Jason Chan
Explanation Generation for Reconciling Contradictions with LLMs[Abstract]
Existing NLP work commonly treats contradictions as errors to be resolved by choosing which statements to accept or discard. Yet a key aspect of human intelligence is the ability to hypothesise explanations to reconcile apparently contradictory observations. Despite growing research into LLMs’ reasoning capabilities, their ability to generate such reconciliatory explanations remains underexplored. We address this gap by introducing a novel task, repurposing existing natural language inference datasets, and proposing metrics that enable scalable automatic evaluation. Our experiments show that, even with extended test-time compute, most LLMs struggle to generate successful explanations in reconciling contradictions, highlighting the need for future work to address this limitation. -
B10 Filippo Tonini
Super-additive Cooperation in Language Model Agents[Abstract]
With the prospect of autonomous artificial intelligence (AI) agents, studying their tendency for cooperative behavior becomes an increasingly relevant topic. This study is inspired by the super-additive cooperation theory, where the combined effects of repeated interactions and inter-group rivalry have been argued to be the cause for cooperative tendencies found in humans. We devised a virtual tournament where language model agents, grouped into teams, face each other in a Prisoner’s Dilemma game. By simulating both internal team dynamics and external competition, we discovered that this blend substantially boosts both overall and initial, one-shot cooperation levels (the tendency to cooperate in one-off interactions). This research provides a novel framework for large language models to strategize and act in complex social scenarios and offers evidence for how intergroup competition can, counter-intuitively, result in more cooperative behavior. These insights are crucial for designing future multi-agent AI systems that can effectively work together and better align with human values. -
B11 Eleni Gkovedarou
ReGender: Gender-Fair Rewriter for English-to-Greek Machine Translation[Abstract]
The use of gender-fair language can lead to a more inclusive society, yet machine translation (MT) systems frequently reproduce and amplify gender bias. Some of this bias is due to inherent ambiguities in the source: English largely lacks grammatical gender marking, whereas Greek requires morphological and semantic gender specifications, forcing MT systems to resolve ambiguity in ways that default to gendered (and often biased) outputs. This research explores gender-fair rewriting as a strategy for bias mitigation for English-to-Greek MT, a language pair that remains highly understudied. We propose a twofold approach: ReGender, a system that first detects gender ambiguity in the English source text and then generates a set of gender-fair Greek translations for the ambiguous cases, including gendered, gender-neutral, and gender-inclusive forms. Through a human-centered design, the project combines NLP methods with community-informed gender-fair language practices that go beyond the gender binary. The resulting model will take the form of a plug-in that can be integrated into existing MT systems, enabling users to make informed translation choices while promoting the broader goal of inclusive language technologies. -
B12 Dominik Seip
Preference Redirection via Attention Concentration: An Attack on Multimodal Agents[Abstract]
Advancements in multimodal foundation models have enabled the development of Computer Use Agents (CUAs) capable of autonomously interacting with GUI environments. As CUAs are not restricted to certain tools, they allow to automate more complex agentic tasks but at the same time open up new security vulnerabilities. While prior work has concentrated on the language modality, the vulnerability of the vision modality has received less attention. In this work, we introduce PRAC, a novel attack that, unlike prior methods targeting the VLM output directly, manipulates the model’s internal preferences by redirecting its attention toward a stealthy adversarial patch. We show that PRAC is able to manipulate the selection process of a CUA on an online shopping platform towards a chosen target product. While we require white-box access to the model for the creation of the attack, we show that our attack generalizes to fine-tuned versions of the same model, presenting a critical threat as multiple companies build specific CUAs based on open weights models. -
B13 Daria Galimzianova
Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA[Abstract]
Large Language Models (LLMs) often hallucinate in question answering (QA) tasks. A key yet underexplored factor contributing to this is the temporality of questions — whether they are evergreen (answers remain stable over time) or mutable (answers change). In this work, we introduce EverGreenQA, the first multilingual QA dataset with evergreen labels, supporting both evaluation and training. Using EverGreenQA, we benchmark 12 modern LLMs to assess whether they encode question temporality explicitly (via verbalized judgments) or implicitly (via uncertainty signals). We also train EG-E5, a lightweight multilingual classifier that achieves SoTA performance on this task. Finally, we demonstrate the practical utility of evergreen classification across three applications: improving self-knowledge estimation, filtering QA datasets, and explaining GPT-4o’s retrieval behavior. -
B14 Amirbek Djanibekov
SPIRIT: Patching Speech Language Models against Jailbreak Attacks[Abstract]
Speech Language Models (SLMs) enable natural interactions via spoken instructions, which more effectively capture user intent by detecting nuances in speech. The richer speech signal introduces new security risks compared to text-based models, as adversaries can better bypass safety mechanisms by injecting imperceptible noise to speech. We analyze adversarial attacks and find that SLMs are substantially more vulnerable to jailbreak attacks, which can achieve a perfect 100% attack success rate in some instances. To improve security, we propose post-hoc patching defenses used to intervene during inference by modifying the SLM’s activations that improve robustness up to 99% with (i) negligible impact on utility and (ii) without any re-training. We conduct ablation studies to maximize the efficacy of our defenses and improve the utility/security trade-off, validated with large-scale benchmarks unique to SLMs. -
B15 Alexander Shabalin
Cosmos: Compressed and Smooth Latent Space for Text Diffusion Modeling[Abstract]
Autoregressive language models dominate modern text generation, yet their sequential nature introduces fundamental limitations: decoding is slow, and maintaining global coherence remains challenging. Diffusion models offer a promising alternative by enabling parallel generation and flexible control; however, their application to text generation is hindered by the high dimensionality of token-level representations. We introduce Cosmos, a novel approach to text generation that operates entirely in a compressed, smooth latent space tailored specifically for diffusion. This space is learned using an autoencoder trained simultaneously for token-level reconstruction and alignment with frozen activations from a pretrained language encoder, providing robust semantic grounding and enabling effective perturbation-based augmentations. Empirically, we demonstrate that text representations can be compressed by 8× while maintaining generation quality comparable to token-level diffusion models. Furthermore, increasing the latent sequence length allows Cosmos to surpass both diffusion-based and autoregressive baselines. We evaluate Cosmos on four diverse generative tasks including story generation, question generation, summarization, and detoxification and compare it with various generative paradigms. Cosmos achieves comparable or superior generation quality while offering more than 2× faster inference. -
B16 Abdelrahman Sadallah
The Good, the Bad and the Constructive: Automatically Measuring Peer Review’s Utility for Authors[Abstract]
Providing constructive feedback to paper authors is a core component of peer review. With reviewers increasingly having less time to perform reviews, automated support systems are required to ensure high reviewing quality, thus making the feedback in reviews useful for authors. To this end, we identify four key aspects of review comments (individual points in weakness sections of reviews) that drive the utility for authors: Actionability, Grounding & Specificity, Verifiability, and Helpfulness. To enable evaluation and development of models assessing review comments, we introduce the RevUtil dataset. We collect 1,430 human-labeled review comments and scale our data with 10k synthetically labeled comments for training purposes. The synthetic data additionally contains rationales, i.e., explanations for the aspect score of a review comment. Employing the RevUtil dataset, we benchmark fine-tuned models for assessing review comments on these aspects and generating rationales. Our experiments demonstrate that these fine-tuned models achieve agreement levels with humans comparable to, and in some cases exceeding, those of powerful closed models like GPT-4o. Our analysis further reveals that machine-generated reviews generally underperform human reviews on our four aspects. -
B17 Gaganpreet Jhajj
An Empirical Study of Many-Shot In-Context Learning for Machine Translation of Low-Resource Languages[Abstract]
In-context learning (ICL) enables LLMs to adapt to new tasks from a few examples, making it attractive for low-resource languages. Recent many-shot ICL work shows that larger context windows can further boost performance, but gains depend on example selection, and inference costs can be prohibitive. We present an empirical study of many-shot ICL for English-to-low-resource machine translation across ten languages recently added to FLORES+, examining retrieval-based example selection, out-of-domain data, and length-based ordering. Our results show that many-shot ICL improves with more examples, and that BM25-based retrieval greatly enhances data efficiency, 50 retrieved examples roughly match 250 random ones, while 250 retrieved examples rival 1,000 random ones. -
B18 Briag Rehel
RAG Controller[Abstract]
Retrieval-Augmented Generation (RAG) has become the go-to architecture in industry applications. Many variants and components have been developed, making the selection of an effective configuration a challenging optimization problem in low data industry use cases. We hypothesize that adapting RAG configurations at the query level, rather than at the use-case level, improves transferability. To this end, we train a controller using offline reinforcement learning to jointly control four RAG inference parameters. -
B19 Laurits Lyngbaek
Statistical Probes of Multilingual Embeddings Fail to Generalize Across Language Learner Corpora[Abstract]
Session 3
-
C1 Yusser Al Ghussin
Steering Multilingual Models Towards Cultural Knowledge[Abstract]
Prior work provides mechanistic evidence that multilingual LLMs encode cultural information in representations that overlap and interact with language-specific components (Namazifard and Poech, 2025), suggesting that intervening on language-aligned directions may also modulate culturally relevant behavior. Motivated by this, our system uses activation steering: instead of optimizing model parameters through fine-tuning, we modify internal activations at inference time using steering vectors (Rimsky et al., 2024). Concretely, we extract language steering vectors and inject them into the residual stream of multilingual LLMs during generation. We build on evidence that language identity is encoded as a stable direction in activation space (Marks and Tegmark, 2023), and hypothesize that steering along such directions can improve access to culturally relevant knowledge. -
C2 Trung Hieu Ngo
Investigating Gender Stereotypes in Large Language Models via Social Determinants of Health[Abstract]
Large Language Models (LLMs) excel in Natural Language Processing (NLP) tasks, but they often propagate biases embedded in their training data, which is potentially impactful in sensitive domains like healthcare. While existing benchmarks evaluate biases related to individual social determinants of health (SDoH) such as gender or ethnicity, they often overlook interactions between these factors and lack context-specific assessments. This study investigates bias in LLMs by probing the relationships between gender and other SDoH in French patient records. Through a series of experiments, we found that embedded stereotypes can be probed using SDoH input and that LLMs rely on embedded stereotypes to make gendered decisions, suggesting that evaluating interactions among SDoH factors could usefully complement existing approaches to assessing LLM performance and bias. -
C3 Tenney Hu
DIVERGE: Diversity-Enhanced RAG for Open-Ended Information Seeking[Abstract]
Existing RAG systems are primarily designed under the assumption that each query has a single correct answer. This overlooks common information-seeking scenarios with multiple plausible answers, where diversity is essential to avoid collapsing to a single dominant response, thereby constraining creativity and compromising fair and inclusive information access. Our analysis reveals a commonly overlooked limitation of standard RAG systems: they underutilize retrieved context diversity, such that increasing retrieval diversity alone does not yield diverse generations. To address this limitation, we propose DIVERGE, a plug-and-play agentic RAG framework with novel reflection-guided generation and memory-augmented iterative refinement, which promotes diverse viewpoints while preserving answer quality. We introduce novel metrics tailored to evaluating the diversity-quality trade-off in open-ended questions, and show that they correlate well with human judgments. We demonstrate that DIVERGE achieves the best diversity-quality trade-off compared to competitive baselines and previous state-of-the-art methods on the real-world Infinity-Chat dataset, substantially improving diversity while maintaining quality. More broadly, our results reveal a systematic limitation of current LLM-based systems for open-ended information-seeking and show that explicitly modeling diversity can mitigate it. -
C4 Patrícia Schmidtová
How Important is ‘Perfect’ English for Machine Translation Prompts?[Abstract]
Large language models (LLMs) show stateof- the-art performance in machine translation, but are also known to be sensitive to errors in user prompts. Given that these models are largely trained on and respond best to prompts in standard English, this may affect the quality of LLM outputs for second language English speakers as well as real-world lay users, with potentially disproportionate effects on the former. We explore this effect by modeling a range of error types exhibited by such users, motivated by studies of L2 English, and quantifying their impact on LLM performance. We work with two related tasks: machine translation and machine translation evaluation. We find that LLMs-as-MT are brittle to natural spelling errors but not to phrasal simplifications. However, the quality drop caused by these errors is lower than the variance over the initial prompt choice, suggesting that “perfect English” for a given prompt is less important than choosing a good prompt. Since lay users and L2 speakers may use non-optimal prompts as well as display imperfect language skills, our work calls for increasing the resilience of model performance to both these phenomena, in order to best serve a diverse user base, both from a robustness and fairness perspective. -
C5 Michelle Elizabeth
Conversational Grounding in LLMs: Evaluation Methods, Challenges and Future Directions[Abstract]
Conversational grounding is the collaborative process through which speakers establish and maintain mutual understanding. It is essential for the success of a dialogue. While it is an inherent property of human conversations, it remains a challenge for instruction-following Large Language Models (LLM). This paper surveys how conversational grounding, from a psycholinguistic perspective, is evaluated in task-oriented dialogue in the current era of LLMs. The literature is organised based on how grounding is modelled – dialogue act based methods and approaches that model the participant mental state. We also review collaborative tasks that enable the evaluation of grounding at the conversation-level based on outcomes. We highlight the limitations of current metrics and outline research directions in grounding evaluation. -
C7 Joanna Radola
Plus d’une langue! Language Identification for Code-Switched Utterances[Abstract]
Automatic identification of Code-Switched (CS) utterances remains a challenge for language identification (LIDs) systems, causing such texts to be underrepresented in the training data of Large Language Models. In this paper, we revisit MaskLID, a state-of-the art approach for CS identification, which requires no training and detects arbitrary language combinations. We make three main contributions: (a) we reformulate the underlying algorithm as an Integer Linear Program, with clear and interpretable constraints; (b) our experiments with 10 languages match MaskLID’s results and highlight a major issue with the underlying LID; (c) using an improved model delivering better word-level scores, we achieve results that outperform MaskLID both on monolingual and CS datasets. -
C8 Iuliia Belikova
Detecting Overflow in Compressed Token Representations for RAG[Abstract]
Efficient long-context processing remains a crucial challenge for contemporary large language models (LLMs), especially in resource-constrained environments. Soft compression architectures promise to extend effective context length by replacing long token sequences with smaller sets of learned compressed tokens. Yet, the limits of compressibility — and when compression begins to erase task-relevant content — remain underexplored. In this paper, we define “token overflow” as a regime in which compressed representations no longer contain sufficient information to answer a given query, and propose a methodology to characterize and detect it. In xRAG soft-compression setting, query-agnostic saturation statistics effectively identify compressed tokens but show limited capability in detecting overflow. Conversely, lightweight query-aware probing classifiers successfully detect overflow across multiple standard QA datasets. This advancement toward query-aware detection enables efficient pre-LLM gating to mitigate compression-induced errors. -
C9 Hawau Olamide Toyin
A summary of efforts towards building truly adaptable speech technology for stuttered speech.[Abstract]
[Work in progress] Speech technologies often underperform for people who stutter due to limited atypical-speech data, scarce expert annotation, and a lack of alignment between stakeholder needs and research objectives. This poster summarizes an ongoing effort toward multimodal stuttering severity assessment. We take a multi-pronged approach: (1) a stakeholder-focused study combining interviews, questionnaire surveys (70 respondents), and a literature review (200+ papers) to identify gaps between what people who stutter (PWS) and speech-language pathologists (SLPs) need and what current systems optimize for; (2) benchmarking ASR for atypical speech under two practical transcription objectives—verbatim versus intended transcripts—to highlight how model behavior changes depending on whether disfluencies are preserved or normalized; and (3) an ongoing multilingual, multimodal data collection effort in collaboration with clinical partners that includes SLP-provided clinical annotations. -
C10 Erwan Fagnou
Chain and Causal Attention for Efficient Entity Tracking[Abstract]
This paper investigates the limitations of transformers for entity-tracking tasks in large language models. We identify a theoretical constraint, showing that transformers require at least log2 (n+1) layers to handle entity tracking with n state changes. To address this issue, we propose an efficient and frugal enhancement to the standard attention mechanism, enabling it to manage long-term dependencies more efficiently. By considering attention as an adjacency matrix, our model can track entity states with a single layer.Empirical results demonstrate significant improvements in entity tracking datasets while keeping competitive performance on standard natural language modeling. Our modified attention allows us to achieve the same performance with drastically fewer layers. Additionally, our enhanced mechanism reveals structured internal representations of attention. Extensive experiments on both toy and complex datasets validate our approach. Our contributions include theoretical insights, an improved attention mechanism, and empirical validation. -
C11 Elena Golimblevskaia
WeightLens: Input-Independent Interpretability for LLM Transcoders[Abstract]
Existing automated interpretability methods for Large Language Models (LLMs) often infer feature meanings by analyzing activations with another LLM, but they suffer from high computational cost, dataset dependence, prompt sensitivity, and explainer bias. Transcoders provide a promising direction for automated interpretability, since their architecture allows separating input-dependent and input-invariant components of feature attributions. We investigate whether analyzing only the input-invariant component (weights) yields meaningful interpretations for token-based features, reducing reliance on external explainers, and develop the WeightLens framework. Experiments show that in the chosen setting, WeightLens performs comparably to, or even better than, activation-based methods, suggesting a low-cost complement to existing approaches. -
C12 Dina Pisarevskaya
Claim Matching with Instruction-following LLMs for Automated Fact-checking[Abstract]
Claim matching (CM) as the Natural Language Processing task can benefit an automated fact-checking pipeline by putting together claims that can be resolved with the same fact-check. We explore zero-shot and few-shot learning approaches to CM as a binary classification task and experiment with instruction-following LLMs, investigating prompt templates. CM can be tackled by leveraging more mature yet similar tasks such as natural language inference or paraphrase detection. We present a novel agent-based approach for CM and the two-step pipeline that first generates prompts with LLMs, to then perform claim matching as a binary classification task with LLMs. We demonstrate the effectiveness of using different LLMs for each step of the pipeline, i.e. using an LLM for prompt generation, and another for claim matching. We reveal insights into the LLMs’ understanding and handling of CM task. -
C13 Arnisa Fazla
How Certain Is Uncertainty? A Benchmark for Confidence, Calibration, and Failure Modes in Clinical VQA[Abstract]
Clinical vision-language model (VLM) evaluation often relies on evaluating accuracy on visual question-answering (VQA) datasets, yet real-world clinical use additionally requires reliable uncertainty estimation (UE) to identify cases requiring clinician review. We present the first comprehensive benchmark of eight post-hoc UE methods across thirteen VLMs spanning five model families. Using controlled “None of the Above” (NOTA) perturbations to the answer options, we show that replacing the correct answer with NOTA unexpectedly increases model confidence while degrading accuracy. Moreover, we further show that initial uncertainty estimates predict answer instability under NOTA perturbations, revealing a meaningful link between uncertainty and robustness to answer-space shifts. -
C14 Amanda Le
Causal Graph-Based Models for Lossless Explanations[Abstract]
Large Language Models offer significant potential for conversational data analytics across diverse domains. However, deploying them in critical areas is limited by the lack of trustworthy explanations that accurately reflect their internal computations. In this work, we propose a causal graph-based model to provide lossless explanations. This model leverages the theory of causal abstraction to capture both task- and instance-specific LLM behavior. Our approach constructs a compressed representation of a neural network by learning low-dimensional abstractions of internal activations alongside causally constrained transition mechanisms. This results in an explanatory graph that reveals traversable computational pathways and supports causal interventions. -
C15 Alex Jiang
Automatic detection of bot-generated content for disinformation purposes[Abstract]
Large language models (LLMs) such as GPT, Claude, LLaMA or Mistral has transformed text generation by producing artificial content that is credible, fluent and contextually relevant. While simplifying our daily tasks, the rise of LLMs lead to issues in many fields (e.g., Academic research, Education, Fake news, Social media). Current detection tools struggle to keep up with this rapid evolution, especially against new LLMs or dealing with different domains they were trained on. Often built to deal with simple generated texts, they lack robustness against basic evasive strategies such as paraphrasing, back-translation. -
C17 Filip Boltuzic
Automated Consolidation of Legal Amendments into Temporal Knowledge Graphs[Abstract]
We introduce an automated pipeline that transforms sequences of legal amendments into temporal knowledge graphs, preserving both structural dependencies and versioned states of legal norms. This representation supports precise reconstruction of the law at any point in time and facilitates graph-based retrieval and reasoning over evolving legal corpora. -
C18 AriaRay Brown
Listening for Speaker Identity in Multilingual Speech Models: How Phonetic Similarity Modulates Cross-Lingual Transfer[Abstract]
Multilingual self-supervised transformer speech models build shared representations to encode speech from multiple languages. While these models learn features and perform well in downstream tasks, it remains unclear (1) how such features are represented similarly or uniquely for all languages, and (2) what might account for any differences. This research investigates how speaker and phonetic information is represented in the multilingual speech models XLS-R and WavLabLM for ten phylogenetically-dispersed languages. Motivated by speech perception research linking human ease in speaker discriminability to phonetic similarity (the Language Familiarity Effect), this work explores whether phonetic similarity influences the success of cross-lingual transfer for speaker verification. Accordingly, we train probing classifiers on embedded utterance pairs in each language and test speaker verification on target languages. Experiments show that speaker information is represented similarly across model layers and languages, but cross-lingual performance varies widely. To examine this, we turn to phonetic insight. We introduce a new method to measure phonetic similarity by calculating the distance between phonetic profiles, based on distributions of contextualized phone embeddings per language, extracted from a distinct phonetic layer. Overall, we find that phonetic similarity tends to significantly improve the accuracy of cross-lingual speaker verification, while this correlation varies for individual test languages.
