Schedule

CET Sunday 30/03 Monday 31/03
Tuesday 1/04 Wednesday 2/04 Thursday 3/04
Friday 4/04
8-9 Breakfast Breakfast Breakfast Breakfast 8h:Breakfast
8h30: Project presentation
9-10 Welcome session François Yvon
Dzmitry Bahdanau
Nature activities
Project presentations
10-11 Poster session 1
Dirk Hovy
11-12
12-13 Lunch Lunch Lunch Lunch Lunch
13-14 Kyunghyun Cho
Nature activities
Poster session 2
Project session Shuttle back
to Grenoble
14-15
15-16 Project session
16-17 Marzieh Fadaee
Benoît Favre
Arrival in Grenoble
at 4.30pm
17-18 Shuttle departs
from Grenoble at 5pm
Alexandra Birch
18-19
19-20 Dinner Dinner Dinner Dinner Dinner
20-21 Poster Session 3
21-22

Poster Sessions

Session 1

  • A1 Zain Muhammad Mujahid
    SAFARI: Cross-lingual Bias and Factuality Detection in News Media and News Articles
    [Abstract] In an era where information is quickly shared across many cultural and language contexts, the neutrality and integrity of news media are essential. Ensuring that media content remains unbiased and factual is crucial for maintaining public trust. With this in mind, we introduce SAFARI (CroSs-lingual BiAs and Factuality Detection in News MediA and News ARtIcles), a novel corpus of news media and articles for predicting political bias and the factuality of reporting in a multilingual and cross-lingual setup. To the best of our knowledge, this corpus is unprecedented in its collection and introduces a dataset for political bias and factuality for three tasks: (i) media-level, (ii) article-level, and (iii) joint modeling at the article-level. At the media and article levels, we evaluate the cross-lingual ability of the models; however, in joint modeling, we evaluate on English data. Our frameworks set a new benchmark in the cross-lingual evaluation of political bias and factuality. This is achieved through the use of various Multilingual Pre-trained Language Models (MPLMs) and Large Language Models (LLMs) coupled with ensemble learning methods.
  • A2 Wenyan Li
    Lost in Embeddings: Information Loss in Vision–Language Models
    [Abstract] Vision–language models typically process visual inputs through a pretrained vision encoder followed by projection into the language model’s embedding space. While crucial for modality fusion, this projection step induces under-characterized information loss that directly impacts model capabilities. We propose two novel approaches to quantify visual information loss introduced at this projection step. First, we evaluate the preservation of semantic information and structural relationships by analyzing changes in nearest-neighbor rankings between representations. Second, to locate information loss for the image representation at a patch level, we directly measure information loss through visual embedding reconstruction. Focusing on connector-based VLMs, our experiments reveal projection layers fundamentally alter visual semantic relationships – nearest neighbor similarity rankings diverge by 40-60% post-projection, directly explaining observed retrieval performance drops. Our embedding reconstruction approach provides interpretable insights for model behavior on visual question-answering tasks, finding that areas of high information loss reliably predict instances where models struggle.
  • A3 Thomas Bauwens
    GRaMPa: Subword Regularisation by Sampling Biased Uniformly Random Tokenisations from a Vocabulary-constrained Path-counting Markov Model
    [Abstract] Stochastically sampling word segmentations from a subword tokeniser, also called subword regularisation, is a known way to increase robustness of language models to out-of- distribution inputs, such as text containing spelling errors. Recent work has observed that usual augmentations that make popular deterministic subword tokenisers stochastic still cause only a handful of all possible segmentations to be sampled. It has been proposed to uniformly sample across these instead, through rejection sampling of paths in an unweighted segmentation graph. In this paper, we argue that uniformly random segmentation in turn skews the distributions of certain segmentational properties (e.g. token lengths and amount of tokens produced) away from uniformity, which still ends up hiding meaningfully diverse tokenisations. We propose an alternative uniform sampler using the same segmentation graph, but weighted by counting the paths through it. Our sampling algorithm, GRaMPa, provides hyperparameters allowing sampled tokenisations to be biased towards fewer, longer tokens. Furthermore, GRaMPa is single-pass, guaranteeing significantly better computational complexity than previous approaches relying on rejection sampling. We show experimentally that language models trained with GRaMPa outperform existing regularising tokenisers in a data-scarce setting on token-level tasks such as dependency parsing, especially when spelling errors are introduced.
  • A4 Simon Devauchelle
    A probabilistic perspective of the source-filter model for vowel production
    [Abstract] Phonetic analysis of vowels often relies on the estimation of formants, which measure the resonances on short speech utterances. Autoregressive modelling of speech (AR) is still a widely used analysis-synthesis approach for this purpose, based on the well-known source-filter model (Fant, 1974). Vowel signals are considered as the output of a linear stationary system that simulates the effect of the vocal tract, driven by a periodic excitation for voiced speech (or a noise-like excitation for unvoiced speech). The formants of a speech frame are usually estimated using the linear predictive coding (LPC), root solving the transfer function of the system (Makhoul 1975). This technique, based on inverse filtering speech signals, does rely on several heuristics when used for phonetic purposes (vocal tract properties, gender of the speaker, dynamic programming, number of resonances – the “poles” of the AR system – to be found in a specific range of frequencies) and is intrinsically confronted to measurement errors induced by the harmonics attraction of the fundamental frequency (Chen et al. 2019; Vallabha et al. 2002). Probalizing such acoustic models to analyse a set of frames requires its own formalization and is able to control and to model interactions between every model parameter. The integration of the AR model into the Bayesian framework enables the addition of prior knowledges derived from phonetic literature and previous acoustic measures. The robustness of this approach and its ability to resolve these systematic errors will be evaluated with experiments on the TIMIT dataset (Garofolo et al. 1993) and on synthetic vowels and results will be compared with state-of-the-art solutions implemented in the software Praat (Boersma et al. 2025) and ground truth formants manually measured (Deng et al. 2006).
  • A5 Sarah Masud
    QUENCH: Measuring the gap between Indic and Non-Indic Contextual General Reasoning in LLMs
    [Abstract] The rise of large language models (LLMs) has created a need for advanced benchmarking systems beyond traditional setups. To this end, we introduce QUENCH, a novel text-based English Quizzing Benchmark manually curated and transcribed from YouTube quiz videos. QUENCH possesses masked entities and rationales for the LLMs to predict via generation. At the intersection of world knowledge, geographical context, and common sense reasoning, QUENCH helps assess world knowledge and deduction capabilities of LLMs via a zero-shot, open-domain quizzing setup. We perform an extensive evaluation on 7 LLMs and 4 metrics, investigating the influence of model size, prompting style, geographical context, and gold-labeled rationale generation. The benchmarking concludes with an error analysis of various types of generative errors to which the LLMs are prone.
  • A6 Mohamed Salim Aissi
    Reinforcement Learning for Aligning Large Language Models Agents with Interactive Environments: Quantifying and Mitigating Prompt Overfitting
    [Abstract] Reinforcement learning (RL) is a promising approach for aligning large language models (LLMs) knowledge with sequential decision-making tasks. However, few studies have thoroughly investigated the impact on LLM agents capabilities of fine-tuning them with RL in a specific environment. In this paper, we propose a novel framework to analyze the sensitivity of LLMs to prompt formulations following RL training in a textual environment. Our findings reveal that the performance of LLMs degrades when faced with prompt formulations different from those used during the RL training phase. Besides, we analyze the source of this sensitivity by examining the model’s internal representations and salient tokens. Finally, we propose to use a contrastive loss to mitigate this sensitivity and improve the robustness and generalization capabilities of LLMs.
  • A7 Maïwenn Fleig
    Investigating Neural Correlates of Predictability in Natural Conversation
    [Abstract] Prediction is a core mechanism in language processing, allowing the brain to anticipate what others will say next based on prior knowledge and context. Surprisal theory models this process by measuring how unexpected a word is in a given context. While research has shown that the brain tracks word surprisal in passive tasks like reading and listening, much less is known about how this prediction process operates in dynamic, real-time conversation. We adapt a pre-trained large language model using conversational data capturing the fluid and unpredictable nature of spoken dialogue, including hesitations, repetitions, and interruptions. We investigate how the brain anticipates language in natural dialogue by identifying predictions in EEG data during natural dialogues.
  • A8 Markarit Vartampetian
    Benchmarking Large Language Models: Challenges, Methods and Engineering Applications
    [Abstract] The evaluation of Large Language Models (LLMs) is essential for ensuring their reliability and applicability in real-world tasks. This research explores state-of-the-art evaluation methods, integrating intrinsic and extrinsic assessments. It investigates LLMs as judges, compares automated with human assessments, and develops carefully curated domain-specific benchmark datasets in collaboration with professionals to ensure relevance and accuracy. A key focus is on assessing LLMs’ effectiveness in engineering contexts, particularly for technical writing tasks. By integrating systematic evaluation frameworks with expert-driven datasets, this study aims to advance LLM assessment methodologies for practical and professional applications.
  • A9 Lukas Mielczarek
    Approaching Psycholinguistically Interpretable Transformers: Syntax-guided Attention and its Effect on Human Reading Time Prediction
    [Abstract] Recently, there has been an increased interest in evaluating language models in terms of their psycholinguistic plausibility. Two important approaches to human processing are expectation-based theories and memory-based theories. The former postulate that model surprisal is a good indicator of human reading times. The latter explain difficulties in processing with the limitations of information encoding in human working memory (e.g. cue-based retrieval). Against this backdrop, efforts were made to unify these theories in a language model (LM) architecture, and to test them against human data. Timkey and Linzen (2023) show that self-attention can be seen as a cue-based retrieval system. They propose a unified cognitive model, combining an LSTM with a single attention head.Given previous syntax-based explanations of working memory and linguists’ assumptions about language structure, we are interested in incorporating syntax into this model. Linguists often assume that humans parse a sentence in a connected, tree-like manner, leading us to question the plausibility of a single attention head that can attend to all preceding tokens. Thus, we propose modelling retrieval/attention along syntactic relations, effectively parsing dependencies incrementally. We aim to compare a transformer LM architecture with a variant where attention heads are supervised by dependency relations. We compare both LMs with respect to human reading time predictions and try to identify to which extent a transformer with uncontrolled attention matrices performs differently than a comparable model whose attention matrices are interpreted as dependency structures.
  • A10 Jingyi Sun
    Evaluating Input Feature Explanations through a Unified Diagnostic Evaluation Framework
    [Abstract] Explaining the decision-making process of machine learning models is crucial for ensuring their reliability and transparency for end users. One popular explanation form highlights key input features, such as i) tokens (e.g., Shapley Values and Integrated Gradients), ii) interactions between tokens (e.g., Bivariate Shapley and Attention-based methods), or iii) interactions between spans of the input (e.g., Louvain Span Interactions). However, these explanation types have only been studied in isolation, making it difficult to judge their respective applicability. To bridge this gap, we develop a unified framework that facilitates an automated and direct comparison between highlight and interactive explanations comprised of four diagnostic properties1. We conduct an extensive analysis across these three types of input feature explanations–each utilizing three different explanation techniques–across two datasets and two models, and reveal that each explanation has distinct strengths across the different diagnostic properties. Nevertheless, interactive span explanations outperform other types of input feature explanations across most diagnostic properties. Despite being relatively understudied, our analysis underscores the need for further research to improve methods generating these explanation types. Additionally, integrating them with other explanation types that perform better in certain characteristics could further enhance their overall effectiveness.
  • A11 Himanshu Beniwal
    Char-mander Use mBackdoor! A Study of Cross-lingual Backdoor Attacks in Multilingual LLMs
    [Abstract] We explore Cross-lingual Backdoor ATtacks (X-BAT) in multilingual Large Language Models (mLLMs), revealing how backdoors inserted in one language can automatically transfer to others through shared embedding spaces. Using toxicity classification as a case study, we demonstrate that attackers can compromise multilingual systems by poisoning data in a single language, with rare tokens serving as specific effective triggers. Our findings expose a critical vulnerability in the fundamental architecture that enables cross-lingual transfer in these models.
  • A12 Frederic Sadrieh
    Token Splitting Effect in soft prompts
    [Abstract] Prompt tuning in natural language processing enables efficient utilization of Large Language Models (LLMs), but soft prompts often struggle with interpretability. This study introduces a novel multi-model training methodology for soft prompts, validated across the MultiBERTs collection using IMDb, Emotion, and MNLI datasets. We uncover the token splitting effect in soft prompts, a phenomenon where individual prompt tokens align with specific models within their embedding spaces, significantly impacting performance. Our findings reveal that post-training prompt compression enhances efficiency with minimal performance loss. We thereby advance the understanding of soft prompt behavior in multi-model settings, offering pathways for resource-efficient optimization and strategic compression in Large Language Models.
  • A13 Flora Helmers
    Causally Reasoning LLMs
    [Abstract] LLMs have seen their reasoning capacities increase in the past years. Now they are not only good at coding, and at solving mathematical problems. One field in mathematics is still under explored: causality. Usually, measuring the causal reasoning of an LLM would evaluate their commonsense reasoning skills. But the causal inference framework introduced by Pearl pushes it further: we can mathematically measure the causal effects of a variable on the output of another one. To measure the causal capacities of an LLM, Jin et al introduce the Cladder dataset, that measures the capacity of a model, on tasks based on probabilities until counterfactual reasoning. The models, like Llama-3.1, perform pretty poorly on these tasks. Our work consist in fine-tuning the model on the Cladder dataset with different techniques, and to evaluate how the reasoning capacities increase. LLMs have seen their reasoning capacities increase in recent years. They are now not only proficient in coding and solving mathematical problems, but also expanding into new areas. However, one field in mathematics remains underexplored: causality. Typically, measuring an LLM’s causal reasoning evaluates its commonsense reasoning skills. But the causal inference framework introduced by Pearl goes further—it allows us to mathematically measure the causal effects of one variable on another. To assess the causal reasoning abilities of LLMs, Jin et al. introduced the Cladder dataset, which evaluates a model’s performance on tasks ranging from probabilistic reasoning to counterfactual reasoning. Current models, such as LLaMA-3.1, perform poorly on these tasks. Our work focuses on fine-tuning the model on the Cladder dataset using different techniques and evaluating how its reasoning capacities improve.
  • A14 Daniil Gurgurov
    GrEmLIn: A Repository of Green Baseline Embeddings for 87 Low-Resource Languages Injected with Multilingual Graph Knowledge
    [Abstract] Contextualized embeddings based on large language models (LLMs) are available for various languages, but their coverage is often limited for lower resourced languages. Using LLMs for such languages is often difficult due to a high computational cost; not only during training, but also during inference. Static word embeddings are much more resource-efficient (“green”), and thus still provide value, particularly for very low-resource languages. There is, however, a notable lack of comprehensive repositories with such embeddings for diverse languages. To address this gap, we present GrEmLIn, a centralized repository of green, static baseline embeddings for 87 mid- and low-resource languages. We compute GrEmLIn embeddings with a novel method that enhances GloVe embeddings by integrating multilingual graph knowledge, which makes our static embeddings competitive with LLM representations, while being parameter-free at inference time. Our experiments demonstrate that GrEmLIn embeddings outperform state-of-the-art contextualized embeddings from E5 on the task of lexical similarity. They remain competitive in extrinsic evaluation tasks like sentiment analysis and natural language inference, with average performance gaps of just 5-10% or less compared to state-of-the-art models, given a sufficient vocabulary overlap with the target task, and underperform only on topic classification.
  • A15 Celia Nouri
    Graphically Speaking: Unmasking Abuse in Social Media with Conversation Insights
    [Abstract] Detecting abusive language in social media conversations poses significant challenges, as identifying abusiveness often depends on the conversational context, characterized by the content and topology of preceding comments. Traditional Abusive Language Detection (ALD) models often overlook this context, which can lead to unreliable performance metrics. Recent Natural Language Processing (NLP) methods that integrate conversational context often depend on limited and simplified representations, and report inconsistent results. In this paper, we propose a novel approach that utilize graph neural networks (GNNs) to model social media conversations as graphs, where nodes represent comments, and edges capture reply structures. We systematically investigate various graph representations and context windows to identify the optimal configuration for ALD. Our GNN model outperform both context-agnostic baselines and linear context-aware methods, achieving significant improvements in F1 scores. These findings demonstrate the critical role of structured conversational context and establish GNNs as a robust framework for advancing context-aware abusive language detection.
  • A16 Siddhesh Pawar
    (tba)
    [Abstract] (tba)
  • A17 Hippolyte Gisserot-boukhlef
    Is Preference Alignment Always the Best Option to Enhance LLM-based Translation?
    [Abstract] Neural metrics for machine translation (MT) evaluation have become increasingly prominent due to their superior correlation with human judgments compared to traditional lexical metrics. Researchers have therefore utilized neural metrics through quality-informed decoding strategies, achieving better results than likelihood-based methods. With the rise of Large Language Models (LLMs), preference-based alignment techniques have gained attention for their potential to enhance translation quality by optimizing model weights directly on preferences induced by quality estimators. This study focuses on Contrastive Preference Optimization (CPO) and conducts extensive experiments to evaluate the impact of preference-based alignment on translation quality. Our findings indicate that while CPO consistently outperforms Supervised Fine-Tuning (SFT) on high-quality data with regard to the alignment metric, it may lead to instability across downstream evaluation metrics, particularly between neural and lexical ones. Additionally, we demonstrate that relying solely on the base model for generating candidate translations achieves performance comparable to using multiple external systems, while ensuring better consistency across downstream metrics.

Session 2

  • B1 Yuxuan Zhang
    Multitask Learning to Model Dynamics of Dimensional Affect in Speech
    [Abstract] Dimensional affect prediction from speech has traditionally relied on acoustic features to predict continuous affect representations (e.g., arousal, valence) at each time step. However, affect evolves dynamically over time, and incorporating temporal information may improve accuracy. This study investigates emotional dynamics in speech emotion recognition using multitask learning, where a model predicts both the affect state and its temporal derivative. Experiments on the RECOLA and SEWA datasets show that dynamic information consistently improves valence prediction, which is particularly challenging from audio alone. While CCC scores for dynamic predictions remain lower than for affect state predictions, results indicate that dynamics enhance valence estimation over time. These findings highlight the crucial role of emotional dynamics in capturing the temporal evolution of affect.
  • B2 Vladana Perlic
    Multimodal Knowledge Enhancement: Integrating Text Analysis, Image Processing, and Knowledge Graphs for Advanced Retrieval-Augmented Generation Systems
    [Abstract] Since the advent of ChatGPT, artificial intelligence has revolutionized document creation and technical content management. This project focuses on developing an advanced Multimodal Knowledge Enhancement system that integrates text analysis, image processing, and knowledge graphs within Retrieval-Augmented Generation (RAG) frameworks to meet the specific needs of STMicroelectronics. The research specifically addresses the challenges of multimodal document understanding, where both textual and visual elements (diagrams, irregular tables, technical illustrations) must be jointly analyzed and represented. By creating robust methods for reliable data extraction from complex documents, optimizing compact multimodal representations, and effectively managing scattered information across extensive documentation, this work aims to overcome current limitations in multimodal RAG systems. The expected outcomes include improved integration of industrial texts and images within a unified retrieval framework, enhanced semantic connections through structured knowledge graphs, and more intelligent information retrieval that maintains contextual relationships while controlling complexity. This multimodal approach will significantly advance document intelligence capabilities for technical industries requiring sophisticated information management solutions.
  • B3 Sumanth Doddapaneni
    Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs
    [Abstract] Evaluating machine-generated text remains a significant challenge in NLP, especially for non-English languages. Current methodologies, including automated metrics, human assessments, and LLM-based evaluations, predominantly focus on English, revealing a significant gap in multilingual evaluation frameworks. We introduce the Cross Lingual Auto Evaluation (CIA) Suite, an extensible framework that includes evaluator LLMs (Hercule) and a novel test set (Recon) specifically designed for multilingual evaluation. Our test set features 500 human-annotated instructions spanning various task capabilities along with human judgment scores across six languages. This would enable benchmarking of general-purpose multilingual LLMs and facilitate meta-evaluation of Evaluator LLMs. The proposed model, Hercule, is a cross-lingual evaluation model that addresses the scarcity of reference answers in the target language by learning to assign scores to responses based on easily available reference answers in English. Our experiments demonstrate that Hercule aligns more closely with human judgments compared to proprietary models, demonstrating the effectiveness of such cross-lingual evaluation in low resource scenarios. Further, it is also effective in zero-shot evaluation on unseen languages. This study is the first comprehensive examination of cross-lingual evaluation using LLMs, presenting a scalable and effective approach for multilingual assessment.
  • B4 Seth Aycock
    Can LLMs Really Learn to Translate a Low-Resource Language from One Grammar Book?
    [Abstract] Extremely low-resource (XLR) languages lack substantial corpora for training NLP models, motivating the use of all available resources such as dictionaries and grammar books. Machine Translation from One Book (Tanzer et al., 2024) suggests that prompting long-context LLMs with one grammar book enables English–Kalamang translation, an XLR language unseen by LLMs—a noteworthy case of linguistics helping an NLP task. We investigate the source of this translation ability, finding almost all improvements stem from the book’s parallel examples rather than its grammatical explanations. We find similar results for Nepali and Guarani, seen low-resource languages, and we achieve performance comparable to an LLM with a grammar book by simply fine-tuning an encoder-decoder translation model. We then investigate where grammar books help by testing two linguistic tasks, grammaticality judgment and gloss prediction, and we explore what kind of grammatical knowledge helps by introducing a typological feature prompt that achieves leading results on these more relevant tasks. We thus emphasise the importance of task-appropriate data for XLR languages: parallel examples for translation, and grammatical data for linguistic tasks. As we find no evidence that long-context LLMs can make effective use of grammatical explanations for XLR translation, we conclude data collection for multilingual XLR tasks such as translation is best focused on parallel data over linguistic description.
  • B5 Qinyue Liu
    Detection of Reliable and Unreliable Citations in Scientific Papers: Dataset and Methods
    [Abstract] Citation verification is an important task for ensuring the integrity of academic discourse. However, few datasets are built for this task and only a small part of the literature concentrates on automatic citation verification. To address this gap, we introduce a novel dataset designed for citation verification, comprising both reliable and unreliable citations. The dataset includes real citations extracted from scientific papers, as well as generated and synthetic citations that simulate various citation misuse scenarios. We evaluate three approaches based on (1) textual similarity metrics, (2) fine-tuning BERT family models and (3) prompting GPT-4o in order to verify whether a citation is reliable or not. While the best-performing textual similarity metric achieves strong results in detecting off-topic citations, it struggles with citations involving subtle semantic misinterpretations. Conversely, GPT-4o and fine-tuned BERT models demonstrates robust performance across all citation categories. However, for GPT-4o, due to its cost and lack of transparency, its practical applicability is limited. Our study advances the field of automatic citation verification by providing a dataset and methodological insights.
  • B6 Millicent Ochieng
    Benchmarking LLMs: From Standard NLP Tasks to Real-World Multilingual Challenges
    [Abstract] Evaluating Large Language Models (LLMs) is crucial for understanding their capabilities and limitations, particularly in multilingual and real-world contexts. This work presents insights from two complementary benchmarking studies: MEGA, a large-scale evaluation of LLMs on 16 NLP benchmarks covering 70 typologically diverse languages, comparing generative models like GPT-4 with fine-tuned state-of-the-art models, and Beyond Metrics, a real-world assessment of LLMs using multilingual and code-mixed WhatsApp conversations to evaluate their effectiveness in sentiment analysis, linguistic comprehension, and interpretability. Our results reveal significant gaps in LLM performance across languages and contexts. While models like GPT-4 demonstrate strong alignment with human understanding, they still struggle with reasoning and cultural nuances in non-English settings. We highlight challenges in multilingual AI, propose new benchmarking strategies, and discuss future directions for developing more robust and inclusive LLMs.
  • B7 Matyas Vincze
    Peer-Rewarding Language Models for Multilingual Self-Improving
    [Abstract] Current self-improving alignment methods for large language models often focus on high-resource languages, neglecting the needs of low-resource languages. We propose **Peer-Rewarding Language Models**, a framework that leverages multilingual collaboration to enhance performance across languages. By training language-specific adapters while sharing reward signals, our approach creates a synergistic system where improvements in one language inform others. We fine-tune Meta-Llama-3-8B-Instruct over 10 iterations, demonstrating sustained performance gains that surpass saturation observed in baseline methods. Evaluations on X-AlpacaEval and M-RewardBench show significant improvements in both low-resource languages and high-resource languages’ capabilities. This method addresses linguistic imbalance and offers a pathway for continuous multilingual improvement without requiring additional human annotation.
  • B8 Maria Francis
    Revisiting Spatial Reasoning in Visual Programmatic Models
    [Abstract] In recent years, Visual Language Models (VLMs) have been tasked with solving increasingly complex tasks, some involving multiple reasoning steps. Some research suggests that monolithic VLMs are not suited to multi-step reasoning problems, as a single error somewhere along the reasoning chain may cause a false output. Visual Programmatic Models (VPMs) attempt to work around this by deconstructing multi-step reasoning problems into a set of subtasks in form of a program (i.e. in Python), then calling off-the-shelf vision or language models to solve each subtask sequentially. This allows the precise source of a reasoning error to be identified and solved. VPMs have recently been criticized as performing poorly on problems involving spatial reasoning due to weaknesses in the code-generating LLM to convert spatial information into code. Our research shows that this is not necessarily the case. In addition, we make some suggestions for future researchers on how research on VLMs can be improved.
  • B9 Loïc Fosse
    Statistical Deficiency for Task Inclusion Estimation
    [Abstract] Tasks are central in machine learning, as they are the most natural objects to assess the capabilities of current models. The trend is to build general models able to address any task. Even though transfer learning and multitask learning try to leverage the underlying task space, no well-founded tools are available to study its structure. This study proposes a theoretically grounded setup to define the notion of task and to compute the {\bf inclusion} between two tasks from a statistical deficiency point of view. We propose a tractable proxy as information sufficiency to estimate the degree of inclusion between tasks, show its soundness on synthetic data, and use it to reconstruct empirically the classic NLP pipeline.
  • B10 Jack Cook
    Brain-like pathways form during learning in models with heterogeneous experts
    [Abstract] The brain is a vast collection of heterogeneous functional and anatomical structures, which, remarkably, are connected in a way that is relatively stable across primate brains. Notable examples of stable pathways include the MD system and DMN, cortical-subcortical pathways, and the ventral and dorsal streams. But despite their large role in cognition, how these pathways form in the brain remains poorly understood. In this work, we train heterogeneous mixture-of-expert models to study the emergence of brain-like pathways in artificial neural networks. We find that pathways between heterogeneous experts do not develop on their own: they require the presence of several specific architectural features. These include a cost-based expert usage loss that penalizes the model according to the size of experts used, expert-task loss normalization which encourages the model to learn a task before adjusting the pathways used to solve it, and dropout. With these features implemented, our “Mixture of Pathways” model is able to replicate key findings made in the MD system and on habit formation in the brain. Crucially, each of our architectural features is biologically plausible, indicating they could each likely be implemented in some form in the brain. We then study our model’s behavior, including its ability to re-use skills during learning and similarities to anxiety and rapid-response pathways. Our results provide insights into several brain systems, and provide a pathway for future research on how connections emerge between different brain regions.
  • B11 Gerard Gállego
    Speech-to-Text Translation with Phoneme-Augmented CoT: Enhancing Cross-Lingual Transfer in Low-Resource Scenarios
    [Abstract] We propose a Speech-to-Text Translation (S2TT) approach that integrates phoneme representations into a Chain-of-Thought (CoT) framework to improve translation in low-resource and zero-resource settings. By introducing phoneme recognition as an intermediate step, we enhance cross-lingual transfer, enabling translation even for languages with no labeled speech data. Our system builds on a multilingual LLM, which we extend to process speech and phonemes. Training follows a curriculum learning strategy that progressively introduces more complex tasks. Experiments on multilingual S2TT benchmarks show that phoneme-augmented CoT improves translation quality in low-resource conditions and enables zero-resource translation, while slightly impacting high-resource performance. Despite this trade-off, our findings demonstrate that phoneme-based CoT is a promising step toward making S2TT more accessible across diverse languages.
  • B12 Francesca Padovani
    What is the real benefit of using Child Directed Language for Language Modeling?
    [Abstract] Recent studies show that language models (LMs) trained on Child-Directed Language (CDL) achieve syntactic competencies comparable to larger models trained on internet data. But what makes CDL so effective despite its simplicity? We analyze specific syntactic paradigms to disentangle LMs’ syntactic learning from lexical memorization, comparing CDL-trained and Wikipedia-trained models. Our study explores whether CDL benefits early learning stages or the entire training process and extends beyond English to languages like French and German. Preliminary results with GPT-2 and RoBERTa indicate CDL-trained models outperform Wikipedia-trained ones only in select benchmarks (e.g., Zorro, CLAMS agreement tasks). Our findings provide insights into efficient language learning in neural models and could inform data augmentation strategies for improving NLP in low-resource languages.
  • B13 Enzo Doyen
    Analyzing use of masculine generics by LLMs in French
    [Abstract] Large language models (LLMs) have been shown to propagate and even amplify gender bias, in English and other languages. While current studies have evaluated LLMs’ gender biases in specific or constrained contexts. no studies so far have focused on gender biases conveyed by LLMs’ responses to generic instructions, especially with regard to masculine generics (MG). MG are a linguistic feature found in many gender-marked languages, denoting the use of the masculine gender as a “default” or supposedly neutral gender to refer to mixed group of men and women, or to a person whose gender is irrelevant or unknown. Numerous psycholinguistics studies have shown that MG are not neutral and induce gender bias. We analyze the use of MG by both proprietary and local LLMs in responses to generic instructions and evaluate their MG bias rate. We focus on French and create a human noun database from existing lexical resources. We filter existing French instruction datasets to retrieve generic instructions and analyze the responses of 6 different LLMs. Overall, we find that ≈39.5% of LLMs’ responses to generic instructions are MG-biased (≈73.1% across responses with human nouns). Our findings also reveal that LLMs are reluctant to using gender-fair language spontaneously in French.
  • B14 Clémence Sebe
    Extracting Information in a Low-resource Setting: Case Study on Bioinformatics Workflows
    [Abstract] Bioinformatics workflows are essential for complex biological data analyses and are often described in scientific articles with source code in public repositories. Extracting detailed workflow information from articles can improve accessibility and reusability but is hindered by limited annotated corpora. To address this, we framed the problem as a low-resource extraction task and tested four strategies: 1) creating a tailored annotated corpus, 2) few-shot named-entity recognition (NER) with an autoregressive language model, 3) NER using masked language models with existing and new corpora, and 4) integrating workflow knowledge into NER models. Using BioToFlow, a new corpus of 52 articles annotated with 16 entities, a SciBERT-based NER model achieved a 70.4 F-measure, comparable to inter-annotator agreement. While knowledge integration improved performance for specific entities, it was less effective across the entire information schema. Our results demonstrate that high-performance information extraction for bioinformatics workflows is achievable.
  • B15 Alba Haveriku
    Understanding Reading Patterns of Albanian Native Readers through Mouse Tracking Analysis
    [Abstract] Eye-tracking has been recognized as an effective method for studying reading patterns in various languages, proving its value in diverse linguistic contexts. High-quality eye-tracking equipment is costly, especially in the case of understudied languages, such as Albanian, where there is no dedicated laboratory infrastructure. To address this, we explore the usage of low-cost alternatives to study reading behaviours. Our study aims to integrate two datasets: (1) a mouse tracking corpus collected from 50 native Albanian speakers, collected using the adaptions made to the MoTR tool in Haveriku et al. [2], and (2) a Stanza model trained using a universal dependencies treebank for Standard Albanian Language (SAL), presented in Kote et al. [3], composed of 24,537 tokens, including part of speech (PoS) tagging, morphological features, lemmas and syntactic dependencies. We leverage the MoTR corpus by integrating the data from the ConLL-u files (automatically generated by the Albanian model) to understand whether the addition of mouse tracking data can improve the accuracy of predicting syntactic dependencies and PoS tags. The findings provides valuable insights into reading speed and linguistic processing, advancing language research for Albanian and laying the groundwork for future data post-processing efforts.
  • B16 Nicolas Boizard
    (tba)
    [Abstract] (tba)
  • B17 Minnie Kabra
    (tba)
    [Abstract] (tba)
  • B18 Haeun Yu
    (tba)
    [Abstract] (tba)

Session 3

  • C1 Yanis Labrak
    An Empirical Analysis of Discrete Unit Representations in Speech Language Modeling
    [Abstract] This paper investigates discrete unit representations in Speech Language Models (SLMs), focusing on optimizing speech modeling during continual pre-training. In this paper, we systematically examine how model architecture, data representation, and training robustness influence the adaptation of pre-trained language models to the speech modality. Our experiments highlight the role of speech encoders and clustering granularity across different model scales, showing how optimal discretization strategies vary with model capacity. By examining cluster distribution and phonemic alignments, we investigate the effective use of discrete vocabulary, uncovering both linguistic and paralinguistic patterns. Additionally, we explore the impact of clustering data selection on model robustness, highlighting the importance of domain matching between discretization training and target applications.
  • C2 Tunde Ajayi
    Cross-lingual Transfer and Multilingual Learning for Detecting Harmful Behaviour in African Under-Resourced Language Dialogue
    [Abstract] Most harmful dialogue detection models are developed for high-resourced languages. Consequently, users who speak under-resourced languages cannot fully benefit from these models in terms of usage, development, detection and mitigation of harmful dialogue utterances. Our work aims at detecting harmful utterances in under-resourced African languages. We leverage transfer learning using pretrained models trained with multilingual embeddings to develop a cross-lingual model capable of detecting harmful content across various African languages. We first fine-tune a harmful dialogue detection model on a selected African dialogue dataset. Additionally, we fine-tune a model on a combined dataset in some African languages to develop a multilingual harmful dialogue detection model. We then evaluate the cross-lingual model’s ability to generalise to an unseen African language by performing harmful dialogue detection in an under-resourced language not present during pretraining or fine-tuning. We evaluate our models on the test datasets. We show that our best performing models achieve impressive results in terms of F1 score. Finally, we discuss the results and limitations of our work.
  • C3 Song Duong
    SCOPE: A Self-supervised Framework for Improving Faithfulness in Conditional Text Generation
    [Abstract] Large Language Models (LLMs), when used for conditional text generation, often produce hallucinations, i.e., information that is unfaithful or not grounded in the input context. This issue arises in typical conditional text generation tasks, such as text summarization and data-to-text generation, where the goal is to produce fluent text based on contextual input. When fine-tuned on specific domains, LLMs struggle to provide faithful answers to a given context, often adding information or generating errors. One underlying cause of this issue is that LLMs rely on statistical patterns learned from their training data. This reliance can interfere with the model’s ability to stay faithful to a provided context, leading to the generation of ungrounded information. We build upon this observation and introduce a novel self-supervised method for generating a training set of unfaithful samples. We then refine the model using a training process that encourages the generation of grounded outputs over unfaithful ones, drawing on preference-based training. Our approach leads to significantly more grounded text generation, outperforming existing self-supervised techniques in faithfulness, as evaluated through automatic metrics, LLM-based assessments, and human evaluations.
  • C4 Sekh Mainul Islam
    Attributing and Mitigating Concept Drift in Misinformation Detection through Active Learning-Enhanced Topic Adaptation
    [Abstract] Misinformation classification on streaming text input, such as social media posts or news articles, can be challenging as distributional shifts in the incoming data cause decayed model performance over time. Prior research works define this phenomenon as “Concept Drift in Misinformation Detection” and have addressed it with various approaches, such as online learning and domain adaptation. However, these approaches either require expensive and timeconsuming data annotation or assume that the distributional shift in incoming data is mainly caused by the emergence of new topics. Through an extensive analysis of four misinformation datasets, we discover another cause for concept drift in misinformation classification – that linguistic variation within similar topic over time. We propose a novel Active Learning enhanced Class and Topic conditional Domain Adaptation (ALCTDA) approach for misinformation classification of text data in a streaming setting. To address limitations in labeled data, our approach uses an uncertainty-based sampling strategy to intelligently select examples for annotation from incoming data. It then uses a novel method for domain adaptation that aligns both the labels and topics between historical and incoming data, minimizing the distribution gap caused by the emergence of new topics and the change in vocabulary within the same topic. We conduct rigorous experiments on ALCTDA using four major misinformation datasets – GossipCop, Politifact, Fake cures, and 5G – and find a consistent 2-3% performance improvement over state-of-the-art models. This demonstrates the effectiveness of ALCTDA, which offers a promising avenue to mitigate concept drift in misinformation detection.
  • C5 Philippe Martin
    Fine Control of Vocal Parameters for High Quality Voice Synthesis
    [Abstract] Voice synthesis has progress fo far from articulatory system to statistical system, and then nowadys with neural models,. High performance has been achieved regarding the quality of the generation but there is yet a fully grasp on finer control to synthesize any voice with any caracteristics sur as emotion and prosody.
  • C6 Michal Štefánik
    Towards Robust Algorithmic Reasoning with Neural Language Models
    [Abstract] Applicability of language models in many applications with huge potential of automatisation is hindered by language models’ unreliability in independent decision-making. We pose that this unreliability is caused by models’ inability to robustly model and follow an underlying algorithm of the task, a skill we refer to as algorithmic reasoning. Towards creating future models capable of robust algorithmic reasoning, we identify two main challenges that remain unsolved with current technology — (1) the ability to flawlessly represent an exact information such as of numeric or discrete values, and (2) the ability to consistently combine pieces of information to infer a new information. We present several directions that we work on with a potential to address these flaws and challenge the research community to try to address these fundamental defficiencies.
  • C7 Matthias Orlikowski
    Beyond Demographics: Fine-tuning Large Language Models to Predict Individuals’ Subjective Text Perceptions
    [Abstract] People naturally vary in their annotations for subjective questions and some of this variation is thought to be due to the person’s sociodemographic characteristics. LLMs have also been used to label data, but recent work has shown that models perform poorly when prompted with sociodemographic attributes, suggesting limited inherent sociodemographic knowledge. Here, we ask whether LLMs can be trained to be accurate sociodemographic models of annotator variation. Using a curated dataset of five tasks with standardized sociodemographics, we show that models do improve in sociodemographic prompting when trained but that this performance gain is largely due to models learning annotator-specific behaviour rather than sociodemographic patterns. Across all tasks, our results suggest that models learn little meaningful connection between sociodemographics and annotation, raising doubts about the current use of LLMs for simulating sociodemographic variation and behaviour.
  • C8 Maelig Hurte
    Structural Biases for Compositional Generalization
    [Abstract] Compositionality is a foundational hypothesis in formal semantics, it states that the semantic interpretation is a function of the sentence’s atomic parts and the way they are combined. In NLP, the current dominant paradigm is to design models with no linguistically interpretable representations, assuming those are already encoded through latent representations. However, research shows that the encoded information could be insufficient or that it would be too complex to be properly interpreted hence requiring more direct syntactic information. Research on the dataset COGS shows that seq2seq models fail at structural generalization tasks, which consists of deriving the meaning of a novel structure which is the composition of other structures that were encountered at training time. In order to improve the models ability to generalize better, we first tried to apply a new training paradigm called iLM. It consists in dividing the training space in such a way the the model would naturally exhibit invariants, which could be derived to understand better the linguistic structure of sentences. Secondly, as we noticed that most seq2seq models which were tested on COGS tasks were Encoder-Decoder models, we propose to study decoder-only models, such as recent LLMs on those tasks. The goal is to see whether they developed strong generalization capabilities through their pre-training, or if they are able to develop such abilities when fine-tuned on the COGS dataset.
  • C9 Joseph James
    On the Rigour of Scientific Writing: Criteria, Analysis, and Insights
    [Abstract] Rigour is crucial for scientific research as it ensures the reproducibility and validity of results and findings. Despite its importance, little work exists on modelling rigour computationally, and there is a lack of analysis on whether these criteria can effectively signal or measure the rigour of scientific papers in practice. In this paper, we introduce a bottom-up, data-driven framework to automatically identify and define rigour criteria and assess their relevance in scientific writing. Our framework includes rigour keyword extraction, detailed rigour definition generation, and salient criteria identification. Furthermore, our framework is domain-agnostic and can be tailored to the evaluation of scientific rigour for different areas, accommodating the distinct salient criteria across fields. We conducted comprehensive experiments based on datasets collected from two high impact venues for Machine Learning and NLP (i.e., ICLR and ACL) to demonstrate the effectiveness of our framework in modelling rigour. In addition, we analyse linguistic patterns of rigour, revealing that framing certainty is crucial for enhancing the perception of scientific rigour, while suggestion certainty and probability uncertainty diminish it.
  • C10 Iuliia Korotkova
    Word-level Text Markup for Prosody Control in Speech Synthesis
    [Abstract] Modern Text-to-Speech (TTS) technologies generate speech very close to the natural one, but synthesized voices still lack variation in intonation which, in addition, is hard to control. In this work, we address the problem of prosody control, aiming to capture information about intonation in a markup without hand-labeling and linguistic expertise. We propose a method of encoding prosodic knowledge from textual and acoustic modalities, which are obtained with the help of models pretrained on self-supervised tasks, into latent quantized space with interpretable features. Based on these features, the prosodic markup is constructed, and it is used as an additional input to the TTS model to solve the one-to-many problem and is predicted by text. Moreover, this method allows for prosody control during inference time and scalability to new data and other languages.
  • C11 Gabrielle Le Bellier
    Control model generation with cultural proxies
    [Abstract] Despite the recent improvements of LLMs, models still struggle to provide pertinent outputs when specific cultures are indicated. As shown in AlKhamissi et al., (2024), current models primarily manifest occidental values and opinions. While the notion of culture is not well defined and is interpreted in different ways in the current literature (Liu et al, 2024), we use proxies of culture to evaluate models’ adaptation to cultural contexts. However, fine-tuning LLMs on specific cultural datasets entails a high computational cost and risks catastrophic forgetting (Liu et al., 2024). We leverage parameter-efficient fine-tuning methods to control the model’s generation based on a given culture while preserving its modeling capacities. References : Badr AlKhamissi, Muhammad ElNokrashy, Mai Alkhamissi, and Mona Diab. 2024. Investigating Cultural Alignment of Large Language Models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12404–12422, Bangkok, Thailand. Association for Computational Linguistics. Chen Cecilia Liu, Iryna Gurevych, Anna Korhonen, 2024. Culturally Aware and Adapted NLP: a Taxonomy and a Survey of the State of the Art. ArXiv, abs/2406.03930.
  • C12 Florian Le Bronnec
    SCOPE: A Self-supervised Framework for Improving Faithfulness in Conditional Text Generation
    [Abstract] Large Language Models (LLMs), when used for conditional text generation, often produce hallucinations, i.e., information that is unfaithful or not grounded in the input context. This issue arises in typical conditional text generation tasks, such as text summarization and data-to-text generation, where the goal is to produce fluent text based on contextual input. When fine-tuned on specific domains, LLMs struggle to provide faithful answers to a given context, often adding information or generating errors. One underlying cause of this issue is that LLMs rely on statistical patterns learned from their training data. This reliance can interfere with the model’s ability to stay faithful to a provided context, leading to the generation of ungrounded information. We build upon this observation and introduce a novel self-supervised method for generating a training set of unfaithful samples. We then refine the model using a training process that encourages the generation of grounded outputs over unfaithful ones, drawing on preference-based training. Our approach leads to significantly more grounded text generation, outperforming existing self-supervised techniques in faithfulness, as evaluated through automatic metrics, LLM-based assessments, and human evaluations.
  • C13 Ekaterina Borisova
    Are LLMs Up to the Challenge of Cross-Domain Table Understanding? A Multimodal Case Study on Scientific vs. Non-Scientific Data
    [Abstract] Tables are ubiquitous data presentation tools used across various domains, including scientific research, finance, medicine, business, and education. While large language models (LLMs) have demonstrated strong performance in a wide range of applications, their ability to understand (semi-)structured data remains under-researched – especially for tables from scientific sources such as scholarly articles. In this study, we aim to address the aforementioned gap by evaluating the performance of both textual and multi-modal LLMs on various table understanding tasks. We compare their ability to process tables from scientific and non-scientific contexts and investigate the impact of different representation modalities (image vs. text) on model performance. Our results will provide insights into the strengths and limitations of contemporary models in understanding tables across diverse domains and formats.
  • C14 Clovis Varangot-reille
    Doing More with Less — Implementing Routing Strategies in Large Language Model-Based Systems: An Extended Survey
    [Abstract] Large Language Models (LLM)-based systems, i.e. interconnected elements that include an LLM as a central component (e.g., conversational agents), are typically monolithic static architectures that rely on a single LLM for all user queries. However, they often require different preprocessing strategies, levels of reasoning, or knowledge. Generalist LLMs (e.g. GPT-4) trained on very large multi-topic corpora can perform well in a variety of tasks. They require significant financial, energy, and hardware resources that may not be justified for basic tasks. This implies potentially investing in unnecessary costs for a given query. To overcome this problem, a routing mechanism routes user queries to the most suitable components, such as smaller LLMs or experts in specific topics. This approach may improve response quality while minimising costs. Routing can be expanded to other components of the conversational agent architecture, such as the selection of optimal embedding strategies. This paper explores key considerations for integrating routing into LLM-based systems, focusing on resource management, cost definition, and strategy selection. Our main contributions include a formalisation of the problem, a novel taxonomy of existing approaches emphasising relevance and resource efficiency, and a comparative analysis of these strategies in relation to industry practices. Finally, we identify critical challenges and directions for future research.
  • C15 Theo Charlot
    Emergent properties with repeated examples
    [Abstract] We study the performance of transformers as a function of the number of repetitions of training examples with algorithmically generated datasets. On three problems of mathematics: the greatest common divisor, modular multiplication, and matrix eigenvalues, we show that for a fixed number of training steps, models trained on smaller sets of repeated examples outperform models trained on larger sets of single-use examples. We also demonstrate that two-set training – repeated use of a small random subset of examples, along normal sampling on the rest of the training set – provides for faster learning and better performance. This highlights that the benefits of repetition can outweigh those of data diversity. These datasets and problems provide a controlled setting to shed light on the still poorly understood interplay between generalization and memorization in deep learning.
  • C16 Mohammed Ghennai
    (tba)
    [Abstract] (tba)
  • C17 Maksim Aparovich
    (tba)
    [Abstract] (tba)

Birds-of-a-feather Session

(tba)