{"id":221,"date":"2020-12-16T08:57:11","date_gmt":"2020-12-16T08:57:11","guid":{"rendered":"http:\/\/lig-alps.imag.fr\/?page_id=221"},"modified":"2026-03-30T07:31:26","modified_gmt":"2026-03-30T07:31:26","slug":"schedule","status":"publish","type":"page","link":"https:\/\/lig-alps.imag.fr\/index.php\/schedule\/","title":{"rendered":"Schedule"},"content":{"rendered":"\n<style type=\"text\/css\">\n.tg  {border-spacing:0;}\n.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;\n  overflow:hidden;padding:10px 5px;word-break:normal;}\n.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;\n  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}\n\n.tg .talk{background-color:#e69f00;text-align:left;vertical-align:top}\n.tg .lab{background-color:#d55e00;text-align:left;vertical-align:top}\n.tg .poster{background-color:#f0e442;border-color:inherit;text-align:left;vertical-align:top}\n.tg .social{background-color:#56b4e9;border-color:inherit;text-align:left;vertical-align:top}\n.tg .nature{background-color:#009e73;text-align:left;vertical-align:top}\n.tg .food{background-color:#cc79a7;text-align:left;vertical-align:top}\n.tg .default{border-color:inherit;text-align:left;vertical-align:top}\n<\/style>\n<table class=\"tg\">\n<thead>\n  <tr>\n    <th class=\"default\"><b>CET<\/b><\/th>\n    <th class=\"default\"><b>Sunday 29\/03<\/b><\/th>\n    <th class=\"default\"><b>Monday 30\/03<\/b><br><\/th>\n    <th class=\"default\"><b>Tuesday 31\/03<\/b><\/th>\n    <th class=\"default\"><b>Wednesday 1\/04<\/b><\/th>\n    <th class=\"default\"><b>Thursday 2\/04<\/b><br><\/th>\n    <th class=\"default\"><b>Friday 3\/04<\/b><\/th>\n  <\/tr>\n<\/thead>\n<tbody>\n  <tr>\n    <td class=\"default\" rowspan=\"2\">8-9<\/td>\n    <td class=\"default\" rowspan=\"1\"><\/td>\n    <td class=\"food\" rowspan=\"2\">Breakfast<\/td>\n    <td class=\"nature\" rowspan=\"8\">Nature activities<\/td>\n    <td class=\"food\" rowspan=\"2\">Breakfast<\/td>\n    <td class=\"food\" rowspan=\"2\">Breakfast<\/td>\n    <td class=\"food\" rowspan=\"2\">Breakfast<\/td>\n  <\/tr>\n  <tr>\n      <td class=\"default\" rowspan=\"1\"><\/td>\n  <\/tr>\n  <tr>\n    <td class=\"default\" rowspan=\"2\">9-10<\/td>\n    <td class=\"default\" rowspan=\"1\"><\/td>\n    <td class=\"talk\" rowspan=\"2\">Welcome session<\/td>\n    <td class=\"default\" rowspan=\"2\"><\/td>\n    <td class=\"poster\" rowspan=\"4\">Poster session 3<\/td>\n    <td class=\"default\" rowspan=\"1\"><\/td>\n  <\/tr>\n  <tr>\n      <td class=\"default\" rowspan=\"1\"><\/td>\n      <td class=\"talk\" rowspan=\"4\">(9.30) Julia Kreutzer<\/td>\n  <\/tr>\n  <tr>\n    <td class=\"default\" rowspan=\"2\">10-11<\/td>\n    <td class=\"default\" rowspan=\"1\"><\/td>\n    <td class=\"talk\" rowspan=\"4\">Roger K.Moore<\/td>\n    <td class=\"talk\" rowspan=\"4\">Fran\u00e7ois Yvon<\/td>\n  <\/tr>\n  <tr>\n      <td class=\"default\" rowspan=\"1\"><\/td>\n  <\/tr>\n  <tr>\n    <td class=\"default\" rowspan=\"2\">11-12<\/td>\n    <td class=\"default\" rowspan=\"1\"><\/td>\n    <td class=\"talk\" rowspan=\"3\">Carlos Ramisch <br> Manon Scholivet <br> Zen Research 3\/\/4<br><\/td>\n  <\/tr>\n  <tr>\n      <td class=\"default\" rowspan=\"1\"><\/td>\n      <td class=\"default\" rowspan=\"1\"><\/td>\n  <\/tr>\n  <tr>\n    <td class=\"default\" rowspan=\"2\">12-13<\/td>\n    <td class=\"default\" rowspan=\"1\"><\/td>\n    <td class=\"food\" rowspan=\"2\">Lunch<\/td>\n    <td class=\"food\" rowspan=\"2\">Lunch<\/td>\n    <td class=\"food\" rowspan=\"2\">Lunch<\/td>\n    <td class=\"food\" rowspan=\"2\">Lunch<\/td>\n  <\/tr>\n  <tr>\n      <td class=\"default\" rowspan=\"1\"><\/td>\n      <td class=\"food\" rowspan=\"2\">(12.30) Lunch<\/td>\n  <\/tr>\n  <tr>\n    <td class=\"default\" rowspan=\"2\">13-14<\/td>\n    <td class=\"default\" rowspan=\"1\"><\/td>\n    <td class=\"poster\" rowspan=\"4\">Poster session 1<br><\/td>\n    <td class=\"talk\" rowspan=\"4\">Carlos Ramisch <br> Manon Scholivet <br> Zen Research 1<br><\/td>\n    <td class=\"talk\" rowspan=\"4\">Carlos Ramisch <br> Manon Scholivet <br> Zen Research 2<br><\/td>\n    \n    <td class=\"default\" rowspan=\"2\">Shuttle back<br>to Grenoble<\/td>\n  <\/tr>\n  <tr>\n      <td class=\"default\" rowspan=\"1\"><\/td>\n      <td class=\"nature\" rowspan=\"11\">Nature activities<br><\/td>\n  <\/tr>\n  <tr>\n    <td class=\"default\" rowspan=\"2\">14-15<\/td>\n    <td class=\"default\" rowspan=\"1\"><\/td>\n    <td class=\"default\" rowspan=\"2\"><\/td>\n  <\/tr>\n  <tr>\n      <td class=\"default\" rowspan=\"1\"><\/td>\n  <\/tr>\n  <tr>\n    <td class=\"default\" rowspan=\"2\">15-16<\/td>\n    <td class=\"default\" rowspan=\"1\"><\/td>\n    <td class=\"talk\" rowspan=\"3\">Timoth\u00e9e Lacroix<br><\/td>\n    <td class=\"lab\" rowspan=\"3\">Projects<br><\/td>\n    <td class=\"lab\" rowspan=\"3\">Project<br><\/td>\n    <td class=\"default\" rowspan=\"2\"><\/td>\n  <\/tr>\n  <tr>\n      <td class=\"default\" rowspan=\"1\"><\/td>\n  <\/tr>\n  <tr>\n    <td class=\"default\" rowspan=\"2\">16-17<\/td>\n    <td class=\"default\" rowspan=\"1\"><\/td>\n    <td class=\"default\" rowspan=\"2\">Arrival in Grenoble<\/td>\n  <\/tr>\n  <tr>\n      <td class=\"default\" rowspan=\"1\"><\/td>\n        <td class=\"food\">Coffee break<\/td>\n        <td class=\"food\">Coffee break<\/td>\n        <td class=\"food\">Coffee break<\/td>\n        \n  <\/tr>\n  <tr>\n    <td class=\"default\" rowspan=\"2\">17-18<\/td>\n    <td class=\"default\" rowspan=\"1\"><\/td>\n    <td class=\"poster\" rowspan=\"4\">Poster session 2<br><\/td>\n    <td class=\"talk\" rowspan=\"4\">Isabelle Augenstein<br><\/td>\n    <td class=\"lab\" rowspan=\"2\">Project<br><\/td>\n    <td class=\"default\" rowspan=\"2\"><\/td>\n  <\/tr>\n  <tr>\n      <td class=\"default\" rowspan=\"1\"><\/td>\n  <\/tr>\n  <tr>\n    <td class=\"default\" rowspan=\"2\">18-19<\/td>\n    <td class=\"default\" rowspan=\"1\"><\/td>\n    <td class=\"lab\" rowspan=\"2\">Project presentation<br><\/td>\n    <td class=\"default\" rowspan=\"2\"><\/td>\n  <\/tr>\n  <tr>\n      <td class=\"default\" rowspan=\"1\"><\/td>\n  <\/tr>\n  <tr>\n    <td class=\"default\" rowspan=\"2\">19-20<\/td>\n    <td class=\"food\" rowspan=\"2\">Dinner<\/td>\n    <td class=\"food\" rowspan=\"2\">Dinner<\/td>\n    <td class=\"food\" rowspan=\"2\">Dinner<\/td>\n    <td class=\"food\" rowspan=\"2\">Dinner<\/td>\n    <td class=\"food\" rowspan=\"2\">Dinner<\/td>\n    <td class=\"default\" rowspan=\"2\"><\/td>\n  <\/tr>\n  <tr><\/tr>\n  <tr>\n    <td class=\"default\" rowspan=\"2\">20-21<\/td>\n    <td class=\"social\" rowspan=\"4\">Social<\/td>\n    <td class=\"social\" rowspan=\"4\">Social<\/td>\n    <td class=\"social\" rowspan=\"4\">Social with karaoke!<\/td>\n    <td class=\"social\" rowspan=\"4\">Birds-of-a-feather session<\/td>\n    <td class=\"social\" rowspan=\"4\">Social<\/td>\n    <td class=\"default\"  rowspan=\"2\"><\/td>\n  <\/tr>\n  <tr><\/tr>\n  <tr>\n    <td class=\"default\" rowspan=\"2\">21-22<\/td>\n    <td class=\"default\" rowspan=\"2\"><\/td>\n  <\/tr>\n  <tr><\/tr>\n<\/tbody>\n<\/table>\n\n\n\n<h2 class=\"wp-block-heading\">Poster Sessions<\/h2>\n\n\n\n<h3><span id=\"S1\">Session 1<\/span><\/h3>\n<ul>\n\n<li>\n    A1 <b> Zofia Milczarek <\/b> \n    <br>\n    <i>Evaluating Social Intelligence in the Long-Term<\/i>\n    <details><summary>[Abstract]<\/summary>\n    [Work in Progress] Evaluating LLMs and LLM-based agents on long-term memory tasks has grown in popularity in recent years. This evaluation ranges from retrieval of small small information (needle-in-a-haystack), to complex temporal and multi-hop reasoning over long-term dialogue traces, to finally evaluating agents&#8217; capabilities in simulated environments. Generalisation abilities of LLMs have also given rise to studies that evaluate their social intelligence &#8211; a dimension becoming increasingly more important as LLM-based systems are being deployed in real-life human-facing contexts (e.g. hospitals, personal assistants, therapy). The existing evaluations focus on Qustion-Anaswering or short-term simulations. Seeing as Long-Term Social Intelligence has been overlooked so far, our goal is to propose to join these two emerging directions and evaluate it via simulated interactions inspired by SmallVille agents.\n    <\/details>\n<\/li>\n\n\n<li>\n    A2 <b> Yosra Jelassi <\/b> \n    <br>\n    <i>Interpretable attributes for transparent Language Detection<\/i>\n    <details><summary>[Abstract]<\/summary>\n    The  lack  of  transparency in machine learning models raises critical concerns, particularly in domains where interpretability  and  fairness  are  essential.  Understanding  the why behind a model\u2019s prediction is crucial not only for experts seeking to validate system behavior but also for ensuring that decisions do not reflect biases. Indeed, fairness is often based on our ability to determine whether predictions are influenced by discriminatory patterns or protected attributes.\n\nTo  address  this  challenge,  the  field  of  explainability has emerged,  seeking  to  develop  methods  that  helps  to  uncover the  behavior  of  these  systems  and  to  make  their  decisions intelligible to humans.  \nThe literature distinguishes multiple types of explainability. One of them is about seeking information about the model learned representations.\nIn this context, a state of the art approach called BA-LR aims to make the model more interpretable by  allowing  a  mapping  between  learned internal  representations  and  interpretable attributes. A posthoc study enables to correlate these attributes to linguistic traits enabling to identify a speaker. \n\nTransposing this idea to language detection, our study focuses on designing a method to model language-specific attributes that are not only discriminative but also linguistically and phonetically meaningful. The discovery of these attributes can be initiated either unsupervisedly by the model itself, or guided through knowledge injection constraints that relate the internal representations with desired linguistic features.\nThese constraints could shape the distribution of learned attributes. \nFor instance, when analyzing a set of languages, we might seek to extract attribute that are either highly distinctive, or widely shared across languages depending on the desired phonetic markers we are looking for. By enforcing a constraint that aligns the attributes with this target, we can drive the model to focus on specific and interpretable attributes. \n    <\/details>\n<\/li>\n\n\n<li>\n    A3 <b> Tobias Kalmbach <\/b> \n    <br>\n    <i>Always So Sure: Can LLM&#8217;s Confidence be Trusted?<\/i>\n    <details><summary>[Abstract]<\/summary>\n    Confidence estimation techniques are often used to better gauge the Large Language Model&#8217;s (LLM) answers. One such technique is verbalized confidence. This prompting setup produces confidence scores alongside the actual answers, but the mechanisms behind these self-reported confidence values remain poorly understood. This paper presents a comprehensive analysis of verbalized confidence across multiple datasets spanning factual questions, multiple-choice QA, and causal reasoning using four different LLMs. \nOur investigation reveals that verbalized confidence scores are highly quantized, clustering around specific values (e.g., 0, 90, 100) with minimal differentiation between correct and incorrect answers. Through causal mediation analysis and targeted input perturbations, we demonstrate that confidence score generation is primarily influenced by structural prompt elements like the word &#8220;confidence&#8221; and the specified scale range rather than the actual question&#8217;s content. \nThese findings raise significant concerns about the reliability of verbalized confidence as a self-evaluation mechanism in LLMs.\n    <\/details>\n<\/li>\n\n\n<li>\n    A4 <b> Sarah Bouaraba <\/b> \n    <br>\n    <i>Clinically-Conditioned Synthetic Generation of French Breast Radiotherapy Reports: A Multi-LLM Study<\/i>\n    <details><summary>[Abstract]<\/summary>\n    Clinical NLP for French faces a critical data bottleneck. Unlike English, which benefits from large-scale, publicly available resources such as MIMIC-III [Johnson et al., 2016] and the i2b2\/n2c2 shared tasks, French clinical corpora remain exceptionally scarce. Existing resources \u2014 including QUAERO [N\u00e9v\u00e9ol et al., 2014], CAS [Grabar et al., 2018], and selected CLEF eHealth subsets \u2014 cover relatively narrow domains and provide limited annotated data. This scarcity is largely driven by regulatory constraints: under GDPR Article 9, health records are classified as sensitive personal data, and French legislation requires HDS (H\u00e9bergeur de Donn\u00e9es de Sant\u00e9) certification for any storage or processing of patient data. As a result, access to real-world clinical records remains highly restricted, even within academic research settings.\n\nSynthetic data generation has therefore emerged as a promising strategy to mitigate this structural limitation. We introduce a controlled pipeline for generating synthetic breast radiotherapy end-of-treatment reports (comptes rendus de fin de radioth\u00e9rapie mammaire) using locally deployed, instruction-tuned LLMs, without domain-specific fine-tuning. This design preserves patient privacy by construction while ensuring experimental reproducibility.\n\nEach report is generated under explicit clinical conditioning across predefined variables. Clinical plausibility and textual quality are assessed using lexical richness, inter-report diversity, and fluency metrics, In addition, reports undergo expert clinical review, and we are currently exploring evaluation via an LLM-as-a-judge framework to assess structural completeness and medical coherence.\n\nPreliminary experiments indicate encouraging levels of structural compliance and lexical variability. A larger-scale benchmark is currently underway. The resulting synthetic corpus is intended to support downstream clinical NLP applications in French, including named entity recognition (NER) and information extraction (IE).\n    <\/details>\n<\/li>\n\n\n<li>\n    A5 <b> Nickil Maveli <\/b> \n    <br>\n    <i>Can LLMs Compress (and Decompress)? Evaluating Code Understanding and Execution via Invertibility<\/i>\n    <details><summary>[Abstract]<\/summary>\n    LLMs demonstrate strong performance on code benchmarks, yet round-trip code execution reveals limitations in their ability to maintain consistent reasoning across forward and backward execution. We present RoundTripCodeEval (RTCE), a comprehensive benchmark consisting of four distinct code execution reasoning tasks designed to rigorously test round-trip consistency. RTCE provides an execution-free, exact-match evaluation of bijection fidelity, assessing whether models preserve a consistent one-to-one mapping between encoding and decoding operations across various algorithms and directions. We systematically evaluate state-of-the-art Code-LLMs using zero-shot prompting, supervised fine-tuning on execution traces, and self-reflection mechanisms. Each yields modest improvements, but none closes the gap, indicating that current LLMs struggle with true round-trip consistency, which demonstrates that they lack the internal coherence required for trustworthy code reasoning. RTCE surfaces several new and previously unmeasured insights that are not captured by existing I\/O-prediction, execution-reasoning, or round-trip natural-language benchmarks.\n    <\/details>\n<\/li>\n\n<!----\n<li>\n    A6 <b> Markus Frohmann <\/b> \n    <br>\n    <i>Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation\n<\/i>\n    <details><summary>[Abstract]<\/summary>\n    Segmenting text into sentences plays an early and crucial role in many NLP systems. This is commonly achieved by using rule-based or statistical methods relying on lexical features such as punctuation. Although some recent works no longer exclusively rely on punctuation, we find that no prior method achieves all of (i) robustness to missing punctuation, (ii) effective adaptability to new domains, and (iii) high efficiency. We introduce a new model - Segment any Text (SaT) - to solve this problem. To enhance robustness, we propose a new pretraining scheme that ensures less reliance on punctuation. To address adaptability, we introduce an extra stage of parameter-efficient fine-tuning, establishing state-of-the-art performance in distinct domains such as verses from lyrics and legal documents. Along the way, we introduce architectural modifications that result in a threefold gain in speed over the previous state of the art and solve spurious reliance on context far in the future. Finally, we introduce a variant of our model with fine-tuning on a diverse, multilingual mixture of sentence-segmented data, acting as a drop-in replacement and enhancement for existing segmentation tools. Overall, our contributions provide a universal approach for segmenting any text. Our method outperforms all baselines - including strong LLMs - across 8 corpora spanning diverse domains and languages, especially in practically relevant situations where text is poorly formatted. Our models and code, including documentation, are available at this https URL under the MIT license.\n\n    <\/details>\n<\/li>\n--->\n\n<li>\n    A7 <b> Krist\u00fdna Onderkov\u00e1 <\/b> \n    <br>\n    <i>Which Logic do LLMs Use?<\/i>\n    <details><summary>[Abstract]<\/summary>\n    The SemEval-2026 shared task 11 investigates how content interacts with formal reasoning in large language models (LLMs), by testing the validity of Aristotelian syllogisms that either align with or oppose commonsense knowledge. Our main system submission uses an LLM translation to a first-order logic (FOL) prover syntax. However, error analysis shows that simple adjustments to modern logic rules are not enough to reconcile Aristotelian and modern logic. This poster investigates how language models can follow different logical frameworks. By comparing the FOL-based prover, a heuristic Aristotelian pipeline, and zero-shot prompting, we test whether models genuinely follow the specified rules or default to unstated patterns learned during training.\n    <\/details>\n<\/li>\n\n\n<li>\n    A8 <b> Jeremias Bohn <\/b> \n    <br>\n    <i>Adaptive Base Logarithmic Quantisation<\/i>\n    <details><summary>[Abstract]<\/summary>\n    In the last years, large language models have grown significantly in size, making it difficult to run those models for inference on consumer hardware since the increase of GPU memory size has stagnated. A common approach is to quantise the models weights and\/or activations and thus reducing the memory requirements significantly. While most approaches resort to a linear quantisation codebook, this method is not optimal since precision gets lost in high-density regions of the weight distributions. Instead, we propose a logarithmic quantisation codebook with variable bases, which shows superior downstream task performance and perplexity compared to standard linear approaches, in particular for low bitwidth scenarios.\n    <\/details>\n<\/li>\n\n\n<li>\n    A9 <b> Ikram Belmadani <\/b> \n    <br>\n    <i>Who Judges the Judge? Evaluating LLM-as-a-Judge for French Medical open-ended QA<\/i>\n    <details><summary>[Abstract]<\/summary>\n    Automatic evaluation of medical open-ended question answering (OEQA) remains challenging due to the need for expert annotations. We evaluate whether large language models (LLMs) can act as judges of semantic equivalence in French medical OEQA, comparing closed-access, general-purpose, and biomedical domain-adapted models. Our results show that LLM-based judgments are strongly influenced by the model that generated the answer, with agreement varying substantially across generators. Domain-adapted and large general-purpose models achieve the highest alignment with expert annotations. We further show that lightweight adaptation of a compact model using supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) substantially improves performance and reduces generator sensitivity, even with limited data. Overall, our findings highlight the need for generator-aware evaluation and suggest that carefully adapted small models can support scalable evaluation in low-resource medical settings.\n    <\/details>\n<\/li>\n\n\n<li>\n    A10 <b> Gabriel Oliveira Dos Santos <\/b> \n    <br>\n    <i>What do vision-language models see (or not) in the context? Investigating multimodal in-context learning<\/i>\n    <details><summary>[Abstract]<\/summary>\n    In-context learning (ICL) enables Large Language Models (LLMs) to learn tasks from demonstration examples without parameter updates. Although it has been extensively studied in LLMs, its effectiveness in Vision-Language Models (VLMs) remains underexplored. In this work, we present a systematic study of ICL in VLMs, evaluating seven models spanning four architectures on three image captioning benchmarks. We analyze how prompt design, architectural choices, and training strategies influence multimodal ICL. To our knowledge,  we are the first to analyze how attention patterns in VLMs vary with an increasing number of in-context demonstrations.\nOur results reveal that training on image\u2013text interleaved data enhances ICL performance but does not imply effective integration of visual and textual information from demonstration examples. In contrast, instruction tuning improves instruction-following but can reduce reliance on in-context demonstrations, suggesting a trade-off between instruction alignment and in-context adaptation. Attention analyses further show that current VLMs primarily focus on textual cues and fail to leverage visual information, suggesting a limited capacity for multimodal integration. These findings highlight key limitations in the ICL abilities of current VLMs and provide insights for enhancing their ability to learn from multimodal in-context examples.\n    <\/details>\n<\/li>\n\n\n<li>\n    A11 <b> Elke Vandermeerschen <\/b> \n    <br>\n    <i>SEED: Self-Explanation-Enhanced Distillation for Common Sense Reasoning in SLMs<\/i>\n    <details><summary>[Abstract]<\/summary>\n    Improving the reasoning capabilities of small language models (SLMs) remains challenging, and existing approaches typically rely on large external teacher models or annotated explanation datasets. Such methods are costly, limit scalability, and often fail to capture reasoning quality directly. We introduce SEED (Self-Explanation-Enhanced Distillation for Common Sense Reasoning in SLMs), a self-improvement framework in which an SLM iteratively learns from its own generated natural language explanations. At each iteration, the model produces explanations alongside answers, after which candidate reasoning traces are filtered and reused as supervision for fine-tuning. Unlike prior self-training approaches that rely on consistency or confidence as proxies for correctness\u2014assumptions that are often violated in poorly calibrated SLMs\u2014SEED employs joint multi-signal filtering to assess explanation quality. Specifically, we combine epistemic signals (logit-based evidence strength, sequence-level entropy), semantic signals (natural language inference consistency), and robustness signals (Contrastive Explanation Invariance, CEI) to retain only high-quality reasoning trajectories.\nWithin SEED, explanations play multiple functional roles: they act as intermediate reasoning representations guiding prediction, as supervision targets during self-distillation, and as structured data for iterative training set construction. This unified use of explanations enables the model to bootstrap its reasoning capabilities from its own generated signals, without external supervision. \n\n    <\/details>\n<\/li>\n\n\n<li>\n    A12 <b> Doria Bonzi <\/b> \n    <br>\n    <i>CareMedEval dataset: Evaluating Critical Appraisal and Reasoning in the Biomedical Field<\/i>\n    <details><summary>[Abstract]<\/summary>\n    Critical appraisal of scientific literature is an essential skill in the biomedical field. While large language models (LLMs) can offer promising support in this task, their reliability remains limited, particularly for critical reasoning in specialized domains. We introduce CareMedEval, an original dataset designed to evaluate LLMs on biomedical critical appraisal and reasoning tasks. Derived from authentic exams taken by French medical students, the dataset contains 534 questions based on 37 scientific articles. Unlike existing benchmarks, CareMedEval explicitly evaluates critical reading and reasoning grounded in scientific papers. Benchmarking state-of-the-art generalist and biomedical-specialized LLMs under various context conditions reveals the difficulty of the task: open and commercial models fail to exceed an Exact Match Rate of 0.5 even though generating intermediate reasoning tokens considerably improves the results. Yet, models remain challenged especially on questions about study limitations and statistical analysis. CareMedEval provides a challenging benchmark for grounded reasoning, exposing current LLM limitations and paving the way for future development of automated support for critical appraisal.\n    <\/details>\n<\/li>\n\n\n<li>\n    A13 <b> Deborah Dore <\/b> \n    <br>\n    <i>Leveraging Graph Structural Knowledge for Argument Relation Prediction in Political Debates<\/i>\n    <details><summary>[Abstract]<\/summary>\n    Argument Mining (AM) aims to detect argument structures in text, including premises, claims, and their support or attack relations. Political debates are a key application domain, as analyzing politicians\u2019 argumentation strategies can help identify fallacious or propagandist arguments. However, predicting relations between argument components remains challenging. Most existing approaches rely only on textual content and ignore structural information from the overall argument graph. In this paper, we address relation prediction by combining structural knowledge from a Knowledge Graph Embedding model with contextual knowledge from a fine-tuned Language Model. Experiments on a benchmark of US presidential debates (1960\u20132020) show that integrating textual and structural knowledge improves prediction accuracy over existing methods.\n    <\/details>\n<\/li>\n\n\n<li>\n    A14 <b> Anastasiia Vozniuk <\/b> \n    <br>\n    <i>Broken Benchmarks, Wrong Questions: A Critical View of AI-Text Detection<\/i>\n    <details><summary>[Abstract]<\/summary>\n    Detection methods for AI-generated text routinely report accuracy above 95%, yet collapse under real-world perturbations such as paraphrasing or stylistic shifts. We show that this gap is partly explained by systematic quality problems in benchmark datasets, where surface artifacts make detection artificially easy. But beyond the technical failures lies a deeper issue: even a perfect detector would answer the wrong question. AI origin is a weak proxy for what actually motivates concern: is a text truthful, credible, and informative? We propose shifting the research agenda from detecting AI authorship toward directly assessing these content properties, which better address the underlying risks of AI-generated text.\n    <\/details>\n<\/li>\n\n\n<li>\n    A15 <b> Alla Chepurova <\/b> \n    <br>\n    <i>Wikontic: Constructing Wikidata-Aligned, Ontology-Aware Knowledge Graphs with Large Language Models<\/i>\n    <details><summary>[Abstract]<\/summary>\n    Knowledge graphs (KGs) provide structured, verifiable grounding for large language models (LLMs), but current LLM-based systems commonly use KGs as auxiliary structures for text retrieval, leaving their intrinsic quality underexplored. In this work, we propose Wikontic, a multi-stage pipeline that constructs KGs from open-domain text by extracting candidate triplets with qualifiers, enforcing Wikidata-based type and relation constraints, and normalizing entities to reduce duplication. The resulting KGs are compact, ontology-consistent, and well-connected; on MuSiQue, the correct answer entity appears in 96% of generated triplets. On HotpotQA, our triplets-only setup achieves 76.0 F1, and on MuSiQue 59.8 F1, matching or surpassing several retrieval-augmented generation baselines that still require textual context. In addition, Wikontic attains state-of-the-art information-retention performance on the MINE-1 benchmark (86%), outperforming prior KG construction methods. Wikontic is also efficient at build time: KG construction uses less than 1,000 output tokens, about 3\u00d7 fewer than AriGraph and <1\/20 of GraphRAG. The proposed pipeline enhances the quality of the generated KG and offers a scalable solution for leveraging structured knowledge in LLMs.\n    <\/details>\n<\/li>\n\n\n<li>\n    A16 <b> Alejandra Lorenzo <\/b> \n    <br>\n    <i>Privacy-Preserving Generation of Synthetic Pathology Reports for Information Extraction<\/i>\n    <details><summary>[Abstract]<\/summary>\n    A long-standing goal of the clinical NLP community is to extract relevant clinical variables from clinical text. However, progress has been limited by distribution shift from the general domain, the scarcity of publicly available annotated clinical data, and privacy constraints.\nWe propose a privacy-preserving method to generate synthetic data that simulate the information extraction task by associating LLM-generated pathology reports with thirteen variables commonly found in real reports for breast cancer patients. First, we generate synthetic tabular data for these variables and their possible values, comparing several tabular synthesizers and selecting PATE-CTGAN for its strong statistical fidelity and differential privacy guarantees. Second, we generate pathology reports using three different LLMs to maximize linguistic diversity and conditioning generation on synthetic variable\u2013value sets.\nWe create synthetic report\/data pairs on which we fine-tune Mistral-7B-Instruct with LoRA-based supervised training. When evaluated on a manually validated benchmark of 377 real pathology reports and their associated variable-value pairs, the fine-tuned model substantially outperforms Mistral-7B-Instruct. These results show that high-quality synthetic data can effectively compensate for limited annotated clinical data while enabling accurate and privacy-preserving clinical information extraction.\n    <\/details>\n<\/li>\n\n\n<li>\n    A17 <b> Huy Hoang Ha <\/b> \n    <br>\n    <i>(tba)<\/i>\n    <details><summary>[Abstract]<\/summary>\n    (tba)\n    <\/details>\n<\/li>\n\n\n<li>\n    A18 <b> Emilio Raimond <\/b> \n    <br>\n    <i>(tba)<\/i>\n    <details><summary>[Abstract]<\/summary>\n    (tba)\n    <\/details>\n<\/li>\n\n\n<li>\n    A19 <b> Angelo Basile <\/b> \n    <br>\n    <i>PyRater: APythonToolkit for Annotation Analysis<\/i>\n    <details><summary>[Abstract]<\/summary>\nIn this work, we build PyRater, an open-source Python library, to address the lack of accessible tools for probabilistic annotation analysis in NLP. Probabilistic models of annotation can jointly estimate gold standard labels, annotator reliability, and item difficulty, outperforming majority voting and standard agreement metrics like Cohen&#8217;s Kappa. Despite these advantages, they remain underused in the field, in part due to the absence of user-friendly implementations. PyRater provides a unified interface for several such models, along with built-in dataset readers, visualisation tools, and an API for adding new models.\nWe also present a novel application of these models to zero-shot prompt selection, where labeled development data is unavailable. In a zero-shot setting, prompt choice can significantly affect model performance, yet there is no straightforward way to identify the best prompt without supervision. By treating different prompts as repeated annotations over the same instances, PyRater can rank prompts and predict true labels, effectively acting as an unsupervised ensemble. We validate this on sentiment analysis datasets, where the probabilistic approach outperforms both majority voting and Kappa-based ranking.\n    <\/details>\n<\/li>\n\n<\/ul>\n<h3><span id=\"S2\">Session 2<\/span><\/h3>\n<ul>\n\n<li>\n    B1 <b> Zihao Li <\/b> \n    <br>\n    <i>Test-Time Scaling of Reasoning Models for Machine Translation<\/i>\n    <details><summary>[Abstract]<\/summary>\n    Test-time scaling (TTS) has enhanced the performance of Reasoning Models (RMs) on various tasks such as math and coding, yet its efficacy in machine translation (MT) remains underexplored. This paper investigates whether increased inference-time computation improves translation quality. We evaluate 12 RMs across a diverse suite of MT benchmarks spanning multiple domains, examining three scenarios: direct translation, forced-reasoning extrapolation, and post-editing. Our findings show that for general-purpose RMs, TTS provides limited and inconsistent benefits for direct translation, with performance quickly plateauing. However, the effectiveness of TTS is unlocked by domain-specific fine-tuning, which aligns a model&#8217;s reasoning process with task requirements, leading to consistent improvements up to an optimal, self-determined reasoning depth. We also find that forcing a model to reason beyond its natural stopping point consistently degrades translation quality. In contrast, TTS proves highly effective in a post-editing context, reliably turning self-correction into a beneficial process. These results indicate that the value of inference-time computation in MT lies not in enhancing single-pass translation with general models, but in targeted applications like multi-step, self-correction workflows and in conjunction with task-specialized models.\n    <\/details>\n<\/li>\n\n\n<li>\n    B2 <b> Xinhao Zhang <\/b> \n    <br>\n    <i>What Makes an LLM a Good Optimizer? A Trajectory Analysis of LLM-Guided Evolutionary Search<\/i>\n    <details><summary>[Abstract]<\/summary>\n    Recent work has demonstrated the promise of orchestrating large language models (LLMs) within evolutionary optimization systems. However, the mechanisms driving these optimization gains remain poorly understood. In this work, we present a large-scale study of LLM-guided evolutionary optimization, collecting optimization trajectories for 15 LLMs across 8 optimization problems. While base problem-solving ability, measured via zero-shot performance, correlates with final optimization outcomes, it explains only part of the variance: models with similar zero-shot capability often induce dramatically different search trajectories and final performance. To explain this gap, we analyze breakthrough dynamics and the geometry of optimization trajectories in the semantic space of candidate solutions. We find that effective LLM optimizers behave as strong local refiners, progressively localizing their search while producing frequent, incremental improvements across generations. In contrast, weaker optimizers exhibit large semantic drift, with occasional large breakthroughs followed by prolonged stagnation, reminiscent of behavior observed in classical metaheuristics. Notably, various measures of solution novelty do not predict final performance; novelty is beneficial only when the search remains sufficiently localized around high-performing regions of the solution space. Our results highlight the importance of trajectory-level evaluation for understanding and improving LLM-based agentic optimization systems, and provide actionable insights for future work on learning to search.\n    <\/details>\n<\/li>\n\n\n<li>\n    B3 <b> Thomas Palmeira Ferraz <\/b> \n    <br>\n    <i>Latent Reasoning in LLMs: Revisiting the Efficiency-Interpretability Trade-off<\/i>\n    <details><summary>[Abstract]<\/summary>\n    We investigate the emerging field of latent reasoning in large language models and discuss its relevance for reasoning and planning. The starting point is a key trade-off: Explicit chain-of-thought reasoning offers a visible reasoning trace, but it is slow, token-intensive, and may fail to faithfully reflect the model\u2019s true internal computation; Latent reasoning, by contrast, moves the reasoning process into hidden states, potentially enabling more efficient computation and broader exploration, at the cost of transparency. We review recent work on latent reasoning with a particular focus on two emerging lines of research: latent sequential reasoning (L-SEQ), where models produce sequences of latent intermediate states, and latent looped reasoning (L-LOOP), where reasoning is deepened through iterative internal computation. The goal is to better understand how these methods are compared to textual chain-of-thought, what kind of efficiency gains these approaches promise, what kind of tasks they appear to support, and how interpretability tools might help uncover more structured or symbolic forms of latent computation. We compare current evidence, and discuss open problems and next research directions.\n    <\/details>\n<\/li>\n\n<!---\n<li>\n    B4 <b> Reza Sanayei <\/b> \n    <br>\n    <i>Can LLMs Judge Debates? Evaluating Non-Linear Reasoning via Argumentation Theory Semantics<\/i>\n    <details><summary>[Abstract]<\/summary>\n    Large Language Models (LLMs) excel at linear reasoning tasks but remain underexplored on non-linear structures such as those found in natural debates, which are best expressed as argument graphs. We evaluate whether LLMs can approximate structured reasoning from Computational Argumentation Theory (CAT). Specifically, we use Quantitative Argumentation Debate (QuAD) semantics, which assigns acceptability scores to arguments based on their attack and support relations. Given only dialogue-formatted debates from two NoDE datasets, models are prompted to rank arguments without access to the underlying graph. We test several LLMs under advanced instruction strategies, including Chain-of-Thought and In-Context Learning. While models show moderate alignment with QuAD rankings, performance degrades with longer inputs or disrupted discourse flow. Advanced prompting helps mitigate these effects by reducing biases related to argument length and position. Our findings highlight both the promise and limitations of LLMs in modeling formal argumentation semantics and motivate future work on graphaware reasoning.\n    <\/details>\n<\/li>\n--->\n\n<li>\n    B5 <b> Nazanin Shafiabadi <\/b> \n    <br>\n    <i>Biases in Translation: Assessing Opinion Distortion in Machine Translated Texts<\/i>\n    <details><summary>[Abstract]<\/summary>\n    Current machine translation (MT) evaluation practices largely assume that high lexical and semantic fidelity implies preservation of meaning. We question this assumption by introducing a framework for detecting and quantifying translation-induced distortion\u2014the systematic alteration of a text\u2019s subjective properties during translation. Focusing on stance as a socially consequential property, we formalize stance preservation as an invariance problem and adapt two classical statistical tests, McNemar\u2019s test and the two-proportion Z-test, to diagnose systematic opinion shifts between source texts and their translations. Unlike standard MT metrics such as BLEU or COMET, which prioritize surface similarity and adequacy, our approach explicitly targets preservation of subjective meaning. In controlled experiments with synthetically distorted translations, we demonstrate that the proposed tests are sensitive to graded levels of stance manipulation. We apply our framework to evaluate twelve multilingual models and find that none reliably preserve stance across all tested language directions. Our findings reveal a critical gap in current MT evaluation practices and highlight the need for explicit evaluation of subjective meaning preservation in socially and politically sensitive contexts.\n    <\/details>\n<\/li>\n\n\n<li>\n    B6 <b> Marie Dewulf <\/b> \n    <br>\n    <i>Webcare through the Eyes of the Bystander: A cross-linguistic comparison of pragmatic-rhetorical features in hotel review-response interactions<\/i>\n    <details><summary>[Abstract]<\/summary>\n    Webcare, as a manifestation of digital reputation management, has become ubiquitous within the tourism industry. The significance of this online customer service communication, accessible to all, cannot be overstated. It demonstrates a commitment for guest satisfaction, thereby positively influencing the hotel&#8217;s image. Although recent studies suggest that guest reviews and hotel responses are influenced by cultural factors, cross-cultural analyses of hotel interactions remain scarce in terms of the languages and cultures investigated. Therefore, the objective of this project is to conduct a cross-linguistic study of a multilingual corpus consisting of approximately 50,000 hotel reviews and their corresponding responses in German, French, English (UK\/US), Italian, Dutch, and Spanish (ES\/MX). Specifically, this project aims to explore the cross-linguistic characteristics of hotel interactions in L1. We will use NLP techniques, such as sentiment analysis, to obtain a quantitative overview of the corpus across the 8 different cultures. The knowledge gained from this research will present opportunities for the development of generative AI systems that can automatically craft responses tailored to the linguistic and cultural context.\n    <\/details>\n<\/li>\n\n\n<li>\n    B7 <b> Karima Kadaoui <\/b> \n    <br>\n    <i>All For One: A Multilinguality Quest to Assist Low Resource Sign Languages<\/i>\n    <details><summary>[Abstract]<\/summary>\n    [Work in progress] Sign Language Recognition efforts are hindered by a lack of data. The need for data is especially dire when learning from Sign Language videos, given the inter- and intra-signer variability, and the presence of linguistically irrelevant information making it harder to generalize. To alleviate this issue, we take inspiration from the Speech domain and explore the use of SignWriting similarly to phones, i.e. as a discrete and language-agnostic intermediate representation, in an effort to reduce the complexity of the input video modality. Doing this would allow us to train make use of higher-resource sign-languages (e.g. American Sign Language) to improve perfromance on lower-resource ones (e.g. Emirati Sign Language).\n    <\/details>\n<\/li>\n\n\n<li>\n    B8 <b> Jason Chan <\/b> \n    <br>\n    <i>Explanation Generation for Reconciling Contradictions with LLMs<\/i>\n    <details><summary>[Abstract]<\/summary>\n    Existing NLP work commonly treats contradictions as errors to be resolved by choosing which statements to accept or discard. Yet a key aspect of human intelligence is the ability to hypothesise explanations to reconcile apparently contradictory observations. Despite growing research into LLMs\u2019 reasoning capabilities, their ability to generate such reconciliatory explanations remains underexplored. We address this gap by introducing a novel task, repurposing existing natural language inference datasets, and proposing metrics that enable scalable automatic evaluation. Our experiments show that, even with extended test-time compute, most LLMs struggle to generate successful explanations in reconciling contradictions, highlighting the need for future work to address this limitation.\n    <\/details>\n<\/li>\n\n<!---\n<li>\n    B9 <b> Houdna Khilouf <\/b> \n    <br>\n    <i>DziriMTL: Low-Rank Adaptation for Multi-Task Learning in Algerian Arabic NLP<\/i>\n    <details><summary>[Abstract]<\/summary>\n    Automatic processing of the Algerian Arabic dialect remains a challenging task due to the scarcity of annotated linguistic resources and the high linguistic variability of the dialect. Algerian Arabic exhibits complex characteristics, including code-switching with Modern Standard Arabic (MSA) and French, orthographic inconsistencies, and strong regional variations, which further complicate the development of robust Natural Language Processing (NLP) systems.\nRecent advances in large language models (LLMs) have significantly improved performance across a wide range of NLP tasks. These models learn rich linguistic representations by training billions of parameters on large-scale textual corpora and are typically adapted to downstream tasks through the use of pre-trained language models. However, adaptation is commonly performed through full fine-tuning, which requires updating all parameters of the model. This approach becomes prohibitively expensive in terms of computational and memory requirements, particularly when adapting a model to multiple tasks simultaneously.\nTo address this limitation, parameter-efficient adaptation techniques have recently gained increasing attention. One effective method is LoRA, which freezes the weights of the pre-trained model and injects trainable low-rank decomposition matrices into the Transformer layers. This approach significantly reduces the number of trainable parameters while preserving the knowledge encoded in the original model.\nIn this study, we propose a parameter-efficient multi-task learning framework for several text classification tasks in Algerian Arabic dialect. Our approach adapts the pre-trained model DziriBERT using LoRA to efficiently support multiple downstream tasks, including sentiment analysis, emotion detection, spam detection,  topic classification and fake news detection.\n    <\/details>\n<\/li>\n--->\n\n<li>\n    B10 <b> Filippo Tonini <\/b> \n    <br>\n    <i>Super-additive Cooperation in Language Model Agents<\/i>\n    <details><summary>[Abstract]<\/summary>\n    With the prospect of autonomous artificial intelligence (AI) agents, studying their tendency for cooperative behavior becomes an increasingly relevant topic. This study is inspired by the super-additive cooperation theory, where the combined effects of repeated interactions and inter-group rivalry have been argued to be the cause for cooperative tendencies found in humans. We devised a virtual tournament where language model agents, grouped into teams, face each other in a Prisoner&#8217;s Dilemma game. By simulating both internal team dynamics and external competition, we discovered that this blend substantially boosts both overall and initial, one-shot cooperation levels (the tendency to cooperate in one-off interactions). This research provides a novel framework for large language models to strategize and act in complex social scenarios and offers evidence for how intergroup competition can, counter-intuitively, result in more cooperative behavior. These insights are crucial for designing future multi-agent AI systems that can effectively work together and better align with human values.\n    <\/details>\n<\/li>\n\n\n<li>\n    B11 <b> Eleni Gkovedarou <\/b> \n    <br>\n    <i>ReGender: Gender-Fair Rewriter for English-to-Greek Machine Translation<\/i>\n    <details><summary>[Abstract]<\/summary>\n    The use of gender-fair language can lead to a more inclusive society, yet machine translation (MT) systems frequently reproduce and amplify gender bias. Some of this bias is due to inherent ambiguities in the source: English largely lacks grammatical gender marking, whereas Greek requires morphological and semantic gender specifications, forcing MT systems to resolve ambiguity in ways that default to gendered (and often biased) outputs. This research explores gender-fair rewriting as a strategy for bias mitigation for English-to-Greek MT, a language pair that remains highly understudied. We propose a twofold approach: ReGender, a system that first detects gender ambiguity in the English source text and then generates a set of gender-fair Greek translations for the ambiguous cases, including gendered, gender-neutral, and gender-inclusive forms. Through a human-centered design, the project combines NLP methods with community-informed gender-fair language practices that go beyond the gender binary. The resulting model will take the form of a plug-in that can be integrated into existing MT systems, enabling users to make informed translation choices while promoting the broader goal of inclusive language technologies.\n    <\/details>\n<\/li>\n\n\n<li>\n    B12 <b> Dominik Seip <\/b> \n    <br>\n    <i>Preference Redirection via Attention Concentration: An Attack on Multimodal Agents<\/i>\n    <details><summary>[Abstract]<\/summary>\n    Advancements in multimodal foundation models have enabled the development of Computer Use Agents (CUAs) capable of autonomously interacting with GUI environments. As CUAs are not restricted to certain tools, they allow to automate more complex agentic tasks but at the same time open up new security vulnerabilities. While prior work has concentrated on the language modality, the vulnerability of the vision modality has received less attention. In this work, we introduce PRAC, a novel attack that, unlike prior methods targeting the VLM output directly, manipulates the model&#8217;s internal preferences by redirecting its attention toward a stealthy adversarial patch. We show that PRAC is able to manipulate the selection process of a CUA on an online shopping platform towards a chosen target product. While we require white-box access to the model for the creation of the attack, we show that our attack generalizes to fine-tuned versions of the same model, presenting a critical threat as multiple companies build specific CUAs based on open weights models.\n    <\/details>\n<\/li>\n\n\n<li>\n    B13 <b> Daria Galimzianova <\/b> \n    <br>\n    <i>Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA<\/i>\n    <details><summary>[Abstract]<\/summary>\n    Large Language Models (LLMs) often hallucinate in question answering (QA) tasks. A key yet underexplored factor contributing to this is the temporality of questions &#8212; whether they are evergreen (answers remain stable over time) or mutable (answers change). In this work, we introduce EverGreenQA, the first multilingual QA dataset with evergreen labels, supporting both evaluation and training.\nUsing EverGreenQA, we benchmark 12 modern LLMs to assess whether they encode question temporality explicitly (via verbalized judgments) or implicitly (via uncertainty signals). We also train EG-E5, a lightweight multilingual classifier that achieves SoTA performance on this task. Finally, we demonstrate the practical utility of evergreen classification across three applications: improving self-knowledge estimation, filtering QA datasets, and explaining GPT-4o\u2019s retrieval behavior.\n    <\/details>\n<\/li>\n\n\n<li>\n    B14 <b> Amirbek Djanibekov <\/b> \n    <br>\n    <i>SPIRIT: Patching Speech Language Models against Jailbreak Attacks<\/i>\n    <details><summary>[Abstract]<\/summary>\n    Speech Language Models (SLMs) enable natural interactions via spoken instructions, which more effectively capture user intent by detecting nuances in speech. The richer speech signal introduces new security risks compared to text-based models, as adversaries can better bypass safety mechanisms by injecting imperceptible noise to speech. We analyze adversarial attacks and find that SLMs are substantially more vulnerable to jailbreak attacks, which can achieve a perfect 100% attack success rate in some instances. To improve security, we propose post-hoc patching defenses used to intervene during inference by modifying the SLM&#8217;s activations that improve robustness up to 99% with (i) negligible impact on utility and (ii) without any re-training. We conduct ablation studies to maximize the efficacy of our defenses and improve the utility\/security trade-off, validated with large-scale benchmarks unique to SLMs.\n    <\/details>\n<\/li>\n\n\n<li>\n    B15 <b> Alexander Shabalin <\/b> \n    <br>\n    <i>Cosmos: Compressed and Smooth Latent Space for Text Diffusion Modeling\n<\/i>\n    <details><summary>[Abstract]<\/summary>\n    Autoregressive language models dominate modern text generation, yet their sequential nature introduces fundamental limitations: decoding is slow, and maintaining global coherence remains challenging. Diffusion models offer a promising alternative by enabling parallel generation and flexible control; however, their application to text generation is hindered by the high dimensionality of token-level representations. We introduce Cosmos, a novel approach to text generation that operates entirely in a compressed, smooth latent space tailored specifically for diffusion. This space is learned using an autoencoder trained simultaneously for token-level reconstruction and alignment with frozen activations from a pretrained language encoder, providing robust semantic grounding and enabling effective perturbation-based augmentations. Empirically, we demonstrate that text representations can be compressed by 8\u00d7 while maintaining generation quality comparable to token-level diffusion models. Furthermore, increasing the latent sequence length allows Cosmos to surpass both diffusion-based and autoregressive baselines. We evaluate Cosmos on four diverse generative tasks including story generation, question generation, summarization, and detoxification and compare it with various generative paradigms. Cosmos achieves comparable or superior generation quality while offering more than 2\u00d7 faster inference.\n    <\/details>\n<\/li>\n\n\n<li>\n    B16 <b> Abdelrahman Sadallah <\/b> \n    <br>\n    <i>The Good, the Bad and the Constructive: Automatically Measuring Peer Review&#8217;s Utility for Authors<\/i>\n    <details><summary>[Abstract]<\/summary>\n    Providing constructive feedback to paper authors is a core component of peer review. With reviewers increasingly having less time to perform reviews, automated support systems are required to ensure high reviewing quality, thus making the feedback in reviews useful for authors. To this end, we identify four key aspects of review comments (individual points in weakness sections of reviews) that drive the utility for authors: Actionability, Grounding &#038; Specificity, Verifiability, and Helpfulness. To enable evaluation and development of models assessing review comments, we introduce the RevUtil dataset. We collect 1,430 human-labeled review comments and scale our data with 10k synthetically labeled comments for training purposes. The synthetic data additionally contains rationales, i.e., explanations for the aspect score of a review comment. Employing the RevUtil dataset, we benchmark fine-tuned models for assessing review comments on these aspects and generating rationales. Our experiments demonstrate that these fine-tuned models achieve agreement levels with humans comparable to, and in some cases exceeding, those of powerful closed models like GPT-4o. Our analysis further reveals that machine-generated reviews generally underperform human reviews on our four aspects.\n    <\/details>\n<\/li>\n\n\n<li>\n    B17 <b> Gaganpreet Jhajj <\/b> \n    <br>\n    <i>An Empirical Study of Many-Shot In-Context Learning for Machine Translation of Low-Resource Languages<\/i>\n    <details><summary>[Abstract]<\/summary>\nIn-context learning (ICL) enables LLMs to adapt to new tasks from a few examples, making it attractive for low-resource languages. Recent many-shot ICL work shows that larger context windows can further boost performance, but gains depend on example selection, and inference costs can be prohibitive. We present an empirical study of many-shot ICL for English-to-low-resource machine translation across ten languages recently added to FLORES+, examining retrieval-based example selection, out-of-domain data, and length-based ordering. Our results show that many-shot ICL improves with more examples, and that BM25-based retrieval greatly enhances data efficiency, 50 retrieved examples roughly match 250 random ones, while 250 retrieved examples rival 1,000 random ones.\n    <\/details>\n<\/li>\n\n\n<li>\n    B18 <b> Briag Rehel <\/b> \n    <br>\n    <i>RAG Controller<\/i>\n    <details><summary>[Abstract]<\/summary>\nRetrieval-Augmented Generation (RAG) has become the go-to architecture in industry applications. Many variants and components have been developed, making the selection of an effective configuration a challenging optimization problem in low data industry use cases. \nWe hypothesize that adapting RAG configurations at the query level, rather than at the use-case level, improves transferability. To this end, we train a controller using offline reinforcement learning to jointly control four RAG inference parameters.\n    <\/details>\n<\/li>\n\n<li>\n    B19 <b> Laurits Lyngbaek <\/b> \n    <br>\n    <i>Statistical Probes of Multilingual Embeddings Fail to Generalize Across Language Learner Corpora<\/i>\n    <details><summary>[Abstract]<\/summary>\n    <\/details>\n<\/li>\n\n<\/ul>\n<h3><span id=\"S3\">Session 3<\/span><\/h3>\n<ul>\n\n<li>\n    C1 <b> Yusser Al Ghussin <\/b> \n    <br>\n    <i>Steering Multilingual Models Towards Cultural Knowledge<\/i>\n    <details><summary>[Abstract]<\/summary>\n    Prior work provides mechanistic evidence that multilingual LLMs encode cultural information in representations that overlap and interact with language-specific components (Namazifard and Poech, 2025), suggesting that intervening on language-aligned directions may also modulate culturally relevant behavior. Motivated by this, our system uses activation steering: instead of optimizing model parameters through fine-tuning, we modify internal activations at inference time using steering vectors (Rimsky et al., 2024). Concretely, we extract language steering vectors and inject them into the residual stream of multilingual LLMs during generation. We build on evidence that language identity is encoded as a stable direction in activation space (Marks and Tegmark, 2023), and hypothesize that steering along such directions can improve access to culturally relevant knowledge.\n    <\/details>\n<\/li>\n\n\n<li>\n    C2 <b> Trung Hieu Ngo <\/b> \n    <br>\n    <i>Investigating Gender Stereotypes in Large Language Models via Social Determinants of Health <\/i>\n    <details><summary>[Abstract]<\/summary>\n    Large Language Models (LLMs) excel in Natural Language Processing (NLP) tasks, but they often propagate biases embedded in their training data, which is potentially impactful in sensitive domains like healthcare. While existing benchmarks evaluate biases related to individual social determinants of health (SDoH) such as gender or ethnicity, they often overlook interactions between these factors and lack context-specific assessments. This study investigates bias in LLMs by probing the relationships between gender and other SDoH in French patient records. Through a series of experiments, we found that embedded stereotypes can be probed using SDoH input and that LLMs rely on embedded stereotypes to make gendered decisions, suggesting that evaluating interactions among SDoH factors could usefully complement existing approaches to assessing LLM performance and bias.\n    <\/details>\n<\/li>\n\n\n<li>\n    C3 <b> Tenney Hu <\/b> \n    <br>\n    <i>DIVERGE: Diversity-Enhanced RAG for Open-Ended Information Seeking<\/i>\n    <details><summary>[Abstract]<\/summary>\n    Existing RAG systems are primarily designed under the assumption that each query has a single correct answer. This overlooks common information-seeking scenarios with multiple plausible answers, where diversity is essential to avoid collapsing to a single dominant response, thereby constraining creativity and compromising fair and inclusive information access. Our analysis reveals a commonly overlooked limitation of standard RAG systems: they underutilize retrieved context diversity, such that increasing retrieval diversity alone does not yield diverse generations. To address this limitation, we propose DIVERGE, a plug-and-play agentic RAG framework with novel reflection-guided generation and memory-augmented iterative refinement, which promotes diverse viewpoints while preserving answer quality. We introduce novel metrics tailored to evaluating the diversity-quality trade-off in open-ended questions, and show that they correlate well with human judgments. We demonstrate that DIVERGE achieves the best diversity-quality trade-off compared to competitive baselines and previous state-of-the-art methods on the real-world Infinity-Chat dataset, substantially improving diversity while maintaining quality. More broadly, our results reveal a systematic limitation of current LLM-based systems for open-ended information-seeking and show that explicitly modeling diversity can mitigate it. \n    <\/details>\n<\/li>\n\n\n<li>\n    C4 <b> Patr\u00edcia Schmidtov\u00e1 <\/b> \n    <br>\n    <i>How Important is \u2018Perfect\u2019 English for Machine Translation Prompts?<\/i>\n    <details><summary>[Abstract]<\/summary>\n    Large language models (LLMs) show stateof-\nthe-art performance in machine translation,\nbut are also known to be sensitive to errors\nin user prompts. Given that these models are\nlargely trained on and respond best to prompts\nin standard English, this may affect the quality\nof LLM outputs for second language English\nspeakers as well as real-world lay users, with\npotentially disproportionate effects on the former.\nWe explore this effect by modeling a range\nof error types exhibited by such users, motivated\nby studies of L2 English, and quantifying\ntheir impact on LLM performance. We work\nwith two related tasks: machine translation and\nmachine translation evaluation. We find that\nLLMs-as-MT are brittle to natural spelling errors\nbut not to phrasal simplifications. However,\nthe quality drop caused by these errors is\nlower than the variance over the initial prompt\nchoice, suggesting that \u201cperfect English\u201d for a\ngiven prompt is less important than choosing a\ngood prompt. Since lay users and L2 speakers\nmay use non-optimal prompts as well as display\nimperfect language skills, our work calls for increasing\nthe resilience of model performance\nto both these phenomena, in order to best serve\na diverse user base, both from a robustness and\nfairness perspective.\n    <\/details>\n<\/li>\n\n\n<li>\n    C5 <b> Michelle Elizabeth <\/b> \n    <br>\n    <i>Conversational Grounding in LLMs: Evaluation Methods, Challenges and Future Directions<\/i>\n    <details><summary>[Abstract]<\/summary>\n    Conversational grounding is the collaborative process through which speakers establish and maintain mutual understanding. It is essential for the success of a dialogue. While it is an inherent property of human conversations, it remains a challenge for instruction-following Large Language Models (LLM). This paper surveys how conversational grounding, from a psycholinguistic perspective, is evaluated in task-oriented dialogue in the current era of LLMs. The literature is organised based on how grounding is modelled &#8211; dialogue act based methods and approaches that model the participant mental state. We also review collaborative tasks that enable the evaluation of grounding at the conversation-level based on outcomes. We highlight the limitations of current metrics and outline research directions in grounding evaluation.\n    <\/details>\n<\/li>\n\n\n\n\n\n<li>\n    C7 <b> Joanna Radola <\/b> \n    <br>\n    <i>Plus d&#8217;une langue! Language Identification for Code-Switched Utterances<\/i>\n    <details><summary>[Abstract]<\/summary>\n    Automatic identification of Code-Switched (CS) utterances remains a challenge for language identification (LIDs) systems, causing such texts to be underrepresented in the training data of Large Language Models. In this paper, we revisit MaskLID, a state-of-the art approach for CS identification, which requires no training and detects arbitrary language combinations. We make three main contributions: (a) we reformulate the underlying algorithm as an Integer Linear Program, with clear and interpretable constraints; (b) our experiments with 10 languages match MaskLID&#8217;s results and highlight a major issue with the underlying LID; (c) using an improved model delivering better word-level scores, we achieve results that outperform MaskLID both on monolingual and CS datasets.\n    <\/details>\n<\/li>\n\n\n<li>\n    C8 <b> Iuliia Belikova <\/b> \n    <br>\n    <i>Detecting Overflow in Compressed Token Representations for RAG<\/i>\n    <details><summary>[Abstract]<\/summary>\n    Efficient long-context processing remains a crucial challenge for contemporary large language models (LLMs), especially in resource-constrained environments. Soft compression architectures promise to extend effective context length by replacing long token sequences with smaller sets of learned compressed tokens. Yet, the limits of compressibility \u2014 and when compression begins to erase task-relevant content \u2014 remain underexplored. In this paper, we define \u201ctoken overflow\u201d as a regime in which compressed representations no longer contain sufficient information to answer a given query, and propose a methodology to characterize and detect it. In xRAG soft-compression setting, query-agnostic saturation statistics effectively identify compressed tokens but show limited capability in detecting overflow. Conversely, lightweight query-aware probing classifiers successfully detect overflow across multiple standard QA datasets. This advancement toward query-aware detection enables efficient pre-LLM gating to mitigate compression-induced errors.\n    <\/details>\n<\/li>\n\n\n<li>\n    C9 <b> Hawau Olamide Toyin <\/b> \n    <br>\n    <i>A summary of efforts towards building truly adaptable speech technology for stuttered speech.<\/i>\n    <details><summary>[Abstract]<\/summary>\n    [Work in progress] Speech technologies often underperform for people who stutter due to limited atypical-speech data, scarce expert annotation, and a lack of alignment between stakeholder needs and research objectives. This poster summarizes an ongoing effort toward multimodal stuttering severity assessment. We take a multi-pronged approach: (1) a stakeholder-focused study combining interviews, questionnaire surveys (70 respondents), and a literature review (200+ papers) to identify gaps between what people who stutter (PWS) and speech-language pathologists (SLPs) need and what current systems optimize for; (2) benchmarking ASR for atypical speech under two practical transcription objectives\u2014verbatim versus intended transcripts\u2014to highlight how model behavior changes depending on whether disfluencies are preserved or normalized; and (3) an ongoing multilingual, multimodal data collection effort in collaboration with clinical partners that includes SLP-provided clinical annotations.\n    <\/details>\n<\/li>\n\n\n<li>\n    C10 <b> Erwan Fagnou <\/b> \n    <br>\n    <i>Chain and Causal Attention for Efficient Entity Tracking<\/i>\n    <details><summary>[Abstract]<\/summary>\n    This paper investigates the limitations of transformers for entity-tracking tasks in large language models. We identify a theoretical constraint, showing that transformers require at least log2 (n+1) layers to handle entity tracking with n state changes. To address this issue, we propose an efficient and frugal enhancement to the standard attention mechanism, enabling it to manage long-term dependencies more efficiently. By considering attention as an adjacency matrix, our model can track entity states with a single layer.Empirical results demonstrate significant improvements in entity tracking datasets while keeping competitive performance on standard natural language modeling. Our modified attention allows us to achieve the same performance with drastically fewer layers. Additionally, our enhanced mechanism reveals structured internal representations of attention. Extensive experiments on both toy and complex datasets validate our approach. Our contributions include theoretical insights, an improved attention mechanism, and empirical validation.\n    <\/details>\n<\/li>\n\n\n<li>\n    C11 <b> Elena Golimblevskaia <\/b> \n    <br>\n    <i>WeightLens: Input-Independent Interpretability for LLM Transcoders<\/i>\n    <details><summary>[Abstract]<\/summary>\n    Existing automated interpretability methods for Large Language Models (LLMs) often infer feature meanings by analyzing activations with another LLM, but they suffer from high computational cost, dataset dependence, prompt sensitivity, and explainer bias. Transcoders provide a promising direction for automated interpretability, since their architecture allows separating input-dependent and input-invariant components of feature attributions. We investigate whether analyzing only the input-invariant component (weights) yields meaningful interpretations for token-based features, reducing reliance on external explainers, and develop the WeightLens framework. Experiments show that in the chosen setting, WeightLens performs comparably to, or even better than, activation-based methods, suggesting a low-cost complement to existing approaches.\n    <\/details>\n<\/li>\n\n\n<li>\n    C12 <b> Dina Pisarevskaya <\/b> \n    <br>\n    <i> Claim Matching with Instruction-following LLMs for Automated Fact-checking<\/i>\n    <details><summary>[Abstract]<\/summary>\n    Claim matching (CM) as the Natural Language Processing task can benefit an automated fact-checking pipeline by putting together claims that can be resolved with the same fact-check. We explore zero-shot and few-shot learning approaches to CM as a binary classification task and experiment with instruction-following LLMs, investigating prompt templates. CM can be tackled by leveraging more mature yet similar tasks such as natural language inference or paraphrase detection. We present a novel agent-based approach for CM and the two-step pipeline that first generates prompts with LLMs, to then perform claim matching as a binary classification task with LLMs. We demonstrate the effectiveness of using different LLMs for each step of the pipeline, i.e. using an LLM for prompt generation, and another for claim matching. We reveal insights into the LLMs\u2019 understanding and handling of CM task.\n    <\/details>\n<\/li>\n\n\n<li>\n    C13 <b> Arnisa Fazla <\/b> \n    <br>\n    <i>How Certain Is Uncertainty? A Benchmark for Confidence, Calibration, and Failure Modes in Clinical VQA<\/i>\n    <details><summary>[Abstract]<\/summary>\n    Clinical vision-language model (VLM) evaluation often relies on evaluating accuracy on visual question-answering (VQA) datasets, yet real-world clinical use additionally requires reliable uncertainty estimation (UE) to identify cases requiring clinician review. We present the first comprehensive benchmark of eight post-hoc UE methods across thirteen VLMs spanning five model families. Using controlled &#8220;None of the Above&#8221; (NOTA) perturbations to the answer options, we show that replacing the correct answer with NOTA unexpectedly increases model confidence while degrading accuracy. Moreover, we further show that initial uncertainty estimates predict answer instability under NOTA perturbations, revealing a meaningful link between uncertainty and robustness to answer-space shifts.\n    <\/details>\n<\/li>\n\n\n<li>\n    C14 <b> Amanda Le <\/b> \n    <br>\n    <i>Causal Graph-Based Models for Lossless Explanations<\/i>\n    <details><summary>[Abstract]<\/summary>\n    Large Language Models offer significant potential for conversational data analytics across diverse domains. However, deploying them in critical areas is limited by the lack of trustworthy explanations that accurately reflect their internal computations. In this work, we propose a causal graph-based model to provide lossless\nexplanations. This model leverages the theory of causal abstraction to capture both task- and instance-specific LLM behavior. \nOur approach constructs a compressed representation of a neural network by learning low-dimensional abstractions of internal activations alongside causally constrained transition mechanisms.\nThis results in an explanatory graph that reveals traversable computational pathways and supports causal interventions.\n\n    <\/details>\n<\/li>\n\n\n<li>\n    C15 <b> Alex Jiang <\/b> \n    <br>\n    <i>Automatic detection of bot-generated content for disinformation purposes<\/i>\n    <details><summary>[Abstract]<\/summary>\n    Large language models (LLMs) such as GPT, Claude, LLaMA or Mistral has transformed text generation by producing artificial content that is credible, fluent and contextually relevant. While simplifying our daily tasks, the rise of LLMs lead to issues in many fields (e.g., Academic research, Education, Fake news, Social media).\n\nCurrent detection tools struggle to keep up with this rapid evolution, especially against new LLMs or dealing with different domains they were trained on. Often built to deal with simple generated texts, they lack robustness against basic evasive strategies such as paraphrasing, back-translation.\n    <\/details>\n<\/li>\n\n<!---\n<li>\n    C16 <b> Joy Olusanya <\/b> \n    <br>\n    <i>(tba)<\/i>\n    <details><summary>[Abstract]<\/summary>\n    (tba)\n    <\/details>\n<\/li>\n--->\n\n<li>\n    C17 <b> Filip Boltuzic <\/b> \n    <br>\n    <i>Automated Consolidation of Legal Amendments into Temporal Knowledge Graphs<\/i>\n    <details><summary>[Abstract]<\/summary>\n    We introduce an automated pipeline that transforms sequences of legal amendments into temporal knowledge graphs, preserving both structural dependencies and versioned states of legal norms. This representation supports precise reconstruction of the law at any point in time and facilitates graph-based retrieval and reasoning over evolving legal corpora.\n    <\/details>\n<\/li>\n\n\n<li>\n    C18 <b> AriaRay Brown <\/b> \n    <br>\n    <i>Listening for Speaker Identity in Multilingual Speech Models: How Phonetic Similarity Modulates Cross-Lingual Transfer<\/i>\n    <details><summary>[Abstract]<\/summary>\nMultilingual self-supervised transformer speech models build shared representations to encode speech from multiple languages. While these models learn features and perform well in downstream tasks, it remains unclear (1) how such features are represented similarly or uniquely for all languages, and (2) what might account for any differences. This research investigates how speaker and phonetic information is represented in the multilingual speech models XLS-R and WavLabLM for ten phylogenetically-dispersed languages. Motivated by speech perception research linking human ease in speaker discriminability to phonetic similarity (the Language Familiarity Effect), this work explores whether phonetic similarity influences the success of cross-lingual transfer for speaker verification. Accordingly, we train probing classifiers on embedded utterance pairs in each language and test speaker verification on target languages. Experiments show that speaker information is represented similarly across model layers and languages, but cross-lingual performance varies widely. To examine this, we turn to phonetic insight. We introduce a new method to measure phonetic similarity by calculating the distance between phonetic profiles, based on distributions of contextualized phone embeddings per language, extracted from a distinct phonetic layer. Overall, we find that phonetic similarity tends to significantly improve the accuracy of cross-lingual speaker verification, while this correlation varies for individual test languages.\n    <\/details>\n<\/li>\n\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>CET Sunday 29\/03 Monday 30\/03 Tuesday 31\/03 Wednesday 1\/04 Thursday 2\/04 Friday 3\/04 8-9 Breakfast Nature activities Breakfast Breakfast Breakfast 9-10 Welcome session Poster session 3 (9.30) Julia Kreutzer 10-11 Roger K.Moore Fran\u00e7ois Yvon 11-12 Carlos Ramisch Manon Scholivet Zen Research 3\/\/4 12-13 Lunch Lunch Lunch Lunch (12.30) Lunch 13-14 Poster session 1 Carlos Ramisch &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/lig-alps.imag.fr\/index.php\/schedule\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Schedule&#8221;<\/span><\/a><\/p>\n","protected":false},"author":3,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-221","page","type-page","status-publish","hentry","entry"],"_links":{"self":[{"href":"https:\/\/lig-alps.imag.fr\/index.php\/wp-json\/wp\/v2\/pages\/221","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/lig-alps.imag.fr\/index.php\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/lig-alps.imag.fr\/index.php\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/lig-alps.imag.fr\/index.php\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/lig-alps.imag.fr\/index.php\/wp-json\/wp\/v2\/comments?post=221"}],"version-history":[{"count":102,"href":"https:\/\/lig-alps.imag.fr\/index.php\/wp-json\/wp\/v2\/pages\/221\/revisions"}],"predecessor-version":[{"id":977,"href":"https:\/\/lig-alps.imag.fr\/index.php\/wp-json\/wp\/v2\/pages\/221\/revisions\/977"}],"wp:attachment":[{"href":"https:\/\/lig-alps.imag.fr\/index.php\/wp-json\/wp\/v2\/media?parent=221"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}