CA | ES | EN
Seminar

Tools based on LLMs for software development
Tools based on LLMs for software development

28/Oct/2025
28/Oct/2025

Speaker:

Antonio Lobo
Antonio Lobo

Institution:

IIIA-CSIC
IIIA-CSIC

Language :

EN
EN

Type :

Attending seminar
Attending seminar

Description :

The field of software engineering is undergoing a gradual shift as large language models evolve from assistive tools to more autonomous agents. This seminar offers a balanced overview of that transition, with attention to practical capability, known limitations, and architectural patterns that support dependable agent assisted development.

We begin with a simple question that matters for real work. How long can an AI system carry a task without drifting. Recent studies introduce a task completion time horizon to capture this idea. We will see why long sequences fail even when single steps look competent, how small errors compound through self conditioning, and why sheer scale does not remove these failure modes. The goal is a practical intuition you can use when planning workflows.

The central part presents a straightforward way to build with agents. Break a complex job into smaller pieces. Assign each piece to a focused subagent with a clear skill and the tools it needs. Control how information moves between them. We cover prompt practice, context management, and the Model Context Protocol, and we connect each element to patterns that teams can adopt with modest effort.

Two hands on sessions make the ideas concrete. First, we extend an existing repository to add a narrowly scoped agent, with logging and inspection so that its behaviour remains visible and auditable. Second, we build a compact dashboard from scratch to analyse tweets, from data ingestion to light processing and interactive views. Both practicals emphasise clear interfaces, small tests, and failure cases that are easy to diagnose.

We then look at what changes in developer routines. Evidence from code review shows shorter turnaround with AI support in some settings, but the effect depends on context and practice. We compare large sample observational work with field studies and draw cautious conclusions about when agents help and where they may add friction.

Finally, we discuss when smaller models are the right tool. Small language models can be efficient for narrow and repetitive tasks when evaluation is thorough and feedback loops are short. We also set out limits that matter day to day. Long inputs can degrade reasoning. Models can produce unsupported statements. Security and bias require ongoing attention. We close with rules of thumb that favour reliable, observable systems over optimistic promises.

IMPORTANT:

During the seminar, we will use the following tools. It is advisable to have them installed:
git (https://git-scm.com/)
docker (https://docs.docker.com/engine/install/)

VSCode (https://code.visualstudio.com/)
GitHub Copilot (the free version will suffice; the Pro version is free for students or those with an institutional account https://github.com/education/students). Install the extension for VSCode: https://code.visualstudio.com/docs/copilot/setup

Alternatively to GitHub Copilot, we will be using a paid option, Claude Code: https://docs.claude.com/en/docs/claude-code/setup. You can use GLM4.6 as a cheaper alternative to a Claude subscription for €3 per month (10% discount with this link https://z.ai/subscribe?ic=3NB26JQGBD). Instructions for integrating GLM into Claude Code: https://docs.z.ai/devpack/tool/claude

By just bringing git and docker, you will be able to see and test the results of my executions. I recommend bringing everything for a complete immersive experience during the seminar.

References:
[1] T. Kwa et al., Measuring AI Ability to Complete Long Tasks, 2025, arXiv:2503.14499. https://arxiv.org/abs/2503.14499
[2] A. Sinha, A. Arun, S. Goel, S. Staab, J. Geiping, The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs, 2025, arXiv:2509.09677. https://arxiv.org/abs/2509.09677
[3] A. Collante, S. Abedu, S. Khatoonabadi, A. Abdellatif, E. Alor, E. Shihab, The Impact of Large Language Models on Code Review Process, 2025, arXiv:2508.11034. https://arxiv.org/abs/2508.11034
[4] F. S. Aðalsteinsson, B. B. Magnússon, M. Milicevic, A. N. Davidsson, C. H. Cheng, Rethinking Code Review Workflows with LLM Assistance: An Empirical Study, 2025, arXiv:2505.16339. https://arxiv.org/abs/2505.16339
[5] P. Belcak, G. Heinrich, S. Diao, Y. Fu, X. Dong, S. Muralidharan, Y. C. Lin, P. Molchanov, Small Language Models are the Future of Agentic AI, 2025, arXiv:2506.02153. https://arxiv.org/abs/2506.02153
[6] M. Levy, A. Jacoby, Y. Goldberg, Same Task, More Tokens: The Impact of Input Length on the Reasoning Performance of Large Language Models, 2024, arXiv:2402.14848. https://arxiv.org/abs/2402.14848
[7] N. Huynh, B. Lin, Large Language Models for Code Generation: A Comprehensive Survey of Challenges, Techniques, Evaluation, and Applications, 2025, arXiv:2503.01245. https://arxiv.org/abs/2503.01245
[8] Anthropic, Interactive Prompt Engineering Tutorial. https://github.com/anthropics/prompt-eng-interactive-tutorial
[9] Anthropic, Claude for Code: Overview. https://docs.claude.com/en/docs/claude-code/overview
[10] OpenAI, Why language models hallucinate, 2025. https://openai.com/index/why-language-models-hallucinate/
[11] Apple Machine Learning Research, The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity, 2025. https://machinelearning.apple.com/research/illusion-of-thinking

 

The field of software engineering is undergoing a gradual shift as large language models evolve from assistive tools to more autonomous agents. This seminar offers a balanced overview of that transition, with attention to practical capability, known limitations, and architectural patterns that support dependable agent assisted development.

We begin with a simple question that matters for real work. How long can an AI system carry a task without drifting. Recent studies introduce a task completion time horizon to capture this idea. We will see why long sequences fail even when single steps look competent, how small errors compound through self conditioning, and why sheer scale does not remove these failure modes. The goal is a practical intuition you can use when planning workflows.

The central part presents a straightforward way to build with agents. Break a complex job into smaller pieces. Assign each piece to a focused subagent with a clear skill and the tools it needs. Control how information moves between them. We cover prompt practice, context management, and the Model Context Protocol, and we connect each element to patterns that teams can adopt with modest effort.

Two hands on sessions make the ideas concrete. First, we extend an existing repository to add a narrowly scoped agent, with logging and inspection so that its behaviour remains visible and auditable. Second, we build a compact dashboard from scratch to analyse tweets, from data ingestion to light processing and interactive views. Both practicals emphasise clear interfaces, small tests, and failure cases that are easy to diagnose.

We then look at what changes in developer routines. Evidence from code review shows shorter turnaround with AI support in some settings, but the effect depends on context and practice. We compare large sample observational work with field studies and draw cautious conclusions about when agents help and where they may add friction.

Finally, we discuss when smaller models are the right tool. Small language models can be efficient for narrow and repetitive tasks when evaluation is thorough and feedback loops are short. We also set out limits that matter day to day. Long inputs can degrade reasoning. Models can produce unsupported statements. Security and bias require ongoing attention. We close with rules of thumb that favour reliable, observable systems over optimistic promises.

IMPORTANT:

During the seminar, we will use the following tools. It is advisable to have them installed:
git (https://git-scm.com/)
docker (https://docs.docker.com/engine/install/)

VSCode (https://code.visualstudio.com/)
GitHub Copilot (the free version will suffice; the Pro version is free for students or those with an institutional account https://github.com/education/students). Install the extension for VSCode: https://code.visualstudio.com/docs/copilot/setup

Alternatively to GitHub Copilot, we will be using a paid option, Claude Code: https://docs.claude.com/en/docs/claude-code/setup. You can use GLM4.6 as a cheaper alternative to a Claude subscription for €3 per month (10% discount with this link https://z.ai/subscribe?ic=3NB26JQGBD). Instructions for integrating GLM into Claude Code: https://docs.z.ai/devpack/tool/claude

By just bringing git and docker, you will be able to see and test the results of my executions. I recommend bringing everything for a complete immersive experience during the seminar.

References:
[1] T. Kwa et al., Measuring AI Ability to Complete Long Tasks, 2025, arXiv:2503.14499. https://arxiv.org/abs/2503.14499
[2] A. Sinha, A. Arun, S. Goel, S. Staab, J. Geiping, The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs, 2025, arXiv:2509.09677. https://arxiv.org/abs/2509.09677
[3] A. Collante, S. Abedu, S. Khatoonabadi, A. Abdellatif, E. Alor, E. Shihab, The Impact of Large Language Models on Code Review Process, 2025, arXiv:2508.11034. https://arxiv.org/abs/2508.11034
[4] F. S. Aðalsteinsson, B. B. Magnússon, M. Milicevic, A. N. Davidsson, C. H. Cheng, Rethinking Code Review Workflows with LLM Assistance: An Empirical Study, 2025, arXiv:2505.16339. https://arxiv.org/abs/2505.16339
[5] P. Belcak, G. Heinrich, S. Diao, Y. Fu, X. Dong, S. Muralidharan, Y. C. Lin, P. Molchanov, Small Language Models are the Future of Agentic AI, 2025, arXiv:2506.02153. https://arxiv.org/abs/2506.02153
[6] M. Levy, A. Jacoby, Y. Goldberg, Same Task, More Tokens: The Impact of Input Length on the Reasoning Performance of Large Language Models, 2024, arXiv:2402.14848. https://arxiv.org/abs/2402.14848
[7] N. Huynh, B. Lin, Large Language Models for Code Generation: A Comprehensive Survey of Challenges, Techniques, Evaluation, and Applications, 2025, arXiv:2503.01245. https://arxiv.org/abs/2503.01245
[8] Anthropic, Interactive Prompt Engineering Tutorial. https://github.com/anthropics/prompt-eng-interactive-tutorial
[9] Anthropic, Claude for Code: Overview. https://docs.claude.com/en/docs/claude-code/overview
[10] OpenAI, Why language models hallucinate, 2025. https://openai.com/index/why-language-models-hallucinate/
[11] Apple Machine Learning Research, The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity, 2025. https://machinelearning.apple.com/research/illusion-of-thinking