29/06/2026

Can AI take on the literature review? New iCARE4CVD research puts it to the test

Screening thousands of research papers is one of the most time-consuming steps in evidence synthesis. A new study from the iCARE4CVD consortium asks whether large language models can do it reliably — and at scale.

Every systematic literature review starts the same way: with a very large pile of papers. 

For researchers studying biomarkers in heart failure, that pile can run to thousands of publications, accumulated over decades, written in the varied language of clinical science. Before any analysis can begin, each paper must be read and assessed against a set of inclusion criteria — a labour-intensive process that, as studies of human reviewers consistently show, is also genuinely difficult to keep consistent. 

Two experienced researchers reading the same paper will sometimes reach different conclusions, reflecting the complexity of applying precise criteria across a large and varied body of literature. This inconsistency in screening affects which evidence makes it into a review, and ultimately, what conclusions can be drawn from it. 

For iCARE4CVD this challenge is directly relevant. The consortium’s work on identifying and validating biomarkers in heart failure depends on comprehensive, reproducible evidence synthesis — at a scale that manual review alone makes difficult to achieve. A new study from the consortium asks whether a well-designed AI system can help, and what it would take to make that reliable in practice. 

An AI tool built to screen thousands of papers against detailed criteria 

5,405 publications is a large but not uncommon starting point for a systematic review in medical research — and a scale that makes manual screening alone difficult to sustain. As part of its work on biomarker use in heart failure with reduced ejection fraction (HFrEF), the iCARE4CVD consortium needed a way to make that screening process both rigorous and feasible. The answer was an AI-based tool designed to automate full-text screening. 

The resulting tool works by breaking each paper into short, searchable segments and converting them into a numerical form that allows meaning-based comparison, capturing context and relationships between terms rather than simply matching keywords. To handle the range of clinical terminology in use across the literature, the team also built a curated medical dictionary. This was necessary because the same clinical concept can appear in many forms: left ventricular ejection fraction — a measure of how much blood the heart pumps with each beat — might be written as “ejection fraction,” “EF,” or “LVEF” depending on the author, institution, or decade in which a study was published. 

Rather than assessing each paper in one broad pass, the tool poses 136 specific questions, each targeting a single inclusion or exclusion criterion. Dedicated AI agents handle each question, drawing only on the most relevant sections of the paper. Their answers feed into a logical rule: a paper is included only if it satisfies both the patient population and biomarker criteria. 

 

Selecting and training an open-source model on a human-validated reference standard 

After testing several models, the team selected LLaMA 3.3 70B — an open-source model that can be downloaded and run on local infrastructure, meaning that research documents and patient-related data never need to leave a secure environment. This was a deliberate choice: privacy and data confidentiality were prioritised alongside performance. In initial testing on 49 papers, it matched the performance of larger commercial models, with no false negatives — meaning it did not miss any papers that should have been included. 

To train and evaluate the tool, the team first established a human-generated reference standard: pairs of independent reviewers — one senior, one junior — assessed 200 publications through a double-blind process, with disagreements resolved by a heart failure expert. This agreed set of correct answers was then used to test and progressively refine the AI. Over multiple training rounds, prompts were adjusted wherever the tool produced incorrect outputs, until it consistently reached the performance threshold the team had set. 

The tool reached a sensitivity of 91.4% in validation — correctly identifying more than nine in ten relevant papers, with a false negative rate of 8.6%. 

Consistency was another notable finding. When the same batch of papers was screened three times, results matched 100% across all runs. As Martina Colombo, researcher in the Unit of Biostatistics at Istituto Mario Negri and a co-author of the study, reflects: “During the evaluation phase, we observed that repeated runs of the AI produced fully consistent results, while human reviewers showed a higher level of variability when assessing the same papers. This really shifted my perspective,  inconsistency is not just an AI problem, but a broader challenge in evidence synthesis.”  

Human reviewers in the same study showed higher variability between reviewers, a finding consistent with broader evidence on the difficulty of achieving uniform results in manual screening. The AI screened each paper in approximately two minutes, compared to around ten minutes for a human reviewer. 

Where the tool has clear limits 

The results also point to areas where the approach requires further development. 

Specificity in the validation phase was 53.2%, meaning the AI accepted a number of papers that human reviewers would have excluded. This reflects a deliberate design choice rather than a technical failure: the tool was calibrated to minimise the risk of missing relevant studies, which meant accepting a higher rate of false positives in exchange. Those papers still proceed to human review, so the tool functions as a first filter rather than a final decision-maker. 

Historical literature presented a more fundamental challenge. As Yongxin Ye, postdoctoral researcher at Novo Nordisk and a co-author of the study, explains: “The AI tool could struggle with domain-specific terminology and changes in how heart failure has been described over time. Some older studies used terminology that predates today’s LVEF-based definitions, and the tool initially risked excluding relevant papers.” Papers from the 1980s and 1990s describing “acute systolic heart failure” – a term commonly used before LVEF-based classification became standard – were initially missed.  

Finally, the study was conducted in a specific disease context with well-defined criteria. How well the approach performs across other clinical domains, with different levels of terminological complexity, remains to be tested. 

A reusable framework for large-scale evidence synthesis 

The architecture developed by the iCARE4CVD team is designed to be adaptable: new criteria, new biomarkers, or new disease areas can be incorporated without rebuilding the system from scratch. That modularity matters in a field where evidence accumulates faster than any team can manually process it. 

For researchers who remain sceptical, both Yongxin Ye and Martina Colombo argue that caution is the right instinct  but not a reason to dismiss the approach. “Scepticism is not only understandable, it is necessary,” says Yongxin. “AI should be treated as a carefully validated assistant, not a replacement for scientific judgement.” Martina echoes this framing: “Rather than thinking in terms of trust vs. distrust, I would frame it as controlled use – understanding where the tool performs well, where it struggles, and integrating it in a way that strengthens the overall quality of the review.” 

For large-scale initiatives like iCARE4CVD, tools that improve the consistency and efficiency of screening while keeping human expertise at the centre of final decisions have clear practical value.

The next step is understanding how this framework performs across different clinical contexts — and how it can be validated and deployed in ways that meet the standards of rigorous scientific review. 

share on:

Subscribe to our newsletter

Subscribe to our newsletter to stay up to date with all the most recent information about iCARE4CVD.