A real-world evaluation of the implementation of NLP
technology in abstract screening of a systematic review

Sara Perlman-Arrow*†1, Noel Loo†2, Niklas Bobrovitz3, Tingting Yan††3, Rahul K.

Arora††4,5

1 School of Population and Global Health, McGill University, QC, Canada

2 Department of Electrical Engineering and Computer Science, Massachusetts

Institute of Technology, MA, United States of America

3 Temerty Faculty of Medicine, University of Toronto, ON, Canada

4 Centre for Health Informatics, University of Calgary, AB, Canada

5 Institute of Biomedical Engineering, University of Oxford, United Kingdom

* sara.perlman-arrow@mail.mcgill.ca

† These authors contributed equally to this work
†† These authors contributed equally to this work

 . CC-BY-NC-ND 4.0 International licenseIt is made available under a 
 is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted February 25, 2022. ; https://doi.org/10.1101/2022.02.24.22268947doi: medRxiv preprint 

NOTE: This preprint reports new research that has not been certified by peer review and should not be used to guide clinical practice.

https://doi.org/10.1101/2022.02.24.22268947
http://creativecommons.org/licenses/by-nc-nd/4.0/


Abstract

The laborious and time-consuming nature of systematic review production hinders the dis-

semination of up-to-date evidence synthesis. Well-performing natural language processing

(NLP) tools for systematic reviews have been developed, showing promise to improve

efficiency. However, the feasibility and value of these technologies have not been com-

prehensively demonstrated in a real-world review. We developed an NLP-assisted abstract

screening tool that provides text inclusion recommendations, keyword highlights, and visual

context cues. We evaluated this tool in a living systematic review on SARS-CoV-2 sero-

prevalence, conducting a quality improvement assessment of screening with and without the

tool. We evaluated changes to abstract screening speed, screening accuracy, characteristics

of included texts, and user satisfaction. The tool improved efficiency, reducing screening

time per abstract by 45.9% and decreasing inter-reviewer conflict rates. The tool conserved

precision of article inclusion (positive predictive value; 0.92 with tool vs 0.88 without) and

recall (sensitivity; 0.90 vs 0.81). The summary statistics of included studies were similar

with and without the tool. Users were satisfied with the tool (mean satisfaction score of

4.2/5). We evaluated an abstract screening process where one human reviewer was replaced

with the tool’s votes, finding that this maintained recall (0.92 one-person, one-tool vs 0.90

two tool-assisted humans) and precision (0.91 vs 0.92) while reducing screening time by

70%. Implementing an NLP tool in this living systematic review improved efficiency,

maintained accuracy, and was well-received by researchers, demonstrating the real-world

effectiveness of NLP in expediting evidence synthesis.

 . CC-BY-NC-ND 4.0 International licenseIt is made available under a 
 is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted February 25, 2022. ; https://doi.org/10.1101/2022.02.24.22268947doi: medRxiv preprint 

https://doi.org/10.1101/2022.02.24.22268947
http://creativecommons.org/licenses/by-nc-nd/4.0/


1 Background1

Evidence synthesis is crucial to evidence-based decision-making in modern medicine and2

public health. [1, 2] It is especially important during health emergencies, when the evidence3

base can rapidly change. The volume of COVID-19 literature exemplifies this, with over4

500 000 articles on the matter published as of October 2021. [3] Unfortunately, producing5

systematic reviews is time-consuming and laborious. The mean time from registration to6

publication is 67.3 weeks for PROSPERO-registered reviews. [4, 5] Living systematic7

reviews are designed to circumvent delays and provide up-to-date results, [6, 7] but it is8

similarly laborious to update these at an adequate frequency. [8] Hence, there is increasing9

urgency to expedite evidence synthesis methods without sacrificing quality.10

Software tools using natural language processing (NLP) systems have been developed11

to accelerate systematic review methods, [9] many of which target abstract screening.12

[10, 11, 12, 13, 14, 15] Under 3% of texts screened are typically eligible for inclusion,13

[4] making it time-consuming to parse through search results, and particularly useful to14

expedite this stage of the process. NLP-based screening tools classify abstract inclusion15

or exclusion, and are trained on abstracts labeled by human reviewers. Examples include16

Rayyan, [14] DistillerSR, [12] and ResearchScreener, [10] which use naive-Bayes or n-17

gram based support-vector machine approaches. Recently, transformer-based NLP models18

have shown particular promise for text screening. [16] Transformer models are typically19

pre-trained on large bodies of medical literature, then fine-tuned on a specific screening20

task. This results in broadly improved performance. [17, 18]21

Despite many existing technologies, there is little data on the real-world utility of such22

screening tools. Most previous reports focus primarily on performance metrics, such as tool23

precision and recall on abstracts previously unseen by the model. [19, 20] Some studies24

have assessed efficiency measures, such as impact to reviewer workload or time saved25

while screening, but these are typically conducted retrospectively, with data from completed26

1

 . CC-BY-NC-ND 4.0 International licenseIt is made available under a 
 is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted February 25, 2022. ; https://doi.org/10.1101/2022.02.24.22268947doi: medRxiv preprint 

https://doi.org/10.1101/2022.02.24.22268947
http://creativecommons.org/licenses/by-nc-nd/4.0/


reviews. [15, 11, 12, 21] Only one study to date has evaluated these tools in the context of27

an ongoing review with user interactions. This evaluation involved only one reviewer, was28

done after traditional screening was completed, and focused exclusively on screening time.29

[10]30

Furthermore, few reports have evaluated the impact of implementing NLP tools into31

living literature reviews, [13] and none have assessed user-tool interactions or user satis-32

faction in this context. Living reviews could benefit from a continuous level of screening33

efficiency and lend themselves well to integration of NLP tools: an initial manual review34

can yield a large number of screened texts, which could serve as the training set to develop35

an algorithm to in turn expedite continuous review updates.36

SeroTracker conducts a living systematic review of global SARS-CoV-2 seroprevalence37

and publishes results onto an interactive dashboard (Serotracker.com). [22, 23] Each38

week, our team screens 800-1000 new abstracts and extracts approximately 30 articles. To39

optimize the efficiency of our screening process, we developed an NLP-assisted software40

tool and conducted a quality improvement (QI) project assessing the efficiency changes41

and usability of integrating this tool into our usual methods. We evaluated changes in the42

time taken to conduct screening, the accuracy of the screening process, the characteristics43

of included texts in our overall review, user interactions with the tool, and user satisfaction44

with the process. Moreover, we assessed different combinations of reviewer and tool pairing45

to determine how to best improve our screening process. As an evaluation of NLP-based46

tools in an ongoing living systematic review, our report provides novel and comprehensive47

evidence regarding the feasibility of NLP for screening and its real-world performance48

benefits.49

2

 . CC-BY-NC-ND 4.0 International licenseIt is made available under a 
 is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted February 25, 2022. ; https://doi.org/10.1101/2022.02.24.22268947doi: medRxiv preprint 

https://doi.org/10.1101/2022.02.24.22268947
http://creativecommons.org/licenses/by-nc-nd/4.0/


2 Methods50

2.1 The SeroTracker systematic review of SARS-CoV-2 seroprevalence51

SeroTracker conducts a living systematic review of SARS-CoV-2 seroprevalence that is52

registered with PROSPERO (CRD42020183634 - version 7 May 2020). We run weekly53

searches in four literature databases (MEDLINE, EMBASE, Web of Science and Europe54

PMC) to find relevant peer-reviewed and pre-print literature. Full details on this review are55

available in previous publications. [23]56

Every week, the searches yield approximately 800-1000 new texts, which are uploaded57

into the Covidence platform for screening. [24] Abstracts are reviewed in duplicate, with58

first and second reviewers blinded to each others’ votes. All texts whose abstracts receive59

“include” votes by two reviewers undergo a full-text screen. All thirteen research team60

members conduct screening and can provide the first vote, and one of six designated second61

reviewers provides the second vote. This abstract screening process yields 25-40 articles62

for full-text screening and 20-35 included articles each week.63

2.2 Development and implementation of an NLP tool into SeroTracker’s64

processes:65

This study was conducted as a quality improvement project, following the Plan-Do-Study-66

Act (PDSA) model, to facilitate and expedite SeroTracker’s screening process while main-67

taining accuracy. The Conjoint Health Research Ethics Board (CHREB) at the University68

of Calgary granted us an exemption from the research ethics board review for this QI69

project. This study followed the Standards for QUality Improvement Reporting Excellence70

(SQUIRE-2.0).71

3

 . CC-BY-NC-ND 4.0 International licenseIt is made available under a 
 is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted February 25, 2022. ; https://doi.org/10.1101/2022.02.24.22268947doi: medRxiv preprint 

https://doi.org/10.1101/2022.02.24.22268947
http://creativecommons.org/licenses/by-nc-nd/4.0/


2.2.1 Plan72

In line with process improvement measures at SeroTracker, reviewers were interviewed to73

assess satisfaction with the screening process. Team members noted its time-consuming74

nature and identified the following key challenges: difficulty in tracking the number of75

texts screened, the inability to reverse a vote if they misclicked, and the inability to identify76

key information at a glance to determine whether a text should be included.77

We developed an NLP-enabled tool that adds additional features to Covidence to allow78

more efficient identification of text eligibility. This tool included (1) an inclusion recom-79

mendation indicator, which displays a confidence rating ranging from “not recommended”80

to “strongly recommended” in the form of a coloured circle beside the abstract title. This81

was developed using the transformer-based pre-trained NLP model PubMedBERT. [18] We82

fine-tuned the model on a set of 25,000 previously screened abstracts from the living sys-83

tematic review. We also included (2) a feature that highlights the Population, Intervention,84

and Outcome (PIO) abstract components in different colours, using the same model but85

trained on the EBM-NLP dataset. [25] More details about the tool’s development can be86

found in Appendix B.87

The tool also incorporated features to streamline screening and ameliorate user experi-88

ence: (3) a screening progress tracker (4) a button to undo a user’s most recent votes on a89

text (5) a feature that displays abstracts in a way that separates them by section headings90

(e.g., “Background”, “Methods”, etc.) and (6) a feature highlighting reviewer-specified91

keywords (Appendix Figure C3).92

2.2.2 Do93

We conducted a project with AB design to assess the feasibility and impact of tool imple-94

mentation on abstract screening. We selected a set of 400 abstracts (“pilot abstracts”) to95

evaluate tool performance. These abstracts had been previously screened using the same96

4

 . CC-BY-NC-ND 4.0 International licenseIt is made available under a 
 is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted February 25, 2022. ; https://doi.org/10.1101/2022.02.24.22268947doi: medRxiv preprint 

https://doi.org/10.1101/2022.02.24.22268947
http://creativecommons.org/licenses/by-nc-nd/4.0/


inclusion criteria as part of SeroTracker’s review. 309/400 were previously excluded and97

91/400 were previously included in the review.98

This project was conducted over a five-week period in three stages Figure 1. In the first99

two weeks, team members conducted screening without the tool (“without-tool stage”).100

200 pilot abstracts were added to the regular primary searches each week. We subsequently101

implemented a week-long washout period, where no pilot abstracts were added, and102

reviewers installed and familiarized themselves with the tool. In the final two weeks, team103

members used the tool for screening (“with-tool stage”) using the features they felt were104

most helpful. In this stage, 200 pilot abstracts were again added to the regular primary105

searches each week.106

2.2.3 Study107

Three sets of reviewer votes on the pilot abstracts were collected: votes from the initial108

screen in April (“pre-project votes”), votes from the without-tool, and votes from the109

with-tool stages. We used pre-project votes as the reference standard for comparison.110

We evaluated key process, outcome, and structure measures. Process measures included111

(1) efficiency metrics, including the screening time and the conflict rate with and without112

the tool and (2) accuracy metrics, including precision (positive predictive value) and recall113

(sensitivity) of screening, as well as the performance of the tool’s inclusion recommenda-114

tions. To evaluate precision and recall, we first calculated the baseline expected variability115

due to human error in screening, by comparing the included and excluded texts between116

the pre-project and without-tool stages, as there is inherent human error in the systematic117

review process. [26] We then assessed whether the outcomes of the with-tool stage were118

within expected levels of variability.119

The first outcome measure evaluated was the tool’s impact on results of the review,120

assessed by comparing summary descriptive statistics for included seroprevalence estimates121

in the pre-project, without-tool, and with-tool stages. We also assessed reviewers’ usage122

5

 . CC-BY-NC-ND 4.0 International licenseIt is made available under a 
 is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted February 25, 2022. ; https://doi.org/10.1101/2022.02.24.22268947doi: medRxiv preprint 

https://doi.org/10.1101/2022.02.24.22268947
http://creativecommons.org/licenses/by-nc-nd/4.0/


of different features, such as voter alignment with the NLP recommendations and the123

frequency of use of each feature. We surveyed users to understand overall satisfaction with124

the tool.125

Finally, one structure measure was evaluated, which compared tool performance with126

different combinations of human and tool votes. We assessed the tool’s performance127

using data from the project in a “one-person and one-tool” (OPOT) screening process, a128

simulated abstract screening scenario in which one human reviewer is replaced with the129

tool’s automated inclusion recommendations. The tool voted “include” on an abstract if it130

provided a rating of “weakly recommended” or stronger for the abstract. Two scenarios131

were considered: one in which the human reviewer had access to the tool [OPOT (W)], and132

one in which they did not [OPOT (W/O)]. Each of these scenarios was compared to a “tool-133

only” system, in which only the tool’s automated inclusion recommendations were used for134

abstract screening, maintaining conflict resolution and full-text screen processes between135

human reviewers. For the human reviewer screening scenarios, we used human votes from136

the without and with-tool stages from the voter with the longest tenure at SeroTracker. Any137

additional conflict resolution or full-text screenings required for this analysis were done138

after the with-tool stage by human reviewers.139

Details about the process, outcome, and structure measures studied are presented in140

Table 1, along with their key results.141

2.2.4 Act142

Results from this work were used to inform whether to integrate this tool into regular143

practice at SeroTracker and to inform further improvements.144

6

 . CC-BY-NC-ND 4.0 International licenseIt is made available under a 
 is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted February 25, 2022. ; https://doi.org/10.1101/2022.02.24.22268947doi: medRxiv preprint 

https://doi.org/10.1101/2022.02.24.22268947
http://creativecommons.org/licenses/by-nc-nd/4.0/


3 Results145

3.1 Process measures146

3.1.1 Screening efficiency - time taken to screen147

Across all abstracts, the tool was associated with a 33.7% decrease in mean screening time148

per abstract from 21.45±22.30s (n = 2852) to 14.22±17.52s (n = 2961) (p = 5.25e-42 by149

two-sided unequal-variances t-test) (Table 2, Figure 2a).150

This reduction was similar when considering the pilot abstracts alone (22.78±22.59s151

(n = 746) to 14.00± 18.04s (n = 736); p = 3.03e-16) (Table 2, Figure 2b).168/800 pilot152

abstracts were screened twice by the same reviewer with and without the tool. To account153

for possible order effects in these votes, we repeated this analysis excluding them. We154

continued to observe a significant decrease in abstract screening time, from 23.99±22.44s155

without-tool (n = 603) to 14.70±18.21s with-tool (n = 591) (p = 8.79e-15) (Table 2).156

Lastly, we repeated the analysis only using abstracts ultimately included at the abstract157

screening stage. These took a similar time to screen with or without the tool (Table 2,158

Figure 2c).159

To account for inter-reviewer effects and inter-abstract variability in our analysis of160

the tool, we modeled the distribution of the logarithm of the times taken as a function of161

tool-use, adjusted for effects of reviewer speed. Under this model, the tool was associated162

with a 45.9% reduction in screening time per abstract (p < 0.0001). There was substantial163

variability in the mean screening time for different reviewers, with ai ranging from -0.96 to164

0.66.165

The tool’s impact on conflict rate was also assessed. An increased number of conflicting166

votes decreases the efficiency of abstract screening, as a third reviewer must resolve these.167

Of the 1960 unique abstracts voted on without the tool, 163 had a conflicting vote. The168

conflict rate decreased from 8.32% (163/1960) to 3.64% (87/2388) (p=5.36e-11, Fisher’s169

7

 . CC-BY-NC-ND 4.0 International licenseIt is made available under a 
 is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted February 25, 2022. ; https://doi.org/10.1101/2022.02.24.22268947doi: medRxiv preprint 

https://doi.org/10.1101/2022.02.24.22268947
http://creativecommons.org/licenses/by-nc-nd/4.0/


exact test) when the tool was added (Table 3). When considering only the pilot abstracts, a170

similar decrease was observed: 11.25% (45/400) to 5.75% (23/400) (p = 0.007, Fisher’s171

exact test) (Table 3). This conflict rate reduction results in a 2.2% time savings per abstract.172

In summary, the tool improved all efficiency metrics except the mean time taken for173

included abstracts. Combining the conflict rate savings (2.2%) and mean times savings174

(33.7%), we expect a 35.2% reduction in screening time per abstract, leading to 3.93 hours175

of reviewer time saved each week.176

3.1.2 Effect on screening precision and recall177

We compared the precision and recall of the screening process with and without the tool178

to ensure that tool use did not interfere with screening accuracy. When considering pilot179

abstracts that were included past full-text review, 74 of the 91 previously included abstracts180

(PI) were included without the tool and 82 with it (Table 4). This change was not significant181

(p = 0.137, Fisher’s exact test), meaning accuracy was conserved. Looking at the texts182

included past full-text review, there were 7 false positive (FP) texts with the tool and 10183

without it. All FPs were found to have been excluded in the extraction stage. This means184

that no new texts were included in the review as a result of the project. 1.185

3.1.3 Inclusion recommendation performance on the pilot abstracts186

The tool’s inclusion recommendation feature rates each abstract’s inclusion likelihood into187

one of five categories. The feature’s operating characteristics were evaluated on the pilot188

abstracts (Table 5, Figure 3). FPs and FNs were calculated taking the pre-project votes as189

“true”, and an inclusion prediction was assigned if the tool recommended at least that level190

of confidence. All the PI pilot abstracts were at least weakly recommended, with 73/91191

being strongly recommended. The weakly recommended threshold gave the highest F1192

1If, during full-text extraction, a text that was included in screening is found to lack information to extract,
it can be excluded

8

 . CC-BY-NC-ND 4.0 International licenseIt is made available under a 
 is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted February 25, 2022. ; https://doi.org/10.1101/2022.02.24.22268947doi: medRxiv preprint 

https://doi.org/10.1101/2022.02.24.22268947
http://creativecommons.org/licenses/by-nc-nd/4.0/


score (0.905), with a precision of 0.827 and recall of 1.0 at this level. Scores remained high193

at all thresholds.194

3.2 Outcome measures195

3.2.1 Results of overall systematic review196

Table 6 summarizes the statistics of the seroprevalence estimates included in the pre-project197

stage, the without-tool stage, and the with-tool stage. The majority of the statistics remain198

consistent with and without the tool. Compared to the pre-project votes, the with-tool stage199

did not exclude any estimates deemed to have a low or moderate risk of bias.200

3.2.2 User interaction with the tool201

We first evaluated the agreement between user votes and the tool’s five thresholds of202

inclusion recommendations. It was found that users typically agreed with the recommenda-203

tions, with the highest F1 score (0.821) being at the “somewhat recommended” threshold204

(Appendix A.1) There was a positive correlation between users voting“include” and the205

presence of keywords in the abstract (Appendix A.2).206

The inclusion recommendation feature was the most commonly used. It was used by all207

7 unique reviewers in the with-tool stage and in 3140/3142 unique votes. PIO highlighting208

was used by 5/7 reviewers, and 1918/3142 (61.0%) votes. While all users had the keyword209

feature on, only 3 actually had any keywords specified, resulting in 21.1% (663/3142) of210

abstract votes having the features on with keywords input. The undo feature was the least211

used (3/7 reviewers ; 23/3142 texts), with the vote being changed in only 6 cases. Of note,212

by default undoing a vote in Covidence is not always possible.213

9

 . CC-BY-NC-ND 4.0 International licenseIt is made available under a 
 is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted February 25, 2022. ; https://doi.org/10.1101/2022.02.24.22268947doi: medRxiv preprint 

https://doi.org/10.1101/2022.02.24.22268947
http://creativecommons.org/licenses/by-nc-nd/4.0/


3.2.3 User feedback on the tool214

After the project’s conclusion, reviewers were sent a satisfaction survey. 9 of 13 members215

provided feedback (Table 7). The self-reported usage information did not align perfectly216

with the computer recorded usage. This could be due to recall bias, as the survey was217

conducted one week after the conclusion of the project. Reviewers who used features rated218

their usefulness out of five (Table 7). The inclusion recommendation feature was voted the219

most useful (mean score of 4.70/5), and keyword (3.88/5) and PIO highlighting (4.00/5)220

were the least useful.221

Reviewers reported that the tool improved perceived screening speed by allowing them222

to rapidly identify key information that qualifies or disqualifies abstracts for inclusion,223

specifically through the bolding headings, PIO highlighting, and keyword highlighting.224

While the undo feature was rarely used, users reported that it provided them with more225

security, allowing them to correct mistakes that would otherwise be permanent. While226

many users found the inclusion recommendations useful, they noted that it could give a227

false sense of security and cause users to blindly trust the tool, rather than carefully read228

through the abstract. Users also noted that the PIO highlighting feature often highlighted229

incorrect information, making it distracting at times. This complaint was reflected in the230

low adoption of the PIO highlighting feature.231

3.3 Structure measures232

3.3.1 One-person-one-tool (OPOT) model233

Given the reliability of the tool’s inclusion recommendations, we evaluated outcomes of an234

abstract screening process where one human reviewer was replaced with the tool’s vote, in235

a scenario in which the human reviewer did not have access to the tool (OPOT (W/O)) and236

in one where they did (OPOT (W)).237

Results are reported in Table 4 for texts that were included in the abstract screen and for238

10

 . CC-BY-NC-ND 4.0 International licenseIt is made available under a 
 is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted February 25, 2022. ; https://doi.org/10.1101/2022.02.24.22268947doi: medRxiv preprint 

https://doi.org/10.1101/2022.02.24.22268947
http://creativecommons.org/licenses/by-nc-nd/4.0/


texts that were included in the full-text screen. None of the FPs were ultimately included239

in the review; all texts were excluded during extraction. Recall for included texts was240

improved from 0.813 without-tool to 0.868 in the OPOT (W/O) model and from 0.901241

with-tool to 0.923 in a OPOT (W/) setting. Assuming that conflict rate remains the same242

in the OPOT (W) setting as with-tool, OPOT would result in a further 48.6% reduction243

in screening time, for a potential total time saving of 72.2%. Assuming conflict rate is244

doubled, the additional OPOT savings would be 45.8% - a total time saving of 70.7%.245

The tool-only screening scenario, in which votes are provided by the tool while a human246

reviewer conducts only conflict resolution and full-text screening, performed comparably247

well to both OPOT and to two humans. Precision, however, was reduced when using this248

system (Table 4).249

4 Discussion250

SeroTracker improved its screening processes by implementing an NLP-based tool. Tool251

use significantly reduced both the mean time taken per abstract and the screening conflict252

rate, leading to an overall decrease in 3.93 hours of weekly screening time. The improved253

efficiency did not come at the cost of accuracy, as precision and recall of the screening254

process and summary statistics of the included estimates were similar with and without the255

tool. Users provided positive feedback overall to using the tool.256

Furthermore, one-person-one-tool and tool-only screening systems, which have the257

potential for increased time-reductions, performed as well as two human reviewers. The258

results of the OPOT analyses are particularly promising, as this system can reduce screening259

time by over 70%. Though the tool-only system performed comparably well to the OPOT260

system in our analyses, further study is required. Our pilot abstract sample was enriched261

with previously included texts, meaning that its precision at the abstract screening stage262

would likely be lower in practice. SeroTracker plans to adopt the OPOT(W) system into263

11

 . CC-BY-NC-ND 4.0 International licenseIt is made available under a 
 is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted February 25, 2022. ; https://doi.org/10.1101/2022.02.24.22268947doi: medRxiv preprint 

https://doi.org/10.1101/2022.02.24.22268947
http://creativecommons.org/licenses/by-nc-nd/4.0/


regular screening practice, to reduce screening time while maintaining a level of quality-264

control through human monitoring. We will incorporate reviewer feedback to augment the265

tool’s utility further. The improvements above can ultimately improve the efficiency of the266

entire living systematic review, allowing us to maintain up-to-date information.267

The failure cases of the tool provide avenues for future development. The most and268

least used features were the inclusion recommendation indicator and PIO highlighting, re-269

spectively. The performance of the named-entity-recognition used in PIO highlighting lags270

behind classification tasks in similar domains, [18] suggesting that more well-developed271

technologies may provide greater benefit. Furthermore, the tool did not aid with the full-text272

screening. As described in Section 3.1.2, 6 PI texts were excluded in full-text review in273

the with-tool stage. This suggests an inherent level of variation in the systematic review274

process that the tool could not mitigate.275

In a broader context, our study supports previous reports on the potential for using NLP276

to improve screening efficiency without impacting review quality. We build off previous277

work by providing the first comprehensive analysis of the feasibility and impact of NLP278

implementation. We examined this technology within an ongoing living systematic review,279

rather than through a retrospective evaluation, and our study design allowed for documented280

user-specific aspects of NLP tool development. We addressed the research team’s specific281

needs when designing the tool’s features through our QI approach and evaluated how users282

interacted with them. User feedback demonstrated that NLP tools are useful to augment283

efficiency and that reviewers feel that they improved the screening workflow. Finally,284

the design of this project allowed us to compare characteristics of included texts in the285

systematic review with and without the use of this NLP tool, which has not been previously286

evaluated.287

This project has implications for evidence synthesis at large and its role in evidence-288

based decision-making. As was highlighted by the COVID-19 pandemic, there are road-289

blocks to adequate evidence-based medicine during health emergencies. While the need for290

12

 . CC-BY-NC-ND 4.0 International licenseIt is made available under a 
 is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted February 25, 2022. ; https://doi.org/10.1101/2022.02.24.22268947doi: medRxiv preprint 

https://doi.org/10.1101/2022.02.24.22268947
http://creativecommons.org/licenses/by-nc-nd/4.0/


rapid research results is evident during a health emergency, evidence synthesis is generally291

slow and rigorous. [27] This incompatibility begets technological advancements, such as292

NLP-assisted tools, to accelerate the systematic review process. The success of this tool at293

SeroTracker demonstrates that these tools can fulfill this role and expedite the response of294

evidence-based medicine to a health emergency.295

Furthermore, the living systematic review in particular provides an ideal context for296

the implementation of such tools, as results of an initial manual review can be used as a297

training set to develop a tool that would, in turn, expedite all future rounds of the review.298

Other teams working on living systematic reviews would benefit from integrating such tools299

tailored to their research question and design. Our QI project had a robust design, evaluated300

a comprehensive set of metrics, and can be used as a model. Appendix B provides more301

implementation details on the planning and technical aspects of the tool.302

4.1 Limitations303

Because the project was designed as a QI project in an ongoing review, the proportion304

of abstracts screened by individual reviewers varied from week to week. Reviewers may305

have systematic differences in how they vote, which could change the outcome of abstracts306

included. While the effect analysis of the tool on efficiency in Section 3.1.1 attempts to307

decompose the per-reviewer variability, there are other potentially informative covariates,308

such as whether the abstract received an “include” or “exclude” vote, or whether a reviewer309

had seen the abstract before, which were not accounted for.310

Furthermore, the AB project design could induce order effects. While we showed311

in Section 3.1.1 that the removal of duplicate abstract-vote pairs did not affect the time312

decrease in abstract screening, we could not definitively demonstrate that order effects313

did not influence votes, particularly for texts that were accepted into full-text screening314

and subsequently interacted with multiple times. Finally, the sample size was limited to315

13

 . CC-BY-NC-ND 4.0 International licenseIt is made available under a 
 is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted February 25, 2022. ; https://doi.org/10.1101/2022.02.24.22268947doi: medRxiv preprint 

https://doi.org/10.1101/2022.02.24.22268947
http://creativecommons.org/licenses/by-nc-nd/4.0/


reduce screening load placed on the team, and some conclusions lacked the sample size for316

statistical significance. This screening load constraint also resulted in a higher inclusion317

rate in the pilot abstract set (23%) than what is typically observed in screening (5%).318

Beyond design, there are limitations to assessing the impact of our tool as a whole.319

Firstly, there is no definitive “gold standard” for inclusion or exclusion of abstracts; we320

assumed “pre-project” screening labels to be accurate. Furthermore, our precision and321

recall analyses treated all false negatives as equal value, but studies generally contribute322

differently to the quantitative results of a review. For instance, falsely excluding a paper323

reporting several unique seroprevalence estimates would alter overall results more than324

excluding a paper with just one estimate. While we examined summary statistics of missed325

texts in Section 3.2.1, a meta-analysis was not performed to determine the impact on326

downstream analysis. Finally, since the tool incorporated several features, the individual327

contributions of each feature could not be quantified. While we assessed user interaction328

and agreement with individual features in Section 3.2.2, the number of covariates, due to329

the multiple features, prevent this analysis from demonstrating causality.330

5 Conclusion331

This study provides the first comprehensive analysis on the implementation of NLP tech-332

nology in the screening stage of a living systematic review. Incorporating an internally333

developed tool at SeroTracker was feasible and significantly improved our processes by334

increasing screening efficiency while maintaining accuracy. User feedback was positive,335

leading to continued use of the tool in regular practice. This provides promising data for336

the evidence synthesis community, as similar tools could be used to expedite the time-337

consuming and laborious manual systematic review process for other research groups,338

allowing for more up-to-date dissemination of evidence syntheses.339

14

 . CC-BY-NC-ND 4.0 International licenseIt is made available under a 
 is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted February 25, 2022. ; https://doi.org/10.1101/2022.02.24.22268947doi: medRxiv preprint 

https://doi.org/10.1101/2022.02.24.22268947
http://creativecommons.org/licenses/by-nc-nd/4.0/


Acknowledgements340

We greatly thank the reviewers on the SeroTracker team, including Mercedes Yanes-Lane,341

Anabel Selemon, Natasha Ilincic, Judy Chen, Christian Cao, Mairead Whelan, Zihan Li342

and Xiaomeng Ma for their participation in the quality improvement project. We also thank343

Gabriel Deveaux, Himanshu Ranka, and Aaron Bensmihen for help with a review of the344

finalized manuscript.345

Funding Information346

SeroTracker receives funding for SARS-CoV-2 seroprevalence study evidence synthesis347

from the Public Health Agency of Canada through Canada’s COVID-19 Immunity Task348

Force, the World Health Organization Health Emergencies Programme, the Robert Koch349

Institute, and the Canadian Medical Association Joule Innovation Fund. No funding source350

had any role in the design of this study, its execution, analyses, interpretation of the data, or351

decision to submit results. This manuscript does not necessarily reflect the views of the352

World Health Organization or any other funder.353

Conflicts of Interest354

RKA was previously a Technical Consultant for the Bill and Melinda Gates Foundation355

Strategic Investment Fund, is a minority shareholder of Alethea Medical, and was a former356

Senior Policy Advisor at Health Canada. Each of these relationships is unrelated to the357

present work. TY reports a role at the Centre for Addiction and Mental Health and past358

employment at Health Canada, outside of the submitted work. All other authors declare359

that they have no competing interests.360

15

 . CC-BY-NC-ND 4.0 International licenseIt is made available under a 
 is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted February 25, 2022. ; https://doi.org/10.1101/2022.02.24.22268947doi: medRxiv preprint 

https://doi.org/10.1101/2022.02.24.22268947
http://creativecommons.org/licenses/by-nc-nd/4.0/


Author Contributions361

NL contributed to the development of the NLP tool. SPA, NL, and RKA contributed to the362

investigation and data collection. NL contributed to the statistical analysis and visualization.363

SPA and NL contributed to writing of the manuscript. RKA, NB, and TY contributed to364

the conceptualisation, methodology, and review and editing of the manuscript. All authors365

approved of the final manuscript prior to submission.366

Data Availability367

Code for data analysis performed in this study as well as code for the training of the NLP368

tool are published in public Github repositories [28, 29].369

16

 . CC-BY-NC-ND 4.0 International licenseIt is made available under a 
 is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted February 25, 2022. ; https://doi.org/10.1101/2022.02.24.22268947doi: medRxiv preprint 

https://doi.org/10.1101/2022.02.24.22268947
http://creativecommons.org/licenses/by-nc-nd/4.0/


References

1. Garritty C, Stevens A, Hamel C, Golfam M, Hutton B, Wolfe D. Knowledge Syn-

thesis in Evidence-Based Medicine. Semin Nuclear Med.. 2019;49(2):136–144. doi:

10.1053/j.semnuclmed.2018.11.006.

2. Guyatt GH, Sackett DL, Sinclair JC, et al. Users’ Guides to the Medical Literature: IX.

A Method for Grading Health Care Recommendations. JAMA.. 1995;274(22):1800–

1804. doi: 10.1001/jama.1995.03530220066035.

3. COVID-19 Primer. https://covid19primer.com/. Accessed October 6, 2021.

4. Borah R, Brown AW, Capers PL, Kaiser KA. Analysis of the Time and Workers

Needed to Conduct Systematic Reviews of Medical Interventions Using Data from the

PROSPERO Registry. BMJ Open.. 2017;7(2):e012545. doi: 10.1136/bmjopen-2016-

012545.

5. Bastian H, Glasziou P, Chalmers I. Seventy-Five Trials and Eleven Systematic Re-

views a Day: How Will We Ever Keep Up?. PLoS Med.. 2010;7(9):e1000326. doi:

10.1371/journal.pmed.1000326.

6. Pearson H. How COVID Broke the Evidence Pipeline. Nature.. 2021;593(7858):182–

185. doi: 10.1038/d41586-021-01246-x.

7. Elliott JH, Turner T, Clavisi O, et al. Living Systematic Reviews: An Emerging

Opportunity to Narrow the Evidence-Practice Gap. PLoS Med.. 2014;11(2):e1001603.

doi: 10.1371/journal.pmed.1001603.

8. Millard T, Synnot A, Elliott J, Green S, McDonald S, Turner T. Feasibility and Accept-

ability of Living Systematic Reviews: Results from a Mixed-Methods Evaluation. Syst

Rev.. 2019;8(1):325. doi: 10.1186/s13643-019-1248-5.

17

 . CC-BY-NC-ND 4.0 International licenseIt is made available under a 
 is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted February 25, 2022. ; https://doi.org/10.1101/2022.02.24.22268947doi: medRxiv preprint 

https://covid19primer.com/
https://doi.org/10.1101/2022.02.24.22268947
http://creativecommons.org/licenses/by-nc-nd/4.0/


9. Marshall IJ, Wallace BC. Toward Systematic Review Automation: A Practical Guide

to Using Machine Learning Tools in Research Synthesis. Syst Rev.. 2019;8(1):163. doi:

10.1186/s13643-019-1074-9.

10. Chai KEK, Lines RLJ, Gucciardi DF, Ng L. Research Screener: A Machine Learn-

ing Tool to Semi-Automate Abstract Screening for Systematic Reviews. Syst Rev..

2017;10(1):93. doi: 10.1186/s13643-021-01635-3.

11. Gates A, Guitard S, Pillay J, et al. Performance and Usability of Machine Learning for

Screening in Systematic Reviews: A Comparative Evaluation of Three Tools. Syst Rev..

2019;8(1):278. doi: 10.1186/s13643-019-1222-2.

12. Hamel C, Kelly SE, Thavorn K, Rice DB, Wells GA, Hutton B. An Evaluation of

DistillerSR’s Machine Learning-Based Prioritization Tool for Title/Abstract Screening

– Impact on Reviewer-Relevant Outcomes. BMC Med Res Methodol.. 2020;20(1):256.

doi: 10.1186/s12874-020-01129-1.

13. Lerner I, Créquit P, Ravaud P, Atal I. Automatic Screening Using Word Embeddings

Achieved High Sensitivity and Workload Reduction for Updating Living Network Meta-

Analyses. J Clin Epidemiol.. 2019;108:86–94. doi: 10.1016/j.jclinepi.2018.12.001.

14. Ouzzani M, Hammady H, Fedorowicz Z, Elmagarmid A. Rayyan—a Web and Mobile

App for Systematic Reviews. Syst Rev.. 2016;5(1):210. doi: 10.1186/s13643-016-0384-

4.

15. Wallace BC, Trikalinos TA, Lau J, Brodley C, Schmid CH. Semi-Automated Screening

of Biomedical Citations for Systematic Reviews. BMC Bioinformatics.. 2010;11(1):55.

doi: 10.1186/1471-2105-11-55.

16. Qin X, Liu J, Wang Y, et al. Natural Language Processing Was Effective in Assist-

18

 . CC-BY-NC-ND 4.0 International licenseIt is made available under a 
 is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted February 25, 2022. ; https://doi.org/10.1101/2022.02.24.22268947doi: medRxiv preprint 

https://doi.org/10.1101/2022.02.24.22268947
http://creativecommons.org/licenses/by-nc-nd/4.0/


ing Rapid Title and Abstract Screening When Updating Systematic Reviews. J Clin

Epidemiol.. 2021;133:121–129. doi: 10.1016/j.jclinepi.2021.01.010.

17. Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-Training of Deep Bidirectional

Transformers for Language Understanding. CoRR. 2018;abs/1810.04805 http://

arxiv.org/abs/1810.04805.

18. Gu Y, Tinn R, Cheng H, et al. Domain-Specific Language Model Pretraining

for Biomedical Natural Language Processing. ACM Trans. Comput. Healthcare..

2022;3(1):1–23. doi: 10.1145/3458754.

19. Huang KC, Chiang IJ, Xiao F, Liao CC, Liu CCH, Wong JM. PICO Element Detection

in Medical Text without Metadata: Are First Sentences Enough?. J. Biomed. Inform..

2013;46(5):940–946. doi: 10.1016/j.jbi.2013.07.009.

20. Cohen AM, Hersh WR, Peterson K, Yen PY. Reducing Workload in Systematic Re-

view Preparation Using Automated Citation Classification. J Am Med Inform Assoc..

2006;13(2):206–219. doi: 10.1197/jamia.M1929.

21. Schoot R, Bruin J, Schram R, et al. An Open Source Machine Learning Framework for

Efficient and Transparent Systematic Reviews. Nat Mach Intell.. 2021;3(2):125–133.

doi: 10.1038/s42256-020-00287-7.

22. Arora RK, Joseph A, Van Wyk J, et al. SeroTracker: A Global SARS-CoV-2 Sero-

prevalence Dashboard. Lancet Infect Dis.. 2021;21(4):e75-e76. doi: 10.1016/S1473-

3099(20)30631-9.

23. Bobrovitz N, Arora RK, Cao C, et al. Global Seroprevalence of SARS-CoV-2 Anti-

bodies: A Systematic Review and Meta-Analysis. PLOS ONE.. 2021;16(6):e0252617.

doi: 10.1371/journal.pone.0252617.

24. Ltd VHI. Covidence. https://www.covidence.org/. Accessed November 5, 2021.

19

 . CC-BY-NC-ND 4.0 International licenseIt is made available under a 
 is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted February 25, 2022. ; https://doi.org/10.1101/2022.02.24.22268947doi: medRxiv preprint 

http://arxiv.org/abs/1810.04805
http://arxiv.org/abs/1810.04805
https://www.covidence.org/
https://doi.org/10.1101/2022.02.24.22268947
http://creativecommons.org/licenses/by-nc-nd/4.0/


25. Nye B, Li JJ, Patel R, et al. A Corpus with Multi-Level Annotations of Patients, In-

terventions and Outcomes to Support Language Processing for Medical Literature. in

Proceedings of the 56th Annual Meeting of the Association for Computational Lin-

guistics (Volume 1: Long Papers):197–207.Association for Computational Linguistics.

2018.

26. Wang Z, Nayfeh T, Tetzlaff J, O’Blenis P, Murad MH. Error Rates of Human Reviewers

during Abstract Screening in Systematic Reviews. PLOS ONE.. 2020;15(1):e0227742.

doi: 10.1371/journal.pone.0227742.

27. Evidence-Based Medicine: How COVID Can Drive Positive Change. Nature..

2021;593(7858):168–168. doi: 10.1038/d41586-021-01255-w.

28. SeroTracker . Serotracker-NLP-Tool-Analysis. Github. https://github.com/

serotracker/Serotracker-NLP-Tool-Analysis 2021. Accessed December 24,

2021.

29. SeroTracker . Serotracker-NLP-Training-and-Inference. Github. https://github.

com/serotracker/Serotracker-NLP-Training-and-Inference 2022. Ac-

cessed January 4, 2022.

20

 . CC-BY-NC-ND 4.0 International licenseIt is made available under a 
 is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted February 25, 2022. ; https://doi.org/10.1101/2022.02.24.22268947doi: medRxiv preprint 

https://github.com/serotracker/Serotracker-NLP-Tool-Analysis
https://github.com/serotracker/Serotracker-NLP-Tool-Analysis
https://github.com/serotracker/Serotracker-NLP-Training-and-Inference
https://github.com/serotracker/Serotracker-NLP-Training-and-Inference
https://doi.org/10.1101/2022.02.24.22268947
http://creativecommons.org/licenses/by-nc-nd/4.0/


6 Tables

Table 1: Key methods and results of process, outcome and structure measures evaluated.

Measure Description Rationale Analysis Key Result

Process Measures

Efficiency

a) Mean time taken to screen an

abstract with-tool compared to

without-tool

Direct indicator of whether

screening time was reduced with

the tool

Two-sided unequal variances t-

tests to compare mean time

taken per abstract with-tool vs

without-tool for the three sets of

abstracts described

Addition of the tool was asso-

ciated with a 33.7% decrease

in mean time taken per abstract

(21.45 ± 22.30s (n = 2852) to

14.22± 17.52s (n = 2961); p =

5.25e-42.

b) Estimate of the overall time

reduction of using the tool, con-

trolling for per-reviewer effects

Direct indicator of whether

screening time was reduced with

the tool

Modelled time taken as a func-

tion of a tool-usage using a Gaus-

sian GLM with an identity link

function. We modeled the means

as µ = µ0 + ai + bt , with ai,

1  i  N denoting per-reviewer

effects, while t 2 {0,1} repre-

sents use of the tool, with b the

associated coefficient.

The tool was associated with

a 45.9% reduction in screening

time per abstract (p < 0.0001)

under this model

21

 . 
C

C
-B

Y
-N

C
-N

D
 4.0 International license

It is m
ade available under a 

 is the author/funder, w
ho has granted m

edR
xiv a license to display the preprint in perpetuity. 

(w
h

ich
 w

as n
o

t certified
 b

y p
eer review

)
T

he copyright holder for this preprint 
this version posted F

ebruary 25, 2022. 
; 

https://doi.org/10.1101/2022.02.24.22268947
doi: 

m
edR

xiv preprint 

https://doi.org/10.1101/2022.02.24.22268947
http://creativecommons.org/licenses/by-nc-nd/4.0/


c) Conflict rate: Proportion of

abstracts requiring conflict re-

solving votes

A decrease in conflicts decreases

the overall screening time, as

conflicting votes must be re-

solved by a third reviewer.

Fisher’s exact test for comparing

conflict rates with vs without the

tool

A decrease in the conflict rate

was observed with addition of

the tool: 8.32% (163/1960) to

3.64% (87/2388).

Accuracy

a) Assessment of screening pre-

cision and recall in with vs with-

out the tool

To ensure that the implementa-

tion of the tool does not detract

from the accuracy of the review

Fisher’s exact test to compare

whether the proportion of in-

cluded abstract is significantly

different

There was no significant change

in the proportion of ultimately in-

cluded texts with vs without the

tool (74/91 without tool, 82/91

with tool)(p = 0.137, Fisher’s ex-

act test)

b) Evaluation of the tool’s inclu-

sion recommendations

To ensure that the tool’s inclu-

sion recommendations are accu-

rate

Calculated operating characteris-

tics of the tool on pilot abstracts.

All the previously included ab-

stracts were at least rated as

“weakly recommended” by the

tool, with 73/91 being classified

as “strongly recommended” for

inclusion. The highest F1 scores

was at the “somewhat recom-

mended” threshold (0.896), with

precision of 0.891 and recall of

0.901

Outcome Measures

22

 . 
C

C
-B

Y
-N

C
-N

D
 4.0 International license

It is m
ade available under a 

 is the author/funder, w
ho has granted m

edR
xiv a license to display the preprint in perpetuity. 

(w
h

ich
 w

as n
o

t certified
 b

y p
eer review

)
T

he copyright holder for this preprint 
this version posted F

ebruary 25, 2022. 
; 

https://doi.org/10.1101/2022.02.24.22268947
doi: 

m
edR

xiv preprint 

https://doi.org/10.1101/2022.02.24.22268947
http://creativecommons.org/licenses/by-nc-nd/4.0/


Results of overall

review

Comparison of the characteris-

tics of final included seropreva-

lence estimates of pilot abstracts

in the pre-project, with-tool and

without-tool stages

To identify any differences in

the results of a systematic review

when using the tool

Compared summary statistics

of the estimates that were ul-

timately included pre-project,

without-tool and with-tool

Overall, there was consistency

in estimates that were ultimately

included at all three stages

User interaction

with the tool

a)Evaluation of agreement be-

tween user votes and the inclu-

sion recommendations

To assess whether users voted

in alignment with the tool and

whether they found it useful

Measured operating characteris-

tics comparing tool predictions

to user’s votes

Users voted strongly in agree-

ment with the tool

b) Quantification of usage of dif-

ferent features, by number of re-

viewers and amount of time

To assess which features of the

tool were likely to contribute

most to any changes observed

in the screening process

Produced a descriptive summary

of the results

All 7 reviewers who voted in

the with-tool stage used the in-

clusion recommendation and 5/7

used the PIO highlight. 3/7 in-

putted keywords and 3/7 used

the undo features (23 votes out

of 3142 were undone.

23

 . 
C

C
-B

Y
-N

C
-N

D
 4.0 International license

It is m
ade available under a 

 is the author/funder, w
ho has granted m

edR
xiv a license to display the preprint in perpetuity. 

(w
h

ich
 w

as n
o

t certified
 b

y p
eer review

)
T

he copyright holder for this preprint 
this version posted F

ebruary 25, 2022. 
; 

https://doi.org/10.1101/2022.02.24.22268947
doi: 

m
edR

xiv preprint 

https://doi.org/10.1101/2022.02.24.22268947
http://creativecommons.org/licenses/by-nc-nd/4.0/


User satisfaction Evaluation of whether team

members felt that the tool was

adequate, whether they felt that

using it augmented the screen-

ing process and whether any fea-

tures of the tool should be modi-

fied

To determine whether the Sero-

Tracker team was satisfied with

using this tool and to allow us to

make modifications to the tool,

to ultimately improve the screen-

ing process

Circulated a user satisfaction sur-

vey and produced a qualitative

summary of the results

The inclusion recommendation

feature was most used and was

voted most useful by users

(mean rating of 4.7 out of 5) and

keyword highlights were rated

the lowest in usefulness (mean

rating 3.88 /5). PIO highlights

were the least frequently used in

self-report (3/9 users), and had

only a marginally higher rating

than keyword highlight (mean

rating 4.0/5)

Structure Measures

24

 . 
C

C
-B

Y
-N

C
-N

D
 4.0 International license

It is m
ade available under a 

 is the author/funder, w
ho has granted m

edR
xiv a license to display the preprint in perpetuity. 

(w
h

ich
 w

as n
o

t certified
 b

y p
eer review

)
T

he copyright holder for this preprint 
this version posted F

ebruary 25, 2022. 
; 

https://doi.org/10.1101/2022.02.24.22268947
doi: 

m
edR

xiv preprint 

https://doi.org/10.1101/2022.02.24.22268947
http://creativecommons.org/licenses/by-nc-nd/4.0/


“One-person-one-

tool” (OPOT)

model

Evaluation of operating charac-

teristics if the abstract screen

vote of one human reviewer is

replaced by the tool’s vote

This use case could significantly

reduce screening time and costs

Built a scenario with one hu-

man vote from the project data

and one vote from the tool’s

inclusion recommendations and

assessed operating characteris-

tics in this setting. An analy-

sis was performed for human

votes without access to the tool

(OPOT (W/O)) and one with ac-

cess (OPOT (W)).

F1 score for included texts was

improved from 0.846 in the

without-tool stage to 0.878 in the

OPOT (W/O) model, and from

0.911 in the with-tool stage to

0.918 in a OPOT (W/) setting.

“Tool-only”

model

Evaluation of operating charac-

teristics if abstract screen votes

are given exclusively by the tool,

with human reviewers only re-

solving conflicts and conducting

full-text screen

To serve as a comparative group

to the project and OPOT models

Assessed operating characteris-

tics in this setting

The tool-only setting performed

as well both OPOT and as two

humans, but with reduced speci-

ficity of abstract screen (preci-

sion of 0.827 compared to 0.881

in OPOT (W) ).

25

 . 
C

C
-B

Y
-N

C
-N

D
 4.0 International license

It is m
ade available under a 

 is the author/funder, w
ho has granted m

edR
xiv a license to display the preprint in perpetuity. 

(w
h

ich
 w

as n
o

t certified
 b

y p
eer review

)
T

he copyright holder for this preprint 
this version posted F

ebruary 25, 2022. 
; 

https://doi.org/10.1101/2022.02.24.22268947
doi: 

m
edR

xiv preprint 

https://doi.org/10.1101/2022.02.24.22268947
http://creativecommons.org/licenses/by-nc-nd/4.0/


Table 2: Mean screening time per abstract with-tool and without-tool, stratified into four
categories: all abstracts, pilot abstracts only, pilot abstracts excluding repeated reviewer-
abstract vote pairs, and only abstracts with “include” vote.

N Mean(s) t-test result

All Abstracts Without Tool 2852 21.45±22.30 t(5811) = 13.70
p = 5.25e-42With Tool 2961 14.22±17.52

Pilot Abstracts Only Without Tool 746 22.78±22.59 t(1480) = 8.27
p = 3.03e-16With Tool 736 14.00±18.04

Pilot Abstracts Only
(No repeats)

Without Tool 603 23.99±22.44 t(1192) = 7.86
p = 8.79e-15With Tool 591 14.70±18.21

Inclusion Votes Only Without Tool 119 45.67 ± 32.37 t(276) = 0.93
p = 0.264With Tool 159 41.25 ± 32.43

Table 3: Conflict rate of abstract screening with-tool vs without-tool, for all abstracts and
for pilot abstracts only.

Conflicts Abstracts Rate Test result

All Abstracts Without Tool 163 1960 8.32% p = 5.36e-11With Tool 2961 2388 3.64%

Pilot Abstracts Without Tool 45 400 11.25% p = 0.007With Tool 23 400 5.75%

26

 . CC-BY-NC-ND 4.0 International licenseIt is made available under a 
 is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted February 25, 2022. ; https://doi.org/10.1101/2022.02.24.22268947doi: medRxiv preprint 

https://doi.org/10.1101/2022.02.24.22268947
http://creativecommons.org/licenses/by-nc-nd/4.0/


Table 4: Precision, recall and F1 scores in all user-tool pairings for texts that were included
in abstract screening and full text screening stages

TN FP FN TP Prec. Rec. F1

Included Abstracts

Without 297 12 14 77 0.865 0.846 0.856
With 298 11 3 88 0.889 0.967 0.926
OPOT(W/O) 298 11 7 84 0.884 0.923 0.903
OPOT(W) 297 12 2 89 0.881 0.978 0.927
Tool-Only 290 19 0 91 0.827 1.000 0.905

Included Full Texts

Without 299 10 17 74 0.881 0.813 0.846
With 302 7 9 82 0.921 0.901 0.911
OPOT(W/O) 299 10 12 79 0.888 0.868 0.878
OPOT(W) 301 8 7 84 0.913 0.923 0.918
Tool-Only 301 8 6 85 0.914 0.934 0.924

Abbreviations: TN, true negative ; FP, false positive ; FN, false negative ; TP, true
positive ; Prec., precision ; Rec., recall

Table 5: Operating characteristics of the tool evaluated on the pilot abstracts at different
inclusion thresholds.

TN FP FN TP Prec. Rec. F1

Not Recommended 0 309 0 91 0.228 1.000 0.371
Weakly Recommended 290 19 0 91 0.827 1.000 0.905
Somewhat Recommended 299 10 9 82 0.891 0.901 0.896
Recommended 301 8 14 77 0.906 0.846 0.875
Strongly Recommended 302 7 18 73 0.912 0.802 0.854

Abbreviations: TN, true negative ; FP, false positive ; FN, false negative ; TP, true
positive ; Prec., precision ; Rec., recall

Table 6: Summary statistics of the seroprevalence estimates from the included texts in the
pre-project, without-tool and with-tool stages.

Characteristic
Pre-Project

(n=308)

Without Tool

(n=287)

With Tool

(n=297)

Geographic Scope

Local 69 (22%) 50 (17%) 59 (20%)

Regional 220 (71%) 218 (76%) 219 (74%)

27

 . CC-BY-NC-ND 4.0 International licenseIt is made available under a 
 is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted February 25, 2022. ; https://doi.org/10.1101/2022.02.24.22268947doi: medRxiv preprint 

https://doi.org/10.1101/2022.02.24.22268947
http://creativecommons.org/licenses/by-nc-nd/4.0/


National 19 (6.2%) 19 (6.6%) 19 (6.4%)

Study Population

Assisted living and long-term care facilities 1 (0.3%) 1 (0.3%) 4 (1.3%)

Blood donors 4 (1.3%) 4 (1.4%) 3 (1.0%)

Essential non-healthcare workers 4 (1.3%) 3 (1.0%) 35 (12%)

Health care workers and caregivers 37 (12%) 29 (10%) 29 (9.8%)

Household and community samples 30 (9.7%) 26 (9.1%) 1 (0.3%)

Multiple general populations 1 (0.3%) 1 (0.3%) 1 (0.3%)

Multiple populations 1 (0.3%) 1 (0.3%) 1 (0.3%)

Non-essential workers and unemployed persons 1 (0.3%) 1 (0.3%) 8 (2.7%)

Patients seeking care for non-COVID-19 reasons 11 (3.6%) 6 (2.1%) 1 (0.3%)

Persons experiencing homelessness 1 (0.3%) 1 (0.3%) 2 (0.7%)

Pregnant or parturient women 2 (0.6%) 0 (0.0%) 0 (0.0%)

Residual sera 212 (69%) 211 (74%) 209 (70%)

Students and Daycares 3 (1.0%) 3 (1.0%) 3 (1.0%)

World Bank Income Level

High income 286 (93%) 269 (94%) 275 (93%)

Low income 2 (0.6%) 2 (0.7%) 2 (0.7%)

Lower middle income 7 (2.3%) 7 (2.4%) 7 (2.4%)

Upper middle income 13 (4.2%) 9 (3.1%) 13 (4.4%)

HRP Status

HRP 19 (6.2%) 15 (5.2%) 19 (6.4%)

non-HRP 289 (94%) 272 (95%) 278 (94%)

Risk of Bias

28

 . CC-BY-NC-ND 4.0 International licenseIt is made available under a 
 is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted February 25, 2022. ; https://doi.org/10.1101/2022.02.24.22268947doi: medRxiv preprint 

https://doi.org/10.1101/2022.02.24.22268947
http://creativecommons.org/licenses/by-nc-nd/4.0/


High 89 (29%) 72 (25%) 78 (26%)

Low 10 (3.2%) 9 (3.1%) 10 (3.4%)

Moderate 204 (66%) 201 (70%) 204 (69%)

Unclear 5 (1.6%) 5 (1.7%) 5 (1.7%)

Sampling Method

Non-probability 288 (94%) 268 (93%) 278 (94%)

Probability 20 (6.5%) 19 (6.6%) 19 (6.4%)

Test Type

CLIA 224 (76%) 217 (79%) 220 (77%)

ELISA 36 (12%) 28 (10%) 34 (12%)

LFIA 16 (5.5%) 13 (4.8%) 16 (5.6%)

Multiple Types 11 (3.8%) 10 (3.7%) 11 (3.9%)

Other 6 (2.0%) 5 (1.8%) 3 (1.1%)

Neutralization 0 (0%) 0 (0%) 0 (0%)

Unclear 0 (0%) 0 (0%) 0 (0%)

29

 . CC-BY-NC-ND 4.0 International licenseIt is made available under a 
 is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted February 25, 2022. ; https://doi.org/10.1101/2022.02.24.22268947doi: medRxiv preprint 

https://doi.org/10.1101/2022.02.24.22268947
http://creativecommons.org/licenses/by-nc-nd/4.0/


Table 7: Self reported usage and usefulness rating by 9 users for features of the NLP-based
tool.

Feature Self-reported Usage Mean usefulness

Undo 4/9 4.4 (3,4,5,5,5)
Inclusion Recommendations 7/9 4.7 (4,4,5,5,5,5,5)
Bold Headings 7/9 4.13 (3,3,3,4,5,5,5,5)
Keyword Highlighting 7/9 3.88 (3,3,3,4,4,4,2,2)
PIO Highlighting 3/9 4 (3,3,4,5,5)

30

 . CC-BY-NC-ND 4.0 International licenseIt is made available under a 
 is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted February 25, 2022. ; https://doi.org/10.1101/2022.02.24.22268947doi: medRxiv preprint 

https://doi.org/10.1101/2022.02.24.22268947
http://creativecommons.org/licenses/by-nc-nd/4.0/


7 Figures

Figure 1: The timeline of the quality improvement project

(a) Routine + Pilot Abstracts (b) Pilot Abstracts (c) Include Votes

Figure 2: Violin plots of times taken per abstract screen with and without the tool. Results
are split into three categories: (a) all abstracts, (b) pilot abstracts only, and (c) abstracts
which received an inclusion vote. Means and standard deviations are reported by the black
and grey dotted lines, respectively.

31

 . CC-BY-NC-ND 4.0 International licenseIt is made available under a 
 is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted February 25, 2022. ; https://doi.org/10.1101/2022.02.24.22268947doi: medRxiv preprint 

https://doi.org/10.1101/2022.02.24.22268947
http://creativecommons.org/licenses/by-nc-nd/4.0/


(a) ROC Curve (b) Precision, Recall and F1 scores

Figure 3: Operating characteristics of the tool evaluated on the pilot abstracts, with “True”
labels taken as the outcome of the previous full screening, and the predicted labels taken
as the tool’s inclusion likelihood. (a) shows the ROC curve, with the four confidence
thresholds given by the tool marked on the curve. (b) shows the precision, recall and F1
scores as a function of the tool’s confidence threshold, from the lower confidence (at least
not recommended/red) to the highest (at least strongly recommended/dark green).

32

 . CC-BY-NC-ND 4.0 International licenseIt is made available under a 
 is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted February 25, 2022. ; https://doi.org/10.1101/2022.02.24.22268947doi: medRxiv preprint 

https://doi.org/10.1101/2022.02.24.22268947
http://creativecommons.org/licenses/by-nc-nd/4.0/

	Background
	Methods
	The SeroTracker systematic review of SARS-CoV-2 seroprevalence
	Development and implementation of an NLP tool into SeroTracker’s processes:
	Plan
	Do
	Study
	Act


	Results
	Process measures
	Screening efficiency - time taken to screen
	Effect on screening precision and recall
	Inclusion recommendation performance on the pilot abstracts

	Outcome measures
	Results of overall systematic review
	User interaction with the tool
	User feedback on the tool

	Structure measures
	One-person-one-tool (OPOT) model


	Discussion
	Limitations

	Conclusion
	Tables
	Figures
	Additional Results
	Voter Agreement with Inclusion Recommendations
	Voter Alignment with the Presence of Keywords

	Training and Development of the Tool
	Dataset partitioning
	Data pre-processing
	Training
	Training Results
	PIO Highlighting

	Infrastructure and Implementation
	The Chrome Plugin
	The Server

	Survey on Usefulness of the Tool's Features