SUP: a probabilistic framework to propagate genome sequence uncertainty, with applications

Simple item page

Simple item page

Full item details

dc.contributor.author
Becker, Devan
Champredon, David
Chato, Connor
Gugan, Gopi
Poon, Art
dc.date.accessioned
2024-03-06T15:50:14Z
dc.date.available
2024-03-06T15:50:14Z
dc.date.issued
2023-04-24
dc.description.abstract - en
Genetic sequencing is subject to many different types of errors, but most analyses treat the resultant sequences as if they are known without error. Next generation sequencing methods rely on significantly larger numbers of reads than previous sequencing methods in exchange for a loss of accuracy in each individual read. Still, the coverage of such machines is imperfect and leaves uncertainty in many of the base calls. In this work, we demonstrate that the uncertainty in sequencing techniques will affect downstream analysis and propose a straightforward method to propagate the uncertainty. Our method (which we have dubbed Sequence Uncertainty Propagation, or SUP) uses a probabilistic matrix representation of individual sequences which incorporates base quality scores as a measure of uncertainty that naturally lead to resampling and replication as a framework for uncertainty propagation. With the matrix representation, resampling possible base calls according to quality scores provides a bootstrap- or prior distribution-like first step towards genetic analysis. Analyses based on these re-sampled sequences will include a more complete evaluation of the error involved in such analyses. We demonstrate our resampling method on SARS-CoV-2 data. The resampling procedures add a linear computational cost to the analyses, but the large impact on the variance in downstream estimates makes it clear that ignoring this uncertainty may lead to overly confident conclusions. We show that SARS-CoV-2 lineage designations via Pangolin are much less certain than the bootstrap support reported by Pangolin would imply and the clock rate estimates for SARS-CoV-2 are much more variable than reported.
dc.identifier.doi
https://doi.org/10.1093/nargab/lqad038
dc.identifier.issn
2631-9268
dc.identifier.pubmedID
37101658
dc.identifier.uri
https://open-science.canada.ca/handle/123456789/1996
dc.language.iso
en
dc.publisher
Oxford University Press
dc.rights - en
Creative Commons Attribution 4.0 International (CC BY 4.0)
dc.rights - fr
Creative Commons Attribution 4.0 International (CC BY 4.0)
dc.rights.openaccesslevel - en
Gold
dc.rights.openaccesslevel - fr
Or
dc.rights.uri - en
https://creativecommons.org/licenses/by/4.0/
dc.rights.uri - fr
https://creativecommons.org/licenses/by/4.0/deed.fr
dc.subject - en
Health
dc.subject - fr
Santé
dc.subject.en - en
Health
dc.subject.fr - fr
Santé
dc.title - en
SUP: a probabilistic framework to propagate genome sequence uncertainty, with applications
dc.type - en
Article
dc.type - fr
Article
local.acceptedmanuscript.articlenum
lqad038
local.article.journalissue
2
local.article.journaltitle
NAR Genomics and Bioinformatics
local.article.journalvolume
5
local.pagination
1-12
local.peerreview - en
Yes
local.peerreview - fr
Oui
Download(s)

Original bundle

Now showing 1 - 1 of 1

Thumbnail image

Name: becker-sup-probabilistic-framework-to-propagate-genome-sequence-uncertainty.pdf

Size: 1.29 MB

Format: PDF

Download file

Page details

Date modified: