Identification of Mislabeled Samples and Sample Mix-ups in Genotype Data using Barcode Genotypes
Research output: Contribution to journal › Journal article › Research › peer-review
Standard
Identification of Mislabeled Samples and Sample Mix-ups in Genotype Data using Barcode Genotypes. / Have, Christian Theil; Appel, Emil Vincent Rosenbaum; Grarup, Niels; Hansen, Torben; Bork-Jensen, Jette.
In: International Journal of Bioscience, Biochemistry and Bioinformatics, Vol. 4, No. 5, 370, 2014, p. 355-360.Research output: Contribution to journal › Journal article › Research › peer-review
Harvard
APA
Vancouver
Author
Bibtex
}
RIS
TY - JOUR
T1 - Identification of Mislabeled Samples and Sample Mix-ups in Genotype Data using Barcode Genotypes
AU - Have, Christian Theil
AU - Appel, Emil Vincent Rosenbaum
AU - Grarup, Niels
AU - Hansen, Torben
AU - Bork-Jensen, Jette
PY - 2014
Y1 - 2014
N2 - Abstract—Undetected mislabeled samples may affect theresults of genotype studies, particular when rare geneticvariants are investigated. Mislabeled samples are often notdetected during quality control and if they are detected, theyare normally discarded due to a lack of a reliable method torecover the correct labels.Here we describe a statistical method which given a few extraindependent genotypes (barcode genotypes) detects mislabeledsamples and recovers the correct labels for sample mix-ups. Wehave implemented the method in a program (namedWunderbar) and we evaluate the reliability of the method onsimulated data. We find that even with only a small number ofbarcode genotypes, Wunderbar is capable of identifyingmislabeled samples and sample mix-ups with high sensitivityand specificity, even with a high genotyping error rate and evenin the presence of dependency between the individual barcodegenotypes.To detect mislabeled samples we calculate the probabilitythat the discordance between genotypes in the data and in theindependent genotypes can be attributed to random(non-mislabeling) genotyping errors. To identify mix-ups wecalculate the probability of identifying the set of identicalgenotypes between sample x and sample y by chance. Based onthis we calculate a mix-up confidence score with penalizationfor introducing mismatches in the proposed new label andadjustment for independency among the genotypes. Thisconfidence score is used to identify probable mix-ups.
AB - Abstract—Undetected mislabeled samples may affect theresults of genotype studies, particular when rare geneticvariants are investigated. Mislabeled samples are often notdetected during quality control and if they are detected, theyare normally discarded due to a lack of a reliable method torecover the correct labels.Here we describe a statistical method which given a few extraindependent genotypes (barcode genotypes) detects mislabeledsamples and recovers the correct labels for sample mix-ups. Wehave implemented the method in a program (namedWunderbar) and we evaluate the reliability of the method onsimulated data. We find that even with only a small number ofbarcode genotypes, Wunderbar is capable of identifyingmislabeled samples and sample mix-ups with high sensitivityand specificity, even with a high genotyping error rate and evenin the presence of dependency between the individual barcodegenotypes.To detect mislabeled samples we calculate the probabilitythat the discordance between genotypes in the data and in theindependent genotypes can be attributed to random(non-mislabeling) genotyping errors. To identify mix-ups wecalculate the probability of identifying the set of identicalgenotypes between sample x and sample y by chance. Based onthis we calculate a mix-up confidence score with penalizationfor introducing mismatches in the proposed new label andadjustment for independency among the genotypes. Thisconfidence score is used to identify probable mix-ups.
U2 - 10.7763/IJBBB.2014.V4.370
DO - 10.7763/IJBBB.2014.V4.370
M3 - Journal article
VL - 4
SP - 355
EP - 360
JO - International Journal of Bioscience, Biochemistry and Bioinformatics
JF - International Journal of Bioscience, Biochemistry and Bioinformatics
SN - 2010-3638
IS - 5
M1 - 370
ER -
ID: 120736068