A Machine-Learning-Based Approach to Prediction of Biogeographic Ancestry within Europe
2023
Аутори:
Kloska, AnnaGiełczyk, Agata
Grzybowski, Tomasz
Płoski, Rafał
Kloska, Sylwester
Marciniak, Tomasz
Pałczynski, Krzysztof
Rogalla-Ładniak, Urszula
Malyarchuk, Boris
Derenko, Miroslava
Kovačević-Grujičić, Nataša
Stevanović, Milena
Drakulić, Danijela
Davidović, Slobodan
Spólnicka, Magdalena
Zubanska, Magdalena
Wozniak, Marcin
Тип документа:
Чланак у часопису (Објављена верзија)
Метаподаци
Приказ свих података о документуАпстракт:
Abstract
Data obtained with the use of massive parallel sequencing (MPS) can be valuable in population genetics studies. In particular, such data harbor the potential for distinguishing samples from different populations, especially from those coming from adjacent populations of common origin. Machine learning (ML) techniques seem to be especially well suited for analyzing large datasets obtained using MPS. The Slavic populations constitute about a third of the population of Europe and inhabit a large area of the continent, while being relatively closely related in population genetics terms. In this proof-of-concept study, various ML techniques were used to classify DNA samples from Slavic and non-Slavic individuals. The primary objective of this study was to empirically evaluate the feasibility of discerning the genetic provenance of individuals of Slavic descent who exhibit genetic similarity, with the overarching goal of categorizing DNA specimens derived from diverse Slavic population representatives. Raw sequencing data were pre-processed, to obtain a 1200 character-long binary vector. A total of three classifiers were used—Random Forest, Support Vector Machine (SVM), and XGBoost. The most-promising results were obtained using SVM with a linear kernel, with 99.9% accuracy and F1-scores of 0.9846–1.000 for all classes.
Кључне речи:
machine learning; SVM; biogeographic origin; biogeographic ancestryИзвор:
International Journal of Molecular Sciences, 2023, 24, 20, 15095-Финансирање / пројекти:
- National Centre for Research and Development within the framework of the project NEXT (DOBBIO7/ 17/01/2015)
- Министарство науке, технолошког развоја и иновација Републике Србије, институционално финансирање - 200007 (Универзитет у Београду, Институт за биолошка истраживања 'Синиша Станковић') (RS-MESTD-inst-2020-200007)
- Министарство науке, технолошког развоја и иновација Републике Србије, институционално финансирање - 200042 (Универзитет у Београду, Институт за молекуларну генетику и генетичко инжењерство) (RS-MESTD-inst-2020-200042)
DOI: 10.3390/ijms242015095
ISSN: 1422-0067