Identification of disease related significant SNPs

Sol, Ceyda (2010) Identification of disease related significant SNPs. [Thesis]

[thumbnail of CeydaSol.pdf] PDF
CeydaSol.pdf

Download (1MB)
[thumbnail of Restricted to Repository Staff Only] Zip Compressed (Restricted to Repository Staff Only)
CeydaSol_Figures.zip
Restricted to Repository staff only

Download (205kB) | Request a copy

Abstract

Single nucleotide polymorphisms (SNPs) are DNA sequence variations that occur when a single nucleotide in the genome sequence is altered. Since, variations in DNA sequence can have a major impact on complex human diseases such as obesity, epilepsy, type 2 diabetes, rheumatoid arthritis; SNPs have become increasingly significant in identification of such complex diseases. Recent biological studies point out that a single altered gene may have a small effect on a complex disease, whereas interactions between multiple genes may have a significant role. Therefore, identifying multiple genes associated with complex disorders is essential. In this spirit, combinations of multiple SNPs rather than individual SNPs should be analyzed. However, assessing a very large number of SNP combinations is computationally challenging and due to this challenge, in literature there exist a limited number of studies on extracting statistically significant SNP combinations. In this thesis work, we focus on this challenging problem and develop a five step "disease-associated multi-SNP combinations search procedure" to identify statistically significant SNP combinations and the significant rules defining the associations between SNPs and a specified disease. The proposed five step multi-SNP combinations procedure is applied to the simulated rheumatoid arthritis data set provided by Genetic Analysis Workshop 15. In each step, statistically significant SNPs are extracted from the available set of SNPs that are not yet classified as significant or insignificant. In the first step, the genome wide association analysis (GWA) is performed on the original complete multi-family data set. Then, in the second step we use the tag SNP selection algorithm to find a smaller subset of informative SNP markers. In literature most tag SNP selection methods are based on the pair wise (two-markers) linkage disequilibrium (LD) measures. But in this thesis, both the pair wise and multiple marker LD measures have been incorporated to improve the genetic coverage. Up to the third step the procedure aims to identify individual significant SNPs. In the third step a genetic algorithm (GA) based feature selection method is performed. It provides a significant combination of SNPs and the GA constructs this combination by maximizing the explanatory power of the selected SNPs while trying to decrease the number of selected SNPs dynamically. Since GA is a probabilistic search approach, at each execution it may provide different SNP combinations. We apply the GA several times to obtain multiple significant SNP combinations, and for each combination we calculate the associated pseudo r-square values and apply some statistical tests to check its significance. We also consider the union and intersection of the SNP combinations, identified by the GA, as potentially significant SNP combinations. After identifying multiple statistically significant SNP combinations, in the fourth and fifth steps we focus on extracting rules to explain the association between the SNPs and the disease. In the fourth step we apply a classification method, called Decision Tree Forest, to calculate the importance values of individual SNPs that belong to at least one of the SNP combinations found by the GA. Since each marker in a SNP combination is in bi-allelic form, genotypes of a SNP can affect the disease status. Different genotypes of SNPs are considered to define candidate rules. Then utilizing the calculated importance values and the occurrence percentage of the candidate rule in the data set, in the fifth step we perform our proposed rule extraction method to select the rules among the candidate ones. In literature there are many classification approaches such as the decision tree, decision forest and random forest. Each of these methods considers SNP interactions which are explanatory for a large subset of patients. However, in real life some SNP interactions that are observed only in a small subset of patients might cause the disease. The existing classification methods do not identify such interactions as significant. However, of the proposed five-step multi-SNP combinations procedure extracts these interactions as well as the others. This is a significant contribution to the research on identifying significant interactions that may cause a human to have the disease.
Item Type: Thesis
Uncontrolled Keywords: Genome wide association analysis. -- Tag SNP selection. -- Genetic algorithm. -- Feature selection. -- Association rule mining. -- SNP combination. -- SNP. -- Genom ilişki analizi. -- Genetik algoritma. -- Tekli nükleotid polimorfizm (SNP). -- Temsilci SNP seçimi. -- Nitelik seçim metodu. -- Kural madenciliği. -- SNP kombinasyonu. -- Nitelik seçimi. -- Belirleyici SNP seçimi.
Subjects: T Technology > T Technology (General) > T055.4-60.8 Industrial engineering. Management engineering
Divisions: Faculty of Engineering and Natural Sciences > Academic programs > Manufacturing Systems Eng.
Faculty of Engineering and Natural Sciences
Depositing User: IC-Cataloging
Date Deposited: 07 Jun 2012 16:51
Last Modified: 26 Apr 2022 09:56
URI: https://research.sabanciuniv.edu/id/eprint/19092

Actions (login required)

View Item
View Item