Sunar, Emine Ayşe (2024) Improving deepkinzero with protein language models and transductive learning. [Thesis]

10476715.pdf
Download (3MB)
Abstract
Phosphorylation is a critical post-translational modification that regulates numerouscellular processes, including cell signaling. Kinases are the enzymes responsiblefor catalyzing phosphorylation events. Due to their essential roles in the cell, kinasesare the major drug targets. The amino acid residue that receives the phosphate inthe substrate protein is termed a phosphosite. While high-throughput experimentaltechniques can detect phosphosites, identifying the specific kinases that phosphorylatethese sites remains challenging. Computational methods, which typically rely onsupervised techniques and existing training data, fall short for understudied kinases,also known as dark kinases, due to insufficient examples for training.Our research group previously addressed this data limitation by framing the predictionof dark kinases as a zero-shot learning problem and introduced DeepKinZero.DeepKinZero takes the phosphosite and its surrounding sequence and kinase attributesand transfers knowledge from well-studied kinases to understudied kinasesto make predictions. In this thesis, we aim to enhance DeepKinZero in several aspects.Firstly, we present a new evaluation setup where the evaluation splittingstrategy takes into account not only the zero-shot nature of the problem but alsothe kinase group memberships, and kinase sequence similarities. This benchmarkdataset, DARKIN, serves as a challenging and valuable benchmark designed to accuratelyassess zero-shot learning performance for dark kinase-phosphosite predictiontasks. Secondly, we improve the protein sequence representation by evaluating variousprotein language models in this task. As part of this study, two zero-shot models—a zero-shot k-NN model and a zero-shot bi-linear model—have been presentedto benchmark the representation power of protein language models. Thirdly, wedemonstrate that using kinase active sites can be as effective as using the entirekinase domain. These active sites slightly surpass the performance of the originalDeepKinZero model. Additionally, we explore a transductive approach and pseudolabelingstrategies to leverage the known phosphosite sequences of the unlabeledphosphosites.
Item Type: | Thesis |
---|---|
Uncontrolled Keywords: | Benchmark Dataset, Protein Language Models, Kinases,Phosphorylation, Zero-Shot Learning, Transductive Learning. -- Denek Seti, Protein Dil Modelleri, Kinazlar, Fosforilasyon,Sıfır-Örnekli Öğrenme, Transdüktif Öğrenme. |
Subjects: | T Technology > TK Electrical engineering. Electronics Nuclear engineering > TK7800-8360 Electronics > TK7885-7895 Computer engineering. Computer hardware |
Divisions: | Faculty of Engineering and Natural Sciences > Academic programs > Computer Science & Eng. Faculty of Engineering and Natural Sciences |
Depositing User: | Dila Günay |
Date Deposited: | 25 Mar 2025 11:25 |
Last Modified: | 25 Mar 2025 11:25 |
URI: | https://research.sabanciuniv.edu/id/eprint/51540 |