MultiGML combines relevant knowledge and experimental data (e.g., gene expression data, microscopy images, protein sequences) in a comprehensive, structured, and unbiased manner.
Initial situation and context
Finding the right drug target
Identifying a good target is key for the effectiveness and safety of a drug candidate in pharmaceutical research. Traditionally, the identification of targets is based on human knowledge and understanding of fundamental disease mechanisms. However, given the growing flood of scientific literature, this manual approach is prone to overlooking relevant data and information, which may lead to sub-optimal choices.
Adverse drug events as a risk for clinical trials
Adverse drug events (ADEs) are defined as an injury resulting from the use of a drug, including harm caused by the drug (adverse drug reactions and overdoses) and harm from the use of the drug (including dose reductions and discontinuations of drug therapy). The appearance of an ADE can at least partially be associated with the choice of the primary target protein or properties of the chemical structure of a drug. Experimental approaches to address potential ADEs (e.g., liver toxicity) based on animal and tissue models are well established. Yet, results obtained in such model systems may not always reflect the situation in humans, and there are ethical concerns regarding the use of animal models. Furthermore, reliable model systems do not exist for all ADEs and indication areas.
Limitations of human genetics
One possible way to address the abovementioned concerns is to check whether genetic variants in a candidate drug target have been associated with an unfavorable phenotype. While this approach has been reported to be highly effective in cases where such an association could be identified, it is crucial to see that there is a high risk of missing relevant associations due to a lack of statistical power.
Learning from comprehensive knowledge and experimental data in an unbiased manner
Our solution combines relevant knowledge and experimental data (e.g., gene expression data, microscopy images, protein sequences) in a comprehensive, structured, and unbiased manner. Technically, this is realized via a semantically harmonized knowledge graph, which we compiled from 14 curated biological databases, resulting in around 400.000 relations between proteins, drugs, and phenotypes, including ADEs (Figure 1). Based on this wealth of information, we trained a multimodal Graph Neural Network architecture capable of accurately predicting new associations between compounds and phenotypes or between protein targets and phenotypes. A distinction to alternative solutions is the possibility of MultiGML to deliver explanations for model predictions (Figure 2).
We offer MultiGML in two variants:
- MultiGML_Model: Our fully trained MultiGML model in both RGCN and RGAT variants (PyTorch version 1.9.1) is ready for use with instructions for installing the environment and applying the pre-trained model.
- MultiGML_Code: The code to enable running MultiGML with a custom knowledge graph, including our scripts to generate node features of the knowledge graph, employed in our previous publication, as well as scripts to explain predictions post-hoc using the Integrated Gradients method. The node feature generation scripts include scripts for the following node-type features:
- Drugs: molecular fingerprint, gene expression signature, morphological fingerprint
- Proteins: sequence embedding, gene ontology fingerprint
- Phenotypes: medical concept embedding
Instructions for the installation of environments and the usage of the command-line interface are included. This will give you full flexibility, including the possibility to enrich your knowledge graph according to your wishes, train your customized MultiGML instance, and subsequently explain your predictions.