14th Annual UT-KBRIN Bioinformatics Summit 2017

Abstracts

Poster Presenters: Please submit your 1/2-page length abstract to Terry Mark-Major with Subject: Abstract for Summit 2017 by the registration deadline! For more details click here.

Hao Chen, University of Tennessee Health Science Center

Applying deep learning to predict phenotype based on genetic variation

Hao Chen, Department of Pharmacology, University of Tennessee Health Science Center, Memphis, TN 38103

Predicting phenotype based on genetic variation has long been a goal of genetic studies. Deep learning, including deep neural networks (DNN), has emerged as a superior method in many fields where machine learning was applied, such as image or speech recognition. Inspired by its success, We explored the potential of DNN in learning phenotype and genotype associations. We used a well-characterized data set of heterogeneous stock rats (Baud, et al, PMID:23708188) that contained 1407 individuals and many phenotypes. We choose to focus on coat color because it has the most complete data and has a strong QTL, which is located on chr 1. We used the Keras library (ver 2.0.2) with the Theano backend (v 0.9.0) to train DNNs on the 46,943 chr 1 SNPs. A GPU (GeForce GTX 1070, 8 GB) running CUDA (8.0.61) was used to accelerate calculation. A simple neural network with one hidden layer of 200 neurons achieved an accuracy of 99.24% in predicting coat colors after 100 training epochs. Accuracy was reduced to 60.23% when new samples were tested, indicating model overfitting. Using five hidden layers increased test accuracy slightly to 61.93%. Further increasing the depth of the network reduced test accuracy. Adding dropout layers did not improve test accuracy. However, augmenting samples by swapping 20% of SNPs and then adding these swapped samples to the training set increased test accuracy to 63.07%. Test accuracy remained at 60.8% when the augmented samples were trained on a network with five hidden layers. In contrast to chr 1, the test accuracy was approximately 20% when these networks were trained on chr 2 data, which had no QTL for the phenotype. In summary, our data showed that DNN could learn genotype-phenotype associations directly from the raw genotype data. Further performance improvement likely will require much larger training data set. The code for this exercise is available in a GitHub repository (https://github.com/chen42/DNN4G2P/)

Naresh Prodduturi, Mayo Clinic

SREVED - Splicing Regulatory Element Variant Effect Determination

Naresh Prodduturi, Gavin R. Oliver, Ying Li, Eric W. Klee Department of Biomedical Informatics and Statistics, Health Sciences Research, Mayo Clinic, 200 First Street, SW, Rochester, MN

Background:
More than 60% of the disease-causing variants in humans affect splice sites or splicing regulatory sites that cause differences in the splicing mechanism[1].
Many hereditary diseases are linked to misregulation in the splicing component[2]. These variants can affect a variety of splicing regulatory elements (SREs) including splice sites, splicing silencers, and splicing enhancers.
Existing DNA-based algorithms predict mutations that affect these elements but observed effects at the mRNA level is highly variable, resulting in a large number of false positive events predicted as pathogenic mutations.
Available RNA-based methods also generate large numbers of false positive candidate events. Despite tool availability, a comprehensive pipeline to identify splice variants is still lacking.

Method:
We have developed an integrative rule-based approach to annotate cis-acting mutations in splice sites and SREs that cause splicing aberrations based on RNA-Seq and DNA-Seq data.
RNA-seq BAM file, variant calls, DNA-Seq variant calls(optional), gene and exon level expression data and junction read counts are inputted into the pipeline.
Optional features like variant calling and gene expression sample filtering are also available.
The pipeline restricts the variants to the local vicinity of junctions, integrates various genomic input data types, summarizes them to variant level and annotates them as mutations
causing whole or partial intron retention or exon exclusion, and the abolition or introduction of splicing sites using a rule-based approach. Where reference expression data exist,
the method can optionally predict effects of the splicing aberration’s effects on relative exon or gene level transcript expression.

Results:
With the proposed pipeline, SREVED (Splicing Regulatory Element Variant Effect Determination) aberrant splice variants can be detected with high specificity and can be applied to genomics and transcriptomics studies.
We have used this pipeline to identify causal genetic variants affecting splicing in unsolved diagnostic odyssey cases.

References:
1. López-Bigas, Núria; Audit, Benjamin; Ouzounis, Christos; Parra, Genís; Guigó, Roderic (2005). "Are splicing mutations the most frequent cause of hereditary disease?". FEBS Letters. 579 (9): 1900–3.
2. Lim, KH; Ferraris, L; Filloux, ME; Raphael, BJ; Fairbrother, WG (2011). "Using positional distribution to identify splicing elements and predict pre-mRNA processing defects in human genes". Proc. Natl. Acad. Sci. USA. 108 (27): 11093–11098.

Acknowledgements
This work was supported by the Center for Individualized Medicine, Mayo Clinic

Kalpani De Silva, University of Louisville

Analysis in modern horses for non-caballine introgression

Kalpani De Silva^1*, Ernest Bailey², Joel Claiborne Stephens³, Theodore S. Kalbfleisch⁴

¹Interdisciplinary Studies Program: Specialization in Bioinformatics, University of Louisville, Louisville, Kentucky 40292 ²Department of Veterinary Science, University of Kentucky, Lexington, Kentucky 40506 ³Genomics GPS, Guilford CT 06437 ⁴Department of Biochemistry and Molecular Genetics, School of Medicine, University of Louisville, Louisville, Kentucky 40292

*kandaudamalinika.desilva@louisville.edu

Horse breeds have undergone many outcrossing events throughout time. In a recent study, a region of introgression event was identified which is estimated by an evolutionary clock to have occurred about 500,000 years ago. Now we are searching for other events of introgression which might explain adaptive introgression in modern horses. Our study consists of 6 animals including three Horses as well as three non-caballine equids, a Kiang, Zebra, and a Somali ass. Because our search is for events that happened so long ago, the haplotype blocks for which we are searching are likely to be much smaller than those that have been reported in humans such as the Human x Denisovans or Human x Neanderthal necessitating a different search strategy. Here we present an algorithm and preliminary results of our search for introgression of non-caballines in modern horses.

Arvind Ramanathan, ORNL

Computing to cure cancer: Developing exascale integrative bioinformatics tools for effective cancer surveillance

Joint work with: Hong-Jun Yoon, James B. Christian, John X. Qiu, Shang Gao, Tianxiang Chen, Paul A. Fearn, Lynne Penberthy, Georgia D. Tourassi

The nation has embarked on an "all government" approach to cancer. As part of this initiative, the Department of Energy (DOE) has entered into a partnership with the National Cancer Institute (NCI) of the National Institutes for Health (NIH). This partnership has identified three key challenges that the combined resources of DOE and NIH can accelerate. At ORNL, we are leading the population health pilot that focuses on improving the ability to monitor cancer patients across the country. Cancer registries from the National Cancer Institute’s Surveillance, Epidemiology, and End Results (SEER) program collect cancer data for about 30% of the US population and process a high volume of pathology reports. These information-rich reports are usually hand coded, a process that not only suffers human variability in interpretation and application of coding rules but is also not easily scalable given the ever increasing duration and complexity of cancer care. To overcome these challenges and to address issues related to comprehensiveness, variability, and timeliness of reporting, we are developing text comprehension tools based on scalable deep learning approaches to automate aspects of information extraction from clinical pathology reports. We present our results in assessing the effects of class prevalence and inter-class transfer learning using a set of 947 pathology reports with human expert annotations as the gold standard. The information extraction task is abstraction of primary cancer site topography. By systematically varying the complexity of a deep learning network topology, we demonstrate that even with limited training data, it is possible to achieve comparable task-level accuracy. Since annotated datasets for training deep learning algorithms are challenging to obtain, we also developed several classes of deep generative models that can faithfully synthesize pathology report ‘examples’ which can then be used as to train text comprehension tools. We highlight our experience in developing end-to-end workflows and scaling text comprehension tools across heterogeneous compute architectures, including ORNL’s TITAN supercomputer, the next generation summit-dev supercomputer and the NVIDIA-DGX1. Collectively our results demonstrate the potential for deep text comprehension tools in automated information extraction from pathology reports.

Kazi Zaman, University of Memphis

Evaluation of Gene Networks Using Literature Cohesion

GeneNetwork is a web tool that allows analysis of genetic and gene expression datasets across a large panel of recombinant inbred mice. Analysis of GeneNetwork data is challenging due to variability in microarray platforms, normalization methods, and biological factors. The goal of this project was to develop an analysis approach based on literature derived functional cohesion (GCAT) to evaluate GeneNetwork output and extract meaningful insights. To evaluate our approach, we used a set of 429 SIRT3 target proteins (genes), determined by mass spectrometry, as gold standard (Rardin et al., 2013). We computed the literature cohesion p-value for the top 500 correlated genes of SIRT3 in GeneNetwork using different platforms and normalization methods. We found high correlation (linear regression R2 = 0.97) between literature cohesion and overlap with the gold standard. Our results suggest that using literature cohesion analysis is useful for filtering gene networks derived from high-throughput experiments.