Registration is closed

Abstracts

Hao Chen, University of Tennessee Health Science Center

Synaptic plasticity in infralimbic cortex mediates socially-acquired nicotine self-administration

Hao Chen, Tengfei Wang, Wenyan Han, Department of Pharmacology, University of Tennessee Health Science Center, Memphis, TN 38163

Social environment plays a critical role in the initiation of cigarette smoking among adolescents. It has been established that nicotine, the principal psychoactive ingredient of tobacco products, has both aversive and rewarding properties. Using olfactogustatory stimuli as the sensory cue for intravenous nicotine self-administration, we have shown that social learning of nicotine contingent odor cue prevented rats from developing conditioned taste aversion (CTA) and allowed them to instead establish stable nicotine SA. We also have shown that infralimibic cortex is a critical brain region for the effect of social learning in reversing nicotine conditioned aversion. We hypothesized that gene expression changes related to synaptic plasticity in infralimbic cortex underlies this effect of social learning. We trained rats to self-administer i.v. nicotine in either an inducing or a neutral social environment, with contingent olfactogustatory cues for three days. Rats were then tested using a standard CTA protocol on day four. Rats were killed immediately after the CTA test and infralimbic cortex was dissected. RNA was then extracted for transcriptome sequencing using the Ion Proton instrument. Reads were mapped to the reference genome (rn6). Gene expression levels were estimated using HTSeq. The RefSeq database was used as the reference. DESeq2 was then used to normalize the expression levels and to identify statistically significant differences between the groups placed in the inducing and neutral social environments. GESA analysis supported our hypothesis that genes related to synaptic plasticity is preferentially regulated by the social environment. Funding: UTHSC Center for Integrative and Translational Genomics.

Sally R. Ellingson, University of Kentucky

Convex-hull voting method on a large data set

Sally R. Ellingson1,3,*, Chi Wang2,4, and Radhakrishnan Nagarajan1

1Division of Biomedical Informatics, College of Public Health, University of Kentucky, Lexington, KY, USA 2Division of Cancer Biostatistics, College of Public Health, University of Kentucky, Lexington, KY, USA 3Cancer Research Informatics Shared Resource Facility, Markey Cancer Center, Lexington, KY, USA 4Biostatistics and Bioinformatics Shared Resource Facility, Markey Cancer Center, Lexington, KY, USA

*sally@kcr.uky.edu

Genes work in concert as a system, not as independent entities, to mediate disease states. There has been considerable interest in understanding variations in molecular signatures between normal and disease states. The selective-voting convex-hull ensemble procedure accommodates molecular heterogeneity within and between groups and allows retrieval of sample-specific sets and investigation of variations in individual networks relevant to personalized medicine. The work here describes using the convex-hull voting method on a large data set.

Hypotheses Using parallelization techniques, we predict that we can execute the convex-hull voting algorithm on the University of Kentucky cluster (DLX) using a dataset much too large to run in a feasible time on a single machine.

Procedures Normalized RNA-seq data for 208 samples (104 matched normal/tumor pairs) from TCGA breast carcinoma data set were downloaded and analyzed by the edgeR package, which identified 2,882 differentially expressed genes with at least a 2-fold difference between tumor and normal samples and at 1% false discovery rate. The convex-hull voting method1 was applied to data from the differentially expressed genes. A general idea of the algorithm including levels of parallelism is given in Figure 1.

A parallel-for loop is used within the R code allowing multiple processors within a node to concurrently perform the voting calculations of different sample pairs within one iteration. Then multiple jobs are submitted to perform the randomized iterations. This turns a computationally intensive problem into a data intensive problem since each iteration produces just over 6 GBs of data.

Results The final runtime of one iteration of the large dataset was just under 34 hours and up to 32 iterations can run concurrently. The entire run of 100 iterations using this large data set took less than a week time.

Future Work Future work will involve the parallelization of the entire computationally and data intensive steps in a way that reduces the complexity of job submission and scalability of the entire job. Computing paradigms such as Hadoop are being explored for this task.

Acknowledgements This research was supported by the Cancer Research Informatics and the Biostatistics and Bioinformatics Shared Resource Facilities of the University of Kentucky Markey Cancer Center (P30CA177558) and the University of Kentucky Center for Computational Sciences.

References 1. Nagarajan, R.; Kodell, R. L., A Selective Voting Convex-Hull Ensemble Procedure for Personalized Medicine. AMIA Summits on Translational Science Proceedings 2012, 2012, 87.

Tamas S Gal, University of Kentucky

Using large public data repositories to discover novel genetic mutations with prospective links to melanoma

Tamas S Gal*, Sally R Ellingson, Chi Wang, Jinpeng Liu, Stuart G Jarrett, John A D’Orazio

Markey Cancer Center, University of Kentucky

tamas.gal@uky.edu

Background Next generation sequencing (NGS) data analysis pipelines are frequently described in literature. NGS data is relatively easy to acquire from national data repositories and most software used in the pipelines are open source. This study extends research on the causal relation between changes in the ataxia telangiectasia and Rad3 related (ATR) pathways and melanoma [1].

Procedures To study the effects of mutations in the ATR region on melanoma, we downloaded the Melanoma Genome Sequencing Project dataset (dbGaP Study Accession: phs000452.v1.p1) from the dbGaP repository [2]. The dataset contained full exome sequencing data of paired normal and tumor samples of 122 phenotyped subjects in the format of trimmed and aligned BAM files. The dataset also included basic demographic information, such as gender and age; as well as disease specific variables, such as the localization of the melanoma and stage. The total size of the dataset was over 4TB, so we only downloaded the region of interest (ATR gene region) with 50K bp padding before and after the ATR gene region. We used an available pipeline (Figure 1) for analysis that was previously developed for a lung cancer project by the Biostatistics and Bioinformatics Shared Resource Facility of the University of Kentucky Markey Cancer Center. The details of the data analysis pipeline will be published elsewhere. We used Python to automate data submission to the pipeline in combination with a configuration file that allowed us to easily swap different versions of the tools used in the pipeline and to match normal and tumor samples for the same patient (Figures 2, 3). Our experiments were executed on the Lipscomb High Performance Computing Cluster at the University of Kentucky.

Results Though analysis and validation of the results are still ongoing at the time of this report, we can share that we identified five previously unreported somatic missense or splice site SNP mutations in the ATR gene region in melanoma patients. Results will be further validated by analysis of NGS data from melanoma cell lines.

Conclusions The main goal of this abstract was to describe a methodology that we used to identify novel genetic markers in publicly available data. This methodology offers a cost effective way to test hypotheses drawn from laboratory research on human genome data.

Acknowledgements This research was supported by the Cancer Research Informatics and the Biostatistics and Bioinformatics Shared Resource Facilities of the University of Kentucky Markey Cancer Center (P30CA177558). We would like to thank the University of Kentucky Information Technology department and Center for Computational Sciences for computing time on the Lipscomb High Performance Computing Cluster and for access to other supercomputing resources.

References 1. Jarrett SG, Horrell EM, Christian PA, Vanover JC, Boulanger MC, Zou Y, D'Orazio JA: PKA-mediated phosphorylation of ATR promotes recruitment of XPA to UV-induced DNA damage. Mol Cell. 2014 Jun 19;54(6):999-1011 2. Mailman MD, Feolo M, Jin Y, Kimura M, Tryka K, Bagoutdinov R, Hao L, Kiang A, Paschall J, Phan L, Popova N, Pretel S, Ziyabari L, Lee M, Shao Y, Wang ZY, Sirotkin K, Ward M, Kholodov M, Zbicz K, Beck J, Kimelman M, Shevelev S, Preuss D, Yaschenko E, Graeff A, Ostell J, Sherry ST: The NCBI dbGaP database of genotypes and phenotypes. Nat Genet. 2007 Oct;39(10):1181-6.

Shanshan Gao, University of Memphis

Alignment-free methods for metagenomic profiling

Shanshan Gao, Diem-Trang Pham, Vinhthuy Phan

Department of Computer Science, University of Memphis, Memphis, TN 38152, USA Determining abundances of microbial genomes in metagenomic samples is an important problem in analyzing metagenomic data. This task can be a computationally expensive since microbial communities usually consist of hundreds to thousands of environmental microbial species. For instance, microorganisms in the human gut contain up to 100 trillion cells. The goal of this work is to determine the abundance of each microbial genome based on short reads taken from a metagenomic sample.

We investigated three different variations of an alignment-free method for profiling abundances of microbial communities. Since the method does not employ the alignment of reads to reference genomes, we can avoid large indices that are often required. The construction and utilization of a large index for thousands of microbial genomes can be computationally prohibitive. The main idea of the method is based on solving linear equations to find optimal solutions that satisfy specific constraints. The first step is to collect a set of genomic markers for the entire set of genomes. From these markers, a matrix F is constructed, where Fij represents the frequency of marker i in genome j. In the ideal case no sequencing errors and mutations in reads, the abundance of each genome can be found by solving the linear equation Fx = b, where b is occurrence vector in which bi represents the frequency of marker i and x is the abundance vector in which xj represents the abundance of genome j.

With the presence of sequencing errors, mutations and unknown genomes, this ideal case will not work. Instead, we considered three variations of an optimization problem, in which instead of finding the exact solution x, we find the optimal value of x, which satisfies certain constraint. The three variations can be formulated, respectively, as a linear programming problem (LP), a least-square approximation problem (L2), and a L1-approximation problem.

Our investigation on a data set consisting of 100 microbial genomes showed that the linear programming formulation (LP) yielded the best prediction of abundances of microbial genomes. This result was consistent across different levels of abundances. The LP variant also achieved better result across the board compared to a popular metagenomic profiler, FOCUS, which was found to be superior to other methods. Our preliminary results also indicated that by choosing the set of genomic markers that were specific to each genome improved the results even further.

Contact: vphan@memphis.edu

Christy M Gearheart, University of Colorado School of Medicine

Efficient identification of mutations that potentially confer treatment resistance through computational derivation

Christy M Gearheart1, James R Lambert2, Christopher C Porter1

1Department of Pediatrics, University of Colorado School of Medicine, Aurora, CO, USA 2Department of Pathology, University of Colorado School of Medicine, Aurora, CO, USA

*christy.gearheart@ucdenver.edu

Background One of the major factors in cancer progression or relapse in patients is the development of chemoresistance. The heterogeneity of cancer subtypes often means that the oncogene addictions are different between subtypes, possibly between patients. We have developed an efficient computational pipeline to filter, analyze, and interpret the vast number of called mutations that differentiate between genomes to identify a targeted list of genetic determinants of phenotype; in this case drug resistant vs. sensitive cancer cells.

Methods Called mutations are filtered for quality, depth coverage, and frequency of the mutation in the mapped reads. Only exonic regions are considered for further analysis. To quickly eliminate mutations in a resistant sample that are also present within the paired sensitive sample, we created an efficient algorithm that dramatically reduces the required search space. This is accomplished through the creation of a hash table mapping to a balanced binary tree of the individual chromosome mutation positions and base changes of the sensitive parental line, reducing the search space required to less than 1% of a brute force algorithm approach. Subsequent analysis of the remaining mutations interprets potential impact through two methodologies. The first method predicts if the specific mutation will have a synonymous or deleterious effect on the gene expression; the second method examines the mutations in aggregate for an individual gene under the hypothesis that highly mutated genes have a higher probability of disrupting functionality.

Application We have applied our bioinformatics pipeline to triple negative breast cancer (TNBC) cells to identify mutations that confer resistance to AMPI-109. AMPI-109 is an internally developed drug molecule that potently induces apoptosis in TNBC cells of various molecular subtypes, yet has little to no effect on non-TNBC and non-tumorigenic cell lines. To identify potential mechanisms of resistance, we compared the transcriptome of the sensitive TNBC BT-20 parental cell line against five AMPI-109 resistant clones. Validation of the targeted mutations identified from our pipeline are ongoing.

Conclusions This work establishes an efficient computational pipeline to quickly identify a targeted list of mutations potentially enabling chemoresistance in cancer patients. Early detection of mechanisms of resistance may enable more personalized therapeutic treatments with improved efficacy and survival. Moreover, this pipeline may be applied to any massively parallel sequencing application, such as the identification of somatic mutations in diseased tissue.

Hyoil Han, Informative Multi-Document Summarization (iMDS)

Informative Multi-Document Summarization (iMDS)

Hyoil Han

Weisberg Division of Computer Science, Marshall University, Huntington, WV 25755, USA hyoil.han@acm.org

Over the last 20 years, the number of online articles, and consequently the amount of data available to people, has increased tremendously. With this overwhelming amount of information, searching and/or processing online data manually is not plausible. Therefore we are living in a world of information overloaded. However, this information is scattered over the internet and makes it very hard to find needed information. As a side effect we are living in a world where information is also underloaded at the same time. Keyword-based Web search typically gives a lot of information that is not related to the user query because it uses each keyword independently instead of utilizing the relationships among terms/concepts of the documents. The aim of this project is to aid people to overcome the information overload and underload problems by providing multi-document summarization from a given set of documents under the same topic.

This project aims to creating an informative multi-documents summary from a set of documents – informative, extractive multi-document summarization (iMDS), which extract key sentences from the input documents without using any specific queries or user needs. In general, the document summarization consists of three components: sentence ranking, sentence selection, and summary generation. Based on chosen features, a sentence score is assigned to each sentence and ranked based on its score. The chosen features are (1) the frequency of each term, document frequency, which indicates how many documents include the term; and (2) cosine similarity of a sentence over all other sentences in the given set of documents. The two features are linearly combined to give a score to each sentence. Based on this score, sentences are ranked. The top-ranked sentence is first included in a summary. The rest of sentences are re-ranked. The loop shown in figure 1 represents unselected sentences’ re-ranking process and is repeated until the user-defined summary size is obtained. The generated summary is evaluated against publicly available reference summaries (i.e., human summaries). Figure 1: The overall process of the proposed multi-document summarization.

References 1. 2. Sovine and Han, A Computationally Efficient System for High-Performance Multi-Document Summarization, The proceedings of The 26th International FLAIRS Conference, Florida, USA, May 22 - 24, 2013. Sovine and Han, Classification of Sentence Ranking Methods for Multi-Document Summarization, Innovative Document Summarization Techniques. In Revolutionizing Knowledge Understanding (Edited by Alessandro Fiori), IGI Global, 2014. 

Benjamin Harrison, University of Louisville

Assessment of gene 3’ untranslated region (3’UTR) expression in Neuroplasticity

Benjamin Harrison, Robert Flight, Jeff Petruska and Eric Rouchka

KBRIN Bioinformatics Core, University of Louisville

Abstract

mRNA 3’-untranslated regions (3’UTR) play an important role in regulating gene functions by modifying cellular localization, stability and/or translational efficiency of transcripts during normal biological functions (e.g., development, nervous system functions) and disease states (e.g., UTR shortening in cancer). 3’UTR isoform diversity is primarily generated by alternate polyadenylation (APA) during mRNA biosynthesis, providing a dynamic substrate for RNA-binding proteins (RNAbps), ribonucleoprotein aggregates and miRNAs. The recent surge in whole transcriptome sequencing studies has revealed an unexpected diversity and specify of regulation of 3’UTRs, and current estimates suggest there may be up to 5,000 genes with unannotated 3’UTR isoforms, some of which more than 10kb longer than those in Ensemble and/or ENREZ databases.

Despite this fundamental importance to gene regulation, accurate measurement of APA remains an unresolved challenge. Current methods either rely heavily on UTR annotation and/or employ statistical models sensitive to UTR shortening events at the expense of lengthening events. We therefore developed a method to assess dynamic APA in RNA-seq profiles. Initially, a comprehensive genome-wide database of polyadenylation signals was developed by compiling data from available, including from 3’-end sequencing studies and EST-tag datasets. The resulting putative alternate 3’ ends were then filtered to remove technical bias by employing a naïve Bayes classifier (cleanUpdTSeq package in R) resulting in over 200,000 putative alternate polyadenylation sites. These putative alternate 3’ ends were then assigned to their respective gene loci to develop a modal of all possible 3’UTR variants for all known genes. The extent of APA was then assessed in RNA-seq profiles from neurons undergoing axonal plasticity. We detected approximately 2,000 previously uncharacterised 3’UTR sequences, of which more than 200 are extended when plasticity is induced. Using position weight matrices, analyses of the UTR sequences extended during neuroplasticity revealed strongly over-represented motifs for neuron-specific miRNAs and RNAbps with known roles in nervous system pathologies.

These preliminary studies demonstrate the utility of our method, by employing a comprehensive database of putative poly-adenylation sites, to assess existing RNA-seq data sets for both UTR shortening and lengthening events. In addition, our analyses of RNA-seq profiles of neural plasticity suggest a role for wide-spread 3’UTR-extension during neuroplasticity and that further analysis of UTR-interacting molecules in neurons may provide novel insights about neurological disease states.

David R Henderson, University of Kentucky

Automated, iterative and scored assignment of metabolites via 1​H­NMR Spectral Peak Lists

David R Henderson​and Hunter N.B. Moseley*​

Department of Molecular and Cellular Biochemistry / Markey Cancer Center / Resource Center for Stable Isotope Resolved Metabolomics. University of Kentucky, Lexington KY 40356

*hunter.moseley@uky.edu

Background & Introduction The assignment of metabolites is currently the single largest data analysis step limiting the effectiveness of metabolomics as an omics­level experimental technique. Mechanistic biological/biomedical interpretation is impossible without assignment of observed metabolic experimental features between case and control samples, relegating unassigned features to 1​ simple biomarker uses only. However, ​H nuclear magnetic resonance (NMR) metabolomics peak list datasets contain high biological and analytical variation in the observed chemical shifts (dimensions of spectral peaks) due to a range of sample­specific conditions like pH, temperature, other solutes, etc. This high degree of variation makes automated assignment of a mixture of metabolites difficult and unreliable. While general regions of expected chemical shifts can be utilized to provide assignment profiles of metabolites, direct comparison to a table of expected chemical shift ranges is not a robust analysis method.

Methods 1​ We are developing a probabilistic analysis approach that uses several (mostly) independent ​H ­NMR peak features including chemical shift, relative peak intensity, and coupling constants (specific chemical shift differences) to create a set of metabolite classifiers for metabolite assignment. The set of classifiers are created from a database of expected features for a set of metabolites that are subsequently analyzed for unique metabolite features. Also, variances for specific expected features are estimated from an initial analysis of a peak list dataset and then refined in an iterative assignment algorithm that selects from a list of possible assignments based on a probabilistic model. The set of classifiers are then optimized based on identified unique features and refined feature variances without requiring all features to be present.

Results and Conclusions Initial analyses of our growing metabolite peak feature database indicate that most metabolites have unique peak features, even though many of the expected chemical shift ranges are not unique. We are now testing the set of classifiers built from the metabolite feature database 1​ against experimental 1D ​H ­NMR peak list datasets to evaluate robustness of the assignment methodology. In the future, the program will be expanded to handle 2D HSQC peak list datasets, which provides a richer set of features, but with higher feature variance.

Eugene W. Hinderer, University of Kentucky

Extracting subcellular localization from Gene Ontology

Eugene W. Hinderer and Hunter N.B. Moseley*

Department of Molecular and Cellular Biochemistry Markey Cancer Center Resource Center for Stable Isotope Resolved Metabolomics University of Kentucky, Lexington KY 40356 *hunter.moseley@uky.edu

Background The exponential growth of genomic and transcriptomic sequencing over the last decade has driven a dramatic increase in the size of the gene and protein sequence repositories and derived protein knowledge databases like Uniprot. To integrate this sequence and functional information with other high-throughput omics-level technologies like mass spectrometry and nuclear magnetic resonance-based metabolomics, new bioinformatics tools are required to gather and organize relevant biological data in an automatic manner. These new tools are enabling a systems biochemical approach to the computational modeling of cellular metabolism as metabolic networks. Such representations can aid in the study of metabolic processes involved in disease and the discovery of drug targets. Toward this end, we present ongoing efforts to create a supervised-automated reorganization of Gene Ontology (GO) subcellular location terms into biologically sensible super-categories for visualizing compartmentalization of metabolic networks and aiding in the study of cellular processes under various conditions.

Methods Currently, GO exists as a directed acyclic graph (DAG), representing a hierarchy of subcellular compartment descriptions in order from most general to most specific. Protein and gene product databases, such as Uniprot, commonly annotate subcellular location by cross-referencing to the appropriate term in GO. Annotations, such as these, are highly specific but complicate omics-level querying and evaluation of large sets of data for generalized localization. Using object-oriented programs written in the Python programming language, we are developing a method to automatically create subgraphs from within GO based on a set of user-defined keywords that delineate a particular compartment. In this way, annotation extensions may be added within GO between specific sub-location terms, such as ‘cajal body,’ and their overarching compartment, such as ‘nucleus,’ simplifying the ability to query GO based on generalized location. Additionally, through analyzing intersections between the automatically created subgraphs, potential transiently-localized gene products can be identified, creating hypotheses about dynamic compartmentation.

Results Using Uniprot’s manually curated controlled vocabulary of subcellular location as a gold-standard, we have evaluated the contents of our 20 selected cellular compartments, within which approximately 87% of GO’s cellular compartment DAG is represented. 12 of our compartments had identical root nodes in Uniprot’s considerably smaller controlled vocabulary. Of these, 7 contained 100% inclusion and were 18 times larger on average. The other 4 contained between 58.8% and 84.6% inclusion. Some discrepancies in the organizational patterns between Uniprot and GO may account for the lower inclusion.

B. F. O’Hara, University of Kentucky

Using microarray data for an improved sleep related gene ontology and identifying candidate genes for sleep QTL

S. S. Joshi1, B. F. O’Hara1
1Dept. of Biology, University of Kentucky, Lexington, KY, USA bohara@uky.edu

Humans spend approximately one third of their lives sleeping, but compared with other biological processes, most of the molecular and genetic aspects of sleep have not been elucidated. A nearly random gene ontology and lack of a dedicated database containing a comprehensive list of sleep related genes and their function presents a hurdle for sleep researchers. Using a two-pronged approach to solve this problem, publicly available microarray data from NCBI GEO (National Center for Biotechnology Information - Gene Expression Omnibus) database was used to develop a list of sleep related genes for traits of interest. The data was analyzed using R Bioconductor and custom Perl scripts. The genes from this list were then matched with the genes in QTL (Quantitative Trait Loci) for the trait. The genes within the QTL chromosomal region matching any in the list of sleep-related genes were considered as potential candidates for causing variations in the Quantitative trait. Here we present the results from our preliminary study conducted for sleep deprivation (SD) using this approach. Three microarray datasets belonging to two superseries in GEO database were analyzed. The datasets were selected on the basis of similarity of experimental design. 227 candidate sleep related genes were identified by comparing data from control and sleep deprived mice. We were able to identify 4 candidate genes in Dps1 QTL, 2 in Dps2, and 9 genes in Dps3. These Dps loci are the QTL associated with delta power in slow wave sleep [1]. The list contains Homer1 that has already been established as a molecular correlate of sleep loss [2], with alleles that appear responsible for Dps1. A second highlighted gene, Asrb, has also been previously reported as a candidate gene. Analysis of additional datasets from mice and Drosophila is underway. The advantage of this approach is that it provides more information and cross support than a simple list of sleep related candidate genes. Experimental validation of candidate genes identified using this approach will help in establishing the validity of this method. The use of microarrays and other data for improved lists of sleep related genes is not perfect, but should represent a substantial improvement over the existing list of genes returned using the query “sleep” or other similar terms in gene ontology database, and should be useful for many sleep researchers.

References: 1. Franken, P., D. Chollet, and M. Tafti, The homeostatic regulation of sleep need is under genetic control. J Neurosci, 2001.21(8): p. 2610-21.

2. Maret, S., et al., Homer1a is a core brain molecular correlate of sleep loss. Proc Natl Acad Sci U S A, 2007. 104(50): p.20090-5.

Akhilesh Kaushal, University of Memphis

Which methods to choose to correct cell types in genome-scale blood-derived DNA methylation data?

Akhilesh Kaushal1, Hongmei Zhang1*, Wilfried J.J. Karmaus1, Julie SL Wang2 1Division of Epidemiology, Biostatistics, and Environmental Health, University of Memphis, Memphis, TN 38152, USA. 2Division of Environmental Health & Occupational Medicine, National Health Research Institutes, Miaoli, Taiwan *hzhang6@memphis.edu

Background High throughput study such as microarray and DNA-methylation are used to measure the transcriptional variation due to exposures, treatments, phenotypes or clinical outcomes in whole blood, which could be confounded by the cellular heterogeneity [1, 2]. Several algorithms have been developed to measure this cellular heterogeneity. However, it is unknown whether these approaches are consistent and if not, which method(s) perform better.

Materials and methods The data implemented in this study were from a Taiwan Maternal and Infant Cohort Study [3, 4]. We compared five cell-type correction methods including four methods recently proposed (the method implemented in the minfi R package[5], the method by Houseman et al.[6], FaST-LMM-EWASher[7], and RefFreeEWAS[8]) and one method using surrogate variables[9] (SVAs). The association of DNA methylation at each CpG site across the whole genome with maternal arsenic exposure levels was assessed adjusting for the estimated cell-types. To further demonstrate and evaluate the methods that do not require reference cell types, we first simulated DNA methylation data at 150 CpG sites across 600 sample, based on an association of DNA methylation with a variable of interest (e.g., level of arsenic exposure) and a set of latent variables representing “cell types”, and then simulated DNA methylation at additional CpG sites only showing association with the latent variables.

Results Only 3 CpG sites showed significant associations with maternal arsenic exposure at FDR level of 0.05, without adjusting for cell types. Adjustment by FaST-LMM-EWASher did not identify any CpG sites. For other methods, Figure 1 illustrates the overlap of identified CpG sites. Further simulation studies on methods free of reference data (i.e., FaST-LMM-EWASher, RefFreeEWAS and SVA) revealed that RefFreeEWAS and SVA provided good and comparable sensitivities and specificities, and FaST-LMM-EWASher gave the lowest sensitivity but highest specificity (Table 1).

Figure 1. Venn diagram illustrating the overlap of significant CpG sites at FDR level of 0.05 after adjusting for cell types by different methods for the association study of maternal arsenic exposure with DNA-methylation

Conclusions The results from real data indicated RefFreeEWAS and SVA were able to identify a large number of CpG sites, and results from SVA showed the highest agreement with all other approaches. Simulation studies further confirmed that RefFreeEWAS and SVA are comparable and perform better than FaST-LMM-EWASher. Overall, the findings support a recommendation of using SVA to adjust for cell types due to its highest agreement with other methods and appealing findings from simulation studies.

References 1. Adalsteinsson BT, Gudnason H, Aspelund T, Harris TB, Launer LJ, Eiriksdottir G, Smith AV, Gudnason V: Heterogeneity in white blood cells has potential to confound DNA methylation measurements. PloS one 2012, 7(10):e46705. 2. Talens RP, Boomsma DI, Tobi EW, Kremer D, Jukema JW, Willemsen G, Putter H, Slagboom PE, Heijmans BT: Variation, patterns, and temporal stability of DNA methylation: considerations for epigenetic epidemiology. FASEB journal : official publication of the Federation of American Societies for Experimental Biology 2010, 24(9):3135-3144. 3. Lin L-C, Wang S-L, Chang Y-C, Huang P-C, Cheng J-T, Su P-H, Liao P-C: Associations between maternal phthalate exposure and cord sex hormones in human infants. Chemosphere 2011, 83(8):1192-1199. 4. Wang S-L, Su P-H, Jong S-B, Guo YL, Chou W-L, Päpke O: In utero exposure to dioxins and polychlorinated biphenyls and its relations to thyroid function and growth hormone in newborns. Environmental health perspectives 2005:1645-1650. 5. Jaffe AE, Irizarry RA: Accounting for cellular heterogeneity is critical in epigenome-wide association studies. Genome biology 2014, 15(2):R31. 6. Houseman EA, Accomando WP, Koestler DC, Christensen BC, Marsit CJ, Nelson HH, Wiencke JK, Kelsey KT: DNA methylation arrays as surrogate measures of cell mixture distribution. BMC bioinformatics 2012, 13:86. 7. Zou J, Lippert C, Heckerman D, Aryee M, Listgarten J: Epigenome-wide association studies without the need for cell-type composition. Nature methods 2014, 11(3):309-311. 8. Houseman EA, Molitor J, Marsit CJ: Reference-free cell mixture adjustments in analysis of DNA methylation data. Bioinformatics 2014, 30(10):1431-1439. 9. Leek JT, Storey JD: Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS genetics 2007, 3(9):e161.

Xing Li, Mayo Clinic College of Medicine

Time course transcriptome data analysis for in vitro modeling of dilated cardiomyopathy using patient-derived induced pluripotent stem cells Xing Li1, Saranya Wyles2, Sybil C. Hrstka3, Jean-Pierre A. Kocher1, Andre Terzic3-5, Timothy M. Olson5,6, Timothy J. Nelson3-5

1 Division of Biomedical Statistics and Information, Department of Health Sciences Research, Mayo Clinic College of Medicine, Rochester, MN, USA 55905 2 Center for Clinical and Translational Sciences, Mayo Clinic College of Medicine, Rochester, MN, USA 55905 3 Division of General Internal Medicine Mayo Clinic College of Medicine, Rochester, MN, USA 55905 4 Marriott Heart Disease Research Program, Division of Cardiovascular Diseases, Mayo Clinic College of Medicine, Rochester, MN, USA 55905 5 Department of Molecular Pharmacology and Experimental Therapeutics, Mayo Clinic College of Medicine, Rochester, MN, USA 55905 6 Division of Pediatric Cardiology, Mayo Clinic College of Medicine, Rochester, MN, USA 55905 7 Cardiovascular Genetics Research Laboratory, Mayo Clinic College of Medicine, Rochester, MN, USA 55905

Background Induced pluripotent stem cells (iPSCs) derived from dilated cardiomyopathy (DCM) patients offer an unprecedented platform for in vitro disease modeling [1]. Time course transcriptome analysis on the differentiation process from iPSCs to beating cardiomyocytes will reveal the holistic dynamic gene expression landscapes and pinpoint molecular deficiencies in cardiogenesis for DCM patients.

Materials and Methods In this study, dermal fibroblasts were isolated from skin biopsies of two unrelated patients who carry the RBM20 R636S mutation. The dermal fibroblasts were reprogrammed to iPSCs and then differentiated to cardiomyocytes to model the cardiogenesis for DCM patients. During the differentiation process, Cell samples at five stages (day 0, 10, 15, 20, and 25) were collected and RNA was extracted for time course transcriptome analysis. The iPSCs from a healthy subject was used as control.

Results Unsupervised hierarchical clustering on genome-wide expression profiles defined clearly separated developmental stages containing pluripotent samples (day 0), early cardiac samples (day 10 and 15), and late cardiac samples (day 20 and 25). Furthermore, Principal Component Analysis revealed dramatic transcriptome differences on patients with severe and minor phenotypes.The comparison of transcriptome profiles of two RBM20 familial DCM patient-specific cell lines and control showed hundreds of differential genes with 50 of them showing consistent differential expression patterns between the two disease cell lines. Gene function enrichment analysis performed on these 50 genes highlighted vital function group of pattern specification process including TBX18, CYP26B1, HHIP, and LHX2 ( P ≤ 2.8E-8) which regulates cell response to differentiation during heart development.

Conclusion This study highlights developmental defects linked to the causative etiology of RBM20 familial DCM due to dysfunctional cardiac gene expression in cardiogenesis. Insights gained from using patient-specific stem cells enables the anticipation of disease outcomes and targeting molecular therapy at the root cause of DCM.

1. Beraldi R, Li X, Martinez-Fernandez A, Reyes S, Secreto F, Terzic A, Olson TM, Nelson TJ. Rbm20-deficient cardiogenesis reveals early disruption of RNA processing and sarcomere remodeling establishing a developmental etiology for dilated cardiomyopathy. Hum Mol Genet. 2014 Jul 15;23(14):3779-91. doi: 10.1093/hmg/ddu091. Epub 2014 Feb 28.

Pinyi Lu, Virginia Tech

Molecular modeling of NLRX1: a potential therapeutic target for immune-mediated and infectious diseases

Pinyi Lu1,2, Vida Abedi1,2, Casandra Philipson1, 2, Raquel Hontecillas1,2, and Josep Bassaganya-Riera1,2

1 The Center for Modeling Immunity to Enteric Pathogens, Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, VA 24061, USA 2 Nutritional Immunology and Molecular Medicine Laboratory, Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, VA 24061, USA

Nucleotide-binding and oligomerization domain (NOD)-like receptors (NLRs) are intracellular sentinels of cytosolic sanctity which are capable of orchestrating innate immunity and inflammatory responses. Thus, dissecting various NLRs genes in higher eukaryotes is important for understanding the intriguing mechanism of the host defense against pathogens. Mammalian NLRs are classified according to the type of their N-terminal domain. The nucleotide-binding oligomerization domain and leucine rich repeat containing X1 (NLRX1) is a NOD-like receptor, that modulates immune responses by negatively regulating Toll-like receptor-mediated NF-κB activation. Furthermore, NLRX1 is expressed in mitochondria whose molecular features and structural organization are poorly characterized. In this study we utilized homology modeling, docking as well as molecular dynamics approach to better understand the structure as well as the binding affinity of NLRX1 and its potential roles in immune mediated diseases. Based on the crystal structure of the C-terminal fragment (residues 629-975) of the human NLRX1 (cNLRX1), the potential ligand binding domain of NLRX1 was predicted. Furthermore, since the crystal structure of the cNRLX1 is trimer, the stability of cNLRX1 was compared for monomer and polymer architectures, and it was shown that the polymer architecture of cNLRX1 is more stable with higher binding affinity with ligands. In addition to cNLRX1, we also investigated the three-dimensional structures of other two domains of NLRX1 using homology modeling approach: N-terminal effector domain and central NACHT domain. The potential functional motifs with these two domains were also identified indicating the role of NLRX1 in the host innate immune system. This is the first comprehensive in silico study on the structure of NLRX1 and highlights its potential as therapeutic target for immune-mediated and infectious diseases.

Behrouz Madahian, University of Memphis

Development of a Literature Informed Bayesian Machine Learning Method for Feature Extraction and Classification

Behrouz Madahian1, Ramin Homayouni2, Lih Yuan Deng1 1 Department of Mathematics, University of Memphis , Memphis, TN, 38152,USA. 2 Department of Bioinformatics, University of Memphis , Memphis, TN, 38152,USA.

Abstract Gene expression profiling has two major limitations that limit their analysis performance. Firstly, large number of variables are assessed compared to relatively small sample sizes. Secondly, identification of set of biologically relevant markers with high predictive power remains difficult. Several machine learning algorithms have been used for cancer classification which are geared toward obtaining highest classification accuracy and do not take into account the biological relevance of the markers obtained. Thus, in majority of applications markers found do not convey meaningful biological information and are merely good classifiers. Thus, a machine learning schema that is able to bridge classification accuracy and biological relevance will be of high merit to the community and can potentially result in deeper understanding of mechanisms involved. In this study, we developed a literature aided Bayesian Shrinkage Generalized Linear model which utilizes Generalized Double Pareto prior to induce shrinkage in terms of number of covariates. Additionally, instead of uninformed hyper parameters for the prior distributions, we adopt a literature informed approach to adjust the hyper parameters based on marker’s biological relevance to the phenotype under study. This will aid us in controlling shrinkage imposed on genes based on their biological relevance.

The method was applied to the leukemia data set of Golub et al. (1999). The dataset was split into training and test group and classification performance on the test group was evaluated. The top 500 highly differentially expressed genes were used for the modeling step. Using top 10 genes obtained from our model, we were able to achieve 91% classification accuracy in the test group. We switched the training and test data and obtained 92% classification accuracy of the new test group. The model without incorporation of biological information achieves 91% and 86% classification accuracy in the two scenarios. Additionally, majority of biologically relevant genes rank very high and stand out when incorporating biological information compared to non-informative set up. This demonstrates that our literature informed choice of hyper parameters aid us in obtaining more biological insights. Taken together, these results suggest that literature informed Sparse Bayesian Generalized Linear Model applied to leukemia data sets allows for better subclass prediction based on more functionally relevant gene sets.

William A Mattingly, University of Louisville

An iterative workflow for creating biomedical visualizations using Inkscape and D3.js

William A Mattingly1*, Robert R Kelley1, Julia H Chariker2, Timothy L Wiemken1, Julio Ramirez1

1Infectious Diseases, University of Louisville, Louisville, KY, 40202, USA 2Psychological and Brain Sciences, University of Louisville, Louisville, KY, 40292, USA

*wamatt02@louisville.edu

Background: Many biological disciplines use data visualization alongside computational methods to explore large-scale biomedical data. Visualization often provides insight into patterns in the data that are not available in the numerical data and statistics [1]. The development of new visualization tools requires the use of sophisticated software and programming skills. Commercial standalone software like Tableau [2] creates multiple types of common visualizations and has the ability to customize certain features. There are also freely available software libraries like D3.js [3] that can be used to make interactive web applications based on static or dynamic data. Nevertheless, modern data visualization is highly sophisticated, and creating customized visualizations to interact with a specific dataset can be challenging for a variety of reasons. Specifically with D3.js, which builds a scalable vector graphic (SVG) programmatically, generating the visualization is a process of trial and error. The programmer generates SVG markup manually and then views it with a browser and does this iteratively until the final graphic is realized. We present an iterative workflow shown in Figure 1, that simplifies the creation of SVG images using freely available software. We demonstrate this workflow in constructing an interactive dashboard to track clinical trial enrollment.

Materials and Methods: Using the Scalable Vector Graphics (SVG) language [4] and the open source SVG authoring tool Inkscape [5], we created and tested several prototypes for visualizing clinical trial enrollment across nine adult hospitals in Jefferson County. The information required by a clinical trial manager included the number of total enrollments per hospital and other contextual data related to the number of enrollments such as the numbers admitted, screened and eligible for the two trials, UAD and HAPPI. A layered bar graph design, shown in Figure 2, provided an efficient method for displaying the necessary information within the appropriate context and in the smallest space. After creating the mockup of the system in Inkscape, it is saved as an svg document and imported into a live website. Inkscape assigns variable names to each primitive object at the time of creation. The D3.js library can then be used to access the properties of these objects and manipulate them according to the data.

Results: The enrollment dashboard prototype was created over the course of 1 week. Many hours were saved on the development of each feature by allowing the design of SVG prototypes without needing to learn the language-specific layout syntax. The demo is currently available at ctrsu.org/screen_demo [6]. Conclusions: SVG prototypes developed in Inkscape can be adapted for use with advanced visualization libraries like D3.js to form an iterative workflow for creating customized visualizations and dashboards. While manipulating interactive SVG still requires knowing JavaScript, our approach significantly reduces the development time. Figure 1: Workflow diagram. Figure 2: Layered bar graph created in Inkscape for use in clinical dashboard.

References 1. Anscombe, F. J. Graphs in Statistical Analysis. American Statistician 1976, 27: 17–21. 2. Bostock Mike. D3.js – Data-Driven Documents [http://d3js.org] Accessed 22 Feb 2015. 3. Tableau Software. Business Intelligence and Analytics [http://www.tableau.com] Accessed 22 Feb 2015. 4. Inkscape [https://inkscape.org] Accessed 22 Feb 2015. 5. W3C SVG Working Group [http://www.w3.org/Graphics/SVG] Accessed 22 Feb 2015. 6. Clinical Enrollment Dashboard Demo [http://ctrsu.org/screen_demo/] Accessed 22 Feb 2015.

Hunter N.B. Moseley, University of Kentucky

A graph database atom-resolved implementation of KEGG metabolic pathways

William A. McCollam, Joshua M. Mitchell, and Hunter N.B. Moseley* Department of Molecular and Cellular Biochemistry Markey Cancer Center Resource Center for Stable Isotope Resolved Metabolomics University of Kentucky, Lexington KY 40356 *hunter.moseley@uky.edu

Background Metabolomics is a systematic study of the metabolites (small biomolecules) present in a cell, tissue, organism, or community of organisms. Meaningful interpretation of metabolomics datasets requires analysis of these datasets within the context of metabolic networks that create, utilize, and consume metabolites. This is especially true for stable isotope-resolved metabolomics (SIRM) datasets that contain data representing the incorporation of stable isotopes into specific atoms of detected metabolites. To facilitate the interpretation of SIRM datasets, we have created atom resolved metabolomics networks using metabolomics reaction information from the Kyoto Encyclopedia of Genes and Genomes (KEGG) [1]. Atom-Resolved KEGG or ARK remodels a subset of KEGG as a graph of nodes within a graph database. Nodes in the graph represent everything from atoms, molecules, and enzymes up to genes and organisms, with edges between nodes representing relationships between these entities.

Methods & Results The ARK software package synchronizes an SQL database with various relational tables representing entities from KEGG, which are then processed to create atom entries from the KCF fields of compounds (i.e. Mol file-like chemical format). Edges connecting atom nodes across reactions (mappings) are created from the ALIGN sections of KCF fields from reactant pairs (RPairs), with additional mappings created by calculating all possible combinations of molecular symmetry between the reactant-product pairs using an enhanced version of Chemically Aware Substructure Search (CASS) previously developed [2]. Finally, ARK uses the SQL representation of the graph to create an analogous representation in a high-performance graph database. In this form, simple graph database queries can be used to find paths between atoms in source metabolites and destination metabolites, facilitating pathway-specific interpretation of stable isotope tracing data. Furthermore, these query methods are orders of magnitude faster than traditional SQL queries, allowing systematic analyses of possible isotope tracings between metabolites.

Conclusion We have created a graph database implementation of KEGG metabolic maps that are both atom-resolved and directly atom-traceable at efficiencies that are computationally feasible.

References 1. Kanehisa M, Goto S: KEGG: kyoto encyclopedia of genes and genomes. Nucleic acids research 2000, 28(1):27-30. 2. Mitchell JM, Fan TW-M, Lane AN, Moseley HN: Development and in silico evaluation of large-scale metabolite identification methods using functional group detection for metabolomics. Frontiers in genetics 2014, 5:237.

Sarah Neuner, University of Tennessee Health Science Center

Multi-scale study of normal aging predicts novel late-onset Alzheimer’s disease risk variants

Sarah Neuner1, Matthew de Both2, Matthew Huentelman2, and Catherine Kaczorowski1

1. Dept. of Anatomy and Neurobiology, University of Tennessee Health Science Center, Memphis, TN, 38163, USA 2. Neurogenomics Division, The Translational Genomics Research Institute (TGEN), Phoenix, AZ, 85004, USA

Alzheimer’s disease (AD) is a neurodegenerative disorder characterized by severe memory impairment and accumulation of neuropathological amyloid plaques and tau tangles. By contrast, ‘normal’ age-associated cognitive decline is less severe and generally occurs in the absence of neuropathology. Many believe aging- and AD-related memory impairments result from separate etiologies, but this distinction is not consistent with emerging evidence that both are linked to hippocampal dysfunction. As aging is the most significant risk factor for late-onset AD (LOAD), we hypothesize that both conditions are driven by common underlying mechanisms, which are exacerbated in AD by disease-specific insults such as neurodegeneration, neuroinflammation, and neuropathologies. Thus, elucidating genetic correlates of memory decline in ‘normal’ aging may identify risk factors that influence susceptibility to LOAD. Given that genetically diverse mouse models have emerged as a powerful way to study complex human traits, we conducted a multi-scalar analysis of two independent mouse models of aging to test this hypothesis. We combined memory tests with proteomic, transcriptomic, and genomic data to generate a list of the top 30 candidates that correlate with memory impairments. To evaluate the translational potential of these candidates, we analyzed their expression in the hippocampus of LOAD patients relative to age-matched non-demented controls. Eighteen genes including TRPC3, GABRB1, GABRB2, WDFY3, and GRM1 were significantly differentially expressed relative to disease status, suggesting they may play a role in the human disease process. In addition to correlation with disease phenotype, we wanted to identify whether any of our candidate genes contained single nucleotide polymorphisms (SNPs) that could be used to predict an individual’s susceptibility to LOAD. To test the ability of our multi-scalar approach to detect LOAD risk genes, we searched our full dataset for published risk genes. We identified APOE, SORL1, EPHa1, BIN1, and TREM2 as significantly differentially expressed relative to memory function in our aging models. We then identified four nominally significant novel putative risk variants (GABRB1, GABRB2, GRM1, and WDFY3) using data generated by our study of aging models. Future work will investigate functional significance of these SNPs and validate mechanistic relevance to cognitive decline during aging and AD. To our knowledge, this work demonstrates for the first time the utility of studying ‘normal’ aging models to both better understand molecular mechanisms mediating memory function in diverse populations and to identify those candidates with the best potential to translate into effective treatments and predictive biomarkers for cognitive decline in elderly humans.

Iwona Pawlikowska, St. Jude Children’s Research Hospital

Dunn Index Bootstrap (DIBS): A Procedure to Empirically Select a Cluster Analysis Method that Identifies Biologically and Clinically Relevant Molecular Disease Subgroups

Iwona Pawlikowska1,3, Zhifa Liu1, Lei Shi1, Tong Lin1, Tanja Gruber2, Giles Robinson2, Arzu Onar-Thomas1, Stan Pounds1

1 Departments of Biostatistics and 2 Oncology, St. Jude Children’s Research Hospital, Memphis, TN, 3 Institue of Mathematics, University of Silesia, Katowice, Poland

Motivation: Cluster analysis is widely used in cancer research to discover molecular subgroups that inform subsequent laboratory investigations and define risk classification criteria for subsequent clinical trials. However, for any data set, there are a very large number of candidate cluster analysis methods (CCAMs) due to the many choices for feature selection criteria, number of selected features, number of clusters to define, etc. Frequently, a specific CCAM is chosen without quantifying the validity of its results in terms of reproducibility or distinctiveness of the reported subgroups.

Methods: Here, we propose the Dunn Index Bootstrap (DIBS) procedure to quantify the reproducibility and distinctiveness of subgroups defined by many CCAMs. DIBS applies each CCAM to the observed data and many bootstrap data sets obtained by subject resampling. The bootstrap results are used to compute metrics of subgroup reproducibility and distinctiveness of the subgroups defined by each CCAM.

Results: DIBS was used to characterize the performance of each of 4,032 CCAMs in the analysis of one RNA-seq, two microarray gene expression, and one methylation array data set from three different cancers. In each example, DIBS identified specific CCAMs that defined subgroups of well-established biological and clinical relevance.

Roger Chui, University of Kentucky

Validation and Quality Assurance for Genome Browser Database Exports

Roger Chui1, Jerzy W Jaromczyk1, Neil Moore1, Christopher L Schardl2 1 Department of Computer Science, University of Kentucky, Lexington, KY 40506, USA 2 Department of Plant Pathology, University of Kentucky, Lexington, KY 40546, USA

A genome browser transition utility designed in our lab, FPD2GB2 (Fungal Project Database to GBrowse 2), exports data from a custom database used by the Fungal Endophytes Genome Project. Designed as a collection of scripts, FPD2GB2 outputs the contents of a locally developed genome annotation database into the standard GFF3 format, allowing for bulk import of data into the GBrowse2 genome browser.

Any application which converts between data formats should ensure the completeness and accuracy of the output produced by FPD2GB2. Adding a data validator as part of the FPD2GB2 script collection allows for independent verification of the quality and soundness of the GFF3 files being imported into a production GBrowse2 environment. We measure the accuracy of the output by comparing the features listed in the GFF3 files to the contents of the original database. Ensuring accurate offsets relative to reference features provides validation of accuracy. Comparing the parent-child inheritance structure of features in the output to that of the source data ensures the completeness of the output.

We discuss the issues involved in creating this validator and how the validator fits into the overall workflow of FPD2GB2.

Zhang Pan, Vanderbilt University

Practicality of Identifying Mitochondria Variants from Exome and RNAseq data

Zhang Pan1, David C. Samuels2, Brian Lehmann3, Jennifer Pietenpol3, Yu Shyr1, Yan Guo1

1 Center for Quantitative Sciences, Vanderbilt University, Nashville TN, 37027 2 Center for Human Genetics Research, Vanderbilt University, Nashville TN, 37037 3 Department of Biochemistry, Vanderbilt University, Nashville TN, 37232

Background The rapid progress in high throughput sequencing technology has significantly enriched our capability to study mitochondria genome. Other than performing mitochondria targeted sequencing, an increasingly popular alternative approach is to utilize the off-target reads from exome sequencing to infer mitochondria genomic variants including SNP and heteroplasmy [1-9]. However, the effectiveness and practicality of such approach have not been tested. Recently, RNAseq data has also been suggested as good source for alternative data mining [10, 11], but whether mitochondria variants are minable has not been studied.

Materials and methods We designed a specific study using targeted mitochondria sequencing data as gold standard to evaluate the practicality of SNP and heteroplasmy detection using exome sequencing and RNAseq data. Up to six breast cancer cell lines were sequenced for mitochondria targeted sequencing, exome sequencing and RNAseq. Furthermore, we examined three mitochondria alignment strategies: 1) align all reads directly to mitochondria genome; 2) align all reads to nuclear genome and mitochondria genome simultaneously; 3) align all reads to nuclear genome first, then used the unaligned reads to align to mitochondria genome.

Results Our Analyses found that exome sequencing can accurately detect mitochondria SNPs and can detect a portion of the true heteroplasmies with reasonable false discovery rate. RNAseq data on the other hand had a lower detection rate of SNP but higher detection rate for heteroplasmy. However, the higher false discovery rate makes RNAseq a less ideal source for studying mitochondria compared to exome sequencing data. Furthermore, we found that aligning all reads directly to mitochondria genome reference or aligning all reads to nuclear genome and mitochondria genome references simultaneously produced the best results.

Conclusions Exome sequencing and RNAseq data can be potentially mined for mitochondria variants. Overall, exome sequencing provides less false discovery rate than RNAseq for mitochondria variant detection, making a more desirable choice. In conclusion, our study provides important guidelines for future studies that intent to use exome sequencing or RNAseq data to infer mitochondria SNP and heteroplasmy.

References 1. Samuels DC, Han L, Li J, Quanghu S, Clark TA, Shyr Y, Guo Y: Finding the lost treasures in exome sequencing data. Trends Genet 2013. 2. Ye F, Samuels DC, Clark T, Guo Y: High-throughput sequencing in mitochondrial DNA research. Mitochondrion 2014, 17:157-163. 3. Picardi E, Pesole G: Mitochondrial genomes gleaned from human whole-exome sequencing. Nature methods 2012, 9(6):523-524. 4. Guo Y, Li J, Li CI, Shyr Y, Samuels DC: MitoSeek: extracting mitochondria information and performing high-throughput mitochondria sequencing analysis. Bioinformatics 2013, 29(9):1210-1211. 5. Dinwiddie DL, Smith LD, Miller NA, Atherton AM, Farrow EG, Strenk ME, Soden SE, Saunders CJ, Kingsmore SF: Diagnosis of mitochondrial disorders by concomitant next-generation sequencing of the exome and mitochondrial genome. Genomics 2013. 6. Falk MJ, Pierce EA, Consugar M, Xie MH, Guadalupe M, Hardy O, Rappaport EF, Wallace DC, LeProust E, Gai XW: Mitochondrial Disease Genetic Diagnostics: Optimized Whole-Exome Analysis for All MitoCarta Nuclear Genes and the Mitochondrial Genome. Discov Med 2012, 79:389-U140. 7. Nemeth AH, Kwasniewska AC, Lise S, Schnekenberg RP, Becker EBE, Bera KD, Shanks ME, Gregory L, Buck D, Cader MZ, Talbot K, De Silva R, Fletcher N, Hastings R, Jayawant S, Morrison PJ, Worth P, Taylor M, Tolmie J, O'Regan M, Consortium UA, Valentine R, Packham E, Evans J, Seller A, Ragoussis J: Next generation sequencing for molecular diagnosis of neurological disorders using ataxias as a model. Brain 2013, 136:3106-3118. 8. Sevini F, Giuliani C, Vianello D, Giampieri E, Santoro A, Biondi F, Garagnani P, Passarino G, Luiselli D, Capri M, Franceschi C, Salvioli S: mtDNA mutations in human aging and longevity: controversies and new perspectives opened by high-throughput technologies. Exp Gerontol 2014, 56:234-244. 9. McMahon S, LaFramboise T: Mutational patterns in the breast cancer mitochondrial genome, with clinical correlates. Carcinogenesis 2014, 35(5):1046-1054. 10. Han L, Vickers KC, Samuels DC, Guo Y: Alternative applications for distinct RNA sequencing strategies. Brief Bioinform 2014. 11. Vickers KC, Roteta LA, Hucheson-Dilks H, Han L, Guo Y: Mining diverse small RNA species in the deep transcriptome. Trends Biochem Sci 2015, 40(1):4-7.

Shruti S Sakhare, Meharry Medical College

Transcriptome analysis of breast cancer in African American women

Shruti S Sakhare, MS1#, Jamaine Davis. PhD2#, Sammed N Mandape, MS1, Siddharth Pratap, PhD1* 1Bioinformatics Core, Meharry Medical College, Nashville, TN, 37208, USA 2Dept. of Biochemistry and Cancer Biology Meharry Medical College, Nashville, TN, 37208, USA # denotes equal author contribution * Corresponding author e-mail: spratap@mmc.edu

Introduction Breast cancer is the second most lethal cancer in women. Further, death rates for African American women are the highest for any racial / ethnic group. Hormone receptor status is one of the major prognostic factors and a determinant of treatment options for breast cancer, thus suggesting the importance of molecular level characterization. In this study, we have identified transcriptome level differences for the receptor specific molecular subtypes of breast cancer in the African American population.

Materials and Methods The Cancer Genome Atlas (TCGA: http://cancergenome.nih.gov/) clinical and type 3 gene expression data (AgilentG4502A transcriptome arrays) from African American women was analyzed. The 18 samples data was classified for receptor specific subtype breast cancer: 1. Triple negative: ER- PgR- Her2- [---] 2. Luminal A: ER+ PgR+ Her2- [+ + -] 3. Her2 over-expressing: ER- PgR- Her2+ [- - +] 4. ER positive: ER+ PgR- Her2- [+ - -] The samples were analyzed using One-way ANOVA with Welch’s correction for unequal sample sizes with type 3 sum of squares. Genes with ANOVA p-value < 0.01 and with relative expression fold change > ± |2.0| were considered significant, yielding 90 differentially expressed genes (Figure 1). Next, we constructed a biological interaction network using the Michigan Molecular Interactions database (MIMI) and Cytoscape (version 2.8.3) (Figure 2). Pathway Enrichment analysis by hypergeometric test was conducted with the WEB-based GEne SeT AnaLysis Toolkit (WEB-GESTALT) in order to identify significantly enriched KEGG pathways.

Results Immediate neighbor genes of the 90 significant genes included important DNA repair genes such as, BRCA1, SMAD3, SMAD4, EGFR and MDC1 [1, 2]. Specifically, MDC1 showed altered expression for all subtypes of breast cancer and a significant p-value for the Luminal A [++-] subtype versus Triple negative [---]. This protein has been previously implicated in DNA damage response [3]. Signaling pathway genes such as RAF1, NFATC4, responsible for cell mediated immunity, also show significantly changed expression. Clustering among subtypes based on fold change data suggest most similarity between Luminal A subtype [++-] and ER+ PgR- Her2- [+--] subtype of breast cancer.

Conclusion In conclusion, significant transcriptome level changes of important DNA repair genes, such as MDC1 and other signaling receptor pathway genes for cell mediated immunity such as RAF1 and NFATC4 in triple negative subtype of breast cancer in African American women stress the importance of evaluating DNA damage response and immune response competence to predict breast cancer chemo-sensitivity and warrant further investigation into its effectiveness.

Figure 1. Hierarchical clustering heat map of significantly altered breast cancer subtype genes in Africa American women

Legend: Green color indicates less than two fold down-regulated genes and red color indicates greater than two fold up-regulated genes.  

Figure 2. Interactions network Breast Cancer subtypes

Legend: Diamond nodes are seed nodes of significantly altered gene transcripts varying between Triple Negative, Her2 over-expressing, Luminal A, and ER+ PgR- Her2- Breast Cancer subtypes in African American Women; circular nodes are 1 degree of biological interactions. Red color indicates genes significant in the group (---,++-), yellow indicates genes significant in the group (--+,---), and orange color indicates genes significant in the group (+--,---).  

Acknowledgements NIH grants MD007586 and MD007593 from the National Institute on Minority Health and Health Disparities (NIMHD). Grant CA166544 from the National cancer Institute (NCI). References 1. Tommiska J, Bartkova J, Heinonen M, Hautala L, Kilpivaara O, Eerola H, Aittomäki K, Hofstetter B, Lukas J, Von Smitten K: The DNA damage signalling kinase ATM is aberrantly reduced or lost in BRCA1/BRCA2-deficient and ER/PR/ERBB2-triple-negative breast cancer. Oncogene 2008, 27(17):2501-2506. 2. Guler G, Himmetoglu C, Jimenez RE, Geyer SM, Wang WP, Costinean S, Pilarski RT, Morrison C, Suren D, Liu J: Aberrant expression of DNA damage response proteins is associated with breast cancer subtype and clinical features. Breast cancer research and treatment 2011, 129(2):421-432. 3. Prat A, Perou CM: Deconstructing the molecular portraits of breast cancer. Molecular oncology 2011, 5(1):5-23.

Mansi Sethi, University of Kentucky

Analysis Of Sleep Traits In Knockout Mice From The Large-scale KOMP2 Population Using A Non-invasive, High-throughput Piezoelectric System

Author Block: Mansi Sethi, MS1, Martin Striz, MS1, Shreyas S. Joshi, MS1, Neil Cole, BS2, Jennifer Ryan, BS2, Michael E. Lhamon, Ph.D.3, Anuj Agarwal, Ph.D.3, Stacey J. Sukoff Rizzo, Ph.D.2, James M. Denegre, Ph.D.2, Robert E. Braun, Ph.D.2, Kevin D. Donohue, Ph.D.3,4, Elissa J. Chesler, Ph.D.2, Karen L. Svenson, Ph.D.2, Bruce F. O'Hara, Ph.D.1.

1 Department of Biology, University of Kentucky, Lexington, KY, USA, 2 The Jackson Laboratory, Bar Harbor, ME, USA, 3 Signal solutions, LLC, Lexington, KY, USA, 4 Department of Electrical and Computer Engineering, University of Kentucky, Lexington, KY, USA.

Introduction In our current study we employed a non-invasive, high-throughput piezoelectric system to characterize sleep-wake phenotypes in a large population of control and single-gene knockout mice; recorded as part of the KOMP2 studies at JAX [1].

Methods Knockout mice (15 weeks) generated on a C57BL/6NJ background were phenotyped for sleep-wake parameters as part of the phenotyping pipeline at JAX under baseline conditions for 5 days under 12:12 LD conditions using a non-invasive Piezoelectric system and compared to control (C57BL/6NJ) mice [2].

The Piezoelectric system consists of a sensor pad placed at the bottom of the mouse cage which records gross body movements. The pressure signals thus generated are classified by an automated classifier into sleep and wake. The system characterizes traits that include sleep time over 24-hrs, as well as during the light and dark phase, or any desired time interval. Likewise, the distribution of sleep bout length (or sleep fragmentation) is assessed, in addition to activity onset. The piezoelectric system has been validated with EEG and human observations, and demonstrates a classification accuracy of over 90% [3,4]. Thus far, we have recorded over 1000 BL/6NJ mice, males and females. The number of animals in each of the 130 knockout mouse groups range from 4-17.

Results C57BL/6NJ female mice exhibited shorter bout length and less total sleep compared to males. Significant sleep-wake differences in both light and dark phases were also found for a number of knockout lines and inbred mice strains compared to control mice. Conclusion We present the results of the sleep phenotyping for a variety of inbred strains, single-gene knockouts, and control mice. A number of genes influencing various sleep traits have been identified, and these data will also be compared and correlated with non-sleep traits assessed in these same mice. Recently improved algorithms now allow classification of REM vs. non-REM sleep as well.

Support NIH Grant OD011185 NIH Grant HG006332 1. Morgan H, Simon M, Mallon AM: Accessing and mining data from large-scale mouse phenotyping projects. Int Rev Neurobiol 2012, 104:47-70. 2. Skarnes WC, Rosen B, West AP, Koutsourakis M, Bushell W, Iyer V, Mujica AO, Thomas M, Harrow J, Cox T, et al: A conditional knockout resource for the genome-wide study of mouse gene function. Nature 2011, 474:337-342. 3. Flores AE, Flores JE, Deshpande H, Picazo JA, Xie XMS, Franken P, Heller HC, Grahn DA, O'Hara BF: Pattern recognition of sleep in rodents using piezoelectric signals generated by gross body movements. Ieee Transactions on Biomedical Engineering 2007, 54:225-233. 4. Mang GM, Nicod J, Emmenegger Y, Donohue KD, O'Hara BF, Franken P: Evaluation of a piezoelectric system as an alternative to electroencephalogram/ electromyogram recordings in mouse sleep studies. Sleep 2014, 37:1383-1392.

Jasmit S Shah, University of Louisville

Metabolomics Data Analysis and Missing Value Issues with Application to Infarcted Mouse Hearts

Jasmit S Shah1,2*, Guy N Brock2, Shesh N Rai2 1 The Diabetes and Obesity Center, University of Louisville, Louisville, Kentucky, 40202, USA 2 Department of Bioinformatics and Biostatistics, University of Louisville, Louisville, Kentucky, 40202, USA * jasmit.shah@louisville.edu

High throughput technology makes it possible to monitor metabolites on different experiments and has been widely used to detect differences in metabolites in many areas of biomedical research. Mass spectrometry has become one of the main analytical technique for profiling a wide array of compounds in the biological samples. Extracting relevant biological information from large datasets is one of the challenges. Missing values in metabolomics dataset occur widely and can arise from different sources, including both technical and biological reasons. Mostly the missing value is substituted by the minimum value, and this substitute may lead to different results in the downstream of the analysis. Different methods tend to give different results. In this study we summarize the statistical analysis of metabolomics data with no missing values and with missing values. With the missing values, we compare the different methods and examine the outcomes based on each method.

Andrey Smelter, University of Louisville

Automated Assignment of Magic-Angle-Spinning Solid-State Protein NMR Spectra

Andrey Smelter1, Indraneel Reddy3, Eric C. Rouchka1, and Hunter N.B. Moseley2,*

1 School of Interdisciplinary and Graduate Studies / Department of Computer Engineering and Computer Science, University of Louisville, Louisville, KY 40208 2 Department of Molecular and Cellular Biochemistry / Markey Cancer Center / Resource Center for Stable Isotope Resolved Metabolomics, University of Kentucky, Lexington, KY 40356 3 Department of Biology, University of Louisville, Louisville, KY 40208 *hunter.moseley@uky.edu

Background Magic-Angle-Spinning Solid-State NMR (MAS SSNMR) is a rapidly developing methodology for studying protein structure, function and dynamics that is complementary to both solution NMR and X-ray crystallography. MAS SSNMR is an invaluable experimental technique due to its capacity to study proteins and complexes in the solid state, especially membrane proteins, amyloid fibrils and other proteins that are difficult to study by other analytical methods due to their insolubility or inability to form crystals. One of the critical steps in the protein structure determination process is the assignment of NMR spectral data, which is the assignment of protein resonances via the association of chemical shift values to specific nuclei in a protein macromolecule. Depending on the quality of the spectra, a manual protein resonance assignment process can take considerable amount of time even by an experienced spectroscopist, requiring weeks to even months of effort. Currently, there is a lack of software tools that are capable of performing protein resonance assignment of SSNMR spectra in an automatic fashion. This situation calls for the development of new methodologies and tools. Our hypothesis is that the process of protein resonance assignment of uniformly 13C and 15N labeled proteins can be automated for MAS SSNMR. This hypothesis is based on the fact that there are several software tools that are developed for performing automated protein resonance assignment in case of solution (liquid-state) NMR. The assignment process is very similar between solution and solid-state NMR.

Methods and Results We are developing core data structures and algorithms that will implement the following basic steps that most assignment programs for protein solution NMR use: i) peak list registration and dataset quality assessment, ii) spin system grouping, iii) sequence site typing, iv) spin system linking, v) segment mapping, and vi) assignment quality assessment. However, improvements in grouping, typing, linking, and mapping steps are required beyond current solution NMR tools due to fundamental differences in experimental strategies and information content of the derived datasets. Our current implementation has addressed the differences and variety of experimental designs in MAS SSNMR. Issues in dataset information content are being addressed as we implement improved algorithms.

Conclusions Our long-term goal is to develop software tools that will significantly improve the speed and the quality of the protein resonance assignment specifically for the MAS SSNMR.

Tamas L. Nagy, University of Kentucky

Characterization of the Structural Constraints of Viral Type I Fusion Proteins

Tamas L. Nagy1, Stacy R. Webb1, Rebecca E. Dutch1, Hunter Moseley1,2,3,† 1Department of Cellular and Molecular Biochemistry, 2Markey Cancer Center, University of Kentucky, 3Resource Center for Stable Isotope Resolved Metabolomics, University of Kentucky, Lexington, KY, 40508, USA, †hunter.moseley@uky.edu

Background Enveloped viruses—including major human pathogens like HIV, SARS, and Dengue—require the fusion of host and viral membranes to initiate infection. This process is driven by controlled, large-scale conformational changes in specialized, meta-stable proteins known as fusion proteins. The evolution of these proteins is constrained by two dichotomous forces: they must be stable enough to remain in their high-energy, prefusion state, but also undergo a structural rearrangement that facilitates the fusion process when triggered. The transition between the higher-energy prefusion and lower-energy postfusion forms is generally irreversible, making the stabilization of the prefusion form and tight control of triggering essential to viral stability and infectivity. For certain Type I fusion proteins, several structural characteristics have been identified as having a critical role in the meta-stability of the whole protein. Recent work has suggested that the process is driven by weakly interacting transmembrane domains (TMDs) that spring apart, unzipping an upstream coiled-coil, which causes the fusion peptide to be inserted in the host membrane and membrane fusion to occur (Smith et al., 2013). Despite its importance, it is not currently known how widespread this fusion approach is among the viral families. This work characterizes the structural features within the Paramyxoviridae family, with further application to other viral families.

Methods Curated sequences for the Paramyxovirus fusion proteins were downloaded from Uniprot and aligned based on the location of TMDs, as predicted by TMHMM (Krogh et al., 2001). The Paircoil2 (McDonnell et al., 2006) coiled-coil prediction software was run on each sequence using a 21 AA window length. Secondary structure prediction was done using the JPRED3 algorithm (Cole et al., 2008) in batch mode. We implemented all analyses in the Python programming language using the Python data science stack.

Results Our analyses found a consistent gradually weakening pattern in the predicted coiled-coil strength spanning from the upstream coiled-coil region into the TMD of Paramyxoviridae fusion proteins. Using a separate secondary structure predictive tool we found that these proteins have a helix-gap-helix structure where the coiled-coil alpha helix is disrupted just prior to the start of the TMD and reformed once inside the TMD. These data agree with our oligomerization studies on alanine scan mutants of the Hendra Paramyxovirus fusion protein as measured by sedimentation equilibrium analytical ultracentrifugation.

References Cole,C. et al. (2008) The jpred 3 secondary structure prediction server. Nucleic Acids Research, 36, W197– W201. Krogh,A. et al. (2001) Predicting transmembrane protein topology with a hidden markov model: Application to complete genomes1. Journal of Molecular Biology, 305, 567–580. McDonnell,A.V. et al. (2006) Paircoil2: Improved prediction of coiled coils from sequence. Bioinformatics, 22, 356–358. Smith,E. et al. (2013) Trimeric transmembrane domain interactions in paramyxovirus fusion proteins: Roles in protein folding, stability and function. J Biol Chem, 288, 35726–35735. 1

Quang Tran, University of Memphis

A linear model for predicting performance of short-read aligners using genome complexity

Quang Tran, Shanshan Gao, Nam S. Vo and Vinhthuy Phan

Department of Computer science, University of Memphis, TN 38152, USA

Motivation: The effectiveness and accuracy of aligning short reads to genomes have an important impact on many applications that rely on next-generation sequencing data. The computational requirements and material cost for aligning large-scale short reads to genomes is also expensive. To prevent from wasting time and resources for aligning short reads, we investigated different measures of genome complexity that correlated best to the performance of alignment to propose a linear model for each aligning method.

Materials and Methods: we demonstrated that repeats in genomic DNA could affect greatly the performance of short-read aligners. Exploring several different measures of genome complexity, we showed that there was a high correlation between our proposed-measure of genome complexity and, respectively, alignment accuracy and chromosomal coverage. The result was validated using 9 state-of-the-art aligners (Bowtie2, BWA-SW, Cushaw2, Masai, mrFast, SeqAlto, SHRiMP2, Smalt, and SOAP2) and two different data sets. The first dataset, which consists of 100 genomic sequences including bacteria, plants and eukaryotes, was used to correlate complexity and alignment accuracy. The second dataset, which consists of all 24 chromosomes of human, 20 chromosomes of soybean, and 10 chromosomes of corn, was used to correlate complexity and chromosomal coverage. High correlation between alignment performance and complexity enabled us to build linear regression models that could predict accurately alignment accuracy and chromosomal coverage.

Result: We demonstrated the utility of this method by showing how to use linear models to predict accuracy of aligners simply based on complexity of genomes without using any reads to for alignment. This can potentially help reduce experimental cost. Further, we showed how to use linear models to predict chromosomal coverage of genomes based on the expected coverage. This ability can also help reducing experimental cost as it allows researchers to predict how much a given number of reads will effectively cover chromosomes of interest. A visualization of genome complexity along chromosomes will also help identify visually chromosomal regions that are potentially difficult to cover by reads.

Acknowledgements: This work is partly supported by the National Science Foundation [CCF-1320297 to V.P.]

Availability: Software to compute measures of genome complexity is available at https://github.com/vtphan/shortread-alignment-prediction Keywords: next-generation sequencing, genome complexity, sequence alignment Correspondence: vphan@memphis.edu

Nam S. Vo, University of Memphis

Improving variant calling by integrating read alignment with existing genetic variants

Nam S. Vo*, Vinhthuy Phan Department of Computer Science, University of Memphis, Memphis, TN 38152, USA *Corresponding author: nsvo1@memphis.edu

Motivation: The identification of genetic variants has great significance in genetic research. To call variants using next-generation sequencing data, current methods rely primarily on mapped reads produced by a separate read aligner without taking into account existing genetic variants [1]. Thus, these methods usually require a large number of reads (high coverage) to be able to detect variants accurately [2]. Moreover, the separation of read alignment and variant calling results in workflow that is complex and involves many separate steps and different tools [3].

Methods and Results: We introduce a novel method that leverages existing information about genetic variants to improve performance of variant calling. Incorporating existing variants allows reads to be aligned more accurately and variants to be detected accurately with low coverage. This method further integrates two separate processes of read alignment and variant calling into one unified workflow, which results in a much more automated, simplified and faster process. Bayesian method is utilized to update probablility of bases to be variants throughout the process of read alignment, which is used to calculate quality of variant calls.

We showed that this method significantly improved the accuracy of variants, especially with low-coverage data, compared to popular methods such as GATK on simulated data on human chromosomes. At low coverage (<= 5x), this method achieved recall rates that were 2-19% higher while maintaining competitive precision compared to GATK. In particular, the method showed a significant improvement on identifying INDELs with recall rates 33-42% higher than GATK, and precision rates 9-34% higher than GATK. Our method also simplifies the workflow greatly, requiring 2 steps to call variants while GATK requires 6-8 steps and 2 additional external tools including Picard and SAMtools to preprocess the data.

Conclusions: As genetic variants are being collected for more and more people, the integration of existing information into the calling of variants is realistic. We demonstrated that by incorporating existing variant information accurate detection of variants could be achieved even with low coverage. Thus, the method is promising in helping to reduce experimental cost.

Acknowledgements: This work is partly supported by NSF CCF-1320297. We thank Quang Tran for his help in initial stage of the project.

References: [1] Nielsen, R., Paul, J. S., Albrechtsen, A., Song, Y. S.: Genotype and snp calling from next-generation sequencing data. Nature Reviews Genetics, 12(6), 443–451 (2011). [2] Yu, X., Sun, S.: Comparing a few snp calling algorithms using low-coverage sequencing data. BMC Bioinformatics, 14:274 (2013). [3] Pabinger, S., Dander, A., Fischer, M., Snajder, R., Sperk, M., Efremova, M., Krabichler, B., Speicher, M. R., Zschocke, J., Trajanoski, Z.: A survey of tools for variant analysis of next-generation genome sequencing data. Briefings in Bioinformatics, 15(2), 256-278 (2014).

Chanung Wang, University of Kentucky

A comparative study of circadian rhythms and sleep between the house mouse (Mus musculus) and African spiny mouse (Acomys cahirinus)

Chanung Wang1, Thomas Gawriluk1, Melissa Keinath1, Shishir Biswas1, Jeramiah Smith1, Ashley W. Seifert1, and Bruce O’Hara1 1Department of Biology, University of Kentucky, Lexington, KY, 40506-0225

The study of circadian and sleep behavior in different organisms can provide valuable insight for understanding behavioral, physiological and environmental influences on these processes. Interestingly, two species of African spiny mice, Acomys russatus (Golden spiny mouse) and Acomys cahirinus (Cairo spiny mouse) have been reported to exhibit different circadian rhythm patterns in locations where the two species overlap. Both species are primarily nocturnal when not in direct competition, but in areas of overlap A. cahirinus exhibit nocturnal behavior, while A. russatus become more diurnal. However, very few studies on the circadian activity of these species are available and nothing is known of their sleep behavior, which can be the dominant force in driving other diurnal variables. Therefore, we have begun to study one of these species (A. cahirinus) in greater detail alongside the well-studied house mouse (Mus musculus) using a well validated, non-invasive, piezoelectric system, that picks up all movements during wake, and the breathing rhythms during sleep. In these studies, we found A. cahirinus and M. musculus to be primarily nocturnal, but with clearly distinct behavioral patterns. Specifically, the activity of A. cahirinus sharply increases right at dark onset, which is common in nocturnal species, but surprisingly, decreases sharply just one hour later. These differences may be related to foraging differences between these species, or may be related to the socialized behavior of A. cahirinus and its poorer adaptation to isolation as compared to Mus musculus. We have sequenced and assembled a low coverage genome for A. cahirinus and explored genes known to influence sleep and circadian rhythms in A. cahirinus and M. musculus. We are currently investigating these and other variables that might explain A. cahirinus sleep behavior including a comparison of genomic sequences between these species.

Jing Wang, Vanderbilt University School of Medicine

Population structure analysis on 2504 individuals across 26 ancestries using bioinformatics approaches

Jing Wang1, David C. Samuels2, Yu Shyr1*, Yan Guo1*

1Center for Quantitative Sciences, Vanderbilt University School of Medicine, Nashville, TN 37232, USA 2Center for Human Genetics Research, Vanderbilt University, Nashville, TN 37232, USA

Corresponding: Yan Guo, Yu Shyr

Background Characterizing genetic diversity is crucial for reconstructing human evolution and for understanding the genetic basis of complex diseases; however, human population genetics are very complicated. Previously, we proved that based on the Hardy-Weinberg equilibrium, the heterozygous vs. non-reference homozygous single nucleotide polymorphism (SNP) ratio (het/nonref-hom) is two[1]. Later, we found that this ratio is very race dependent, with African being the most genetically diverse race and Asian being the most homozygous [2]. This observation prompted us to conduct further study to understand the reasoning behind this diversity.

Material and Methods Using the 1000 Genomes Project (1000G) released genomic data of 2504 individuals (26 races from five major-races), we first computed the (het/nonref-hom) ratio which has been applied as a quality control parameter for sequencing data [1, 3]. As expected, we found that the het/nonref-hom ratio is strongly associated with human ancestry. Africans had the highest het/nonref-hom ratios, followed by Americans and Europeans, and East Asians had the lowest (Figure 1). More interestingly, the het/nonref-hom ratios of South Asians are much higher than those of East Asians, and Americans showed the highest range (Figure 1). Thus we further quantitatively analyzed genetic variation in human populations on a 1000G dataset of 1011 observed genotypes (2504 individuals at 13424776 SNPs) using Structure 2.3.4 [4]. The resulting population structure is consistent with the major geographical regions. All races identified a dominate origin population, except Americans who had the most variation in the structure, represented by several populations including the dominate population of East Asians (Figure 2). Moreover, East Asians and South Asians were found to originate from different ancestries (Figure 2).

Conclusions Using novel bioinformatics approach, we identified new insights into the history and geography of human evolution, and are valuable for tracking human migrations and adaptation to local conditions.

Figure 1. het/nonref-hom ratio across 26 ancestries

Figure 2. Population structure inferred from the TGP genetic data.

1. Guo Y, Ye F, Sheng Q, Clark T, Samuels DC: Three-stage quality control strategies for DNA re-sequencing data. Brief Bioinform 2013. 2. Wang J, Raskin L, Samuels DC, Shyr Y, Guo Y: Genome measures used for quality control are dependent on gene function and ancestry. Bioinformatics 2015, 31(3):318-323. 3. Guo Y, Zhao S, Sheng Q, Ye F, Li J, Lehmann B, Pietenpol J, Samuels DC, Shyr Y: Multi-perspective quality control of Illumina exome sequencing data using QC3. Genomics 2014, 103(5-6):323-328. 4. Hubisz MJ, Falush D, Stephens M, Pritchard JK: Inferring weak population structure with the assistance of sample group information. Mol Ecol Resour 2009, 9(5):1322-1332.

Kai Wang, University of Tennessee

An Automated Resource for Enhanced Differential Analysis

Kai Wang1, Charles A. Phillips1, Arnold M. Saxton2 and Michael A. Langston1

1 Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville, TN 37996-2250, USA 2 Department of Animal Science, University of Tennessee Institute of Agriculture, Knoxville, TN 37996-4574, USA

Differential Shannon entropy (DSE) and differential coefficient of variation (DCV) have proven to be effective complements to differential expression (DE) in the analysis of gene co-expression data [1]. Because DSE and DCV measure difference in variability, rather than mere difference in magnitude, they can often identify significant changes in gene activity not reflected in mere mean expression level. Thus, we have devised a general purpose, easy-to-use R package to calculate DSE and DCV. Dubbed EntropyExplorer, this package operates on two numeric matrices with identically labeled rows, such as case/control transcriptomic data. All functionality has been wrapped into one routine. With a single procedure call a user may select a metric, whether to display that metric, its p-value, or both, whether to sort by metric or p-value, and how many of the most highly ranked results to display.

1. Wang K, Phillips CA, Rogers GL, Barrenas F, Benson M, Langston MA: Differential Shannon Entropy and Differential Coefficient of Variation: Alternatives and Augmentations to Differential Expression in the Search for Disease-Related Genes. International Journal of Computational Biology and Drug Design 2014:183-194.

Lynda A Wilmott, University of Tennessee Health Science Center

Novel drug discovery method identifies and validates Kv12.2 as a target for cognitive enhancement

LA Wilmott, SM Neuner, TM Shapaker, & CC Kaczorowski

More than 5 million elderly people in the United States are currently affected by Alzheimer’s disease (AD), and by 2050 this number is expected to rise to ~16 million (Alzheimer’s Association, 2012). The working hypothesis in our lab is that memory deficits in AD and ‘normal’ aging may share a common mechanism. One of the most important risk factors for AD dementia is aging, and age-associated changes in protein expression play a role in this memory impairment. We utilized unbiased proteomics to map the hippocampus membrane proteome of AD strong and weak-learners (n=9 mice per group) and identified a novel candidate protein, KCNH3 (Kv12.2), that was significantly differentially expressed in AD mice relative to their memory performance, evidenced by a 4.8-fold increase in Kv12.2 in impaired AD mice vs. AD mice with intact memory (p = 8.5 x 10-10). Kv12.2 is a member of the voltage-gated potassium channel family involved in hyperpolarizing hippocampal neuron resting membrane potential, thereby reducing hippocampal neuronal excitability. Next, we utilized the BXD mouse panel, where the resulting lines are descended from crosses between C57BL/6J and DBA/2J and are 99.5% isogenic, genetically diverse, and densely phenotyped. These BXD lines provide an excellent resource to study both genetic and phenotypic variation across a population in traits such as ‘normal’ aging. Similarly, we found a negative relationship between Kv12.2 expression and memory for a fear conditioning task (n=13 strains) in aged BXD mice. While a discrepancy exists regarding the role of Kv12.2 in spatial memory, we hypothesized that blocking this channel with the Kv12.2 selective antagonist, CX4, during post-training consolidation window would enhance long-term memory. To this end, we utilized contextual and cued fear conditioning and found that CX4 significantly increased both contextual and cued fear memory in adult C57BL/6J mice (n=5/grp), with a similar trend in pre-symptomatic Cg 5XFAD mice (n=4/grp), an AD mouse model. Collectively, our results suggest that that there may be a disease related change in Kv12.2 expression. Further tests will be conducted to elucidate the mechanism by which Kv12.2 affects memory.

Dongfeng Wu, University of Louisville

Title: Long Term Screening Outcomes for Aged People with a Screening History

Dongfeng Wu

University of Louisville, School of Public Health and Information Sciences, KY 40202, USA

This research is an extension of the previous probability model that the author developed for evaluating long-term outcomes due to regular screening. The previous model was for people without any screening history, while the current extension focuses on old people who has a screening history and is superficially healthy so far. People with a screening history are categorized into four mutually exclusive groups: True-early-detection, No-early-detection, Over-diagnosis, and Symptom-free-life. Probability formulae were derived for each case. These probabilities change with a person's current age, previous screening history, future exam frequency, screening sensitivity, and other parameters. Human lifetime was treated as a random variable by using the actuarial life table from the US Social Security Administration. Simulation studies using the HIP breast cancer data provide estimates for these probabilities and corresponding credible intervals. The model provides important information regarding the proportion of each category in the future for aged individuals with a screening history, and potential risk of over diagnosis at an advanced age. Finally, this model is applicable to other kinds of screening test as well.

Key words: over-diagnosis, true-early-detection, symptom-free-life, sensitivity, sojourn time, transition probability.

References: 1. Wu D, Kafadar K, and Rosner GL (2014). Inference of long-term effects and overdiagnosis in periodic cancer screening. Statistica Sinica 24(2), 815-831.

2. Wu, D. (2014). Long term effects of periodic cancer screening for aged people with a screening history. In JSM Proceedings, International Chinese Statistical Association Section. Alexandria, VA: American Statistical Association. 793-804.

Zhonghang Xia, Western Kentucky University

A semi-supervised learning framework for peptide identification

1Xijun Liang, 2Zhonghang Xia, 3Xinnan Niu, 4Andrew J. Link

1 College of Science, China University of Petroleum, Qingdao, China. 2 Department of Mathematics & Computer Science, Western Kentucky University, Bowling Green, KY 42101. 3,4 Department of Microbiology and Immunology, Vanderbilt University, Nashville, TN 37232, USA.

The complexity of biological samples and experimental process leads to rich noise in mass spectrogram, resulting in a large quantity of incorrect peptide spectrum matches (PSMs) in SEQUEST’s search results. This essentially increases the computational complexity of the post-database search methods. Existing filtering methods cannot efficiently cope with large-scale datasets. Moreover, it is extremely difficult to validate the trained models usually hard to verify since it is hard to determine the correctness of a target PSM. We have developed a post-database search method FC-Ranker for PSM validation by assigning a nonnegative weight to each target PSM, indicating the possibility of its being correct. The weight is computed based on clustering analysis of target PSMs. Experimental studies show that FC-Ranker outperforms other post-database search algorithms over a variety of datasets in terms of ROC and the number of identified PSMs. However, it is a challenge to choose appropriate parameters in weight computation for various PSM datasets. In this work, we proposed a novel approach CRanker to conquer this difficulty by formulating the filtering problem in a semi-supervised learning framework. The weights of target PSMs are treated as optimization variables in a SVM-based classification model. Moreover, Cholesky factorization technique is employed in CRanker for saving memory usage required by the kernel matrix in large-scale problems. By dividing large datesets into several sub-datasets, CRanker can efficiently handle datasets of 400,000 samples and apply the trained classifiers on each independent set in terms of FDR. Compared with PeptideProphet and Percolator, CRanker has identified more PSMs under similar false discover rates over different datasets.

Farid Yaghouby, University of Kentucky

An Effective Seizure Forecasting Model Based on Random Forest Classifiers

Farid Yaghouby1*, Behrouz Madahian2, Sridhar Sunderam1

1. Department of Biomedical Engineering, University of Kentucky, Lexington, KY. 2. Department of Mathematical Sciences, University of Memphis, Memphis, TN.

About 30% of all patients with epilepsy experience seizures that are unresponsive to medication or resective surgery. Although seizure frequency in these patients could be moderate, the constant threat of an impending seizure prevents them from doing several daily routine activities. Hence, an effective seizure forecasting system that identifies periods associated with elevated seizure risk would improve the quality of life in patients with intractable seizures. Early seizure warning would help patients avoid potentially risky activities (e.g. driving or swimming), and enable individually tailored closed-loop anti-seizure therapies. Research over the past decade has shown that seizures are not quite random events and that statistical models can be applied to predict seizures to some extent. The goal of a seizure prediction algorithm is typically to differentiate interictal (baseline) and preictal (pre-seizure) periods. In this study, a statistical algorithm for anticipating seizures based on a random forest classifier is proposed and tested on prolonged Intracranial EEG recordings in dogs. The possibility of classifying preictal and interictal states are explored and results from out-of-sample testing showed perfect sensitivity and a very low false positive rate for the proposed algorithm.

Dake Yang, University of Louisville

Integrated analysis of miRNA-mRNA expression profiles

Dake Yang, Partha Mukhopadhyay, Robert Greene, Michele Pisano, and Guy Brock Affiliations: Department of Bioinformatics and Biostatistics, University of Louisville, Louisville, Kentucky, 40218, USA

MicroRNAs (miRNAs) are a large number of small endogenous non-coding RNA molecules (18-25 nucleotides in length) which regulate expression of genes post-transcriptionally. While a variety of algorithms exist for determining the targets of miRNAs, they are generally based on sequence information and frequently produce lists consisting of thousands of genes. Canonical correlation analysis (CCA) is a multivariate statistical method that can be used to find linear relationships between two data sets, and here we apply CCA to find the linear combination of differentially expressed miRNAs and their corresponding target genes having maximal negative correlation. Due to the high dimensionality, sparse CCA is used to constrain the problem and obtain a solution. A novel gene set enrichment analysis statistic is proposed based on the sparse CCA results for estimating the significance of predefined gene sets. The methods are illustrated with both a simulation study and real miRNA-mRNA expression data concerning the murine embryonic developing neural tube.

Sen Yao, University of Louisville

A less biased analysis of metalloproteins reveals novel zinc coordination geometries

Sen Yao1, Robert M. Flight2, Eric C. Rouchka1, Hunter N.B. Moseley2* 1 School of Interdisciplinary and Graduate Studies/ Department of Computer Engineering and Computer Science, University of Louisville, Louisville, Kentucky USA 2 Deptartment of Molecular & Cell Biochemistry/ Markey Cancer Center/ Resource Center for Stable Isotope Resolved Metabolomics, University of Kentucky, Kentucky USA

* hunter.moseley@uky.edu

Background Zinc metalloproteins are involved in many biological processes and play crucial biochemical roles across all domains of life. Local structure around the zinc ion, especially the coordination geometry (CG), is dictated by the protein sequence and is often directly related to the function(s) of the protein. Current methodologies in characterizing zinc metalloproteins’ CG consider only previously reported CG (canonical CG) models based mainly on non-biological chemical context. Exceptions to these canonical CG models are either misclassified or discarded as “outliers”.

Methods We developed a less-biased method that directly handles potential exceptions without pre-assuming any canonical CG model. Zinc metalloproteins were acquired from the worldwide Protein Data Bank (wwPDB). We calculated the number of binding ligands for each zinc site using criteria derived from an analysis of ligand-zinc bond lengths. Zinc sites with a compressed ligand-zinc-ligand angle (about 58 or 38 degrees) were separated from normal zinc sites, as compressed angles are very likely to indicate a novel CG. K- means was then applied on normal and compressed classes separately to differentiate the CG based on angle statistics. Assignments of clusters to canonical and novel CGs were based on cluster centers, 3D structures, and average χ2 probabilities together. We also cross-validated our k-means cluster results against functional annotations derived from InterProScan.

Results and Conclusions Our study shows that thousands of exceptions to canonical CGs could actually be classified, and that new CG models are needed to characterize them. Also, these new CG models are cross-validated by strong correlation between independent structural and functional annotation distance metrics, which is lost if these new CGs models are ignored. Furthermore, these new CG models exhibit functional propensities distinct from the canonical CG models.

Guannan Zhao, University of Tennessee Health Science Center

Lentiviral CRISPR/Cas9 mediated genome editing reveals functional difference of Notch receptors

1,2Guannan Zhao, 3Jinggang Yin, 1,2Qingqing Gu, 4Lu Lu, 3Edward Chaum, 1,2Junming Yue*

Department of Pathology and Laboratory Medicine, Center for Cancer Research, Viral Vector Core Facility, 3 Department of Ophthalmology and 4 Genomics and Bioinformatics, the University of Tennessee Health Science Center, Memphis, TN

Background CRISPR/cas9 (clustered regularly interspaced short palindromic repeats) mediated genome editing is a powerful approach in defining gene and non-coding RNA functions. A single guide RNA (gRNA) directing wildtype cas9 induces indels at a specific genome locus, thus leads to gene knockout. A single gRNA directs a single mutation of cas9 in a specific genome locus leads to the genome nick at each strand. Two gRNA in combination with a mutated cas9 also leads to genome deletion by nicking two different strands. A gRNA driving inactivated cas9 (two mutations) in combination of engineered transcriptional activator or repressor have been used to study gene function through transcriptional activation and repression. Currently human and mouse gRNA libraries as well as Synergistic Activation Mediator (SAM) library are available from Addgene website. Those gRNA sequences are readily available to target each individual genes or miRNAs using CRISPR/Cas9 or CRISPR/dCas9 system. While CRISPR/Cas9 is used for genome editing, CRISPR/dCas9 is a new approach in studying transcriptional regulation by transcriptionally gene interference using engineered transcriptional complex.

Materials and methods We have cloned two gRNAs by targeting Notch1, Notch2, Notch3 and Notch4 into lentiviral CRISPR/Cas9 vectors and generated lentivirus from HEK293FTcells. Viruses were used to transduce ovarian cancer SKOV3 and OVCAR3 cells and selected using puromycin. Gene knockout were examined using Western blot.

Results We have generated several ovarian cancer knockout cell lines using lentiviral CRISPR/Cas9 system by disrupting Notch1, Notch2, Notch3 and Notch4. We also further examined several downstream target genes MAM1, JAG1 and Rbpsuh of Notch signaling pathway and found that disruption of Notch receptors using CRISPR/Cas9 leads to downregulation of those targeted genes. Moreover, we found that different receptors function differently in contributing to tumor invasion and metastasis through different mechanisms.

Conclusions Lentiviral CRISPR/Cas9 system is a powerful approach in defining gene function and disruption of different Notch receptors reveals their functional differences in contributing tumor metastasis.

*Corresponding to Dr. Junming Yue, University of Tennessee Health Science Center, 19 S. Manassas St., Rm. 266, Memphis, TN 38163; Fax: 901-448-3910; Phone: 901-448-2091; Email: jyue@uthsc.edu

Junfei Zhao, Vanderbilt University School of Medicine

SGDriver: A novel structural genomics-based method to prioritize druggable mutations in 16 major cancer types

Junfei Zhao1, Feixiong Cheng1, Zhongming Zhao1,2,3,4*

1Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN 37203, USA 2Department of Cancer Biology, Vanderbilt University School of Medicine, Nashville, TN 37232, USA 3Vanderbilt-Ingram Cancer Center, Vanderbilt University Medical Center, Nashville, TN 37232, USA 4Center for Quantitative Sciences, Vanderbilt University Medical Center, Nashville, TN 37232, USA

*Address correspondence to: zhongming.zhao@vanderbilt.edu

A huge volume of somatic mutations has been generated through The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC) projects. However, understanding the functional consequences of somatic mutations in cancer remains a monumental challenge in cancer genomic studies. Due to the development of structural genomic technologies, such as X-ray and NMR, a large number of protein structure data were generated in the past decade, which enable us to map somatic mutations to protein functional features (i.e., protein-ligand binding sites) and investigate their potential impacts.

In this study, we have developed SGDriver, a structural genomics-based method that incorporates protein-ligand binding sites information into the somatic missense mutations data to help understand the pathophysiological role of variations and prioritize putative druggable mutations using a Bayes inference statistical framework. We applied SGDriver to 746,631 missense mutations across 16 major cancer types from The Cancer Genome Atlas. We found 251 genes enriched with ligand binding sites mutations in their protein products with false discovery rate < 0.05, which including 43 Cancer Gene Census Genes. Furthermore, drug-gene network analysis identifies ~100 druggable anticancer targets using the data from DrugBank, Therapeutics Target Database, and PharmGKB databases. Finally, bioinformatics analysis using Connectivity Map data identified several existing drugs that may be potentially repurposed for precision cancer therapy by targeting cancer driver gene products identified by our SGDriver method. Taken together, this study provides a novel method to identify new druggable mutations for precision cancer medicine.

Shilin Zhao, Vanderbilt University

FunTFPair: An R package to identify functional transcription factor pairs by expression data

Shilin Zhao1#, Qi Liu1#, Yu Shyr1*

1. Center for Quantitative Sciences, Vanderbilt University, Nashville, TN 37027, USA # Equal Contribution * Corresponding author; Email: yu.shyr@vanderbilt.edu

Background Transcription factors (TFs) are fundamental controllers of cellular regulation and function in a complex and cooperative manner. Accurate identification of functional transcription factor pairs is essential to understanding their roles in certain conditions. However, due to a high false positive rate of current methods, identifying reliable functional transcription factor pairs in certain conditions is still difficult.

Materials and methods Now that the Encyclopedia of DNA Elements project (ENCODE) provides a comprehensive information for the targets of hundreds of TFs and possible TF pairs, while the Gene Expression Omnibus (GEO) archives variety of expression data in different kinds of conditions. In FunTFPair, the transcription factors pairs from ENCODE project will be used as candidates, and the users can use their prior expression data or select any dataset from GEO that best matches their experiment design. And then FunTFPair will perform differential or correlation analysis to see if the target genes of TF pairs are significantly influenced by condition changes (Figure 1).

Result The functional TF pairs and their relative importance will be reported in a TF corporation network. Two datasets from GEO are used as examples to demonstrate the usage and prove the reliability of the package.

Availability FunTFPair is implemented in the R language and can be accessed freely at github (https://github.com/slzhao/FunTFPair). Figure 1: The flowchart of FunTFPair package.

Zhongming Zhao, Vanderbilt University School of Medicine

A comparative analysis of the inconsistencies in detecting variants from whole exome sequencing versus transcriptome sequencing in lung cancer

Timothy D. O’Brien1,2, Peilin Jia2, Junfeng Xia2, Uma Saxena3, Hailing Jin4, Huy Vuong2, Pora Kim2, Qingguo Wang2, Martin J Aryee3, Mari Mino-Kenudson3, Jeffrey Engelman5, Long P. Le3, A. John Iafrate3, Rebeca S Heist5, William Pao4, and Zhongming Zhao2,6,7

1 Center for Human Genetics Research, Vanderbilt University School of Medicine, Nashville, TN 37232, United States. 2 Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN 37203, United States. 3 Department of Pathology, Massachusetts General Hospital, Boston, MA, 02114, United States 4 Department of Medicine/Division of Hematology-Oncology, Vanderbilt University School of Medicine, Nashville, TN 37232, United States 5 Department of Medicine, Division of Hematology and Oncology, Massachusetts General Hospital, Boston, MA 02114, United States 6 Department of Psychiatry, Vanderbilt University School of Medicine, Nashville, TN 37232, United States. 7 Department of Cancer Biology, Vanderbilt University School of Medicine, Nashville, TN 37232, United States.

*Corresponding author: Zhongming Zhao, Zhongming.zhao@vanderbilt.edu

Background Whole exome sequencing (WES) and RNA sequencing (RNA-Seq) are two next-generation sequencing (NGS) techniques used for detecting genetic alterations in cancer. WES is primarily for detection of single nucleotide variants (SNVs), while RNA-Seq is more commonly used for measurement of gene expression. However, RNA-Seq can also be used to detect SNVs and has been done in several cancer studies [1, 2]. A detailed analysis of how consistently SNVs can be detected by both sequencing methods on identical samples has not yet been fully evaluated.

Materials and methods We used 27 matched tumor-normal lung cancer samples that have both WES and RNA-Seq data to compare SNVs called from WES and RNA-Seq of the same patients. We mapped WES reads to the human genome reference (hg19) using bwa [3]. Post-processing of the initial mapping included steps to mark duplicate reads using Picard [4] and perform local realignment using GATK [5, 6]. We mapped RNA-Seq reads to the human transcriptome and genome (hg19), using TopHat2 [7]. We used MuTect [8] to call SNVs for both WES and RNA-Seq, VarScan 2 [9] to determine read counts, and Cufflinks [10] to compute gene-based expression levels (FPKM) of RNA-Seq data.

Results We found a low overlap of ~14% for the SNVs detected from WES and RNA-Seq. Among WES unique SNVs, 41% were not covered in RNA-Seq, and 17% were missed due to low coverage (2 - 7 reads). We also examined gene expression levels in RNA-Seq for these WES unique SNVs. As expected, 51% of WES unique SNVs were located in unexpressed genes (FPKM 1). We surveyed WES unique SNVs with available cDNA information, and on average 49% were located on the non-transcribed strand. Among RNA-Seq unique SNVs, 71% were not covered by the WES capture kit. However, for SNVs covered by the WES capture kit, 82 - 98% had a callable coverage (≥ 8 reads) in WES. We examined their allele frequencies and determined only 3% of the alternate alleles occurred with frequencies ≥ 20% in WES. We analyzed the mutation patterns of the SNVs, and found that 55% of the RNA-Seq unique SNVs displayed a T:A → C:G pattern, a signature of potential adenosine deaminase acting on RNA (ADAR) induced RNA-editing occurring in these samples.

Conclusions In conclusion, we found several technical and biological reasons for the small overlap between SNVs detected in identical samples using WES and RNA-Seq. This work serves as an important resource regarding the inconsistencies in detecting SNVs in WES and RNA-Seq data.

Acknowledgements This work was supported by a grant from LUNGevity Foundation and Upstage Lung Cancer. We also thank financial support from US National Institutes of Health grants (R01LM011177, P50CA095103, P50CA098131, and P30CA068485), a Vanderbilt Breast SPORE pilot grant, and Ingram Professorship funds (ZZ). TO was supported by a National Institute of General Medical Sciences Training Grant (T32GM080178).

References 1. Seo JS, Ju YS, Lee WC, Shin JY, Lee JK, Bleazard T, Lee J, Jung YJ, Kim JO, Shin JY, et al: The transcriptional landscape and mutational profile of lung adenocarcinoma. Genome Res 2012, 22:2109-2119. 2. Govindan R, Ding L, Griffith M, Subramanian J, Dees ND, Kanchi KL, Maher CA, Fulton R, Fulton L, Wallis J, et al: Genomic landscape of non-small cell lung cancer in smokers and never-smokers. Cell 2012, 150:1121-1134. 3. Li H, Durbin R: Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 2010, 26:589-595. 4. Picard Web Site: http://picard.sourceforge.net/. 5. DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, et al: A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 2011, 43:491-498. 6. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA: The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 2010, 20:1297-1303. 7. Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL: TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol 2013, 14:R36. 8. Cibulskis K, Lawrence MS, Carter SL, Sivachenko A, Jaffe D, Sougnez C, Gabriel S, Meyerson M, Lander ES, Getz G: Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat Biotechnol 2013, 31:213-219. 9. Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD, Lin L, Miller CA, Mardis ER, Ding L, Wilson RK: VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res 2012, 22:568-576. 10. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 2010, 28:511-515.