
Data Integration of omic data using Differential Bayesian Networks Project Ref: NGCM0102 Available: Yes Supervisor: Faisal I Rezwan (Medicine) Faculty: Faculty of Medicine Research Group: Genomic and Epigenomic Bioinformatics Cosupervisor: Ruben SanchezGarcia (Maths) Antony Overstall (Maths) Faculty: FSHM Academic Unit: School of Mathematics Research Area: Computational Engineering Project Description: Data integration of multiple sources is a key challenge in the Life Sciences, particularly for omic data, where analysis of single data type is often insufficient to explain the aetiology of complex traits. The assimilation of different sources of data is therefore essential to understand the underlying biological mechanisms linked with phenotype. To be applicable to typical biological data sets, data integration methodologies must meet many computational challenges, from data size and heterogeneity, to dimensionality and noise. Although a number of analytical approaches and software tools are available for data integration of biological data, however new strategies is required to better understand the aetiology of complex traits.Graphical models are often used to represent the conditional independence structure (i.e. the associations) between a series of variables. For example, a particular gene may be associated with a given disease but this association is dependent on several other variables. However, there is no direction to these associations, one variable cannot be said to predict or cause the other. Directed acyclic graphs, commonly called Bayesian networks, add a direction to these associations. It means that given the presence of a particular gene, and status of other variables, the probability of disease can be inferred. Differential Bayesian networks is a technique which can determine where the association structures of two (or more) different populations are the same.The main aim of the project is to develop a robust data integration technique for heterogeneous biological data sources, using a metadimensional transformation integration technique based on a Bayesian Network approach. Further, we will extend it by performing Differential Bayesian Network testing on networks, which will potentially differentiate phenotypes. To implement this method, we would use omic data available from Isle of Wight Third Generation (IoW F2) cohort, which would help us understand the geneenvironment interaction in childhood asthma. This project would add value to both in fields of data integration and respiratory research as it will develop a novel methodology for integrating omic datasets as well as contributing to the understanding of childhood respiratory disease. The main objectives for the PhD project are:a. Data preprocessing and quality control: DNA methylation profiling of blood samples from the IoW F2 cohort using will be processed using standard methods in [6,7] . Microbiome samples will be preprocessed and QCed with standard bioinformatic process used in [8]. QCed and preprocessed gene expression data for the matching samples are also available.b. Data reduction: Explore and implement data reduction techniques to handle the high dimensionality of these three omic data while maintaining biologically meaningful variables. This can be a combination of intrinsic (such as: principal component analysis, factor analysis) and extrinsic (using existing knowledge base) data reduction techniques. c. Data integration: Develop, perform, and validate a metadimensional data integration method using Bayesian networks for heterogeneous sources of biological data. The metadimensional analyses are not hypothesis free and, therefore, we need to establish relationships between phenotypes and each genomic data set in the first instance. Then, based on phenotypic classification (for example, childhood asthma and nonasthma), we would implement transformationbased data integration techniques d. Web service: The last stage of the project is to implement a web service where external users can upload their omic data and run a userinteractive data integration analysis. This will produce a userfriendly QC and analysis report for the user for further downstream analyses.The prospective candidate must be a highly motivated individual with at least an upper secondclass degree in Computer Science, Bioinformatics, Physics or related field, and a background and/or interest in Molecular Biology. Programming experience in a numerical computing environment (ideally R, Matlab, Perl, Python, Java, C or C++), data analytical techniques, and UNIX skill are desirable. An enthusiasm for realworld applications of complex mathematical ideas and a positive attitude towards interdisciplinary research are essential. If you wish to discuss any details of the project informally, please contact Faisal Rezwan, Email: F.Rezwan@soton.ac.uk, Tel: +44 (0) 2380 482002 Keywords: Applied Mathematics, Bioinformatics, Computer Science, Health sciences Support: All studentships provide access to our unique facilities and training and research support . Project Images

