Treballs Finals de Grau en Estadística UB-UPC, Facultat d'Economia i Empresa (UB) i Facultat de Matemàtiques i Estadística (UPC), Curs: 2013-2014, Tutor: Esteban Vegas Lozano
En l’última dècada s’han desenvolupat noves tecnologies d’alt rendiment, les quals
generen un volum de dades biològiques tan gran que ha motivat la creació de nous algorismes
en el camp de la bioinformàtica per analitzar les dades generades. Aquests avenços
han revolucionat la biologia molecular i han conduït a una nova mentalitat en la qual
es desenvolupa una visió global dels sistemes biològics. En aquest context, actualment
hi ha dues grans vies d’investigació: la integració de dades òmiques i la visualització de
les variables originals. L’anàlisi de dades òmiques de més d’un tipus de forma simultània
combinada amb la visualització de les relacions entre els milers de variables biològiques
pot portar a una millor comprensió dels processos biològics. En aquest projecte s’estudia
la tècnica del Kernel PCA juntament amb procediments per a representar les variables
originals, s’aplica a dos conjunts de dades òmiques i es presenta de forma accessible amb
aplicacions web interactives.
The development in the last decade of the high-throughput technologies, new techniques
for measuring biological data, has dramatically changed our views on molecular
biology. Whereas a few years ago each gene or protein was studied as a single entity,
new technologies allow to analyse large numbers of genes or proteins simultaneously. As
a result, biological processes are studied as complex systems of functionally interacting
macromolecules. This new mindset has led to the rise of new disciplines, such as genomics,
proteomics and transcriptomics, in the so-called “omics era”. All of them have in common
that are based on the analysis of a large volume of heterogeneous biological data. These
datasets encourage researchers to develop new algorithms in the field of bioinformatics
for its interpretation.
Within this context, there are currently two major research challenges: omics data
integration and visualization of the input variables. The analysis at the same time of
integrated omics data combined with the visualization of relationships between the thousands
of biological variables generated may lead to a better understanding of the global
functioning of biological systems. Although individual analysis of each of these omics
data undoubtedly results into interesting findings, it is only by integrating them that one
can gain a global insight into cellular behavior. A systems approach thus is predicated
on the integration of multiple independent datasets. Visualization is a key aspect of both
the analysis and understanding of the omics data. The challenge is to create clear and
meaningful visualizations that give biological insight, despite the complexity of the data.
In this project, first we present the main types of omics data, the associated highthroughput
technologies and the challenges that present its analysis, including the integration
of omics data. After this, we give an overview of the discipline of machine
learning, which provides algorithms and techniques to analyze omics data. In addition,
special attention is paid to kernel methods, which are one of the most powerful methods
for integrating heterogeneous data types. In the present work, we analyze the integration
of data from several sources of information using the Kernel PCA technique together with
a set of procedures to represent the input variables. Then we apply them to two different
omics datasets. In addition, we provide this technique in an accessible way by the creation
of interactive web applications.