Automated Discovery of Cancer Types from Genes

30-minute Talk - Sunday, July 28 at 2:30pm in Barbie Tootle

Cancer treatment often focuses on organ of origin, but different types can occur in one organ. Gene expression provides valuable clues of the cancer type, but studying data manually is difficult. Instead, we use variational autoencoding, a deep learning method, to derive 36-dimensional feature space from 5000-dimensional gene space and show its efficacy in classification and a TSNE visualization.

While many other diseases are relatively predictable and treatable, cancer is very diverse and unpredictable, making diagnosis, treatment, and control extremely difficult. Traditional methods try to treat cancer based on the organ of origin in the body, such as breast or brain cancer, but this type of classification is often inadequate. If we are able to identify cancers based on their gene expressions, there is hope to find better medicines and treatment methods. However, gene expression data is so vast that humans cannot detect such patterns. In this project, the approach is to apply unsupervised deep learning to automatically identify cancer subtypes. In addition, we seek to organize patients based on their gene expression similarities, in order to make the recognition of similar patients easier.

While traditional clustering algorithms use nearest neighbor methods and linear mappings, we use a recently developed technique called Variational Autoencoding (VAE) that can automatically find clinically meaningful patterns and therefore find clusters that have medicinal significance. Python-based deep learning framework, Keras, offers an elegant way of defining such a VAE model, training, and applying it. In this work, the data of 11,000 patients across 32 different cancer types was retrieved from The Cancer Genome Atlas. A VAE was used to compress 5000 dimensions into 100 clinically meaningful dimensions. Then, the data was reduced to two dimensions for visualization using tSNE (t-distributed stochastic neighbor embedding). Finally, an interactive Javascript scatter plot was created. We noticed that the VAE representation correctly clustered existing types, identified new subtypes, and pointed to similarities across cancer types. This interactive plot of patient data also allows the study of nearest patients, and when a classification task was created to validate the accuracy of the representation, it achieved 98% accuracy. The hope is that this tool will allow doctors to quickly identify specific subtypes of cancer found using gene expression and allow for further study into treatments provided to other patients who had similar gene expressions.


Presented by: