Why I love Computational Biology: A Statement of Purpose

November 18, 2024

Imagine you have a book that told you how to cure cancer, sitting in your basement for 21 years. Wouldn’t you want to learn how to read it?

Unfortunately, if your book is the human genome, you would be reading for quite a while. A lot of the pages also wouldn’t make sense until you folded it into an origami masterpiece and tore sections out according to notes left in the margins.

We’ve collected so much assay data and know so little about its genomic and transcriptomic quantities. Why can’t we just use computers to parse this data? What could we learn about sickness, therapeutics, and ourselves? I want to answer these questions with the help of machine learning, conducting research that maximizes impact and minimizes human suffering from disease.

Worming into Comp Bio Research

In my first year, getting a research position without experience was hard. I eventually landed a role studying C. Elegans at Professor Mei Zhen’s lab, and I reluctantly took it. I thought computational biology sucked. It was “what researchers did when they couldn’t get into computer science.” But little by little I fell in love with coding to solve the code of life.

My first research project was building an image processing and modeling pipeline to extract key metrics from fluorescence microscope recordings of the worm’s chewing and swallowing. Watching the worm swallow bacteria on rainbow-colored slides and discovering never-before-seen behaviors was profound. Even better, our model and insights could be applied to understand human neurological disorders. I learned that computational biology could help us understand the most complex biological systems and solve real problems.

Although this opened my eyes to the power of computers in biology, our lab was publishing at a glacial pace and lacked compute resources. I wanted to do research that impacted the real world and was hungry to learn more.

Getting Deep into Genomics

The next year I joined Deep Genomics as a Machine Learning Research Intern. This $500 million startup had done pioneering transformers genomics research with BigRNA – and had ample compute to meet my model training ambitions. By integrating chromatin accessibility data and advances from recent literature (including papers by Dr. Anshul Kundaje), I achieved state-of-the-art variant effect prediction accuracy. To maintain model performance on other tasks, I fine-tuned the model on variant effect data from Massively Parallel Reporter Perturbation Assays (MPRAs) with linear probing and parameter-efficient tuning. My work was pushed to production code and used by 8 research scientists, furthering Deep Genomics’ mission to develop lifesaving drugs by understanding RNA biology.

A Foundation for the Future

I’m currently researching cell metabolites at the Hannes Rost lab, applying large language modeling techniques to train foundational models on unlabelled mass spectrometry data. We hope to understand how cells function on a molecular level and to apply this knowledge to search for new cures for various diseases.

In metabolomics, we have an abundance of molecule mass spectra, but few spectra that are labeled with their corresponding molecules. Our lab’s unique insight is that pretraining paradigms from multimodal models used in language and vision (e.g. CLIP) can be applied to the retrieval of mass spectroscopy data. I learned about all aspects of model development, from training to data preprocessing to sampling. I’m excited about how recent advances in generative AI can be applied to more than just chatbots and customer service agents, impacting biology and genomics research for good.

Why do I want a PhD?

To solve all diseases, stop aging, and make the world a generally better place. I love doing research and I love to win. I considered applying my competitive personality to high-frequency trading research and dunking on all the other quants, but I wanted work that wasn’t soulless. I settled on computational biology since winning over disease and suffering is good enough for me. I’m also a generally curious person who wants answers about how genes work and why machine learning models exhibit certain behaviors. Here are some specific research questions I’d pursue:

  1. Why don’t we understand how genes are regulated? Many models predict various genomic quantities, such as chromatin accessibility and RNA-seq data, each providing a different perspective on the bigger picture. My mission is to create systems that have a unified understanding of gene regulation.

  2. Why are genomics models often biased and noisy, rendering them useless in production? I’m curious to find improvements in modeling to denoise and de-bias assay data and evaluate the trade-offs of training dataset size versus quality.

  3. How should models be scored for variant effect prediction? Variant effect prediction remains one of the most effective zero-shot ways to evaluate model quality, as variants can impact anywhere in the regulatory chain, from chromatin structure up to splicing.

Conclusion

The human genome has been around since 2003 and large parts of it are largely a mystery to modern science. I hope to learn from awesome like-minded researchers, satisfy my competitive curiosity, and apply the AI revolution to genomics for good :)