Making machine learning models make sense

Gemma Moran, an alum of the Eric and Wendy Schmidt Center postdoctoral fellowship program, is working to improve the interpretability and transparency of machine learning models in medicine.
Credit: Eva Woolridge
Allison Whitten
July 27, 2023

Gemma Moran will never forget how magical it felt to run her very first statistical models on genomics data during her undergraduate summer research project at the University of Sydney. Moran had initially planned to major in pure mathematics but veered away from that path towards a career in applied research after taking a few statistics courses. “I came to realize that I was much more interested in being able to apply math to real world applications and data,” she said. 

Now, as a postdoctoral fellow with the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard, Moran’s interest in using statistical models to uncover biological patterns that could improve health care has only grown stronger. These days, Moran, who is based at Columbia University’s Data Science Institute, is working to combine the rigorous and intuitive nature of the simple statistical models she first learned about in undergrad with the flexibility and power of today’s modern machine learning algorithms. In September, Moran will launch her own research group to pursue this direction as an assistant professor of statistics at Rutgers University.

Those who work with her are confident that her research has been and will continue to be impactful. “Gemma is a clear thinker, a careful scientist, and a fantastic collaborator to work with and learn from,” said David Blei, Moran’s postdoctoral adviser and a professor of statistics and computer science at Columbia University. “What her algorithms discover is information that we can use to help make better scientific and medical predictions, and use to help further our understanding of biology and genetics.”

As part of a project with Anthony Philippakis, co-director of the Eric and Wendy Schmidt Center and chief data officer of the Broad Institute, Moran has been using a type of machine learning algorithm called a variational autoencoder (VAE) to reveal important connections between disease symptoms that doctors may be missing. Though it’s still in its early stages, this work has the potential to affect clinical care if these algorithms discover new ways to cluster symptoms into one disease versus another — a challenging task that has long relied on doctor’s observations alone. 

Revealing New Relationships

As a graduate student at the University of Pennsylvania, Moran worked on designing a method to uncover genes that are most relevant in different subtypes of breast cancer. She also developed new theoretical techniques to estimate the uncertainty present in models. During her postdoctoral fellowship in Blei’s lab, Moran developed a new method that allows researchers to better interpret the results that variational autoencoder algorithms spit out. These algorithms are masterful at paring down massive datasets into tiny summaries that contain only the most important aspects of the bigger dataset. The problem, Moran explains, is that it’s very challenging for researchers to understand exactly what parts of the original dataset are captured in the small summaries.

Moran working in her office in Columbia's Data Science Institute

To illustrate the challenge and her new fix, Moran gives the example of a large dataset filled with hundreds and hundreds of movie ratings. To create a meaningful summary with fewer data points, the variational autoencoder algorithm might divide these ratings into categories like horror, comedy, action, and science fiction. While it learns, the algorithm creates connections between the movie titles in the original dataset and its new summary output. But if left to its own devices, the algorithm will create thousands of connections that will be difficult to interpret. 

Importantly, by pruning down these connections at certain places in the network until they become sparse, Moran’s new method — named "sparse VAE" — makes it much easier to see what parts of the original data are directly linked to the smaller summary. For example, she could trace back the new “anchor points” to find that the movie “Alien” is only represented in the science fiction category of the summary, but a movie like “Everything Everywhere All At Once” might be represented in the categories of action, comedy, and science fiction. And as an added rare bonus, Moran’s new method successfully achieves a statistical property known as identifiability. This ensures that the model only has one way to interpret it, as long as there are anchor points in the data.

After chatting with Philippakis last year about her new sparse VAE method, the two realized that it could be a great way to unearth previously unknown relationships between health symptoms in ways that would be easy for doctors and health researchers to interpret. Essentially, their project uses machine learning to improve nosology, which is the scientific field of disease classification. Until now, to classify a new disease, doctors have relied on their own expertise and experience to know what symptoms — like blurry vision and increased urination for diabetes — co-occur. They’ve also had to decide how to meaningfully differentiate these symptoms from another group of symptoms that comprise a separate disease. But it’s possible that physicians haven’t noticed some co-occurring symptoms that might tell them more about disease severity or indicate a new subgroup of a disease — or require a new disease label altogether. 

“What these machine learning methods are exactly designed to do is find what things travel together, and so in that way, they can help physicians see more things that travel together that they might not have noticed just by observation alone,” said Moran.

Moran stands in front of the Low Memorial Library

Moran and Philippakis are currently applying the sparse VAE method to data from 500,000 patients in the UK Biobank, which is a large patient dataset filled with detailed genetic and health information collected by researchers in the United Kingdom. They hope it may yield surprising correlations between biological signals that could improve the classification of diseases, with the goal of obtaining their first results later this year.

“I’m incredibly excited about where this line of research is headed,” said Anthony Philippakis. “In the same way that Gemma has already shown that her method can identify ‘eigen-movies’ that indicate similar classes of films, there is the opportunity to uncover ‘eigen-phenotypes’ that indicate collections of traits that are correlated with each other.”

New Job, Same Thrill

When Moran starts her own research group this fall at Rutgers University, she will continue her work on improving the interpretability and transparency of powerful machine learning algorithms applied to medical research. Her ultimate goal is to create algorithms that provide the most advantages to the health of society without propagating harmful biases against certain groups. Indeed, Moran sees this problem of bias in machine learning as one of the biggest challenges facing the field over the next ten years. 

“It’s a really crazy time to be in machine learning. There are so many developments happening at breakneck speed,” she said. “What worries me is people building these powerful [machine learning] models without necessary checks and balances and transparency and interpretability … especially applied to health care because it's such a critical domain where we could see negative consequences if we're not using these tools responsibly.”

While Moran’s goals and physical locations on opposite sides of the globe have changed across her academic career, the joy she finds in the work has remained constant. “That feeling when you've had an idea and then you code up something that works — it's just very thrilling,” she said. For Moran, that thrill becomes even more meaningful when she’s answering a question that could help actual patients. “At the end of the day, I love math and modeling and thinking about variation and how to think about data, but it's nice to connect it to real world questions.”

Get Involved