News & Updates

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
September 10, 2023
New machine learning techniques boost predictions for virtual drug screening with less data

Scientists using machine learning tools to analyze biomedical data often turn to neural network algorithms, but before these models became popular, another simpler type of machine learning algorithm called kernel methods were commonly used. Kernel methods work by first applying straightforward operations to transform data and then training a simple model on the transformed data.

Now, in a new paper recently published in Nature Communications, researchers at the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard have developed a new way of using kernel methods that could make them more useful for a wider range of applications, such as virtual drug screening. They came up with the first “transfer learning” techniques for kernel methods that can be successfully applied to large-scale datasets. Transfer learning allows researchers to improve machine learning models by training them on one task in a way that enhances their performance on a second task — without having to spend the time and resources training a new model for each new task. In their paper, the team showed how their transfer learning framework allowed them to predict which drugs might be most effective in certain cancer cell lines where little data is available. They did this by transferring from cell lines in which many drugs have already been tested.

“Before our paper, there was no transfer learning method for kernel methods that could scale to the large datasets of most interest in the biomedical field and beyond. We’ve shown for the first time that transfer learning using kernels in these settings is possible and I think that is really exciting,” said Caroline Uhler, the senior author on the paper and a Broad core institute member, co-director of the Schmidt Center at Broad, and a professor in the Department of Electrical Engineering and Computer Science as well as the Institute for Data, Systems, and Society at MIT.  

The team’s key innovation was creatively adapting transfer learning methods used in neural network algorithms so that they can be applied to kernel methods. This advance could find uses in other applications.

“Particularly for healthcare and biomedical applications, it's very hard to collect a lot of data for every question of interest. When you have very little data for a certain task but a related task has abundant data, this is exactly a setting where our method is effective,” said Adityanarayanan Radhakrishnan, a co-first author on the study and a Schmidt Center fellow, who worked on this study while completing his PhD as an Eric and Wendy Schmidt Center Fellow in Uhler’s lab at Broad and MIT, and is currently the George F. Carrier Postdoctoral Fellow at Harvard School of Engineering and Applied Sciences.

Transferring knowledge

The research team focused on kernel methods because they found in a previous paper that these performed better than typical neural network models on virtual drug screening tasks. But they wanted to make it possible for researchers to quickly reuse their kernel method algorithms to identify drugs for a wide range of cancer types without having to train a new model for each new type of cancer. They realized that transfer learning techniques are necessary for this, but because existing techniques don’t work well for kernel methods, they had to come up with new ones.

They decided to take inspiration from two transfer learning techniques that work well for neural network models, which they called projection and translation. The team adapted them to work with kernel methods and then tested their approach in a virtual drug screen.

The researchers analyzed performance of their transfer learning algorithms on two massive Broad datasets, one from the Connectivity Map (CMAP) and the other from the Cancer Dependency Map (DepMap). These datasets describe the effects of drugs on cancer cell lines  across millions of drug and cell line combinations. The team trained their kernel method algorithms to predict either the genes expressed by a certain cell type after it was treated with a certain drug (using the CMAP dataset), or the proportion of cancer cells that survived after treatment with the same drug (using the DepMap dataset).

The scientists then applied their projection and translation techniques to their model so that it could complete the second task: to predict the effect of the drug on new cancer cell lines that have much less data. The projection transformation corrects the model’s predictions on the second task by recognizing when the prediction errors are falling into categories that can be easily corrected to the right category. And the translation technique fine-tunes the model by applying a correction term that shifts the model’s predictions so that it’s more accurate on the second task.

The team found that their transfer learning techniques allowed their original kernel method to be successfully “transferred” to the second task, without needing to be retrained. Compared to a new model trained only on the second task, the transfer learning techniques greatly boosted the accuracy of their model in predicting the effect of drugs for new cancer cell lines. And on a common machine learning task where the team trained their kernel method algorithms to recognize images, their approach surprisingly boosted the accuracy by up to 10 percent.

Moreover, the researchers were also able to pinpoint exactly how much extra data they would need to collect to increase the performance of the model. Uhler said this could be helpful to scientists trying to decide whether it’s worthwhile to collect more data in the lab. “That's really quite exciting because you can ask ‘how much is it worth for me to have a little bit better performance of my model if I know that we’ll need to collect, say, 10 or 20 percent more data?’” said Uhler.

Beyond drug screening

Two additional advantages of kernel methods are that they provide interpretability as well as a quantification of how uncertain the model is on a given prediction. To take advantage of the interpretability aspect, the research team is working on pinning down the features of a drug that lead their model to predict that it will be effective. In addition, the research team hopes that the uncertainty estimates provided by their kernel approach will be helpful in identifying which new drug and cell line combinations should be screened experimentally for a more effective drug discovery pipeline.

They also have plans to expand their framework to other applications, such as screening cancer genes that tumors heavily depend on for survival and might be targeted with new drugs.

The team adds that their transfer learning approach for kernel methods may also open up other, unexpected applications. Because kernel methods make it easy for scientists to mathematically understand what the model is doing, they can investigate what kinds of biomedical questions will be the best fit to study. “It now gives us a more thorough or deeper understanding of transfer learning and where the power comes from, so that we can analyze which tasks it will actually work for,” said Uhler.

New machine learning techniques boost predictions for virtual drug screening with less data
Drug Screening
Read story
Continue reading on
August 31, 2023
Schmidt Center, Helmholtz Munich launch AI and machine learning in genomics collaboration

Helmholtz Munich and the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard today announce the launch of a collaboration to bridge a gap in health research with AI and machine learning.

In the past decade, the field of genomics has accelerated to a point where we can now both measure and perturb biological systems at massive, unprecedented scales, holding huge potential for disease treatment. However, the computational tools needed to take advantage of all this data have not kept pace. By leveraging machine learning methods, the partnership between Helmholtz Munich and the Eric and Wendy Schmidt Center seeks to gain valuable insights into important genomics problems while simultaneously advancing the foundations of machine learning through novel research inspired by genomics questions.

Leading this joint initiative are Caroline Uhler, co-director of the Eric and Wendy Schmidt Center at the Broad Institute, and Fabian Theis, head of the Computational Health Center (CHC) at Helmholtz Munich and Director of Helmholtz AI. Both Caroline Uhler and Fabian Theis have backgrounds in machine learning, statistics, data science, biology, and human biology. “This exchange model between the Broad Institute and Helmholtz Munich will merge our expertise on machine learning and genomics to foster innovative ways to address major challenges in biomedical research,” said Fabian Theis.

The collaboration will encompass a range of activities, including the exchange of graduate students, postdoctoral fellows, and other research staff between the two research centers. These individuals will undertake short research stays, enabling them to benefit from the expertise and resources available at both centers. In addition, the research centers will co-organize workshops and conferences to facilitate knowledge exchange and foster collaboration in the field of AI and genomics.

“Despite an explosion in biological data, the technology sector remains the key driver of machine learning advances today,” said Caroline Uhler. “Both Helmholtz Munich and the Broad Institute are seeking to change that by developing foundations of machine learning that are geared specifically to biological problems, and we’re excited for this collaboration to amplify our efforts.”

Schmidt Center, Helmholtz Munich launch AI and machine learning in genomics collaboration
Read story
Continue reading on
July 27, 2023
Making machine learning models make sense

Gemma Moran will never forget how magical it felt to run her very first statistical models on genomics data during her undergraduate summer research project at the University of Sydney. Moran had initially planned to major in pure mathematics but veered away from that path towards a career in applied research after taking a few statistics courses. “I came to realize that I was much more interested in being able to apply math to real world applications and data,” she said. 

Now, as a postdoctoral fellow with the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard, Moran’s interest in using statistical models to uncover biological patterns that could improve health care has only grown stronger. These days, Moran, who is based at Columbia University’s Data Science Institute, is working to combine the rigorous and intuitive nature of the simple statistical models she first learned about in undergrad with the flexibility and power of today’s modern machine learning algorithms. In September, Moran will launch her own research group to pursue this direction as an assistant professor of statistics at Rutgers University.

Those who work with her are confident that her research has been and will continue to be impactful. “Gemma is a clear thinker, a careful scientist, and a fantastic collaborator to work with and learn from,” said David Blei, Moran’s postdoctoral adviser and a professor of statistics and computer science at Columbia University. “What her algorithms discover is information that we can use to help make better scientific and medical predictions, and use to help further our understanding of biology and genetics.”

As part of a project with Anthony Philippakis, co-director of the Eric and Wendy Schmidt Center and chief data officer of the Broad Institute, Moran has been using a type of machine learning algorithm called a variational autoencoder (VAE) to reveal important connections between disease symptoms that doctors may be missing. Though it’s still in its early stages, this work has the potential to affect clinical care if these algorithms discover new ways to cluster symptoms into one disease versus another — a challenging task that has long relied on doctor’s observations alone. 

Revealing New Relationships

As a graduate student at the University of Pennsylvania, Moran worked on designing a method to uncover genes that are most relevant in different subtypes of breast cancer. She also developed new theoretical techniques to estimate the uncertainty present in models. During her postdoctoral fellowship in Blei’s lab, Moran developed a new method that allows researchers to better interpret the results that variational autoencoder algorithms spit out. These algorithms are masterful at paring down massive datasets into tiny summaries that contain only the most important aspects of the bigger dataset. The problem, Moran explains, is that it’s very challenging for researchers to understand exactly what parts of the original dataset are captured in the small summaries.

Moran working in her office in Columbia's Data Science Institute

To illustrate the challenge and her new fix, Moran gives the example of a large dataset filled with hundreds and hundreds of movie ratings. To create a meaningful summary with fewer data points, the variational autoencoder algorithm might divide these ratings into categories like horror, comedy, action, and science fiction. While it learns, the algorithm creates connections between the movie titles in the original dataset and its new summary output. But if left to its own devices, the algorithm will create thousands of connections that will be difficult to interpret. 

Importantly, by pruning down these connections at certain places in the network until they become sparse, Moran’s new method — named "sparse VAE" — makes it much easier to see what parts of the original data are directly linked to the smaller summary. For example, she could trace back the new “anchor points” to find that the movie “Alien” is only represented in the science fiction category of the summary, but a movie like “Everything Everywhere All At Once” might be represented in the categories of action, comedy, and science fiction. And as an added rare bonus, Moran’s new method successfully achieves a statistical property known as identifiability. This ensures that the model only has one way to interpret it, as long as there are anchor points in the data.

After chatting with Philippakis last year about her new sparse VAE method, the two realized that it could be a great way to unearth previously unknown relationships between health symptoms in ways that would be easy for doctors and health researchers to interpret. Essentially, their project uses machine learning to improve nosology, which is the scientific field of disease classification. Until now, to classify a new disease, doctors have relied on their own expertise and experience to know what symptoms — like blurry vision and increased urination for diabetes — co-occur. They’ve also had to decide how to meaningfully differentiate these symptoms from another group of symptoms that comprise a separate disease. But it’s possible that physicians haven’t noticed some co-occurring symptoms that might tell them more about disease severity or indicate a new subgroup of a disease — or require a new disease label altogether. 

“What these machine learning methods are exactly designed to do is find what things travel together, and so in that way, they can help physicians see more things that travel together that they might not have noticed just by observation alone,” said Moran.

Moran stands in front of the Low Memorial Library

Moran and Philippakis are currently applying the sparse VAE method to data from 500,000 patients in the UK Biobank, which is a large patient dataset filled with detailed genetic and health information collected by researchers in the United Kingdom. They hope it may yield surprising correlations between biological signals that could improve the classification of diseases, with the goal of obtaining their first results later this year.

“I’m incredibly excited about where this line of research is headed,” said Anthony Philippakis. “In the same way that Gemma has already shown that her method can identify ‘eigen-movies’ that indicate similar classes of films, there is the opportunity to uncover ‘eigen-phenotypes’ that indicate collections of traits that are correlated with each other.”

New Job, Same Thrill

When Moran starts her own research group this fall at Rutgers University, she will continue her work on improving the interpretability and transparency of powerful machine learning algorithms applied to medical research. Her ultimate goal is to create algorithms that provide the most advantages to the health of society without propagating harmful biases against certain groups. Indeed, Moran sees this problem of bias in machine learning as one of the biggest challenges facing the field over the next ten years. 

“It’s a really crazy time to be in machine learning. There are so many developments happening at breakneck speed,” she said. “What worries me is people building these powerful [machine learning] models without necessary checks and balances and transparency and interpretability … especially applied to health care because it's such a critical domain where we could see negative consequences if we're not using these tools responsibly.”

While Moran’s goals and physical locations on opposite sides of the globe have changed across her academic career, the joy she finds in the work has remained constant. “That feeling when you've had an idea and then you code up something that works — it's just very thrilling,” she said. For Moran, that thrill becomes even more meaningful when she’s answering a question that could help actual patients. “At the end of the day, I love math and modeling and thinking about variation and how to think about data, but it's nice to connect it to real world questions.”

Making machine learning models make sense
Read story
Continue reading on
June 6, 2023
Yue Qin named to Forbes 30 Under 30 Asia 2023

Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard is excited to announce that postdoctoral fellow Yue Qin was named to the Forbes 30 Under 30 Asia 2023 list this May. The Forbes 30 under 30 lists highlight some of the most successful researchers, leaders, and entrepreneurs around the world. 

Qin joined the Eric and Wendy Schmidt Center in January, 2023. She is co-advised by Paul Blainey, a core member of the Broad Institute and an associate professor of biological engineering at MIT, and Caroline Uhler, co-director of the Eric and Wendy Schmidt Center. Qin's research interests lie in understanding how to read out the programs of cells from the genome. Qin uses that knowledge to create in silico cells that simulate the effect of therapeutic interventions in different disease and genetic contexts with the ultimate goal of developing personalized medicine.

Qin holds a PhD in Bioinformatics and Systems Biology and a BSc in Bioinformatics from the University of California San Diego (UCSD). As a graduate student, she was the first author on a 2021 Nature paper that developed a machine learning framework to map the structure of human cells by fusing data from protein imaging and protein biophysical interactions. Qin is a Siebel Scholar and a recipient of an NCI Predoctoral to Postdoctoral Fellow Transition Award (F99/K00) as well as the Chancellor’s Dissertation Medal within the Jacobs School of Engineering at UCSD.

“Yue embodies the type of researcher we’re excited to work with at the Eric and Wendy Schmidt Center,” said Uhler, who is also a core member of the Broad Institute and a professor in the Department of Electrical Engineering and Computer Science and the Institute for Data, Systems and Society at MIT. “Her research is a great example of how computation and biology can go hand in hand in an age where the number of possible experiments we could perform has exploded.”

Yue Qin named to Forbes 30 Under 30 Asia 2023
Read story
Continue reading on
April 28, 2023
A deep (learning) dive into the roots of cancer

In a recent grant application to the National Institutes of Health, Petar Stojanov was required, among other things, to describe his “specific aims” as well as his background. It’s doubtful that the NIH reviewers would have considered Stojanov’s research agenda lacking in ambition, given its broad scope: to identify the genetic mutations that cause cancer and figure out how they cause it.

The reviewers, moreover, must have decided he had a credible chance of achieving these goals, or at least making progress toward their realization, as he was informed earlier this year that he had earned a coveted Pathway to Independence (K99) Award. As a result, Stojanov — a current Eric and Wendy Schmidt Center Postdoctoral Fellow at the Broad Institute of MIT and Harvard — will receive up to five years of research support, meaning he can devote himself fully to his scientific inquiries without having to worry about funding.

K99 grants help “outstanding” researchers transition from postdoctoral positions to running their own labs. In this next stage of his career, Stojanov will develop new methods in two types of machine learning: algorithms related to causality and deep generative models.

An early interest in computational biology

In some sense, Stojanov set off on the path that led him to this milestone when he was a high school student in Macedonia. A family friend told him that computational biology was becoming a hot area in science. Stojanov was immediately intrigued, he said, “for the same reason that has brought many people to this field — math and biology were my favorite subjects.” And here was a chance to combine his preferred disciplines into a unified course of study that might lead to an interesting career.

He spent his senior year of high school in Pelham, New York (where he lived with his family friend), as he’d always believed he “would have the best opportunities for innovation in the U.S.” A year later, he enrolled in Bard College, which had no courses, let alone a major, in computational biology. Stojanov stuck to his passion, nevertheless, taking the bulk of his classes in computer science, biology, mathematics, and chemistry. He gained hands-on experience in computational biology through summer research programs at George Washington University and the University of Maryland.

Stojanov on his way to work at the Broad Institute

After graduating from Bard in 2010, he took a job in the laboratory of Gaddy Getz, director of the Broad’s Cancer Genome Computational Analysis Group. That’s where Stojanov got started on the two-pronged research track he’s still pursuing today: First, to figure out which mutations are present in cancerous tissue and, second, to determine which of those mutations actually spur our cells to multiply out of control and drive cancer. The standard approach at the time was to rely on statistical methodology, such as examining whether the number of mutations in a given gene was greater than would be expected from random processes, unrelated to cancer.

Stojanov spent four productive years at the Broad, coauthoring more than a dozen papers — four of which he was a lead author. He didn’t sleep much those days, mainly because he was “hungry for projects and never said no to an opportunity.” Yet, by the end of that tenure, he felt that his work in this area could benefit from additional training in computer science, which would enable him to bring new tools to the kinds of problems he’d been grappling with. In 2014, he entered a PhD program at Carnegie Mellon University, where he immersed himself in machine learning techniques and other emerging approaches in artificial intelligence. Although his graduate research had nothing to do with biology, he recognized that the methods he was learning, combined with statistics, might lead to breakthroughs in his previous cancer investigations.

Bringing ML to bear on cancer research

Stojanov returned to the Broad in 2021 and picked up in the Getz lab where he had left off — this time ready to unleash the full power of AI. Getz was eager to have him back, touting “the unique set of skills that Petar has,” given his prior experience in cancer research and his recently strengthened background in computer science. “And now,” Getz said, “he’s applying his expertise in machine learning to the search for the drivers of cancer.”

Just counting the number of mutations in a gene is not enough to reveal the mechanisms underpinning cancer, Stojanov explained. “That may tell you which mutations are most prevalent, and maybe the most important, but it still doesn’t tell you what they do.” To understand how a mutation affects a gene, you have to look at gene expression, the cellular process by which the information encoded in a gene is used to create proteins.

In his latest work at the Broad, Stojanov is focusing on two variables: gene mutations, which can be gleaned from DNA sequencing data, and gene expression expression (which can be obtained from RNA sequencing data by measuring the amount of RNA, a gene-decoding molecule, in the cell). He then uses a set of machine learning tools called causal inference and discovery algorithms to uncover the “causal relationships” between these two variables – mutations and expression.

“The idea is to show that some aspects of gene expression are the consequences of mutations,” he said.  

The only causal relationships he cares about are those associated with cancer. While sorting through DNA and RNA sequencing data from thousands of cancer patients, he’s looking for patterns. In particular, he said, “we might find mutations that influence patients with the same cancer type (or subtype), in the same way.”

Stojanov in his office with colleagues Pinar Eser (center) and Tim Coorens

As an intermediate step, Stojanov relies on a related class of machine learning-based tools, so-called deep generative models, which basically takes abstract (“high-dimensional”) information processed by computers and represents it in a form that is meaningful to humans. If you have mutation and expression data for 20,000 genes, he said, these models offer a way to summarize that vast amount of data in terms of the concepts you’re interested in, such as biological processes or cell subtypes that might be impacted by cancer.

The ultimate goal is to learn as much as possible about this multifaceted disease — how and where it starts and progresses. “To really understand what’s going on,” Stojanov said, “we need an interpretable map that shows which processes are affected by what mutations.”

Existing techniques can only get you so far

Eric and Wendy Schmidt Center co-director Caroline Uhler is excited by the prospect of “getting at the causal genes, which contain the mutations that drive cancer. "Once you have that,” she said, “you’re in a much better position to think about effective therapies. That’s really the promise of this work.”

Stojanov’s current research is, admittedly, at an early stage. He has a solid base of experience to draw on, and he’s picked out a set of tools, in the form of machine learning algorithms, that are poised to advance our knowledge base. The big challenge, Uhler pointed out, is that “existing techniques can only get you so far. Petar has to build on these methods and develop new algorithms in order to solve the important biological questions he plans to address.”

Stojanov is mindful of the hard work ahead and grateful that his burden has been eased by having several years of funding already secured. “This [K99] award gives you the ultimate amount of independence you can have as a postdoc,” he said.

When asked if getting the award is the best thing that could happen to someone in his position, embarking on such an ambitious enterprise, he replied, “Well, it’s certainly up there.”

A deep (learning) dive into the roots of cancer
Read story
Continue reading on
April 28, 2023
Machine learning model finds genetic factors for heart disease

To get an inside look at the heart, cardiologists often use electrocardiograms (ECGs) to trace its electrical activity and magnetic resonance images (MRIs) to map its structure. Because the two types of data reveal different details about the heart, physicians typically study them separately to diagnose heart conditions.

Now, in a paper published in Nature Communications, scientists in the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard have developed a machine learning approach that can learn patterns from ECGs and MRIs simultaneously, and based on those patterns, predict characteristics of a patient’s heart. Such a tool, with further development, could one day help doctors better detect and diagnose heart conditions from routine tests such as ECGs.

The researchers also showed that they could analyze ECG recordings, which are easy and cheap to acquire, and generate MRI movies of the same heart, which are much more expensive to capture. And their method could even be used to find new genetic markers of heart disease that existing approaches that look at individual data modalities might miss.

Overall, the team said their technology is a more holistic way to study the heart and its ailments. “It is clear that these two views, ECGs and MRIs, should be integrated because they provide different perspectives on the state of the heart,” said Caroline Uhler, a co-senior author on the study, a Broad core institute member, co-director of the Schmidt Center at Broad, and a professor in the Department of Electrical Engineering and Computer Science as well as the Institute for Data, Systems, and Society at MIT.

"As a field, cardiology is fortunate to have many diagnostic modalities, each providing a different view into cardiac physiology in health and diseases. A challenge we face is that we lack systematic tools for integrating these modalities into a single, coherent picture,” said Anthony Philippakis, a senior co-author on the study and chief data officer at Broad and co-director of the Schmidt Center. “This study represents a first step towards building such a multi-modal characterization."

Model making

To develop their model, the researchers used a machine learning algorithm called an autoencoder, which automatically integrates gigantic swaths of data into a concise representation – a simpler form of the data. The team then used this representation as input for other machine learning models that make specific predictions.

In their study, the team first trained their autoencoder using ECGs and heart MRIs from participants in the UK Biobank. They fed in tens of thousands of ECGs, each paired with MRI images from the same person. The algorithm then created shared representations that captured crucial details from both types of data.

“Once you have these representations, you can use them for many different applications,” said Adityanarayanan Radhakrishnan, a co-first author on the study, an Eric and Wendy Schmidt Center Fellow at the Broad, and a graduate student at MIT in Uhler’s lab. Sam Friedman, a senior machine learning scientist in the Data Sciences Platform at the Broad, is the other co-first author.

One of those applications is predicting heart-related traits. The researchers used the representations created by their autoencoders to build a model that could predict a range of traits, including features of the heart like the weight of the left ventricle, other patient characteristics related to heart function like age, and even heart disorders. Moreover, their model outperformed more standard machine learning approaches, as well as autoencoder algorithms that were trained on just one of the imaging modalities.

“What we showed here is that you get better prediction accuracy if you incorporate multiple types of data,” Uhler said.

Radhakrishnan explained that their model made more accurate predictions because it used representations that had been trained on a much larger dataset. Autoencoders don’t require data that have been labeled by humans, so the team could feed their autoencoder with around 39,000 unlabeled pairs of ECGs and MRI images, rather than just around 5,000 labeled pairs.

The researchers demonstrated another application of their autoencoder: generating new MRI movies. By inputting an individual’s ECG recording into the model — without a paired MRI recording — the model produced the predicted MRI movie for the same person.

With more work, the scientists envision that such technology could potentially allow physicians to learn more about a patient’s heart health from just ECG recordings, which are routinely collected at doctors’ offices.

Broader gene search

With their autoencoder representations, the team realized they could also use them to look for genetic variants associated with heart disease. The traditional method of finding genetic variants for a disease, called a genome-wide association study (GWAS), requires genetic data from individuals that have been labeled with the disease of interest.

But because the team’s autoencoder framework doesn’t require labeled data, they were able to generate representations that reflected the overall state of a patient’s heart. Using these representations and genetic data on the same patients from the UK Biobank, the researchers created a model that looked for genetic variants that impact the state of the heart in more general ways. The model produced a list of variants including many of the known variants related to heart disease and some new ones that can now be investigated further.

Radhakrishnan said that genetic discovery could be the area in which the autoencoder framework, with more data and development, could have the most impact – not just for heart disease, but for any disease. The research team is already working on applying their autoencoder framework to study neurological diseases.

Uhler said this project is a good example of how innovations in biomedical data analysis emerge when machine learning researchers collaborate with biologists and physicians. “An exciting aspect about getting machine learning researchers interested in biomedical questions is that they might come up with a completely new way of looking at a problem.”

Support for the research was provided in part by the Eric and Wendy Schmidt Center at the Broad Institute, the National Science Foundation, the Office of Naval Research, the MIT-IBM Watson AI Lab, a Simons Investigator Award, the National Institutes of Health, and the American Heart Association.

Adapted from a news story posted on the Broad Institute website.

Machine learning model finds genetic factors for heart disease
Heart Disease
Read story
Continue reading on
March 30, 2023
A method for designing neural networks optimally suited for certain tasks

Neural networks, a type of machine-learning model, are being used to help humans complete a wide variety of tasks, from predicting if someone’s credit score is high enough to qualify for a loan to diagnosing whether a patient has a certain disease. But researchers still have only a limited understanding of how these models work. Whether a given model is optimal for certain task remains an open question.

MIT researchers have found some answers. They conducted an analysis of neural networks and proved that they can be designed so they are “optimal,” meaning they minimize the probability of misclassifying borrowers or patients into the wrong category when the networks are given a lot of labeled training data. To achieve optimality, these networks must be built with a specific architecture.

The researchers discovered that, in certain situations, the building blocks that enable a neural network to be optimal are not the ones developers use in practice. These optimal building blocks, derived through the new analysis, are unconventional and haven’t been considered before, the researchers say.

In a paper published this week in the Proceedings of the National Academy of Sciences, they describe these optimal building blocks, called activation functions, and show how they can be used to design neural networks that achieve better performance on any dataset. The results hold even as the neural networks grow very large. This work could help developers select the correct activation function, enabling them to build neural networks that classify data more accurately in a wide range of application areas, explains senior author Caroline Uhler, a professor in the Department of Electrical Engineering and Computer Science (EECS) and co-director of the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard.

“While these are new activation functions that have never been used before, they are simple functions that someone could actually implement for a particular problem. This work really shows the importance of having theoretical proofs. If you go after a principled understanding of these models, that can actually lead you to new activation functions that you would otherwise never have thought of,” says Uhler, who is a core institute member of the Broad Institute, and a researcher at MIT’s Laboratory for Information and Decision Systems (LIDS) and Institute for Data, Systems and Society (IDSS).

Joining Uhler on the paper are lead author Adityanarayanan Radhakrishnan, an EECS graduate student and an Eric and Wendy Schmidt Center Fellow, and Mikhail Belkin, a professor in the Halicioğlu Data Science Institute at the University of California at San Diego.

Activation investigation

A neural network is a type of machine-learning model that is loosely based on the human brain. Many layers of interconnected nodes, or neurons, process data. Researchers train a network to complete a task by showing it millions of examples from a dataset.

For instance, a network that has been trained to classify images into categories, say dogs and cats, is given an image that has been encoded as numbers. The network performs a series of complex multiplication operations, layer by layer, until the result is just one number. If that number is positive, the network classifies the image a dog, and if it is negative, a cat.

Activation functions help the network learn complex patterns in the input data. They do this by applying a transformation to the output of one layer before data are sent to the next layer. When researchers build a neural network, they select one activation function to use. They also choose the width of the network (how many neurons are in each layer) and the depth (how many layers are in the network.)

“It turns out that, if you take the standard activation functions that people use in practice, and keep increasing the depth of the network, it gives you really terrible performance. We show that if you design with different activation functions, as you get more data, your network will get better and better,” says Radhakrishnan.

He and his collaborators studied a situation in which a neural network is infinitely deep and wide — which means the network is built by continually adding more layers and more nodes — and is trained to perform classification tasks. In classification, the network learns to place data inputs into separate categories.

“A clean picture”

After conducting a detailed analysis, the researchers determined that there are only three ways this kind of network can learn to classify inputs. One method classifies an input based on the majority of inputs in the training data; if there are more dogs than cats, it will decide every new input is a dog. Another method classifies by choosing the label (dog or cat) of the training data point that most resembles the new input.

The third method classifies a new input based on a weighted average of all the training data points that are similar to it. Their analysis shows that this is the only method of the three that leads to optimal performance. They identified a set of activation functions that always use this optimal classification method.

“That was one of the most surprising things — no matter what you choose for an activation function, it is just going to be one of these three classifiers. We have formulas that will tell you explicitly which of these three it is going to be. It is a very clean picture,” he says.

They tested this theory on a several classification benchmarking tasks and found that it led to improved performance in many cases. Neural network builders could use their formulas to select an activation function that yields improved classification performance, Radhakrishnan says.

In the future, the researchers want to use what they’ve learned to analyze situations where they have a limited amount of data and for networks that are not infinitely wide or deep. They also want to apply this analysis to situations where data do not have labels.

“In deep learning, we want to build theoretically grounded models so we can reliably deploy them in some mission-critical setting. This is a promising approach at getting toward something like that — building architectures in a theoretically grounded way that translates into better results in practice,” he says.

This work was supported, in part, by the National Science Foundation, Office of Naval Research, the MIT-IBM Watson AI Lab, the Eric and Wendy Schmidt Center at the Broad Institute, and a Simons Investigator Award.

Adapted from a news story posted on MIT News.

A method for designing neural networks optimally suited for certain tasks
Neural Networks
Read story
Continue reading on
March 28, 2023
Machine learning experts from around the world compete to improve cancer immunotherapy

Marios Gavrielatos had never participated in a machine learning competition when he decided to enter the Eric and Wendy Schmidt Center’s Cancer Immunotherapy Data Science Grand Challenge.

Gavrielatos’ friend and colleague, Konstantinos Kyriakidis, asked him to team up in the competition after learning about it from a promotional video on YouTube.

Despite Gavrielatos’ newcomer status, the pair developed a new deep learning model that won them the first part of the competition last month.

The challenge “helped me develop new computational skills, deep-learning wise,” said Gavrielatos, a bioinformatics master’s student at the National and Kapodistrian University of Athens, adding that because they couldn’t find similar problems online, “we had to develop something new ourselves, which was interesting.”

The Cancer Immunotherapy Data Science Grand Challenge, which ran on Topcoder from January 9 to February 3, aimed to uncover new ways to modify, or “perturb,” T cells to make them more effective at killing cancer cells to ultimately improve cancer treatment.

Top challenge submissions will be tested out in a lab at the Broad Institute of MIT and Harvard later this year.

The Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard partnered with Harvard’s Laboratory for Innovation Science, the MIT Department of Electrical Engineering and Computer Science, Topcoder, Gordian Biotechnology, and Massachusetts General Hospital (MGH) to run the challenge. Over 900 people registered for the first part of the competition — making it Topcoder’s fifth-largest data science challenge to date.

“In biology, we can perform perturbations on a scale that other fields can only dream of, meaning we need to develop novel machine learning methods to best make use of such data and answer biological questions,” said Caroline Uhler, co-director of the Eric and Wendy Schmidt Center, a core member of the Broad Institute, and professor in the Department of Electrical Engineering and Computer Science and the Institute for Data, Systems and Society at MIT. “We held this data science challenge to direct bright computational minds from around the world to this problem in cancer immunotherapy. And we’re thrilled that we now get to test out some of their proposed perturbations experimentally.”

A great fit for a data science challenge

While chemotherapy and radiation have saved many lives, these treatments have a weak spot: they are not specific enough — meaning they can kill cancerous and healthy cells. The promise of cancer immunotherapy, a newer and effective form of cancer treatment, is that it can harness our immune system to recognize and kill cancer cells while leaving other cells alone in most cases.

Cancer cells have developed a number of ways to evade our immune system. One such strategy is sending signals to T cells to make them exhausted and ineffective at killing cancer cells. That’s why cancer researchers like Nir Hacohen, an institute member at Broad, director of the Broad Institute’s Cell Circuits Program and director of the Center for Cancer Immunology at Mass General Hospital, are investigating whether perturbing certain genes could shift T cells to a cancer-fighting, “effector” state.

“We were excited to develop this data science challenge with the Eric and Wendy Schmidt Center because the T cell exhaustion problem seemed like a great fit for this kind of competition,” said Hacohen. “It was an opportunity to combine our cancer biology and immunology knowledge with the computational and mathematical skills of machine learning experts from all over the world.”

Marc Schwartz, a postdoctoral fellow in the Hacohen Lab, ran experiments testing the  effects of 73 gene knockouts in T cells on mice with cancer. Given that it took months to test a fraction of the 20,000 potential gene knockouts — a genetic perturbation that stops a gene from functioning — Broad researchers wanted a way to zero in on the most promising perturbations. Enter machine learning.

The overarching challenge was divided into three parts that ran as individual data science competitions on Topcoder. In Challenge 1, participants received gene expression data from 66 of the 73 T-cell-gene knockouts from Schwartz’s experiments as training data. They then had to develop an algorithm that could predict how knocking out the seven “held-out” genes would affect T cells.

Challenge 2 participants used their algorithms from the first challenge to propose new gene knockouts (picking from any of the 20K genes in the entire genome) to shift as many T cells as possible into a cancer-fighting state. In Challenge 3, participants proposed a metric for ranking how well a particular gene knockout would bring about this desired shift in T cells.

To solve Challenge 1, winners Gavrielatos and Kyriakidis first pared down the single-cell dataset so that it contained only expression information from important genes — that is, genes whose expression changed across different T cell states. The preprocessing of the data is a crucial step to distill the “signal” — or useful information — when working with such noisy data, said Kyriakidis, who has previously won several precision FDA data science challenges.

The pair next trained a deep learning model to predict what portion of T cells would move into an effector, exhausted, or alternate state after a specific gene was knocked out. Initially, they tried to come up with an algorithm using only the training data provided from Schwartz’s experiment. But as they continued working, they realized that incorporating public biomedical databases into their analysis — namely, Reactome, a database of biological pathways in human cells, and STRING, a protein interaction database — could reveal associations between the missing and observed genes.

“The whole process was so rewarding,” said Kyriakidis. “You have to divide the whole problem into smaller parts to try to find the solution to each part and connect the dots.”

Sometimes, simple algorithms are best

The second place winners were three MIT students — including two graduate students from the Laboratory for Information and Decision Systems (LIDS), Yuzhou Gu and Anzo Teh, MIT Institute for Data, Systems, and Society (IDSS) postdoc Yanjun Han, and undergraduate student Brandon Wang. Teh, who is also an Eric and Wendy Schmidt Center PhD fellow, said his advisor, MIT professor Yury Polyanskiy, suggested that he and the other researchers join forces for the challenge.

Anzo Teh, Eric and Wendy Schmidt Center PhD Fellow

Teh, Gu, and Han, have a theoretical and computational background — specifically, information theory — while the undergraduate student, Brandon Wang, has expertise in computational biology.

“I did feel like this challenge was a good way for me to learn how to work on these types of problems because I’m pretty new to the biology field,” said Teh.

Several teams used neural networks to describe the experimental gene expression data, an approach that often requires thousands of parameters to create an effective model. The MIT team, on the other hand, made a simplifying assumption that gene expression could be modeled with a small number of parameters following a Gaussian distribution, or a bell curve.

They then reduced the dimensions of their data from 20,000 to 50 columns using a machine learning technique called “principal component analysis.” The MIT team also incorporated an outside public database on human genes into their model, mapping human gene expression profiles to their missing mouse counterparts. Finally, they used a proven machine learning classification algorithm to determine how the gene expression profiles lined up with T cell states.

“Sometimes simple algorithms can work better than neural networks,” said Teh. The MIT team’s background in information theory, which is the study of organizing and quantifying data, helped them discover what signals in the experimental data to focus their models on.

Peter Novotný, the third place winner and a math professor at the University of Žilina in Slovakia, also took a relatively simple approach to solving Challenge 1. Novotný, a former Topcoder “copilot” who had participated in a NASA asteroid-hunter challenge, among many other competitions, has more of a mathematics than a computer science background. In part through participating in data science challenges, he’s discovered that he enjoys machine learning though.

“And, I also quite like competing,” he said.

For the cancer immunotherapy challenge, Novotný first selected 14 features from the T cell data that quantified  how gene expression levels differed between perturbed and unperturbed cells, as the way to represent his training data. Then, he built a model using a common machine learning algorithm — the “random forest” — and predicted the distribution of T cell states for each of the seven withheld genes.

To make the challenge accessible to participants without a biology background, Lightmark Creative and Orr Ashenberg, associate director of computational biology at The Klarman Cell Observatory of the Broad Institute, produced a 1.5-hour crash course on cancer biology, perturbation data, and single-cell sequencing technologies.

“To compete in this contest, you really need to understand what the data is, and without those lectures, it would be quite difficult to understand the problem,” said Novotný.

In addition, Uhler held an IAP course that ran at the same time as the challenge, encouraging MIT students to team up and participate in the competition.

Testing perturbations in the lab

The Eric and Wendy Schmidt Center also announced last month who won the third challenge, in which participants came up with a metric to rank new T cell perturbations.

The winners of that challenge were:

  • First place: Dariusz Brzeziński and Wojciech Kotlowski from Poznań University of Technology in Poland
  • Second place: Salil Bhate, MIT, postdoctoral fellow at the Eric and Wendy Schmidt Center
  • Third place: Irene Bonafonte Pardàs, Artur Szalata, and Benjamin Schubert from Helmholtz Center Munich and Miriam Lyzotte from Mila - Quebec AI Institute

Now, researchers at the Hacohen Lab will run experiments to test how the perturbations proposed in Challenge 2 affect mouse T cells’ cancer-fighting abilities.

“It will be really exciting to see how these computationally identified perturbations actually perform in the lab,” said Uhler. “After all, machine learning cannot replace experiments, but the goal is to work hand in hand with biologists and help prioritize the next experiments to run.”

Machine learning experts from around the world compete to improve cancer immunotherapy
Data Science Challenge
Read story
Continue reading on
January 20, 2023
Researchers develop an AI model that can detect future lung cancer risk

The name Sybil has its origins in the oracles of Ancient Greece, also known as sibyls: feminine figures who were relied upon to relay divine knowledge of the unseen and the omnipotent past, present, and future. Now, the name has been excavated from antiquity and bestowed on an artificial intelligence tool for lung cancer risk assessment being developed by researchers at MIT's Abdul Latif Jameel Clinic for Machine Learning in Health, Mass General Cancer Center (MGCC), and Chang Gung Memorial Hospital (CGMH).

Lung cancer is the No. 1 deadliest cancer in the world, resulting in 1.7 million deaths worldwide in 2020, killing more people than the next three deadliest cancers combined.

"It’s the biggest cancer killer because it’s relatively common and relatively hard to treat, especially once it has reached an advanced stage,” says Florian Fintelmann, MGCC thoracic interventional radiologist and co-author on the new work. “In this case, it’s important to know that if you detect lung cancer early, the long-term outcome is significantly better. Your five-year survival rate is closer to 70 percent, whereas if you detect it when it’s advanced, the five-year survival rate is just short of 10 percent.”

Although there has been a surge in new therapies introduced to combat lung cancer in recent years, the majority of patients with lung cancer still succumb to the disease. Low-dose computed tomography (LDCT) scans of the lung are currently the most common way patients are screened for lung cancer with the hope of finding it in the earliest stages, when it can still be surgically removed. Sybil takes the screening a step further, analyzing the LDCT image data without the assistance of a radiologist to predict the risk of a patient developing a future lung cancer within six years.

In their new paper published in the Journal of Clinical Oncology, Jameel Clinic, MGCC, and CGMH researchers demonstrated that Sybil obtained C-indices of 0.75, 0.81, and 0.80 over the course of six years from diverse sets of lung LDCT scans taken from the National Lung Cancer Screening Trial (NLST), Mass General Hospital (MGH), and CGMH, respectively — models achieving a C-index score over 0.7 are considered good and over 0.8 is considered strong. The ROC-AUCs for one-year prediction using Sybil scored even higher, ranging from 0.86 to 0.94, with 1.00 being the highest score possible.

Despite its success, the 3D nature of lung CT scans made Sybil a challenge to build. Co-author Peter Mikhael, an MIT PhD student in electrical engineering and computer science, a fellow at the Eric and Wendy Schmidt Center, and an affiliate at the Jameel Clinic and the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL), likened the process to “trying to find a needle in a haystack.” The imaging data used to train Sybil was largely absent of any signs of cancer because early-stage lung cancer occupies small portions of the lung — just a fraction of the hundreds of thousands of pixels making up each CT scan. Denser portions of lung tissue are known as lung nodules, and while they have the potential to be cancerous, most are not, and can occur from healed infections or airborne irritants.  

To ensure that Sybil would be able to accurately assess cancer risk, Fintelmann and his team labeled hundreds of CT scans with visible cancerous tumors that would be used to train Sybil before testing the model on CT scans without discernible signs of cancer.

MIT electrical engineering and computer science PhD student Jeremy Wohlwend, co-author of the paper and Jameel Clinic and CSAIL affiliate, was surprised by how highly Sybil scored despite the lack of any visible cancer. “We found that while we [as humans] couldn’t quite see where the cancer was, the model could still have some predictive power as to which lung would eventually develop cancer,” he recalls. “Knowing [Sybil] was able to highlight which side was the most likely side was really interesting to us.”

Co-author Lecia V. Sequist, a medical oncologist, lung cancer expert, and director of the Center for Innovation in Early Cancer Detection at MGH, says the results the team achieved with Sybil are important “because lung cancer screening is not being deployed to its fullest potential in the U.S. or globally, and Sybil may be able to help us bridge this gap.”

Lung cancer screening programs are underdeveloped in regions of the United States hardest hit by lung cancer due to a variety of factors. These range from stigma against smokers to political and policy landscape factors like Medicaid expansion, which varies from state to state.

Moreover, many patients diagnosed with lung cancer today have either never smoked or are former smokers who quit over 15 ago — traits that make both groups ineligible for lung cancer CT screening in the United States.

“Our training data consisted only of smokers because this was a necessary criterion for enrolling in the NLST,” Mikhael says. “In Taiwan, they screen nonsmokers, so our validation data is expected to contain people who didn’t smoke, and it was exciting to see Sybil generalize well to that population.”

“An exciting next step in the research will be testing Sybil prospectively on people at risk for lung cancer who have not smoked or who quit decades ago,” says Sequist. “I treat such patients every day in my lung cancer clinic and it’s understandably hard for them to reconcile that they would not have been candidates to undergo screening. Perhaps that will change in the future.”

There is a growing population of patients with lung cancer who are categorized as nonsmokers. Women nonsmokers are more likely to be diagnosed with lung cancer than men who are nonsmokers. Globally, over 50 percent of women diagnosed with lung cancer are nonsmokers, compared to 15 to 20 percent of men.

MIT Professor Regina Barzilay, a paper co-author and the Jameel Clinic AI faculty lead, who is also a member of the Koch Institute for Integrative Cancer Research, credits MIT and MGH’s joint efforts on Sybil to Sylvia, the sister to a close friend of Barzilay and one of Sequist’s patients. "Sylvia was young, healthy and athletic — she never smoked,” Barzilay recalls. “When she started coughing, neither her doctors nor her family initially suspected that the cause could be lung cancer. When Sylvia was finally diagnosed and met Dr. Sequist, the disease was too advanced to revert its course. When mourning Sylvia's death, we couldn't stop thinking how many other patients have similar trajectories.”

This work was supported by the Bridge Project, a partnership between the Koch Institute at MIT and the Dana-Farber/Harvard Cancer Center; the MIT Jameel Clinic; Quanta Computer; Stand Up To Cancer; the MGH Center for Innovation in Early Cancer Detection; the Bralower and Landry Families; Upstage Lung Cancer; and the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard. The Cancer Center of Linkou CGMH under Chang Gung Medical Foundation provided assistance with data collection and R. Yang, J. Song and their team (Quanta Computer Inc.) provided technical and computing support for analyzing the CGMH dataset. The authors thank the National Cancer Institute for access to NCI’s data collected by the National Lung Screening Trial, as well as patients who participated in the trial.

Adapted from a news story posted on MIT News.

Researchers develop an AI model that can detect future lung cancer risk
ML for Healthcare
Read story
Continue reading on
December 7, 2022
New method identifies spatial biomarkers of Alzheimer’s disease progression in animal model

Many diseases affect how cells are spatially organized in tissues, such as in Alzheimer’s disease, where amyloid-β proteins clump together to form plaques in the brain. Studying how cells differ in various regions of tissue could help scientists better understand the key changes that lead to Alzheimer’s and other diseases. But integrating data on gene expression and cell structure and spatial location into the same analysis has proven challenging.

Now, researchers from the Broad Institute of MIT and Harvard and ETH Zürich in Switzerland have developed a computational framework for simultaneously analyzing gene expression, the structure of cell nuclei, and their position in space. STACI (Spatial Transcriptomics combined using Autoencoders with Chromatin Imaging) is the first method that combines these three kinds of data. The findings appeared recently in Nature Communications.

The team, led by Caroline Uhler, the study’s senior author and co-director of the Eric and Wendy Schmidt Center at the Broad, and Xinyi Zhang, first author on the study and a graduate student in Uhler’s lab, developed STACI and applied it to study a mouse model of Alzheimer’s disease.

STACI uses a kind of computational model called a neural network to analyze data generated by a technique called STARmap, which measures the expression of more than two thousand genes and maps their location in intact tissue. STARmap was developed by Xiao Wang, a core institute member at the Broad and co-author on the study.

The team used STACI to analyze brain tissue from the Alzheimer’s mouse model. By studying gene expression and the location of cells in the tissue, the scientists identified a part of the cortex in the mouse brain that was more likely to have significant plaque accumulations. With the help of G. V. Shivashankar, a study author and professor of mechano-genomics at ETH Zürich, the team also found that they could predict plaque size — a marker of disease progression — by analyzing just one feature of cells near the plaques: the structure of chromatin, the complex of DNA and protein that makes up chromosomes. The results suggest that chromatin structure could be a marker of Alzheimer’s disease progression.

“We began by asking how we can integrate these different data modalities,” said Uhler, who is also a core institute member at Broad and professor in the Department of Electrical Engineering and Computer Science at MIT. “What’s really exciting is that now, with STACI, we can begin to ask biological questions to learn more about disease by taking all modalities into account simultaneously.”

Zhang, who is also a fellow at the Schmidt Center, says that STACI is a useful tool for researchers because chromatin imaging is routine in labs and cheaper than measuring the gene expression of cells directly. “This study may provide simple, low-cost avenues for studying which regions of the brain are more affected by disease and for tracking disease progression,” she said.

Cells in space

In previous work, Uhler and Shivashankar showed that they could use computational techniques to analyze single-cell RNA sequencing data along with chromatin images. They collaborated with Wang to incorporate the analysis of cell location data from STARmap and build STACI.

STACI relies on a neural network, which learns patterns from “training” data to predict characteristics of new data. To develop STACI, the researchers trained it to build a map, called a latent space, that groups together cells with similar locations, gene expression, or chromatin structure. They then used STACI to analyze images of chromatin from mouse brain tissue.

From this latent space, the scientists found that the size of plaque deposits is highly correlated with the ratio of heterochromatin to euchromatin, which indicates how densely packed the chromatin is. This relationship suggests DNA packing could be a marker of disease progression.

The team says the connection between chromatin density and plaques suggests new questions in Alzheimer’s research. They hope their findings will spur other groups to investigate the biological relationship between DNA packing and plaque build-up.  

Branching out

Brain tissue samples can vary widely in how they are collected and prepared, but the scientists designed STACI to account for this variation. The technique could also be applicable to other spatial data types, such as from Slide-Seq — developed by Fei Chen, Evan Macosko and other colleagues at the Broad — as well as Visium and MERFISH.

Uhler adds that STACI could also help researchers learn more about other diseases, since many have important spatial features. She envisions using the framework to analyze the local microenvironment in cancer, fibrosis or scarring in the lungs or other tissues, as well as developmental processes. As scientists apply STACI to new problems, they’ll likely encounter new analytical challenges, but she thinks this is an opportunity to help the model expand.

“This work shows how biology can be a great inspiration for novel computational questions and developments,” Uhler said. “And that’s really exciting.”

This work was supported in part by the Eric and Wendy Schmidt Center, the Simons Foundation, the Office of Naval Research, the National Institutes of Health, and the National Science Foundation.

Adapted from a news story posted on the Broad Institute website.

New method identifies spatial biomarkers of Alzheimer’s disease progression in animal model
Spatial Transcriptomics
Read story
Continue reading on
November 21, 2022
Eric and Wendy Schmidt Center announces data science challenge to harness machine learning for cancer immunotherapy

The immune system is adept at fighting off viral and bacterial infections, but it can also find and attack cancer in the body. Cancer cells, however, are skilled at disarming the immune system’s T cells — allowing tumors to continue growing unabated.

Scientists at the Broad Institute of MIT and Harvard and beyond have been looking for ways to genetically modify T cells to improve their cancer-fighting ability. Now the Eric and Wendy Schmidt Center at the Broad Institute is joining this effort, by holding a data science challenge this winter that will call on machine learning enthusiasts to develop algorithms that identify effective genetic modifications in T cells.

Winners will receive monetary prizes at each stage — and, unlike in most data science challenges, the top-scoring participants will have their submissions experimentally validated. Members of a cancer immunology lab at Broad led by institute member Nir Hacohen will make the top-ranked genetic modifications in T cells in the lab and assess the cells’ cancer-fighting abilities.  

The "Cancer Immunotherapy Data Science Grand Challenge" was announced earlier this month at the online coding tournament Topcoder Open, and will run from January 9 to February 3, 2023. The Eric and Wendy Schmidt Center is partnering with Harvard’s Laboratory for Innovation Science, the MIT Department of Electrical Engineering and Computer Science, Topcoder, and Massachusetts General Hospital (MGH) to run the challenge.

“Machine learning experts have largely gone into the fields of big technology and finance. With this challenge, we’re describing an important problem in cancer immunology in a way that is approachable for computational minds — thus hoping to entice more of these experts to the life sciences,” said Caroline Uhler, co-director of the Eric and Wendy Schmidt Center, a core member of the Broad Institute, and professor in the Department of Electrical Engineering and Computer Science and the Institute for Data, Systems and Society at MIT.

Improving cancer immunotherapy through machine learning

Cancer immunotherapies boost the immune system to fight off cancer in a variety of ways. Scientists have made many breakthroughs in cancer immunotherapy in the last decade, such as the development of several FDA-approved checkpoint blockade and “CAR T” therapies. CAR T treatments involve removing T cells from a cancer patient, genetically engineering them in the lab to target tumors, and then reintroducing them back into the patient. However, these treatments work for only a small number of cancer types and only in some patients.

“We hope that this challenge will allow us to quickly hone in on the most promising perturbations so we can better target our experimental validation.” — Nir Hacohen

To make T cell-based immunotherapies more effective for more patients, scientists are looking for other genetic changes they can introduce in T cells to make them better cancer killers. With the development of genome-editing technologies such as CRISPR in the last decade, researchers can look for those desirable changes by performing large-scale genetic screens to systematically modify or knock out each gene and study the effect of these “perturbations” at the single-cell level.

However, perturbing each of the 20,000 genes in the cell or the several hundred million different combinations of genes in the lab would be too costly and time-consuming. Machine learning can help, by predicting which genetic perturbations might be most effective.

“We hope that this challenge will allow us to quickly hone in on the most promising perturbations so we can better target our experimental validation,” said Hacohen, director of the Broad Institute’s Cell Circuits Program, institute member of the Broad Institute, and director of MGH’s Center for Cancer Immunology. “The predictions from this challenge will provide a crucial step toward making cancer immunotherapy more effective for more patients.”

The Cancer Immunotherapy Data Science Challenge will consist of three parts that will run at the same time. In the first part, participants will use transcriptomic and perturbational data from T cells in mouse tumors to develop algorithms that predict the effect of perturbations that have already been studied in the lab, allowing them to see how well their algorithms work. In part two, they’ll come up with a metric for ranking how well a particular gene knockout would shift T cells to a desired state.

And, third, participants will use their algorithms to propose perturbations that boost T cells’ ability to destroy tumors. The top-scoring participants from part one will have their proposed perturbations experimentally validated.

“Data science challenges like this one draw on the power of the crowd to bring in outside computational and creative machine learning techniques to solve biological problems,” said MarcAntonio Awada, head of research and data science at Harvard’s Institute for Digital, Data, and Design Institute. “In the past, crowdsourcing has led to out-of-the-box approaches and completely novel solutions compared to what experts had come up with.”

Unique learning and data access opportunities

The challenge will run concurrently with an Independent Activities Period course at MIT, which brings together computer science and biology students to collaborate on this problem. “The course provides a great opportunity for MIT students to apply their education and see that what they’re learning in the classroom has a direct impact on answering critical biomedical questions,” said Uhler, who is one of the course’s instructors.

A biology background isn’t necessary to participate. The Eric and Wendy Schmidt Center will provide all challenge participants with an online crash course on cancer immunology and unique features of the large-scale datasets. Interested participants can pre-register now as an individual or as part of a team on Topcoder, which is hosting the challenge on their platform.

Participants will have free access to Saturn Cloud to complete the challenge.

Adapted from a news story posted on the Broad Institute website.

Eric and Wendy Schmidt Center announces data science challenge to harness machine learning for cancer immunotherapy
Data Science Challenge
Read story
Continue reading on
May 13, 2022
Workshop sparks new tissue biology and AI research areas and collaborations

Advancing our understanding of tissue biology requires tight collaborations between biologists with driving questions, technologists creating new experimental methods, and computational scientists who are creating new ways of analyzing data. One of the key aims of an April 27 workshop held by the Eric and Wendy Schmidt Center and the Klarman Cell Observatory at the Broad Institute was to explore the interface between these disciplines. Speakers and panelists included researchers at Stanford University, MIT, Harvard University, the Sloan Kettering Institute, UC Berkeley, Princeton University, and the Broad Institute.

The workshop brought together a diverse set of communities to discuss new tissue biology research questions — and new opportunities for collaboration between the biomedical sciences and machine learning.

Caroline Uhler, co-director of the Eric and Wendy Schmidt Center, a core member of the Broad Institute, and an associate professor in the Department of Electrical Engineering and Computer Science and the Institute for Data, Systems and Society at MIT, told workshop attendees during opening remarks that biology has seen an “explosion” of data in recent years. “We now have the opportunity to understand the programs of life, so not just the units (like genes or single cells), but actually the interactions between these units.”

Biological research frontiers

These cellular interactions play a key role in the cancer immunotherapy research shared by keynote speaker Garry Nolan, a professor in the pathology department at Stanford University. His research team develops algorithms to model tissue areas where different groups of cells interact, areas he calls “interface zones,” to gain insights into how cancer remodels its surrounding tissue and evades the body’s immune system. These interface zones are critical as the locus of cellular changes that lead to tumor growth.

“I would urge you, when you're looking at your RNA data sets, to the extent that you can call out these kinds of interface zones, pay special attention to the RNA changes that are occurring there,” said Nolan, adding later: “The boundary space is where the action is.”

Additionally, biologists should reconsider labeling tumors and other features “heterogenous,” which implies that tumors from different patients are too distinct from one another to be compared. “There is an order here that can be extracted,” said Nolan.

“If you ask me, I would say that we'll never be able to fully understand and model cell function if we don't include spatial proteomics data.” — Emma Lundberg

Meanwhile keynote speaker Emma Lundberg, an associate professor of bioengineering at Stanford and co-director of the Human Protein Atlas, outlined how her team has mapped where proteins are located in cells — a process known as "spatial proteomics." Interestingly, over half of proteins can be found in more than one part of the cell, which changes how they function.

“If you ask me, I would say that we'll never be able to fully understand and model cell function if we don't include spatial proteomics data,” said Lundberg.

Panelists also discussed next steps for engineered tissues and artificial organs in disease study and regenerative medicine. Sangeeta Bhatia, a professor of health sciences and of electrical engineering and computer science at MIT’s Koch Institute, said that researchers have been able to engineer artificial tissues and organs that have little structure, like the skin and cartilage, for decades. Now, they're moving onto endocrine tissue, like the pancreas and liver. “Then you start to think about the tissues whose function is dependent on architecture, like the kidney, the lung — that's the next frontier, and I think we are not quite there yet,” she said.

One challenge brought up by Paola Arlotta, a professor of stem cell and regenerative biology at Harvard University, is how to factor genetics into tissue and organ models. One way to do this is to see how cells from different individuals respond to the same kinds of disturbances. If researchers don’t take genetic variability into account, “we’re ignoring a fundamental component of what human disease is,” she said.

Computational and technological challenges

Keynote speaker Dana Pe’er, chair of the Computational and Systems Biology Program at the Sloan Kettering Institute, outlined computational limitations that need to be addressed to answer pressing biological questions. For example, as researchers move from profiling a small section of a tissue to mapping a whole tissue or organ in different samples, they need to be able to map different tissue sections to each other.

“We’re still largely trying to figure out how to process this data, which is hampering our ability to interpret and powerfully utilize the data,” Pe’er said.

Given that there’s not yet a spatial profiling technology that can provide both high resolution and high content information on features like proteins, researchers will often need to combine a spatial profiling method with single cell data.

Barbara Engelhardt, an associate professor of computer science at Princeton University, said taking multiple images from the same type of tissue and aligning them can help researchers better understand cell type variability.

At the end of the second panel, Anthony Philippakis, co-director of the Eric and Wendy Schmidt Center and chief data officer of the Broad Institute, asked panelists whether they had any “recipes for success” to foster collaborations between the two fields.

Bhatia emphasized the importance of having researchers, or research teams, who are “bilingual” — that is, able to understand both experimental and computational biology. "It doesn't work well if you're just the recipient of data and you don't understand the context." Bhatia said. "We have to create these teams where we can really speak both languages."

Starting the conversations needed to build this bilingual proficiency was precisely the goal of the workshop.

Workshop sparks new tissue biology and AI research areas and collaborations
Read story
Continue reading on
April 13, 2022
Fellows develop AI methods to design antibodies and virtually screen drugs

Wengong Jin planned to research language processing for his computer science PhD. But when Jin learned about research on machine-learning for drug discovery at the MIT Computer Science and Artificial Intelligence Laboratory, he told his advisor, Regina Barzilay, that he’d had a change of heart.

“She thought I was jet lagged, because I’d just come over from China and I was proposing a really big switch,” he said.

Jin, now a fellow at the Eric and Wendy Schmidt Center, stayed the course. Six years later, he and a team of researchers have come up with a new kind of model to automatically design antibodies ­— holding huge potential for immunotherapy.

Meanwhile, another Eric and Wendy Schmidt Center Fellow, PhD candidate Adit Radhakrishnan, recently developed a simple yet powerful method for virtually screening new drug candidates. That framework appears in a study published this April in Proceedings of the National Academy of Sciences.

“A number of research institutes have started using machine learning to answer key questions in biology. But at the Eric and Wendy Schmidt Center, as Jin’s and Radhakrishnan’s research shows, our goal is to also go in the other direction, by using biomedical problems to drive advances in machine-learning,” said Caroline Uhler, co-Director of the Eric and Wendy Schmidt Center, a core member of the Broad Institute, and professor in the Department of Electrical Engineering and Computer Science and the Institute for Data, Systems and Society at MIT.

Game-changer for antibody design

Discovering drugs has traditionally been a labor-intensive process, with researchers toiling away for years to test millions of molecules only to come up with a handful of candidates. Now, researchers like Jin and Radhakrishnan are working to automate that process.

“The idea is that we don't need experts to get a cup of coffee and then work all night trying to figure out a new molecule, but rather, to let the machine do the heavy lifting,” Jin said.

Wengong Jin

During his PhD, Jin was part of a research team that developed a machine-learning algorithm to speed up antibiotic discovery. The researchers found a new antibiotic that was effective against bacteria that are resistant to multiple drugs. In this instance, the team provided the model with roughly a million possible compounds to sort through.

That left Jin and other researchers wondering: Could they use artificial intelligence to design molecules from scratch?

The answer was yes. Jin and other researchers developed a generative model that designed antibodies — Y-shaped proteins that bind to viruses, bacteria, and other pathogens, activating our bodies’ immune response — that could neutralize the SARS-CoV-2 virus. Their findings were published earlier this year in a paper at the International Conference on Learning Representations.

"The new model can propose  in a couple of seconds an antibody that has a high likelihood of working — totally changing the game,” said Jin.

While researchers had worked on generative models for antibody discovery before, those models could only come up with a protein’s amino acid sequence — not its shape. In contrast, the new model, which represents the antibody as a graph, simultaneously designs both the sequence and structure of its binding region. “Whether or not the antibody is the right shape to bind to a virus or other pathogen is crucial to its success,” said Jin.

“The new model can propose  in a couple of seconds an antibody that has a high likelihood of working — totally changing the game."  — Wengong Jin

"While human experts have methods to generate neutralizing antibodies, it takes time and effort. The task becomes even more challenging when additional properties need to be enforced. As our understanding of disease biology and immune system deepens, the number of such desired characteristics will continue to grow. Computational methods for antibody design are particularly useful to address this challenge,” said Regina Barzilay, the AI faculty lead for the MIT Jameel Clinic for Machine Learning in Health.

And, because so many types of data are structured as networks, the model also represents an advance in the field of machine learning. “It’s an example of how biology proposed a new problem for machine learning to solve,” said Jin.

An old machine-learning method repurposed for virtual drug screening

Adit Radhakrishnan's father had pursued a mathematics education in India prior to immigrating to the U.S. He instilled in his son a love of math, which led the younger Radhakrishnan to pursue a PhD of his own in electrical engineering and computer science at MIT.

Radhakrishnan researches the fundamentals of deep learning — a kind of artificial intelligence modeled after the human brain that processes unstructured data. Understanding why deep learning is successful, and using that knowledge to build novel models for the healthcare and genomic space, underpins much of Radhakrishnan’s research as an Eric and Wendy Schmidt Center fellow.

Adit Radhakrishnan

Over the past few years, deep learning has become widely adopted in biological applications, with researchers increasingly turning to it to screen potential new drugs. In order to perform well on such tasks, researchers use very large deep learning models that often require significant computing power. Moreover, the complexity of this approach makes it hard for scientists to understand why these models make a given prediction, shedding little light on why a proposed drug could work.

To get around the complexities of deep learning, Radhakrishnan and other researchers, including Uhler and Mikhail Belkin, a professor at the Halıcıoğlu Data Science Institute at the University of California, San Diego, turned to an older class of machine learning models: kernel methods. Prior to the recent wave of deep learning, kernel methods were a prominent and computationally simple approach for machine learning tasks. These models have recently become popular again since they can serve as a proxy for using very large deep learning models with much less computational burden.

The team came up with a simple yet highly adaptable kernel framework that was able to predict the effect that a drug has on gene expression, a measure of how cells change in response to a drug. “In contrast to the expertise needed to train large deep learning models to solve a particular problem, it takes about three lines of code to train the kernel method to do the same task,” said Radhakrishnan.

The framework has uses beyond biology; the researchers demonstrated, for example, that it could be used by video streaming providers to predict how a viewer would rank a particular movie they hadn’t yet seen. And the framework allows researchers to gain insights into how more complex deep learning models function.

According to Radhakrishnan, who is not trained as a biologist, the best part of being a fellow at the Eric and Wendy Schmidt Center is that the center puts machine learning experts and biologists in constant conversation with each other.

“You don’t just have computational researchers running their methods on a biology dataset without a biologist in the mix. You can get continuous feedback on: Is this actually useful?” said Radhakrishnan. “So it gives you a much more guided focus on what biological problems are important and what computational methods are missing.”

Fellows develop AI methods to design antibodies and virtually screen drugs
ML for Healthcare
Read story
Continue reading on
In the media: