News

Search News

Any
Any
Any
Any
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
May 23, 2024
Caroline Uhler, Schmidt Center director, named IMS FellowCaroline Uhler, Schmidt Center director, named IMS Fellow
2024

The Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard is pleased to share that Center Director Caroline Uhler has been elected Fellow of the Institute of Mathematical Statistics (IMS). Uhler received the award for interdisciplinary excellence and for merging mathematical statistics and computational biology in innovative and impactful ways. 

Uhler is a core institute member of the Broad Institute and a professor in the Department of Electrical Engineering and Computer Science (EECS) and the Institute for Data, Systems, and Society (IDSS) at MIT. She is also a SIAM Fellow, a Sloan Research Fellow, and an elected member of the International Statistical Institute. 

Uhler’s research lies at the intersection of machine learning, statistics, and genomics, with a particular focus on causal inference, representation learning, and gene regulation. Her use of probabilistic graphical models and development of scalable algorithms with healthcare applications has enabled her research group to gain insights into causal relationships hidden within massive amounts of data, such as those generated during gene knockout or knockdown experiments.

Caroline Uhler, director of the Schmidt Center

For almost 90 years, the title of IMS Fellow has represented a prestigious honor. Evaluated by a committee of peers, each Fellow has exhibited exceptional mastery in statistical or probabilistic research and/or has showcased remarkable leadership that has left a lasting impact on the field.

Established in 1935, the IMS is a member organization that fosters the development and dissemination of the theory and applications of statistics and probability. The IMS has over 4,700 active members throughout the world, with approximately 10% of the current IMS members earning the fellowship status. The announcement of the 2024 class of IMS Fellows can be viewed here.

Uhler will be honored among the new IMS Fellows at the IMS Presidential Address and Awards Ceremony at the Bernoulli-IMS 11th World Congress in Probability and Statistics on August 12-16, 2024 in Bochum, Germany.

People
Causal Inference
Representation Learning
May 13, 2024
Machine learning method reveals chromosome locations in individual cell nucleusMachine learning method reveals chromosome locations in individual cell nucleus
2024

Researchers from Carnegie Mellon University’s School of Computer Science and the Broad Institute of MIT and Harvard have made a significant advancement toward understanding how the human genome is organized inside a single cell. This knowledge is crucial for analyzing how DNA structure influences gene expression and disease processes. 

In a paper published by the journal Nature Methods, Ray and Stephanie Lane Professor of Computational Biology Jian Ma and former Ph.D. students Kyle Xiong and Ruochi Zhang introduce scGHOST, a machine learning method that detects subcompartments — a specific type of 3D genome feature in the cell nucleus — and connects them to gene expression patterns. Zhang is currently a postdoctoral fellow at the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard.

Ruochi Zhang
Ruochi Zhang, postodctoral fellow at the Eric and Wendy Schmidt Center

In human cells, chromosomes aren’t arranged linearly but are folded into 3D structures. Researchers are particularly interested in 3D genome subcompartments because they reveal where chromosomes are located spatially inside the nucleus. 

“One of the ultimate goals of single-cell biology is to elucidate the connections between cellular structure and function across a wide variety of biological contexts,” Ma said. “In this case, we are exploring how chromosome organization within the nucleus correlates with gene expression.” 

While new technologies allow the study of these structures at the single-cell level, poor data quality can hinder precise understanding. scGHOST addresses this problem by using graph-based machine learning to enhance the data, making it easier to pinpoint and identify how chromosomes are spatially organized. scGHOST builds upon the Higashi method and its evolution, Fast Higashi, which focuses on scHi-C embeddings and imputations, that Ma's research group previously developed.

"Graph and hypergraph representation learning are integral to these methods and scGHOST, as they allow for a more nuanced and detailed exploration of the complex interactions within the genome,” said Zhang.

With the ability to accurately identify 3D genome subcompartments, scGHOST adds to the growing array of single-cell analysis tools scientists use to delineate the intricate molecular landscape of complex tissues, such as those in the brain. Ma anticipates that scGHOST could open new avenues to understanding gene regulation in health and disease. 

Read more about their work in Nature Methods. Additionally, learn more about this research in a February 8, 2023, Models, Inference and Algorithms talk by Zhang.

Adapted from a news story posted on the CMU School of Computer Science’s website.

Cells
Representation Learning
People
April 11, 2024
Researchers introduce new AI tool to help clinicians capture uncertainty in medical imagesResearchers introduce new AI tool to help clinicians capture uncertainty in medical images
2024

In biomedicine, segmentation involves annotating pixels from an important structure in a medical image, like an organ or cell. Artificial intelligence models can help clinicians by highlighting pixels that may show signs of a certain disease or anomaly.

However, these models typically only provide one answer, while the problem of medical image segmentation is often far from black and white. Five expert human annotators might provide five different segmentations, perhaps disagreeing on the existence or extent of the borders of a nodule in a lung CT image.

“Having options can help in decision-making. Even just seeing that there is uncertainty in a medical image can influence someone’s decisions, so it is important to take this uncertainty into account,” says Marianne Rakic, an MIT computer science PhD candidate and fellow at the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard.

Rakic is lead author of a paper with others at MIT, the Broad Institute, and Massachusetts General Hospital that introduces a new AI tool that can capture the uncertainty in a medical image.

Known as Tyche (named for the Greek divinity of chance), the system provides multiple plausible segmentations that each highlight slightly different areas of a medical image. A user can specify how many options Tyche outputs and select the most appropriate one for their purpose.

Importantly, Tyche can tackle new segmentation tasks without needing to be retrained. Training is a data-intensive process that involves showing a model many examples and requires extensive machine-learning experience.

Because it doesn’t need retraining, Tyche could be easier for clinicians and biomedical researchers to use than some other methods. It could be applied “out of the box” for a variety of tasks, from identifying lesions in a lung X-ray to pinpointing anomalies in a brain MRI.

Ultimately, this system could improve diagnoses or aid in biomedical research by calling attention to potentially crucial information that other AI tools might miss.

“Ambiguity has been understudied. If your model completely misses a nodule that three experts say is there and two experts say is not, that is probably something you should pay attention to,” adds senior author Adrian Dalca, an assistant professor at Harvard Medical School and MGH, and a research scientist in the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL).

Their co-authors include Hallee Wong, a graduate student in electrical engineering and computer science; Jose Javier Gonzalez Ortiz PhD ’23; Beth Cimini, associate director for bioimage analysis at the Broad Institute; and John Guttag, the Dugald C. Jackson Professor of Computer Science and Electrical Engineering. Rakic will present Tyche at the IEEE Conference on Computer Vision and Pattern Recognition, where Tyche has been selected as a highlight.

Addressing ambiguity

AI systems for medical image segmentation typically use neural networks. Loosely based on the human brain, neural networks are machine-learning models comprising many interconnected layers of nodes, or neurons, that process data.

After speaking with collaborators at the Broad Institute and MGH who use these systems, the researchers realized two major issues limit their effectiveness. The models cannot capture uncertainty and they must be retrained for even a slightly different segmentation task.

Some methods try to overcome one pitfall, but tackling both problems with a single solution has proven especially tricky, Rakic says.

“If you want to take ambiguity into account, you often have to use an extremely complicated model. With the method we propose, our goal is to make it easy to use with a relatively small model so that it can make predictions quickly,” she says.

The researchers built Tyche by modifying a straightforward neural network architecture.

A user first feeds Tyche a few examples that show the segmentation task. For instance, examples could include several images of lesions in a heart MRI that have been segmented by different human experts so the model can learn the task and see that there is ambiguity.

The researchers found that just 16 example images, called a “context set,” is enough for the model to make good predictions, but there is no limit to the number of examples one can use. The context set enables Tyche to solve new tasks without retraining.

For Tyche to capture uncertainty, the researchers modified the neural network so it outputs multiple predictions based on one medical image input and the context set. They adjusted the network’s layers so that, as data move from layer to layer, the candidate segmentations produced at each step can “talk” to each other and the examples in the context set.

In this way, the model can ensure that candidate segmentations are all a bit different, but still solve the task.

“It is like rolling dice. If your model can roll a two, three, or four, but doesn’t know you have a two and a four already, then either one might appear again,” she says.

They also modified the training process so it is rewarded by maximizing the quality of its best prediction.

If the user asked for five predictions, at the end they can see all five medical image segmentations Tyche produced, even though one might be better than the others.

The researchers also developed a version of Tyche that can be used with an existing, pretrained model for medical image segmentation. In this case, Tyche enables the model to output multiple candidates by making slight transformations to images.

Better, faster predictions

When the researchers tested Tyche with datasets of annotated medical images, they found that its predictions captured the diversity of human annotators, and that its best predictions were better than any from the baseline models. Tyche also performed faster than most models.

“Outputting multiple candidates and ensuring they are different from one another really gives you an edge,” Rakic says.

The researchers also saw that Tyche could outperform more complex models that have been trained using a large, specialized dataset.

For future work, they plan to try using a more flexible context set, perhaps including text or multiple types of images. In addition, they want to explore methods that could improve Tyche’s worst predictions and enhance the system so it can recommend the best segmentation candidates.

This research is funded, in part, by the National Institutes of Health, the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard, and Quanta Computer.

This story was adapted from a piece on MIT News.

Organisms
Representation Learning
March 15, 2024
Schmidt Center director awarded Department of Defense MURI fundingSchmidt Center director awarded Department of Defense MURI funding
2024

Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard is excited to announce that center director Caroline Uhler has received a Multidisciplinary University Research Initiative (MURI) award from the U.S. Department of Defense. 

MURI awards support interdisciplinary teams of researchers in conducting fundamental research on topics deemed critical by the defense department. Uhler and collaborators will use the award to advance optimal intervention design in complex systems — an effort that should enhance decision-making in areas ranging from biomedical to engineering and societal applications. 

Uhler, the project’s principal investigator, is a core member of the Broad Institute and a professor in the Department of Electrical Engineering and Computer Science and the Institute for Data, Systems, and Society at MIT. She is joined on the award by Alberto Abadie and Devavrat Shah, MIT; Miguel Hernan, Harvard University; John Ioannidis, Stanford University; Mengdi Wang, Princeton University; and Feng Zhang, Broad Institute. The project will involve several graduate students and postdocs.

Uhler and the research team will develop a computational framework for evidence-based decision-making. The researchers plan to host machine learning competitions to test out their methodology in engineering, biological, health, and societal application areas.

Project Title: “Evaluating, Predicting, Optimizing, and Monitoring Hypothetical Interventions in Large Networked Systems”

People
March 13, 2024
How do neural networks learn? A mathematical formula explains how they detect relevant patternsHow do neural networks learn? A mathematical formula explains how they detect relevant patterns
2024

Neural networks have been powering breakthroughs in artificial intelligence, including the large language models that are now being used in a wide range of applications, from finance to human resources to healthcare. But these networks remain a black box whose inner workings engineers and scientists struggle to understand. Now, a team led by data and computer scientists at the University of California San Diego has given neural networks the equivalent of an X-ray to uncover how they actually learn. 

The researchers found that a formula used in statistical analysis provides a streamlined mathematical description of how neural networks, such as GPT-2, a precursor to ChatGPT, learn relevant patterns in data, known as features. This formula also explains how neural networks use these relevant patterns to make predictions. 

“We are trying to understand neural networks from first principles,” said Daniel Beaglehole, a PhD student in the UC San Diego Department of Computer Science and Engineering and co-first author of the study. “With our formula, one can simply interpret which features the network is using to make predictions.”  

Adit Radhakrishnan, a postdoctoral fellow at Harvard who worked on the paper as an MIT EECS PhD student funded by the Schmidt Center and co-first author of the study, added: “We showed that neural networks, unlike other machine learning models, automatically implement this formula to identify features most relevant for prediction.”

The team presented their findings in the March 7 issue of the journal Science

Why does it matter how neural networks make predictions? AI-powered tools are now pervasive in everyday life. Banks use them to approve loans. Hospitals use them to analyze medical data, such as X-rays and MRIs. Companies use them to screen job applicants. But it’s currently difficult to understand the mechanism neural networks use to make decisions and the biases in the training data that might impact this. 

“If you don’t understand how neural networks learn, it’s very hard to establish whether neural networks produce reliable, accurate, and appropriate responses,” said Mikhail Belkin, the paper’s corresponding author and a professor at the UC San Diego Halicioglu Data Science Institute. “This is particularly significant given the rapid recent growth of machine learning and neural net technology.”

Former Eric and Wendy Schmidt Center PhD fellow Adit Radhakrishnan's research focuses on advancing the theoretical foundations of machine learning and developing new methods for tackling biomedical problems.

Understanding how neural networks make predictions is especially important in biological applications. In the realm of drug discovery, for example, researchers would not only want a model that accurately predicts drugs that are effective in treating cancer — they also want to discover biological mechanisms that make such drugs effective, explained Radhakrishnan. “By applying our findings to models trained to predict the effect of drugs on cancer cells, we can discover features of cancer cells that make them susceptible to a given drug and then develop new drugs to specifically target those mechanisms,” he said.

The study is part of a larger effort in Belkin’s research group to develop a mathematical theory that explains how neural networks work. “Technology has outpaced theory by a huge amount,” he said. “We need to catch up.” 

The team also showed that the statistical formula they used to understand how neural networks learn, known as Average Gradient Outer Product (AGOP), could be applied to improve performance and efficiency in other types of machine learning architectures that do not include neural networks.

“If we understand the underlying mechanisms that drive neural networks, we should be able to build machine learning models that are simpler, more efficient, and more interpretable,” Belkin said. “We hope this will help democratize AI.”

The machine learning systems that Belkin envisions would need less computational power, and therefore less power from the grid, to function. These systems also would be less complex and so easier to understand. 

Illustrating the new findings with an example

(Artificial) neural networks are computational tools to learn relationships between data characteristics (i.e. identifying specific objects or faces in an image). One example of a task is determining whether in a new image a person is wearing glasses or not. Machine learning approaches this problem by providing the neural network many example (training) images labeled as images of “a person wearing glasses” or ”a person not wearing glasses.” The neural network learns the relationship between images and their labels, and extracts data patterns, or features, that it needs to focus on to make a determination. One of the reasons AI systems are considered a black box is because it is often difficult to describe mathematically what criteria the systems are actually using to make their predictions, including potential biases. The new work provides a simple mathematical explanation for how the systems are learning these features.

Features are relevant patterns in the data. In the example above, there are a wide range of features that the neural networks learns, and then uses, to determine if in fact a person in a photograph is wearing glasses or not. One feature it would need to pay attention to for this task is the upper part of the face. Other features could be the eye or the nose area where glasses often rest. The network selectively pays attention to the features that it learns are relevant and then discards the other parts of the image, such as the lower part of the face, the hair and so on.  

Feature learning is the ability to recognize relevant patterns in data and then use those patterns to make predictions. In the glasses example, the network learns to pay attention to the upper part of the face. In the new Science paper, the researchers identified a statistical formula that describes how the neural networks are learning features. 

Alternative neural network architectures: The researchers went on to show that inserting this formula into computing systems that do not rely on neural networks allowed these systems to learn faster and more efficiently.  

“How do I ignore what’s not necessary? Humans are good at this,” said Belkin. “Machines are doing the same thing. Large Language Models, for example, are implementing this ‘selective paying attention’ and we haven’t known how they do it. In our Science paper, we present a mechanism explaining at least some of how the neural nets are ‘selectively paying attention.’” 

Study funders included the National Science Foundation and the Simons Foundation for the Collaboration on the Theoretical Foundations of Deep Learning. Belkin is part of NSF-funded and UC San Diego-led The Institute for Learning-enabled Optimization at Scale, or TILOS. 

Paper title: Mechanism for feature learning in neural networks and backpropagation-free machine learning models

Adit Radhakrishnan, Harvard School of Engineering and Applied Sciences and Broad Institute of MIT and Harvard

Daniel Beaglehole and Mikhail Belkin, University of California San Diego

Parthe Pandit: IIT Bombay–Pandit did the work for this paper as a postdoctoral researcher at the UC San Diego Halicioglu Data Science Institute

This story was adapted from a piece UC San Diego Today.

No items found.
March 4, 2024
Student Spotlight: Victory Yinka-BanjoStudent Spotlight: Victory Yinka-Banjo
2024

This interview is part of a series of short interviews from the Department of EECS, called Student Spotlights. Each Spotlight features a student answering their choice of questions about themselves and life at MIT. Today’s interviewee, Victory Yinka-Banjo, is a junior majoring in 6-7: Computer Science and Molecular Biology. Yinka-Banjo is Yinka-Banjo keeps a packed schedule; she is a member of the Office of Minority Education (OME) Laureates & Leaders program; a 2024 fellow in the public service-oriented BCAP program; has previously served as Secretary of the African Students’ Association and is now undergraduate president of the MIT Biotech Group; additionally, she is working on a cardiometabolic disease and deep learning project at the Broad Institute as an Eric and Wendy Schmidt Center Funded SuperUROP Scholar; a member of the Ginkgo Bioworks’ Cultivate Fellowship (a program that supports students interested in synthetic biology/biotech); and an ambassador for Leadership Brainery, which equips juniors/leaders of color with the resources needed to prepare for graduate school. Nevertheless, she found time to share a peek into her MIT experience with readers.

What’s your favorite building or room within MIT, and what’s special about it to you?

It has to be the Broad Institute of MIT & Harvard on Ames Street in Kendall Square, where I do my SuperUROP research in Caroline Uhler’s lab. Outside of classes, you’re 90% likely to find me on the newest mezzanine floor (between the 11th and 12th floor), in one of the UROP rooms I share with two other undergrads in the lab. We have standing desks, an amazing coffee/hot chocolate machine, external personal monitors, comfortable sofas – everything really! Not only is it my favorite building, it is also my favorite study spot on campus. In fact, I am there so often that when friends recently planned a birthday surprise for me, they told me they were considering having it at the Broad, since they could count on me being there.

I think the most beautiful thing about this building, apart from the beautiful view of Cambridge we get from being on one of the highest floors, is that when I was applying to MIT from high school, I had fantasized working at the Broad because of the ground-breaking research. To think that it is now a reality makes me appreciate every minute I spend on my floor, whether I am doing actual research or some last-minute studying for a midterm.

Tell me about one interest or hobby you’ve discovered since you came to MIT. (It doesn’t have to be academic!)

I have become pretty involved in the performing arts since I got to MIT! I have acted in two plays run by the Black Theater Guild, which was revived during my freshman year by one of my friends. I played a supporting role in the first play called Nkrumah’s Last Day, which was about Ghana at a time of governance under Kwame Nkrumah (its first president). In the second play, a ghost story/comedy called Shooting the Sheriff, I played one of the lead roles. Both caused me to step way out of my comfort zone and I loved the experiences because of that. I also got to act with some of my close friends who were first-time stage actors as well, so that made it even more fun.

Outside of acting, I also do spoken word/poetry. I have performed at events like the African Students Association Cultural Night, MIT Africa Innovate Conference and Black Womens’ Alliance Banquet. I try to use my pieces to share my experiences both within and beyond MIT, offering the perspective of an international Nigerian student. My favorite piece was called Code Switch, and I used concepts from CS & Biology (especially genetic code switching), to draw parallels with linguistic code-switching, and emphasize the beauty and originality of authenticity. This semester, I’m also a part of MIT Monologues and will be performing a piece called Inheritance, about the beauty of self-love found in affection transferred from a mother.

Are you a re-reader or a re-watcher—and if so, what are your comfort books, shows, or movies?

I don’t watch too many movies, although I used to be obsessed with all parts of High School Musical; and the only book I’ve ever reread is Americanah. I would actually say I am a re-podcaster! My go-to comfort-podcast is this episode, “A Breakthrough Unfolds”, by Google DeepMind. It makes me a little emotional every time I listen. It is such an exemplification of the power of science and its ability to break boundaries that humans formerly thought impossible. As a Computer Science & Biology major, I am particularly interested in these two disciplines’ applications to relevant problems, like the protein-folding problem discussed in the episode, which DeepMind’s solution for has caused massive advances in the biotech industry. It makes me so hopeful for the future of biology, and the ways in which computation can advance human health and precision medicine.

Who’s your favorite artist? (Using the term very broadly; any form of art can qualify!)

When I think of the word ‘artist’, I think of music artists first. There are so many who I love; my favorites also evolve over time. I’m Christian, so I listen to a lot of gospel music. I’m also Nigerian so I listen to a lot of afrobeats. Since last summer, I’ve been obsessed with Limoblaze, who fuses both gospel and afrobeats music! KB, a super talented gospel rapper, is also somewhat tied in ranking with Limo for me right now. His songs are probably ~50% of my workout playlist.

It’s time to get on the shuttle to the first Mars colony, and you can only bring one personal item. What are you going to bring along with you?

Oooh, this is a tough one, but it has to be my brass rat. Ever since I got mine at the end of sophomore year, it’s been nearly impossible for me to take it off. If there’s ever a time I forget to wear it, my finger feels off for the entire day.

Tell me about one conversation that changed the trajectory of your life.

Two specific career-defining moments come to mind. They aren’t quite conversations, but they are talks/lectures that I was deeply inspired by. The first was towards the end of high school when I watched this TEDx Talk about storing data in DNA. At the time, I was getting ready to apply to colleges and I knew that biology and computer science were two things I really liked, but I didn’t really understand the possibilities that could be birthed from them coming together as an interdisciplinary field. The TEDx talk was my eureka moment for computational biology.

The second moment was in my junior Fall during an introductory lecture to “Lab Fundamentals for Bioengineering” by Professor Jacquin Niles. I started the school year with a lot of confusion about my future post-grad, and the relevance of my planned career path to the communities that I care about. Basically, I was unsure about how Computational Biology fit into the context of Nigeria’s problems, especially because my interest in the field is oriented towards molecular biology/medicine, not necessarily public health.

In the US, most research focuses on diseases like cancer and Alzheimer’s, which, while important, are not the most pressing health conditions in tropical regions like Nigeria. When Prof Niles told us about his lab’s dedication to malaria research from a molecular biology standpoint, it was yet another eureka moment. Like yes! Computation and molecular biology can indeed mitigate diseases that affect developing nations like Nigeria–diseases that are understudied, and whose research is underfunded.

Since his talk, I found a renewed sense of purpose. Grad school isn’t the end goal. Using my skills to shine a light on the issues affecting my people that deserve far more attention is the goal. I’m so excited to see how I will use Computational Biology to possibly create the next cure to a commonly neglected tropical disease, or accelerate the diagnosis of one. Whatever it may be, I know that it will be close to home, eventually 🙂

What are you looking forward to about life after graduation? What do you think you’ll miss about MIT?

Thinking about graduating actually makes me sad. I’ve grown to love MIT. The biggest thing I’ll miss, though, is Independent Activities Period (IAP). It is such a unique part of the MIT experience. I’ve done a web development class/competition, research, a data science challenge, a molecular bio crash course, and a deep learning crash course over the past 3 IAPs. It is SUCH an amazing time to try something low stakes, forget about grades, explore Boston, build a robot, travel abroad, do less, go slower, really rejuvenate before the Spring, and embrace MIT’s motto of “mind and hand” by just being creative and explorative. It is such an exemplification of what it means to go here, and I can’t imagine it being the same anywhere else.

That said, I look forward to graduating so I can do more research. My hours spent at the Broad thinking about my UROP are always the quickest hours of my week. I love the rabbit holes my research allows me to explore, and I hope that I find those over and over again as I apply and hopefully get into PhD programs. I look forward to exploring a new city after I graduate too. I wouldn’t mind staying in Cambridge/Boston. I love it here. But I would welcome a chance to be somewhere new and embrace all the people and unique experiences it has to offer. I also hope to work on more passion projects post-grad. I feel like I have this idea in my head that once I graduate from MIT, I’ll have so much more time on my hands (we’ll see how that goes). I hope that I can use that time to work on education projects in Nigeria, which is a space I care a lot about. Generally, I want to make service more integrated in my lifestyle. I hope that post-graduation, I can prioritize doing that even more: making it a norm to lift others as I continue to climb.

Adapted from a profile posted on MIT Electrical Engineering and Computer Science Department's site.

No items found.
February 5, 2024
Data science challenge reveals new research directions for cancer immunotherapyData science challenge reveals new research directions for cancer immunotherapy
2024

The Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard is delighted to announce the completion of its Cancer Immunotherapy Data Science Grand Challenge.

Participants in the challenge developed algorithms to uncover new ways to modify, or “perturb,” T cells to make them more effective at killing cancer cells. Scientists in the Hacohen Lab at the Broad then tested their predictions in mouse models, making this the first challenge that the Schmidt Center knows of in which new experiments were performed based on the output of machine-learning models developed in the challenge.  

While it’s too early to say whether any of the proposed perturbations could prove useful for cancer treatment, the researchers plan to further study some of the identified perturbations and the algorithms that gave rise to them.

The Schmidt Center partnered with Harvard’s Laboratory for Innovation Science (LISH), the MIT Department of Electrical Engineering and Computer Science, Topcoder, Gordian Biotechnology, and Saturn Cloud to run the challenge. More than 1,000 people from around the world registered for the competition.

“We are thrilled that our first data science challenge attracted so many participants, including various machine-learning experts who had not previously worked on biological problems,” said Caroline Uhler, director of the Eric and Wendy Schmidt Center, a core member of the Broad Institute, and a professor in the Department of Electrical Engineering and Computer Science and the Institute for Data, Systems, and Society at MIT.

Karim Lakhani, founder and co-director of LISH and a professor of business administration at the Harvard Business School, said: “At LISH, we believe that data science challenges can help organizations harness the power of the crowd to answer pressing questions in biology and other fields. We hope this challenge will serve as a case study in how machine-learning experts can collaborate with biologists to improve experimental design.”

Boosting cancer research with machine learning

Cancer immunotherapy seeks to harness the body’s immune system, and most often T cells, to recognize and kill cancer cells while leaving healthy cells alone. In the last decade, there have been many breakthroughs in cancer immunotherapy, yet treatments still only work for some cancer patients some of the time.

“We’re hopeful that challenges like this can help us home in on T-cell-perturbations that could ultimately lead to new therapeutics — and make cancer immunotherapy work for more patients,” said Nir Hacohen, an institute member at Broad, director of the Broad Institute’s Cell Circuits Program, and director of the Center for Cancer Immunology at Mass General Brigham.

Marc Schwartz, a postdoctoral fellow in the Hacohen Lab, previously ran experiments testing the effects of 73 gene knockouts in T cells in mouse models. Because researchers can’t scale mouse model experiments beyond 100 or so genes at a time, it’s not feasible to test out every gene in a particular disease pathway, explained Schwartz.

“That’s why we were excited about the idea of testing a limited number of genes that we think are important and then training an algorithm to learn something that we can't see from that data on our own,” he added.

The overarching data science challenge was divided into three parts that ran as individual data science competitions on Topcoder. In Challenge 1, participants received gene expression data from 66 of the 73 T-cell-gene knockouts from Schwartz’s experiments as training data. They then developed an algorithm that could predict how knocking out the seven “held-out” genes would affect T cells.

Challenge 2 participants used their algorithms from the first challenge to propose new gene knockouts (picking from any of the 20,000 genes in the entire genome) to shift as many T cells as possible into a cancer-fighting state. In Challenge 3, participants proposed a metric for ranking how well a particular gene knockout would bring about this desired shift in T cells.

To make the challenge accessible to participants without a biology background, Orr Ashenberg, associate director of computational biology at the Klarman Cell Observatory of the Broad Institute, produced a 1.5-hour crash course on cancer biology, genetic perturbations, and single-cell sequencing technologies.

Orr Ashenberg, associate director of computational biology at the Broad's Klarman Cell Observatory, delivers a lecture on single-cell sequencing technologies.

The Schmidt Center announced the Challenges 1 and 3 winners last March. The researchers then ran the top-scoring algorithms from Challenge 1 to predict which genes to knock out to mimic two kinds of cancer immunotherapy — CAR T-cell therapy and checkpoint blockade therapy. Next, Schwartz conducted experiments to see how well the proposed gene knockouts performed in a mouse model. To determine the Challenge 2 winners, Schmidt Center research fellow Jiaqi Zhang, who was instrumental in developing the challenge, calculated how well each participant’s algorithm from Challenge 1 predicted the effects of those ~60 gene knockouts.

The winners of Challenge 2 — the final part of the competition — are:

-First place: Brody Langille, Jordan Trajkovski, and Elizabeth Hudson

-Second place: mglettig (username)*

-Third place: Ai Vu Hong, researcher at Genethon, France

-Fourth place: Saket Kunwar, independent researcher, Nepal

-Fifth place: lxastro0 (username)*

-Sixth place: John Gardner, freelance data scientist

-Seventh place: agilsoft (username)*

-Eighth place: Basak Eraslan, postdoctoral researcher holding a joint position at the Regev Lab in Genentech and Kundaje Lab at Stanford University

-Ninth place: Haoyue Dai, Kun Zhang, Ignavier Ng, Yujia Zheng, Xinshuai Dong, and Yewen Fan from Carnegie Mellon University; Petar Stojanov, postdoctoral fellow at the Eric and Wendy Schmidt Center; Gongxu Luo, Mohamed bin Zayed University of Artificial Intelligence; and Biwei Huang, University of California, San Diego

-Ninth place: Liu Xindi, freelance programmer

-Ninth place: Johnson Zhou, Camille Sayoc, and Yi-Cheng Peng, Master’s students of the Faculty of Engineering and IT at the University of Melbourne, Victoria, Australia

The winning teams approached the problem using different deep-learning methods depending on the chosen input features. These features include gene expression and “chromatin accessibility,” the degree to which genetic information encoded in DNA can be accessed and read, measured by ATAC-seq peak counts. Additionally, some of the top-scoring teams incorporated learned representations from variational autoencoders — models that can capture meaningful features from raw data — or graph neural networks constructed based on the gene ontology database.

"We are grateful for the opportunity to participate in this challenge and are excited by the results,” said the first-place team in a prepared statement. “It's not often that you get invited to work on an important problem alongside preeminent scientists who furnish the problem description and data that you need to develop a novel solution — a novel solution that those same scientists can then turn around and validate in their lab.”

Martin Borch Jensen, chief scientific officer of Gordian Biotechnology, said: "Technological advances in sequencing have led to a vast amount of genomics data. As we pile up more and more transcriptomes from every type of cell in the human body, it becomes increasingly valuable to develop ways to understand how gene expression can cause and predict health and disease. I'm very excited for this competition to catalyze more work on this problem.”

Now, researchers at the Schmidt Center will further study the top-scoring algorithms to see if they can combine components from each into an even better predictive tool. The center plans to hold its second data science challenge later this year.

*Editor's note: Usernames were used instead of participant names in cases where the Schmidt Center could not get in touch with winners.

Cells
January 16, 2024
Building a two-way street between cell biology and machine learningBuilding a two-way street between cell biology and machine learning
2024

In a Comment for Nature Cell Biology, the Eric and Wendy Schmidt Center's director Caroline Uhler discusses how the rise of large-scale datasets in biology positions the field to become a driver of foundational advances in machine learning — and vice versa. Uhler, who is also a full professor in the Department of Electrical Engineering and Computer Science and the Institute for Data, Systems, and Society at MIT, advocates for new machine learning models that can better integrate different types of biological data and can uncover causal mechanisms in disease, not just associations. She also discusses the need for close collaborations between biologists and computational scientists so that predictive and causal algorithms can be incorporated into experimental design — and outlines some of the challenges, such as distinct cultures and vocabularies, of building those teams.

No items found.
January 10, 2024
Researchers identify new regulators of cellular agingResearchers identify new regulators of cellular aging
2024

As we age, the risk for a wide range of diseases, including cancer and neurodegenerative conditions, increases. But while aging has been extensively studied, scientists don’t have a clear picture of the molecular changes that take place as we get older.

Now, researchers at the Broad Institute of MIT and Harvard and ETH Zürich in Switzerland have found key gene-expression regulators related to cellular aging that are tightly coupled to structural alterations of chromatin — the DNA-protein complex that forms chromosomes. The findings, published last month in Aging Cell, offer new insights into the biology of cellular aging. The research may also provide potential targets for aging reversal.

The study stems from a long-term collaboration between the laboratory of GV Shivashankar at ETH Zürich on the biological side and Caroline Uhler at the Broad Institute on the computational side.

“The explosion of biomedical data presents an exciting opportunity to develop novel machine learning methods to help answer important biological questions,” said study co-senior author Uhler, the director of the Eric and Wendy Schmidt Center at the Broad and a professor in the Department of Electrical Engineering and Computer Science and the Institute for Data, Systems, and Society at MIT. “In this work, the availability of large-scale sequencing data from many individuals in different age groups motivated us to develop methods to identify drivers of cellular aging,” she added.

Shivashankar’s lab has long been interested in understanding the relationship between a cell’s microenvironment, the three-dimensional structure of the genome, and gene expression in health and disease. Depending on how DNA is packed inside a cell’s nucleus, it may alter the expression of specific genes, which could in turn result in certain diseases, explained co-senior author Shivashankar, professor of Mechano-Genomics at ETH Zürich and head of the Laboratory of Nanoscale Biology at the Paul Scherrer Institute in Switzerland. “We’re very excited about understanding what may lead to healthy aging as opposed to cancer or neurodegeneration,” he added.

The study also aligns with the Eric and Wendy Schmidt Center’s goal of developing computational approaches for challenging biomedical questions. To this end, the Schmidt Center trains talented undergraduate, master’s, and PhD students as well as postdoctoral fellows with computational backgrounds on how to work with experimental biologists.

“As a graduate student in statistics, working closely with a biological lab allows me to gain a much deeper understanding of the kinds of questions and data that are most interesting to biologists,” said study co-first author Louis Cammarata, a PhD student at Harvard University and the Eric and Wendy Schmidt Center. “I’m able to design more useful computational methods because of this constant communication.”

Drivers of aging

In the nucleus of a cell, DNA coils around proteins to form chromatin. Other proteins bind along chromatin, creating complex three-dimensional structures that leave some genes accessible to transcription and others closed off.

Clockwise from top right: Caroline Uhler, GV Shivashankar (image credit: Paul Scherrer Institute), Louis Cammarata, and Jana Braunger

Uhler, Shivashankar, and their teams analyzed gene expression data from skin cells of 133 individuals aged 1 to 96 years, who were divided in five age groups. The difference in gene expression was particularly prominent when comparing the two oldest groups, which included people aged 61 to 85 years and those aged 86 to 96 years. Differentially expressed genes tended to be involved in biological processes such as immune response and cell proliferation, which play important roles in aging.

Next, the researchers used statistical algorithms to combine these data with information from a database that lists protein-protein interactions. The analysis revealed key age-associated regulators of gene expression, which include transcription factors — proteins that control how other genes are expressed.

“Transcription factors may be post-translationally activated or they may benefit from changes in chromatin organization to activate their target genes at a later time point,” said study co-first author Jana Braunger, a former master’s student at the Eric and Wendy Schmidt Center and current PhD student at the University of Heidelberg.

Gene expression hubs

To analyze the coupling between chromatin organization and changes in gene expression, the researchers used an experimental method called Hi-C, which provides a proximity map of the DNA packing.

Comparing Hi-C data from old and young skin cells revealed that the structure of chromatin changes over time, either drawing apart genes that were close together or bringing together genes that were far apart in young cells.

In the cell’s nucleus, nearby genes are often expressed as a group, Cammarata explained. “There are specific hotspots where different chromosomes come together, along with other molecules that are useful for transcription, and within those hubs, you have active transcription and co-regulation of genes,” he said. “In aging, changes in how DNA is folded influence these hotspots of transcription.”

Mitigating aging

Although more work is needed to determine whether alterations in chromatin structure drive changes in gene expression or vice versa, some of the gene-expression regulators identified in this study could serve as potential targets to mitigate, prevent, or even reverse cellular aging. “Identifying the key transcriptional drivers of cellular aging is crucial to develop interventions for cellular reprogramming and rejuvenation,” Shivashankar said.

Uhler noted that the study is an example of how computational researchers can develop new methods to help answer important biological questions — a core mission of the Eric and Wendy Schmidt Center. “We place great importance on training the next generation of scientists — researchers who are strong on the computational side and understand the biological questions,”she said. “Merging computational science and biology can help us tackle some of medicine’s biggest challenges.”

Cells
December 14, 2023
A new method for genomics analysis doesn’t require reference dataA new method for genomics analysis doesn’t require reference data
2023

In 2003, scientists finished sequencing almost all of the three million nucleotide base pairs that make up the human genome. This feat led to an explosion in genomics analysis, which to this day relies on aligning sequencing data to a “reference genome” — a composite made up of DNA samples from different individuals in the same species — for humans and other species.

Now, researchers at Stanford University and the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard have developed a genomics analysis framework, SPLASH, that directly analyzes raw sequencing samples, eliminating the need for reference data. The method can perform genomic analyses more quickly and with less computing power than traditional methods. SPLASH should prove especially useful for analyzing genomes of understudied or rapidly mutating species.

In a study published earlier this month in Cell, the team showed that the framework can detect different strains of SARS-CoV-2 and find sequence diversity in adaptive immune receptors, among other findings. Kaitlin Chuang and Tavor Baharav, former PhD students at Stanford University, were co-first authors on the paper, and Julia Salzman, associate professor of biomedical science and biochemistry at Stanford, was the lead author. All research was performed in Salzman's group, whose lab combines statistics and genomics.

 “A lot of sequencing analysis is done with implicit priors, meaning that your pipeline is only going to identify the one feature that it was designed to find,” said Baharav, who is now an Eric and Wendy Schmidt Center postdoctoral fellow. “With SPLASH, we’ve developed a method for unbiased, reference-free hypothesis generation.”

From alignment- to statistics-first

While genomics has revolutionized both medicine and ecology, its dependence on reference genomes has its limitations. For example, only 5% of mammalian species have had their genomes sequenced — a percentage that drops even further for organisms like bacteria and viruses. Additionally, because the human-reference genome only contains samples from a handful of individuals, it does not reflect global genomic diversity. 

Eric and Wendy Schmidt Center postdoctoral fellow Tavor Baharav

Also, traditional genomics analysis aligns samples with references before comparing the samples to each other, discarding outliers. “When you're trying to detect an interesting, novel event, it almost by definition isn't going to align well to the reference,” said Baharav.

To address these and other limitations, researchers in the Salzman Lab at Stanford University came up with a way to analyze raw sequencing data without having to first align it to a reference genome. 

Their framework, SPLASH, identifies unchanging "anchor" subsequences in the raw sequencing data  that are followed by "target" sequences that vary by sample. SPLASH, which stands for “Statistically Primary aLignment Agnostic Sequence Homing,” uses a new statistical test to determine which stretch of RNA reads exhibit the most variation. 

"This work illustrates how interdisciplinary teams with diverse perspectives and skill sets are powerful and needed for scientific progress,” said Salzman. “Initially, the team questioned why such a straightforward approach hadn't been implemented before, but we gradually came to realize that rethinking conventions can sometimes yield simple solutions that could work better than ingrained approaches.”

Unlike traditional methods, which can only detect certain types of genetic variations, the framework can detect a wide variety of variations. SPLASH is also much more computationally efficient than those methods. An updated version of the framework can complete the entire analysis in an hour while using much less computing power than alignment-first approaches. 

Detecting viral mutations + microalgae growing on eelgrass

To test the effectiveness of SPLASH, the team used it to perform a range of genomic analyses. In one, they compared nasal swab samples from patients taken at different periods during the COVID-19 pandemic, when different viral strains were dominant. SPLASH was able to identify which anchors had “low p-values” and high effect sizes — indicators of viral mutations. They then mapped these reads to control samples from different COVID strains, determining that almost all of the anchors that SPLASH homed in on were indeed strain-defining mutations.

Eelgrass provides foraging areas and shelter for fish. Adam Obaza/NOAA.

Given that very few species have reference genomes, the team also tested how well SPLASH can detect variations between samples from two species — eelgrass and octopus — with limited reference data available. They compared RNA from eelgrass, a common seagrass, found in the Mediterranean and Norway, finding that almost 6% of targets did not align to eelgrass references. In particular, they noticed that the target sequences for one anchor varied by location and season. 

The team theorized that these discrepancies could indicate the presence of different species of diatoms, microalgae that grow on other plants, as the anchor was less abundant in samples taken at night, when diatoms reduce expression of this particular type of gene.

 “On its own, SPLASH does not provide immediately interpretable results, but it points researchers to interesting questions that they can investigate further,” said Baharav. 

Next steps

Baharav, who completed his PhD in electrical engineering at Stanford earlier this year, is now applying his computational background to cancer research. As white blood cells develop, they shuffle around parts of their genome through a process called “V(D)J recombination.” This genetic reshuffling allows them to produce a huge array of antibodies and T-cell receptors, which they use to recognize and kill millions of microbes. 

Cancer researchers like Baharav’s mentor, Rafael Irizarry, chair of the Department of Data Science at Dana-Farber Cancer Institute, want to better understand how V(D)J recombination works to design cancer vaccines. As a Schmidt Center fellow, Baharav is developing a reference-free way to analyze these adaptive immune receptors. 

“SPLASH provides an exciting new statistical and computational framework for genomic analysis. I'm looking forward to building on this work to expand the scope of reference-free analysis, allowing researchers to perform unbiased inference on their data,” said Baharav. “As discussed in SPLASH, reference-based methods fall short in analyzing highly diverse genomic regions such as T cell receptors, which I'm looking to change.”

Cells
October 24, 2023
Maria Skoularidou receives Blackwell-Rosenbluth AwardMaria Skoularidou receives Blackwell-Rosenbluth Award
2023

The Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard is excited to announce that postdoctoral fellow Maria Skoularidou was awarded the 2023 Blackwell-Rosenbluth Award earlier this month. The Blackwell-Rosenbluth Award is granted to outstanding young researchers in the field of Bayesian statistics. 

Skoularidou joined the Eric and Wendy Schmidt Center in September, 2023. She is co-advised by Nikos Daskalakis, director of the Neurogenomics and Translational Bioinformatics Laboratory at McLean Hospital and an associate professor of psychiatry at Harvard Medical School, and Costis Daskalakis, a professor in MIT’s Department of Electrical Engineering and Computer Science and a member of the MIT Computer Science and Artificial Intelligence Laboratory. Her research focuses on developing scalable and efficient computational methods to detect epigenetic effects in diverse trauma and PTSD contexts through employing information from various datasets.

Skoularidou holds a PhD in biostatistics from the University of Cambridge, where she was advised by Sylvia Richardson. Skoularidou has a four-year degree in informatics and a Master’s of Science in statistical science from the Athens University of Economics and Business. She founded (Dis)Ability in AI, a group that supports and advocates for disabled people’s needs at machine learning conferences and other venues, and is on the editorial board of ACM Transactions on Probabilistic Machine Learning

“Maria has already made impressive contributions to the field of Bayesian inference as well as generative modeling and its applications to biomedical data,” said Caroline Uhler, director of the Eric and Wendy Schmidt Center, a core member of the Broad Institute, and a professor in the Department of Electrical Engineering and Computer Science and the Institute for Data, Systems and Society at MIT. “We’re excited to see what she’ll continue to accomplish as a Schmidt Center fellow.”

People
October 2, 2023
A more effective experimental design for engineering a cell into a new stateA more effective experimental design for engineering a cell into a new state
2023

A strategy for cellular reprogramming involves using targeted genetic interventions to engineer a cell into a new state. The technique holds great promise in immunotherapy, for instance, where researchers could reprogram a patient’s T-cells so they are more potent cancer killers. Someday, the approach could also help identify life-saving cancer treatments or regenerative therapies that repair disease-ravaged organs.

But the human body has about 20,000 genes, and a genetic perturbation could be on a combination of genes or on any of the over 1,000 transcription factors that regulate the genes. Because the search space is vast and genetic experiments are costly, scientists often struggle to find the ideal perturbation for their particular application.  

Researchers from MIT and Harvard University developed a new, computational approach that can efficiently identify optimal genetic perturbations based on a much smaller number of experiments than traditional methods.

Their algorithmic technique leverages the cause-and-effect relationship between factors in a complex system, such as genome regulation, to prioritize the best intervention in each round of sequential experiments.

The researchers conducted a rigorous theoretical analysis to determine that their technique did, indeed, identify optimal interventions. With that theoretical framework in place, they applied the algorithms to real biological data designed to mimic a cellular reprogramming experiment. Their algorithms were the most efficient and effective.

“Too often, large-scale experiments are designed empirically. A careful causal framework for sequential experimentation may allow identifying optimal interventions with fewer trials, thereby reducing experimental costs,” says co-senior author Caroline Uhler, a professor in the Department of Electrical Engineering and Computer Science (EECS) who is also the director of the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard, and a researcher at MIT’s Laboratory for Information and Decision Systems (LIDS) and Institute for Data, Systems and Society (IDSS).

Joining Uhler on the paper, which appears today in Nature Machine Intelligence, are lead author Jiaqi Zhang, a graduate student and Eric and Wendy Schmidt Center Fellow; co-senior author Themistoklis P. Sapsis, professor of mechanical and ocean engineering at MIT and a member of IDSS; and others at Harvard and MIT.

Active learning

When scientists try to design an effective intervention for a complex system, like in cellular reprogramming, they often perform experiments sequentially. Such settings are ideally suited for the use of a machine-learning approach called active learning. Data samples are collected and used to learn a model of the system that incorporates the knowledge gathered so far. From this model, an acquisition function is designed — an equation that evaluates all potential interventions and picks the best one to test in the next trial.

This process is repeated until an optimal intervention is identified (or resources to fund subsequent experiments run out).

“While there are several generic acquisition functions to sequentially design experiments, these are not effective for problems of such complexity, leading to very slow convergence,” Sapsis explains.

Acquisition functions typically consider correlation between factors, such as which genes are co-expressed. But focusing only on correlation ignores the regulatory relationships or causal structure of the system. For instance, a genetic intervention can only affect the expression of downstream genes, but a correlation-based approach would not be able to distinguish between genes that are upstream or downstream.

“You can learn some of this causal knowledge from the data and use that to design an intervention more efficiently,” Zhang explains.

The MIT and Harvard researchers leveraged this underlying causal structure for their technique. First, they carefully constructed an algorithm so it can only learn models of the system that account for causal relationships.

Then the researchers designed the acquisition function so it automatically evaluates interventions using information on these causal relationships. They crafted this function so it prioritizes the most informative interventions, meaning those most likely to lead to the optimal intervention in subsequent experiments.

“By considering causal models instead of correlation-based models, we can already rule out certain interventions. Then, whenever you get new data, you can learn a more accurate causal model and thereby further shrink the space of interventions,” Uhler explains.

This smaller search space, coupled with the acquisition function’s special focus on the most informative interventions, is what makes their approach so efficient.

The researchers further improved their acquisition function using a technique known as output weighting, inspired by the study of extreme events in complex systems. This method carefully emphasizes interventions that are likely to be closer to the optimal intervention.

“Essentially, we view an optimal intervention as an ‘extreme event’ within the space of all possible, suboptimal interventions and use some of the ideas we have developed for these problems,” Sapsis says.    

Enhanced efficiency

They tested their algorithms using real biological data in a simulated cellular reprogramming experiment. For this test, they sought a genetic perturbation that would result in a desired shift in average gene expression. Their acquisition functions consistently identified better interventions than baseline methods through every step in the multi-stage experiment.

“If you cut the experiment off at any stage, ours would still be more efficient than the baselines. This means you could run fewer experiments and get the same or better results,” Zhang says.

The researchers are currently working with experimentalists to apply their technique toward cellular reprogramming in the lab.

Their approach could also be applied to problems outside genomics, such as identifying optimal prices for consumer products or enabling optimal feedback control in fluid mechanics applications.

In the future, they plan to enhance their technique for optimizations beyond those that seek to match a desired mean. In addition, their method assumes that scientists already understand the causal relationships in their system, but future work could explore how to use AI to learn that information, as well.

This work was funded, in part, by the Office of Naval Research, the MIT-IBM Watson AI Lab, the MIT J-Clinic for Machine Learning and Health, the Eric and Wendy Schmidt Center at the Broad Institute, a Simons Investigator Award, the Air Force Office of Scientific Research, and a National Science Foundation Graduate Fellowship.

Adapted from a news story posted on the MIT News website.

Cells
Active Learning
September 10, 2023
New machine learning techniques boost predictions for virtual drug screening with less dataNew machine learning techniques boost predictions for virtual drug screening with less data
2023

Scientists using machine learning tools to analyze biomedical data often turn to neural network algorithms, but before these models became popular, another simpler type of machine learning algorithm called kernel methods were commonly used. Kernel methods work by first applying straightforward operations to transform data and then training a simple model on the transformed data.

Now, in a new paper recently published in Nature Communications, researchers at the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard have developed a new way of using kernel methods that could make them more useful for a wider range of applications, such as virtual drug screening. They came up with the first “transfer learning” techniques for kernel methods that can be successfully applied to large-scale datasets. Transfer learning allows researchers to improve machine learning models by training them on one task in a way that enhances their performance on a second task — without having to spend the time and resources training a new model for each new task. In their paper, the team showed how their transfer learning framework allowed them to predict which drugs might be most effective in certain cancer cell lines where little data is available. They did this by transferring from cell lines in which many drugs have already been tested.

“Before our paper, there was no transfer learning method for kernel methods that could scale to the large datasets of most interest in the biomedical field and beyond. We’ve shown for the first time that transfer learning using kernels in these settings is possible and I think that is really exciting,” said Caroline Uhler, the senior author on the paper and a Broad core institute member, co-director of the Schmidt Center at Broad, and a professor in the Department of Electrical Engineering and Computer Science as well as the Institute for Data, Systems, and Society at MIT.  

The team’s key innovation was creatively adapting transfer learning methods used in neural network algorithms so that they can be applied to kernel methods. This advance could find uses in other applications.

“Particularly for healthcare and biomedical applications, it's very hard to collect a lot of data for every question of interest. When you have very little data for a certain task but a related task has abundant data, this is exactly a setting where our method is effective,” said Adityanarayanan Radhakrishnan, a co-first author on the study and a Schmidt Center fellow, who worked on this study while completing his PhD as an Eric and Wendy Schmidt Center Fellow in Uhler’s lab at Broad and MIT, and is currently the George F. Carrier Postdoctoral Fellow at Harvard School of Engineering and Applied Sciences.

Transferring knowledge

The research team focused on kernel methods because they found in a previous paper that these performed better than typical neural network models on virtual drug screening tasks. But they wanted to make it possible for researchers to quickly reuse their kernel method algorithms to identify drugs for a wide range of cancer types without having to train a new model for each new type of cancer. They realized that transfer learning techniques are necessary for this, but because existing techniques don’t work well for kernel methods, they had to come up with new ones.

They decided to take inspiration from two transfer learning techniques that work well for neural network models, which they called projection and translation. The team adapted them to work with kernel methods and then tested their approach in a virtual drug screen.

The researchers analyzed performance of their transfer learning algorithms on two massive Broad datasets, one from the Connectivity Map (CMAP) and the other from the Cancer Dependency Map (DepMap). These datasets describe the effects of drugs on cancer cell lines  across millions of drug and cell line combinations. The team trained their kernel method algorithms to predict either the genes expressed by a certain cell type after it was treated with a certain drug (using the CMAP dataset), or the proportion of cancer cells that survived after treatment with the same drug (using the DepMap dataset).

The scientists then applied their projection and translation techniques to their model so that it could complete the second task: to predict the effect of the drug on new cancer cell lines that have much less data. The projection transformation corrects the model’s predictions on the second task by recognizing when the prediction errors are falling into categories that can be easily corrected to the right category. And the translation technique fine-tunes the model by applying a correction term that shifts the model’s predictions so that it’s more accurate on the second task.

The team found that their transfer learning techniques allowed their original kernel method to be successfully “transferred” to the second task, without needing to be retrained. Compared to a new model trained only on the second task, the transfer learning techniques greatly boosted the accuracy of their model in predicting the effect of drugs for new cancer cell lines. And on a common machine learning task where the team trained their kernel method algorithms to recognize images, their approach surprisingly boosted the accuracy by up to 10 percent.

Moreover, the researchers were also able to pinpoint exactly how much extra data they would need to collect to increase the performance of the model. Uhler said this could be helpful to scientists trying to decide whether it’s worthwhile to collect more data in the lab. “That's really quite exciting because you can ask ‘how much is it worth for me to have a little bit better performance of my model if I know that we’ll need to collect, say, 10 or 20 percent more data?’” said Uhler.

Beyond drug screening

Two additional advantages of kernel methods are that they provide interpretability as well as a quantification of how uncertain the model is on a given prediction. To take advantage of the interpretability aspect, the research team is working on pinning down the features of a drug that lead their model to predict that it will be effective. In addition, the research team hopes that the uncertainty estimates provided by their kernel approach will be helpful in identifying which new drug and cell line combinations should be screened experimentally for a more effective drug discovery pipeline.

They also have plans to expand their framework to other applications, such as screening cancer genes that tumors heavily depend on for survival and might be targeted with new drugs.

The team adds that their transfer learning approach for kernel methods may also open up other, unexpected applications. Because kernel methods make it easy for scientists to mathematically understand what the model is doing, they can investigate what kinds of biomedical questions will be the best fit to study. “It now gives us a more thorough or deeper understanding of transfer learning and where the power comes from, so that we can analyze which tasks it will actually work for,” said Uhler.

Proteins
Representation Learning
August 31, 2023
Schmidt Center, Helmholtz Munich launch AI and machine learning in genomics collaboration Schmidt Center, Helmholtz Munich launch AI and machine learning in genomics collaboration
2023

Helmholtz Munich and the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard today announce the launch of a collaboration to bridge a gap in health research with AI and machine learning.

In the past decade, the field of genomics has accelerated to a point where we can now both measure and perturb biological systems at massive, unprecedented scales, holding huge potential for disease treatment. However, the computational tools needed to take advantage of all this data have not kept pace. By leveraging machine learning methods, the partnership between Helmholtz Munich and the Eric and Wendy Schmidt Center seeks to gain valuable insights into important genomics problems while simultaneously advancing the foundations of machine learning through novel research inspired by genomics questions.

Leading this joint initiative are Caroline Uhler, co-director of the Eric and Wendy Schmidt Center at the Broad Institute, and Fabian Theis, head of the Computational Health Center (CHC) at Helmholtz Munich and Director of Helmholtz AI. Both Caroline Uhler and Fabian Theis have backgrounds in machine learning, statistics, data science, biology, and human biology. “This exchange model between the Broad Institute and Helmholtz Munich will merge our expertise on machine learning and genomics to foster innovative ways to address major challenges in biomedical research,” said Fabian Theis.

The collaboration will encompass a range of activities, including the exchange of graduate students, postdoctoral fellows, and other research staff between the two research centers. These individuals will undertake short research stays, enabling them to benefit from the expertise and resources available at both centers. In addition, the research centers will co-organize workshops and conferences to facilitate knowledge exchange and foster collaboration in the field of AI and genomics.

“Despite an explosion in biological data, the technology sector remains the key driver of machine learning advances today,” said Caroline Uhler. “Both Helmholtz Munich and the Broad Institute are seeking to change that by developing foundations of machine learning that are geared specifically to biological problems, and we’re excited for this collaboration to amplify our efforts.”

No items found.
July 27, 2023
Making machine learning models make senseMaking machine learning models make sense
2023

Gemma Moran will never forget how magical it felt to run her very first statistical models on genomics data during her undergraduate summer research project at the University of Sydney. Moran had initially planned to major in pure mathematics but veered away from that path towards a career in applied research after taking a few statistics courses. “I came to realize that I was much more interested in being able to apply math to real world applications and data,” she said. 

Now, as a postdoctoral fellow with the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard, Moran’s interest in using statistical models to uncover biological patterns that could improve health care has only grown stronger. These days, Moran, who is based at Columbia University’s Data Science Institute, is working to combine the rigorous and intuitive nature of the simple statistical models she first learned about in undergrad with the flexibility and power of today’s modern machine learning algorithms. In September, Moran will launch her own research group to pursue this direction as an assistant professor of statistics at Rutgers University.

Those who work with her are confident that her research has been and will continue to be impactful. “Gemma is a clear thinker, a careful scientist, and a fantastic collaborator to work with and learn from,” said David Blei, Moran’s postdoctoral adviser and a professor of statistics and computer science at Columbia University. “What her algorithms discover is information that we can use to help make better scientific and medical predictions, and use to help further our understanding of biology and genetics.”

As part of a project with Anthony Philippakis, co-director of the Eric and Wendy Schmidt Center and chief data officer of the Broad Institute, Moran has been using a type of machine learning algorithm called a variational autoencoder (VAE) to reveal important connections between disease symptoms that doctors may be missing. Though it’s still in its early stages, this work has the potential to affect clinical care if these algorithms discover new ways to cluster symptoms into one disease versus another — a challenging task that has long relied on doctor’s observations alone. 

Revealing New Relationships

As a graduate student at the University of Pennsylvania, Moran worked on designing a method to uncover genes that are most relevant in different subtypes of breast cancer. She also developed new theoretical techniques to estimate the uncertainty present in models. During her postdoctoral fellowship in Blei’s lab, Moran developed a new method that allows researchers to better interpret the results that variational autoencoder algorithms spit out. These algorithms are masterful at paring down massive datasets into tiny summaries that contain only the most important aspects of the bigger dataset. The problem, Moran explains, is that it’s very challenging for researchers to understand exactly what parts of the original dataset are captured in the small summaries.

Moran working in her office in Columbia's Data Science Institute

To illustrate the challenge and her new fix, Moran gives the example of a large dataset filled with hundreds and hundreds of movie ratings. To create a meaningful summary with fewer data points, the variational autoencoder algorithm might divide these ratings into categories like horror, comedy, action, and science fiction. While it learns, the algorithm creates connections between the movie titles in the original dataset and its new summary output. But if left to its own devices, the algorithm will create thousands of connections that will be difficult to interpret. 

Importantly, by pruning down these connections at certain places in the network until they become sparse, Moran’s new method — named "sparse VAE" — makes it much easier to see what parts of the original data are directly linked to the smaller summary. For example, she could trace back the new “anchor points” to find that the movie “Alien” is only represented in the science fiction category of the summary, but a movie like “Everything Everywhere All At Once” might be represented in the categories of action, comedy, and science fiction. And as an added rare bonus, Moran’s new method successfully achieves a statistical property known as identifiability. This ensures that the model only has one way to interpret it, as long as there are anchor points in the data.

After chatting with Philippakis last year about her new sparse VAE method, the two realized that it could be a great way to unearth previously unknown relationships between health symptoms in ways that would be easy for doctors and health researchers to interpret. Essentially, their project uses machine learning to improve nosology, which is the scientific field of disease classification. Until now, to classify a new disease, doctors have relied on their own expertise and experience to know what symptoms — like blurry vision and increased urination for diabetes — co-occur. They’ve also had to decide how to meaningfully differentiate these symptoms from another group of symptoms that comprise a separate disease. But it’s possible that physicians haven’t noticed some co-occurring symptoms that might tell them more about disease severity or indicate a new subgroup of a disease — or require a new disease label altogether. 

“What these machine learning methods are exactly designed to do is find what things travel together, and so in that way, they can help physicians see more things that travel together that they might not have noticed just by observation alone,” said Moran.

Moran stands in front of the Low Memorial Library

Moran and Philippakis are currently applying the sparse VAE method to data from 500,000 patients in the UK Biobank, which is a large patient dataset filled with detailed genetic and health information collected by researchers in the United Kingdom. They hope it may yield surprising correlations between biological signals that could improve the classification of diseases, with the goal of obtaining their first results later this year.

“I’m incredibly excited about where this line of research is headed,” said Anthony Philippakis. “In the same way that Gemma has already shown that her method can identify ‘eigen-movies’ that indicate similar classes of films, there is the opportunity to uncover ‘eigen-phenotypes’ that indicate collections of traits that are correlated with each other.”

New Job, Same Thrill

When Moran starts her own research group this fall at Rutgers University, she will continue her work on improving the interpretability and transparency of powerful machine learning algorithms applied to medical research. Her ultimate goal is to create algorithms that provide the most advantages to the health of society without propagating harmful biases against certain groups. Indeed, Moran sees this problem of bias in machine learning as one of the biggest challenges facing the field over the next ten years. 

“It’s a really crazy time to be in machine learning. There are so many developments happening at breakneck speed,” she said. “What worries me is people building these powerful [machine learning] models without necessary checks and balances and transparency and interpretability … especially applied to health care because it's such a critical domain where we could see negative consequences if we're not using these tools responsibly.”

While Moran’s goals and physical locations on opposite sides of the globe have changed across her academic career, the joy she finds in the work has remained constant. “That feeling when you've had an idea and then you code up something that works — it's just very thrilling,” she said. For Moran, that thrill becomes even more meaningful when she’s answering a question that could help actual patients. “At the end of the day, I love math and modeling and thinking about variation and how to think about data, but it's nice to connect it to real world questions.”

Organisms
Representation Learning
June 6, 2023
Yue Qin named to Forbes 30 Under 30 Asia 2023Yue Qin named to Forbes 30 Under 30 Asia 2023
2023

Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard is excited to announce that postdoctoral fellow Yue Qin was named to the Forbes 30 Under 30 Asia 2023 list this May. The Forbes 30 under 30 lists highlight some of the most successful researchers, leaders, and entrepreneurs around the world. 

Qin joined the Eric and Wendy Schmidt Center in January, 2023. She is co-advised by Paul Blainey, a core member of the Broad Institute and an associate professor of biological engineering at MIT, and Caroline Uhler, co-director of the Eric and Wendy Schmidt Center. Qin's research interests lie in understanding how to read out the programs of cells from the genome. Qin uses that knowledge to create in silico cells that simulate the effect of therapeutic interventions in different disease and genetic contexts with the ultimate goal of developing personalized medicine.

Qin holds a PhD in Bioinformatics and Systems Biology and a BSc in Bioinformatics from the University of California San Diego (UCSD). As a graduate student, she was the first author on a 2021 Nature paper that developed a machine learning framework to map the structure of human cells by fusing data from protein imaging and protein biophysical interactions. Qin is a Siebel Scholar and a recipient of an NCI Predoctoral to Postdoctoral Fellow Transition Award (F99/K00) as well as the Chancellor’s Dissertation Medal within the Jacobs School of Engineering at UCSD.

“Yue embodies the type of researcher we’re excited to work with at the Eric and Wendy Schmidt Center,” said Uhler, who is also a core member of the Broad Institute and a professor in the Department of Electrical Engineering and Computer Science and the Institute for Data, Systems and Society at MIT. “Her research is a great example of how computation and biology can go hand in hand in an age where the number of possible experiments we could perform has exploded.”

People
Cells
April 28, 2023
A deep (learning) dive into the roots of cancerA deep (learning) dive into the roots of cancer
2023

In a recent grant application to the National Institutes of Health, Petar Stojanov was required, among other things, to describe his “specific aims” as well as his background. It’s doubtful that the NIH reviewers would have considered Stojanov’s research agenda lacking in ambition, given its broad scope: to identify the genetic mutations that cause cancer and figure out how they cause it.

The reviewers, moreover, must have decided he had a credible chance of achieving these goals, or at least making progress toward their realization, as he was informed earlier this year that he had earned a coveted Pathway to Independence (K99) Award. As a result, Stojanov — a current Eric and Wendy Schmidt Center Postdoctoral Fellow at the Broad Institute of MIT and Harvard — will receive up to five years of research support, meaning he can devote himself fully to his scientific inquiries without having to worry about funding.

K99 grants help “outstanding” researchers transition from postdoctoral positions to running their own labs. In this next stage of his career, Stojanov will develop new methods in two types of machine learning: algorithms related to causality and deep generative models.

An early interest in computational biology

In some sense, Stojanov set off on the path that led him to this milestone when he was a high school student in Macedonia. A family friend told him that computational biology was becoming a hot area in science. Stojanov was immediately intrigued, he said, “for the same reason that has brought many people to this field — math and biology were my favorite subjects.” And here was a chance to combine his preferred disciplines into a unified course of study that might lead to an interesting career.

He spent his senior year of high school in Pelham, New York (where he lived with his family friend), as he’d always believed he “would have the best opportunities for innovation in the U.S.” A year later, he enrolled in Bard College, which had no courses, let alone a major, in computational biology. Stojanov stuck to his passion, nevertheless, taking the bulk of his classes in computer science, biology, mathematics, and chemistry. He gained hands-on experience in computational biology through summer research programs at George Washington University and the University of Maryland.

Stojanov on his way to work at the Broad Institute

After graduating from Bard in 2010, he took a job in the laboratory of Gaddy Getz, director of the Broad’s Cancer Genome Computational Analysis Group. That’s where Stojanov got started on the two-pronged research track he’s still pursuing today: First, to figure out which mutations are present in cancerous tissue and, second, to determine which of those mutations actually spur our cells to multiply out of control and drive cancer. The standard approach at the time was to rely on statistical methodology, such as examining whether the number of mutations in a given gene was greater than would be expected from random processes, unrelated to cancer.

Stojanov spent four productive years at the Broad, coauthoring more than a dozen papers — four of which he was a lead author. He didn’t sleep much those days, mainly because he was “hungry for projects and never said no to an opportunity.” Yet, by the end of that tenure, he felt that his work in this area could benefit from additional training in computer science, which would enable him to bring new tools to the kinds of problems he’d been grappling with. In 2014, he entered a PhD program at Carnegie Mellon University, where he immersed himself in machine learning techniques and other emerging approaches in artificial intelligence. Although his graduate research had nothing to do with biology, he recognized that the methods he was learning, combined with statistics, might lead to breakthroughs in his previous cancer investigations.

Bringing ML to bear on cancer research

Stojanov returned to the Broad in 2021 and picked up in the Getz lab where he had left off — this time ready to unleash the full power of AI. Getz was eager to have him back, touting “the unique set of skills that Petar has,” given his prior experience in cancer research and his recently strengthened background in computer science. “And now,” Getz said, “he’s applying his expertise in machine learning to the search for the drivers of cancer.”

Just counting the number of mutations in a gene is not enough to reveal the mechanisms underpinning cancer, Stojanov explained. “That may tell you which mutations are most prevalent, and maybe the most important, but it still doesn’t tell you what they do.” To understand how a mutation affects a gene, you have to look at gene expression, the cellular process by which the information encoded in a gene is used to create proteins.

In his latest work at the Broad, Stojanov is focusing on two variables: gene mutations, which can be gleaned from DNA sequencing data, and gene expression expression (which can be obtained from RNA sequencing data by measuring the amount of RNA, a gene-decoding molecule, in the cell). He then uses a set of machine learning tools called causal inference and discovery algorithms to uncover the “causal relationships” between these two variables – mutations and expression.

“The idea is to show that some aspects of gene expression are the consequences of mutations,” he said.  

The only causal relationships he cares about are those associated with cancer. While sorting through DNA and RNA sequencing data from thousands of cancer patients, he’s looking for patterns. In particular, he said, “we might find mutations that influence patients with the same cancer type (or subtype), in the same way.”

Stojanov in his office with colleagues Pinar Eser (center) and Tim Coorens

As an intermediate step, Stojanov relies on a related class of machine learning-based tools, so-called deep generative models, which basically takes abstract (“high-dimensional”) information processed by computers and represents it in a form that is meaningful to humans. If you have mutation and expression data for 20,000 genes, he said, these models offer a way to summarize that vast amount of data in terms of the concepts you’re interested in, such as biological processes or cell subtypes that might be impacted by cancer.

The ultimate goal is to learn as much as possible about this multifaceted disease — how and where it starts and progresses. “To really understand what’s going on,” Stojanov said, “we need an interpretable map that shows which processes are affected by what mutations.”

Existing techniques can only get you so far

Eric and Wendy Schmidt Center co-director Caroline Uhler is excited by the prospect of “getting at the causal genes, which contain the mutations that drive cancer. "Once you have that,” she said, “you’re in a much better position to think about effective therapies. That’s really the promise of this work.”

Stojanov’s current research is, admittedly, at an early stage. He has a solid base of experience to draw on, and he’s picked out a set of tools, in the form of machine learning algorithms, that are poised to advance our knowledge base. The big challenge, Uhler pointed out, is that “existing techniques can only get you so far. Petar has to build on these methods and develop new algorithms in order to solve the important biological questions he plans to address.”

Stojanov is mindful of the hard work ahead and grateful that his burden has been eased by having several years of funding already secured. “This [K99] award gives you the ultimate amount of independence you can have as a postdoc,” he said.

When asked if getting the award is the best thing that could happen to someone in his position, embarking on such an ambitious enterprise, he replied, “Well, it’s certainly up there.”

Cells
Causal Inference
April 28, 2023
Machine learning model finds genetic factors for heart diseaseMachine learning model finds genetic factors for heart disease
2023

To get an inside look at the heart, cardiologists often use electrocardiograms (ECGs) to trace its electrical activity and magnetic resonance images (MRIs) to map its structure. Because the two types of data reveal different details about the heart, physicians typically study them separately to diagnose heart conditions.

Now, in a paper published in Nature Communications, scientists in the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard have developed a machine learning approach that can learn patterns from ECGs and MRIs simultaneously, and based on those patterns, predict characteristics of a patient’s heart. Such a tool, with further development, could one day help doctors better detect and diagnose heart conditions from routine tests such as ECGs.

The researchers also showed that they could analyze ECG recordings, which are easy and cheap to acquire, and generate MRI movies of the same heart, which are much more expensive to capture. And their method could even be used to find new genetic markers of heart disease that existing approaches that look at individual data modalities might miss.

Overall, the team said their technology is a more holistic way to study the heart and its ailments. “It is clear that these two views, ECGs and MRIs, should be integrated because they provide different perspectives on the state of the heart,” said Caroline Uhler, a co-senior author on the study, a Broad core institute member, co-director of the Schmidt Center at Broad, and a professor in the Department of Electrical Engineering and Computer Science as well as the Institute for Data, Systems, and Society at MIT.

"As a field, cardiology is fortunate to have many diagnostic modalities, each providing a different view into cardiac physiology in health and diseases. A challenge we face is that we lack systematic tools for integrating these modalities into a single, coherent picture,” said Anthony Philippakis, a senior co-author on the study and chief data officer at Broad and co-director of the Schmidt Center. “This study represents a first step towards building such a multi-modal characterization."

Model making

To develop their model, the researchers used a machine learning algorithm called an autoencoder, which automatically integrates gigantic swaths of data into a concise representation – a simpler form of the data. The team then used this representation as input for other machine learning models that make specific predictions.

In their study, the team first trained their autoencoder using ECGs and heart MRIs from participants in the UK Biobank. They fed in tens of thousands of ECGs, each paired with MRI images from the same person. The algorithm then created shared representations that captured crucial details from both types of data.

“Once you have these representations, you can use them for many different applications,” said Adityanarayanan Radhakrishnan, a co-first author on the study, an Eric and Wendy Schmidt Center Fellow at the Broad, and a graduate student at MIT in Uhler’s lab. Sam Friedman, a senior machine learning scientist in the Data Sciences Platform at the Broad, is the other co-first author.

One of those applications is predicting heart-related traits. The researchers used the representations created by their autoencoders to build a model that could predict a range of traits, including features of the heart like the weight of the left ventricle, other patient characteristics related to heart function like age, and even heart disorders. Moreover, their model outperformed more standard machine learning approaches, as well as autoencoder algorithms that were trained on just one of the imaging modalities.

“What we showed here is that you get better prediction accuracy if you incorporate multiple types of data,” Uhler said.

Radhakrishnan explained that their model made more accurate predictions because it used representations that had been trained on a much larger dataset. Autoencoders don’t require data that have been labeled by humans, so the team could feed their autoencoder with around 39,000 unlabeled pairs of ECGs and MRI images, rather than just around 5,000 labeled pairs.

The researchers demonstrated another application of their autoencoder: generating new MRI movies. By inputting an individual’s ECG recording into the model — without a paired MRI recording — the model produced the predicted MRI movie for the same person.

With more work, the scientists envision that such technology could potentially allow physicians to learn more about a patient’s heart health from just ECG recordings, which are routinely collected at doctors’ offices.

Broader gene search

With their autoencoder representations, the team realized they could also use them to look for genetic variants associated with heart disease. The traditional method of finding genetic variants for a disease, called a genome-wide association study (GWAS), requires genetic data from individuals that have been labeled with the disease of interest.

But because the team’s autoencoder framework doesn’t require labeled data, they were able to generate representations that reflected the overall state of a patient’s heart. Using these representations and genetic data on the same patients from the UK Biobank, the researchers created a model that looked for genetic variants that impact the state of the heart in more general ways. The model produced a list of variants including many of the known variants related to heart disease and some new ones that can now be investigated further.

Radhakrishnan said that genetic discovery could be the area in which the autoencoder framework, with more data and development, could have the most impact – not just for heart disease, but for any disease. The research team is already working on applying their autoencoder framework to study neurological diseases.

Uhler said this project is a good example of how innovations in biomedical data analysis emerge when machine learning researchers collaborate with biologists and physicians. “An exciting aspect about getting machine learning researchers interested in biomedical questions is that they might come up with a completely new way of looking at a problem.”

Support for the research was provided in part by the Eric and Wendy Schmidt Center at the Broad Institute, the National Science Foundation, the Office of Naval Research, the MIT-IBM Watson AI Lab, a Simons Investigator Award, the National Institutes of Health, and the American Heart Association.

Adapted from a news story posted on the Broad Institute website.

Organisms
Representation Learning
March 30, 2023
A method for designing neural networks optimally suited for certain tasksA method for designing neural networks optimally suited for certain tasks
2023

Neural networks, a type of machine-learning model, are being used to help humans complete a wide variety of tasks, from predicting if someone’s credit score is high enough to qualify for a loan to diagnosing whether a patient has a certain disease. But researchers still have only a limited understanding of how these models work. Whether a given model is optimal for certain task remains an open question.

MIT researchers have found some answers. They conducted an analysis of neural networks and proved that they can be designed so they are “optimal,” meaning they minimize the probability of misclassifying borrowers or patients into the wrong category when the networks are given a lot of labeled training data. To achieve optimality, these networks must be built with a specific architecture.

The researchers discovered that, in certain situations, the building blocks that enable a neural network to be optimal are not the ones developers use in practice. These optimal building blocks, derived through the new analysis, are unconventional and haven’t been considered before, the researchers say.

In a paper published this week in the Proceedings of the National Academy of Sciences, they describe these optimal building blocks, called activation functions, and show how they can be used to design neural networks that achieve better performance on any dataset. The results hold even as the neural networks grow very large. This work could help developers select the correct activation function, enabling them to build neural networks that classify data more accurately in a wide range of application areas, explains senior author Caroline Uhler, a professor in the Department of Electrical Engineering and Computer Science (EECS) and co-director of the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard.

“While these are new activation functions that have never been used before, they are simple functions that someone could actually implement for a particular problem. This work really shows the importance of having theoretical proofs. If you go after a principled understanding of these models, that can actually lead you to new activation functions that you would otherwise never have thought of,” says Uhler, who is a core institute member of the Broad Institute, and a researcher at MIT’s Laboratory for Information and Decision Systems (LIDS) and Institute for Data, Systems and Society (IDSS).

Joining Uhler on the paper are lead author Adityanarayanan Radhakrishnan, an EECS graduate student and an Eric and Wendy Schmidt Center Fellow, and Mikhail Belkin, a professor in the Halicioğlu Data Science Institute at the University of California at San Diego.

Activation investigation

A neural network is a type of machine-learning model that is loosely based on the human brain. Many layers of interconnected nodes, or neurons, process data. Researchers train a network to complete a task by showing it millions of examples from a dataset.

For instance, a network that has been trained to classify images into categories, say dogs and cats, is given an image that has been encoded as numbers. The network performs a series of complex multiplication operations, layer by layer, until the result is just one number. If that number is positive, the network classifies the image a dog, and if it is negative, a cat.

Activation functions help the network learn complex patterns in the input data. They do this by applying a transformation to the output of one layer before data are sent to the next layer. When researchers build a neural network, they select one activation function to use. They also choose the width of the network (how many neurons are in each layer) and the depth (how many layers are in the network.)

“It turns out that, if you take the standard activation functions that people use in practice, and keep increasing the depth of the network, it gives you really terrible performance. We show that if you design with different activation functions, as you get more data, your network will get better and better,” says Radhakrishnan.

He and his collaborators studied a situation in which a neural network is infinitely deep and wide — which means the network is built by continually adding more layers and more nodes — and is trained to perform classification tasks. In classification, the network learns to place data inputs into separate categories.

“A clean picture”

After conducting a detailed analysis, the researchers determined that there are only three ways this kind of network can learn to classify inputs. One method classifies an input based on the majority of inputs in the training data; if there are more dogs than cats, it will decide every new input is a dog. Another method classifies by choosing the label (dog or cat) of the training data point that most resembles the new input.

The third method classifies a new input based on a weighted average of all the training data points that are similar to it. Their analysis shows that this is the only method of the three that leads to optimal performance. They identified a set of activation functions that always use this optimal classification method.

“That was one of the most surprising things — no matter what you choose for an activation function, it is just going to be one of these three classifiers. We have formulas that will tell you explicitly which of these three it is going to be. It is a very clean picture,” he says.

They tested this theory on a several classification benchmarking tasks and found that it led to improved performance in many cases. Neural network builders could use their formulas to select an activation function that yields improved classification performance, Radhakrishnan says.

In the future, the researchers want to use what they’ve learned to analyze situations where they have a limited amount of data and for networks that are not infinitely wide or deep. They also want to apply this analysis to situations where data do not have labels.

“In deep learning, we want to build theoretically grounded models so we can reliably deploy them in some mission-critical setting. This is a promising approach at getting toward something like that — building architectures in a theoretically grounded way that translates into better results in practice,” he says.

This work was supported, in part, by the National Science Foundation, Office of Naval Research, the MIT-IBM Watson AI Lab, the Eric and Wendy Schmidt Center at the Broad Institute, and a Simons Investigator Award.

Adapted from a news story posted on MIT News.

Representation Learning
March 28, 2023
Machine learning experts from around the world compete to improve cancer immunotherapyMachine learning experts from around the world compete to improve cancer immunotherapy
2023

Marios Gavrielatos had never participated in a machine learning competition when he decided to enter the Eric and Wendy Schmidt Center’s Cancer Immunotherapy Data Science Grand Challenge.

Gavrielatos’ friend and colleague, Konstantinos Kyriakidis, asked him to team up in the competition after learning about it from a promotional video on YouTube.

Despite Gavrielatos’ newcomer status, the pair developed a new deep learning model that won them the first part of the competition last month.

The challenge “helped me develop new computational skills, deep-learning wise,” said Gavrielatos, a bioinformatics master’s student at the National and Kapodistrian University of Athens, adding that because they couldn’t find similar problems online, “we had to develop something new ourselves, which was interesting.”

The Cancer Immunotherapy Data Science Grand Challenge, which ran on Topcoder from January 9 to February 3, aimed to uncover new ways to modify, or “perturb,” T cells to make them more effective at killing cancer cells to ultimately improve cancer treatment.

Top challenge submissions will be tested out in a lab at the Broad Institute of MIT and Harvard later this year.

The Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard partnered with Harvard’s Laboratory for Innovation Science, the MIT Department of Electrical Engineering and Computer Science, Topcoder, Gordian Biotechnology, and Massachusetts General Hospital (MGH) to run the challenge. Over 900 people registered for the first part of the competition — making it Topcoder’s fifth-largest data science challenge to date.

“In biology, we can perform perturbations on a scale that other fields can only dream of, meaning we need to develop novel machine learning methods to best make use of such data and answer biological questions,” said Caroline Uhler, co-director of the Eric and Wendy Schmidt Center, a core member of the Broad Institute, and professor in the Department of Electrical Engineering and Computer Science and the Institute for Data, Systems and Society at MIT. “We held this data science challenge to direct bright computational minds from around the world to this problem in cancer immunotherapy. And we’re thrilled that we now get to test out some of their proposed perturbations experimentally.”

A great fit for a data science challenge

While chemotherapy and radiation have saved many lives, these treatments have a weak spot: they are not specific enough — meaning they can kill cancerous and healthy cells. The promise of cancer immunotherapy, a newer and effective form of cancer treatment, is that it can harness our immune system to recognize and kill cancer cells while leaving other cells alone in most cases.

Cancer cells have developed a number of ways to evade our immune system. One such strategy is sending signals to T cells to make them exhausted and ineffective at killing cancer cells. That’s why cancer researchers like Nir Hacohen, an institute member at Broad, director of the Broad Institute’s Cell Circuits Program and director of the Center for Cancer Immunology at Mass General Hospital, are investigating whether perturbing certain genes could shift T cells to a cancer-fighting, “effector” state.

“We were excited to develop this data science challenge with the Eric and Wendy Schmidt Center because the T cell exhaustion problem seemed like a great fit for this kind of competition,” said Hacohen. “It was an opportunity to combine our cancer biology and immunology knowledge with the computational and mathematical skills of machine learning experts from all over the world.”

Marc Schwartz, a postdoctoral fellow in the Hacohen Lab, ran experiments testing the  effects of 73 gene knockouts in T cells on mice with cancer. Given that it took months to test a fraction of the 20,000 potential gene knockouts — a genetic perturbation that stops a gene from functioning — Broad researchers wanted a way to zero in on the most promising perturbations. Enter machine learning.

The overarching challenge was divided into three parts that ran as individual data science competitions on Topcoder. In Challenge 1, participants received gene expression data from 66 of the 73 T-cell-gene knockouts from Schwartz’s experiments as training data. They then had to develop an algorithm that could predict how knocking out the seven “held-out” genes would affect T cells.

Challenge 2 participants used their algorithms from the first challenge to propose new gene knockouts (picking from any of the 20K genes in the entire genome) to shift as many T cells as possible into a cancer-fighting state. In Challenge 3, participants proposed a metric for ranking how well a particular gene knockout would bring about this desired shift in T cells.

To solve Challenge 1, winners Gavrielatos and Kyriakidis first pared down the single-cell dataset so that it contained only expression information from important genes — that is, genes whose expression changed across different T cell states. The preprocessing of the data is a crucial step to distill the “signal” — or useful information — when working with such noisy data, said Kyriakidis, who has previously won several precision FDA data science challenges.

The pair next trained a deep learning model to predict what portion of T cells would move into an effector, exhausted, or alternate state after a specific gene was knocked out. Initially, they tried to come up with an algorithm using only the training data provided from Schwartz’s experiment. But as they continued working, they realized that incorporating public biomedical databases into their analysis — namely, Reactome, a database of biological pathways in human cells, and STRING, a protein interaction database — could reveal associations between the missing and observed genes.

“The whole process was so rewarding,” said Kyriakidis. “You have to divide the whole problem into smaller parts to try to find the solution to each part and connect the dots.”

Sometimes, simple algorithms are best

The second place winners were three MIT students — including two graduate students from the Laboratory for Information and Decision Systems (LIDS), Yuzhou Gu and Anzo Teh, MIT Institute for Data, Systems, and Society (IDSS) postdoc Yanjun Han, and undergraduate student Brandon Wang. Teh, who is also an Eric and Wendy Schmidt Center PhD fellow, said his advisor, MIT professor Yury Polyanskiy, suggested that he and the other researchers join forces for the challenge.

Anzo Teh, Eric and Wendy Schmidt Center PhD Fellow

Teh, Gu, and Han, have a theoretical and computational background — specifically, information theory — while the undergraduate student, Brandon Wang, has expertise in computational biology.

“I did feel like this challenge was a good way for me to learn how to work on these types of problems because I’m pretty new to the biology field,” said Teh.

Several teams used neural networks to describe the experimental gene expression data, an approach that often requires thousands of parameters to create an effective model. The MIT team, on the other hand, made a simplifying assumption that gene expression could be modeled with a small number of parameters following a Gaussian distribution, or a bell curve.

They then reduced the dimensions of their data from 20,000 to 50 columns using a machine learning technique called “principal component analysis.” The MIT team also incorporated an outside public database on human genes into their model, mapping human gene expression profiles to their missing mouse counterparts. Finally, they used a proven machine learning classification algorithm to determine how the gene expression profiles lined up with T cell states.

“Sometimes simple algorithms can work better than neural networks,” said Teh. The MIT team’s background in information theory, which is the study of organizing and quantifying data, helped them discover what signals in the experimental data to focus their models on.

Peter Novotný, the third place winner and a math professor at the University of Žilina in Slovakia, also took a relatively simple approach to solving Challenge 1. Novotný, a former Topcoder “copilot” who had participated in a NASA asteroid-hunter challenge, among many other competitions, has more of a mathematics than a computer science background. In part through participating in data science challenges, he’s discovered that he enjoys machine learning though.

“And, I also quite like competing,” he said.

For the cancer immunotherapy challenge, Novotný first selected 14 features from the T cell data that quantified  how gene expression levels differed between perturbed and unperturbed cells, as the way to represent his training data. Then, he built a model using a common machine learning algorithm — the “random forest” — and predicted the distribution of T cell states for each of the seven withheld genes.

To make the challenge accessible to participants without a biology background, Lightmark Creative and Orr Ashenberg, associate director of computational biology at The Klarman Cell Observatory of the Broad Institute, produced a 1.5-hour crash course on cancer biology, perturbation data, and single-cell sequencing technologies.

“To compete in this contest, you really need to understand what the data is, and without those lectures, it would be quite difficult to understand the problem,” said Novotný.

In addition, Uhler held an IAP course that ran at the same time as the challenge, encouraging MIT students to team up and participate in the competition.

Testing perturbations in the lab

The Eric and Wendy Schmidt Center also announced last month who won the third challenge, in which participants came up with a metric to rank new T cell perturbations.

The winners of that challenge were:

  • First place: Dariusz Brzeziński and Wojciech Kotlowski from Poznań University of Technology in Poland
  • Second place: Salil Bhate, MIT, postdoctoral fellow at the Eric and Wendy Schmidt Center
  • Third place: Irene Bonafonte Pardàs, Artur Szalata, and Benjamin Schubert from Helmholtz Center Munich and Miriam Lyzotte from Mila - Quebec AI Institute

Now, researchers at the Hacohen Lab will run experiments to test how the perturbations proposed in Challenge 2 affect mouse T cells’ cancer-fighting abilities.

“It will be really exciting to see how these computationally identified perturbations actually perform in the lab,” said Uhler. “After all, machine learning cannot replace experiments, but the goal is to work hand in hand with biologists and help prioritize the next experiments to run.”

Cells
Active Learning
January 20, 2023
Researchers develop an AI model that can detect future lung cancer riskResearchers develop an AI model that can detect future lung cancer risk
2023

The name Sybil has its origins in the oracles of Ancient Greece, also known as sibyls: feminine figures who were relied upon to relay divine knowledge of the unseen and the omnipotent past, present, and future. Now, the name has been excavated from antiquity and bestowed on an artificial intelligence tool for lung cancer risk assessment being developed by researchers at MIT's Abdul Latif Jameel Clinic for Machine Learning in Health, Mass General Cancer Center (MGCC), and Chang Gung Memorial Hospital (CGMH).

Lung cancer is the No. 1 deadliest cancer in the world, resulting in 1.7 million deaths worldwide in 2020, killing more people than the next three deadliest cancers combined.

"It’s the biggest cancer killer because it’s relatively common and relatively hard to treat, especially once it has reached an advanced stage,” says Florian Fintelmann, MGCC thoracic interventional radiologist and co-author on the new work. “In this case, it’s important to know that if you detect lung cancer early, the long-term outcome is significantly better. Your five-year survival rate is closer to 70 percent, whereas if you detect it when it’s advanced, the five-year survival rate is just short of 10 percent.”

Although there has been a surge in new therapies introduced to combat lung cancer in recent years, the majority of patients with lung cancer still succumb to the disease. Low-dose computed tomography (LDCT) scans of the lung are currently the most common way patients are screened for lung cancer with the hope of finding it in the earliest stages, when it can still be surgically removed. Sybil takes the screening a step further, analyzing the LDCT image data without the assistance of a radiologist to predict the risk of a patient developing a future lung cancer within six years.

In their new paper published in the Journal of Clinical Oncology, Jameel Clinic, MGCC, and CGMH researchers demonstrated that Sybil obtained C-indices of 0.75, 0.81, and 0.80 over the course of six years from diverse sets of lung LDCT scans taken from the National Lung Cancer Screening Trial (NLST), Mass General Hospital (MGH), and CGMH, respectively — models achieving a C-index score over 0.7 are considered good and over 0.8 is considered strong. The ROC-AUCs for one-year prediction using Sybil scored even higher, ranging from 0.86 to 0.94, with 1.00 being the highest score possible.

Despite its success, the 3D nature of lung CT scans made Sybil a challenge to build. Co-author Peter Mikhael, an MIT PhD student in electrical engineering and computer science, a fellow at the Eric and Wendy Schmidt Center, and an affiliate at the Jameel Clinic and the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL), likened the process to “trying to find a needle in a haystack.” The imaging data used to train Sybil was largely absent of any signs of cancer because early-stage lung cancer occupies small portions of the lung — just a fraction of the hundreds of thousands of pixels making up each CT scan. Denser portions of lung tissue are known as lung nodules, and while they have the potential to be cancerous, most are not, and can occur from healed infections or airborne irritants.  

To ensure that Sybil would be able to accurately assess cancer risk, Fintelmann and his team labeled hundreds of CT scans with visible cancerous tumors that would be used to train Sybil before testing the model on CT scans without discernible signs of cancer.

MIT electrical engineering and computer science PhD student Jeremy Wohlwend, co-author of the paper and Jameel Clinic and CSAIL affiliate, was surprised by how highly Sybil scored despite the lack of any visible cancer. “We found that while we [as humans] couldn’t quite see where the cancer was, the model could still have some predictive power as to which lung would eventually develop cancer,” he recalls. “Knowing [Sybil] was able to highlight which side was the most likely side was really interesting to us.”

Co-author Lecia V. Sequist, a medical oncologist, lung cancer expert, and director of the Center for Innovation in Early Cancer Detection at MGH, says the results the team achieved with Sybil are important “because lung cancer screening is not being deployed to its fullest potential in the U.S. or globally, and Sybil may be able to help us bridge this gap.”

Lung cancer screening programs are underdeveloped in regions of the United States hardest hit by lung cancer due to a variety of factors. These range from stigma against smokers to political and policy landscape factors like Medicaid expansion, which varies from state to state.

Moreover, many patients diagnosed with lung cancer today have either never smoked or are former smokers who quit over 15 ago — traits that make both groups ineligible for lung cancer CT screening in the United States.

“Our training data consisted only of smokers because this was a necessary criterion for enrolling in the NLST,” Mikhael says. “In Taiwan, they screen nonsmokers, so our validation data is expected to contain people who didn’t smoke, and it was exciting to see Sybil generalize well to that population.”

“An exciting next step in the research will be testing Sybil prospectively on people at risk for lung cancer who have not smoked or who quit decades ago,” says Sequist. “I treat such patients every day in my lung cancer clinic and it’s understandably hard for them to reconcile that they would not have been candidates to undergo screening. Perhaps that will change in the future.”

There is a growing population of patients with lung cancer who are categorized as nonsmokers. Women nonsmokers are more likely to be diagnosed with lung cancer than men who are nonsmokers. Globally, over 50 percent of women diagnosed with lung cancer are nonsmokers, compared to 15 to 20 percent of men.

MIT Professor Regina Barzilay, a paper co-author and the Jameel Clinic AI faculty lead, who is also a member of the Koch Institute for Integrative Cancer Research, credits MIT and MGH’s joint efforts on Sybil to Sylvia, the sister to a close friend of Barzilay and one of Sequist’s patients. "Sylvia was young, healthy and athletic — she never smoked,” Barzilay recalls. “When she started coughing, neither her doctors nor her family initially suspected that the cause could be lung cancer. When Sylvia was finally diagnosed and met Dr. Sequist, the disease was too advanced to revert its course. When mourning Sylvia's death, we couldn't stop thinking how many other patients have similar trajectories.”

This work was supported by the Bridge Project, a partnership between the Koch Institute at MIT and the Dana-Farber/Harvard Cancer Center; the MIT Jameel Clinic; Quanta Computer; Stand Up To Cancer; the MGH Center for Innovation in Early Cancer Detection; the Bralower and Landry Families; Upstage Lung Cancer; and the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard. The Cancer Center of Linkou CGMH under Chang Gung Medical Foundation provided assistance with data collection and R. Yang, J. Song and their team (Quanta Computer Inc.) provided technical and computing support for analyzing the CGMH dataset. The authors thank the National Cancer Institute for access to NCI’s data collected by the National Lung Screening Trial, as well as patients who participated in the trial.

Adapted from a news story posted on MIT News.

Organisms
December 7, 2022
New method identifies spatial biomarkers of Alzheimer’s disease progression in animal modelNew method identifies spatial biomarkers of Alzheimer’s disease progression in animal model
2022

Many diseases affect how cells are spatially organized in tissues, such as in Alzheimer’s disease, where amyloid-β proteins clump together to form plaques in the brain. Studying how cells differ in various regions of tissue could help scientists better understand the key changes that lead to Alzheimer’s and other diseases. But integrating data on gene expression and cell structure and spatial location into the same analysis has proven challenging.

Now, researchers from the Broad Institute of MIT and Harvard and ETH Zürich in Switzerland have developed a computational framework for simultaneously analyzing gene expression, the structure of cell nuclei, and their position in space. STACI (Spatial Transcriptomics combined using Autoencoders with Chromatin Imaging) is the first method that combines these three kinds of data. The findings appeared recently in Nature Communications.

The team, led by Caroline Uhler, the study’s senior author and co-director of the Eric and Wendy Schmidt Center at the Broad, and Xinyi Zhang, first author on the study and a graduate student in Uhler’s lab, developed STACI and applied it to study a mouse model of Alzheimer’s disease.

STACI uses a kind of computational model called a neural network to analyze data generated by a technique called STARmap, which measures the expression of more than two thousand genes and maps their location in intact tissue. STARmap was developed by Xiao Wang, a core institute member at the Broad and co-author on the study.

The team used STACI to analyze brain tissue from the Alzheimer’s mouse model. By studying gene expression and the location of cells in the tissue, the scientists identified a part of the cortex in the mouse brain that was more likely to have significant plaque accumulations. With the help of G. V. Shivashankar, a study author and professor of mechano-genomics at ETH Zürich, the team also found that they could predict plaque size — a marker of disease progression — by analyzing just one feature of cells near the plaques: the structure of chromatin, the complex of DNA and protein that makes up chromosomes. The results suggest that chromatin structure could be a marker of Alzheimer’s disease progression.

“We began by asking how we can integrate these different data modalities,” said Uhler, who is also a core institute member at Broad and professor in the Department of Electrical Engineering and Computer Science at MIT. “What’s really exciting is that now, with STACI, we can begin to ask biological questions to learn more about disease by taking all modalities into account simultaneously.”

Zhang, who is also a fellow at the Schmidt Center, says that STACI is a useful tool for researchers because chromatin imaging is routine in labs and cheaper than measuring the gene expression of cells directly. “This study may provide simple, low-cost avenues for studying which regions of the brain are more affected by disease and for tracking disease progression,” she said.

Cells in space

In previous work, Uhler and Shivashankar showed that they could use computational techniques to analyze single-cell RNA sequencing data along with chromatin images. They collaborated with Wang to incorporate the analysis of cell location data from STARmap and build STACI.

STACI relies on a neural network, which learns patterns from “training” data to predict characteristics of new data. To develop STACI, the researchers trained it to build a map, called a latent space, that groups together cells with similar locations, gene expression, or chromatin structure. They then used STACI to analyze images of chromatin from mouse brain tissue.

From this latent space, the scientists found that the size of plaque deposits is highly correlated with the ratio of heterochromatin to euchromatin, which indicates how densely packed the chromatin is. This relationship suggests DNA packing could be a marker of disease progression.

The team says the connection between chromatin density and plaques suggests new questions in Alzheimer’s research. They hope their findings will spur other groups to investigate the biological relationship between DNA packing and plaque build-up.  

Branching out

Brain tissue samples can vary widely in how they are collected and prepared, but the scientists designed STACI to account for this variation. The technique could also be applicable to other spatial data types, such as from Slide-Seq — developed by Fei Chen, Evan Macosko and other colleagues at the Broad — as well as Visium and MERFISH.

Uhler adds that STACI could also help researchers learn more about other diseases, since many have important spatial features. She envisions using the framework to analyze the local microenvironment in cancer, fibrosis or scarring in the lungs or other tissues, as well as developmental processes. As scientists apply STACI to new problems, they’ll likely encounter new analytical challenges, but she thinks this is an opportunity to help the model expand.

“This work shows how biology can be a great inspiration for novel computational questions and developments,” Uhler said. “And that’s really exciting.”

This work was supported in part by the Eric and Wendy Schmidt Center, the Simons Foundation, the Office of Naval Research, the National Institutes of Health, and the National Science Foundation.

Adapted from a news story posted on the Broad Institute website.

Cells
Representation Learning
November 21, 2022
Eric and Wendy Schmidt Center announces data science challenge to harness machine learning for cancer immunotherapyEric and Wendy Schmidt Center announces data science challenge to harness machine learning for cancer immunotherapy
2022

The immune system is adept at fighting off viral and bacterial infections, but it can also find and attack cancer in the body. Cancer cells, however, are skilled at disarming the immune system’s T cells — allowing tumors to continue growing unabated.

Scientists at the Broad Institute of MIT and Harvard and beyond have been looking for ways to genetically modify T cells to improve their cancer-fighting ability. Now the Eric and Wendy Schmidt Center at the Broad Institute is joining this effort, by holding a data science challenge this winter that will call on machine learning enthusiasts to develop algorithms that identify effective genetic modifications in T cells.

Winners will receive monetary prizes at each stage — and, unlike in most data science challenges, the top-scoring participants will have their submissions experimentally validated. Members of a cancer immunology lab at Broad led by institute member Nir Hacohen will make the top-ranked genetic modifications in T cells in the lab and assess the cells’ cancer-fighting abilities.  

The "Cancer Immunotherapy Data Science Grand Challenge" was announced earlier this month at the online coding tournament Topcoder Open, and will run from January 9 to February 3, 2023. The Eric and Wendy Schmidt Center is partnering with Harvard’s Laboratory for Innovation Science, the MIT Department of Electrical Engineering and Computer Science, Topcoder, and Massachusetts General Hospital (MGH) to run the challenge.

“Machine learning experts have largely gone into the fields of big technology and finance. With this challenge, we’re describing an important problem in cancer immunology in a way that is approachable for computational minds — thus hoping to entice more of these experts to the life sciences,” said Caroline Uhler, co-director of the Eric and Wendy Schmidt Center, a core member of the Broad Institute, and professor in the Department of Electrical Engineering and Computer Science and the Institute for Data, Systems and Society at MIT.

Improving cancer immunotherapy through machine learning

Cancer immunotherapies boost the immune system to fight off cancer in a variety of ways. Scientists have made many breakthroughs in cancer immunotherapy in the last decade, such as the development of several FDA-approved checkpoint blockade and “CAR T” therapies. CAR T treatments involve removing T cells from a cancer patient, genetically engineering them in the lab to target tumors, and then reintroducing them back into the patient. However, these treatments work for only a small number of cancer types and only in some patients.

“We hope that this challenge will allow us to quickly hone in on the most promising perturbations so we can better target our experimental validation.” — Nir Hacohen

To make T cell-based immunotherapies more effective for more patients, scientists are looking for other genetic changes they can introduce in T cells to make them better cancer killers. With the development of genome-editing technologies such as CRISPR in the last decade, researchers can look for those desirable changes by performing large-scale genetic screens to systematically modify or knock out each gene and study the effect of these “perturbations” at the single-cell level.

However, perturbing each of the 20,000 genes in the cell or the several hundred million different combinations of genes in the lab would be too costly and time-consuming. Machine learning can help, by predicting which genetic perturbations might be most effective.

“We hope that this challenge will allow us to quickly hone in on the most promising perturbations so we can better target our experimental validation,” said Hacohen, director of the Broad Institute’s Cell Circuits Program, institute member of the Broad Institute, and director of MGH’s Center for Cancer Immunology. “The predictions from this challenge will provide a crucial step toward making cancer immunotherapy more effective for more patients.”

The Cancer Immunotherapy Data Science Challenge will consist of three parts that will run at the same time. In the first part, participants will use transcriptomic and perturbational data from T cells in mouse tumors to develop algorithms that predict the effect of perturbations that have already been studied in the lab, allowing them to see how well their algorithms work. In part two, they’ll come up with a metric for ranking how well a particular gene knockout would shift T cells to a desired state.

And, third, participants will use their algorithms to propose perturbations that boost T cells’ ability to destroy tumors. The top-scoring participants from part one will have their proposed perturbations experimentally validated.

“Data science challenges like this one draw on the power of the crowd to bring in outside computational and creative machine learning techniques to solve biological problems,” said MarcAntonio Awada, head of research and data science at Harvard’s Institute for Digital, Data, and Design Institute. “In the past, crowdsourcing has led to out-of-the-box approaches and completely novel solutions compared to what experts had come up with.”

Unique learning and data access opportunities

The challenge will run concurrently with an Independent Activities Period course at MIT, which brings together computer science and biology students to collaborate on this problem. “The course provides a great opportunity for MIT students to apply their education and see that what they’re learning in the classroom has a direct impact on answering critical biomedical questions,” said Uhler, who is one of the course’s instructors.

A biology background isn’t necessary to participate. The Eric and Wendy Schmidt Center will provide all challenge participants with an online crash course on cancer immunology and unique features of the large-scale datasets. Interested participants can pre-register now as an individual or as part of a team on Topcoder, which is hosting the challenge on their platform.

Participants will have free access to Saturn Cloud to complete the challenge.

Adapted from a news story posted on the Broad Institute website.

No items found.
May 13, 2022
Workshop sparks new tissue biology and AI research areas and collaborationsWorkshop sparks new tissue biology and AI research areas and collaborations
2022

Advancing our understanding of tissue biology requires tight collaborations between biologists with driving questions, technologists creating new experimental methods, and computational scientists who are creating new ways of analyzing data. One of the key aims of an April 27 workshop held by the Eric and Wendy Schmidt Center and the Klarman Cell Observatory at the Broad Institute was to explore the interface between these disciplines. Speakers and panelists included researchers at Stanford University, MIT, Harvard University, the Sloan Kettering Institute, UC Berkeley, Princeton University, and the Broad Institute.

The workshop brought together a diverse set of communities to discuss new tissue biology research questions — and new opportunities for collaboration between the biomedical sciences and machine learning.

Caroline Uhler, co-director of the Eric and Wendy Schmidt Center, a core member of the Broad Institute, and an associate professor in the Department of Electrical Engineering and Computer Science and the Institute for Data, Systems and Society at MIT, told workshop attendees during opening remarks that biology has seen an “explosion” of data in recent years. “We now have the opportunity to understand the programs of life, so not just the units (like genes or single cells), but actually the interactions between these units.”

Biological research frontiers

These cellular interactions play a key role in the cancer immunotherapy research shared by keynote speaker Garry Nolan, a professor in the pathology department at Stanford University. His research team develops algorithms to model tissue areas where different groups of cells interact, areas he calls “interface zones,” to gain insights into how cancer remodels its surrounding tissue and evades the body’s immune system. These interface zones are critical as the locus of cellular changes that lead to tumor growth.

“I would urge you, when you're looking at your RNA data sets, to the extent that you can call out these kinds of interface zones, pay special attention to the RNA changes that are occurring there,” said Nolan, adding later: “The boundary space is where the action is.”

Additionally, biologists should reconsider labeling tumors and other features “heterogenous,” which implies that tumors from different patients are too distinct from one another to be compared. “There is an order here that can be extracted,” said Nolan.

“If you ask me, I would say that we'll never be able to fully understand and model cell function if we don't include spatial proteomics data.” — Emma Lundberg

Meanwhile keynote speaker Emma Lundberg, an associate professor of bioengineering at Stanford and co-director of the Human Protein Atlas, outlined how her team has mapped where proteins are located in cells — a process known as "spatial proteomics." Interestingly, over half of proteins can be found in more than one part of the cell, which changes how they function.

“If you ask me, I would say that we'll never be able to fully understand and model cell function if we don't include spatial proteomics data,” said Lundberg.

Panelists also discussed next steps for engineered tissues and artificial organs in disease study and regenerative medicine. Sangeeta Bhatia, a professor of health sciences and of electrical engineering and computer science at MIT’s Koch Institute, said that researchers have been able to engineer artificial tissues and organs that have little structure, like the skin and cartilage, for decades. Now, they're moving onto endocrine tissue, like the pancreas and liver. “Then you start to think about the tissues whose function is dependent on architecture, like the kidney, the lung — that's the next frontier, and I think we are not quite there yet,” she said.

One challenge brought up by Paola Arlotta, a professor of stem cell and regenerative biology at Harvard University, is how to factor genetics into tissue and organ models. One way to do this is to see how cells from different individuals respond to the same kinds of disturbances. If researchers don’t take genetic variability into account, “we’re ignoring a fundamental component of what human disease is,” she said.

Computational and technological challenges

Keynote speaker Dana Pe’er, chair of the Computational and Systems Biology Program at the Sloan Kettering Institute, outlined computational limitations that need to be addressed to answer pressing biological questions. For example, as researchers move from profiling a small section of a tissue to mapping a whole tissue or organ in different samples, they need to be able to map different tissue sections to each other.

“We’re still largely trying to figure out how to process this data, which is hampering our ability to interpret and powerfully utilize the data,” Pe’er said.

Given that there’s not yet a spatial profiling technology that can provide both high resolution and high content information on features like proteins, researchers will often need to combine a spatial profiling method with single cell data.

Barbara Engelhardt, an associate professor of computer science at Princeton University, said taking multiple images from the same type of tissue and aligning them can help researchers better understand cell type variability.

At the end of the second panel, Anthony Philippakis, co-director of the Eric and Wendy Schmidt Center and chief data officer of the Broad Institute, asked panelists whether they had any “recipes for success” to foster collaborations between the two fields.

Bhatia emphasized the importance of having researchers, or research teams, who are “bilingual” — that is, able to understand both experimental and computational biology. "It doesn't work well if you're just the recipient of data and you don't understand the context." Bhatia said. "We have to create these teams where we can really speak both languages."

Starting the conversations needed to build this bilingual proficiency was precisely the goal of the workshop.

Events
Tissues
April 13, 2022
Fellows develop AI methods to design antibodies and virtually screen drugsFellows develop AI methods to design antibodies and virtually screen drugs
2022

Wengong Jin planned to research language processing for his computer science PhD. But when Jin learned about research on machine-learning for drug discovery at the MIT Computer Science and Artificial Intelligence Laboratory, he told his advisor, Regina Barzilay, that he’d had a change of heart.

“She thought I was jet lagged, because I’d just come over from China and I was proposing a really big switch,” he said.

Jin, now a fellow at the Eric and Wendy Schmidt Center, stayed the course. Six years later, he and a team of researchers have come up with a new kind of model to automatically design antibodies ­— holding huge potential for immunotherapy.

Meanwhile, another Eric and Wendy Schmidt Center Fellow, PhD candidate Adit Radhakrishnan, recently developed a simple yet powerful method for virtually screening new drug candidates. That framework appears in a study published this April in Proceedings of the National Academy of Sciences.

“A number of research institutes have started using machine learning to answer key questions in biology. But at the Eric and Wendy Schmidt Center, as Jin’s and Radhakrishnan’s research shows, our goal is to also go in the other direction, by using biomedical problems to drive advances in machine-learning,” said Caroline Uhler, co-Director of the Eric and Wendy Schmidt Center, a core member of the Broad Institute, and professor in the Department of Electrical Engineering and Computer Science and the Institute for Data, Systems and Society at MIT.

Game-changer for antibody design

Discovering drugs has traditionally been a labor-intensive process, with researchers toiling away for years to test millions of molecules only to come up with a handful of candidates. Now, researchers like Jin and Radhakrishnan are working to automate that process.

“The idea is that we don't need experts to get a cup of coffee and then work all night trying to figure out a new molecule, but rather, to let the machine do the heavy lifting,” Jin said.

Wengong Jin

During his PhD, Jin was part of a research team that developed a machine-learning algorithm to speed up antibiotic discovery. The researchers found a new antibiotic that was effective against bacteria that are resistant to multiple drugs. In this instance, the team provided the model with roughly a million possible compounds to sort through.

That left Jin and other researchers wondering: Could they use artificial intelligence to design molecules from scratch?

The answer was yes. Jin and other researchers developed a generative model that designed antibodies — Y-shaped proteins that bind to viruses, bacteria, and other pathogens, activating our bodies’ immune response — that could neutralize the SARS-CoV-2 virus. Their findings were published earlier this year in a paper at the International Conference on Learning Representations.

"The new model can propose  in a couple of seconds an antibody that has a high likelihood of working — totally changing the game,” said Jin.

While researchers had worked on generative models for antibody discovery before, those models could only come up with a protein’s amino acid sequence — not its shape. In contrast, the new model, which represents the antibody as a graph, simultaneously designs both the sequence and structure of its binding region. “Whether or not the antibody is the right shape to bind to a virus or other pathogen is crucial to its success,” said Jin.

“The new model can propose  in a couple of seconds an antibody that has a high likelihood of working — totally changing the game."  — Wengong Jin

"While human experts have methods to generate neutralizing antibodies, it takes time and effort. The task becomes even more challenging when additional properties need to be enforced. As our understanding of disease biology and immune system deepens, the number of such desired characteristics will continue to grow. Computational methods for antibody design are particularly useful to address this challenge,” said Regina Barzilay, the AI faculty lead for the MIT Jameel Clinic for Machine Learning in Health.

And, because so many types of data are structured as networks, the model also represents an advance in the field of machine learning. “It’s an example of how biology proposed a new problem for machine learning to solve,” said Jin.

An old machine-learning method repurposed for virtual drug screening

Adit Radhakrishnan's father had pursued a mathematics education in India prior to immigrating to the U.S. He instilled in his son a love of math, which led the younger Radhakrishnan to pursue a PhD of his own in electrical engineering and computer science at MIT.

Radhakrishnan researches the fundamentals of deep learning — a kind of artificial intelligence modeled after the human brain that processes unstructured data. Understanding why deep learning is successful, and using that knowledge to build novel models for the healthcare and genomic space, underpins much of Radhakrishnan’s research as an Eric and Wendy Schmidt Center fellow.

Adit Radhakrishnan

Over the past few years, deep learning has become widely adopted in biological applications, with researchers increasingly turning to it to screen potential new drugs. In order to perform well on such tasks, researchers use very large deep learning models that often require significant computing power. Moreover, the complexity of this approach makes it hard for scientists to understand why these models make a given prediction, shedding little light on why a proposed drug could work.

To get around the complexities of deep learning, Radhakrishnan and other researchers, including Uhler and Mikhail Belkin, a professor at the Halıcıoğlu Data Science Institute at the University of California, San Diego, turned to an older class of machine learning models: kernel methods. Prior to the recent wave of deep learning, kernel methods were a prominent and computationally simple approach for machine learning tasks. These models have recently become popular again since they can serve as a proxy for using very large deep learning models with much less computational burden.

The team came up with a simple yet highly adaptable kernel framework that was able to predict the effect that a drug has on gene expression, a measure of how cells change in response to a drug. “In contrast to the expertise needed to train large deep learning models to solve a particular problem, it takes about three lines of code to train the kernel method to do the same task,” said Radhakrishnan.

The framework has uses beyond biology; the researchers demonstrated, for example, that it could be used by video streaming providers to predict how a viewer would rank a particular movie they hadn’t yet seen. And the framework allows researchers to gain insights into how more complex deep learning models function.

According to Radhakrishnan, who is not trained as a biologist, the best part of being a fellow at the Eric and Wendy Schmidt Center is that the center puts machine learning experts and biologists in constant conversation with each other.

“You don’t just have computational researchers running their methods on a biology dataset without a biologist in the mix. You can get continuous feedback on: Is this actually useful?” said Radhakrishnan. “So it gives you a much more guided focus on what biological problems are important and what computational methods are missing.”

Proteins

Get Involved