Data science challenge reveals new research directions for cancer immunotherapy

More than 1,000 people registered for the challenge, which harnessed machine learning to predict ways to make T cells better cancer-cell killers.
A pseudo-colored scanning electron micrograph shows two T cells (red) attacking a cancer cell (white)
Rita Elena Serda, Duncan Comprehensive Cancer Center at Baylor College of Medicine, National Cancer Institute, National Institutes of Health
Elizabeth Gribkoff
February 5, 2024

The Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard is delighted to announce the completion of its Cancer Immunotherapy Data Science Grand Challenge.

Participants in the challenge developed algorithms to uncover new ways to modify, or “perturb,” T cells to make them more effective at killing cancer cells. Scientists in the Hacohen Lab at the Broad then tested their predictions in mouse models, making this the first challenge that the Schmidt Center knows of in which new experiments were performed based on the output of machine-learning models developed in the challenge.  

While it’s too early to say whether any of the proposed perturbations could prove useful for cancer treatment, the researchers plan to further study some of the identified perturbations and the algorithms that gave rise to them.

The Schmidt Center partnered with Harvard’s Laboratory for Innovation Science (LISH), the MIT Department of Electrical Engineering and Computer Science, Topcoder, Gordian Biotechnology, and Saturn Cloud to run the challenge. More than 1,000 people from around the world registered for the competition.

“We are thrilled that our first data science challenge attracted so many participants, including various machine-learning experts who had not previously worked on biological problems,” said Caroline Uhler, director of the Eric and Wendy Schmidt Center, a core member of the Broad Institute, and a professor in the Department of Electrical Engineering and Computer Science and the Institute for Data, Systems, and Society at MIT.

Karim Lakhani, founder and co-director of LISH and a professor of business administration at the Harvard Business School, said: “At LISH, we believe that data science challenges can help organizations harness the power of the crowd to answer pressing questions in biology and other fields. We hope this challenge will serve as a case study in how machine-learning experts can collaborate with biologists to improve experimental design.”

Boosting cancer research with machine learning

Cancer immunotherapy seeks to harness the body’s immune system, and most often T cells, to recognize and kill cancer cells while leaving healthy cells alone. In the last decade, there have been many breakthroughs in cancer immunotherapy, yet treatments still only work for some cancer patients some of the time.

“We’re hopeful that challenges like this can help us home in on T-cell-perturbations that could ultimately lead to new therapeutics — and make cancer immunotherapy work for more patients,” said Nir Hacohen, an institute member at Broad, director of the Broad Institute’s Cell Circuits Program, and director of the Center for Cancer Immunology at Mass General Brigham.

Marc Schwartz, a postdoctoral fellow in the Hacohen Lab, previously ran experiments testing the effects of 73 gene knockouts in T cells in mouse models. Because researchers can’t scale mouse model experiments beyond 100 or so genes at a time, it’s not feasible to test out every gene in a particular disease pathway, explained Schwartz.

“That’s why we were excited about the idea of testing a limited number of genes that we think are important and then training an algorithm to learn something that we can't see from that data on our own,” he added.

The overarching data science challenge was divided into three parts that ran as individual data science competitions on Topcoder. In Challenge 1, participants received gene expression data from 66 of the 73 T-cell-gene knockouts from Schwartz’s experiments as training data. They then developed an algorithm that could predict how knocking out the seven “held-out” genes would affect T cells.

Challenge 2 participants used their algorithms from the first challenge to propose new gene knockouts (picking from any of the 20,000 genes in the entire genome) to shift as many T cells as possible into a cancer-fighting state. In Challenge 3, participants proposed a metric for ranking how well a particular gene knockout would bring about this desired shift in T cells.

To make the challenge accessible to participants without a biology background, Orr Ashenberg, associate director of computational biology at the Klarman Cell Observatory of the Broad Institute, produced a 1.5-hour crash course on cancer biology, genetic perturbations, and single-cell sequencing technologies.

Orr Ashenberg, associate director of computational biology at the Broad's Klarman Cell Observatory, delivers a lecture on single-cell sequencing technologies.

The Schmidt Center announced the Challenges 1 and 3 winners last March. The researchers then ran the top-scoring algorithms from Challenge 1 to predict which genes to knock out to mimic two kinds of cancer immunotherapy — CAR T-cell therapy and checkpoint blockade therapy. Next, Schwartz conducted experiments to see how well the proposed gene knockouts performed in a mouse model. To determine the Challenge 2 winners, Schmidt Center research fellow Jiaqi Zhang, who was instrumental in developing the challenge, calculated how well each participant’s algorithm from Challenge 1 predicted the effects of those ~60 gene knockouts.

The winners of Challenge 2 — the final part of the competition — are:

-First place: Brody Langille, Jordan Trajkovski, and Elizabeth Hudson

-Second place: mglettig (username)*

-Third place: Ai Vu Hong, researcher at Genethon, France

-Fourth place: Saket Kunwar, independent researcher, Nepal

-Fifth place: lxastro0 (username)*

-Sixth place: John Gardner, freelance data scientist

-Seventh place: agilsoft (username)*

-Eighth place: Basak Eraslan, postdoctoral researcher holding a joint position at the Regev Lab in Genentech and Kundaje Lab at Stanford University

-Ninth place: Haoyue Dai, Kun Zhang, Ignavier Ng, Yujia Zheng, Xinshuai Dong, and Yewen Fan from Carnegie Mellon University; Petar Stojanov, postdoctoral fellow at the Eric and Wendy Schmidt Center; Gongxu Luo, Mohamed bin Zayed University of Artificial Intelligence; and Biwei Huang, University of California, San Diego

-Ninth place: Liu Xindi, freelance programmer

-Ninth place: Johnson Zhou, Camille Sayoc, and Yi-Cheng Peng, Master’s students of the Faculty of Engineering and IT at the University of Melbourne, Victoria, Australia

The winning teams approached the problem using different deep-learning methods depending on the chosen input features. These features include gene expression and “chromatin accessibility,” the degree to which genetic information encoded in DNA can be accessed and read, measured by ATAC-seq peak counts. Additionally, some of the top-scoring teams incorporated learned representations from variational autoencoders — models that can capture meaningful features from raw data — or graph neural networks constructed based on the gene ontology database.

"We are grateful for the opportunity to participate in this challenge and are excited by the results,” said the first-place team in a prepared statement. “It's not often that you get invited to work on an important problem alongside preeminent scientists who furnish the problem description and data that you need to develop a novel solution — a novel solution that those same scientists can then turn around and validate in their lab.”

Martin Borch Jensen, chief scientific officer of Gordian Biotechnology, said: "Technological advances in sequencing have led to a vast amount of genomics data. As we pile up more and more transcriptomes from every type of cell in the human body, it becomes increasingly valuable to develop ways to understand how gene expression can cause and predict health and disease. I'm very excited for this competition to catalyze more work on this problem.”

Now, researchers at the Schmidt Center will further study the top-scoring algorithms to see if they can combine components from each into an even better predictive tool. The center plans to hold its second data science challenge later this year.

*Editor's note: Usernames were used instead of participant names in cases where the Schmidt Center could not get in touch with winners.

Get Involved