Navigating the data jungle with deep learning
Comparing one cell with another requires focusing on what matters most. There’s only one problem: scientists are often unable to say at first what matters most. Even if two cells in the body produce the exact same molecules, they can still look different to scientists when they analyze them. That may simply be because they were extracted on different days or in different labs and were exposed to stress during extraction. When data scientists try to work with such data, they must therefore deal with the problem of “ground truth”: without a clear understanding of the data, no models can be evaluated.
Pia Rautenstrauch uses algorithms and computer models to try to fish the relevant information out of a sea of data. The doctoral candidate is part of Professor Uwe Ohler’s lab at the Berlin Institute for Medical Systems Biology (BIMSB) at the Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC). She recently participated in a data challenge hosted by NeurIPS, the world’s largest machine learning conference. The results are slated to be published soon in a special issue of Proceedings of Machine Learning Research.
High-dimensional data
Last year, international researchers from universities and industry organized the NeurIPS competition. They supplied an enormous data set from the single cell sequencing of 120,000 bone marrow cells, with as much “ground truth” as possible. The participants’ goal was to develop new methods of analysis within two months. As traditional statistical methods cannot cope with the increasing complexity of data from single cells, algorithms and deep learning models are necessary to interpret the high-dimensional data.
Just a few years ago, it was simply impossible to collect several data types at once from thousands of cells of one tissue and see individual differences. Only the latest technology from single cell genomics allows scientists to measure various biomolecules within a single cell.
Three data types shed light on cellular gene regulation and were made available to the participating scientists. Chromatin accessibility determines when and where the genes on the chromosomes are accessible to DNA-binding proteins, in order, for example, to start RNA transcription. This controls the rate of gene expression, which was shown in the second data type. The third data type was related to the proteins available on the cell surface.
All three data types are dependent on one another. Accordingly, the participants were able to use them to develop models that could reliably predict one specific data type. For example, they could infer the rate of surface proteins from gene expression.
Filtering out relevant signals
My goal was to filter out the technical differences that can result from cell stress during sample extraction.
Rautenstrauch focused on a different task, however. She wanted to figure out which cells are similar to one another in order to then identify the different cell types contained in the samples. To do so, she first had to navigate her way through the data jungle so that in the end she could create a simple and low-dimensional representation of the data. She thought about how she could group the cells and meaningfully aggregate various data types so as to filter out biological differences.
It was a challenge, as the three data types each have different mathematical and statistical qualities. “My goal was to filter out the technical differences that can result from cell stress during sample extraction,” Rautenstrauch says. The problem, as she explains, is that all cells are destroyed during single-cell sequencing. It is then hard to tell afterwards whether differences are indicative of two types of cell or are due to a measuring error, i.e., an artefact. For the competition, Rautenstrauch used a deep learning model from the field of artificial intelligence that she herself had developed. Such models learn on their own from data, continually improving at meaningfully aggregating two data types and filtering out artefacts. She came in second place for the combination of “gene expression and surface protein” and fourth place for “gene expression and chromatin accessibility.”
Rautenstrauch’s deep learning model may have applications for the Human Cell Atlas – a reference map of all cells in the human body. The raw data it is based on are heterogeneous in multiple ways. They come from thousands of donors from various age groups around the world, and the samples were extracted using various types of equipment in over 2,000 laboratories. Such differences are reflected in the data and can obfuscate essential biological differences between cell types.
A different type of prestigious publication
NeurIPS created the first competition with a standardized problem and pre-defined evaluation criteria for single cell sequencing data. The participants’ solutions are now slated to be judged and published. “In our community we publish our results in journals, but we prefer to present them at large conferences,” Rautenstrauch explains. She and Ohler are named as consortium authors on the accompanying manuscript. There is also a peer review process. The competition was sponsored mostly by U.S. institutions and companies (Cellarity, Yale University, the Chan Zuckerberg Initiative, and the Chan Zuckerberg Biohub) but also by Helmholtz Center München.
Text: Christina Anders
Further information
- Genomic regulatory map of the zebrafish
- More about the data set und the competition
- NeurIPS Conference 2021