Data Science, the extraction of relevant insights from data, has firmly positioned itself at the forefront of health and life science research. Machine Learning and Deep Learning are foreseen to become key disruptive technologies within Data Science in translational research and personalized medicine applications.
The cross-cutting area Data Science and Artificial Intelligence currently comprises 16 Labs and four Technology Platforms spanning different research areas at the Max Delbrück Center, with emphasis on:
Data Science at the Max Delbrück Center develops applications across the different research areas to:
Data Science Platforms at the Max Delbrück Center support established and front-end application of technologies, provide instrumentation and methodologies and are engaged in collaborative research projects and technology developments.
Omics and precision medicine addresses models for gene regulatory networks to understand how they are encoded in the epigenome and non-coding regulatory sequences. This is complemented by machine learning to model cancer evolution, as well as to use gene regulation and expression data to characterize cancer subtypes at the molecular level.
Aided by high-resolution single-cell molecular data, we anticipate that our AI efforts will come to fruition to interpret patient genomes in different contexts, such as cancer cohorts or patient-derived organoids, as well as in vitro differentiation systems that recapitulate the molecular disease phenotypes of specific patient variants.
Data integration and disease modelling uses genomic, proteomic and phenotyping data in mechanistic mathematical multiscale models and elucidates and describes disease mechanisms. The particular strength of this approach is the integration of heterogeneous data to establish the link to cell and organ function.
Biomedical image analysis and complex phenotyping platforms that provide phenotypic readouts across multiple scales, form a key component of Data Science at the Max Delbrück Center. Our groups develop new algorithms to process, visualize, and analyze large-volume datasets, to enable integrative analyses and data-driven classification using approaches for more reliable diagnosis and personalized treatment. Our platforms span from in vivo, live imaging up to whole model organisms, down to the resolution of single molecule readouts.
Epidemiology and health data integration studies connect the above approaches with large-scale data from clinical studies or insurance records that provide a wide array of phenotypic data for healthy and affected individuals. Harmonizing clinical data across study centers and patients will allow the bidirectional projection of findings between healthy cohorts and patient groups with severe diseases, as well as the study of treatment effects.
Data Science at the Max Delbrück Center coordinates efforts towards the development of platforms and bioinformatics tools for:
with the overall aim to reveal new insights on disease progression and treatment.
Below you can find a noncomprehensive list of Bioinformatics Tools developed by Data Science research labs at the Max Delbrück Center. For a complete list of Bioinformatics Tools please visit the homepage of the respective research lab.
genomation
R package that contains a collection of tools for visualizing and analyzing genome-wide data sets.
janggu
Deep learning infrastructure for bioinformatics.
maui
Deep learning-based heterogenous data analysis toolkit
methylKit
R package for DNA methylation analysis and annotation from high-throughput bisulfite sequencing
netSmooth
R/Bioconductor package for network smoothing of single cell RNA sequencing data
PiGx
Collection of reproducible genomics pipelines for: 1. raw fastq read data of bisulfite experiments, 2. RNAseq samplessingle cell RNA-seq analysis, 3. reads from ChIPseq experiments and, 4. the analysis of sequence mutations in CRISPR-CAS9 targeted amplicon sequencing data.
RCAS
R package that provides dynamic annotations with interactive figures and tables for custom input files that contain transcriptomic target regions
DigestiFlow
Automated and reproducible conversion of base calls to sequences and the demultiplexing thereof including sample sheet management and comprehensive data and processing QC.
MeTaQuaC
Comprehensive quality control, rich visualization, and summary statistics for large scale metabolomics studies.
SCelVis and SCelVis-demo
Cloud ready interactive visualization of single-cell RNA-seq data. It provides easy-to-use yet flexible means of scRNA-seq data exploration for researchers without computational background.
VarFish and VarFish server
User-friendly web application for the quality control, filtering, prioritization, analysis, and user-based annotation of DNA variant data with focus on rare disease genetics.
Linneaus
Online tool and interactive portal for the lineage tracing by nuclease-activated editing of ubiquitous sequences.
PatchPerPix
Automated segmentation of densely clustered and/or overlapping objects in microscopy data
BIQ
Web server to query RNA-seq datasets for backsplice junctions
CoBold
Visualization app of CoBold predictions of transient RNA structure features that could aid or hinder the formation of a given RNA secondary structure.
CoFold
Web server for the prediction of RNA secondary structure that takes co-transcriptional folding in account.
R-chie and R4RNA
Web server and R package for plotting RNA secondary structures, trans RNA-RNA interactions and genomic interactions.
RNA-Decoder
Web based comparative tool for finding and folding RNA secondary structures within protein-coding regions.
Simufold
Web server using a Baysian MCMC framework for co-estimating an RNA structure including pseudo-knots, a multiple-sequence alignment and an evolutionary tree, given a set of evolutionarily related RNA sequences as input.
Transat
Web server for detection of conserved helices of high statistical significance, including pseudo-knotted, transient and alternative structures.
DeepRipe
Multitask and multimodal deep neural network for characterizing in vivo RBP binding preferences and interpreting RNA binding protein target preferences.
FootprintPipeline
Pipeline to find transcription factor footprints in DNase-seq or ATAC-seq datasets.
JAMM
Peak finder for NGS datasets (ChIP-Seq, ATAC-Seq, DNase-Seq, etc.) that can integrate replicates and assign peak boundaries accurately.
ORFquant
R package that aims at detecting and quantifiying ORF translation on complex transcriptomes using Ribo-seq data.
PARpipe
Complete analysis pipeline for PAR-CLIP data that provides: 1. Pre-processing and alignment of reads, 2. Definition of interaction sites, 3. Additional site-level metrics, 4. Annotation of reads, groups, and cluster, and 5. Meta-analysis of binding sites relative to important transcript features
Overview of Data, Software and Resources
ALVIS
Java-based software with a graphical user interface for the embedded analysis and visualisation of multiple-sequence alignments (MSAs).
GAMIBHEAR
R package for whole-genome phasing of germline variants and haplotype reconstruction from Genome Architecture Mapping (GAM) data.
MEDICC
Finite-state transducer-based framework and software tool for the reconstruction of phylogenetic trees from allele-specific copy-number profiles.
Jointly with Charité researchers within the Berlin Institute of Health (BIH), the Berlin Long term Observation of Cardiovascular Events (BeLOVE) follows circa 10,000 subjects with primary cardiovascular disease or key precursor type 2 diabetes. BeLOVE allows for direct observation of disease comorbidities, study of mechanisms and differential risk factors and determinants of treatment efficacy.
The Berlin Institute for the Foundations of Learning and Data (BIFOLD) aims to conduct research into the scientific foundations of Big Data and Machine Learning, to advance AI application development, and greatly increase the impact to society, the economy, and science.
The German Network for Bioinformatics Infrastructure (de.NBI) is a national, academic and non-profit infrastructure supported by the Federal Ministry of Education and Research providing bioinformatics services to users in life sciences research and biomedicine in Germany and Europe. The partners organize training events, courses and summer schools on tools, standards and compute services provided by de.NBI to assist researchers to more effectively exploit their data.
The Helmholtz Artificial Intelligence Cooperation Unit (Helmholtz AI) is one of five platforms initiated by the Helmholtz Information and Data Science Incubator. Its main goal is to become a driver for applied artificial intelligence (AI) through the development and distribution of AI methods across all Helmholtz centres, effectively combining AI-based analytics with Helmholtz' unique research questions and datasets.
The Helmholtz Information and Data Science Academy (HIDA) connects and serves as the roof to 6 data science research schools linked by a network of 14 national research centers and 17 top-tier universities across Germany. HIDA was developed by the Helmholtz Information and Data Science Incubator that was founded in 2016. The Incubator is a body of 38 expert scientists from each of the Helmholtz Centers and industry experts. HIDA website
The Helmholtz Imaging Platform (HIP) brings scientists and engineers in the Helmholtz Association together to promote and develop imaging science and to foster synergies across imaging modalities and applications within the Helmholtz Association. HIP website
LifeTime, a pan-European initiative involving 50+ research institutes in 18 countries. Its goal is to track the molecular make-up of human cells in time and space at single cell resolution in order to be able to predict onset and course of diseases. An entire work package is focused on “Data Science, Artificial Intelligence and Machine Learning”. The initiative is jointly coordinated by Nikolaus Rajewsky from the Max Delbrück Center and Geneviève Almouzni from the Institut Curie.
The Max Delbrück Center hosts a study center for the German National Cohort (NaKO), which tracks health trajectories on a population level over longer time scales.
The Pan-Cancer Analysis of Whole Genomes (PCAWG) study is an international collaboration to identify common patterns of mutation in more than 2,800 cancer whole genomes from the International Cancer Genome Consortium. The Schwarz Lab is part of PCAWG Working Group 3 (Interaction of Genome and Transcriptome) and is responsible for conducting allele-specific expression analyses to understand the impact of somatic genetic variation on gene expression in these 2800 tumours.
In the sparse2big consortium eight Helmholtz Centers work together on developing, evaluating and sharing methods for data imputation and integration, with the scope to achieve meaningful big data and in-depth insightful analyses. Potential use cases range from patient data in medicine to remote sensing in geography or sample noise in imaging.
MDC faculty contributes to the following Data Science doctoral education programs, either as coordinators (HEIBRiDS, iNAMES, Regulatory Genome) or as partners (CompCancer):
CompCancer is a PhD programme (DFG funded research training group) that focusses on computational aspects of cancer research. The goal of CompCancer is to develop and apply computational methods on relevant questions of current cancer research and thereby train the next generation of computational oncologists.
The Max Delbrück Center is one of the six Helmholtz Centers that have joined forces with the Einstein Center Digital Future to create a new PhD program in data science. Established in 2018, the Helmholtz Einstein International Berlin Research School in Data Science (HEIBRiDS) is an interdisciplinary school that trains young scientists in Data Science applications within a broad range of natural science domains, spanning from Earth & Environment, Astronomy, Space & Planetary Research to Geosciences, Materials & Energy and Molecular Medicine.
The Max Delbrück Center, the Weizmann Institute of Science in Rehovot, the Humboldt-Universität zu Berlin and the Charité-Universitätsmedizin have joined forces to establish iNAMES - MDC-Weizmann Helmholtz International Research School (HIRS) for Imaging from the NAno to the MESo. The mission of iNAMES is the training of outstanding young imaging and data scientists in a truly international research school.
In an alliance between Berlin institutions (led by Humboldt University) and Duke University, the DFG-funded international research training group Dissecting and Reengineering the Regulatory Genome aims to teach the next generation of researchers a quantitative understanding of genome function and gene regulation within the context of biological systems.
In addition to the above PhD Programs, Data Science group leaders participate in the Graduate Program at the Max Delbrück Center, which runs PhD Recruitment rounds twice a year.
MDC faculty contributes to the following MSc Programs of partner Universities:
The Master Program Data Science is offered by the Department of Mathematics and Computer Science of the Free University of Berlin. It is aimed at students who wish to specialize in the processing and analysis of large amounts of data.
Employing adequate training in the various sub-disciplines, this program provides the required knowledge for students to be able to judge mathematical methods and models, to recognize relevant biological questions, and to correctly interpret the results of the models in a biological context.
The Master Program in Biophysics of the Humboldt University in Berlin offers research-based teaching in the interdisciplinary field of experimental and theoretical biophysics.