Data Sciences and Artificial Intelligence

Cross-cutting Focus Area at the MDC


Data Science, the extraction of relevant insights from data, has firmly positioned itself at the forefront of health and life science research. Machine Learning and Deep Learning are foreseen to become key disruptive technologies within Data Science in translational research and personalized medicine applications.

The cross-cutting focus area currently comprises 15 research groups spanning MDC’s different research areas, with an emphasis on:

  1. Omics and precision medicine, including single-cell technologies
  2. Data integration and disease modelling
  3. Biomedical image analysis and complex phenotyping
  4. Epidemiology and health data integration


Research Areas

Omics and precision medicine


Omics and precision medicine addresses models for gene regulatory networks to understand how they are encoded in the epigenome and non-coding regulatory sequences. This is complemented by machine learning to model cancer evolution, as well as to use gene regulation and expression data to characterize cancer subtypes at the molecular level.

Aided by high-resolution single-cell molecular data, we anticipate that our AI efforts will come to fruition to interpret patient genomes in different contexts, such as cancer cohorts or patient-derived organoids, as well as in vitro differentiation systems that recapitulate the molecular disease phenotypes of specific patient variants.

Data integration and disease modelling


Data integration and disease modelling uses genomic, proteomic and phenotyping data in mechanistic mathematical multiscale models and elucidates and describes disease mechanisms. The particular strength of this approach is the integration of heterogeneous data to establish the link to cell and organ function.​​

Biomedical image analysis and complex phenotyping


Biomedical image analysis and complex phenotyping platforms that provide phenotypic readouts across multiple scales, form a key component of Data Science at the MDC. Our groups develop new algorithms to process, visualize, and analyze large-volume datasets, to enable integrative analyses and data-driven classification using approaches for more reliable diagnosis and personalized treatment. Our platforms span from in vivo, live imaging up to whole model organisms, down to the resolution of single molecule readouts.

Epidemiology and health data integration


Epidemiology and health data integration studies connect the above approaches with large-scale data from clinical studies or insurance records that provide a wide array of phenotypic data for healthy and affected individuals. Harmonizing clinical data across study centers and patients will allow the bidirectional projection of findings between healthy cohorts and patient groups with severe diseases, as well as the study of treatment effects.


Data Science at the MDC coordinates efforts to provide reproducible, scalable pipelines for

  1. standardized data analysis and deep phenotyping,
  2. biosample data collection and
  3. microbiome analysis

with the overall aim to reveal new insights on disease progression and treatment.

To this end, Data Science at the MDC develops data science applications across its research areas to:

  • model biological processes from the cellular to the organismal level,
  • detect patterns of health/disease trajectories and
  • identify early warning biomarkers for drug development or repurposing.

Data Science Platforms

Bioinformatics Tools


Java-based software with a graphical user interface for the embedded analysis and visualisation of multiple-sequence alignments (MSAs). ALVIS

  • BIQ

Web server to query RNA-seq datasets for backsplice junctions. BIQ

  • CoBold

Visualization app of CoBold predictions of transient RNA structure features that could aid or hinder the formation of a given RNA secondary structure. CoBold

  • CoFold

Web server for the prediction of RNA secondary structure that takes co-transcriptional folding in account. CoFold

  • DigestiFlow

Automated and reproducible conversion of base calls to sequences and the demultiplexing thereof including sample sheet management and comprehensive data and processing QC. DigestiFlow


R package for whole-genome phasing of germline variants and haplotype reconstruction from Genome Architecture Mapping (GAM) data. GAMIBHEAR

  • genomation

R package that contains a collection of tools for visualizing and analyzing genome-wide data sets. genomation

  • janggu

Deep learning infrastructure for bioinformatics. janggu

  • maui

Deep learning-based heterogenous data analysis toolkit. maui


Finite-state transducer-based framework and software tool for the reconstruction of phylogenetic trees from allele-specific copy-number profiles. MEDICC

  • MeTaQuaC

Comprehensive quality control, rich visualization, and summary statistics for large scale metabolomics studies. MeTaQuaC

  • methylKit

R package for DNA methylation analysis and annotation from high-throughput bisulfite sequencing. methylKit

  • netSmooth

R/Bioconductor package for network smoothing of single cell RNA sequencing data. netSmooth

  • PatchPerPix

Automated segmentation of densely clustered and/or overlapping objects in microscopy data. PatchPerPix

  • PiGx

Collection of reproducible genomics pipelines for: i) raw fastq read data of bisulfite experiments, ii) RNAseq samples, iii) single cell RNA-seq analysis, iv) reads from ChIPseq experiments and v) the analysis of sequence mutations in CRISPR-CAS9 targeted amplicon sequencing data. PiGx

  • RCAS

R package that provides dynamic annotations with interactive figures and tables for custom input files that contain transcriptomic target regions. RCAS

  • R-chie and R4RNA

Web server and R package for plotting RNA secondary structures, trans RNA-RNA interactions and genomic interactions. R-chie and R4RNA

  • RNA-Decoder

Web based comparative tool for finding and folding RNA secondary structures within protein-coding regions. RNA-Decoder

  • SCelVis

Cloud ready interactive visualization of single-cell RNA-seq data. It provides easy-to-use yet flexible means of scRNA-seq data exploration for researchers without computational background. SCelVis and SCelVis-demo

  • Simufold

Web server using a Baysian MCMC framework for co-estimating an RNA structure including pseudo-knots, a multiple-sequence alignment and an evolutionary tree, given a set of evolutionarily related RNA sequences as input. Simufold

  • Transat

Web server for detection of conserved helices of high statistical significance, including pseudo-knotted, transient and alternative structures. Transat

  • VarFish

User-friendly web application for the quality control, filtering, prioritization, analysis, and user-based annotation of DNA variant data with focus on rare disease genetics. VarFish and VarFish server



Jointly with Charité researchers within the Berlin Institute of Health (BIH), the Berlin Long term Observation of Cardiovascular Events (BeLOVE) follows circa 10,000 subjects with primary cardiovascular disease or key precursor type 2 diabetes. BeLOVE allows for direct observation of disease comorbidities, study of mechanisms and differential risk factors and determinants of treatment efficacy.

BeLOVE project page


The MDC hosts a study center for the German National Cohort (NaKO), which tracks health trajectories on a population level over longer time scales.

NAKO Health study


LifeTime, a new pan-European consortium of more than 90 leading research institutions supported by over 70 companies, aims at revolutionising healthcare by mapping, understanding, and targeting human cells during disease. An entire work package is focused on “Data Science, Artificial Intelligence and Machine Learning”. The initiative is jointly coordinated by Nikolaus Rajewsky from the MDC and Geneviève Almouzni from the Institut Curie.

LifeTime website

Berlin Center for Machine Learning (BZML)

The Berlin Center for Machine Learning (BZML, Berliner Zentrum für Maschinelles Lernen) aims at the systematic and sustainable expansion of interdisciplinary machine learning research, both in proven research constellations as well as in new, highly topical scientific objectives that have not yet been jointly researched.

BZML website


The 'German Network for Bioinformatics Infrastructure – de.NBI' is a national, academic and non-profit infrastructure supported by the Federal Ministry of Education and Research providing bioinformatics services to users in life sciences research and biomedicine in Germany and Europe. The partners organize training events, courses and summer schools on tools, standards and compute services provided by de.NBI to assist researchers to more effectively exploit their data.

de.NBI website


The Pan-Cancer Analysis of Whole Genomes (PCAWG) study is an international collaboration to identify common patterns of mutation in more than 2,800 cancer whole genomes from the International Cancer Genome Consortium. The Schwarz Lab is part of PCAWG Working Group 3 (Interaction of Genome and Transcriptome) and is responsible for conducting allele-specific expression analyses to understand the impact of somatic genetic variation on gene expression in these 2800 tumours.

PCAWG website


Doctoral Education

MDC faculty contribute to the following Data Science doctoral education programs, either as coordinators (HEIBRiDS, Regulatory Genome) or as partners (CompCancer):



The MDC is one of the six Helmholtz Centers that have joined forces with the Einstein Center Digital Future to create a new PhD program in data science. Established in 2018, the Helmholtz Einstein International Berlin Research School in Data Science (HEIBRiDS) is an interdisciplinary school that trains young scientists in Data Science applications within a broad range of natural science domains, spanning from Earth & Environment, Astronomy, Space & Planetary Research to Geosciences, Materials & Energy and Molecular Medicine.

HEIBRiDS website



CompCancer is a PhD programme (DFG funded research training group) that focusses on computational aspects of cancer research. The goal of CompCancer is to develop and apply computational methods on relevant questions of current cancer research and thereby train the next generation of computational oncologists.

CompCancer website


Regulatory Genome

In an alliance between Berlin institutions (led by Humboldt University) and Duke University, the DFG-funded international research training group Dissecting and Reengineering the Regulatory Genome aims to teach the next generation of researchers a quantitative understanding of genome function and gene regulation within the context of biological systems.

Regulatory Genome website


MDC-funded PhD positions on Data Science

In addition to the above PhD Programs, Data Science group leaders participate in the MDC Graduate Program, which runs PhD Recruitment rounds twice a year.

Call Spring 2020


MSc Education

MDC faculty contribute to the following MSc Programs of partner Universities:


Master Program Data Science

The Master Program Data Science is a new program offered by the Department of Mathematics and Computer Science of the Free University of Berlin. It is aimed at students who wish to specialize in the processing and analysis of large amounts of data.

MSc Data Science website


Master Program in Bioinformatics

Employing adequate training in the various sub-disciplines, this program provides the required knowledge for students to be able to judge mathematical methods and models, to recognize relevant biological questions, and to correctly interpret the results of the models in a biological context.

MSc Bioinformatics website

Master Program in Biophysics

The Master Program in Biophysics of the Humboldt University in Berlin offers research-based teaching in the interdisciplinary field of experimental and theoretical biophysics.

MSc Biophysics website (in German)

News & Events

Regular lectures & seminars


Press releases