Data Science and Artificial Intelligence

Cross-cutting Focus Area at the MDC


Data Science, the extraction of relevant insights from data, has firmly positioned itself at the forefront of health and life science research. Machine Learning and Deep Learning are foreseen to become key disruptive technologies within Data Science in translational research and personalized medicine applications.

The cross-cutting focus area Data Science and Artificial Intelligence currently comprises 15 research groups spanning MDC’s different research areas, with an emphasis on:

  1. Omics and precision medicine, including single-cell technologies
  2. Data integration and disease modelling
  3. Biomedical image analysis and complex phenotyping
  4. Epidemiology and health data integration

Data Science at the MDC develops applications across its research areas to:

  • model biological processes from the cellular to the organismal level,
  • detect patterns of health/disease trajectories and
  • identify early warning biomarkers for drug development or repurposing.


Research Areas

Omics and precision medicine


Omics and precision medicine addresses models for gene regulatory networks to understand how they are encoded in the epigenome and non-coding regulatory sequences. This is complemented by machine learning to model cancer evolution, as well as to use gene regulation and expression data to characterize cancer subtypes at the molecular level.

Aided by high-resolution single-cell molecular data, we anticipate that our AI efforts will come to fruition to interpret patient genomes in different contexts, such as cancer cohorts or patient-derived organoids, as well as in vitro differentiation systems that recapitulate the molecular disease phenotypes of specific patient variants.

Data integration and disease modelling


Data integration and disease modelling uses genomic, proteomic and phenotyping data in mechanistic mathematical multiscale models and elucidates and describes disease mechanisms. The particular strength of this approach is the integration of heterogeneous data to establish the link to cell and organ function.​​

Biomedical image analysis and complex phenotyping


Biomedical image analysis and complex phenotyping platforms that provide phenotypic readouts across multiple scales, form a key component of Data Science at the MDC. Our groups develop new algorithms to process, visualize, and analyze large-volume datasets, to enable integrative analyses and data-driven classification using approaches for more reliable diagnosis and personalized treatment. Our platforms span from in vivo, live imaging up to whole model organisms, down to the resolution of single molecule readouts.

Epidemiology and health data integration


Epidemiology and health data integration studies connect the above approaches with large-scale data from clinical studies or insurance records that provide a wide array of phenotypic data for healthy and affected individuals. Harmonizing clinical data across study centers and patients will allow the bidirectional projection of findings between healthy cohorts and patient groups with severe diseases, as well as the study of treatment effects.


Data Science at the MDC coordinates efforts towards the development of platforms and bioinformatics tools for

  1. standardized data analysis and deep phenotyping,
  2. biosample data collection and
  3. microbiome analysis

with the overall aim to reveal new insights on disease progression and treatment.

Data Science Platforms

Data Science Platforms at the MDC support established and front-end application of technologies, provide instrumentation and methodologies and are engaged in collaborative research projects and technology developments.

Bioinformatics Tools

Below you can find a noncomprehensive list of Bioinformatics Tools developed by Data Science research groups at the MDC. For a complete list of Bioinformatics Tools please visit the homepage of the respective research group.

Bioinformatics and Omics Data Science (A. Akalin)

  • genomation
    R package that contains a collection of tools for visualizing and analyzing genome-wide data sets.

  • janggu
    Deep learning infrastructure for bioinformatics.

  • maui
    Deep learning-based heterogenous data analysis toolkit

  • methylKit
    R package for DNA methylation analysis and annotation from high-throughput bisulfite sequencing

  • netSmooth
    R/Bioconductor package for network smoothing of single cell RNA sequencing data

  • PiGx
    Collection of reproducible genomics pipelines for: 1. raw fastq read data of bisulfite experiments, 2. RNAseq samplessingle cell RNA-seq analysis, 3. reads from ChIPseq experiments and, 4. the analysis of sequence mutations in CRISPR-CAS9 targeted amplicon sequencing data.

  • RCAS
    R package that provides dynamic annotations with interactive figures and tables for custom input files that contain transcriptomic target regions

Core Unit Bioinformatics (D. Beule)

  • DigestiFlow
    Automated and reproducible conversion of base calls to sequences and the demultiplexing thereof including sample sheet management and comprehensive data and processing QC.

  • MeTaQuaC
    Comprehensive quality control, rich visualization, and summary statistics for large scale metabolomics studies.

  • CelVis
    Cloud ready interactive visualization of single-cell RNA-seq data. It provides easy-to-use yet flexible means of scRNA-seq data exploration for researchers without computational background.
    SCelVis and SCelVis-demo

  • VarFish
    User-friendly web application for the quality control, filtering, prioritization, analysis, and user-based annotation of DNA variant data with focus on rare disease genetics. VarFish and VarFish server

Junker Lab

  • Linneaus
    Online tool and interactive portal for the lineage tracing by nuclease-activated editing of ubiquitous sequences.

Kainmueller Lab

  • PatchPerPix
    Automated segmentation of densely clustered and/or overlapping objects in microscopy data

Meyer Lab

  • BIQ
    Web server to query RNA-seq datasets for backsplice junctions

  • CoBold
    Visualization app of CoBold predictions of transient RNA structure features that could aid or hinder the formation of a given RNA secondary structure.

  • CoFold
    Web server for the prediction of RNA secondary structure that takes co-transcriptional folding in account.

  • R-chie and R4RNA
    Web server and R package for plotting RNA secondary structures, trans RNA-RNA interactions and genomic interactions.

  • RNA-Decoder
    Web based comparative tool for finding and folding RNA secondary structures within protein-coding regions.

  • Simufold
    Web server using a Baysian MCMC framework for co-estimating an RNA structure including pseudo-knots, a multiple-sequence alignment and an evolutionary tree, given a set of evolutionarily related RNA sequences as input.

  • Transat
    Web server for detection of conserved helices of high statistical significance, including pseudo-knotted, transient and alternative structures.

Ohler Lab

  • DeepRipe
    Multitask and multimodal deep neural network for characterizing in vivo RBP binding preferences and interpreting RNA binding protein target preferences.

  • FootprintPipeline
    Pipeline to find transcription factor footprints in DNase-seq or ATAC-seq datasets.

  • JAMM Peak Finder
    Peak finder for NGS datasets (ChIP-Seq, ATAC-Seq, DNase-Seq, etc.) that can integrate replicates and assign peak boundaries accurately.

  • ORFquant
    R package that aims at detecting and quantifiying ORF translation on complex transcriptomes using Ribo-seq data.

  • PARpipe
    Complete analysis pipeline for PAR-CLIP data that provides: 1. Pre-processing and alignment of reads, 2. Definition of interaction sites, 3. Additional site-level metrics, 4. Annotation of reads, groups, and cluster, and 5. Meta-analysis of binding sites relative to important transcript features

Preibisch Lab

  • BigStitcher
    Software package that allows simple and efficient alignment of multi-tile and multi-angle image datasets, for example acquired by lightsheet, widefield or confocal microscopes.

N. Rajewsky Lab

  • circBase
    Web-based tool for the exploration of public datasets of circular RNAs (circRNAs). Custom python scripts can be downloaded and applied to the discovery of circRNAs in (ribominus) RNA-seq data.

  • DVEX
    Online resource tool for the exploration of the transcriptome of the stage 6 Drosophila embryo at the single cell level.

  • miRDeep2
    Software package for the discovery of known and novel miRNAs from deep sequencing data. Furthermore, it can be used for miRNA expression profiling across samples.

  • miReduce
    Perl script for Linux for correlating the logarithm of expression fold changes of a set of genes with the motif content of the regulatory sequences of these genes.

  • spacegerm
    Web app for the for the study of the Spatial Caenorhabditis elegans germline expression of mRNA & miRNA.

  • SVZ Cell Atlas
    Web app for the exploration and visualization of single cells derived from the subventricular zone of the adult mouse brain.

Schwarz Lab

    Java-based software with a graphical user interface for the embedded analysis and visualisation of multiple-sequence alignments (MSAs).

    R package for whole-genome phasing of germline variants and haplotype reconstruction from Genome Architecture Mapping (GAM) data.

    Finite-state transducer-based framework and software tool for the reconstruction of phylogenetic trees from allele-specific copy-number profiles.



Jointly with Charité researchers within the Berlin Institute of Health (BIH), the Berlin Long term Observation of Cardiovascular Events (BeLOVE) follows circa 10,000 subjects with primary cardiovascular disease or key precursor type 2 diabetes. BeLOVE allows for direct observation of disease comorbidities, study of mechanisms and differential risk factors and determinants of treatment efficacy.

BeLOVE project page

Berlin Center for Machine Learning (BZML)

The Berlin Center for Machine Learning (BZML, Berliner Zentrum für Maschinelles Lernen) aims at the systematic and sustainable expansion of interdisciplinary machine learning research, both in proven research constellations as well as in new, highly topical scientific objectives that have not yet been jointly researched.

BZML website


The Berlin Institute for the Foundations of Learning and Data (BIFOLD) aims to conduct research into the scientific foundations of Big Data and Machine Learning, to advance AI application development, and greatly increase the impact to society, the economy, and science.

BIFOLD website


The German Network for Bioinformatics Infrastructure (de.NBI) is a national, academic and non-profit infrastructure supported by the Federal Ministry of Education and Research providing bioinformatics services to users in life sciences research and biomedicine in Germany and Europe. The partners organize training events, courses and summer schools on tools, standards and compute services provided by de.NBI to assist researchers to more effectively exploit their data.

de.NBI website

Helmholtz AI

The Helmholtz Artificial Intelligence Cooperation Unit (Helmholtz AI) is one of five platforms initiated by the Helmholtz Information and Data Science Incubator. Its main goal is to become a driver for applied artificial intelligence (AI) through the development and distribution of AI methods across all Helmholtz centres, effectively combining AI-based analytics with Helmholtz' unique research questions and datasets.

Helmholtz AI website


The Helmholtz Information and Data Science Academy (HIDA) connects and serves as the roof to 6 newly founded data science research schools linked by a network of 14 national research centers and 17 top-tier universities across Germany. HIDA was developed by the Helmholtz Information and Data Science Incubator, which was founded in 2016. The Incubator is a body of 38 expert scientists from each of the Helmholtz Centers and industry experts. HIDA website


The Helmholtz Imaging Platform (HIP) brings scientists and engineers in the Helmholtz Association together to promote and develop imaging science and to foster synergies across imaging modalities and applications within the Helmholtz Association. HIP website


LifeTime, a pan-European initiative involving 50+ research institutes in 18 countries. Its goal is to track the molecular make-up of human cells in time and space at single cell resolution in order to be able to predict onset and course of diseases. An entire work package is focused on “Data Science, Artificial Intelligence and Machine Learning”. The initiative is jointly coordinated by Nikolaus Rajewsky from the MDC and Geneviève Almouzni from the Institut Curie.



The MDC hosts a study center for the German National Cohort (NaKO), which tracks health trajectories on a population level over longer time scales.

NAKO Health study


The Pan-Cancer Analysis of Whole Genomes (PCAWG) study is an international collaboration to identify common patterns of mutation in more than 2,800 cancer whole genomes from the International Cancer Genome Consortium. The Schwarz Lab is part of PCAWG Working Group 3 (Interaction of Genome and Transcriptome) and is responsible for conducting allele-specific expression analyses to understand the impact of somatic genetic variation on gene expression in these 2800 tumours.

PCAWG website


Doctoral Education

MDC faculty contribute to the following Data Science doctoral education programs, either as coordinators (HEIBRiDS, Regulatory Genome) or as partners (CompCancer):



The MDC is one of the six Helmholtz Centers that have joined forces with the Einstein Center Digital Future to create a new PhD program in data science. Established in 2018, the Helmholtz Einstein International Berlin Research School in Data Science (HEIBRiDS) is an interdisciplinary school that trains young scientists in Data Science applications within a broad range of natural science domains, spanning from Earth & Environment, Astronomy, Space & Planetary Research to Geosciences, Materials & Energy and Molecular Medicine.

HEIBRiDS website



CompCancer is a PhD programme (DFG funded research training group) that focusses on computational aspects of cancer research. The goal of CompCancer is to develop and apply computational methods on relevant questions of current cancer research and thereby train the next generation of computational oncologists.

CompCancer website


Regulatory Genome

In an alliance between Berlin institutions (led by Humboldt University) and Duke University, the DFG-funded international research training group Dissecting and Reengineering the Regulatory Genome aims to teach the next generation of researchers a quantitative understanding of genome function and gene regulation within the context of biological systems.

Regulatory Genome website


MDC-funded PhD positions on Data Science

In addition to the above PhD Programs, Data Science group leaders participate in the MDC Graduate Program, which runs PhD Recruitment rounds twice a year.

Apply for a PhD


MSc Education

MDC faculty contribute to the following MSc Programs of partner Universities:


Master Program Data Science

The Master Program Data Science is a new program offered by the Department of Mathematics and Computer Science of the Free University of Berlin. It is aimed at students who wish to specialize in the processing and analysis of large amounts of data.

MSc Data Science website


Master Program in Bioinformatics

Employing adequate training in the various sub-disciplines, this program provides the required knowledge for students to be able to judge mathematical methods and models, to recognize relevant biological questions, and to correctly interpret the results of the models in a biological context.

MSc Bioinformatics website

Master Program in Biophysics

The Master Program in Biophysics of the Humboldt University in Berlin offers research-based teaching in the interdisciplinary field of experimental and theoretical biophysics.

MSc Biophysics website (in German)

News & Events

Regular lectures & seminars


News and press releases