Data Science, the extraction of relevant insights from data, has firmly positioned itself at the forefront of health and life science research. Machine Learning and Deep Learning are foreseen to become key disruptive technologies within Data Science in translational research and personalized medicine applications.
The cross-cutting focus area Data Science and Artificial Intelligence currently comprises 15 research groups spanning MDC’s different research areas, with an emphasis on:
- Omics and precision medicine, including single-cell technologies
- Data integration and disease modelling
- Biomedical image analysis and complex phenotyping
- Epidemiology and health data integration
Data Science at the MDC develops applications across its research areas to:
- model biological processes from the cellular to the organismal level,
- detect patterns of health/disease trajectories and
- identify early warning biomarkers for drug development or repurposing.
Omics and precision medicine addresses models for gene regulatory networks to understand how they are encoded in the epigenome and non-coding regulatory sequences. This is complemented by machine learning to model cancer evolution, as well as to use gene regulation and expression data to characterize cancer subtypes at the molecular level.
Aided by high-resolution single-cell molecular data, we anticipate that our AI efforts will come to fruition to interpret patient genomes in different contexts, such as cancer cohorts or patient-derived organoids, as well as in vitro differentiation systems that recapitulate the molecular disease phenotypes of specific patient variants.
Data integration and disease modelling uses genomic, proteomic and phenotyping data in mechanistic mathematical multiscale models and elucidates and describes disease mechanisms. The particular strength of this approach is the integration of heterogeneous data to establish the link to cell and organ function.
Biomedical image analysis and complex phenotyping platforms that provide phenotypic readouts across multiple scales, form a key component of Data Science at the MDC. Our groups develop new algorithms to process, visualize, and analyze large-volume datasets, to enable integrative analyses and data-driven classification using approaches for more reliable diagnosis and personalized treatment. Our platforms span from in vivo, live imaging up to whole model organisms, down to the resolution of single molecule readouts.
Epidemiology and health data integration studies connect the above approaches with large-scale data from clinical studies or insurance records that provide a wide array of phenotypic data for healthy and affected individuals. Harmonizing clinical data across study centers and patients will allow the bidirectional projection of findings between healthy cohorts and patient groups with severe diseases, as well as the study of treatment effects.
Data Science at the MDC coordinates efforts towards the development of platforms and bioinformatics tools for
- standardized data analysis and deep phenotyping,
- biosample data collection and
- microbiome analysis
with the overall aim to reveal new insights on disease progression and treatment.
Data Science Platforms
Data Science Platforms at the MDC support established and front-end application of technologies, provide instrumentation and methodologies and are engaged in collaborative research projects and technology developments.
Below you can find a noncomprehensive list of Bioinformatics Tools developed by Data Science research groups at the MDC. For a complete list of Bioinformatics Tools please visit the homepage of the respective research group.
R package that contains a collection of tools for visualizing and analyzing genome-wide data sets.
Deep learning infrastructure for bioinformatics.
Deep learning-based heterogenous data analysis toolkit
R package for DNA methylation analysis and annotation from high-throughput bisulfite sequencing
R/Bioconductor package for network smoothing of single cell RNA sequencing data
Collection of reproducible genomics pipelines for:
raw fastq read data of bisulfite experiments,
single cell RNA-seq analysis,
reads from ChIPseq experiments and
the analysis of sequence mutations in CRISPR-CAS9 targeted amplicon sequencing data
R package that provides dynamic annotations with interactive figures and tables for custom input files that contain transcriptomic target regions
Automated and reproducible conversion of base calls to sequences and the demultiplexing thereof including sample sheet management and comprehensive data and processing QC.
Comprehensive quality control, rich visualization, and summary statistics for large scale metabolomics studies.
Cloud ready interactive visualization of single-cell RNA-seq data. It provides easy-to-use yet flexible means of scRNA-seq data exploration for researchers without computational background.
User-friendly web application for the quality control, filtering, prioritization, analysis, and user-based annotation of DNA variant data with focus on rare disease genetics. and
Online tool and interactive portal for the lineage tracing by nuclease-activated editing of ubiquitous sequences.
Automated segmentation of densely clustered and/or overlapping objects in microscopy data
Web server to query RNA-seq datasets for backsplice junctions
Visualization app of CoBold predictions of transient RNA structure features that could aid or hinder the formation of a given RNA secondary structure.
Web server for the prediction of RNA secondary structure that takes co-transcriptional folding in account.
Web server and R package for plotting RNA secondary structures, trans RNA-RNA interactions and genomic interactions.
Web based comparative tool for finding and folding RNA secondary structures within protein-coding regions.
Web server using a Baysian MCMC framework for co-estimating an RNA structure including pseudo-knots, a multiple-sequence alignment and an evolutionary tree, given a set of evolutionarily related RNA sequences as input.
Web server for detection of conserved helices of high statistical significance, including pseudo-knotted, transient and alternative structures.
Multitask and multimodal deep neural network for characterizing in vivo RBP binding preferences and interpreting RNA binding protein target preferences.
Pipeline to find transcription factor footprints in DNase-seq or ATAC-seq datasets.
Peak finder for NGS datasets (ChIP-Seq, ATAC-Seq, DNase-Seq, etc.) that can integrate replicates and assign peak boundaries accurately.
R package that aims at detecting and quantifiying ORF translation on complex transcriptomes using Ribo-seq data.
Complete analysis pipeline for PAR-CLIP data that provides
Pre-processing and alignment of reads,
Definition of interaction sites,
Additional site-level metrics,
Annotation of reads, groups, and cluster,
Meta-analysis of binding sites relative to important transcript features
Web-based tool for the exploration of public datasets of circular RNAs (circRNAs). Custom python scripts can be downloaded and applied to the discovery of circRNAs in (ribominus) RNA-seq data
Online resource tool for the exploration of the transcriptome of the stage 6 Drosophila embryo at the single cell level.
Software package for the discovery of known and novel miRNAs from deep sequencing data. Furthermore, it can be used for miRNA expression profiling across samples.
Perl script for Linux for correlating the logarithm of expression fold changes of a set of genes with the motif content of the regulatory sequences of these genes.
Web app for the for the study of the Spatial Caenorhabditis elegans germline expression of mRNA & miRNA.
Web app for the exploration and visualization of single cells derived from the subventricular zone of the adult mouse brain.
Java-based software with a graphical user interface for the embedded analysis and visualisation of multiple-sequence alignments (MSAs).
R package for whole-genome phasing of germline variants and haplotype reconstruction from Genome Architecture Mapping (GAM) data.
Finite-state transducer-based framework and software tool for the reconstruction of phylogenetic trees from allele-specific copy-number profiles.
Jointly with Charité researchers within the, the Berlin Long term Observation of Cardiovascular Events (BeLOVE) follows circa 10,000 subjects with primary cardiovascular disease or key precursor type 2 diabetes. BeLOVE allows for direct observation of disease comorbidities, study of mechanisms and differential risk factors and determinants of treatment efficacy.
Berlin Center for Machine Learning (BZML)
The Berlin Center for Machine Learning (BZML, Berliner Zentrum für Maschinelles Lernen) aims at the systematic and sustainable expansion of interdisciplinary machine learning research, both in proven research constellations as well as in new, highly topical scientific objectives that have not yet been jointly researched.
The Berlin Institute for the Foundations of Learning and Data (BIFOLD) aims to conduct research into the scientific foundations of Big Data and Machine Learning, to advance AI application development, and greatly increase the impact to society, the economy, and science.
The German Network for Bioinformatics Infrastructure (de.NBI) is a national, academic and non-profit infrastructure supported by the Federal Ministry of Education and Research providing bioinformatics services to users in life sciences research and biomedicine in Germany and Europe. The partners organize training events, courses and summer schools on tools, standards and compute services provided by de.NBI to assist researchers to more effectively exploit their data.
The Helmholtz Artificial Intelligence Cooperation Unit (Helmholtz AI) is one of five platforms initiated by the. Its main goal is to become a driver for applied artificial intelligence (AI) through the development and distribution of AI methods across all Helmholtz centres, effectively combining AI-based analytics with Helmholtz' unique research questions and datasets.
The Helmholtz Information and Data Science Academy (HIDA) connects and serves as the roof to 6 newly founded data science research schools linked by a network of 14 national research centers and 17 top-tier universities across Germany. HIDA was developed by the, which was founded in 2016. The Incubator is a body of 38 expert scientists from each of the Helmholtz Centers and industry experts.
LifeTime, a pan-European initiative involving 50+ research institutes in 18 countries. Its goal is to track the molecular make-up of human cells in time and space at single cell resolution in order to be able to predict onset and course of diseases. An entire work package is focused on “Data Science, Artificial Intelligence and Machine Learning”. The initiative is jointly coordinated by Nikolaus Rajewsky from the MDC and Geneviève Almouzni from the Institut Curie.
The MDC hosts a study center for the German National Cohort (NaKO), which tracks health trajectories on a population level over longer time scales.
The Pan-Cancer Analysis of Whole Genomes (PCAWG) study is an international collaboration to identify common patterns of mutation in more than 2,800 cancer whole genomes from the International Cancer Genome Consortium. The Schwarz Lab is part of PCAWG Working Group 3 (Interaction of Genome and Transcriptome) and is responsible for conducting allele-specific expression analyses to understand the impact of somatic genetic variation on gene expression in these 2800 tumours.
MDC faculty contribute to the following Data Science doctoral education programs, either as coordinators (HEIBRiDS, Regulatory Genome) or as partners (CompCancer):
The MDC is one of the six Helmholtz Centers that have joined forces with the Einstein Center Digital Future to create a new PhD program in data science. Established in 2018, the Helmholtz Einstein International Berlin Research School in Data Science (HEIBRiDS) is an interdisciplinary school that trains young scientists in Data Science applications within a broad range of natural science domains, spanning from Earth & Environment, Astronomy, Space & Planetary Research to Geosciences, Materials & Energy and Molecular Medicine.
CompCancer is a PhD programme (DFG funded research training group) that focusses on computational aspects of cancer research. The goal of CompCancer is to develop and apply computational methods on relevant questions of current cancer research and thereby train the next generation of computational oncologists.
In an alliance between Berlin institutions (led by Humboldt University) and Duke University, the DFG-funded international research training group Dissecting and Reengineering the Regulatory Genome aims to teach the next generation of researchers a quantitative understanding of genome function and gene regulation within the context of biological systems.
MDC-funded PhD positions on Data Science
In addition to the above PhD Programs, Data Science group leaders participate in the MDC Graduate Program, which runs PhD Recruitment rounds twice a year.
MDC faculty contribute to the following MSc Programs of partner Universities:
Master Program Data Science
The Master Program Data Science is a new program offered by the Department of Mathematics and Computer Science of the Free University of Berlin. It is aimed at students who wish to specialize in the processing and analysis of large amounts of data.
Master Program in Bioinformatics
Employing adequate training in the various sub-disciplines, this program provides the required knowledge for students to be able to judge mathematical methods and models, to recognize relevant biological questions, and to correctly interpret the results of the models in a biological context.
Master Program in Biophysics
The Master Program in Biophysics of the Humboldt University in Berlin offers research-based teaching in the interdisciplinary field of experimental and theoretical biophysics.
Regular lectures & seminars
News and press releases