Research Data Management
Organizing your data
You can find the file organization best practices guidelines under this title. If you need more information, you can get contact with RDM team anytime.
- Directory structure
-
A clear directory structure will make it easier to locate files and versions and this is particularly important when collaborating with others. Consider a hierarchical file structure starting from broad topics to more specific ones nested inside, restricting the level of folders to 3 or 4 with a limited number of items (max. 50 items if possible) inside each folder.
-
Directory structure at the MDC
At the MDC, there are multiple locations for File storage. For an overview of locations and how to access them, please refer to our guideline on the Intranet.
For more information regarding access, please refer to IT-Knowledgebase-Only available internally.
Examples of directory structuring
Resources:
- The UK data services offer an example of the directory structure and naming:https://ukdataservice.ac.uk/manage-data/format/organising.aspx
- Dryad Best Practices:https://datadryad.org/stash/best_practices
- File naming
-
A file name should be unique, consistent and descriptive. This allows for increased visibility and discoverability and can be used to easily classify and sort files. Remember, a file name is a primary identifier to the file and its contents.
-
Do’s
- Create descriptive, meaningful, easily understood names that are not too short or too long, i.e., no less than 12-14 characters except for generic, well-defined names such as README.
- Use identifiers to make it easier to classify types of files i.e., Int1 (interview 1).
- When combining elements in the file name, preferably use underscores (_) or hyphens (-) as an element separator, see examples of commonly used special letter case patterns.
- Make sure the file format extension is present at the end of the name (e.g. .doc, .xls, .mov, .tif, .fasta, .html).
-
If applicable, include versioning within file names.
- For dates use the ISO 8601 standard: YYYY-MM-DD and place at the end of the file number UNLESS you need to organize your files chronologically.
- For experimental data files, consider using the project/experiment name and conditions in abbreviations.
- Add a README file in your top directory which details your naming convention, directory structure, and abbreviations.
Don’ts
- Avoid using a capital letter to separate words such as CamelCase and use underscores or hyphens instead.
- Avoid naming files/folders with individual names as it impedes handover and data sharing.
- Avoid long names. e.g., no longer than 35-40 characters.
- Avoid using spaces, dots, commas and special characters (e.g. " / \ ~ : ! @ # $ % ^ & * ( ) ` ; < > ? , [ ] { } ‘ “ | ), or any foreign (Unicode) characters e.g. äöüß r カイダー字 .
- Avoid repetition for ex. Directory name Electron_Microscopy_Images, then you don’t need to name the files ELN_MI_Img_20200101.img.
Examples
- Stanford Libraries guidance on file naming is a great place to start
- Dryad example
- 1900-2000_sasquatch_migration_coordinates.csv
- Smith-fMRI-neural-response-to-cupcakes-vs-vegetables.nii.gz
Tools
- File naming tools for bulk naming such as Ant Renamer, RenameIT or Rename4Mac.
- File format
-
The choice of file formats plays an essential role for long term data storage and archiving, data sharing, searchability, accessibility, and has a significant impact on data reusability.
It is advisable to consider open file formats whenever possible that allows the data to be imported and accessed by different tools and is not vendor locked in case a tool is no longer supported.
-
Consider the following:
- Choose standard file formats most commonly used in your field.
- Convert data to a standard format.
- Choose a format which is required for data deposition i.e. repository requirements, archival compression.
- Consider exporting or converting from original format to a more open/preferred format but keep in mind that some data might be lost or altered during the process e.g., text formatting in documents, decimal point formatting, date and time values.
- Keep in mind there are no standard preferred file formats, and none are perfect, but consider choosing open formats that are most applicable for your use and field.
- When archiving data, combine the whole project (i.e., raw data, analysis, documentation, code and software) in one package.
- For software consider the use of containers to enable interoperability and long-term re-use.
Example
- Recommended file formats by the UK data archive: https://www.ukdataservice.ac.uk/manage-data/format/recommended-formats.aspx
- Dryad guide: https://datadryad.org/stash/best_practices#accessible
- Australian National Data Services (ANDS) guide: https://www.ands.org.au/__data/assets/pdf_file/0003/731775/File-Formats.pdf
- JISC guide: https://rdmtoolkit.jisc.ac.uk/collect-and-capture/file-management-and-formats/
Tools
- MDC container- Singularity
- MDC Guix- https://guix.mdc-berlin.de/
For use of MDC Singularity container please attend the Introductory session on the use of Max Cluster held bi-monthly-Only available internally.
- Singularity: https://github.com/hpcng/singularity
- Docker: https://www.docker.com/resources/what-container
- Jupyter: https://jupyter.org/index.html
- Fido: https://github.com/openpreserve/fido
- Vagrant: https://www.vagrantup.com/intro/vs/docker.html
- Quality control
-
Quality control is a fundamental step in research, which ensures the integrity of the data and could affect its use and reuse and is required in order to identify potential problems.
It is therefore essential to outline how data collection will be controlled at various stages (data collection, digitisation or data entry, checking and analysis). Consider the following quality control measures:
-
Data collection
- Outline the number of measurements/samples/procedures repeated.
- Outline instrument calibration tests & data set or samples used for calibration.
- Outline standardized controls (e.g., sample controls).
- Use of standardized protocols and methods with clear instructions and documentation.
Data entry
- Decide a method for documentation i.e., Electronic lab notebooks vs paper.
- Outline the non-digital data structure and strategy for digitization.
- Collect and create metadata throughout the data collection and handling process
- Use controlled vocabularies.
- Outline how the data/samples/variables are labelled.
- Document terminology used.
- Describe how to flag/tag questionable data.
- Ensure data and time is represented in a machine-readable format and valid.
- Set up validation rules or input masks in data entry software.
Data Analysis and checking
- Outline software/code used for analysis.
- Outline strategy for data transfer and controls (e.g., checksum).
- Outline how the data will be cross-checked and validated.
- Assign person/expert for quality assurance and data checks and/or peer review.
- Outline database structure to organize data and data files.
- Document any modifications and outline versioning strategy to avoid duplicate error checking.
- Check and flag questionable data.
- Verify your analysis by using a random data set/samples compare to original data.
- Double-check the code for any errors and ensure appropriate documentation.
- Use statistical analysis to detect erroneous and/or anomalous values.
Qualitative data
-
For qualitative data such as interviews:
- Outline guided interview questions.
- Make use of software tools such as text to speech.
- Control the quality of audio/video/transcripts files.
- Refer to the UK data archive guidelines.
Examples
- Data one quality control https://old.dataone.org/best-practices/ensure-basic-quality-control
- Kings College Quality control https://www.kcl.ac.uk/researchsupport/managing/organise
- UK Data Archive quality control guidelines: https://www.ukdataservice.ac.uk/manage-data/format/transcription.aspx
- RDA metadata standards and collection tools: http://rd-alliance.github.io/metadata-directory/standards/
Tools
- Open Refine for data quality control https://openrefine.org/
- Numeric data anonymization R-Package: sdcMicro
- UK data archive tools list: https://www.data-archive.ac.uk/managing-data/digital-curation-and-data-publishing/tools-we-use/
File transfer MDC
To send data securely to external users:
- For any data by up to 10GB use https://filetransfer.mdc-berlin.de.
Please refer to the guidelines, for more information on the use of file transfer at the MDC- Only available internally.
- For any data more than 10GB, please get in touch with IT-Helpdesk.
- Sensitive and/or personal data files should be encrypted e.g., at rest before upload.
- Consider the use of GPG (GNU Privacy Guard) which is freely available for both personal and commercial use.
- Pretty Good Privacy (PGP) is an industry-standard encryption technology tool that is freely available but restricted for personal use only.
- Keep in mind the data needs to be locally encrypted and the password must be sent through a different channel.
- For more details, please consult the Information Security Office and Data protection office especially in cases of transferring genomics data.
- Please refrain from using general communication tools for file transfer such as Slack, Skype, Mattermost, DropBox, especially not for personal or sensitive data.
- Versioning
-
In order to keep track of changes made to a file/dataset, versioning can be an efficient way to see who did what and when, in collaborative work, this can be very useful.
A version control strategy will allow you to easily detect the most current/final version, organize, manage and record any edits made while working on the document/data, drafting, editing and analysis.
-
Consider the following practices:
- Outline the master file and identify major files for instance; original, pre-review, 1st revision, 2nd revision, final revision, submitted.
- Outline strategy for archiving and storing: Where to store the minor and major versions, how long will you retain them accordingly.
- Maintain a record of file locations, a good place is in the README files.
- Record any related files and documents and any updates/changes made to them
- Use a systemic and unique naming system to identify the different versions, e.g., numbers and/or dates.
- Include a version control table that outlines the file history, which version, where the other versions are located, list all associated files and their versions and modifications, add dates, authors, access rights, licensing, and details of changes made since the last version.
Example
- UK Data service version control guide: https://www.ukdataservice.ac.uk/manage-data/format/versioning.aspx
Tools
- Bitbucket server is available at git.mdc-berlin.de, please get in touch with IT-helpdesk for access
- Sharepoint (not for personal or sensitive data)
- Github (not for personal or sensitive data)
- README files
-
The purpose of a README file is to give an overview of the content, aiding individuals in making sense of the data enclosed thereby persevering the long-term value of the data. This can be very helpful if you are sharing your data with others, or to keep track of content and edits or changes made in multiple projects, or revisiting data after some time has passed.
-
- A README file is better suited for a collection of data such as a directory for a specific project or experiment, software tool, or any data that is related to each other “logically”.
- Place the README file in a parent directory associated with the content described.
- Use plain markdown or a simple text editor to create the README file in either .md or .txt file format.
- For dates use the ISO 8601 standard: YYYY-MM-DD.
- Whenever possible use the standard vocabulary from your field, see metadata standards directory by RDA community.
Template README File
Title:
Abstract:
Keywords:
Funding/grant number:
Date project Started:
Date files were created:
Date the files were last updated:
Version:
Author information:
Full Name
Email
ORCID
Affiliation
Principal Investigator:
Full Name
Email
ORCID
Affiliation
Collaborators:
People involved in the project for instance, those involved in sample collection, processing, analysis and/or submissionFull Name
Email
ORCID
Affiliation
Access rights:
Can be viewed by X, Can be edited by X, Can be shared by X, Owned by XLicense: Sharing & attribution rights
Directory structure:
File Formats:
Links:
To related data or publicationsData Source:
Methods:
i.e. type of experiment, analysisNaming conventions:
Abbreviations:
Units of measurement:
Codes or symbols:
Variable list:
Such as column headings for tabular dataQuality assurance:
Metadata:
How will it be generated and where will it be availableData retention schedule:
Which data and file format will be kept, how long records should be kept, who is authorized to access it, who is responsible for disposal decision and performance.Data sensitivity:
Does it contain private or sensitive data?