In this article, I'll be discussing some practices that were not part of my academic curriculum but have since become integral to my daily routine.
While these concepts may be familiar to those with a computer science background, I hold a Bachelor's degree in Biotechnology and a Master's in Genetics. I had to acquire these skills through hands-on experience in my job.
In this discussion, I won't delve into programming languages, file formats, or specific best practices for various types of analyses. Instead, I'll shed light on some tools and practices that are widely recognized among developers but may be unfamiliar to those from a biological science background.
Source version control
The most well-known services for managing version control in your projects are GitHub and GitLab, both of which utilize Git as their version control system. While Git is the most popular choice, there are alternative systems available, such as Mercurial and Subversion.
Git was developed by the same individual responsible for creating Linux, Linus Torvalds. The beauty of Git lies in its simplicity; users often need to be familiar with only a small fraction (less than 10%) of its features to effectively harness its capabilities.
A few notable advantages of employing source version control for your projects include:
Collaboration: multiple developers can work on the same project, opening new branches for example. This way one developer does not interfere with the other's job. When the job is done the branch can be merged into the main one.
History and auditing: a complete record of changes allows developers to understand how the project has evolved. Every line of code added has a history of authors that have changed it.
Reverting changes: if a change introduces a bug or other issues, it's possible to revert to a previous version to restore the project to a stable state.
Backup and disaster recovery: you can always recover your project if something unexpected happens in your local machine.
GitHub and GitLab provide web-based interfaces and additional features built on top of Git to enhance collaboration and project management.
Analysis workflow
A significant portion of my role involves developing and maintaining bioinformatics pipelines. To put it simply, we process terabytes of raw DNA sequencing data to extract meaningful biological insights. If you're interested in learning more about this process, please let me know in the comments, and I may write a future post on the topic.
Throughout my planning process, I typically adhere to the following steps:
Research the most widely-used tools available for addressing the specific problem at hand. The better you understand the characteristics of your data or problem, the more informed your choice will be.
Experiment with some of the tools interactively by manually installing them. If provided, use the examples to ensure your installation is functioning properly. This stage can be challenging for certain tools, so remember to take notes.
Create a subset of your data for hands-on testing. Depending on your experiment, this might involve isolating data from a smaller chromosome or downsampling the input data.
Use this subset to run the tools you've selected. Familiarize yourself with the outputs. Sometimes a lot of outputs are created and it's hard to find proper documentation. It's a good idea to take notes of how much computational resource (CPU, memory and disk) the analysis uses in this step.
(Optional) Consolidate all steps into a bash script, allowing for easy recollection and replication of your analysis.
Following this initial process, I begin creating a more automated approach to executing the analysis workflow. There are numerous tools available for orchestrating the various tasks within your pipeline. Here are some of them:
WDL - stands for "Workflow Description Language" (pronounced "widdle"). I'll probably make a dedicated post on this one because is one I chose back in 2016. For now, you can find more details here.
Nextflow - have a more active community developing new pipelines and events for promoting the language. You can easily find good material on how to start, including videos on YouTube. A good starting point is probably their official website.
Snakemake and Galaxy - I know that there are a lot of people using it, but I don´t have tried it yet.
I know that you can already achieve automatization with your bash script (step 5). But these tools aim to offer for your analysis workflow:
Reproducibility: enables researchers to clearly describe their workflows, including the tasks, inputs, outputs, and dependencies, making it easier for others to understand and reproduce the work.
Portability: execute your analysis in different platforms, without modifications. For example, you can develop using your local machine and launch the whole analysis in an HPC cluster or a Cloud provider.
Modularity: create reusable tasks, which can be shared and combined to create new workflows.
Scalability: you will have the same "amount of work" to process one or thousands of samples. All you need is available machines to do it.
From my experience, the initial benefits of utilizing workflow management systems include improved reproducibility and scalability. To fully leverage the additional advantages, a bit more effort may be required.
Project dependencies
Here I want to highlight two tools: Conda and Docker
Beginning with Conda, this versatile tool helps manage packages and environments for various programming languages. Moreover, it provides access to numerous popular bioinformatics programs via the Bioconda channel. Getting started with Conda is straightforward.
Take a look at its official documentation. After installing, usually in your home directory, you can get bioinformatics software by doing, for example:
conda create -n my-env-for-project-a
conda activate my-env-for-project-a
conda install -c bioconda -c conda-forge bwa pandas matplotlib
By using this approach, you can maintain multiple versions of Python and other libraries on your system without encountering the dreaded dependency nightmare. This issue often arises when you need to install various programs with shared dependencies but necessitating specific versions.
In summary, Conda is an incredibly practical tool, particularly when the desired package is already available in the Conda channels.
Moving on to Docker: there's a wealth of information available online about this tool, as it's utilized in various contexts. It can quickly become complex, particularly when considering security and privileged access.
In our context, any container system, such as Docker, Podman, or Singularity, can be used to supply the necessary dependencies for our bioinformatics workflow. Typically, Docker isn't found in HPC clusters; instead, Singularity is installed. For our purposes, they can be considered equivalent (much like Microsoft Office and LibreOffice).
To simplify, users create instructions in files called Dockerfiles, specifying the base distribution (e.g., Ubuntu or CentOS) and the software to be included. These instructions generate images, with each instance of an image referred to as a container.
I've observed two common approaches for using containers in bioinformatics workflows:
Bundling all dependencies in a single image: This method can result in images larger than 5GB, containing all the necessary dependencies for your program. I have assisted in setting up this kind of container for the FunGAP pipeline.
Creating small, program-specific images: When utilizing a workflow language like Nextflow or WDL, it's possible to designate a different Docker image for each task. This approach allows you to leverage pre-existing images, such as those from the BioContainers project.
Important considerations: Both Conda and the container approach come with their own set of challenges. You may occasionally encounter compatibility issues when attempting to install a package through Conda, or find a program with an issue in the Conda channel, especially if Conda is not mentioned in the program's official documentation.
On the other hand, while containers offer more control, crafting a robust and comprehensive Dockerfile that meets all your requirements and withstands the test of time can be quite time-consuming. To adopt good Dockerfile practices, I find Hadolint, a handy linter, to be quite helpful.
In future articles, I'll likely discuss implementing automatic tests for your bioinformatics workflows.