Generalized workflow after working with LLMs

For our bioinformatics analyses, we follow a standard procedure to ensure reproducibility, consistency, and clear tracking of results. The core components of this workflow standard are:

Containerization (using Docker)

We package the specific version of each software tool and all its dependencies into a Docker image. This guarantees that the software environment is exactly the same every time we run an analysis, regardless of the system it's run on, avoiding 'it works on my machine' problems.

Scripted Workflows (using shell scripts)

The entire analysis process for a given tool or step – from defining inputs and parameters to executing the core commands (like indexing, alignment, processing) – is written as a single shell script (.sh). This makes the analysis transparent, shareable, and easily repeatable. It's our 'recipe' for the analysis.

Dedicated, Timestamped Output Directories

Every time a workflow script is executed, it automatically creates a new, unique directory stamped with the date and time of the run. All the output files generated during that specific execution (e.g., alignment BAMs, index files, results files) are directed into this dedicated folder. This prevents results from different runs (even with slightly different parameters) from overwriting each other and keeps everything associated with a particular execution neatly organized.

Comprehensive Command Logging

Within each workflow script, we enable detailed logging (set -x and redirection to a log file). This means every single command executed by the script, along with its arguments, is recorded in a log file specific to that run's output directory. This log serves as an auditable record of exactly what happened, which is invaluable for troubleshooting and confirming the steps taken.

Background Execution & Persistence

We typically run these workflow scripts using tools like nohup or within persistent sessions like screen or tmux. This ensures that the analysis continues to run reliably in the background on the server, even if our SSH connection is lost or terminated.

By adhering to these principles, we ensure that any analysis can be reliably re-run, the exact steps taken are documented, and the output from each execution is clearly separated and traceable.

dikiprawisuda/Bioinformatics reproducible workflow framework.md

Select an option

No results found