BinAC2/Software/Alphafold

From bwHPC Wiki
Jump to navigation Jump to search

The main documentation is available on the cluster via module help bio/alphafold. Most software modules for applications provide working example batch scripts.


Description Content
module load bio/alphafold
License ACC BY-NC-SA 4.0 - see [1]
Citing See [2]
Links DeepMind AlphaFold Website: [3]

Alphafold 3 Repository: [4]

Description

AlphaFold 3 developed by DeepMind predicts the 3D structure and interactions of biological molecules: proteins, DNA, RNA, ligands, and other small molecules.

BinAC 2 provides almost everything you need for working with Alphafold 3:

  • Alphafold 3 installed in an Apptainer image
  • Alphafold 3 database

BUT you have to bring your own parameter file! Due to license issues we are not allowed to share them publicly. You can find information on how to obtain them here.

Interpreting Predictions

When using AlphaFold3, users should treat predicted structures as hypotheses rather than definitive representations of molecular reality. While the method often produces highly accurate models, confidence can vary substantially across regions, especially for flexible loops, disordered segments, or novel interactions not well represented in training data. Users should therefore critically assess confidence metrics provided by the model and, where possible, validate predictions against experimental data or independent computational approaches. Biological plausibility, consistency with known functional or biochemical evidence, and sensitivity to alternative inputs should also be considered. Overall, AlphaFold3 outputs are most powerful when used as guidance for downstream analysis and experimental design, not as final ground truth.

AlphaFold 3 supplies multiple confidence metrics to help you critically assess its predictions:

  • Predicted LDDT (pLDDT): predicted atomic coordinates are accompanied by pLDDT scores. These reflect AlphaFold 3’s local confidence in the prediction of the position of that particular atom.
  • Predicted Aligned Error (PAE) scores and a PAE plot: an indication of AlphaFold’s confidence in the packing and relative positions of domains, molecular chains such as proteins and DNA, and other entities like ligands and ions.
  • Predicted TM (pTM) score: a single-value metric reflecting the accuracy of the overall predicted structure.
  • Interface-predicted TM (ipTM) score: measures the accuracy of predictions of one component of the complex relative to the other components of the complex.
  • Per chain pTM and per-chain pair ipTM: confidence in individual chains or pairs of chains.

Further Reading:

Usage

AlphaFold's algorithm can be devided into two steps:

  1. CPU-only part: Computation of several multiple sequence alignments (MSA)
  2. GPU-part: MSA used as input for neural network for infering structure

This results in two optimal resource profiles regarding the number of CPU cores and GPUs. Therefore we provide two template jobscripts for both steps at /opt/bwhpc/common/bio/alphafold/3.0.1/bwhpc-examples.

Input

You can provide inputs to run_alphafold.py in one of two ways:

  • Single input file: Use the --json_path flag followed by the path to a single JSON file.
  • Multiple input files: Use the --input_dir flag followed by the path to a directory of JSON files.

Please see the official AlphaFold 3 documentation for more information regarding JSON file structure.

The example jobscript uses a very simple input JSON:

{
  "name": "2PV7",
  "sequences": [
    {
      "protein": {
        "id": ["A", "B"],
        "sequence": "GMRESYANENQFGFKTINSDIHKIVIVGGYGKLGGLFARYLRASGYPISILDREDWAVAESILANADVVIVSVPINLTLETIERLKPYLTENMLLADLTSVKREPLAKMLEVHTGAVLGLHPMFGADIASMAKQVVVRCDGRFPERYEWLLEQIQIWGAKIYQTNATEHDHNMTYIQALRHFSTFANGLHLSKQPINLANLLALSSPIYRLELAMIGRLFAQDAELYADIIMDKSENLAVIETLKQTYDEALTFFENNDRQGFIDAFHKVRDWFGDYSEQFLKESRQLLQQANDLKQG"
      }
    }
  ],
  "modelSeeds": [1],
  "dialect": "alphafold3",
  "version": 1
}

AlphaFold 3 database

Don't change --db_dir=$ALPHAFOLD_DATABASES without a good reason. Upon loading the AlphaFold 3 module, the environment variable $ALPHAFOLD_DATABASES is set to the centrally stored database at /pfs/10/project/db/alphafold/3.0.1/databases/.

AlphaFold 3 model parameters

Set the path to your AlphaFold 3 model parameters you received from DeepMind after applying for it. If you place the file af3.bin.zst in the directory $HOME/af3-models the example scripts will work out of the box.

Please note: You can also store the parameters in your project directory or a workspace for slightly better performance. The template uses $HOME/af3-models because this environment variable works for everyone.

Output

You can set the output directory via the --output_dir=$ALPHAFOLD_RESULTS_DIR option. The template jobscript creates a workspace called alphafold as output directory.

For every input job, AlphaFold 3 writes all its outputs in a directory called by the sanitized version of the job name. E.g. for job name "My first fold (TEST)", AlphaFold 3 will write its outputs in a directory called My_first_fold_TEST (the case is respected). If such directory already exists, AlphaFold 3 will append a timestamp to the directory name to avoid overwriting existing data unless --force_output_dir is passed.

Please see the official AlphaFold 3 documentation for more information regarding output directory structure.

CPU-only: Multiple Sequence Alignment

In the beginning, AlphaFold 3 computes three multiple sequence alignments (MSA). These MSAs are computed on the CPU sequentially and the number of threads are hard-coded:

  • jackhmmer on UniRef90 using 8 threads
  • jackhmmer on MGnify using 8 threads
  • HHblits on BFD + Uniclust30 using 4 threads

We provide a template jobscript at: /opt/bwhpc/common/bio/alphafold/3.0.1/bwhpc-examples/binac2-af3-alignment.slurm

You can change --time and --mem, although these are sensible defaults.

#!/bin/bash
#SBATCH --partition=compute
#SBATCH --nodes=1

# Alphafold creates alignments sequentially, using at max 8 cores.
#SBATCH --ntasks-per-node=8

#SBATCH --time=10:00:00
#SBATCH --mem=100gb

The MSAs are stored in the directory specified by --output_dir.

GPU-part: Model Inference

Aftere computing the MSAs, AlphaFold performs model inference on the GPU. Only one GPU is used. This use case has this optimal resource profile:

#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:a100:1
#SBATCH --time=06:00:00
#SBATCH --mem=100gb

The template jobscript only differs to CPU-only part in the option --norun_data_pipeline. This means the MSAs aren't recomputed, but are taken from the previous CPU-only job.

Chaining CPU- and GPU-Jobs

To run the data pipeline first and then start the inference job as soon as the first one is finished, you can chain them like this:

JOBID=$(sbatch --parsable binac2-af3-alignment.slurm)
sbatch --dependency=afterok:$JOBID binac2-af3-inference.slurm