BinAC2/Software/Alphafold
|
The main documentation is available on the cluster via |
| Description | Content |
|---|---|
| module load | bio/alphafold |
| License | ACC BY-NC-SA 4.0 - see [1] |
| Citing | See [2] |
| Links | DeepMind AlphaFold Website: [3] Alphafold 3 Repository: [4] |
Description
AlphaFold 3 developed by DeepMind predicts the 3D structure and interactions of biological molecules: proteins, DNA, RNA, ligands, and other small molecules.
BinAC 2 provides almost everything you need for working with Alphafold 3:
- Alphafold 3 installed in an Apptainer image
- Alphafold 3 database
BUT you have to bring your own parameter file! Due to license issues we are not allowed to share them publicly. You can find information on how to obtain them here.
Interpreting Predictions
When using AlphaFold3, users should treat predicted structures as hypotheses rather than definitive representations of molecular reality. While the method often produces highly accurate models, confidence can vary substantially across regions, especially for flexible loops, disordered segments, or novel interactions not well represented in training data. Users should therefore critically assess confidence metrics provided by the model and, where possible, validate predictions against experimental data or independent computational approaches. Biological plausibility, consistency with known functional or biochemical evidence, and sensitivity to alternative inputs should also be considered. Overall, AlphaFold3 outputs are most powerful when used as guidance for downstream analysis and experimental design, not as final ground truth.
AlphaFold 3 supplies multiple confidence metrics to help you critically assess its predictions:
- Predicted LDDT (pLDDT): predicted atomic coordinates are accompanied by pLDDT scores. These reflect AlphaFold 3’s local confidence in the prediction of the position of that particular atom.
- Predicted Aligned Error (PAE) scores and a PAE plot: an indication of AlphaFold’s confidence in the packing and relative positions of domains, molecular chains such as proteins and DNA, and other entities like ligands and ions.
- Predicted TM (pTM) score: a single-value metric reflecting the accuracy of the overall predicted structure.
- Interface-predicted TM (ipTM) score: measures the accuracy of predictions of one component of the complex relative to the other components of the complex.
- Per chain pTM and per-chain pair ipTM: confidence in individual chains or pairs of chains.
Further Reading:
Usage
AlphaFold's algorithm can be devided into two steps:
- CPU-only part: Computation of several multiple sequence alignments (MSA)
- GPU-part: MSA used as input for neural network for infering structure
This results in two optimal resource profiles regarding the number of CPU cores and GPUs.
Therefore we provide two template jobscripts for both steps at /opt/bwhpc/common/bio/alphafold/3.0.1/bwhpc-examples.
Input
You can provide inputs to run_alphafold.py in one of two ways:
- Single input file: Use the
--json_pathflag followed by the path to a single JSON file. - Multiple input files: Use the
--input_dirflag followed by the path to a directory of JSON files.
Please see the official AlphaFold 3 documentation for more information regarding JSON file structure.
The example jobscript uses a very simple input JSON:
{
"name": "2PV7",
"sequences": [
{
"protein": {
"id": ["A", "B"],
"sequence": "GMRESYANENQFGFKTINSDIHKIVIVGGYGKLGGLFARYLRASGYPISILDREDWAVAESILANADVVIVSVPINLTLETIERLKPYLTENMLLADLTSVKREPLAKMLEVHTGAVLGLHPMFGADIASMAKQVVVRCDGRFPERYEWLLEQIQIWGAKIYQTNATEHDHNMTYIQALRHFSTFANGLHLSKQPINLANLLALSSPIYRLELAMIGRLFAQDAELYADIIMDKSENLAVIETLKQTYDEALTFFENNDRQGFIDAFHKVRDWFGDYSEQFLKESRQLLQQANDLKQG"
}
}
],
"modelSeeds": [1],
"dialect": "alphafold3",
"version": 1
}AlphaFold 3 database
Don't change --db_dir=$ALPHAFOLD_DATABASES without a good reason. Upon loading the AlphaFold 3 module, the environment variable $ALPHAFOLD_DATABASES is set to the centrally stored database at /pfs/10/project/db/alphafold/3.0.1/databases/.
AlphaFold 3 model parameters
Set the path to your AlphaFold 3 model parameters you received from DeepMind after applying for it. If you place the file af3.bin.zst in the directory $HOME/af3-models the example scripts will work out of the box.
Please note: You can also store the parameters in your project directory or a workspace for slightly better performance. The template uses $HOME/af3-models because this environment variable works for everyone.
Output
You can set the output directory via the --output_dir=$ALPHAFOLD_RESULTS_DIR option. The template jobscript creates a workspace called alphafold as output directory.
For every input job, AlphaFold 3 writes all its outputs in a directory called by the sanitized version of the job name. E.g. for job name "My first fold (TEST)", AlphaFold 3 will write its outputs in a directory called My_first_fold_TEST (the case is respected). If such directory already exists, AlphaFold 3 will append a timestamp to the directory name to avoid overwriting existing data unless --force_output_dir is passed.
Please see the official AlphaFold 3 documentation for more information regarding output directory structure.
CPU-only: Multiple Sequence Alignment
In the beginning, AlphaFold 3 computes three multiple sequence alignments (MSA). These MSAs are computed on the CPU sequentially and the number of threads are hard-coded:
- jackhmmer on UniRef90 using 8 threads
- jackhmmer on MGnify using 8 threads
- HHblits on BFD + Uniclust30 using 4 threads
We provide a template jobscript at: /opt/bwhpc/common/bio/alphafold/3.0.1/bwhpc-examples/binac2-af3-alignment.slurm
You can change --time and --mem, although these are sensible defaults.
#!/bin/bash
#SBATCH --partition=compute
#SBATCH --nodes=1
# Alphafold creates alignments sequentially, using at max 8 cores.
#SBATCH --ntasks-per-node=8
#SBATCH --time=10:00:00
#SBATCH --mem=100gbThe MSAs are stored in the directory specified by --output_dir.
GPU-part: Model Inference
Aftere computing the MSAs, AlphaFold performs model inference on the GPU. Only one GPU is used. This use case has this optimal resource profile:
#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:a100:1
#SBATCH --time=06:00:00
#SBATCH --mem=100gbThe template jobscript only differs to CPU-only part in the option --norun_data_pipeline. This means the MSAs aren't recomputed, but are taken from the previous CPU-only job.
Chaining CPU- and GPU-Jobs
To run the data pipeline first and then start the inference job as soon as the first one is finished, you can chain them like this:
JOBID=$(sbatch --parsable binac2-af3-alignment.slurm)
sbatch --dependency=afterok:$JOBID binac2-af3-inference.slurm