NEMO2/SDS hd: Difference between revisions

From bwHPC Wiki
Jump to navigation Jump to search
 
(6 intermediate revisions by the same user not shown)
Line 18: Line 18:
'''Prerequisites:'''
'''Prerequisites:'''
* Your bwIDM username and SDS@hd password
* Your bwIDM username and SDS@hd password
* Your SDS@hd project ID (find it at [https://wiki.bwhpc.de/e/SDS@hd/Access wiki.bwhpc.de/e/SDS@hd/Access])
* Your SDS@hd project ID (e.g. <tt>sd14a001</tt>, find it at [[SDS@hd/Access]])


== Step 1: Create the rclone configuration file ==
== Step 1: Create the rclone configuration file ==
Line 48: Line 48:


Replace:
Replace:
* <tt>YOUR_BWIDM_USERNAME</tt> — your bwIDM login name
* <tt>YOUR_BWIDM_USERNAME</tt> — your bwIDM login name (same as on NEMO2)
* <tt>YOUR_PROJECT_ID</tt> — your SDS@hd project ID, e.g. <tt>sd14a001</tt>
* <tt>YOUR_PROJECT_ID</tt> — your SDS@hd project ID, e.g. <tt>sd14a001</tt>
* <tt>PASTE_OBSCURED_PASSWORD_HERE</tt> — leave this for now, you will fill it in Step 2
* <tt>PASTE_OBSCURED_PASSWORD_HERE</tt> — leave this for now, you will fill it in Step 2
Line 238: Line 238:
| Extra rclone flags appended verbatim. Values are split on spaces — arguments containing internal spaces (e.g. bwlimit timetables) cannot be expressed here; use <tt>rclone.conf</tt> or <tt>mount-sdshd</tt> command-line options instead.
| Extra rclone flags appended verbatim. Values are split on spaces — arguments containing internal spaces (e.g. bwlimit timetables) cannot be expressed here; use <tt>rclone.conf</tt> or <tt>mount-sdshd</tt> command-line options instead.
|}
|}

<div class="mw-collapsible mw-collapsed">
Expand for full <tt>~/.config/sdshd/config</tt> example.
<div class="mw-collapsible-content">
<pre>
# ~/.config/sdshd/config

# Local VFS cache directory. Must be an absolute path.
# Leave empty to use /tmp/sdshd-<uid>/cache/ (node-local NVMe, cleaned on reboot).
# Use a Weka workspace path (ws_find <name>) for crash-safe output caching or
# for large datasets that exceed node NVMe capacity.
# A per-hostname subdirectory (rclone-<nodename>/) is always appended so that
# parallel jobs on different nodes each have their own independent cache.
cache_dir =

# VFS cache mode: off | minimal | writes | full (default: full)
# 'full' caches entire files locally for read/write. Required for arbitrary
# file access patterns and write support. 'off' reads directly from SDS@hd
# with no local cache.
cache_mode = full

# Maximum local cache size. rclone evicts LRU files automatically.
cache_size = 50G

# How long to keep cached files before eviction (even if cache_size not hit).
# Set >= max job wall time when using Weka as cache_dir (not cleaned on reboot).
cache_age = 24h

# Number of parallel SFTP requests per connection.
# Higher values fill the TCP window more aggressively over the 5 ms WAN link.
# For job arrays with many tasks, reduce to 8 to avoid overloading lsdf02-sshfs.
sftp_concurrency = 16

# Number of parallel file transfer workers (background uploads).
transfers = 8

# In-memory read/write buffer per file. 128M fills the TCP window well at
# 5 ms RTT (Freiburg → Heidelberg). Increase for very large sequential files.
buffer_size = 128M

# Sequential read prefetch window. Increase to 512M for large sequential reads.
read_ahead = 64M

# Delay between file close and upload start.
# 5s (default): close() returns immediately; the upload runs asynchronously
# in the background. Your job continues while data is transferred to SDS@hd,
# and umount-sdshd (or the Slurm epilog) waits for all uploads to finish
# before disconnecting.
# 0s: upload runs synchronously inside close(), blocking your job for the
# entire upload duration of every file written. For large output files this
# means your job script stalls at each file write — you almost certainly
# do not want this.
# Note: value must be in seconds (e.g. 5s, 30s); other units are not supported.
write_back = 5s

# How long directory listings are cached. 2h for compute jobs (stable),
# 5m for interactive use (pick up changes from other login nodes quickly).
dir_cache = 5m

# Extra rclone flags appended verbatim after all other options.
# Command-line arguments (mount-sdshd only) still take final precedence.
#
# Values are split on spaces, so arguments that contain an internal space
# (e.g. bwlimit timetables like "08:00,100M 18:00,500M") cannot be expressed
# here. Set those via rclone.conf or pass them directly on the command line
# with mount-sdshd instead.
#
# Useful examples:
#
# Limit upload/download bandwidth (per rclone process, not per file).
# Useful when running multiple jobs simultaneously or to be a good
# neighbour on a shared gateway. Value: bytes/s suffix M/G, or 'off'.
# Per-direction: --bwlimit-file takes the same syntax.
# extra_opts = --bwlimit 200M
#
# Exclude patterns — keep temp/checkpoint files out of the VFS cache
# so they don't consume cache space or trigger unnecessary uploads.
# Do NOT use shell quotes here — they are passed literally to rclone,
# not interpreted by the shell. Write glob patterns unquoted:
# extra_opts = --exclude *.tmp --exclude checkpoint_*.pt
#
# Override SSH connection timeout (default 60s). Reduce if you want
# faster failure detection on flaky connections:
# extra_opts = --timeout 30s
extra_opts =
</pre>
</div>
</div>


'''Available disk space by node type:'''
'''Available disk space by node type:'''
Line 324: Line 412:


See [[Workspaces]] for workspace management commands.
See [[Workspaces]] for workspace management commands.

The same Weka <tt>cache_dir</tt> also benefits read-heavy jobs such as ML training or multi-epoch analysis: rclone downloads each input file on first access and serves all subsequent reads from Weka without any WAN traffic. Note that the cache is per-node — each node in a job array downloads its own copy independently. For job arrays where all tasks need the same files, use <tt>rclone copy</tt> to pre-stage once instead — see [[#Shared_dataset_across_many_jobs|Shared dataset across many jobs]].


= Advanced Usage =
= Advanced Usage =
Line 340: Line 430:
mount-sdshd --read-only
mount-sdshd --read-only
</pre>
</pre>

== ML training and read-intensive jobs ==

If your job reads the same files many times (training epochs, iterative algorithms, multi-pass analysis), set a Weka workspace as <tt>cache_dir</tt>. rclone downloads each file on first access; every subsequent read is served from Weka without any WAN traffic:

<pre>
ws_find mywsname # e.g. /work/classic/fr_abc123456-mywsname
</pre>

<pre>
# In ~/.config/sdshd/config:
cache_dir = /work/classic/fr_abc123456-mywsname
cache_size = 2T
cache_age = 120h
</pre>

The first epoch downloads from SDS@hd. Every subsequent epoch reads from Weka — no WAN traffic, no gateway load.

'''Note:''' The cache is per-node. Each node in a job array downloads its own copy. For job arrays where all tasks need the same files, use <tt>rclone copy</tt> to pre-stage once instead (see below).


== Shared dataset across many jobs ==
== Shared dataset across many jobs ==
Line 370: Line 441:
ws_allocate mydata 30
ws_allocate mydata 30
rclone copy sdshd-sftp:inputdata/ $(ws_find mydata)/inputdata/ \
rclone copy sdshd-sftp:inputdata/ $(ws_find mydata)/inputdata/ \
--progress --transfers 8 --sftp-concurrency 8
--progress --transfers 8 --sftp-concurrency 8 --sftp-chunk-size 255k


# Step 2: reference the Weka path in your job script (no --gres=sdshd needed)
# Step 2: reference the Weka path in your job script (no --gres=sdshd needed)
INPUT_DIR=$(ws_find mydata)/inputdata
INPUT_DIR=$(ws_find mydata)/inputdata
</pre>
</pre>

<tt>--sftp-chunk-size 255k</tt> increases the SFTP packet size from 32 KB (default) to 255 KB (OpenSSH's maximum), which reduces protocol overhead and significantly improves throughput for large files. Use this flag with explicit <tt>rclone copy</tt> commands — it has no effect on the FUSE mount and should not be placed in <tt>extra_opts</tt>.


All tasks read from Weka at full speed with zero WAN traffic and zero gateway load. Re-run <tt>rclone copy</tt> if the source data changes — it skips unchanged files.
All tasks read from Weka at full speed with zero WAN traffic and zero gateway load. Re-run <tt>rclone copy</tt> if the source data changes — it skips unchanged files.
Line 427: Line 500:
<pre>
<pre>
rclone copy /local/results/ sdshd-sftp:results/ \
rclone copy /local/results/ sdshd-sftp:results/ \
--transfers 4 --sftp-concurrency 8 --progress
--transfers 4 --sftp-concurrency 8 --sftp-chunk-size 255k --progress
</pre>
</pre>



Latest revision as of 18:49, 22 April 2026

WARNING! This feature is currently for testing purposes only!

SDS@hd is the scientific storage service of Heidelberg University. On NEMO2, you can mount your SDS@hd project directly into your jobs or interactive sessions via rclone. Data is cached locally on the node's NVMe drive and uploaded to SDS@hd in the background.

What you need to do:

  • Initial Setup — one-time configuration, takes about 5 minutes. You must complete this before using SDS@hd on NEMO2.
  • Advanced Configuration — optional. Tune cache size, bandwidth limits, and more. Skip this on your first use.

Initial Setup (Required)

Complete these four steps once. After that, SDS@hd is available in all your jobs and interactive sessions without any further configuration.

Prerequisites:

  • Your bwIDM username and SDS@hd password
  • Your SDS@hd project ID (e.g. sd14a001, find it at SDS@hd/Access)

Step 1: Create the rclone configuration file

Run this command to create the config file with the correct permissions:

rclone config touch

This creates ~/.config/rclone/rclone.conf with mode 600. Now open the file in an editor and add the following two sections, replacing the placeholder values with your own:

[sdshd-backend]
type = sftp
host = lsdf02-sshfs.urz.uni-heidelberg.de
user = YOUR_BWIDM_USERNAME
pass = PASTE_OBSCURED_PASSWORD_HERE
md5sum_command = none
sha1sum_command = none
shell_type = unix
set_modtime = false
idle_timeout = 0

[sdshd-sftp]
type = alias
remote = sdshd-backend:YOUR_PROJECT_ID

Replace:

  • YOUR_BWIDM_USERNAME — your bwIDM login name (same as on NEMO2)
  • YOUR_PROJECT_ID — your SDS@hd project ID, e.g. sd14a001
  • PASTE_OBSCURED_PASSWORD_HERE — leave this for now, you will fill it in Step 2

The [sdshd-sftp] section name must not be changed — all NEMO2 scripts mount sdshd-sftp: by name.

What the path in [sdshd-sftp] controls:

The path after the colon in remote = sdshd-backend:YOUR_PROJECT_ID becomes the root of your mount at /mnt/sdshd/$USER. If you set it correctly, your project files appear directly under /mnt/sdshd/$USER/.

If you leave the path empty (remote = sdshd-backend:), the mount shows the SFTP server's home directory. Your project then appears as a subdirectory at /mnt/sdshd/$USER/YOUR_PROJECT_ID/. Job scripts that reference /mnt/sdshd/$USER/ directly will not find your data.

Exception: If you are a member of multiple projects, you may intentionally omit the project ID so all projects appear as subdirectories. In that case, adjust your job scripts to include the project ID in the path.

Step 2: Set the password

rclone does not store passwords in plain text — it stores an obscured form. Generate the obscured password with the following command. It reads your password without echoing it to the terminal and is never written to your shell history:

printf 'SDS@hd password:\n'; read -s p && printf '%s' "$p" | rclone obscure -; echo

Type your SDS@hd password and press Enter. The command prints a string like:

QKkf-abc123XYZetc

Copy that string and paste it as the pass = value in ~/.config/rclone/rclone.conf, replacing PASTE_OBSCURED_PASSWORD_HERE. Then clear the variable from memory:

unset p        # bash / zsh
# set -e p    # fish

Do not use echo "mypassword" | rclone obscure - — the password would appear in your shell history and in ps output. The read -s method above avoids both.

Step 3: Protect the config file

rclone config touch already creates the file with mode 600. If you created it manually or are unsure, set the permissions explicitly:

chmod 600 ~/.config/rclone/rclone.conf

Step 4: Verify the connection

rclone ls sdshd-sftp:

Expected output: a list of files and directories in your project. An empty listing (no output, no error) is also fine — it means the project exists but is empty.

Error message Likely cause
SSH authentication failed Wrong username or password
no such file or directory Project ID in [sdshd-sftp] is wrong
connection refused / timeout Network issue or wrong hostname
Empty listing (no error) Project exists but is empty — that is fine

Setup complete. See Basic Usage to start using SDS@hd.

Basic Usage

Interactive sessions on login nodes

mount-sdshd            # mount at /mnt/sdshd/$USER
umount-sdshd           # unmount and wait for all uploads to finish

The mount point /mnt/sdshd/$USER is created automatically when you log in, via a PAM hook. No manual admin step is needed.

Always use umount-sdshd to disconnect — never fusermount -u directly.

fusermount -u terminates rclone immediately, cutting off any uploads in progress and causing data loss. umount-sdshd waits until rclone confirms that all uploads are complete before sending the shutdown signal.

Run mount-sdshd --help for all options and examples.

Slurm jobs

Add the --gres=sdshd flag to your job script. The Slurm prolog mounts SDS@hd automatically at job start; the epilog waits for all uploads to complete before the job finishes:

#SBATCH --gres=sdshd

Your data is available at /mnt/sdshd/$USER inside the job, with no further commands needed.

The job shows CG (COMPLETING) in squeue during the upload phase. For large output files this can take minutes — this is expected and safe. The job will not exit until all data has been confirmed uploaded.

If you need data on SDS@hd before the job finishes (e.g. for a multi-step pipeline where the next job reads this job's output), call umount-sdshd explicitly at the end of your job script:

#!/bin/bash
#SBATCH --gres=sdshd

# ... your computation ...

umount-sdshd    # blocks here until all uploads are done, then the job exits

Advanced Configuration

This section is optional. The built-in defaults work well for most jobs. Come back here if you need to:

  • limit cache size or bandwidth
  • use a persistent (crash-safe) cache on Weka
  • tune performance for ML training or large job arrays

Per-user configuration file

Create ~/.config/sdshd/config to override the built-in defaults for all your jobs:

mkdir -p ~/.config/sdshd

The format is key = value, one per line. Lines starting with # are comments. The file is parsed without being executed — invalid or unknown keys are silently ignored.

Configuration keys

Key Default Description
cache_dir /tmp/sdshd-<uid>/cache/ Local VFS cache directory. Node-local NVMe by default (cleaned on reboot). Set to a Weka workspace path for crash-safe persistent caching. See Persistent cache on Weka.
cache_mode full VFS cache mode: off / minimal / writes / full. full caches entire files for read and write. Required for arbitrary file access patterns. off reads directly from SDS@hd with no local cache.
cache_size 50G Maximum local cache size. rclone evicts least-recently-used files automatically when the limit is approached.
cache_age 24h How long to keep cached files before eviction, even if cache_size is not hit. Set to at least your job's wall time when using Weka as cache_dir.
sftp_concurrency 16 Parallel SFTP requests per connection. Higher values fill the TCP window more aggressively over the 5 ms WAN link. Reduce to 8 for large job arrays (e.g. 100+ tasks) to avoid overloading the gateway.
transfers 8 Number of parallel background upload workers.
buffer_size 128M In-memory read/write buffer per file. 128M is well-matched to the Freiburg–Heidelberg RTT. Increase for very large sequential files.
read_ahead 64M Sequential read prefetch window. Increase to 512M for large sequential reads.
write_back 5s Delay between file close and upload start. With the default 5s, close() returns immediately and the upload runs in the background — your job continues while data transfers to SDS@hd. Setting this to 0s makes uploads synchronous: your job stalls at every file write for the full upload duration. You almost certainly do not want 0s.
dir_cache 5m How long to cache directory listings. Use 2h for compute jobs (stable dataset), 5m for interactive use (picks up changes from other nodes quickly).
extra_opts (empty) Extra rclone flags appended verbatim. Values are split on spaces — arguments containing internal spaces (e.g. bwlimit timetables) cannot be expressed here; use rclone.conf or mount-sdshd command-line options instead.

Expand for full ~/.config/sdshd/config example.

# ~/.config/sdshd/config

# Local VFS cache directory. Must be an absolute path.
# Leave empty to use /tmp/sdshd-<uid>/cache/ (node-local NVMe, cleaned on reboot).
# Use a Weka workspace path (ws_find <name>) for crash-safe output caching or
# for large datasets that exceed node NVMe capacity.
# A per-hostname subdirectory (rclone-<nodename>/) is always appended so that
# parallel jobs on different nodes each have their own independent cache.
cache_dir       =

# VFS cache mode: off | minimal | writes | full (default: full)
# 'full' caches entire files locally for read/write. Required for arbitrary
# file access patterns and write support. 'off' reads directly from SDS@hd
# with no local cache.
cache_mode      = full

# Maximum local cache size. rclone evicts LRU files automatically.
cache_size      = 50G

# How long to keep cached files before eviction (even if cache_size not hit).
# Set >= max job wall time when using Weka as cache_dir (not cleaned on reboot).
cache_age       = 24h

# Number of parallel SFTP requests per connection.
# Higher values fill the TCP window more aggressively over the 5 ms WAN link.
# For job arrays with many tasks, reduce to 8 to avoid overloading lsdf02-sshfs.
sftp_concurrency = 16

# Number of parallel file transfer workers (background uploads).
transfers       = 8

# In-memory read/write buffer per file. 128M fills the TCP window well at
# 5 ms RTT (Freiburg → Heidelberg). Increase for very large sequential files.
buffer_size     = 128M

# Sequential read prefetch window. Increase to 512M for large sequential reads.
read_ahead      = 64M

# Delay between file close and upload start.
# 5s (default): close() returns immediately; the upload runs asynchronously
# in the background. Your job continues while data is transferred to SDS@hd,
# and umount-sdshd (or the Slurm epilog) waits for all uploads to finish
# before disconnecting.
# 0s: upload runs synchronously inside close(), blocking your job for the
# entire upload duration of every file written. For large output files this
# means your job script stalls at each file write — you almost certainly
# do not want this.
# Note: value must be in seconds (e.g. 5s, 30s); other units are not supported.
write_back      = 5s

# How long directory listings are cached. 2h for compute jobs (stable),
# 5m for interactive use (pick up changes from other login nodes quickly).
dir_cache       = 5m

# Extra rclone flags appended verbatim after all other options.
# Command-line arguments (mount-sdshd only) still take final precedence.
#
# Values are split on spaces, so arguments that contain an internal space
# (e.g. bwlimit timetables like "08:00,100M 18:00,500M") cannot be expressed
# here. Set those via rclone.conf or pass them directly on the command line
# with mount-sdshd instead.
#
# Useful examples:
#
#   Limit upload/download bandwidth (per rclone process, not per file).
#   Useful when running multiple jobs simultaneously or to be a good
#   neighbour on a shared gateway. Value: bytes/s suffix M/G, or 'off'.
#   Per-direction: --bwlimit-file takes the same syntax.
#     extra_opts = --bwlimit 200M
#
#   Exclude patterns — keep temp/checkpoint files out of the VFS cache
#   so they don't consume cache space or trigger unnecessary uploads.
#   Do NOT use shell quotes here — they are passed literally to rclone,
#   not interpreted by the shell.  Write glob patterns unquoted:
#     extra_opts = --exclude *.tmp --exclude checkpoint_*.pt
#
#   Override SSH connection timeout (default 60s). Reduce if you want
#   faster failure detection on flaky connections:
#     extra_opts = --timeout 30s
extra_opts      =

Available disk space by node type:

Node type Local disk Recommended max cache_size
Login nodes ~400 GB 20–50 G (shared with other users)
Milan nodes ~1.9 TB 50 G (default)
Other nodes ~3.8 TB 50 G (default)

Useful extra_opts examples:

# Limit upload/download bandwidth (per rclone process).
# Useful when running many jobs simultaneously or to be a good neighbour.
extra_opts = --bwlimit 200M

# Exclude temp and checkpoint files from the cache — saves cache space
# and avoids uploading files you don't need on SDS@hd.
# Do NOT use shell quotes here — write glob patterns unquoted.
extra_opts = --exclude *.tmp --exclude checkpoint_*.pt

Persistent cache on Weka

With the default /tmp cache, data written to SDS@hd is first held locally on the node's NVMe. If the node crashes, reboots, or is killed (OOM, walltime) before the upload completes, the cached data is lost.

A Weka workspace survives node failures. If a job is killed before uploads finish, the data is still in the cache — remount with the same cache_dir to resume, or copy directly:

# Create a workspace and configure it as the cache directory:
ws_allocate mywsname 30

# Edit ~/.config/sdshd/config, add:
cache_dir  = /path/from/ws_find
cache_size = 2T
cache_age  = 720h

Replace /path/from/ws_find with the output of ws_find mywsname.

After a failed job, recover the cached data:

# The cache subdirectory is named after the compute node hostname.
# Check: ls $(ws_find mywsname)/
rclone copy "$(ws_find mywsname)/rclone-<compute-node-name>/vfs/sdshd-sftp/" \
    sdshd-sftp:recovered/ --progress

When are cached files removed?

Trigger What happens
rclone running, cache_age exceeded rclone evicts the file from cache (LRU)
rclone running, cache_size would be exceeded rclone evicts least-recently-used files
rclone not running (between mounts) nothing — files are not touched
Workspace expires (ws_extend not run) restorable for 30 days, then auto-deleted permanently
Manual cleanup remove the rclone-* subdirectory inside the workspace

See Workspaces for workspace management commands.

The same Weka cache_dir also benefits read-heavy jobs such as ML training or multi-epoch analysis: rclone downloads each input file on first access and serves all subsequent reads from Weka without any WAN traffic. Note that the cache is per-node — each node in a job array downloads its own copy independently. For job arrays where all tasks need the same files, use rclone copy to pre-stage once instead — see Shared dataset across many jobs.

Advanced Usage

Read-only access

For browsing or spot reads without writing — bypasses the local cache entirely, every read goes directly to SDS@hd:

mount-sdshd --vfs-cache-mode off --read-only

For repeated reads of the same files (e.g. checking a large dataset), use the default cache with --read-only. The VFS cache still operates — files are downloaded once on first access and served from local disk on all subsequent reads:

mount-sdshd --read-only

Shared dataset across many jobs

If many job array tasks need to read the same files, rclone VFS cannot share a cache — concurrent mounts against the same cache directory will corrupt it, even with --read-only.

The solution: copy the dataset once to a Weka workspace, then have all jobs read from the Weka path directly as a plain filesystem — no FUSE mount, no per-job downloads:

# Step 1 (once, on login node): copy dataset from SDS@hd to Weka
ws_allocate mydata 30
rclone copy sdshd-sftp:inputdata/ $(ws_find mydata)/inputdata/ \
    --progress --transfers 8 --sftp-concurrency 8 --sftp-chunk-size 255k

# Step 2: reference the Weka path in your job script (no --gres=sdshd needed)
INPUT_DIR=$(ws_find mydata)/inputdata

--sftp-chunk-size 255k increases the SFTP packet size from 32 KB (default) to 255 KB (OpenSSH's maximum), which reduces protocol overhead and significantly improves throughput for large files. Use this flag with explicit rclone copy commands — it has no effect on the FUSE mount and should not be placed in extra_opts.

All tasks read from Weka at full speed with zero WAN traffic and zero gateway load. Re-run rclone copy if the source data changes — it skips unchanged files.

Known Limitations

Symlinks are not preserved

rclone's SFTP backend does not store symlinks on the remote. When rclone uploads a symlink (through the FUSE mount or via rclone copy), it follows the link and uploads the target's content as a regular file. The symlink itself is lost.

Situation What happens
Symlink to a regular file Target file is uploaded; symlink metadata is lost
Symlink to a directory Directory tree is traversed; contents are uploaded recursively
Dangling symlink (target missing) Transfer error or silently skipped
Restore from SDS@hd Regular files/directories — original symlinks are gone

Workarounds:

  • Archive before upload: tar -czf env.tar.gz myenv/ — preserves all symlinks inside the archive. Unpack after download to restore the original structure.
  • rclone copy --links: encodes symlinks as .rclonelink text files on the remote. Only works for rclone copy/rclone sync workflows (not the FUSE VFS mount). Requires --links on both the copy and the restore.

Troubleshooting

SSH_FX_FAILURE during upload

Most common cause: storage quota exhausted.

SDS@hd enforces a hard quota per project. When full, the server rejects all writes. Check your quota first:

rclone about sdshd-sftp:

If the quota is full, delete or archive old data on SDS@hd before retrying.

Secondary cause: gateway overload from large job arrays.

100 tasks × 16 SFTP connections = 1600 simultaneous connections to lsdf02-sshfs. The gateway may reject new connections under this load. Reduce sftp_concurrency to 8 in ~/.config/sdshd/config, or bypass the FUSE mount entirely for bulk transfers:

rclone copy /local/results/ sdshd-sftp:results/ \
    --transfers 4 --sftp-concurrency 8 --sftp-chunk-size 255k --progress

With the default /tmp cache, resuming requires submitting a new job to the exact same compute node — not generally practical — and the cache is gone after a node reboot regardless. With a Weka workspace as cache_dir, the cache survives both job end and node reboots; the key advantage is that you can copy the data out manually without needing a job on that node at all, as described in Persistent cache on Weka.

Job stuck in CG (COMPLETING) for a long time

The Slurm epilog waits for rclone to finish uploading all pending data. For large output files this can take minutes — this is expected and safe. The job will exit on its own once all uploads are confirmed complete. There is nothing to do.