NEMO2/SDS hd
WARNING! This feature is currently for testing purposes only!
SDS@hd is the scientific storage service of Heidelberg University. On NEMO2, you can mount your SDS@hd project directly into your jobs or interactive sessions via rclone. Data is cached locally on the node's NVMe drive and uploaded to SDS@hd in the background.
What you need to do:
- Initial Setup — one-time configuration, takes about 5 minutes. You must complete this before using SDS@hd on NEMO2.
- Advanced Configuration — optional. Tune cache size, bandwidth limits, and more. Skip this on your first use.
Initial Setup (Required)
Complete these four steps once. After that, SDS@hd is available in all your jobs and interactive sessions without any further configuration.
Prerequisites:
- Your bwIDM username and SDS@hd password
- Your SDS@hd project ID (e.g. sd14a001, find it at SDS@hd/Access)
Step 1: Create the rclone configuration file
Run this command to create the config file with the correct permissions:
rclone config touch
This creates ~/.config/rclone/rclone.conf with mode 600. Now open the file in an editor and add the following two sections, replacing the placeholder values with your own:
[sdshd-backend] type = sftp host = lsdf02-sshfs.urz.uni-heidelberg.de user = YOUR_BWIDM_USERNAME pass = PASTE_OBSCURED_PASSWORD_HERE md5sum_command = none sha1sum_command = none shell_type = unix set_modtime = false idle_timeout = 0 [sdshd-sftp] type = alias remote = sdshd-backend:YOUR_PROJECT_ID
Replace:
- YOUR_BWIDM_USERNAME — your bwIDM login name (same as on NEMO2)
- YOUR_PROJECT_ID — your SDS@hd project ID, e.g. sd14a001
- PASTE_OBSCURED_PASSWORD_HERE — leave this for now, you will fill it in Step 2
The [sdshd-sftp] section name must not be changed — all NEMO2 scripts mount sdshd-sftp: by name.
What the path in [sdshd-sftp] controls:
The path after the colon in remote = sdshd-backend:YOUR_PROJECT_ID becomes the root of your mount at /mnt/sdshd/$USER. If you set it correctly, your project files appear directly under /mnt/sdshd/$USER/.
If you leave the path empty (remote = sdshd-backend:), the mount shows the SFTP server's home directory. Your project then appears as a subdirectory at /mnt/sdshd/$USER/YOUR_PROJECT_ID/. Job scripts that reference /mnt/sdshd/$USER/ directly will not find your data.
Exception: If you are a member of multiple projects, you may intentionally omit the project ID so all projects appear as subdirectories. In that case, adjust your job scripts to include the project ID in the path.
Step 2: Set the password
rclone does not store passwords in plain text — it stores an obscured form. Generate the obscured password with the following command. It reads your password without echoing it to the terminal and is never written to your shell history:
printf 'SDS@hd password:\n'; read -s p && printf '%s' "$p" | rclone obscure -; echo
Type your SDS@hd password and press Enter. The command prints a string like:
QKkf-abc123XYZetc
Copy that string and paste it as the pass = value in ~/.config/rclone/rclone.conf, replacing PASTE_OBSCURED_PASSWORD_HERE. Then clear the variable from memory:
unset p # bash / zsh # set -e p # fish
Do not use echo "mypassword" | rclone obscure - — the password would appear in your shell history and in ps output. The read -s method above avoids both.
Step 3: Protect the config file
rclone config touch already creates the file with mode 600. If you created it manually or are unsure, set the permissions explicitly:
chmod 600 ~/.config/rclone/rclone.conf
Step 4: Verify the connection
rclone ls sdshd-sftp:
Expected output: a list of files and directories in your project. An empty listing (no output, no error) is also fine — it means the project exists but is empty.
| Error message | Likely cause |
|---|---|
| SSH authentication failed | Wrong username or password |
| no such file or directory | Project ID in [sdshd-sftp] is wrong |
| connection refused / timeout | Network issue or wrong hostname |
| Empty listing (no error) | Project exists but is empty — that is fine |
Setup complete. See Basic Usage to start using SDS@hd.
Basic Usage
Interactive sessions on login nodes
mount-sdshd # mount at /mnt/sdshd/$USER umount-sdshd # unmount and wait for all uploads to finish
The mount point /mnt/sdshd/$USER is created automatically when you log in, via a PAM hook. No manual admin step is needed.
Always use umount-sdshd to disconnect — never fusermount -u directly.
fusermount -u terminates rclone immediately, cutting off any uploads in progress and causing data loss. umount-sdshd waits until rclone confirms that all uploads are complete before sending the shutdown signal.
Run mount-sdshd --help for all options and examples.
Slurm jobs
Add the --gres=sdshd flag to your job script. The Slurm prolog mounts SDS@hd automatically at job start; the epilog waits for all uploads to complete before the job finishes:
#SBATCH --gres=sdshd
Your data is available at /mnt/sdshd/$USER inside the job, with no further commands needed.
The job shows CG (COMPLETING) in squeue during the upload phase. For large output files this can take minutes — this is expected and safe. The job will not exit until all data has been confirmed uploaded.
If you need data on SDS@hd before the job finishes (e.g. for a multi-step pipeline where the next job reads this job's output), call umount-sdshd explicitly at the end of your job script:
#!/bin/bash #SBATCH --gres=sdshd # ... your computation ... umount-sdshd # blocks here until all uploads are done, then the job exits
Advanced Configuration
This section is optional. The built-in defaults work well for most jobs. Come back here if you need to:
- limit cache size or bandwidth
- use a persistent (crash-safe) cache on Weka
- tune performance for ML training or large job arrays
Per-user configuration file
Create ~/.config/sdshd/config to override the built-in defaults for all your jobs:
mkdir -p ~/.config/sdshd
The format is key = value, one per line. Lines starting with # are comments. The file is parsed without being executed — invalid or unknown keys are silently ignored.
Configuration keys
| Key | Default | Description |
|---|---|---|
| cache_dir | /tmp/sdshd-<uid>/cache/ | Local VFS cache directory. Node-local NVMe by default (cleaned on reboot). Set to a Weka workspace path for crash-safe persistent caching. See Persistent cache on Weka. |
| cache_mode | full | VFS cache mode: off / minimal / writes / full. full caches entire files for read and write. Required for arbitrary file access patterns. off reads directly from SDS@hd with no local cache. |
| cache_size | 50G | Maximum local cache size. rclone evicts least-recently-used files automatically when the limit is approached. |
| cache_age | 24h | How long to keep cached files before eviction, even if cache_size is not hit. Set to at least your job's wall time when using Weka as cache_dir. |
| sftp_concurrency | 16 | Parallel SFTP requests per connection. Higher values fill the TCP window more aggressively over the 5 ms WAN link. Reduce to 8 for large job arrays (e.g. 100+ tasks) to avoid overloading the gateway. |
| transfers | 8 | Number of parallel background upload workers. |
| buffer_size | 128M | In-memory read/write buffer per file. 128M is well-matched to the Freiburg–Heidelberg RTT. Increase for very large sequential files. |
| read_ahead | 64M | Sequential read prefetch window. Increase to 512M for large sequential reads. |
| write_back | 5s | Delay between file close and upload start. With the default 5s, close() returns immediately and the upload runs in the background — your job continues while data transfers to SDS@hd. Setting this to 0s makes uploads synchronous: your job stalls at every file write for the full upload duration. You almost certainly do not want 0s. |
| dir_cache | 5m | How long to cache directory listings. Use 2h for compute jobs (stable dataset), 5m for interactive use (picks up changes from other nodes quickly). |
| extra_opts | (empty) | Extra rclone flags appended verbatim. Values are split on spaces — arguments containing internal spaces (e.g. bwlimit timetables) cannot be expressed here; use rclone.conf or mount-sdshd command-line options instead. |
Expand for full ~/.config/sdshd/config example.
# ~/.config/sdshd/config # Local VFS cache directory. Must be an absolute path. # Leave empty to use /tmp/sdshd-<uid>/cache/ (node-local NVMe, cleaned on reboot). # Use a Weka workspace path (ws_find <name>) for crash-safe output caching or # for large datasets that exceed node NVMe capacity. # A per-hostname subdirectory (rclone-<nodename>/) is always appended so that # parallel jobs on different nodes each have their own independent cache. cache_dir = # VFS cache mode: off | minimal | writes | full (default: full) # 'full' caches entire files locally for read/write. Required for arbitrary # file access patterns and write support. 'off' reads directly from SDS@hd # with no local cache. cache_mode = full # Maximum local cache size. rclone evicts LRU files automatically. cache_size = 50G # How long to keep cached files before eviction (even if cache_size not hit). # Set >= max job wall time when using Weka as cache_dir (not cleaned on reboot). cache_age = 24h # Number of parallel SFTP requests per connection. # Higher values fill the TCP window more aggressively over the 5 ms WAN link. # For job arrays with many tasks, reduce to 8 to avoid overloading lsdf02-sshfs. sftp_concurrency = 16 # Number of parallel file transfer workers (background uploads). transfers = 8 # In-memory read/write buffer per file. 128M fills the TCP window well at # 5 ms RTT (Freiburg → Heidelberg). Increase for very large sequential files. buffer_size = 128M # Sequential read prefetch window. Increase to 512M for large sequential reads. read_ahead = 64M # Delay between file close and upload start. # 5s (default): close() returns immediately; the upload runs asynchronously # in the background. Your job continues while data is transferred to SDS@hd, # and umount-sdshd (or the Slurm epilog) waits for all uploads to finish # before disconnecting. # 0s: upload runs synchronously inside close(), blocking your job for the # entire upload duration of every file written. For large output files this # means your job script stalls at each file write — you almost certainly # do not want this. write_back = 5s # How long directory listings are cached. 2h for compute jobs (stable), # 5m for interactive use (pick up changes from other login nodes quickly). dir_cache = 5m # Extra rclone flags appended verbatim after all other options. # Command-line arguments (mount-sdshd only) still take final precedence. # # Values are split on spaces, so arguments that contain an internal space # (e.g. bwlimit timetables like "08:00,100M 18:00,500M") cannot be expressed # here. Set those via rclone.conf or pass them directly on the command line # with mount-sdshd instead. # # Useful examples: # # Limit upload/download bandwidth (per rclone process, not per file). # Useful when running multiple jobs simultaneously or to be a good # neighbour on a shared gateway. Value: bytes/s suffix M/G, or 'off'. # Per-direction: --bwlimit-file takes the same syntax. # extra_opts = --bwlimit 200M # # Exclude patterns — keep temp/checkpoint files out of the VFS cache # so they don't consume cache space or trigger unnecessary uploads. # Do NOT use shell quotes here — they are passed literally to rclone, # not interpreted by the shell. Write glob patterns unquoted: # extra_opts = --exclude *.tmp --exclude checkpoint_*.pt # # Override SSH connection timeout (default 60s). Reduce if you want # faster failure detection on flaky connections: # extra_opts = --timeout 30s extra_opts =
Available disk space by node type:
| Node type | Local disk | Recommended max cache_size |
|---|---|---|
| Login nodes | ~400 GB | 20–50 G (shared with other users) |
| Milan nodes | ~1.9 TB | 50 G (default) |
| Other nodes | ~3.8 TB | 50 G (default) |
Useful extra_opts examples:
# Limit upload/download bandwidth (per rclone process). # Useful when running many jobs simultaneously or to be a good neighbour. extra_opts = --bwlimit 200M # Exclude temp and checkpoint files from the cache — saves cache space # and avoids uploading files you don't need on SDS@hd. # Do NOT use shell quotes here — write glob patterns unquoted. extra_opts = --exclude *.tmp --exclude checkpoint_*.pt
Persistent cache on Weka
With the default /tmp cache, data written to SDS@hd is first held locally on the node's NVMe. If the node crashes, reboots, or is killed (OOM, walltime) before the upload completes, the cached data is lost.
A Weka workspace survives node failures. If a job is killed before uploads finish, the data is still in the cache — remount with the same cache_dir to resume, or copy directly:
# Create a workspace and configure it as the cache directory: ws_allocate mywsname 30 # Edit ~/.config/sdshd/config, add: cache_dir = /path/from/ws_find cache_size = 2T cache_age = 720h
Replace /path/from/ws_find with the output of ws_find mywsname.
After a failed job, recover the cached data:
# The cache subdirectory is named after the compute node hostname.
# Check: ls $(ws_find mywsname)/
rclone copy "$(ws_find mywsname)/rclone-<compute-node-name>/vfs/sdshd-sftp/" \
sdshd-sftp:recovered/ --progress
When are cached files removed?
| Trigger | What happens |
|---|---|
| rclone running, cache_age exceeded | rclone evicts the file from cache (LRU) |
| rclone running, cache_size would be exceeded | rclone evicts least-recently-used files |
| rclone not running (between mounts) | nothing — files are not touched |
| Workspace expires (ws_extend not run) | restorable for 30 days, then auto-deleted permanently |
| Manual cleanup | remove the rclone-* subdirectory inside the workspace |
See Workspaces for workspace management commands.
Advanced Usage
Read-only access
For browsing or spot reads without writing — bypasses the local cache entirely, every read goes directly to SDS@hd:
mount-sdshd --vfs-cache-mode off --read-only
For repeated reads of the same files (e.g. checking a large dataset), use the default cache with --read-only. The VFS cache still operates — files are downloaded once on first access and served from local disk on all subsequent reads:
mount-sdshd --read-only
ML training and read-intensive jobs
If your job reads the same files many times (training epochs, iterative algorithms, multi-pass analysis), set a Weka workspace as cache_dir. rclone downloads each file on first access; every subsequent read is served from Weka without any WAN traffic:
ws_find mywsname # e.g. /work/classic/fr_abc123456-mywsname
# In ~/.config/sdshd/config: cache_dir = /work/classic/fr_abc123456-mywsname cache_size = 2T cache_age = 120h
The first epoch downloads from SDS@hd. Every subsequent epoch reads from Weka — no WAN traffic, no gateway load.
Note: The cache is per-node. Each node in a job array downloads its own copy. For job arrays where all tasks need the same files, use rclone copy to pre-stage once instead (see below).
If many job array tasks need to read the same files, rclone VFS cannot share a cache — concurrent mounts against the same cache directory will corrupt it, even with --read-only.
The solution: copy the dataset once to a Weka workspace, then have all jobs read from the Weka path directly as a plain filesystem — no FUSE mount, no per-job downloads:
# Step 1 (once, on login node): copy dataset from SDS@hd to Weka
ws_allocate mydata 30
rclone copy sdshd-sftp:inputdata/ $(ws_find mydata)/inputdata/ \
--progress --transfers 8 --sftp-concurrency 8
# Step 2: reference the Weka path in your job script (no --gres=sdshd needed)
INPUT_DIR=$(ws_find mydata)/inputdata
All tasks read from Weka at full speed with zero WAN traffic and zero gateway load. Re-run rclone copy if the source data changes — it skips unchanged files.
Known Limitations
Symlinks are not preserved
rclone's SFTP backend does not store symlinks on the remote. When rclone uploads a symlink (through the FUSE mount or via rclone copy), it follows the link and uploads the target's content as a regular file. The symlink itself is lost.
| Situation | What happens |
|---|---|
| Symlink to a regular file | Target file is uploaded; symlink metadata is lost |
| Symlink to a directory | Directory tree is traversed; contents are uploaded recursively |
| Dangling symlink (target missing) | Transfer error or silently skipped |
| Restore from SDS@hd | Regular files/directories — original symlinks are gone |
Workarounds:
- Archive before upload: tar -czf env.tar.gz myenv/ — preserves all symlinks inside the archive. Unpack after download to restore the original structure.
- rclone copy --links: encodes symlinks as .rclonelink text files on the remote. Only works for rclone copy/rclone sync workflows (not the FUSE VFS mount). Requires --links on both the copy and the restore.
Troubleshooting
SSH_FX_FAILURE during upload
Most common cause: storage quota exhausted.
SDS@hd enforces a hard quota per project. When full, the server rejects all writes. Check your quota first:
rclone about sdshd-sftp:
If the quota is full, delete or archive old data on SDS@hd before retrying.
Secondary cause: gateway overload from large job arrays.
100 tasks × 16 SFTP connections = 1600 simultaneous connections to lsdf02-sshfs. The gateway may reject new connections under this load. Reduce sftp_concurrency to 8 in ~/.config/sdshd/config, or bypass the FUSE mount entirely for bulk transfers:
rclone copy /local/results/ sdshd-sftp:results/ \
--transfers 4 --sftp-concurrency 8 --progress
With the default /tmp cache, resuming requires submitting a new job to the exact same compute node — not generally practical — and the cache is gone after a node reboot regardless. With a Weka workspace as cache_dir, the cache survives both job end and node reboots; the key advantage is that you can copy the data out manually without needing a job on that node at all, as described in Persistent cache on Weka.
Job stuck in CG (COMPLETING) for a long time
The Slurm epilog waits for rclone to finish uploading all pending data. For large output files this can take minutes — this is expected and safe. The job will exit on its own once all uploads are confirmed complete. There is nothing to do.