2026-04-22T15:54:24Z

M Janczyk: /* Persistent cache on Weka */

<div style="border: 3px solid #dc3545; padding: 15px; background-color: #f8d7da; margin: 10px 0;">
'''WARNING! This feature is currently for testing purposes only!'''
</div>

'''SDS@hd''' is the scientific storage service of Heidelberg University. On NEMO2, you can mount your SDS@hd project directly into your jobs or interactive sessions via rclone. Data is cached locally on the node's NVMe drive and uploaded to SDS@hd in the background.

<div style="border: 3px solid #28a745; padding: 15px; background-color: #d4edda; margin: 10px 0;">
'''What you need to do:'''

* '''[[#Initial_Setup_.28Required.29|Initial Setup]]''' — one-time configuration, takes about 5 minutes. '''You must complete this before using SDS@hd on NEMO2.'''
* '''[[#Advanced_Configuration|Advanced Configuration]]''' — optional. Tune cache size, bandwidth limits, and more. Skip this on your first use.
</div>

= Initial Setup (Required) =

Complete these four steps once. After that, SDS@hd is available in all your jobs and interactive sessions without any further configuration.

'''Prerequisites:'''
* Your bwIDM username and SDS@hd password
* Your SDS@hd project ID (e.g. <tt>sd14a001</tt>, find it at [[SDS@hd/Access]])

== Step 1: Create the rclone configuration file ==

Run this command to create the config file with the correct permissions:

<pre>
rclone config touch
</pre>

This creates <tt>~/.config/rclone/rclone.conf</tt> with mode 600. Now open the file in an editor and add the following two sections, replacing the placeholder values with your own:

<pre>
[sdshd-backend]
type = sftp
host = lsdf02-sshfs.urz.uni-heidelberg.de
user = YOUR_BWIDM_USERNAME
pass = PASTE_OBSCURED_PASSWORD_HERE
md5sum_command = none
sha1sum_command = none
shell_type = unix
set_modtime = false
idle_timeout = 0

[sdshd-sftp]
type = alias
remote = sdshd-backend:YOUR_PROJECT_ID
</pre>

Replace:
* <tt>YOUR_BWIDM_USERNAME</tt> — your bwIDM login name (same as on NEMO2)
* <tt>YOUR_PROJECT_ID</tt> — your SDS@hd project ID, e.g. <tt>sd14a001</tt>
* <tt>PASTE_OBSCURED_PASSWORD_HERE</tt> — leave this for now, you will fill it in Step 2

'''The <tt>[sdshd-sftp]</tt> section name must not be changed''' — all NEMO2 scripts mount <tt>sdshd-sftp:</tt> by name.

<div style="border: 3px solid #17a2b8; padding: 15px; background-color: #d1ecf1; margin: 10px 0;">
'''What the path in <tt>[sdshd-sftp]</tt> controls:'''

The path after the colon in <tt>remote = sdshd-backend:YOUR_PROJECT_ID</tt> becomes the root of your mount at <tt>/mnt/sdshd/$USER</tt>. If you set it correctly, your project files appear directly under <tt>/mnt/sdshd/$USER/</tt>.

If you leave the path empty (<tt>remote = sdshd-backend:</tt>), the mount shows the SFTP server's home directory. Your project then appears as a ''subdirectory'' at <tt>/mnt/sdshd/$USER/YOUR_PROJECT_ID/</tt>. Job scripts that reference <tt>/mnt/sdshd/$USER/</tt> directly will not find your data.

'''Exception:''' If you are a member of multiple projects, you may intentionally omit the project ID so all projects appear as subdirectories. In that case, adjust your job scripts to include the project ID in the path.
</div>

== Step 2: Set the password ==

rclone does not store passwords in plain text — it stores an obscured form. Generate the obscured password with the following command. It reads your password without echoing it to the terminal and is never written to your shell history:

<pre>
printf 'SDS@hd password:\n'; read -s p && printf '%s' "$p" | rclone obscure -; echo
</pre>

Type your SDS@hd password and press Enter. The command prints a string like:

<pre>
QKkf-abc123XYZetc
</pre>

Copy that string and paste it as the <tt>pass =</tt> value in <tt>~/.config/rclone/rclone.conf</tt>, replacing <tt>PASTE_OBSCURED_PASSWORD_HERE</tt>. Then clear the variable from memory:

<pre>
unset p # bash / zsh
# set -e p # fish
</pre>

<div style="border: 3px solid #dc3545; padding: 15px; background-color: #f8d7da; margin: 10px 0;">
'''Do not use <tt>echo "mypassword" | rclone obscure -</tt>''' — the password would appear in your shell history and in <tt>ps</tt> output. The <tt>read -s</tt> method above avoids both.
</div>

== Step 3: Protect the config file ==

<tt>rclone config touch</tt> already creates the file with mode 600. If you created it manually or are unsure, set the permissions explicitly:

<pre>
chmod 600 ~/.config/rclone/rclone.conf
</pre>

== Step 4: Verify the connection ==

<pre>
rclone ls sdshd-sftp:
</pre>

Expected output: a list of files and directories in your project. An empty listing (no output, no error) is also fine — it means the project exists but is empty.

{| class="wikitable"
|-
! Error message
! Likely cause
|-
| <tt>SSH authentication failed</tt>
| Wrong username or password
|-
| <tt>no such file or directory</tt>
| Project ID in <tt>[sdshd-sftp]</tt> is wrong
|-
| <tt>connection refused</tt> / timeout
| Network issue or wrong hostname
|-
| Empty listing (no error)
| Project exists but is empty — that is fine
|}

'''Setup complete.''' See [[#Basic_Usage|Basic Usage]] to start using SDS@hd.

= Basic Usage =

== Interactive sessions on login nodes ==

<pre>
mount-sdshd # mount at /mnt/sdshd/$USER
umount-sdshd # unmount and wait for all uploads to finish
</pre>

The mount point <tt>/mnt/sdshd/$USER</tt> is created automatically when you log in, via a PAM hook. No manual admin step is needed.

<div style="border: 3px solid #dc3545; padding: 15px; background-color: #f8d7da; margin: 10px 0;">
'''Always use <tt>umount-sdshd</tt> to disconnect — never <tt>fusermount -u</tt> directly.'''

<tt>fusermount -u</tt> terminates rclone immediately, cutting off any uploads in progress and causing data loss. <tt>umount-sdshd</tt> waits until rclone confirms that all uploads are complete before sending the shutdown signal.
</div>

Run <tt>mount-sdshd --help</tt> for all options and examples.

== Slurm jobs ==

Add the <tt>--gres=sdshd</tt> flag to your job script. The Slurm prolog mounts SDS@hd automatically at job start; the epilog waits for all uploads to complete before the job finishes:

<pre>
#SBATCH --gres=sdshd
</pre>

Your data is available at <tt>/mnt/sdshd/$USER</tt> inside the job, with no further commands needed.

'''The job shows <tt>CG</tt> (COMPLETING) in <tt>squeue</tt> during the upload phase.''' For large output files this can take minutes — this is expected and safe. The job will not exit until all data has been confirmed uploaded.

'''If you need data on SDS@hd before the job finishes''' (e.g. for a multi-step pipeline where the next job reads this job's output), call <tt>umount-sdshd</tt> explicitly at the end of your job script:

<pre>
#!/bin/bash
#SBATCH --gres=sdshd

# ... your computation ...

umount-sdshd # blocks here until all uploads are done, then the job exits
</pre>

= Advanced Configuration =

<div style="border: 3px solid #ffc107; padding: 15px; background-color: #fff3cd; margin: 10px 0;">
'''This section is optional.''' The built-in defaults work well for most jobs. Come back here if you need to:
* limit cache size or bandwidth
* use a persistent (crash-safe) cache on Weka
* tune performance for ML training or large job arrays
</div>

== Per-user configuration file ==

Create <tt>~/.config/sdshd/config</tt> to override the built-in defaults for all your jobs:

<pre>
mkdir -p ~/.config/sdshd
</pre>

The format is <tt>key = value</tt>, one per line. Lines starting with <tt>#</tt> are comments. The file is parsed without being executed — invalid or unknown keys are silently ignored.

=== Configuration keys ===

{| class="wikitable"
|-
! Key
! Default
! Description
|-
| <tt>cache_dir</tt>
| <tt>/tmp/sdshd-<uid>/cache/</tt>
| Local VFS cache directory. Node-local NVMe by default (cleaned on reboot). Set to a Weka workspace path for crash-safe persistent caching. See [[#Persistent_cache_on_Weka|Persistent cache on Weka]].
|-
| <tt>cache_mode</tt>
| <tt>full</tt>
| VFS cache mode: <tt>off</tt> / <tt>minimal</tt> / <tt>writes</tt> / <tt>full</tt>. <tt>full</tt> caches entire files for read and write. Required for arbitrary file access patterns. <tt>off</tt> reads directly from SDS@hd with no local cache.
|-
| <tt>cache_size</tt>
| <tt>50G</tt>
| Maximum local cache size. rclone evicts least-recently-used files automatically when the limit is approached.
|-
| <tt>cache_age</tt>
| <tt>24h</tt>
| How long to keep cached files before eviction, even if <tt>cache_size</tt> is not hit. Set to at least your job's wall time when using Weka as <tt>cache_dir</tt>.
|-
| <tt>sftp_concurrency</tt>
| <tt>16</tt>
| Parallel SFTP requests per connection. Higher values fill the TCP window more aggressively over the 5 ms WAN link. Reduce to <tt>8</tt> for large job arrays (e.g. 100+ tasks) to avoid overloading the gateway.
|-
| <tt>transfers</tt>
| <tt>8</tt>
| Number of parallel background upload workers.
|-
| <tt>buffer_size</tt>
| <tt>128M</tt>
| In-memory read/write buffer per file. 128M is well-matched to the Freiburg–Heidelberg RTT. Increase for very large sequential files.
|-
| <tt>read_ahead</tt>
| <tt>64M</tt>
| Sequential read prefetch window. Increase to <tt>512M</tt> for large sequential reads.
|-
| <tt>write_back</tt>
| <tt>5s</tt>
| Delay between file close and upload start. With the default <tt>5s</tt>, <tt>close()</tt> returns immediately and the upload runs in the background — your job continues while data transfers to SDS@hd. Setting this to <tt>0s</tt> makes uploads synchronous: your job stalls at every file write for the full upload duration. You almost certainly do not want <tt>0s</tt>.
|-
| <tt>dir_cache</tt>
| <tt>5m</tt>
| How long to cache directory listings. Use <tt>2h</tt> for compute jobs (stable dataset), <tt>5m</tt> for interactive use (picks up changes from other nodes quickly).
|-
| <tt>extra_opts</tt>
| (empty)
| Extra rclone flags appended verbatim. Values are split on spaces — arguments containing internal spaces (e.g. bwlimit timetables) cannot be expressed here; use <tt>rclone.conf</tt> or <tt>mount-sdshd</tt> command-line options instead.
|}

<div class="mw-collapsible mw-collapsed">
Expand for full <tt>~/.config/sdshd/config</tt> example.
<div class="mw-collapsible-content">
<pre>
# ~/.config/sdshd/config

# Local VFS cache directory. Must be an absolute path.
# Leave empty to use /tmp/sdshd-<uid>/cache/ (node-local NVMe, cleaned on reboot).
# Use a Weka workspace path (ws_find <name>) for crash-safe output caching or
# for large datasets that exceed node NVMe capacity.
# A per-hostname subdirectory (rclone-<nodename>/) is always appended so that
# parallel jobs on different nodes each have their own independent cache.
cache_dir =

# VFS cache mode: off | minimal | writes | full (default: full)
# 'full' caches entire files locally for read/write. Required for arbitrary
# file access patterns and write support. 'off' reads directly from SDS@hd
# with no local cache.
cache_mode = full

# Maximum local cache size. rclone evicts LRU files automatically.
cache_size = 50G

# How long to keep cached files before eviction (even if cache_size not hit).
# Set >= max job wall time when using Weka as cache_dir (not cleaned on reboot).
cache_age = 24h

# Number of parallel SFTP requests per connection.
# Higher values fill the TCP window more aggressively over the 5 ms WAN link.
# For job arrays with many tasks, reduce to 8 to avoid overloading lsdf02-sshfs.
sftp_concurrency = 16

# Number of parallel file transfer workers (background uploads).
transfers = 8

# In-memory read/write buffer per file. 128M fills the TCP window well at
# 5 ms RTT (Freiburg → Heidelberg). Increase for very large sequential files.
buffer_size = 128M

# Sequential read prefetch window. Increase to 512M for large sequential reads.
read_ahead = 64M

# Delay between file close and upload start.
# 5s (default): close() returns immediately; the upload runs asynchronously
# in the background. Your job continues while data is transferred to SDS@hd,
# and umount-sdshd (or the Slurm epilog) waits for all uploads to finish
# before disconnecting.
# 0s: upload runs synchronously inside close(), blocking your job for the
# entire upload duration of every file written. For large output files this
# means your job script stalls at each file write — you almost certainly
# do not want this.
write_back = 5s

# How long directory listings are cached. 2h for compute jobs (stable),
# 5m for interactive use (pick up changes from other login nodes quickly).
dir_cache = 5m

# Extra rclone flags appended verbatim after all other options.
# Command-line arguments (mount-sdshd only) still take final precedence.
#
# Values are split on spaces, so arguments that contain an internal space
# (e.g. bwlimit timetables like "08:00,100M 18:00,500M") cannot be expressed
# here. Set those via rclone.conf or pass them directly on the command line
# with mount-sdshd instead.
#
# Useful examples:
#
# Limit upload/download bandwidth (per rclone process, not per file).
# Useful when running multiple jobs simultaneously or to be a good
# neighbour on a shared gateway. Value: bytes/s suffix M/G, or 'off'.
# Per-direction: --bwlimit-file takes the same syntax.
# extra_opts = --bwlimit 200M
#
# Exclude patterns — keep temp/checkpoint files out of the VFS cache
# so they don't consume cache space or trigger unnecessary uploads.
# Do NOT use shell quotes here — they are passed literally to rclone,
# not interpreted by the shell. Write glob patterns unquoted:
# extra_opts = --exclude *.tmp --exclude checkpoint_*.pt
#
# Override SSH connection timeout (default 60s). Reduce if you want
# faster failure detection on flaky connections:
# extra_opts = --timeout 30s
extra_opts =
</pre>
</div>
</div>

'''Available disk space by node type:'''

{| class="wikitable"
|-
! Node type
! Local disk
! Recommended max <tt>cache_size</tt>
|-
| Login nodes
| ~400 GB
| 20–50 G (shared with other users)
|-
| Milan nodes
| ~1.9 TB
| 50 G (default)
|-
| Other nodes
| ~3.8 TB
| 50 G (default)
|}

'''Useful <tt>extra_opts</tt> examples:'''

<pre>
# Limit upload/download bandwidth (per rclone process).
# Useful when running many jobs simultaneously or to be a good neighbour.
extra_opts = --bwlimit 200M

# Exclude temp and checkpoint files from the cache — saves cache space
# and avoids uploading files you don't need on SDS@hd.
# Do NOT use shell quotes here — write glob patterns unquoted.
extra_opts = --exclude *.tmp --exclude checkpoint_*.pt
</pre>

== Persistent cache on Weka ==

With the default <tt>/tmp</tt> cache, data written to SDS@hd is first held locally on the node's NVMe. If the node crashes, reboots, or is killed (OOM, walltime) before the upload completes, the cached data is lost.

A Weka workspace survives node failures. If a job is killed before uploads finish, the data is still in the cache — remount with the same <tt>cache_dir</tt> to resume, or copy directly:

<pre>
# Create a workspace and configure it as the cache directory:
ws_allocate mywsname 30

# Edit ~/.config/sdshd/config, add:
cache_dir = /path/from/ws_find
cache_size = 2T
cache_age = 720h
</pre>

Replace <tt>/path/from/ws_find</tt> with the output of <tt>ws_find mywsname</tt>.

After a failed job, recover the cached data:

<pre>
# The cache subdirectory is named after the compute node hostname.
# Check: ls $(ws_find mywsname)/
rclone copy "$(ws_find mywsname)/rclone-<compute-node-name>/vfs/sdshd-sftp/" \
sdshd-sftp:recovered/ --progress
</pre>

'''When are cached files removed?'''

{| class="wikitable"
|-
! Trigger
! What happens
|-
| rclone running, <tt>cache_age</tt> exceeded
| rclone evicts the file from cache (LRU)
|-
| rclone running, <tt>cache_size</tt> would be exceeded
| rclone evicts least-recently-used files
|-
| rclone not running (between mounts)
| nothing — files are not touched
|-
| Workspace expires (<tt>ws_extend</tt> not run)
| restorable for 30 days, then auto-deleted permanently
|-
| Manual cleanup
| remove the <tt>rclone-*</tt> subdirectory inside the workspace
|}

See [[Workspaces]] for workspace management commands.

The same Weka <tt>cache_dir</tt> also benefits read-heavy jobs such as ML training or multi-epoch analysis: rclone downloads each input file on first access and serves all subsequent reads from Weka without any WAN traffic. Note that the cache is per-node — each node in a job array downloads its own copy independently. For job arrays where all tasks need the same files, use <tt>rclone copy</tt> to pre-stage once instead — see [[#Shared_dataset_across_many_jobs|Shared dataset across many jobs]].

= Advanced Usage =

== Read-only access ==

For browsing or spot reads without writing — bypasses the local cache entirely, every read goes directly to SDS@hd:

<pre>
mount-sdshd --vfs-cache-mode off --read-only
</pre>

For '''repeated reads of the same files''' (e.g. checking a large dataset), use the default cache with <tt>--read-only</tt>. The VFS cache still operates — files are downloaded once on first access and served from local disk on all subsequent reads:

<pre>
mount-sdshd --read-only
</pre>

== ML training and read-intensive jobs ==

If your job reads the same files many times (training epochs, iterative algorithms, multi-pass analysis), set a Weka workspace as <tt>cache_dir</tt>. rclone downloads each file on first access; every subsequent read is served from Weka without any WAN traffic:

<pre>
ws_find mywsname # e.g. /work/classic/fr_abc123456-mywsname
</pre>

<pre>
# In ~/.config/sdshd/config:
cache_dir = /work/classic/fr_abc123456-mywsname
cache_size = 2T
cache_age = 120h
</pre>

The first epoch downloads from SDS@hd. Every subsequent epoch reads from Weka — no WAN traffic, no gateway load.

'''Note:''' The cache is per-node. Each node in a job array downloads its own copy. For job arrays where all tasks need the same files, use <tt>rclone copy</tt> to pre-stage once instead (see below).

== Shared dataset across many jobs ==

If many job array tasks need to read the same files, rclone VFS cannot share a cache — concurrent mounts against the same cache directory will corrupt it, even with <tt>--read-only</tt>.

The solution: copy the dataset once to a Weka workspace, then have all jobs read from the Weka path directly as a plain filesystem — no FUSE mount, no per-job downloads:

<pre>
# Step 1 (once, on login node): copy dataset from SDS@hd to Weka
ws_allocate mydata 30
rclone copy sdshd-sftp:inputdata/ $(ws_find mydata)/inputdata/ \
--progress --transfers 8 --sftp-concurrency 8 --sftp-chunk-size 255k

# Step 2: reference the Weka path in your job script (no --gres=sdshd needed)
INPUT_DIR=$(ws_find mydata)/inputdata
</pre>

<tt>--sftp-chunk-size 255k</tt> increases the SFTP packet size from 32 KB (default) to 255 KB (OpenSSH's maximum), which reduces protocol overhead and significantly improves throughput for large files. Use this flag with explicit <tt>rclone copy</tt> commands — it has no effect on the FUSE mount and should not be placed in <tt>extra_opts</tt>.

All tasks read from Weka at full speed with zero WAN traffic and zero gateway load. Re-run <tt>rclone copy</tt> if the source data changes — it skips unchanged files.

= Known Limitations =

== Symlinks are not preserved ==

rclone's SFTP backend does not store symlinks on the remote. When rclone uploads a symlink (through the FUSE mount or via <tt>rclone copy</tt>), it follows the link and uploads the target's content as a regular file. The symlink itself is lost.

{| class="wikitable"
|-
! Situation
! What happens
|-
| Symlink to a regular file
| Target file is uploaded; symlink metadata is lost
|-
| Symlink to a directory
| Directory tree is traversed; contents are uploaded recursively
|-
| Dangling symlink (target missing)
| Transfer error or silently skipped
|-
| Restore from SDS@hd
| Regular files/directories — original symlinks are gone
|}

'''Workarounds:'''

* '''Archive before upload:''' <tt>tar -czf env.tar.gz myenv/</tt> — preserves all symlinks inside the archive. Unpack after download to restore the original structure.
* '''<tt>rclone copy --links</tt>:''' encodes symlinks as <tt>.rclonelink</tt> text files on the remote. Only works for <tt>rclone copy</tt>/<tt>rclone sync</tt> workflows (not the FUSE VFS mount). Requires <tt>--links</tt> on both the copy and the restore.

= Troubleshooting =

== <tt>SSH_FX_FAILURE</tt> during upload ==

'''Most common cause: storage quota exhausted.'''

SDS@hd enforces a hard quota per project. When full, the server rejects all writes. Check your quota first:

<pre>
rclone about sdshd-sftp:
</pre>

If the quota is full, delete or archive old data on SDS@hd before retrying.

'''Secondary cause: gateway overload from large job arrays.'''

100 tasks × 16 SFTP connections = 1600 simultaneous connections to <tt>lsdf02-sshfs</tt>. The gateway may reject new connections under this load. Reduce <tt>sftp_concurrency</tt> to <tt>8</tt> in <tt>~/.config/sdshd/config</tt>, or bypass the FUSE mount entirely for bulk transfers:

<pre>
rclone copy /local/results/ sdshd-sftp:results/ \
--transfers 4 --sftp-concurrency 8 --sftp-chunk-size 255k --progress
</pre>

With the default <tt>/tmp</tt> cache, resuming requires submitting a new job to the exact same compute node — not generally practical — and the cache is gone after a node reboot regardless. With a Weka workspace as <tt>cache_dir</tt>, the cache survives both job end and node reboots; the key advantage is that you can copy the data out manually without needing a job on that node at all, as described in [[#Persistent_cache_on_Weka|Persistent cache on Weka]].

== Job stuck in <tt>CG</tt> (COMPLETING) for a long time ==

The Slurm epilog waits for rclone to finish uploading all pending data. For large output files this can take minutes — this is expected and safe. The job will exit on its own once all uploads are confirmed complete. There is nothing to do.

NEMO2/SDS hd

2026-04-22T15:46:47Z

M Janczyk: /* Shared dataset across many jobs */

<div style="border: 3px solid #dc3545; padding: 15px; background-color: #f8d7da; margin: 10px 0;">
'''WARNING! This feature is currently for testing purposes only!'''
</div>

'''SDS@hd''' is the scientific storage service of Heidelberg University. On NEMO2, you can mount your SDS@hd project directly into your jobs or interactive sessions via rclone. Data is cached locally on the node's NVMe drive and uploaded to SDS@hd in the background.

<div style="border: 3px solid #28a745; padding: 15px; background-color: #d4edda; margin: 10px 0;">
'''What you need to do:'''

* '''[[#Initial_Setup_.28Required.29|Initial Setup]]''' — one-time configuration, takes about 5 minutes. '''You must complete this before using SDS@hd on NEMO2.'''
* '''[[#Advanced_Configuration|Advanced Configuration]]''' — optional. Tune cache size, bandwidth limits, and more. Skip this on your first use.
</div>

= Initial Setup (Required) =

Complete these four steps once. After that, SDS@hd is available in all your jobs and interactive sessions without any further configuration.

'''Prerequisites:'''
* Your bwIDM username and SDS@hd password
* Your SDS@hd project ID (e.g. <tt>sd14a001</tt>, find it at [[SDS@hd/Access]])

== Step 1: Create the rclone configuration file ==

Run this command to create the config file with the correct permissions:

<pre>
rclone config touch
</pre>

This creates <tt>~/.config/rclone/rclone.conf</tt> with mode 600. Now open the file in an editor and add the following two sections, replacing the placeholder values with your own:

<pre>
[sdshd-backend]
type = sftp
host = lsdf02-sshfs.urz.uni-heidelberg.de
user = YOUR_BWIDM_USERNAME
pass = PASTE_OBSCURED_PASSWORD_HERE
md5sum_command = none
sha1sum_command = none
shell_type = unix
set_modtime = false
idle_timeout = 0

[sdshd-sftp]
type = alias
remote = sdshd-backend:YOUR_PROJECT_ID
</pre>

Replace:
* <tt>YOUR_BWIDM_USERNAME</tt> — your bwIDM login name (same as on NEMO2)
* <tt>YOUR_PROJECT_ID</tt> — your SDS@hd project ID, e.g. <tt>sd14a001</tt>
* <tt>PASTE_OBSCURED_PASSWORD_HERE</tt> — leave this for now, you will fill it in Step 2

'''The <tt>[sdshd-sftp]</tt> section name must not be changed''' — all NEMO2 scripts mount <tt>sdshd-sftp:</tt> by name.

<div style="border: 3px solid #17a2b8; padding: 15px; background-color: #d1ecf1; margin: 10px 0;">
'''What the path in <tt>[sdshd-sftp]</tt> controls:'''

The path after the colon in <tt>remote = sdshd-backend:YOUR_PROJECT_ID</tt> becomes the root of your mount at <tt>/mnt/sdshd/$USER</tt>. If you set it correctly, your project files appear directly under <tt>/mnt/sdshd/$USER/</tt>.

If you leave the path empty (<tt>remote = sdshd-backend:</tt>), the mount shows the SFTP server's home directory. Your project then appears as a ''subdirectory'' at <tt>/mnt/sdshd/$USER/YOUR_PROJECT_ID/</tt>. Job scripts that reference <tt>/mnt/sdshd/$USER/</tt> directly will not find your data.

'''Exception:''' If you are a member of multiple projects, you may intentionally omit the project ID so all projects appear as subdirectories. In that case, adjust your job scripts to include the project ID in the path.
</div>

== Step 2: Set the password ==

rclone does not store passwords in plain text — it stores an obscured form. Generate the obscured password with the following command. It reads your password without echoing it to the terminal and is never written to your shell history:

<pre>
printf 'SDS@hd password:\n'; read -s p && printf '%s' "$p" | rclone obscure -; echo
</pre>

Type your SDS@hd password and press Enter. The command prints a string like:

<pre>
QKkf-abc123XYZetc
</pre>

Copy that string and paste it as the <tt>pass =</tt> value in <tt>~/.config/rclone/rclone.conf</tt>, replacing <tt>PASTE_OBSCURED_PASSWORD_HERE</tt>. Then clear the variable from memory:

<pre>
unset p # bash / zsh
# set -e p # fish
</pre>

<div style="border: 3px solid #dc3545; padding: 15px; background-color: #f8d7da; margin: 10px 0;">
'''Do not use <tt>echo "mypassword" | rclone obscure -</tt>''' — the password would appear in your shell history and in <tt>ps</tt> output. The <tt>read -s</tt> method above avoids both.
</div>

== Step 3: Protect the config file ==

<tt>rclone config touch</tt> already creates the file with mode 600. If you created it manually or are unsure, set the permissions explicitly:

<pre>
chmod 600 ~/.config/rclone/rclone.conf
</pre>

== Step 4: Verify the connection ==

<pre>
rclone ls sdshd-sftp:
</pre>

Expected output: a list of files and directories in your project. An empty listing (no output, no error) is also fine — it means the project exists but is empty.

{| class="wikitable"
|-
! Error message
! Likely cause
|-
| <tt>SSH authentication failed</tt>
| Wrong username or password
|-
| <tt>no such file or directory</tt>
| Project ID in <tt>[sdshd-sftp]</tt> is wrong
|-
| <tt>connection refused</tt> / timeout
| Network issue or wrong hostname
|-
| Empty listing (no error)
| Project exists but is empty — that is fine
|}

'''Setup complete.''' See [[#Basic_Usage|Basic Usage]] to start using SDS@hd.

= Basic Usage =

== Interactive sessions on login nodes ==

<pre>
mount-sdshd # mount at /mnt/sdshd/$USER
umount-sdshd # unmount and wait for all uploads to finish
</pre>

The mount point <tt>/mnt/sdshd/$USER</tt> is created automatically when you log in, via a PAM hook. No manual admin step is needed.

<div style="border: 3px solid #dc3545; padding: 15px; background-color: #f8d7da; margin: 10px 0;">
'''Always use <tt>umount-sdshd</tt> to disconnect — never <tt>fusermount -u</tt> directly.'''

<tt>fusermount -u</tt> terminates rclone immediately, cutting off any uploads in progress and causing data loss. <tt>umount-sdshd</tt> waits until rclone confirms that all uploads are complete before sending the shutdown signal.
</div>

Run <tt>mount-sdshd --help</tt> for all options and examples.

== Slurm jobs ==

Add the <tt>--gres=sdshd</tt> flag to your job script. The Slurm prolog mounts SDS@hd automatically at job start; the epilog waits for all uploads to complete before the job finishes:

<pre>
#SBATCH --gres=sdshd
</pre>

Your data is available at <tt>/mnt/sdshd/$USER</tt> inside the job, with no further commands needed.

'''The job shows <tt>CG</tt> (COMPLETING) in <tt>squeue</tt> during the upload phase.''' For large output files this can take minutes — this is expected and safe. The job will not exit until all data has been confirmed uploaded.

'''If you need data on SDS@hd before the job finishes''' (e.g. for a multi-step pipeline where the next job reads this job's output), call <tt>umount-sdshd</tt> explicitly at the end of your job script:

<pre>
#!/bin/bash
#SBATCH --gres=sdshd

# ... your computation ...

umount-sdshd # blocks here until all uploads are done, then the job exits
</pre>

= Advanced Configuration =

<div style="border: 3px solid #ffc107; padding: 15px; background-color: #fff3cd; margin: 10px 0;">
'''This section is optional.''' The built-in defaults work well for most jobs. Come back here if you need to:
* limit cache size or bandwidth
* use a persistent (crash-safe) cache on Weka
* tune performance for ML training or large job arrays
</div>

== Per-user configuration file ==

Create <tt>~/.config/sdshd/config</tt> to override the built-in defaults for all your jobs:

<pre>
mkdir -p ~/.config/sdshd
</pre>

The format is <tt>key = value</tt>, one per line. Lines starting with <tt>#</tt> are comments. The file is parsed without being executed — invalid or unknown keys are silently ignored.

=== Configuration keys ===

{| class="wikitable"
|-
! Key
! Default
! Description
|-
| <tt>cache_dir</tt>
| <tt>/tmp/sdshd-<uid>/cache/</tt>
| Local VFS cache directory. Node-local NVMe by default (cleaned on reboot). Set to a Weka workspace path for crash-safe persistent caching. See [[#Persistent_cache_on_Weka|Persistent cache on Weka]].
|-
| <tt>cache_mode</tt>
| <tt>full</tt>
| VFS cache mode: <tt>off</tt> / <tt>minimal</tt> / <tt>writes</tt> / <tt>full</tt>. <tt>full</tt> caches entire files for read and write. Required for arbitrary file access patterns. <tt>off</tt> reads directly from SDS@hd with no local cache.
|-
| <tt>cache_size</tt>
| <tt>50G</tt>
| Maximum local cache size. rclone evicts least-recently-used files automatically when the limit is approached.
|-
| <tt>cache_age</tt>
| <tt>24h</tt>
| How long to keep cached files before eviction, even if <tt>cache_size</tt> is not hit. Set to at least your job's wall time when using Weka as <tt>cache_dir</tt>.
|-
| <tt>sftp_concurrency</tt>
| <tt>16</tt>
| Parallel SFTP requests per connection. Higher values fill the TCP window more aggressively over the 5 ms WAN link. Reduce to <tt>8</tt> for large job arrays (e.g. 100+ tasks) to avoid overloading the gateway.
|-
| <tt>transfers</tt>
| <tt>8</tt>
| Number of parallel background upload workers.
|-
| <tt>buffer_size</tt>
| <tt>128M</tt>
| In-memory read/write buffer per file. 128M is well-matched to the Freiburg–Heidelberg RTT. Increase for very large sequential files.
|-
| <tt>read_ahead</tt>
| <tt>64M</tt>
| Sequential read prefetch window. Increase to <tt>512M</tt> for large sequential reads.
|-
| <tt>write_back</tt>
| <tt>5s</tt>
| Delay between file close and upload start. With the default <tt>5s</tt>, <tt>close()</tt> returns immediately and the upload runs in the background — your job continues while data transfers to SDS@hd. Setting this to <tt>0s</tt> makes uploads synchronous: your job stalls at every file write for the full upload duration. You almost certainly do not want <tt>0s</tt>.
|-
| <tt>dir_cache</tt>
| <tt>5m</tt>
| How long to cache directory listings. Use <tt>2h</tt> for compute jobs (stable dataset), <tt>5m</tt> for interactive use (picks up changes from other nodes quickly).
|-
| <tt>extra_opts</tt>
| (empty)
| Extra rclone flags appended verbatim. Values are split on spaces — arguments containing internal spaces (e.g. bwlimit timetables) cannot be expressed here; use <tt>rclone.conf</tt> or <tt>mount-sdshd</tt> command-line options instead.
|}

<div class="mw-collapsible mw-collapsed">
Expand for full <tt>~/.config/sdshd/config</tt> example.
<div class="mw-collapsible-content">
<pre>
# ~/.config/sdshd/config

# Local VFS cache directory. Must be an absolute path.
# Leave empty to use /tmp/sdshd-<uid>/cache/ (node-local NVMe, cleaned on reboot).
# Use a Weka workspace path (ws_find <name>) for crash-safe output caching or
# for large datasets that exceed node NVMe capacity.
# A per-hostname subdirectory (rclone-<nodename>/) is always appended so that
# parallel jobs on different nodes each have their own independent cache.
cache_dir =

# VFS cache mode: off | minimal | writes | full (default: full)
# 'full' caches entire files locally for read/write. Required for arbitrary
# file access patterns and write support. 'off' reads directly from SDS@hd
# with no local cache.
cache_mode = full

# Maximum local cache size. rclone evicts LRU files automatically.
cache_size = 50G

# How long to keep cached files before eviction (even if cache_size not hit).
# Set >= max job wall time when using Weka as cache_dir (not cleaned on reboot).
cache_age = 24h

# Number of parallel SFTP requests per connection.
# Higher values fill the TCP window more aggressively over the 5 ms WAN link.
# For job arrays with many tasks, reduce to 8 to avoid overloading lsdf02-sshfs.
sftp_concurrency = 16

# Number of parallel file transfer workers (background uploads).
transfers = 8

# In-memory read/write buffer per file. 128M fills the TCP window well at
# 5 ms RTT (Freiburg → Heidelberg). Increase for very large sequential files.
buffer_size = 128M

# Sequential read prefetch window. Increase to 512M for large sequential reads.
read_ahead = 64M

# Delay between file close and upload start.
# 5s (default): close() returns immediately; the upload runs asynchronously
# in the background. Your job continues while data is transferred to SDS@hd,
# and umount-sdshd (or the Slurm epilog) waits for all uploads to finish
# before disconnecting.
# 0s: upload runs synchronously inside close(), blocking your job for the
# entire upload duration of every file written. For large output files this
# means your job script stalls at each file write — you almost certainly
# do not want this.
write_back = 5s

# How long directory listings are cached. 2h for compute jobs (stable),
# 5m for interactive use (pick up changes from other login nodes quickly).
dir_cache = 5m

# Extra rclone flags appended verbatim after all other options.
# Command-line arguments (mount-sdshd only) still take final precedence.
#
# Values are split on spaces, so arguments that contain an internal space
# (e.g. bwlimit timetables like "08:00,100M 18:00,500M") cannot be expressed
# here. Set those via rclone.conf or pass them directly on the command line
# with mount-sdshd instead.
#
# Useful examples:
#
# Limit upload/download bandwidth (per rclone process, not per file).
# Useful when running multiple jobs simultaneously or to be a good
# neighbour on a shared gateway. Value: bytes/s suffix M/G, or 'off'.
# Per-direction: --bwlimit-file takes the same syntax.
# extra_opts = --bwlimit 200M
#
# Exclude patterns — keep temp/checkpoint files out of the VFS cache
# so they don't consume cache space or trigger unnecessary uploads.
# Do NOT use shell quotes here — they are passed literally to rclone,
# not interpreted by the shell. Write glob patterns unquoted:
# extra_opts = --exclude *.tmp --exclude checkpoint_*.pt
#
# Override SSH connection timeout (default 60s). Reduce if you want
# faster failure detection on flaky connections:
# extra_opts = --timeout 30s
extra_opts =
</pre>
</div>
</div>

'''Available disk space by node type:'''

{| class="wikitable"
|-
! Node type
! Local disk
! Recommended max <tt>cache_size</tt>
|-
| Login nodes
| ~400 GB
| 20–50 G (shared with other users)
|-
| Milan nodes
| ~1.9 TB
| 50 G (default)
|-
| Other nodes
| ~3.8 TB
| 50 G (default)
|}

'''Useful <tt>extra_opts</tt> examples:'''

<pre>
# Limit upload/download bandwidth (per rclone process).
# Useful when running many jobs simultaneously or to be a good neighbour.
extra_opts = --bwlimit 200M

# Exclude temp and checkpoint files from the cache — saves cache space
# and avoids uploading files you don't need on SDS@hd.
# Do NOT use shell quotes here — write glob patterns unquoted.
extra_opts = --exclude *.tmp --exclude checkpoint_*.pt
</pre>

== Persistent cache on Weka ==

With the default <tt>/tmp</tt> cache, data written to SDS@hd is first held locally on the node's NVMe. If the node crashes, reboots, or is killed (OOM, walltime) before the upload completes, the cached data is lost.

A Weka workspace survives node failures. If a job is killed before uploads finish, the data is still in the cache — remount with the same <tt>cache_dir</tt> to resume, or copy directly:

<pre>
# Create a workspace and configure it as the cache directory:
ws_allocate mywsname 30

# Edit ~/.config/sdshd/config, add:
cache_dir = /path/from/ws_find
cache_size = 2T
cache_age = 720h
</pre>

Replace <tt>/path/from/ws_find</tt> with the output of <tt>ws_find mywsname</tt>.

After a failed job, recover the cached data:

<pre>
# The cache subdirectory is named after the compute node hostname.
# Check: ls $(ws_find mywsname)/
rclone copy "$(ws_find mywsname)/rclone-<compute-node-name>/vfs/sdshd-sftp/" \
sdshd-sftp:recovered/ --progress
</pre>

'''When are cached files removed?'''

{| class="wikitable"
|-
! Trigger
! What happens
|-
| rclone running, <tt>cache_age</tt> exceeded
| rclone evicts the file from cache (LRU)
|-
| rclone running, <tt>cache_size</tt> would be exceeded
| rclone evicts least-recently-used files
|-
| rclone not running (between mounts)
| nothing — files are not touched
|-
| Workspace expires (<tt>ws_extend</tt> not run)
| restorable for 30 days, then auto-deleted permanently
|-
| Manual cleanup
| remove the <tt>rclone-*</tt> subdirectory inside the workspace
|}

See [[Workspaces]] for workspace management commands.

= Advanced Usage =

== Read-only access ==

For browsing or spot reads without writing — bypasses the local cache entirely, every read goes directly to SDS@hd:

<pre>
mount-sdshd --vfs-cache-mode off --read-only
</pre>

For '''repeated reads of the same files''' (e.g. checking a large dataset), use the default cache with <tt>--read-only</tt>. The VFS cache still operates — files are downloaded once on first access and served from local disk on all subsequent reads:

<pre>
mount-sdshd --read-only
</pre>

== ML training and read-intensive jobs ==

If your job reads the same files many times (training epochs, iterative algorithms, multi-pass analysis), set a Weka workspace as <tt>cache_dir</tt>. rclone downloads each file on first access; every subsequent read is served from Weka without any WAN traffic:

<pre>
ws_find mywsname # e.g. /work/classic/fr_abc123456-mywsname
</pre>

<pre>
# In ~/.config/sdshd/config:
cache_dir = /work/classic/fr_abc123456-mywsname
cache_size = 2T
cache_age = 120h
</pre>

The first epoch downloads from SDS@hd. Every subsequent epoch reads from Weka — no WAN traffic, no gateway load.

'''Note:''' The cache is per-node. Each node in a job array downloads its own copy. For job arrays where all tasks need the same files, use <tt>rclone copy</tt> to pre-stage once instead (see below).

== Shared dataset across many jobs ==

If many job array tasks need to read the same files, rclone VFS cannot share a cache — concurrent mounts against the same cache directory will corrupt it, even with <tt>--read-only</tt>.

The solution: copy the dataset once to a Weka workspace, then have all jobs read from the Weka path directly as a plain filesystem — no FUSE mount, no per-job downloads:

<pre>
# Step 1 (once, on login node): copy dataset from SDS@hd to Weka
ws_allocate mydata 30
rclone copy sdshd-sftp:inputdata/ $(ws_find mydata)/inputdata/ \
--progress --transfers 8 --sftp-concurrency 8 --sftp-chunk-size 255k

# Step 2: reference the Weka path in your job script (no --gres=sdshd needed)
INPUT_DIR=$(ws_find mydata)/inputdata
</pre>

<tt>--sftp-chunk-size 255k</tt> increases the SFTP packet size from 32 KB (default) to 255 KB (OpenSSH's maximum), which reduces protocol overhead and significantly improves throughput for large files. Use this flag with explicit <tt>rclone copy</tt> commands — it has no effect on the FUSE mount and should not be placed in <tt>extra_opts</tt>.

All tasks read from Weka at full speed with zero WAN traffic and zero gateway load. Re-run <tt>rclone copy</tt> if the source data changes — it skips unchanged files.

= Known Limitations =

== Symlinks are not preserved ==

rclone's SFTP backend does not store symlinks on the remote. When rclone uploads a symlink (through the FUSE mount or via <tt>rclone copy</tt>), it follows the link and uploads the target's content as a regular file. The symlink itself is lost.

{| class="wikitable"
|-
! Situation
! What happens
|-
| Symlink to a regular file
| Target file is uploaded; symlink metadata is lost
|-
| Symlink to a directory
| Directory tree is traversed; contents are uploaded recursively
|-
| Dangling symlink (target missing)
| Transfer error or silently skipped
|-
| Restore from SDS@hd
| Regular files/directories — original symlinks are gone
|}

'''Workarounds:'''

* '''Archive before upload:''' <tt>tar -czf env.tar.gz myenv/</tt> — preserves all symlinks inside the archive. Unpack after download to restore the original structure.
* '''<tt>rclone copy --links</tt>:''' encodes symlinks as <tt>.rclonelink</tt> text files on the remote. Only works for <tt>rclone copy</tt>/<tt>rclone sync</tt> workflows (not the FUSE VFS mount). Requires <tt>--links</tt> on both the copy and the restore.

= Troubleshooting =

== <tt>SSH_FX_FAILURE</tt> during upload ==

'''Most common cause: storage quota exhausted.'''

SDS@hd enforces a hard quota per project. When full, the server rejects all writes. Check your quota first:

<pre>
rclone about sdshd-sftp:
</pre>

If the quota is full, delete or archive old data on SDS@hd before retrying.

'''Secondary cause: gateway overload from large job arrays.'''

100 tasks × 16 SFTP connections = 1600 simultaneous connections to <tt>lsdf02-sshfs</tt>. The gateway may reject new connections under this load. Reduce <tt>sftp_concurrency</tt> to <tt>8</tt> in <tt>~/.config/sdshd/config</tt>, or bypass the FUSE mount entirely for bulk transfers:

<pre>
rclone copy /local/results/ sdshd-sftp:results/ \
--transfers 4 --sftp-concurrency 8 --sftp-chunk-size 255k --progress
</pre>

With the default <tt>/tmp</tt> cache, resuming requires submitting a new job to the exact same compute node — not generally practical — and the cache is gone after a node reboot regardless. With a Weka workspace as <tt>cache_dir</tt>, the cache survives both job end and node reboots; the key advantage is that you can copy the data out manually without needing a job on that node at all, as described in [[#Persistent_cache_on_Weka|Persistent cache on Weka]].

== Job stuck in <tt>CG</tt> (COMPLETING) for a long time ==

The Slurm epilog waits for rclone to finish uploading all pending data. For large output files this can take minutes — this is expected and safe. The job will exit on its own once all uploads are confirmed complete. There is nothing to do.