Data Transfer

From bwHPC Wiki
Revision as of 00:05, 22 November 2024 by H Schumacher (talk | contribs) (formatting)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
This page is work in progress. To discover more relevant subpages, please come back at a later timepoint.

Overview

Data transfer is the exchange of files between two systems. Before data transfer can happen, you need to go through the following steps:

  1. Choose the two data storage systems that shall exchange data.
  2. Choose the top level ways of transfer (copy, sync or mount) by considering your specific use case.
  3. Choose a network protocol or transfer tool to use for the communication between the systems.

The recommended setup already includes these three steps. For a full overview, you can reference the tables that show all transfer routes. Those include all possible combinations between systems, top level way of transfer and network protocol / transfer tool.

Data Storage Systems

Data transfer can happen between a variety of systems. For example:

  • Notebook.svg Local computer or VM (virtual machine)
  • Microscope.svg Data producing machine (sequencer, microscope, ...)
  • Clusternodes.svg HPC system
  • Storage small.svg Storage space (SDS@hd, institute server, ...)
  • Cloud.svg Cloud resource

Ways of Transfer: Copy, Sync, Mount

The top level ways of transfer are:

  • Copy: A simple copy command is the most basic way to transfer data. This is most efficient for very big data files that shall be retrieved from or moved to a remote location. And it can be most convenient, if you prefer moving your files via commandline instead of using a file browser.
  • Sync: If the data is intended to be kept on both systems and undergoes change on only one of the systems, it makes sense to use a synchronization command instead. This way, only the changed files in one location are updated in the other location. Good use cases are backups or data transfers that go mostly in one direction like moving data from a sequencer to a storage space. A disadvantage is that the data needs storage space on both systems.
  • Mount: If the data undergoes change on both systems or is too big to store locally, then mounting is the most convenient solution. This allows you to see and work with the data as if it were stored locally on your computer while it is still placed on the remote system. All changes that you implement happen directly on the original data so that you don't need to copy or synchronize anything. Additionally, you'll see all changes that another party does to the data with just a very short delay.

CopySyncMount.png

Figure 1: Top level transfer routes


Network Protocols & Transfer Tools

Basic Network Protocol Used By Network Protocol
ssh scp, sftp, rsync
http(s) WebDAV
smb -
NFS -

For every data transfer a network protocol to use for the communication between the systems must be chosen. The basic network protocols and the network protocols that build directly upon those are shown in the table on the right. These protocols can either be used rather directly or through tools that provide the protocol together with additional features. A tool can either mean a command line tool or a tool with a graphical user interface.

A comprehensive overview of all transfer options (network protocols and tools) can be found on the page all transfer routes.

Recommended Setup

The main tool/protocol for transferring data to a bwHPC cluster from your local machine is as follows:

  • MobaXterm for Windows. It is a graphical user interface that allows logging in to the cluster with ssh as well as transferring data via a file browser.
  • sshfs for MacOS and Linux. It is the quickest solution for mounting a folder. If you want to use copy and sync as well, it is more convenient to use Rclone instead. Rclone provides copy, sync and mount functionality for various types of infrastructure.


Bwhpc diagram simplenobox.jpg

Figure 2: bwHPC main transfer routes

For SDS@hd you can find the main access options at the SDS@hd Access page.

Best Practices

  • Strong firewall restrictions
    -> Use ssh or http(s) based protocols, for example WebDav and sftp. For very strict facilities, ssh based protocols might not be allowed.
  • Share data with collaborators...
    • ...outside of Baden-Württemberg
      -> Use the SDS@hd storage.
    • ...that are less comfortable with the command line
      -> Let them mount the folder.
  • Transfer many small files
    -> Compress the files to one.

For advanced topics see advanced-data-transfer.