Data Transfer: Difference between revisions

From bwHPC Wiki
Jump to navigation Jump to search
No edit summary
(updated whole page)
 
(17 intermediate revisions by 3 users not shown)
Line 1: Line 1:
{| style=" background:#FEF4AB; width:100%;"
== Transfer Tools ==
| style="padding:6px; font-size:100%;text-align:left" | This page is work in progress. To discover more relevant subpages, please come back at a later timepoint.
|-
|}
== Overview ==

Data transfer is the exchange of files between two systems. Before data transfer can happen, you need to go through the following steps:

# Choose the two [[Data_Storage_Systems|data storage systems]] that shall exchange data.
# Choose the top level ways of transfer ([[Ways of Transfer:_Copy,_Sync,_Mount|copy, sync or mount]]) by considering your specific use case.
# Choose a [[Network_Protocols_&_Transfer_Tools|network protocol or transfer tool]] to use for the communication between the systems.

The recommended setup already includes these three steps. For a full overview, you can reference the tables that show [[All_Transfer_Routes|all transfer routes]]. Those include all possible combinations between systems, top level way of transfer and network protocol / transfer tool.

=== Data Storage Systems ===

Data transfer can happen between a variety of systems. For example:

* [[File:Notebook.svg|x20px]] Local computer or VM (virtual machine)
* [[File:Microscope.svg|x20px]] <span style="margin-left:10px;">Data producing machine (sequencer, microscope, ...)<span>
* [[File:Clusternodes.svg|x20px]] <span style="margin-left:8px;">HPC system<span>
* [[File:Storage_small.svg|x15px]] <span style="margin-left:8px;">Storage space (SDS@hd, institute server, ...)<span>
* [[File:Cloud.svg|x15px]] <span style="margin-left:3px;">Cloud resource<span>

=== Ways of Transfer: Copy, Sync, Mount ===

The top level ways of transfer are:

* '''Copy:''' A simple copy command is the most basic way to transfer data. This is most efficient for very big data files that shall be retrieved from or moved to a remote location. And it can be most convenient, if you prefer moving your files via commandline instead of using a file browser.
* '''Sync:''' If the data is intended to be kept on both systems and undergoes change on only one of the systems, it makes sense to use a synchronization command instead. This way, only the changed files in one location are updated in the other location. Good use cases are backups or data transfers that go mostly in one direction like moving data from a sequencer to a storage space. A disadvantage is that the data needs storage space on both systems.
* '''Mount:''' If the data undergoes change on both systems or is too big to store locally, then mounting is the most convenient solution. This allows you to see and work with the data as if it were stored locally on your computer while it is still placed on the remote system. All changes that you implement happen directly on the original data so that you don't need to copy or synchronize anything. Additionally, you'll see all changes that another party does to the data with just a very short delay.
<p style="text-align: left;">[[File:CopySyncMount.png|x250px]]</p>
<p style="text-align: left; font-size: small; margin-top: 10px; margin-left: 200px;">Figure 1: Top level transfer routes</p>

</br>

=== Network Protocols & Transfer Tools ===


{| class="wikitable" style="vertical-align:middle;"
{|{{Table|width=99%}}
|- style="font-weight:bold;"
! Basic Network Protocol
! Used By Network Protocol
|-
|-
| ssh
! rowspan="2" | Type
| scp, sftp, rsync
! rowspan="2" | Software
! rowspan="2" | Remarks
! colspan="4" style="text-align:center" | Executable on
! colspan="3" style="text-align:center" | Transfer from/to
|-
|-
| http(s)
!Local°
| WebDAV
!bwUniCluster
!bwForCluster
!www
!bwHPC cluster
![[Sds-hd|SDS@hd]]
|-
|-
| smb
| rowspan="5" | Command-line
| scp
| -
| rowspan="3" | Throughput < 150 MB/s (depending on cipher)
| style="text-align:center" | +
| style="text-align:center" | +
| style="text-align:center" | +
|
| style="text-align:center" | +
| style="text-align:center" |
|-
| sftp
| style="text-align:center" | +
| style="text-align:center" | +
| style="text-align:center" | +
|
| style="text-align:center" | +
| style="text-align:center" | +
|-
| rsync
| style="text-align:center" | +
| style="text-align:center" | +
| style="text-align:center" | +
|
| style="text-align:center" | +
| style="text-align:center" |
|-
| rdata
| Throughput of 350-400 MB/s
|
| style="text-align:center" | +
|
|
|
| style="text-align:center" | +
|-
| wget
| Download from http/ftp address only
| style="text-align:center" | +
| style="text-align:center" | +
| style="text-align:center" | +
| style="text-align:center" | +
|
| style="text-align:center" |
|-
| rowspan="2" | Graphical
| [https://winscp.net/eng/download.php WinSCP]
| based on SCP/SFTP, Windows only
| style="text-align:center" | +
|
|
|
| style="text-align:center" | +
| style="text-align:center" | +
|-
| [https://filezilla-project.org/download.php?show_all=1 FileZilla]
| based on SFTP
| style="text-align:center" | +
|
|
|
| style="text-align:center" | +
| style="text-align:center" | +
|-
|-
| NFS
| -
|}
|}
For every data transfer a network protocol to use for the communication between the systems must be chosen. The basic network protocols and the network protocols that build directly upon those are shown in the table on the right. These protocols can either be used rather directly or through tools that provide the protocol together with additional features. A tool can either mean a command line tool or a tool with a graphical user interface.


A comprehensive overview of all transfer options (network protocols and tools) can be found on the page [[Data_Transfer/All_Transfer_Routes|all transfer routes]].
° Depending on the installed operating system (OS).



== Recommended Setup ==


The main tool/protocol for transferring data to a bwHPC cluster from your local machine is as follows:
<H1>Using SFTP from Unix client</H1>


* '''[[mobaxterm|mobaxterm]]''' for Windows. It is a graphical user interface that allows logging in to the cluster with ssh as well as transferring data via a file browser.
'''Example:'''
* '''[[Data_Transfer/SSHFS|sshfs]]''' for MacOS and Linux. It is the quickest solution for mounting a folder. If you want to use copy and sync as well, it is more convenient to use '''[[Data_Transfer/Rclone|Rclone]]''' instead. Rclone provides copy, sync and mount functionality for various types of infrastructure.


<pre>
> sftp ka_xy1234@bwfilestorage.lsdf.kit.edu
Connecting to bwfilestorage.lsdf.kit.edu<br>
ka_xy1234@bwfilestorage.lsdf.kit.edu's password:
sftp> ls
snapshots
temp test
sftp> help
...
sftp> put myfile
sftp> get myfile
</pre>


<p style="text-align: center; margin-top: 10px">[[File:Bwhpc diagram simplenobox.jpg|x150px]]</p>
<p style="text-align: center; font-size: small; margin-top: 10px">Figure 2: bwHPC main transfer routes</p>
For SDS@hd you can find the main access options at the [[SDS@hd/Access|SDS@hd Access]] page.


== Best practices ==
<H1>Using SFTP from Windows and Mac client</H1>


* '''Strong firewall restrictions'''<br />
Windows clients do not have a SCP/SFTP client installed by default, so it needs to be installed before this protocol can be used.
-&gt; Use ssh or http(s) based protocols, for example [[Data_Transfer/WebDAV|'''WebDav''']] and [[Data_Transfer/SFTP|sftp]]. For very strict facilities, ssh based protocols might not be allowed.
* '''Share data with collaborators...'''
** ...outside of Baden-Württemberg<br />
-&gt; Use the [[SDS@hd|SDS@hd]] storage.
** ...that are less comfortable with the command line<br />
-&gt; Let them mount the folder.
* '''Transfer many small files'''<br />
-&gt; Compress the files to one.


For advanced topics see [[advanced-data-transfer|advanced-data-transfer]].
'''Tools for example:'''
* [https://www.openssh.com/ OpenSSH]
*[https://www.chiark.greenend.org.uk/~sgtatham/putty/download.html Putty suite] (for Windows and Unix)
*[https://winscp.net/eng/download.php WinSCP] (for Windows)
*[https://filezilla-project.org/download.php?show_all=1 FileZilla] (for Windows, Mac and Linux)
*[https://cygwin.com/install.html Cygwin] (for Windows)
<br>
'''network drive over SFTP:'''
*[https://www.southrivertechnologies.com/download/downloadwd.html WebDrive] (for Windows and Mac)
*[https://www.eldos.com/sftp-net-drive/comparison.php SFTP Net Drive (ELDOS)] (for Windows)
*[https://www.netdrive.net/ NetDrive] (for Windows)
*[https://www.expandrive.com/expandrive ExpanDrive] (for Windows and Mac)
<hr>
<br>
<br>
<br>
<br>
[[Category:bwFileStorage|SFTP]]

Latest revision as of 21:10, 21 November 2024

This page is work in progress. To discover more relevant subpages, please come back at a later timepoint.

Overview

Data transfer is the exchange of files between two systems. Before data transfer can happen, you need to go through the following steps:

  1. Choose the two data storage systems that shall exchange data.
  2. Choose the top level ways of transfer (copy, sync or mount) by considering your specific use case.
  3. Choose a network protocol or transfer tool to use for the communication between the systems.

The recommended setup already includes these three steps. For a full overview, you can reference the tables that show all transfer routes. Those include all possible combinations between systems, top level way of transfer and network protocol / transfer tool.

Data Storage Systems

Data transfer can happen between a variety of systems. For example:

  • Notebook.svg Local computer or VM (virtual machine)
  • Microscope.svg Data producing machine (sequencer, microscope, ...)
  • Clusternodes.svg HPC system
  • Storage small.svg Storage space (SDS@hd, institute server, ...)
  • Cloud.svg Cloud resource

Ways of Transfer: Copy, Sync, Mount

The top level ways of transfer are:

  • Copy: A simple copy command is the most basic way to transfer data. This is most efficient for very big data files that shall be retrieved from or moved to a remote location. And it can be most convenient, if you prefer moving your files via commandline instead of using a file browser.
  • Sync: If the data is intended to be kept on both systems and undergoes change on only one of the systems, it makes sense to use a synchronization command instead. This way, only the changed files in one location are updated in the other location. Good use cases are backups or data transfers that go mostly in one direction like moving data from a sequencer to a storage space. A disadvantage is that the data needs storage space on both systems.
  • Mount: If the data undergoes change on both systems or is too big to store locally, then mounting is the most convenient solution. This allows you to see and work with the data as if it were stored locally on your computer while it is still placed on the remote system. All changes that you implement happen directly on the original data so that you don't need to copy or synchronize anything. Additionally, you'll see all changes that another party does to the data with just a very short delay.

CopySyncMount.png

Figure 1: Top level transfer routes


Network Protocols & Transfer Tools

Basic Network Protocol Used By Network Protocol
ssh scp, sftp, rsync
http(s) WebDAV
smb -
NFS -

For every data transfer a network protocol to use for the communication between the systems must be chosen. The basic network protocols and the network protocols that build directly upon those are shown in the table on the right. These protocols can either be used rather directly or through tools that provide the protocol together with additional features. A tool can either mean a command line tool or a tool with a graphical user interface.

A comprehensive overview of all transfer options (network protocols and tools) can be found on the page all transfer routes.

Recommended Setup

The main tool/protocol for transferring data to a bwHPC cluster from your local machine is as follows:

  • mobaxterm for Windows. It is a graphical user interface that allows logging in to the cluster with ssh as well as transferring data via a file browser.
  • sshfs for MacOS and Linux. It is the quickest solution for mounting a folder. If you want to use copy and sync as well, it is more convenient to use Rclone instead. Rclone provides copy, sync and mount functionality for various types of infrastructure.


Bwhpc diagram simplenobox.jpg

Figure 2: bwHPC main transfer routes

For SDS@hd you can find the main access options at the SDS@hd Access page.

Best practices

  • Strong firewall restrictions

-> Use ssh or http(s) based protocols, for example WebDav and sftp. For very strict facilities, ssh based protocols might not be allowed.

  • Share data with collaborators...
    • ...outside of Baden-Württemberg

-> Use the SDS@hd storage.

    • ...that are less comfortable with the command line

-> Let them mount the folder.

  • Transfer many small files

-> Compress the files to one.

For advanced topics see advanced-data-transfer.