Python Environments

From bwHPC Wiki
Revision as of 16:58, 4 November 2024 by H Winkhardt (talk | contribs) (Initial page)
Jump to navigation Jump to search

Introduction

When installing Python, typically all required packages are installed with pip install xyz. Doing nothing else, these packages would be installed in the global python installation of a system. With working on multiple projects, over time many packages can pile up. It is only a matter of time until this will lead to problems.

Say you have pandas 2.2.2 installed for a project A. After a while you work on a new project B which requires you to install auto-sklearn 0.15.0. This package is however only compatible with versions of pandas up to pandas 2.0.0. In this case, Pip might uninstall the first version of pandas, installing a a compatible one, namely pandas 1.5.3. This might however break project A because these versions of pandas are not entirely backwards-compatible.

This is one of the major reasons which is became common practice to work in virtual environments. This is a built-in functionality in a fresh Python install. Within the domain of HPC systems, virtual environments are even more necessary since there are multiple users on the system. Giving each the permission to install the packages that they need in a global python environment would quickly end up in a mess, which is why this is disabled. Environments are left for the user to create and configure within their own directory.

Pip / Venv

> Venv/Pip is not the recommended way to set up an environment, but might still be necessary in some cases. Skip to Conda if you want to get started in the recommended way.

Pip

Pip and Venv are two modules that oftentimes come with Python by default. Pip is Pythons default Package manager and accesses PyPI (Python Package Index) to download packages.

$ pip install pandas           # Installs the latest compatible version
$ pip install pandas=1.5.3    # Installs exact version
$ pip install pandas>=1.5.3    # Installs version newer or equal to 1.5.3

Pip is however not a standalone program but comes tied to each installation of Python. This can lead to some confusion, as oftentimes versions of Python come preinstalled with the system. When there are multiple instances of Python installed, it might not always be clear which one is actually currently active. The easiest way to remedy this is by launching pip within the context of a certain Python interpreter. This way it's clear which exact interpreter a package is associated with.

$ which python3                  # returns the path to the currently active python distribution
/usr/bin/python3
$ python3 -m pip install pandas  # -m mod : run library module as a script

requirements.txt

Because its cumbersome to try to install each package of a project manually, a file can be provided which lists the dependencies, typically named requirements.txt. It simply lists all the packages, and can also be supplied with specific versions.

pexpect
requests=2.32.3

This can then be used with

$ python3 -m pip install -r requirements.txt

Oftentimes, a project will accumulate packages over time. The following command exports all the currently installed package names and versions to a new requirements.txt which can be used later to rebuild the exact environment.

$ python3 -m pip freeze > requirements.txt

Pip on its own is not capable of separating package installations for specific environments. For this, Venv is required.

Venv

Venv is responsible for making a copy of the global Python installation to a local folder, e.g. a project folder.

$ python3 -m venv <path_to_environment>
$ python3 -m venv .venv

Ran from the project directory, this will create a .venv directory with a new interpreter inside. It can be activated with

$ source .venv/bin/activate

When the virtual environment is successfully activated, the terminal prompt will reflect that

(.venv) $

Using which also shows the path of the new interpreter:

(.venv) $ which python
</path/to/project>.venv/bin/python

With this, we can then use Pip to install packages to the new environment:

(.venv) $ python -m pip install pandas

Limitations of Pip / Venv

While the combination of Pip and Venv is common for projects ran on single user systems, using it on HPC systems, one can run into some problems. The largest of which is the fact that this solution is always dependent on a centrally installed version of Python. There are however too many versions to cater to the demands of the many users of HPCs. Since users are not permitted to freely install new global versions of Python, a more sustainable approach would be to enable users to install Python versions themselves within their home directory. This is why **Conda** is currently considered the go-to solution.

Conda

In the context of Python combines the package manager, the virtual environment creator, as well as a manger for Python versions. Instead of having a virtual environment that is dependent on an already available Python interpreter, a virtual environment is created with an explicit version, which is downloaded if not already available. This allow users to install their own versions as needed. On top of that, Conda comes with its own package repository solution.

Detailed instructions for Conda Environments can be found under Development/Conda.


Poetry

TBA