Skip to content

Using Dac-Man on HPC Clusters

In this section, we detail some of the steps needed to run Dac-Man on HPC clusters. The instructions here are based on our experience running Dac-Man on the NERSC Cori system. They should translate to other HPC systems.

Requirements

The mpi4py package is required to enable running Dac-Man with MPI. Installation steps and additional information on how to install Dac-Man's MPI dependencies are given below.

Installing dependencies for running Dac-Man with MPI

The mpi4py package is required to use Dac-Man with MPI.

If you have not installed all of Dac-Man’s optional dependencies then run this command from the root directory of your local copy of the Dac-Man repository:

conda env update --name dacman-env --file dependencies/conda/mpi.yml

Important

Different computing environments may have specific requirements for interfacing user applications with MPI, e.g. using custom versions of MPI libraries. This is especially true for HPC systems. In this case, refer to the system's documentation to find out how to enable MPI.

Using MPI

Dac-Man allows you to parallelize two steps of the change analysis:

  • index
  • diff with the --datachange option

To parallelize on HPC clusters, enable MPI support by using the appropriate flags:

dacman index ... -m mpi
dacman diff ... -e mpi --datachange

To distribute tasks to multiple workers, use the MPI executable appropriate for the system in use, e.g. srun, mpiexec, mpirun, etc.

For example, to run Dac-Man on an HPC cluster with 8 nodes and 32 cores per node, where the MPI executable is srun, do the following:

srun -n 256 dacman index ... -m mpi
srun -n 256 dacman diff ... -e mpi --datachange

Batch Script

To submit a batch job to a cluster, include the Dac-Man command in your job script. The example below shows a batch script (hpcEx.batch) for the Slurm scheduler.

# !/bin/bash

#SBATCH -J example
#SBATCH -t 00:30:00
#SBATCH -N 8
#SBATCH -q myqueue

srun -n 256 dacman diff /old/data /new/data -e mpi --datachange

The script can then be submitted to the batch scheduler as:

sbatch hpcEx.batch

Using Dac-Man on NERSC

This section explains how to run Dac-Man at scale using Cori at NERSC.

Installation

Enabling Conda

The Conda package manager is preinstalled at NERSC. To enable the package on one of Cori's login nodes, load one of the Anaconda Python modules as described here:

module load python

Installing Dac-Man

Then, follow the same steps to install Dac-Man using Conda as illustrated in the Installing Dac-Man section, including installing additional plug-in dependencies as needed:

git clone https://github.com/deduce-dev/dac-man
cd dac-man
conda env create --file environment.yml
# optional - install dependencies for included plug-ins
conda env update --name dacman-env --file dependencies/conda/builtin-plugins.yml

Enable MPI

On systems where incompatibilities exist between packaged versions installed via Pip or Conda with the system's MPI libraries, e.g. Cori, build the mpi4py package manually.

First, activate the Dac-Man Conda environment:

conda activate dacman-env

Next, follow the steps described in this section of the NERSC documentation to download, build, and install the mpi4py package.

Running a test job

After the installation is finished, test Dac-Man on one of Cori compute nodes using an interactive session.

To access a node in an interactive session, run:

salloc -N 1 -C haswell -q interactive -t 30:00

Finally, once logged on to the interactive compute node session, invoke dacman on any two test data directories. On Cori, the MPI executable is srun:

srun -n 32 dacman diff <dir-1> <dir-2> -e mpi --datachange

Important

Make sure that you run this command in a Cori compute node, as opposed to a login node. Running MPI on login nodes is discouraged at NERSC, and the above command might fail if run on a login node.