DGX Quantum Installation Guide¶
The following page describes the installation procedure of a DGX Quantum (DGX-Q) server, including connectivity to OPX1000, configuration, and initialization.
Components¶
DGX Quantum Physical Components¶
- GH200: The Grace Hopper Supercharged High Performance Computer driving the classical computation. Also referred to as the "DGX-Q server"
- OPX1000: Ultra low latency Quantum control and readout controller
- OPNIC: OP Network Interface Card, installed in the GH200 PCIe port
DGX Quantum Software & Firmware Components¶
The DGX-Q software on the server consists of three open-source software components that are required to be installed on the server:
- OPNIC Driver: A kernel driver for the OPNIC PCIe card
- OPNIC SDK: A shared library that is used by user’s application on the server
- OPNIC CLI tool: A CLI interface for managing the OPNIC (for example, a one-time sync with QOP, updating card FW, etc.)
OPNIC Installation in the DGX-Q Server¶
Step 4 can be used to update the OPNIC firmware. If the system was previously configured, you can skip directly to step 4.
Step 1: OPNIC Mechanical Assembly
- Follow the mechanical assembly manual OPNIC Assembly Guide
Step 2: DGX-Q Connection Schema
The DGX-Q system requires an Ethernet connection between the OPX1000 chassis and the server and an optical connection between the OPNIC and the OPX1000 chassis. Please follow these guidelines:
- Make sure slot 1 is populated by an FEM or contact Quantum Machines support for an alternative connectivity configuration.
- Connect the 2 QSFP-MPO adapters to the relevant ports in the OPNIC.
-
Connect the MPO optical cables from the OPNIC to the OPX1000 according to the diagram:
-
Make sure to connect Comm 2 to the right OPNIC port (orientation according to the sketch).
-
Make sure both MPO optical cables are identical and of the same length.
Note
The sketch illustrates the connection to a Rev. C chassis.
If the OPNIC is connected to a Rev. B chassis, use the provided adapter kit, or use a patched MPO optical cables, which are sometimes pink.
-
-
Network Communication - Ethernet connections should be based on the specific site/IT connectivity guidelines.
Make sure you can ping the OPX1000 from the server. The easiest way is to ensure they are on the same subnet. Alternatively, routing can be defined, please contact your IT department for support.
Step 3: Software Configuration
- Copy the OPNIC software package provided by Quantum Machines into the server
-
Add execute permissions:
-
Install Driver:
-
Install SDK:
-
Verify installation of opnic libraries:
And verify that the following files are present:libopnic.so
,libopnic-cuda.so
. -
Install CLI:
Step 4: OPNIC Firmware Update
The OPNIC firmware consists of two separate images which can be updated using the OPNIC CLI tool:
FPGA Image: The bitfile that is loaded into the OPNIC FPGA. This image is responsible for the PCIe interface and the communication with the OPX.
PLL configuration: The OPNIC clock configuration, will rarely update.
-
Check the version by running:
-
Validate that the output indeed shows the latest FPGA and PLL images:
-
If any of the versions is wrong, Flash the latest image:
-
Once the flash has ended, reset the card by doing:
-
Restart the server:
-
Repeat Validation - see point 2 above.
Appendix 1: Server Installation¶
DGX-Q server minimal configuration:
- UBUNTU Ver. 22.04.5
- GCC13
- CMake ≥ 3.25.5
- make
- NVidia driver linux-nvidia-64k-hwe-22.04
- Cuda toolkit 12-8
Recommended Installation Steps¶
Note that these steps are for a specific GH200 server by QCT, exact details may vary based on the server model and configuration.
1. Preparations
- Connect the GH200 with two power cables.
- Connect an ethernet cable to the Baseboard Management Controller (BMC) panel. This is next to the front power button.
- Connect an ethernet cable to a free ethernet port, this should be in the same subnet as the OPX.
- Connect a screen and keyboard to find the BMC IP through the BIOS -
2. Firmware updates
Update the server's firmware, full steps (for a specific update) can be found in this guide
3. Linux installation
Download and install Ubuntu 22.04.5 ARM64 Server: link:
- Load it onto a disk-on-key as bootable
- In MacOS (replace below # with the disk number, find it with diskutil list
):
diskutil unmountDisk /dev/disk#
sudo dd if=~/Downloads/ubuntu-22.04.5-live-server-arm64.iso of=/dev/rdisk# bs=4m
- Insert disk-on-key in the BMC panel usb port, and power on the GH200
- The installation will start automatically, follow the instructions on the screen.
4. Install gcc13
Run the following commands:
sudo apt install software-properties-common -y
sudo add-apt-repository ppa:ubuntu-toolchain-r/test -y
sudo apt update
sudo apt install gcc-13 g++-13 -y
# make gcc13 the default version
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-13 100
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-13 100
# verify
gcc --version
sudo ln -s /usr/bin/gcc /usr/bin/cc
Edit the /.bashrc
file and add the next string at the end:
Run bashrc
and run configuration:
5. Install cmake
Run the following commands:
# download cmake installer
wget https://github.com/Kitware/CMake/releases/download/v3.31.6/cmake-3.31.6-linux-aarch64.sh
# grant execution permission
sudo chmod +x cmake-3.31.6-linux-aarch64.sh
# run it. agree to the license and type 'Y' when it asks if you want to install it in the default folder
./cmake-3.31.6-linux-aarch64.sh
# move it to /opt
sudo mv cmake-3.31.6-linux-aarch64/ /opt/cmake-3.31.6
# add symbolic links in /usr/local/bin to point to the cmake you just installed
sudo ln -s /opt/cmake-3.31.6/bin/ccmake /usr/local/bin/ccmake
sudo ln -s /opt/cmake-3.31.6/bin/cmake /usr/local/bin/cmake
sudo ln -s /opt/cmake-3.31.6/bin/cmake-gui /usr/local/bin/cmake-gui
sudo ln -s /opt/cmake-3.31.6/bin/cpack /usr/local/bin/cpack
sudo ln -s /opt/cmake-3.31.6/bin/ctest /usr/local/bin/ctest
# test
cmake --version
# install Ninja
sudo apt install ninja-build -y
7. Install and update Nvidia driver
Run the following commands to update the system and install the NVIDIA optimized Ubuntu kernel variant and reboot:
sudo DEBIAN_FRONTEND=noninteractive apt purge linux-image-$(uname -r) linux-headers-$(uname -r) linux-modules-$(uname -r) -y
sudo apt update
sudo apt install linux-nvidia-64k-hwe-22.04 -y
sudo reboot now
Updating Nvidia driver:
sudo apt-get install linux-headers-$(uname -r)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/sbsa/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring*.deb
sudo apt-get update
sudo apt-get install cuda-toolkit-12-8 -y
sudo apt-get install nvidia-kernel-open-535 cuda-drivers-535 -y
sudo reboot
Check installation with:
To check and determine whether the CPU and GPU memory subsystems are up and functional, run the following commands
map nvcc:
8. Validate correct gcc version
This step ensures the correct GCC version is used, as certain installations may inadvertently trigger a rollback to an earlier version