Distributed Neural Computation on a GPU Cluster

The compatibility of GIAANN Biological Artificial Intelligence and frontier‑scale GPU superclusters

Summary

This post describes a distributed GIAANN training and inference implementation on a large peer-to-peer GPU cluster. The design assumes 1m+ concept columns at ~1GB->1TB each, with most nodes dedicated to storing concept columns in RAM (not used for primary computation), and a smaller batch B of nodes dedicated to primary compute for training and inference.

Storage Layout Assumptions

Most nodes store a subset (1+) of concept columns (in RAM) and primarily serve data to/from computation nodes.
Concept columns are large (~1GB->1TB each), so the cluster is storage-heavy relative to compute.
GIAANNpy currently stores entire concept columns in database/observedColumns files (typically Ext4 SSD filesystem). In a distributed implementation, concept column features (including their outgoing connections) may be split into separate RAM regions rather than a single large sparse PyTorch tensor to speed up inference.

Training Parallelism (Multiple Sequences)

Description

Each training round (sequence) is assigned to a dedicated computation node.
Sequences are either randomly selected or preselected to minimize overlap in concept columns, reducing interconnect contention.
The computation node builds a dense PyTorch subnetwork for the sequence connectivity (see GIAANNproto_databaseNetworkTrainExcitation.py).
The computation node sends per-sequence concept column data to the storage nodes responsible for those concept columns.
Transferred data is typically limited to sequence features when trainSequenceObservedColumnsUseSequenceFeaturesOnly=True (densified view of sparse columns).
Storage nodes independently integrate updates into their sparse concept column tensors (see updateObservedColumns() in GIAANNproto_sequenceObservedColumnsExcitation.py).

Diagram (Training)

Training parallelism diagram

Inference Parallelism (Multiple Sequences)

Description

Each sequence corresponds to a separate user prompt/prediction (inference for a single sequence is not parallelized; it executes per token).
Each computation node in batch B handles one prompt+prediction sequence.
Each computation node keeps its own globalFeatureNeurons tensor in GPU memory (very large, 100GB+ in production).
For each token (inference round):
- The computation node requests the full concept column data for the token (concept column feature), including outgoing connections.
- The computation node computes target activations.
- The computation node updates its local globalFeatureNeurons tensor.
- The process repeats for the next token in the seed/prediction sequence.

Diagram (Inference)

Inference parallelism diagram

Notes

The training batch size B is bounded by interconnect bandwidth and storage node parallelism.
Sequence selection that minimizes column overlap reduces contention and improves throughput.
Splitting concept column features into multiple RAM regions could reduce sparse tensor overhead for inference.