Type: New Feature
Priority: Not specified
Affects Version/s: None
Fix Version/s: None
The deliverables and the scope of work for this internship: (2 weeks each)
- Get familiar with GNN bundle
a. Get bundle running on single Node and multi node machines
b. Train a benchmark NN using GNN on single and multi-node clusters. Document performance between single node and multi node training
- Get GPU enabled AWS instances running with GNN bundle. Investigate and develop any requirements to get GNN running on a GPU machines
a. Model parallelism across one or more GPUs (on a single node with multiple GPUs).
- R&D of multi node (Thor nodes or otherwise) GPU training. Each node will be a single GPU on a single physical computer, with multiple GPUs. This is an important building block for the next steps.
- R&D of multi node GPU training across multiple physical computers. The result will be a flexible multi node GPU scheme that can work on individual physical computers and across many GPUs in a large cluster of GPUs machines.
- Perform case study of the work to evaluate, validate, and benchmark the performance of the work.
- Update and create documentation of the work. Including annotated example code and system setup and requirements (this is a non-standard HPCC System, i.e. one with fewer but more computationally powerful nodes).
1. R&D of generative adversarial networks (GANs) using GPU accelerated training
2. R&D of asynchronous training across multiple GPUs. This requires logic to only run on nodes that are connected with low latency and high bandwidth connection, i.e. between multiple GPUs on a single machine.