Uploaded image for project: 'Machine Learning Library'
  1. Machine Learning Library
  2. ML-459

developing a multi-node and multi-GPU accelerated Deep Learning algorithm and runtime on the HPCC Systems Platform using the existing Generalized Neural Network (GNN) bundle

    Details

    • Type: New Feature
    • Status: Open
    • Priority: Not specified
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      The deliverables and the scope of work for this internship: (2 weeks each)

      1. Get familiar with GNN bundle
        a. Get bundle running on single Node and multi node machines
        b. Train a benchmark NN using GNN on single and multi-node clusters. Document performance between single node and multi node training
      2. Get GPU enabled AWS instances running with GNN bundle. Investigate and develop any requirements to get GNN running on a GPU machines
        a. Model parallelism across one or more GPUs (on a single node with multiple GPUs).
      3. R&D of multi node (Thor nodes or otherwise) GPU training. Each node will be a single GPU on a single physical computer, with multiple GPUs. This is an important building block for the next steps.
      4. R&D of multi node GPU training across multiple physical computers. The result will be a flexible multi node GPU scheme that can work on individual physical computers and across many GPUs in a large cluster of GPUs machines.
      5. Perform case study of the work to evaluate, validate, and benchmark the performance of the work.
      6. Update and create documentation of the work. Including annotated example code and system setup and requirements (this is a non-standard HPCC System, i.e. one with fewer but more computationally powerful nodes).

      Wish list:
      1. R&D of generative adversarial networks (GANs) using GPU accelerated training
      2. R&D of asynchronous training across multiple GPUs. This requires logic to only run on nodes that are connected with low latency and high bandwidth connection, i.e. between multiple GPUs on a single machine.

        Attachments

          Activity

            People

            • Assignee:
              rkennedy Robert Kennedy
              Reporter:
              lorraineachapman Lorraine Chapman
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: