Benchmark Study by Imagga/L3C

Training Big on Images

Large Model Support for Semantic Segmentation of Cityscape and Waste Images

Download study here

Study overview

Datasets and CNNs

Hardware overview

Benchmarks

Conclusions

In this study we wanted to benchmark and compare the IBM Power AC922 server vs the NVIDIA DGX Station vs the Amazon Web Services p3.8xlarge instance type for state-of-the-art deep learning training.

Download the full study

320,000+

Image classification with very large number of categories and object recognition were considered, including PlantSnap plant recognition classifier with over 320K plant species, however, we also wanted to use publicly available data sets for this benchmarking and 3D medical image processing was extensively benchmarked by other analysts.

Study overview

Semantic segmentation and Large model support (LMS)

Semantic Segmentation

We finally decided to focus this benchmarking study of IBM Power AC922 vs NVIDIA DGX Station vs Amazon Web Services p3.8xlarge instance on the task of semantic segmentation of (a) cityscape images using the Cityscape data set and (b) of waste in the wild using the TACO data set using IBM’s Large Model Support.

Neural Network

For the choice of a neural network we were looking for a hardware demanding architecture which uses high-resolution input images for training. Gated Shape CNN – a state-of-the-art CNN architecture one of the top methodologies in the Cityscape benchmark, is perfect candidate for benchmarking the performance.

Cityscape Dataset

Semantic understanding of urban street scenes

Dataset overview:

+ Street photos from 50 cities (cityscapes)
+ Several months (spring, summer, fall), daytime
+ Good/medium weather conditions
+ Manually selected frames
+ Large number of dynamic objects
+ Varying scene layout

+ Varying background
+ 5000 annotated images with fine annotations
+ 20000 annotated images with coarse annotations
+ Very challenging data set for semantic segmentation
+ Various applications such as autonomous cars and driving

Download the full study

Semantic Segmentation

+ Semantic image segmentation is one of the most widely studied problems in computer vision and image analysis with applications in autonomous driving, 3D reconstruction, medical imaging, image generation, etc.

+ State-of-the-art approaches for semantic segmentation are predominantly based on Convolutional Neural Networks (CNN).

+ Recently, dramatic improvements in performance and inference speed have been driven by new architectural designs

G-SCNN

Architecture overview

01

State-of-Art

State-of-the-art CNN architecture, achieving 82.8% IoU score on the Cityscapes dataset

02

DGX/Tesla

Originally trained on a NVIDIA DGX Station 2 with 8 NVIDIA Tesla V100

03

Batch

Trained with batch size of 16 – 2 per each GPU

04

Resolution

Trained for 175 epochs and high-resolution input size of 800×800

Hardware overview

IBM Power AC922
NVIDIA DGX Station
AWS p3.8xlarge

IBM Power AC922

CPU: 32-Core IBM POWER9 Single Chip Module (SCM)

GPU: 4x 16GB SXM2 NVIDIA Tesla V100 with NVLink, Air Cooled

RAM: 512GB

NVIDIA DGX Station

CPU: 20-core Intel Xeon E5-2698 v4

GPU: 4x 16GB NVIDIA Tesla V100 with NVLink, Water Cooled

RAM: 256GB

AWS p3.8xlarge

CPU: 32-Core Intel Xeon E5-2686 v4

GPU: 4x 16GB NVIDIA Tesla V100 with NVLink

RAM: 244GB

Benchmark Study by Imagga/L3C
2019 Edition

Download study here

NVLink Comparison

Intel-based NVLink Architecture vs Power9-based Architecture NVIDIA DGX Station and AWS p3.8xlarge (left): The NVIDIA Tesla V100 GPUs are each connected with a single NVLink 2.0 brick capable of 50 GB/s of bidirectional bandwidth. The CPU and GPU communication is through PCIe Gen3. IBM Power AC922 (right): The NVIDIA Tesla V100 GPUs are each connected with three NVLink 2.0 bricks for up to 150GB/s of bidirectional bandwidth between GPUs. Three NVLink 2.0 bricks also connect each GPU with the IBM Power9 CPU providing 150GB/s of bidirectional bandwidth, enabling direct system memory access.

Large model support

Beyond GPU Memory

LMS and tensors

Аllows the successful training of deep learning models that would otherwise exhaust GPU memory and abort without of memory errors. LMS manages this oversubscription of GPU memory by temporarily swapping tensors to host memory when they are not needed.

LMS and NVLink

IBM POWER Systems servers (Power8 and Power9 cores) with NVLink technology are especially well-suited to LMS because of their hardware topology that enables fast communication between CPU and GPUs.Тhey include high-speed I/O interfaces like CAPI and PCIe v4.

LMS and models

One or more elements of a deep learning model can lead to GPU memory exhaustion. These include:

+ Model depth and complexity
+ Input data size (e.g. high-resolution images)
+ Batch size

Overview

Benchmarks

To showcase the benefits of using LMS and to benchmark the performance of IBM Power AC922 vs NVIDIA DGX Station vs AWS p3.8xlarge, the following benchmarks and tests were created:

01

Training time

Training time – comparing the training time for 175 epochs on the Cityscape dataset on the AC922, the DGX and the AWS p3.8xlarge instance type

02

Use of LMS

Showcasing what the benefits of using Large Model Support are by demonstrating “Out of Memory”
situations using the G-SCNN architecture and the Cityscape datase

03

LMS overhead

Overviewing the training time overhead when using Large Model Support

04

GPU Profiling

GPU profiling – a detailed comparison of the two systems during training using NVIDIA profiling data

Detailed results

Training time
Use of LMS
LMS overhead
GPU profiling

Training time

Benchmark 1

The training on IBM Power AC922 completed first - 3 days and 6 hours earlier than the one on the NVIDIA DGX Station with almost no difference in accuracy. The training on the AWS p3.8xlarge came last - 3 hours later than the DGX.

Training parameters:
Input size: 800x800 Batch size: 16 Validation batch size: 2 Epochs: 175 Learning rate: 0.01 Learning rate policy: polynomial

Use of LMS

Benchmark 2

Benchmark 2

Without LMS activated, with batch size 16 the training couldn't fit in the 4x NVIDIA Tesla V100 GPUs on all machines, resulting in an “Out of Memory” error.

Some explanations:
1) The semantic segmentation using G-SCNN has a high-memory usage requirements due to the large input size of 800x800 and the architecture design.
2) The neural network framework of our choice is the same framework used in the paper - PyTorch. The LMS integration was as easy as adding a single line of code: torch.cuda.set_enabled_lms(True)
3) Batch size of 8 fitted on the four GPUs but with some reduction of the input size from 800x800 to 700x700.
4) Charts for epoch times are shown on the next section.

LMS overhead

Benchmark 4.1

Benchmark 4.3

Benchmark 4.2

IBM Power AC922 shows significantly lower LMS overhead due to the NVLink connectivity between the CPU and the GPU.

Some explanations:
1) For calculating the LMS overhead we used input sizes of 600x600 and 700x700 and a fixed batch size of 8.
2) LMS overhead for 600x600 input size is 106%for the DGX , 105% for the AWS p3.8xlarge and 43% for the AC922.
3) LMS overhead for 700x700 input size is 72.8%for the DGX, 91% for the AWS p3.8xlarge and 30% for the AC922.
4) The AWS p3.8xlarge instance shows similar performance to the DGX due to the almost identical GPU and CPU system architecture.
5) The next slide shows exact epoch times for each machine and train type

GPU profiling

Benchmark 5.1

Benchmark 5.2

The gaps in the GPU utilisation for the AC922 machine are drastically smaller than both the DGX and AWS - leading to higher utilisation and faster trainings.

Some explanations:

1) To investigate further where the difference in the numbers between the machines come from, we used Nvprof to profile the GPU activity during epoch 2 between 40th and 60th iteration.

2) Nvprof shows that the memory copies between the CPU and GPU for tensor swapping for the LMS take considerably longer on the NVIDIA DGX Station and the AWS p3.8xlarge instance type than on the IBM Power AC922 and lead to GPUs becoming idle.

3) The next graphic shows GPU usage on the 50th iteration on the three machines. The blue lines relatively mark the locations of equivalent tensors on the machines.

Waste Segmenation

Littering

Humans have been littering the Earth from the bottom of Mariana trench to Mount Everest. Every minute, at least 15 tonnes of plastic waste leak into the ocean, that is equivalent to the capacity of one garbage truck.

Segmenation

One way to achieve automatic waste segmentation is using the semantic segmentation technology.

We used the TACO dataset for training our waste segmentation model based on the G-SCNN architecture.

Data and AC922

The dataset consists of 715 images and 2152 annotations, labeled in 60 categories of litter.

We trained the dataset exclusively on the IBM Power AC922 as it achieved the best performance in our benchmarks.

Download the full study

Conclusions

Faster Computing

IBM Power AC922 is significantly faster than NVIDIA DGX Station and the AWS p3.8xlarge instance type in such computationally demanding tasks as semantic segmentation.

Larger Batch

Large Model Support enables us to train the model with a larger batch size and input image dimensions producing better overall results.

GPU Utilization

IBM’s Large Model Support technology has less overhead when used with the IBM Power AC922 hardware, leading to more GPU utilisation and faster training time.

Complex Tasks

IBM Power AC922 satisfies the hardware requirements for training on complex tasks such as automatic waste segmentation.

What could you achieve with L3C AI Cloud?

READ MORE

Post Views: 1,430