GPU H100 - Transformer Engine - 03.성능최적화

버터젤리 2023. 7. 28. 18:17

Performance Optimizations

TE엔진 사용법에 이어서 최적화하는 방법이다. GPT encoder Layer를 기준으로 소개한다.

quickstart_utils.py 의 함수를 사용해서 적용해보자.

import torch
import transformer_engine.pytorch as te
from transformer_engine.common.recipe import Format, DelayedScaling
import quickstart_utils as utils

# Layer configuration
hidden_size = 4096
sequence_length = 2048
batch_size = 4
ffn_hidden_size = 16384
num_attention_heads = 32
dtype = torch.float16

# Synthetic data
x = torch.rand(sequence_length, batch_size, hidden_size).cuda().to(dtype=dtype)
dy = torch.rand(sequence_length, batch_size, hidden_size).cuda().to(dtype=dtype)

# Construct layer
basic_transformer = te.TransformerLayer(
    hidden_size,
    ffn_hidden_size,
    num_attention_heads,
)
basic_transformer.to(dtype=dtype).cuda()

fp8_format = Format.HYBRID
fp8_recipe = DelayedScaling(
    fp8_format=fp8_format,
    amax_history_len=16,
    amax_compute_algo="max",
)
# Training step
with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):
    y = basic_transformer(x, attention_mask=None)
y.backward(dy)

# Measure step time
utils.speedometer(
    basic_transformer,
    x,
    dy,
    forward_kwargs = { "attention_mask": None },
    fp8_autocast_kwargs = { "enabled": True, "fp8_recipe": fp8_recipe },
)

Mean time: 27.82952880859375 ms

Multi-GPU Training

Multi-GPU를 사용하기위해서는 일반적으로 데이터 병렬화하여 분산시키는 방법을 생각해 볼 수 있다.

주로 Batch size 크기를 GPU수만큼 나눠서 학습을 진행한다. 각 GPU에 모델의 복사본을 저장하고

forward, backward 을 단계별로 진행할 때 Gradient 업데이트를 독립적으로 진행한다.

한 단계 업그레이드 된 방법으로는 Tensor를 hidden size에 따라 나눠서 병렬화할 수 있다. 텐서를?

어떻게? 히든사이즈는 뭐지?