Is it possible to leverage xformers optimizations in NeMo ASR? #5516
Replies: 2 comments 2 replies
-
There maybe some way to shoehorn it into NeMo ASR, but we need to take a look. As you can note, we have plenty of code that overrides the requirements of basic attention blocks - for example the cache code is just one such example. Further, ASR masking is more delicate than what transformers normally go through, plus we use relpos MHA by default for Conformer (we use relpos encoding + relpos MHA), so that code you linked isn't called - its this one instead - https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/asr/parts/submodules/multi_head_attention.py#L208 As such, xformers wont really do much for Conformer. We can take a look anyway to try to shoehorn parts of it, but ASR requirements from Transformer-like blocks are very different from regular NLP/LM requirements, so there probably is no simple integration. |
Beta Was this translation helpful? Give feedback.
-
IMO this is a great idea and allows us to introduce optimisations leveraging fused computation/triton. It also helps us to move towards much faster throughput for these models for training. I agree with @titu1994 that there is quite a bit of custom code to tackle. The relative positional encoding + relative MHA code would require some effort, I think we could start by creating an issue on the xFormers repo to see if there is any interest in assisting in building these components. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I've been reading about how recent LDMs (aka Stable Diffusion) have been leveraging FlashAttention and other optimizations found in
xformers
and was wondering if it's possible to leverage it in the Conformer models for ASR?Specifically in the computation of the
MultiHeadAttention
(Not sure if this the one to update or the one innemo.asr.parts.submodules
).My proposal is to leverage a similar mechanism to what
diffusers
do in checking whether the optimizations in xformers can be enabled or not here.Then use
xformers.ops.memory_efficient_attention
to compute the attention in theMultiHeadAttention
here.Thoughts? Could also solve the numerical stability issues w/ fp16 and will make training faster/more memory efficient.
LMK what the team's thoughts are on this!
Beta Was this translation helpful? Give feedback.
All reactions