Is it possible to leverage xformers optimizations in NeMo ASR? #5516

piraka9011 · 2022-11-28T23:49:11Z

piraka9011
Nov 28, 2022

I've been reading about how recent LDMs (aka Stable Diffusion) have been leveraging FlashAttention and other optimizations found in xformers and was wondering if it's possible to leverage it in the Conformer models for ASR?

Specifically in the computation of the MultiHeadAttention (Not sure if this the one to update or the one in nemo.asr.parts.submodules).

My proposal is to leverage a similar mechanism to what diffusers do in checking whether the optimizations in xformers can be enabled or not here.

Then use xformers.ops.memory_efficient_attention to compute the attention in the MultiHeadAttention here.

Thoughts? Could also solve the numerical stability issues w/ fp16 and will make training faster/more memory efficient.
LMK what the team's thoughts are on this!

titu1994 · 2022-11-29T21:22:05Z

titu1994
Nov 29, 2022
Maintainer

There maybe some way to shoehorn it into NeMo ASR, but we need to take a look. As you can note, we have plenty of code that overrides the requirements of basic attention blocks - for example the cache code is just one such example. Further, ASR masking is more delicate than what transformers normally go through, plus we use relpos MHA by default for Conformer (we use relpos encoding + relpos MHA), so that code you linked isn't called - its this one instead - https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/asr/parts/submodules/multi_head_attention.py#L208

As such, xformers wont really do much for Conformer. We can take a look anyway to try to shoehorn parts of it, but ASR requirements from Transformer-like blocks are very different from regular NLP/LM requirements, so there probably is no simple integration.

2 replies

piraka9011 Nov 29, 2022
Author

Thanks for the explanation, didn't notice the caching mechanism and can see how it differs from the typical MHA implementation.
If it doesn't make sense to introduce in the repo that's fine! Was just curious and wanted to get feedback on the idea before diving deeper into it.

titu1994 Nov 29, 2022
Maintainer

Technically we could try it, it's just a matter of priority and integration simplicity.

SeanNaren · 2022-11-30T14:38:07Z

SeanNaren
Nov 30, 2022
Maintainer

IMO this is a great idea and allows us to introduce optimisations leveraging fused computation/triton. It also helps us to move towards much faster throughput for these models for training.

I agree with @titu1994 that there is quite a bit of custom code to tackle. The relative positional encoding + relative MHA code would require some effort, I think we could start by creating an issue on the xFormers repo to see if there is any interest in assisting in building these components.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is it possible to leverage xformers optimizations in NeMo ASR? #5516

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Is it possible to leverage xformers optimizations in NeMo ASR? #5516

Uh oh!

piraka9011 Nov 28, 2022

Replies: 2 comments · 2 replies

Uh oh!

titu1994 Nov 29, 2022 Maintainer

Uh oh!

piraka9011 Nov 29, 2022 Author

Uh oh!

titu1994 Nov 29, 2022 Maintainer

Uh oh!

SeanNaren Nov 30, 2022 Maintainer

piraka9011
Nov 28, 2022

Replies: 2 comments 2 replies

titu1994
Nov 29, 2022
Maintainer

piraka9011 Nov 29, 2022
Author

titu1994 Nov 29, 2022
Maintainer

SeanNaren
Nov 30, 2022
Maintainer