Skip to content

[cuda.parallel]: CI testing should exclude tests making large GPU memory allocations #4722

Closed
@oleksandr-pavlyk

Description

@oleksandr-pavlyk
Contributor

We use pytest-xdist to speed-up test execution, and invoke pytest -n auto.

Executing tests using N processes, we scale GPU allocation footprint by a factor of N as well (in the worst case scenario). This risks running into spurious OutOfMemoryError exceptions of our own doing.

A simple solution would be to introduce a pytest mark, say pytest.mark.large to mark those tests that make large GPU memory allocations and exclude them using pytest command line argument -m "not large" in CI jobs.

Alternative solution might be to introduce exclusive_gpu_use_lock based on FileLock as in https://pytest-xdist.readthedocs.io/en/latest/how-to.html#making-session-scoped-fixtures-execute-only-once

The lock would be used around blocks making and using GPU allocations, and would need to make sure to release GPU allocations before releasing the lock. Doing so would overlap JIT-ting steps, allocation and execution steps, and host validation steps while making sure that GPU allocation/execution/validation steps are serialized.

Metadata

Metadata

Labels

cuda.parallelFor all items related to the cuda.parallel Python module

Type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

    Development

    Participants

    @oleksandr-pavlyk

    Issue actions

      [cuda.parallel]: CI testing should exclude tests making large GPU memory allocations · Issue #4722 · NVIDIA/cccl