-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH200 run on ALPS #7
Comments
I think @wsmoses fixed a very similar error but for "unsupported .version 8.2" compared to 8.1 rather than 8.6 and 8.4. |
What version of reactant are you using, we released the fix on the latest patch |
(GB-25) pkg> st
Status `/capstor/scratch/cscs/lraess/GB-25/Project.toml`
[6e4b80f9] BenchmarkTools v1.6.0
[9e8cae18] Oceananigans v0.95.7 `https://github.com/CliMA/Oceananigans.jl.git#main`
[3c362404] Reactant v0.2.21 `https://github.com/EnzymeAD/Reactant.jl.git#main`
[0192cb87] Reactant_jll v0.0.48+0 |
Can you update to latest release (0.2.22). Tho FYI if it's aarch64 you'll need the aarch64 CUDA jll @giordano is working on currently (but not yet landed). X86 CUDA jll is already in the package manager |
Thanks, will update. And yes, it's aarch64, so will have to wait until @giordano lands the jll. |
If/when JuliaPackaging/Yggdrasil#10313 is green. |
|
I gave the code https://github.com/PRONTOLab/GB-25/blob/glw/super-simple-distributed/oceananigans-dynamical-core/super_simple_simulation.jl from # arch = Distributed(GPU(), partition=Partition(2, 2)) # distributed on 4 GPUs
arch = GPU() running it:
Env:
|
can you retry on latest main? We just had a hopeful fix for this cc @giordano |
Ok testing it rn. Besides the issue, seems to be quite some significant run time overhead in the reactant case. The new run reports following issue(s) julia> include("super_simple_simulation.jl")
┌ Debug: Detected CUDA Driver version 12.4.0
└ @ Reactant_jll /capstor/scratch/cscs/lraess/daint/juliaup/depot/packages/Reactant_jll/qvFaw/.pkg/platform_augmentation.jl:60
Reactant_jll.cuDriverGetVersion(dlopen("libcuda.so")) = v"12.4.0"
┌ Warning: `Adapt.parent_type` is not implemented for Field{Center, Center, Face, Nothing, LatitudeLongitudeGrid{Float64, Periodic, Bounded, Bounded, Oceananigans.Grids.StaticVerticalDiscretization{OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, Float64, Float64}, OffsetArrays.OffsetVector{Float64, CUDA.CuArray{Float64, 1, CUDA.DeviceMemory}}, Float64, Float64, Float64, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, GPU}, Tuple{Colon, Colon, UnitRange{Int64}}, OffsetArrays.OffsetArray{Float64, 3, CUDA.CuArray{Float64, 3, CUDA.DeviceMemory}}, Float64, FieldBoundaryConditions{BoundaryCondition{Oceananigans.BoundaryConditions.Periodic, Nothing}, BoundaryCondition{Oceananigans.BoundaryConditions.Periodic, Nothing}, BoundaryCondition{Oceananigans.BoundaryConditions.Flux, Nothing}, BoundaryCondition{Oceananigans.BoundaryConditions.Flux, Nothing}, Nothing, Nothing, BoundaryCondition{Oceananigans.BoundaryConditions.Flux, Nothing}}, Nothing, Oceananigans.Fields.FieldBoundaryBuffers{Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing}}. Assuming Field{Center, Center, Face, Nothing, LatitudeLongitudeGrid{Float64, Periodic, Bounded, Bounded, Oceananigans.Grids.StaticVerticalDiscretization{OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, Float64, Float64}, OffsetArrays.OffsetVector{Float64, CUDA.CuArray{Float64, 1, CUDA.DeviceMemory}}, Float64, Float64, Float64, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, GPU}, Tuple{Colon, Colon, UnitRange{Int64}}, OffsetArrays.OffsetArray{Float64, 3, CUDA.CuArray{Float64, 3, CUDA.DeviceMemory}}, Float64, FieldBoundaryConditions{BoundaryCondition{Oceananigans.BoundaryConditions.Periodic, Nothing}, BoundaryCondition{Oceananigans.BoundaryConditions.Periodic, Nothing}, BoundaryCondition{Oceananigans.BoundaryConditions.Flux, Nothing}, BoundaryCondition{Oceananigans.BoundaryConditions.Flux, Nothing}, Nothing, Nothing, BoundaryCondition{Oceananigans.BoundaryConditions.Flux, Nothing}}, Nothing, Oceananigans.Fields.FieldBoundaryBuffers{Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing}} isn't a wrapped array.
└ @ Reactant /capstor/scratch/cscs/lraess/daint/juliaup/depot/packages/Reactant/7y9bj/src/Reactant.jl:39
[ Info: Initializing simulation...
[ Info: ... simulation initialization complete (9.342 seconds)
[ Info: Executing initial time step...
[ Info: ... initial time step complete (12.374 seconds).
[ Info: Simulation is stopping after running for 21.765 seconds.
[ Info: Model iteration 2 equals or exceeds stop iteration 2.
[ Info: KA
[ Info: Initializing simulation...
[ Info: ... simulation initialization complete (1.207 ms)
[ Info: Executing initial time step...
[ Info: ... initial time step complete (1.944 ms).
[ Info: Simulation is stopping after running for 0 seconds.
[ Info: Model iteration 3 equals or exceeds stop iteration 2.
[ Info: Initializing simulation...
[ Info: ... simulation initialization complete (1.939 minutes)
[ Info: Executing initial time step...
[ Info: ... initial time step complete (2.021 minutes).
[ Info: Simulation is stopping after running for 0 seconds.
[ Info: Model iteration 4 equals or exceeds stop iteration 2.
ERROR: LoadError: UNAVAILABLE: No PTX compilation provider is available. Neither ptxas/nvlink nor nvjtlink is available. As a fallback you can enable JIT compilation in the CUDA driver via the flag `--xla_gpu_unsafe_fallback_to_driver_on_ptxas_not_found`. Details:
- Has NvJitLink support: LibNvJitLink is not supported (disabled during compilation).
- Has NvPtxCompiler support: LibNvPtxCompiler is not supported (disabled during compilation).
- Parallel compilation support is desired: 0
- ptxas_path: Couldn't find a suitable version of ptxas. The following locations were considered: /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/cuda/bin/ptxas, /user-environment/juhpc_setup/juliaup_wrapper/ptxas, /capstor/scratch/cscs/lraess/daint/juliaup/bin/ptxas, /users/lraess/.vscode-server/cli/servers/Stable-cd4ee3b1c348a13bafd8f9ad8060705f6d4b9cba/server/bin/remote-cli/ptxas, /users/lraess/bin/ptxas, /usr/local/bin/ptxas, /usr/bin/ptxas, /bin/ptxas, /usr/lib/mit/bin/ptxas, /capstor/scratch/cscs/lraess/daint/juliaup/depot/juliaup/julia-1.11.3+0.aarch64.linux.gnu/bin/julia.runfiles/cuda_nvcc/bin/ptxas, /capsto/cuda_nvcc/bin/ptxas, bin/ptxas, /usr/local/cuda/bin/ptxas, /opt/cuda/bin/ptxas, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../nvidia/cuda_nvcc/bin/ptxas, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../../nvidia/cuda_nvcc/bin/ptxas, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/cuda/bin/ptxas, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../../../../bin/ptxas, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../../../../../bin/ptxas
- ptxas_version: Couldn't find a suitable version of ptxas. The following locations were considered: /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/cuda/bin/ptxas, /user-environment/juhpc_setup/juliaup_wrapper/ptxas, /capstor/scratch/cscs/lraess/daint/juliaup/bin/ptxas, /users/lraess/.vscode-server/cli/servers/Stable-cd4ee3b1c348a13bafd8f9ad8060705f6d4b9cba/server/bin/remote-cli/ptxas, /users/lraess/bin/ptxas, /usr/local/bin/ptxas, /usr/bin/ptxas, /bin/ptxas, /usr/lib/mit/bin/ptxas, /capstor/scratch/cscs/lraess/daint/juliaup/depot/juliaup/julia-1.11.3+0.aarch64.linux.gnu/bin/julia.runfiles/cuda_nvcc/bin/ptxas, /capsto/cuda_nvcc/bin/ptxas, bin/ptxas, /usr/local/cuda/bin/ptxas, /opt/cuda/bin/ptxas, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../nvidia/cuda_nvcc/bin/ptxas, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../../nvidia/cuda_nvcc/bin/ptxas, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/cuda/bin/ptxas, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../../../../bin/ptxas, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../../../../../bin/ptxas
- nvlink_path: Couldn't find a suitable version of nvlink. The following locations were considered: /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/cuda/bin/nvlink, /user-environment/juhpc_setup/juliaup_wrapper/nvlink, /capstor/scratch/cscs/lraess/daint/juliaup/bin/nvlink, /users/lraess/.vscode-server/cli/servers/Stable-cd4ee3b1c348a13bafd8f9ad8060705f6d4b9cba/server/bin/remote-cli/nvlink, /users/lraess/bin/nvlink, /usr/local/bin/nvlink, /usr/bin/nvlink, /bin/nvlink, /usr/lib/mit/bin/nvlink, /capstor/scratch/cscs/lraess/daint/juliaup/depot/juliaup/julia-1.11.3+0.aarch64.linux.gnu/bin/julia.runfiles/cuda_nvcc/bin/nvlink, /capsto/cuda_nvcc/bin/nvlink, bin/nvlink, /usr/local/cuda/bin/nvlink, /opt/cuda/bin/nvlink, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../nvidia/cuda_nvcc/bin/nvlink, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../../nvidia/cuda_nvcc/bin/nvlink, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/cuda/bin/nvlink, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../../../../bin/nvlink, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../../../../../bin/nvlink
- nvlink_version: Couldn't find a suitable version of nvlink. The following locations were considered: /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/cuda/bin/nvlink, /user-environment/juhpc_setup/juliaup_wrapper/nvlink, /capstor/scratch/cscs/lraess/daint/juliaup/bin/nvlink, /users/lraess/.vscode-server/cli/servers/Stable-cd4ee3b1c348a13bafd8f9ad8060705f6d4b9cba/server/bin/remote-cli/nvlink, /users/lraess/bin/nvlink, /usr/local/bin/nvlink, /usr/bin/nvlink, /bin/nvlink, /usr/lib/mit/bin/nvlink, /capstor/scratch/cscs/lraess/daint/juliaup/depot/juliaup/julia-1.11.3+0.aarch64.linux.gnu/bin/julia.runfiles/cuda_nvcc/bin/nvlink, /capsto/cuda_nvcc/bin/nvlink, bin/nvlink, /usr/local/cuda/bin/nvlink, /opt/cuda/bin/nvlink, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../nvidia/cuda_nvcc/bin/nvlink, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../../nvidia/cuda_nvcc/bin/nvlink, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/cuda/bin/nvlink, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../../../../bin/nvlink, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../../../../../bin/nvlink
- Driver compilation is enabled: 0
Stacktrace:
[1] reactant_err(msg::Cstring)
@ Reactant.XLA /capstor/scratch/cscs/lraess/daint/juliaup/depot/packages/Reactant/7y9bj/src/XLA.jl:164
[2] Compile(client::Reactant.XLA.Client, mod::Reactant.MLIR.IR.Module)
@ Reactant.XLA /capstor/scratch/cscs/lraess/daint/juliaup/depot/packages/Reactant/7y9bj/src/XLA.jl:571
[3] compile_xla(f::Function, args::Tuple{Simulation{…}}; client::Nothing, optimize::Bool, no_nan::Bool, device::Nothing)
@ Reactant.Compiler /capstor/scratch/cscs/lraess/daint/juliaup/depot/packages/Reactant/7y9bj/src/Compiler.jl:1034
[4] compile_xla
@ /capstor/scratch/cscs/lraess/daint/juliaup/depot/packages/Reactant/7y9bj/src/Compiler.jl:983 [inlined]
[5] compile(f::Function, args::Tuple{…}; sync::Bool, kwargs::@Kwargs{…})
@ Reactant.Compiler /capstor/scratch/cscs/lraess/daint/juliaup/depot/packages/Reactant/7y9bj/src/Compiler.jl:1052
[6] top-level scope
@ /capstor/scratch/cscs/lraess/daint/juliaup/depot/packages/Reactant/7y9bj/src/Compiler.jl:706
[7] include(fname::String)
@ Main ./sysimg.jl:38
[8] top-level scope
@ REPL[3]:1
in expression starting at /capstor/scratch/cscs/lraess/GB-25/oceananigans-dynamical-core/super_simple_simulation.jl:42
Some type information was truncated. Use `show(err)` to see complete types.
julia> |
I mean that supposed runtime oceanigans prints is actually all compile time atm (and regardless something seems to be going awry @giordano if you can take a look) |
@luraess have you ever tried this before? Seeing that ptxas can't be found anywhere (we only modified the first location searched) suggests me that this would have never worked for you. Can you check if ptxas is available at any of the printed locations? |
Could be a module issue. Now, I am seeing
let me try again |
I mean that's different anyways though as we should be using the ptxas shipped with the JLL. If you look in the JLL artifact path, is there a ptxas there? |
But does the file |
Seems it does exist
|
what is Reactant.XLA.CUDA_DATA_DIR. and also what is dirname(dirname(Reactant_jll.ptxas_path)) |
So the question is why xla doesn't seem to like it? Also, is there a ptxas in any of the other fallback locations? I had it under /usr/local/cuda/bin, and that was a system version, which was correctly picked up. If you have it there (or in any of the other locations) but xla decides it doesn't like it then that's a problem someone will have to debug |
julia> Reactant.XLA.CUDA_DATA_DIR
Base.RefValue{String}("/capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/cuda")
julia> dirname(dirname(Reactant_jll.ptxas_path))
"/capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/cuda" |
oh wait. @luraess can you use latest Reactant release and not dev'd. I bet you're accidentally on an old jll (if you run st for example). Julia won't update dependencies necessarily if you dev a main edit: Enzyme->Reactant in the text |
Can you run
? And I insist, do you have ptxas in any of the fallback system locations? If so, this seems to me a problem in XLA unable to accept ptxas anywhere.
Message above showed it's Reactant_jll.jl 0.0.60, and that matches the hash of the artifact. I don't think that's the issue. |
oh right.... |
|
Then what's wrong with XLA? 😄 |
CUDA 12.3 is explicitly excluded: https://github.com/openxla/xla/blob/0919d17b5e3de7d15f51fd89e190505cb22a88c4/xla/stream_executor/cuda/subprocess_compilation.cc#L221 ... now building a 12.4 |
Reporting here about an attempt in running https://github.com/PRONTOLab/GB-25/blob/main/oceananigans-dynamical-core/super_simple_simulation.jl on a single Nvidia GH200 on the ALPS infrastructure.
Config
Output
The text was updated successfully, but these errors were encountered: