Skip to content

cuda.bindings latency benchmarks - part 2#1856

Open
danielfrg wants to merge 6 commits intomainfrom
cuda-bindings-bench-more
Open

cuda.bindings latency benchmarks - part 2#1856
danielfrg wants to merge 6 commits intomainfrom
cuda-bindings-bench-more

Conversation

@danielfrg
Copy link
Copy Markdown
Contributor

@danielfrg danielfrg commented Apr 3, 2026

Description

closes #1580

Follow up #1580

Adding a couple of more benchmarks here and fixing a couple of issue with the pyperf json handling.

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot bot commented Apr 3, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@danielfrg danielfrg force-pushed the cuda-bindings-bench-more branch from 0cfea1d to a3f0678 Compare April 3, 2026 15:25
@danielfrg
Copy link
Copy Markdown
Contributor Author

There are the results for a run on my dev machine (4090)


----------------------------------------------------------------------------------
Benchmark                                   C++ (mean)   Python (mean)    Overhead
----------------------------------------------------------------------------------
ctx_device.ctx_get_current                        6 ns          112 ns     +106 ns
ctx_device.ctx_get_device                         8 ns          122 ns     +113 ns
ctx_device.ctx_set_current                        8 ns          103 ns      +96 ns
ctx_device.device_get                             6 ns          126 ns     +120 ns
ctx_device.device_get_attribute                   9 ns          195 ns     +186 ns
event.event_create_destroy                       90 ns          307 ns     +218 ns
event.event_query                                74 ns          215 ns     +140 ns
event.event_record                               93 ns          229 ns     +136 ns
event.event_synchronize                          94 ns          239 ns     +145 ns
launch.launch_16_args                          1.57 us         3.12 us    +1545 ns
launch.launch_16_args_pre_packed               1.58 us         1.99 us     +409 ns
launch.launch_empty_kernel                     1.54 us         1.85 us     +302 ns
launch.launch_small_kernel                     1.54 us         2.23 us     +690 ns
pointer_attributes.pointer_get_attribute         29 ns          511 ns     +482 ns
stream.stream_create_destroy                   3.78 us         4.06 us     +274 ns
stream.stream_query                              86 ns          232 ns     +145 ns
stream.stream_synchronize                       111 ns          257 ns     +146 ns
----------------------------------------------------------------------------------

@danielfrg danielfrg requested review from mdboom and rwgk April 3, 2026 16:52
@danielfrg danielfrg self-assigned this Apr 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Python latency testing & benchmarking

1 participant