Skip to content

Commit 318408d

Browse files
authored
[libcu++] Add initial cccl-runtime docs for 3.1 (#6562)
* Add initial cccl-runtime docs for 3.1 * Review feedback
1 parent 069d5f5 commit 318408d

File tree

10 files changed

+385
-49
lines changed

10 files changed

+385
-49
lines changed

docs/libcudacxx/Doxyfile

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,10 @@ XML_OUTPUT = xml
1010
XML_PROGRAMLISTING = YES
1111

1212
INPUT = ../../libcudacxx/include/cuda/__iterator \
13+
../../libcudacxx/include/cuda/__stream \
14+
../../libcudacxx/include/cuda/__device \
15+
../../libcudacxx/include/cuda/__event \
16+
../../libcudacxx/include/cuda/__algorithm \
1317
../../libcudacxx/include/nv
1418

1519
RECURSIVE = YES

docs/libcudacxx/extended_api.rst

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,6 @@ Extended API
1818
extended_api/type_traits
1919
extended_api/numeric
2020
extended_api/memory
21-
extended_api/streams
2221
extended_api/memory_resource
2322
extended_api/math
2423
extended_api/mdspan

docs/libcudacxx/extended_api/streams.rst

Lines changed: 0 additions & 24 deletions
This file was deleted.

docs/libcudacxx/extended_api/streams/stream_ref.rst

Lines changed: 0 additions & 24 deletions
This file was deleted.

docs/libcudacxx/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ libcu++
1212
releases
1313
standard_api
1414
extended_api
15+
runtime
1516
ptx_api
1617
API reference <api/index>
1718

docs/libcudacxx/runtime.rst

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
.. _cccl-runtime:
2+
3+
Runtime
4+
=======
5+
6+
.. toctree::
7+
:hidden:
8+
:maxdepth: 1
9+
10+
runtime/stream
11+
runtime/event
12+
runtime/algorithm
13+
runtime/device
14+
15+
.. list-table::
16+
:widths: 25 45 30 30
17+
:header-rows: 1
18+
19+
* - **Header**
20+
- **Content**
21+
- **CCCL Availability**
22+
- **CUDA Toolkit Availability**
23+
24+
* - :ref:`devices <cccl-runtime-device-devices>`
25+
- A range of all available CUDA devices
26+
- CCCL 3.1.0
27+
- CUDA 13.1
28+
29+
* - :ref:`device_ref <cccl-runtime-device-device-ref>`
30+
- A non-owning representation of a CUDA device
31+
- CCCL 3.1.0
32+
- CUDA 13.1
33+
34+
* - :ref:`arch_traits <cccl-runtime-device-arch-traits>`
35+
- Per-architecture trait accessors
36+
- CCCL 3.1.0
37+
- CUDA 13.1
38+
39+
40+
* - :ref:`stream_ref <cccl-runtime-stream-stream-ref>`
41+
- A non-owning wrapper around a ``cudaStream_t``
42+
- CCCL 2.2.0
43+
- CUDA 12.3
44+
45+
* - :ref:`stream <cccl-runtime-stream-stream>`
46+
- An owning wrapper around a ``cudaStream_t``
47+
- CCCL 3.1.0
48+
- CUDA 13.1
49+
50+
* - :ref:`event_ref <cccl-runtime-event-event-ref>`
51+
- A non-owning wrapper around a ``cudaEvent_t``
52+
- CCCL 3.1.0
53+
- CUDA 13.1
54+
55+
* - :ref:`event <cccl-runtime-event-event>`
56+
- An owning wrapper around a ``cudaEvent_t`` (timing disabled)
57+
- CCCL 3.1.0
58+
- CUDA 13.1
59+
60+
* - :ref:`timed_event <cccl-runtime-event-timed-event>`
61+
- An owning wrapper around a ``cudaEvent_t`` with timing enabled and elapsed-time queries
62+
- CCCL 3.1.0
63+
- CUDA 13.1
64+
65+
* - :ref:`copy_bytes <cccl-runtime-algorithm-copy_bytes>`
66+
- Byte-wise copy into a ``cuda::stream_ref`` for ``cuda::std::span``/``cuda::std::mdspan`` sources and destinations
67+
- CCCL 3.1.0
68+
- CUDA 13.1
69+
70+
* - :ref:`fill_bytes <cccl-runtime-algorithm-fill_bytes>`
71+
- Byte-wise fill into a ``cuda::stream_ref`` for ``cuda::std::span``/``cuda::std::mdspan`` destinations
72+
- CCCL 3.1.0
73+
- CUDA 13.1
74+
75+
* - :ref:`Memory Resources <libcudacxx-extended-api-memory-resources>`
76+
- ``cuda::mr`` interfaces (resources, wrappers, properties) usable with streams
77+
- CCCL 2.2.0 (experimental), CCCL 3.1.0 (stable)
78+
- CUDA 12.3 (experimental), CUDA 13.1 (stable)
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
.. _cccl-runtime-algorithm:
2+
3+
Algorithm
4+
==========
5+
6+
The ``runtime`` part of the ``cuda/algorithm`` header provide stream-ordered, byte-wise primitives that operate on ``cuda::std::span`` and
7+
``cuda::std::mdspan``-compatible types. They require a ``cuda::stream_ref`` to enqueue work.
8+
9+
``cuda::copy_bytes``
10+
---------------------
11+
.. _cccl-runtime-algorithm-copy_bytes:
12+
13+
Launch a byte-wise copy from source to destination on the provided stream.
14+
15+
- Overloads accept ``cuda::std::span``-convertible contiguous ranges or ``cuda::std::mdspan``-convertible multi-dimensional views.
16+
- Elements must be trivially copyable
17+
- ``cuda::std::mdspan``-convertible types must convert to a mdspan that is exhaustive
18+
- Source access order (during the copy call or in stream order) can be configured with ``cuda::copy_configuration``
19+
20+
Availability: CCCL 3.1.0 / CUDA 13.1
21+
22+
.. code:: cpp
23+
24+
#include <cuda/algorithm>
25+
#include <cuda/stream>
26+
#include <cuda/std/span>
27+
28+
void copy_example(cuda::stream_ref s, int* d_dst, const int* d_src, std::size_t n) {
29+
cuda::std::span<const int> src{d_src, n};
30+
cuda::std::span<int> dst{d_dst, n};
31+
cuda::copy_bytes(s, src, dst); // enqueued on s
32+
}
33+
34+
35+
``cuda::fill_bytes``
36+
---------------------
37+
.. _cccl-runtime-algorithm-fill_bytes:
38+
39+
Launch a byte-wise fill of the destination on the provided stream.
40+
41+
- Overloads accept ``cuda::std::span``-convertible or ``cuda::std::mdspan``-convertible destinations.
42+
- Elements must be trivially copyable
43+
- ``cuda::std::mdspan``-convertible types must convert to a mdspan that is exhaustive
44+
45+
Availability: CCCL 3.1.0 / CUDA 13.1
46+
47+
.. code:: cpp
48+
49+
#include <cuda/algorithm>
50+
#include <cuda/stream>
51+
#include <cuda/std/span>
52+
53+
void fill_example(cuda::stream_ref s, int* d_dst, std::size_t n) {
54+
cuda::std::span<int> dst{d_dst, n};
55+
cuda::fill_bytes(s, dst, 0x00); // zero-fill device memory
56+
}

docs/libcudacxx/runtime/device.rst

Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,104 @@
1+
.. _cccl-runtime-device:
2+
3+
Devices
4+
=======
5+
6+
``cuda::device_ref``
7+
---------------------
8+
.. _cccl-runtime-device-device-ref:
9+
10+
``cuda::device_ref`` is a lightweight, non-owning handle to a CUDA device ordinal.
11+
It offers:
12+
13+
- ``get()``: native device ordinal
14+
- ``name()``: device name
15+
- ``init()``: initialize the device context
16+
- ``peers()``: list peers for which peer access can be enabled
17+
- ``has_peer_access_to(device_ref)``: query if peer access can be enabled to the given device
18+
- ``attribute(attr)`` / ``attribute<::cudaDeviceAttr>()``: attribute queries
19+
20+
Availability: CCCL 3.1.0 / CUDA 13.1
21+
22+
``cuda::devices``
23+
------------------
24+
.. _cccl-runtime-device-devices:
25+
26+
``cuda::devices`` is a random-access view of all available CUDA devices in form of ``cuda::device_ref`` objects`. It provides indexing, size, and iteration for use
27+
in range-based loops.
28+
29+
Availability: CCCL 3.1.0 / CUDA 13.1
30+
31+
Example:
32+
33+
.. code:: cpp
34+
35+
#include <cuda/devices>
36+
#include <iostream>
37+
38+
void print_devices() {
39+
for (auto& dev : cuda::devices) {
40+
std::cout << "Device " << dev.get() << ": " << dev.name() << std::endl;
41+
}
42+
}
43+
44+
Device attributes
45+
-----------------
46+
.. _cccl-runtime-device-attributes:
47+
48+
``cuda::device_attributes`` provides strongly-typed attribute query objects usable with
49+
``device_ref::attribute``. Selected examples:
50+
51+
- ``compute_capability``
52+
- ``multiprocessor_count``
53+
- ``concurrent_managed_access``
54+
- ``clock_rate``
55+
- ``numa_id``
56+
57+
Availability: CCCL 3.1.0 / CUDA 13.1
58+
59+
Example:
60+
61+
.. code:: cpp
62+
63+
#include <cuda/devices>
64+
65+
int get_max_blocks_on_device(cuda::device_ref dev) {
66+
return cuda::device_attributes::multiprocessor_count(dev) * cuda::device_attributes::blocks_per_multiprocessor(dev);
67+
}
68+
69+
``cuda::arch_traits``
70+
---------------------
71+
.. _cccl-runtime-device-arch-traits:
72+
73+
Per-architecture trait accessors providing limits and capabilities common to all devices of an architecture.
74+
Compared to ``device_attributes``, ``cuda::arch_traits`` provide a compile-time accessible structure that describes common characteristics of all devices of an architecture, while attributes are run-time queries of a single characteristic of a specific device.
75+
76+
- ``cuda::arch_traits<cuda::arch_id::sm_80>()`` (compile-time) or
77+
``cuda::arch_traits_for(cuda::arch_id)`` / ``cuda::arch_traits_for(cuda::compute_capability)`` (run-time).
78+
- Returns a ``cuda::arch_traits_t`` with fields like
79+
``max_threads_per_block``, ``max_shared_memory_per_block``, ``cluster_supported`` and other capability flags.
80+
- Traits for the current architecture can be accessed with ``cuda::device::current_arch_traits()``
81+
82+
Availability: CCCL 3.1.0 / CUDA 13.1
83+
84+
Example:
85+
86+
.. code:: cpp
87+
88+
#include <cuda/devices>
89+
90+
template <cuda::arch_id Arch>
91+
__device__ void fn() {
92+
auto traits = cuda::arch_traits<Arch>();
93+
if constexpr (traits.cluster_supported) {
94+
// cluster specific code
95+
}
96+
else {
97+
// non-cluster code
98+
}
99+
100+
}
101+
102+
__global__ void kernel() {
103+
fn<cuda::arch_id::sm_90>();
104+
}

docs/libcudacxx/runtime/event.rst

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
.. _cccl-runtime-event:
2+
3+
Events
4+
======
5+
6+
``cuda::event_ref``
7+
--------------------------------------------------
8+
.. _cccl-runtime-event-event-ref:
9+
10+
``cuda::event_ref`` is a non-owning wrapper around a ``cudaEvent_t``. It prevents unsafe implicit constructions from
11+
``nullptr`` or integer literals and provides convenient helpers:
12+
13+
- ``record(cuda::stream_ref)``: record the event on a stream
14+
- ``sync()``: wait for the recorded work to complete
15+
- ``is_done()``: non-blocking completion query
16+
- comparison operators against other ``event_ref`` or ``cudaEvent_t``
17+
18+
Availability: CCCL 3.1.0 / CUDA 13.1
19+
20+
Example:
21+
22+
.. code:: cpp
23+
24+
#include <cuda/event>
25+
#include <cuda/stream>
26+
27+
void record_on_stream(cuda::stream_ref stream, cudaEvent_t raw_handle) {
28+
cuda::event_ref e{raw_handle};
29+
e.record(stream);
30+
}
31+
32+
.. _cccl-runtime-event-event:
33+
``cuda::event``
34+
--------------------------------------------
35+
36+
``cuda::event`` is an owning wrapper around a ``cudaEvent_t`` (with timing disabled). It inherits from ``event_ref`` and provides all of its functionality.
37+
It also creates and destroys the native event, can be moved (but not copied), and can release ownership via ``release()``. Construction can target a specific
38+
``cuda::device_ref`` or record immediately on a ``cuda::stream_ref``.
39+
40+
Availability: CCCL 3.1.0 / CUDA 13.1
41+
42+
.. code:: cpp
43+
44+
#include <cuda/event>
45+
#include <cuda/stream>
46+
#include <cuda/devices>
47+
48+
cuda::std::optional<cuda::event> query_and_record_on_stream(cuda::stream_ref stream) {
49+
if (stream.is_done()) {
50+
return std::nullopt;
51+
}
52+
else {
53+
return cuda::event{stream};
54+
}
55+
}
56+
57+
.. _cccl-runtime-event-timed-event:
58+
``cuda::timed_event``
59+
-----------------------------------------------------
60+
61+
``cuda::timed_event`` is an owning wrapper for a timed ``cudaEvent_t``. It inherits from ``event`` and provides all of its functionality.
62+
It also supports elapsed-time queries between two events via ``operator-``, returning ``cuda::std::chrono::nanoseconds``.
63+
64+
Availability: CCCL 3.1.0 / CUDA 13.1
65+
66+
.. code:: cpp
67+
68+
#include <cuda/event>
69+
#include <cuda/stream>
70+
#include <cuda/std/chrono>
71+
72+
template <typename F>
73+
cuda::std::chrono::nanoseconds measure_execution_time(cuda::stream_ref stream, F&& f) {
74+
cuda::timed_event start{stream};
75+
f(stream);
76+
cuda::timed_event end{stream};
77+
return end - start;
78+
}

0 commit comments

Comments
 (0)