Installieren Sie unsere Erweiterung an, um sofort in jedem Video zu suchen

Lightning Talk: Crafting CUDA Compatible C++ Code - Jon White - CppCon 2025
Indiziert: 2026-05-22

320 Aufrufe155:08CppConOriginalveröffentlichung: 2026-05-22

The ideal design separates the mathematical operation from the parallelization method. The operation should be passed as a runtime parameter, while the parallelization method should be parameterized on the operation type. This allows the same parallelization infrastructure to work with different operations.

[00:00:00]attending any conference, [music] it's in it's incredibly important to to be there. That's kind of the only way to really dedicate your time uh uh to be there and kind of be immersed in the whole thing and not distracted by by other stuff going on. Even if you do get distract distracted with [music] interesting conversations in the hallway.

[00:00:30]>> Hi, my name is John and I'm going to be talking about crafting CUDA compatible C++ code.

[00:00:36]Um, so basically the problem is I'm trying to write a uh parallel math library that needs to run on CPU and GPU, but I don't want to write anything twice and I don't want CUDA features in my C++ code or CPU code. Um so just as a simple example of uh an operation you want to parallelize u single precision ax plus y um it's a embarrassingly parallel problem because uh every index is independent of all the other indices um so if you want to parallelize this on a CPU you get your vector of uh threads and then you distribute the work to each of the threads or if you want to uh paralyze it on a GPU I've implemented a grid stride loop and a uh CUDA kernel.

[00:01:23]Um but the issue is uh those were both uh singlepurpose functions that you had to write both the operation and the parallelization method uh every time. Uh and ideally we want to separate concerns into the operation and the parallelization method.

[00:01:42]Uh so step one obligatory context for all the things.

[00:01:47]Um so the reason this is important is because on NVCC uh the NVIDIA compiler um you can pass the experimental relaxed con expert flag uh and that allows all of your con expert functions to be uh uh used on both the host and the device. Uh you write it once uh and you don't actually have to use any CUDA uh keywords.

[00:02:13]Um so if you can read that uh um so up at the top we have uh the single precision ax plus y uh as a const expert function um and so that is being called from both of the parallelization methods the the one that's doing the CPU uh threading and then the one that's doing the CUDA kernel.

[00:02:38]Um so step two uh basically you want to follow the example of the STL uh pass in an operation instead of having to call it uh from each of your par parallelization methods. Um so we're going to pass the operation as a runtime parameter and uh the parallelization method is going to be parameterized on the type of the operation.

[00:03:04]Um, so yeah, now we're able to pass in the uh SAXP op to both of our parallelization methods.

[00:03:14]Um, and so now we have separation of concerns and everything is only written once except that this doesn't work. Um, so anyone know what's wrong with this?

[00:03:29]Uh so the problem is that the operation kernel is being passed at runtime and so it doesn't actually correctly resolve as the the host or device uh version of it.

[00:03:40]Um so the CUDA version is actually getting the host version of the function.

[00:03:45]Um and so the CUDA kernel is going to silently fail when you try to call it.

[00:03:50]Um so actual final step uh make the operation a non-type type play parameter. um that's going to force resolution at compile time. And so the host version gets the the host the host code gets the host version and the uh CUDA version CUDA code gets the CUDA version. So this is what that looks like. Uh still just a con expert function uh defining the operation. Um but now we're passing in the operation as the template parameter, not as one of the runtime parameters. And so this works now.

[00:04:27]We did it. Uh we now have a way of uh executing parallel operations that work on either the CPU or the GPU without having to write anything twice and without using CUDA in code that's meant for CPU.

[00:04:41]Thank you. [applause] [applause]

#shorts #code #gpu #CUDA #c++ code

Ähnliche Videos

Ubuntu Touch Q&A 190

UBports

241 views•2026-05-17

Iterators and Generators: Real Use Cases

jsmentor-uk

188 views•2026-05-17

TCS NQT Coding Questions Solution (One Shot) | TCS NQT Preparation 2027 | TCS Actual PYQ 2026

knacademy20

2K views•2026-05-17

The 4 Bit AI Training Trick

explaquiz

414 views•2026-05-19

Image to 3D World Workflow 👀

badxstudio

843 views•2026-05-16

Why Learn Algorithms in the AI Era

bitsandproofs

245 views•2026-05-17

NFA - Transition Diagram and Transition Table

nesoacademy

198 views•2026-05-19

BCS | BASIC COMPUTER SKILLS | WHOLE SUBJECT EXPLANATION | OSMANIA UNIVERSITY | ‎⁨@shivanipallela⁩

shivanipallela

345 views•2026-05-22

Trends

She Lived A DECADE In 3 Weeks

andyyjiang

3866K views•2026-05-18

The Gen Alpha Melody

Carl.e.martin

845K views•2026-05-17

How Big is the Biggest Volcano?

CleoAbram

1908K views•2026-05-16

The 10-Year-Old Who Outsmarted His Math Teacher 🤯

DiscoveryPill_YT

1848K views•2026-05-18