The gpu
package manages all the details of WebGPU to provide a higher-level interface where you can specify the data variables and values, shader pipelines, and other parameters that tell the GPU what to do, without having to worry about all the lower-level implementational details. It maps directly onto the underlying WebGPU structure, and does not decrease performance in any way. It supports both graphics and compute functionality.
The main gpu code is in the top-level gpu
package, with the following sub-packages available:
-
phong is a Blinn-Phong lighting model implementation on top of
gpu
, which then serves as the basis for the higherlevel xyz 3D scenegraph system. -
shape generates standard 3D shapes (sphere, cylinder, box, etc), with all the normals and texture coordinates. You can compose shape elements into more complex groups of shapes, programmatically. It separates the calculation of the number of vertex and index elements from actually setting those elements, so you can allocate everything in one pass, and then configure the shape data in a second pass, consistent with the most efficient memory model provided by gpu. It only has a dependency on the math32 package and could be used for anything.
-
gpudraw implements GPU-accelerated texture-based versions of the Go image/draw api. This is used for compositing images in the
core
GUI to construct the final rendered scene, and for drawing that scene on the actual hardware window. -
gosl translates Go code into GPU shader language code for running compute shaders in
gpu
, playing the role of NVIDIA's "cuda" language in other frameworks.
- On desktop (mac, windows, linux), glfw is used for initializing the GPU.
- Mobile (android, ios)...
- When developing for Android on macOS, it is critical to set
Emulated Performance
->Graphics
toSoftware
in theAndroid Virtual Device Manager (AVD)
; otherwise, the app will crash on startup. This is because macOS does not support direct access to the underlying hardware GPU in the Android Emulator. You can see more information how to do this in the Android developer documentation. Please note that this issue will not affect end-users of your app, only you while you develop it. Also, due to the typically bad performance of the emulated device GPU on macOS, it is recommended that you use a more modern emulated device than the default Pixel 3a. Finally, you should always test your app on a real mobile device if possible to see what it is actually like.
- When developing for Android on macOS, it is critical to set
For systems with multiple GPU devices, by default the discrete device is selected, and if multiple of those are present, the one with the most RAM is used. To see what is available and their properties, use:
$ go run cogentcore.org/core/gpu/cmd/webgpuinfo@latest
(you can install
that tool for later use as well)
The following environment variables can be set to specifically select a particular device by device number or name (deviceName
):
-
GPU_DEVICE
, for GUI and compute usage. -
GPU_COMPUTE_DEVICE
, only used for compute, if present, will override above, so you can use different GPUs for graphics vs compute. -
GPU
represents the hardwareAdapter
and maintains global settings, info about the hardware. -
Device
is a logical device and associatedQueue
info. Each such device can function in parallel.
There are many distinct mechanisms for graphics vs. compute functionality, so we review the Graphics system first, then the Compute.
-
GraphicsSystem
manages multipleGraphicsPipeline
s and associated variables (Var
) andValue
s, to accomplish a complete overall rendering / computational job. TheVars
andValues
are shared across all pipelines within a System, which is more efficient and usually what you want. A given shader can simply ignore the variables it doesn't need.GraphicsPipeline
performs a specific chain of operations, usingShader
program(s). In the graphics context, each pipeline typically handles a different type of material or other variation in rendering (textured vs. not, transparent vs. solid, etc).Vars
has up to 4 (hard limit imposed by WebGPU)VarGroup
s which are referenced with the@group(n)
syntax in the WGSL shader, in addition to a specialVertexGroup
specifically for the special Vertex and Index variables. EachVarGroup
can have a number ofVar
variables, which occupy sequential@binding(n)
numbers within each group.Values
withinVar
manages the specific data values for each variable. For example, eachTexture
or vertex mesh is stored in its own separateValue
, with its ownwgpu.Buffer
that is used to transfer data from the CPU to the GPU device. TheSetCurrent
method selects whichValue
is currently used, for the nextBindPipeline
call that sets all the values to use for a given pipeline run. Critically, all values must be uploaded to the GPU in advance of a given GPU pass. For large numbers ofUniform
andStorage
values, aDynamicOffset
can be set so there is just a singleValue
but the specific data used is determined by theDynamicIndex
within the one big value buffer.
-
Texture
manages a WebGPU Texture and associatedTextureView
, along with an optionalSampler
that defines how pixels are accessed in a shader. The Texture can manage any kind of texture object, with different Config methods for the different types. -
Renderer
is an interface for the final render target, implemented by two types:Surface
represents the full hardware-managedTexture
s associated with an actual on-screen Window.RenderTexture
is an offscreen render target that renders directly to a Texture, which can then be downloaded from the GPU or used directly as an input to a shader.Render
is a helper type that is used by both of the above to manage the additional depth texture and multisampling texture target.
-
Unlike most game-oriented GPU setups,
gpu
is designed to be used in an event-driven manner where render updates arise from user input or other events, instead of requiring a constant render loop taking place at all times (which can optionally be established too). The event-driven model is vastly more energy efficient for non-game applications.
These are the basic steps for a render pass, using convenient methods on the sy = GraphicsSystem
, which then manages the rest of the underlying steps. pl
here is a GraphicsPipeline
.
rp, err := sy.BeginRenderPass()
if err != nil { // error has already been logged, as all errors are.
return
}
pl.BindPipeline(rp)
pl.BindDrawIndexed(rp)
rp.End() // note: could add stuff after End and before EndRenderPass
sy.EndRenderPass(rp)
Note that all errors are logged in the gpu system, because in general GPU-level code should not create errors once it has been debugged.
The single most important constraint in thinking about how the GPU works, is that all resources (data in buffers, textures) must be uploaded to the GPU at the start of the render pass.
Thus, you must configure all the vars and values prior to a render pass, and if anything changes, these need to be reconfigured.
Then, during the render pass, the BindPipeline
calls BindAllGroups
to select which of multiple possible Value
instances of each Var
is actually seen by the current GPU commands. After the initial BindPipeline
call, you can more selectively call BindGroup
on an individual group to update the bindings.
Furthermore if you change the DynamicOffset
for a variable configured with that property, you need to call BindGroup to update the offset into a larger shared value buffer, to determine which value is processed.
The Var.Values.Current
index determines which Value is used for the BindGroup call, and SetCurrent*
methods set this for you at various levels of the variable hierarchy. Likewise, the Value.DynamicIndex
determines the dynamic offset, and can be set with SetDynamicIndex*
calls.
Vars
variables define the Type
and Role
of data used in the shaders. There are 3 major categories of Var roles:
-
Vertex
andIndex
represent mesh points etc that provide input to Vertex shader -- these are handled very differently from the others, and must be located in aVertexSet
which has a set index of -2. The offsets into allocated Values are updated dynamically for each render Draw command, so you can Bind different Vertex Values as you iterate through objects within a single render pass (again, the underlying vals must be sync'd prior). -
PushConst
(not yet available in WebGPU) are push constants that can only be 128 bytes total that can be directly copied from CPU ram to the GPU via a command -- it is the most high-performance way to update dynamically changing content, such as view matricies or indexes into other data structures. Must be located inPushConstSet
set (index -1). -
Uniform
(read-only "constants") andStorage
(read-write) data that contain misc other data, e.g., transformation matricies. These are the only types that can optionally use theDynamicOffset
mechanism, which should generally be reserved for cases where there is a large and variable number of values that need to be selected among during a render pass. The phong system stores the object-specific "model matrix" and other object-specific data using this dynamic offset mechanism. -
Texture
vars that provide the rawTexture
data, theTextureView
through which that is accessed, and aSampler
that parametrizes how the pixels are mapped onto coordinates in the Fragment shader. Each texture object is managed as a distinct item in device memory.
The world and "normalized display coordinate" (NDC) system for gpu
is the following right-handed framework:
^
Y+ |
|
+-------->
/ X+
/ Z+
v
Which is consistent with the standard cartesian coordinate system, where everything is rotated 90 degrees along the X axis, so that Y+ now points into the depth plane, and Z+ points upward:
^ ^
Z+ | / Y+
| /
+-------->
/ X+
/ Y-
v
You can think of this as having vertical "stacks" of standard X-Y coordinates, stacked up along the Z axis, like a big book of graph paper. In some cases, e.g., neural network layers, where this "stack" analog is particularly relevant, it can be useful to adopt this version of the coordinate system.
However, the advantage of our "Y+ up" system is that the X-Y 2D cartesian plane then maps directly onto the actual 2D screen that the user is looking at, with Z being the "extra" depth axis. Given the primacy and universal standard way of understanding the 2D plane, this consistency seems like a nice advantage.
In this coordinate system, the standard front face winding order is clockwise (CW), so the default is set to: pl.SetFrontFace(wgpu.FrontFaceCW)
in the GraphicsPipeline
.
The above coordinate system is consistent with OpenGL, but other 3D rendering frameworks, including the default in WebGPU, have other systems, as documented here: gpuweb/gpuweb#416. WebGPU is consistent with DirectX and Metal (by design), and is a left handed coordinate system (using FrontFaceCCW
by default), which conflicts with the near-universal right-hand-rule used in physics and engineering. Vulkan has its own peculiar coordinate system, with the "up" Y direction being negative, which turns it into a right-handed system, but one that doesn't make a lot of intuitive sense.
For reference, this is the default WebGPU coordinate system:
^
Y+ |
|
+-------->
/ X+
/ Z-
v
Obviously every system can be converted into every other with the proper combination of camera projection matricies and winding order settings, so it isn't a problem that we use something different than WebGPU natively uses -- it just requires a different winding order setting.
See examples/compute1
for a very simple compute shader, and compute.go for the ComputeSystem
that manages compute-only use of the GPU.
See [gosl] for a tool that converts Go code into WGSL shader code, so you can effectively run Go on the GPU.
Here's how it works:
-
Each WebGPU
Pipeline
holds 1 computeshader
program, which is equivalent to akernel
in CUDA. This is the basic unit of computation, accomplishing one parallel sweep of processing across some number of identical data structures. -
The
Vars
andValues
in theSystem
hold all the data structures your shaders operate on, and must be configured and data uploaded before running. In general, it is best to have a single static set of Vars that cover everything you'll need, and different shaders can operate on different subsets of these, minimizing the amount of memory transfer. -
Because the
Queue.Submit
call is by far the most expensive call in WebGPU, it should be minimized. This means combining as much of your computation into one big Command sequence, with calls to various differentPipeline
shaders (which can all be put in one command buffer) that gets submitted once, rather than submitting separate commands for each shader. Ideally this also involves combining memory transfers to / from the GPU in the same command buffer as well. -
There are no explicit sync mechanisms on the command, CPU side WebGPU (they only exist in the WGSL shaders), but it is designed so that shader compute is automatically properly synced with prior and subsequent memory transfer commands, so it automatically does the right thing for most use cases.
-
Compute is particularly taxing on memory transfer in general, and overall the best strategy is to rely on the optimized
WriteBuffer
command to transfer from CPU to GPU, and then use a staging buffer to read data back from the GPU. E.g., see this reddit post. Critically, the write commands are queued and any staging buffers are managed internally, so it shouldn't be much slower than manually doing all the staging. For reading, we have to implement everything ourselves, and here it is critical to batch theReadSync
calls for all relevant values, so they all happen at once. Use ad-hocValueGroup
s to organize these batched read operations efficiently for the different groups of values that need to be read back in the different compute stages. -
For large numbers of items to compute, there is a strong constraint that only 65_536 (2^16) workgroups can be submitted, per dimension at a time. For unstructured 1D indexing, we typically use
[64,1,1]
for the workgroup size (which must be hard-coded into the shader and coordinated with the Go side code), which gives 64 * 65_536 = 4_194_304 max items. For more than that number, more than 1 needs to be used for the second dimension. The NumWorkgroups* functions return appropriate sizes with a minimum remainder. See examples/compute for the logic needed to get the overall global index from the workgroup sizes.
It is hard to find this info very clearly stated:
- All internal computation in shaders is done in a linear color space.
- Textures are assumed to be sRGB and are automatically converted to linear on upload.
- Other colors that are passed in should be converted from sRGB to linear (the phong shader does this for the PerVertex case).
- The
Surface
automatically converts from Linear to sRGB for actual rendering. - A
RenderTexture
for offscreen / headless rendering must usewgpu.TextureFormatRGBA8UnormSrgb
for the format, in order to get back an image that is automatically converted back to sRGB format.
New*
returns a new object.Config
operates on an existing object and settings, and does everything to get it configured for use.Release
releases allocated WebGPU objects. The usual Go simplicity of not having to worry about freeing memory does not apply to these objects.
See https://web3dsurvey.com/webgpu for a browser of limits across different platforms, for the web platform. Note that the native version typically will have higher limits for many things across these same platforms, but because we want to maintain full interoperability across web and native, it is the lower web limits that constrain.
- https://web3dsurvey.com/webgpu/limits/maxBindGroups only 4!
- https://web3dsurvey.com/webgpu/limits/maxBindingsPerBindGroup 640 low end: plenty of room for all your variables, you just have to put them in relatively few top-level groups.
- https://web3dsurvey.com/webgpu/limits/maxDynamicUniformBuffersPerPipelineLayout 8: should be plenty.
- https://web3dsurvey.com/webgpu/limits/maxVertexBuffers 8: can't stuff too many vars into the vertex group, but typically not a problem.