Building a fast single source GPGPU language in C++, and rendering black holes in it

Hi! Last episode we built a schwarzschild black hole raytracer together, and we learnt several things

  1. Maths is hard
  2. General relativity is slow to simulate

A lot of work in general relativity, and virtually all of numerical relativity, is written on the CPU. We’re going to skip all of that, and go right for super high performance stuff: GPU programming

Where we’re going, the black hole simulation we built together is the fastest possible thing to simulate on a CPU, and its still not exactly fast at 15 seconds per frame. While there are ways1 to speed up that simulation, we’re going to need a lot more compute power in general

Relativistic eularian hydrodynamics, on a GPU

Why roll our own language though? Isn’t that kind of complicated?

Its worth examining what our alternatives here are:

API Pros Cons
Cuda Good support, good feature set Nvidia only
OpenCL Cross platform, resonably well supported, good feature set Writing C99 is at best painful, slightly crusty, AMD’s implementation has issues, some api overhead
Vulkan Cross platform, well supported except on mac, low api overhead Writing vulkan is painful, poor shader languages, missing gpgpu features
HIP/ROCm ? AMD’s support, AMD-only, primarily linux, buggy implementation, performance issues, bad docs, no support for most GPUs
SYCL Single source Lower performance
webgl/webgpu very good portability Bad performance, serious limitations make it unusable for us, wgsl

We’re looking for a cross platform, cross vendor solution, which leaves Vulkan, OpenCL, and perhaps SYCL. Vulkan is increasingly a viable GPGPU solution for the future2, but is still missing features here and there, so the somewhat venerable OpenCL is the ‘best’ option out of a limited selection

The biggest issue for any choice here is that shader languages are, by and large, pretty unsuited for what we’re trying to do. Vulkan is often written in GLSL, and OpenCL is a restricted variant of C99 (though some version of C++ on OpenCL does exist), both of which are very challenging languages3 for trying to build a complex project in. I’d like to use OpenCL as a backend in this article here, and generate all of our kernels programmatically. If you’d like to follow along with source, you can find the code here

We’ve got a lot to get through in this article, so lets get into it. The topic doesn’t really leave much room for diagrams or anything visual to keep you engaged, so I’m going to substitute by decorating this article with informative pictures of my cat

Catty

She’s a good cat

Design

Lets imagine we want to write a function to copy a buffer in OpenCL. That looks like this:

__kernel
void copy_buffer(global float* in, global float* out) {
    int idx = get_global_id(0);
    out[idx] = in[idx];
}

This is about as basic as it gets on a GPU. idx is our thread id (think: std::thread::get_id(), though ids run from 0 -> work size - 1). global denotes the address space4 that your pointer points to

What we’d like to do is write this from C++ instead, and have it be translated into OpenCL. We want to write this:

void copy_buffer(buffer<float> in, buffer_mut<float> out) {
    valuei idx = get_global_id(0);
    as_ref(out[idx]) = in[idx];
}

in C++, and be able to run something like std::string my_kernel = kernel_to_string(copy_buffer) to produce a kernel that is equivalent to our original OpenCL version

To start off, we’re going to leverage the value type written in the automatic differentiation segment of this article series, and work on the bodies of these functions first. To recap, a value is a type that forms a tree of expressions, which we can evaluate at a later date:

valuef v1 = 1234.f;
valuef v2 = 2345.f;
valuef v3 = v1 + v2;
//v3 is a node of type 'op::ADD', and contains two arguments which are '1234.f' and '2345.f'

To achieve our goals of building a high performance GPGPU language, we need the following things

  1. A way to turn a value into a string
  2. Control flow, and side effects
  3. The ability to declare kernels in C++, that are translated to an equivalent declaration on the GPU
  4. Mutability
  5. Deterministic Optimisations

The purpose of this article is not to present a detailed analysis of a particular implementation, but instead to present the kinds of problems you’re likely to run into if you decide to build something like this - and how to solve them

Value to string

Previously, we designed a simple value type together - this is a tree. Each node is either a value, or an operation. First up, we need to extend this class to hold an abstract value, in the form of a std::string abstract_value member, holding something like "x" or "v1". This gives us a data layout that looks like this:

struct value_base {
    std::variant<double, float, float16, int> concrete;
    std::vector<value_base> args;
    op::type type = op::NONE;
    std::string name; //so that variables can be identified later. Clunky!
    std::string abstract_value; //if we are a named value, like eg a variable, instead of a concrete value
};

//Enforces strongly typed arithmetic operations
template<typename T>
struct value<T> : value_base {

};

using valuef = value<float>;
//etc

name and abstract_value serve two different purposes. name is for a user to stuff data into, so they can identify it themselves later5. abstract_value is where we’ll be storing the names of our variables

The next thing to do is to add a way to turn an expression stored in a value, into a string:

///handles function calls, and infix operators
std::string function_call_or_infix(const value_base& v)
{
    using namespace op;

    std::map<op::type, std::string> table {
        {PLUS, "+"},
        {MINUS, "-"},
        {UMINUS, "-"},
        {MULTIPLY, "*"},
        {DIVIDE, "/"},
        {MOD, "%"},

        {LT, "<"},
        {LTE, "<="},
        {EQ, "=="},
        {GT, ">"},
        {GTE, ">="},
        {NEQ, "!="},
        {NOT, "!"},
        {LOR, "||"},
        {LAND, "&&"},
        {SIN, "sin"},
        {COS, "cos"},
        {TAN, "tan"},
        {SQRT, "sqrt"},
        {FMOD, "fmod"},

        {GET_GLOBAL_ID, "get_global_id"}, //this maps to the thread id
    };

    std::set<op::type> infix{PLUS, MINUS, MULTIPLY, DIVIDE, MOD, LT, LTE, EQ, GT, GTE, NEQ, NOT, LOR, LAND};

    //generate (arg[0] op arg[1]) as a string
    if(infix.count(v.type)) {
        return "(" + value_to_string(v.args[0]) + table.at(v.type) + value_to_string(v.args[1]) + ")";
    }

    //otherwise, this is a function call
    std::string args;
    //join all the function arguments together separated by a comma
    for(int i=0; i < (int)v.args.size(); i++)
    {
        args += value_to_string(v.args[i]) + ",";
    }

    if(args.size() > 0)
        args.pop_back();

    //generates (funcname(args[0],args[1],args[2],etc))
    return "(" + table.at(v.type) + "(" + args + "))";
}

std::string value_to_string(const value_base& v) {
    if(v.op.type == value::VALUE) {
        ///variable, eg a string like "xyzw"
        if(v.abstract_value.size() != 0) {
            return v.abstract_value;
        }

        ///constant literal
        return std::visit([]<typename T>(const T& in)
        {
            std::string suffix;

            if constexpr(std::is_same_v<T, float16>)
                suffix = "h";
            if constexpr(std::is_same_v<T, float>)
                suffix = "f";

            //to_string_s is implemented in terms of std::to_chars, but always ends with a "." for floating point numbers, as 1234f is invalid syntax in OpenCL
            if(in < 0)
                return "(" + to_string_s(in) + suffix + ")";
            else
                return to_string_s(in) + suffix;
        }, v.concrete);
    }

    return function_call_or_infix(v);
};

Making sure you have a consistent bracketing scheme is important here, eg correctly bracketing negative values (-123.f) to make sure we correctly implement the order of operations. With the advent of widespread support for std::to_chars, we can also finally turn arbitrary literals into the shortest, roundtrippable floating point representation, which makes this segment a lot simpler

With all of this, we can now manipulate expressions with symbols in, like so

//note that we will never set abstract_value directly in user facing code, this is just illustrative
valuef x;
x.abstract_value = "x";

valuef y;
y.abstract_value = "y";

valuef z = (sin(x) * sin(y)) + 1234;

Where doing value_to_string(z) gives:

"(sin(x)*sin(y))+1234"

We’ll be extending this function shortly, to handle more cases

Side effects

This kind of expression based system is pretty cool. Its a pure functional language, albeit lacking a few features. There’s two issues

  1. We can’t actually do anything in it, only generate the right hand side of equations, and we clearly need to be able to do stuff
  2. There’s an exponential blowup in terms of the amount of code we generate for big equations

Fundamentally, we need side effects and to break the neat purity of our language. To handle this we’re going to introduce a new type that looks like this initially:

struct execution_context_base {
    std::vector<value_base> to_execute;

    virtual void add(const value_base& b) = 0;
    virtual void int next_id() = 0;

    virtual ~execution_context_base(){}
}

struct execution_context : execution_context_base {
    int id = 0;

    void add(const value_base& in) override {
        to_execute.push_back(in);
    }

    int next_id() override {
        return id++;
    }
};

This stores a linear6 sequence of commands to execute. To generate our final string, we’ll simply iterate over the list to_execute, and generate code sequentially via value_to_string. Once we do this, we’ll have generated our full program. The reason why we define a _base type, is because you may have your own strategy for how you want to build to_execute, or you might want to stuff extra data into this struct

One thing to note here, is that I’ve leaned towards making this global (per-thread). I’m generally against global state, but each kernel only ever needs to access one execution context - unless we want to generate functions without our kernels - which is straightforwardly modelled as a stack. In practice, making it global this simplifies a lot of code tremendously, and I haven’t run into any negatives yet

To implement this, looks like this:

std::vector<execution_context_base*>& context_stack() {
    static thread_local std::vector<execution_context_base*> ctx;
    return ctx;
}

template<typename T>
T& push_context()  {
    T* ptr = new T();
    context_stack().push_back(ptr);
    return *ptr;
}

void pop_context() {
    auto ptr = context_stack().back();
    delete ptr;
    context_stack().pop_back();
}

execution_context_base& get_context() {
    return *context_stack().back();
}

If you’re unconvinced, all functions come in both an implicit and explicit form, with the implicit forms living in a separate namespace, so global state is optional. Its way nicer though

Mutability

One key issue that crops up is the issue of mutability. In C++, everything we’re dealing with is a regular C++ value. If we write

valuef v1 = v2;
v1 = 3245;

We are overwriting v1, not creating a new node which assigns to it in the AST. Additionally, if we have an expression with a mutable variable in it, repeatedly evaluating it may lead to the result of the expression changing - which isn’t ideal. We do not want variables that can change to be hidden! Down below we’ll implement assignment, but before we get there we’ll want to discuss how to mark something as mutable

The best way I’ve found is to use a struct, which is defined like this

template<typename T>
struct mut : T {
    const T& as_constant() const {
        return *this;
    }

    void set_from_constant(const T& in) {
        static_cast<T&>(*this) = in;
    }
}

which we can use like mut<valuef> some_value, or tensor<mut<valuef>, 4> some_tensor. We’ll be using this primarily as a tag type to detect whether or not we intend something to be mutable. This helps tremendously with managing the mutability issues in what we’re implementing, as well as letting us implement assignment in a way that’s less awkward than a function

Back to side effects again

Feature Wishlist

Its time to define and implement all the impure parts of the language: control flow, assignment, declarations, and everything else. We need to support the following features that fall under the category of ‘side effect’:7

  1. Branches
  2. Variable declarations
  3. Assignment
  4. While loops
  5. For loops
  6. Return
  7. Break

This segment is a mixture of reference material, and implementation. You’ll likely get the gist of how this works fairly quickly, in which case you can skip to kernel declaration

Branches

Lets look at the implementation of a branch function. Initially, we’ll split this up into two parts:

  1. An expression representing a branch op
  2. An execution of that expression, ie a statement
template<typename T>
value_base if_b(const value<bool>& condition) {
    value_base decl;
    decl.type = op::IF;
    decl.args = {condition};
    return decl;
}

This still doesn’t actually do anything. Next up is where it gets interesting:

value_base block_start_b() {
    value_base base;
    base.type = op::BLOCK_START;
    return base;
}

value_base block_end_b() {
    value_base base;
    base.type = op::BLOCK_END;
    return base;
}

template<typename T>
value_base if_e(execution_context& ectx, const value<bool>& condition, T&& then) {
    ectx.add(if_b(condition));
    ectx.add(block_start_b());

    then();

    ectx.add(block_end_b());
}

template<typename T>
value_base if_e(const value<bool>& condition, T&& then) {
    return if_e(get_context(), condition, std::forward<T>(then));
}

block_start_b() and block_end_b() map to { and } respectively, making if_e translate to

if(condition) {
    then();
}

which is what we’re looking for. We use this in code like in the following example:

value<int> v1 = 1234;
value<int> v2 = 23456;

if_e(v1 < v2, [&]{
    //do stuff inside the branch
});

Code generation

std::string value_to_string(const value_base& v) {
    ///everything else
    if(v.type == op::BLOCK_START)
        return "{";

    if(v.type == op::BLOCK_END)
        return "}";

    if(v.type == op::IF)
        return "if(" + value_to_string(v.args.at(0)) + ")";
}

Variable declarations

This follows a very similar pattern to branching, though with an extra wrinkle in that we’d like to be able to give a variable a specific name

template<typename T>
value_base declare_b(const std::string& name, const value<T>& rhs) {
    value_base type = name_type(T());

    value<T> v;
    v.type = op::DECLARE;
    v.args = {type, name, rhs};

    return v;
}

name_type is one of those things thats very tedious to write, but non complicated. It’ll crop up repeatedly when we need to take a concrete type. To avoid dumping huge amounts of lightly relevant code, check it out here. It takes a type, like int or float16, and converts it as a string to the OpenCL equivalent: ie "int" or "half"8

Execution

We want different versions for declaring a mutable type, and a non mutable type here. This maps to eg float and const float, and helps with safety. We also need overloads for tensors as well, which makes all of this a little verbose:

template<typename T>
value<T> declare_e(execution_context_base& ectx, const std::string& name, const value<T>& rhs)
{
    ectx.add(declare_b(name, rhs));

    value<T> named;
    named.type = op::VALUE;
    named.abstract_value = name;

    return named;
}

template<typename T>
value<T> declare_e(execution_context_base& ectx, const value<T>& rhs){
    return declare_e(get_context(), "decl_" + std::to_string(get_context().next_id()), rhs);
}

template<typename T, int N>
auto declare_e(execution_context_base& ectx, const tensor<T, N>& rhs){
    tensor<decltype(declare_e(ectx, T())), N> ret;

    for(int i=0; i < N; i++){
        ret[i] = declare_e(ectx, rhs[i]);
    }

    return ret;
}

template<typename T>
mut<value<T>> declare_mut_e(execution_context_base& ectx, const std::string& name, const value<T>& rhs){
    mut<value<T>> val;
    val.set_from_constant(declare_e(ectx, name, rhs));
    return val;
}

template<typename T>
mut<value<T>> declare_mut_e(execution_context_base& ectx, const value<T>& rhs){
    mut<value<T>> val;
    val.set_from_constant(declare_e(ectx, rhs));
    return val;
}

template<typename T, int N>
auto declare_mut_e(execution_context_base& ectx, const tensor<T, N>& rhs){
    tensor<decltype(declare_mut_e(ectx, T())), N> ret;

    for(int i=0; i < N; i++) {
        ret[i] = declare_mut_e(ectx, rhs[i]);
    }

    return ret;
}


template<typename T>
value<T> declare_e(const std::string& name, const value<T>& rhs){
    return declare_e(get_context(), name, rhs);
}

template<typename T>
value<T> declare_e(const value<T>& rhs){
    return declare_e(get_context(), rhs);
}

template<typename T>
value<T> declare_e(const mut<value<T>>& rhs){
    return declare_e(get_context(), rhs);
}

template<typename T, int N>
auto declare_e(const tensor<T, N>& rhs){
    return declare_e(get_context(), rhs);
}

template<typename T>
mut<value<T>> declare_mut_e(const std::string& name, const value<T>& rhs){
    return declare_mut_e(get_context(), name, rhs);
}

template<typename T>
mut<value<T>> declare_mut_e(const value<T>& rhs){
    return declare_mut_e(get_context(), rhs);
}

template<typename T, int N>
tensor<mut<T>, N> declare_mut_e(const tensor<T, N>& rhs){
    return declare_mut_e(get_context(), rhs);
}

Code Generation

std::string value_to_string(const value_base& v) {
    if(v.type == op::DECLARE){
        //v1 v2 = v3;
        if(v.args.size() == 3)
            return value_to_string(v.args.at(0)) + " " + value_to_string(v.args.at(1)) + " = " + value_to_string(v.args.at(2));
        //v1 v2;
        else if(v.args.size() == 2)
            return value_to_string(v.args.at(0)) + " " + value_to_string(v.args.at(1));
    }
}

Example

With this, we can now write the following code:

valuef v1 = 1234;
valuef v2 = 2345;

valuef v = declare_e(v1 + v2);

And this generates:

float decl_0 = 1234 + 2345;

We can also write

mut<valuef> v1 = declare_mut_e(valuef(0.f));

To make a new mutable variable

Assignment

template<typename T>
value_base assign_b(const mut<value<T>>& v1, const value<T>& v2) {
    value_base result;
    result.type = op::ASSIGN;
    result.args = {v1, v2};

    return result;
}

template<typename T, int... N>
tensor<value_base, N...> assign_b(const tensor<mut<T>, N...>& v1, const tensor<T, N...>& v2) {
    return tensor_for_each_nary([&](const mut<T>& v1, const T& v2) {
        return assign_b(v1, v2);
    }, v1, v2);
}

tensor_for_each_nary is something that takes 1+ tensors, and applies a function to the components of each elementwise

Execution

template<typename T>
void assign_e(execution_context_base& e, const mut<T>& v1, const T& v2) {
    return e.add(assign_b(v1, v2));
}

template<typename T, int... N>
void assign_e(execution_context_base& e, const tensor<mut<T>, N...>& v1, const tensor<T, N...>& v2) {
    return e.add(assign_b(v1, v2));
}

template<typename T>
void assign_e(const mut<T>& v1, const T& v2) {
    return assign_e(get_context(), v1, v2);
}

template<typename T, int... N>
void assign_e(const tensor<mut<T>, N...>& v1, const tensor<T, N...>& v2) {
    return assign_e(get_context(), v1, v2);
}

Extending mut<>

One nice thing we do here is also extend mut<>, to allow for a clean assignment syntax, as well as implement an as_ref function

template<typename T, typename U>
struct mutable_proxy {
    const T& v;
    U& ctx;

    mutable_proxy(const T& _v, U& _ctx) : v(_v), ctx(_ctx){}

    template<typename V>
    void operator=(const V& to_set) {
        assign_e(ctx, v, to_set);
    }
};

template<typename T>
struct mut : T {
    /*NEW*/
    template<typename U>
    auto as_ref(U& executor) {
        return mutable_proxy(*this, executor);
    }
};

template<typename T>
auto as_ref(const T& in) {
    return in.as_ref(get_context());
}

This returns a reference to a type mutable_proxy, where operator= forwards to our assignment function, letting us use a more natural syntax for assignment

Code generation

std::string value_to_string(const value_base& v) {
    if(v.type == op::ASSIGN)
        return value_to_string(v.args.at(0)) + " = " + value_to_string(v.args.at(1));
}

Example

mut<valuef> v1 = declare_mut_e("var", valuef(0.f));

valuef v2 = 1234;
valuef v3 = 2345;

assign_e(v1, v2 + v3);

Produces

float var = 0.f;
var = 1234 + 2345;

Much more cleanly, we can write

mut<valuef> v1 = declare_mut_e("var", valuef(0.f));

valuef v2 = 1234;
valuef v3 = 2345;

as_ref(v1) = v2 + v3;

This might seem odd/hacky, but it results in a significant increase in usability

While loops

While loops are structurally identical to an if, so there’s nothing new in here

value_base while_b(const value<bool>& condition)
{
    value_base base;
    base.type = op::WHILE;
    base.args = {condition};

    return base;
}

Execution

template<typename Func>
void while_e(execution_context_base& ectx, const value<bool>& condition, Func&& then)
{
    ectx.add(while_b(condition));
    ectx.add(block_start_b());

    then();

    ectx.add(block_end_b());
}

template<typename Func>
void while_e(const value<bool>& condition, Func&& then) {
    return while_e(get_context(), condition, std::forward<Func>(then));
}

Code generation

std::string value_to_string(const value_base& v) {
    if(v.type == op::WHILE)
        return "while(" + value_to_string(v.args.at(0)) + ")";
}

Example

mut<valuef> v1 = declare_mut_e(valuef(0));

while_e(v1 < 10, [&]
{
    as_ref(v1) = v1 + 1;
});

Generates

float decl_0 = 0.f;

while(decl_0 < 10.f)
{
    decl_0 = decl_0 + 1.f;
}

For loops

There’s one wrinkle with for loops. Ideally we’d like to be able to write

for_e(context, init, condition, expr, body);

The only issue is that its very difficult in a function call to declare a variable, and then refer to it later. For this reason, for loops take the form

for_e(context, condition, expr, body);
value_base for_b(const value<bool>& condition, const value_base& execute)
{
    value_base type;
    type.type = op::FOR;
    type.args = {condition, execute};

    return type;
}

Execution

template<typename T>
void for_e(execution_context_base& ectx, const value<bool>& condition, const value_base& execute, T&& then)
{
    ectx.add(for_b(condition, execute));
    ectx.add(block_start_b());

    then();

    ectx.add(block_end_b());
}

template<typename T>
void for_e(const value<bool>& condition, const value_base& execute, T&& then) {
    for_e(get_context(), condition, execute, std::forward<T>(then));
}

Code Generation

    if(v.type == op::FOR)
        return "for(;" + value_to_string(v.args.at(0)) + ";" + value_to_string(v.args.at(1)) + ")";

Example

mut<valuei> idx = declare_mut_e("i", valuei(0));

//one of the very few times we use the non execution form of assign
for_e(idx < 1024, assign_b(idx, idx+1), [&]
{
    //do something in the loop
});

Generates

int i=0;

for(; i<1024; i = i+1)
{

}

Return

We’re not going to go into declaring your own functions yet, which makes return slightly non useful, but we might as well support it for the moment

template<typename T>
value_base return_b(const value<T>& in)
{
    value<T> v;
    v.type = op::RETURN;
    v.args = {in};
    return v;
}

value_base return_b()
{
    value_base v;
    v.type = op::RETURN;
    return v;
}

Execution

template<typename T>
void return_e(execution_context_base& ectx, const value<T>& v)
{
    ectx.add(return_b(v));
}

void return_e(execution_context_base& ectx)
{
    ectx.add(return_b());
}

template<typename T>
void return_e(const value<T>& v) {
    return_e(get_context(), v);
}

void return_e() {
    return_e(get_context());
}

Code Generation

    if(v.type == op::RETURN) {
        if(v.args.size() > 0)
            return "return " + value_to_string(v.args.at(0));
        else
            return "return";
    }

Break

value_base break_b()
{
    value_base v;
    v.type = op::BREAK;
    return v;
}

Execution

void break_e(execution_context_base& ectx)
{
    ectx.add(break_b());
}

void break_e()
{
    break_e(get_context());
}

Code Generation

    if(v.type == op::BREAK)
        return "break";

Unsprisingly, very unexciting

Side Effect Summary

To sum up, we now have all of the basic tools we need to write procedural code. We can write progams that look like this:

mut<valuef> my_var = declare_mut_e("var", valuef(-1));

if_e(my_var < 1.f, [&] {
    while_e(my_var < 1, [&] {
        as_ref(my_var) = my_var + 1;

        break_e();
    });
});

This, while not the prettiest code ever, is workable. The mutability and control flow story is by far the ugliest part of what we’re building today. Luckily, this tends to make up a smaller part of the simulations that we’re going to build eventually, so it fades a little into the background

its a cat

Kernel declaration

At this point we can handle everything inside a kernel, we just need to be able to declare a kernel itself. Here’s a concrete example of what a kernel in OpenCL looks like, showcasing the datatypes we’ll need to support:

__kernel
void my_kernel(global float* buf, const global int2* buf2, float4 lit, __read_only image2d_t some_image) {

}

And here’s what we’d like this to look like on the CPU side of things, ideally

void my_kernel(execution_context& ectx, buffer_mut<float> buf, buffer<tensor<int, 2>> buf2, literal<tensor<float, 4>> lit, read_only_image<2> some_image) {

}

Doing this is going to involve some fairly involved template metaprogramming. We want to

  1. Iterate over the arguments of our function
  2. Inspect the type of the argument
  3. Output some sort of parsed representation of each function argument
  4. Stringify the whole thing

While we’re here, we’re also going to spend the time generating the body of our function as well

Lets get to it

Argument Types

These are the following types we’ll need. The name is set by the kernel generator, and is what we use to actually access our kernel arguments

template<typename T>
struct buffer {
    std::string name;
    using value_type = T;

    value<T> operator[](const value<int>& index);
};

template<typename T>
struct buffer_mut {
    std::string name;
    using value_type = T;

    mut<value<T>> operator[](const value<int>& index);
};

template<typename T>
struct literal {
    std::string name;
    using value_type = T;

    value<T> get();
};

template<int N>
struct image {
    std::string name;
};

template<int N>
struct read_only_image : image<N> {
    template<typename T, int M>
    tensor<value<T>, M> read(const tensor<value<int>, N>& pos) const;
};

template<int N>
struct write_only_image : image<N> {
    template<typename T, int M>
    void write(const tensor<value<int>, N>& pos, const tensor<value<T>, M>& val) const;
};

Argument processing

Each function argument maps to an input, defined in the next section. We deduce some basic information (constness, is it a pointer, whats the type) about the argument based on its type, and stuff that into our input

Argument names are dynamically generated - and are set while iterating over the types

namespace impl {
    template<typename T>
    void add(buffer<T>& buf, type_storage& result)
    {
        input in;
        in.pointer = true;
        in.is_constant =  true;

        std::string name = "buf" + std::to_string(result.args.size());

        in.name = name;
        buf.name = name;

        result.args.push_back(in);
    }

    template<typename T>
    void add(buffer_mut<T>& buf, type_storage& result)
    {
        input in;
        in.pointer = true;
        in.is_constant =  false;

        std::string name = "buf" + std::to_string(result.args.size());

        in.name = name;
        buf.name = name;

        result.args.push_back(in);
    }

    template<typename T>
    void add(literal<T>& lit, type_storage& result)
    {
        input in;
        in.type = name_type(T());
        in.pointer = false;

        std::string name = "lit" + std::to_string(result.args.size());

        in.name = name;
        lit.name = name;

        result.args.push_back(in);
    }

    template<int N>
    void add(read_only_image<N>& img, type_storage& result)
    {
        input in;
        in.is_image = true;
        in.is_constant = true;
        in.image_N = N;

        std::string name = "img" + std::to_string(result.args.size());

        in.name = name;
        img.name = name;

        result.args.push_back(in);
    }

    template<int N>
    void add(write_only_image<N>& img, type_storage& result)
    {
        input in;
        in.is_image = true;
        in.image_N = N;

        std::string name = "img" + std::to_string(result.args.size());

        in.name = name;
        img.name = name;

        result.args.push_back(in);
    }
}

Argument to string

Each input contains the information necessary to actually turn one argument into a string that OpenCL can understand

struct input
{
    std::string type;
    bool pointer = false;
    bool is_constant = false;
    bool is_image = false;
    int image_N = 0;
    std::string name;

    std::string format() const
    {
        if(is_image)
        {
            if(is_constant)
                return "__read_only image" + std::to_string(image_N) + "d_t " + name;
            else
                return "__write_only image" + std::to_string(image_N) + "d_t " + name;
        }

        if(pointer)
        {
            std::string cst = is_constant ? "const " : "";

            return "__global " + cst + type + "* __restrict__ " + name;
        }
        else
        {
            return type + " " + name;
        }
    }
};

High level code generation

struct type_storage
{
    std::vector<input> args;
};

struct function_context
{
    type_storage inputs;
};

//Push a context, process the function arguments and convert them to inputs, then call our function
template<typename T, typename R, typename... Args>
void setup_kernel(R(*func)(T&, Args...), function_context& ctx)
{
    //create a new context, based on whatever the first argument is of our kernel
    T& ectx = push_context<T>();

    //Instantiate the list of arguments to our function, without the T& parameter
    std::tuple<std::remove_reference_t<Args>...> args;

    //for each argument
    std::apply([&](auto&&... expanded_args){
        //add it to our list of inputs
        (impl::add(expanded_args, ctx.inputs), ...);
    }, args);

    std::tuple<T&> a1 = {ectx};

    //before finally calling our argument with the new context we made, and the function arguments
    std::apply(func, std::tuple_cat(a1, args));
}

//this function is responsible for doing the final assembly on our code
//it takes our processed arguments and stringifies them, as well as stringifying the body of our kernels
std::string generate_kernel_string(function_context& kctx, const std::string& kernel_name)
{
    execution_context_base& ctx = get_context();

    std::string base;

    //in opencl, if you want to use the half datatype, you have to enable it manually via an extension. I do this unconditionally, because your gpu definitely supports it
    base += "#pragma OPENCL EXTENSION cl_khr_fp16 : enable\n\n";

    base += "__kernel void " + kernel_name + "(";

    //join together all the function arguments
    for(int i=0; i < (int)kctx.inputs.args.size(); i++)
    {
        base += kctx.inputs.args[i].format();

        if(i != (int)kctx.inputs.args.size() - 1)
            base += ",";
    }

    base += ")\n{\n";

    //most statements that are at the top level are going to end with a semicolon
    //but we don't want to write something like:
    /* if(x);
       {;
           some_statement();
       };
    */
    //Because that would be bad (tm), so we have to be careful
    auto is_semicolon_terminated = [](op::type t)
    {
        return t == op::ASSIGN || t == op::DECLARE || t == op::RETURN || t == op::BREAK || t == op::IMAGE_READ || t == op::IMAGE_WRITE;
    };

    int indentation = 1;

    int bid = 0;

    //iterate over the linear list of commands to execute, and generate the commands, with indentation to make the code more readable
    for(const value_base& v : ctx.to_execute)
    {
        if(v.type == op::BLOCK_END)
            indentation--;

        std::string prefix(indentation * 4, ' ');

        base += prefix + "//" + std::to_string(bid) + "\n";

        if(is_semicolon_terminated(v.type))
            base += prefix + value_to_string(v) + ";\n";
        else
            base += prefix + value_to_string(v) + "\n";

        if(v.type == op::BLOCK_START)
            indentation++;

        bid++;
    }

    base += "\n}\n";

    return base;
}

//tie it all together
template<typename T>
std::string make_function(T&& in, const std::string& kernel_name)
{
    function_context kctx;
    setup_kernel(in, kctx);

    std::string str = generate_kernel_string(kctx, kernel_name);

    pop_context();
    return str;
}

We can now make kernels like this:

//fills a buffer with a literal value
void test_kernel(buffer_mut<float> buf, lit<int> in) {
    valuei idx = get_global_id(0);

    as_ref(buf[idx]) = in.get();
}

//feed this to opencl, hooray!
std::string transpiled = make_function(test_kernel, "my_kernel");

At this point the language is ‘done’, and we could get up and running with this now. There’s one major thing we do want to address though, and that’s optimising the AST - in the most basic way

cat

Deterministic optimisations

The rules for ieee floating point do not allow for a wide class of optimisations. Eg for floats

a*b - a*b != 0

Because if either a, or b are NaN, then the result is NaN not 0. The compiler must be able to prove that this isn’t true, which in most cases it can’t do

When running visualisations on the CPU in C/C++, its very common to turn on -ffast-math. In OpenCL, the equivalent is -cl-fast-relaxed-math, and it does essentially the same thing - allow the compiler to perform technically invalid, but in practice very useful compiler optimisations like the above

There are a few issues with using these compiler flags

  1. The optimisations that the compiler performs are unreliable, and often very erratic under small code changes
  2. The optimisations make your results very hard/impossible to reproduce
  3. Compiler upgrades (which on a GPU especially, is essentially out of your control) will change your results

The big advantage we have when operating on our AST, is that we can deterministically optimise our expression tree, and the resulting code that we output will be reproducible by someone else. In fact, they don’t need to know anything about our technique, because we can give them the postprocessed source. Hooray!

There in general are a few major benefits to optimising the AST directly:

  1. We can work around compiler optimisation failures
  2. We can decide which floating point rules we want to ignore ourselves9
  3. It reduces the size of the generated code massively

Lets define the classes of operations we want to optimise

  1. Constant propagation
  2. Any optimisation that would be something that you would uncontroversially write by hand, but which may be invalid under ieee rules, eg x-x = 0
  3. A transform that will reduce the size of the code, eg 1.f * x -> x, or x + 0 -> x

Constant Propagation

This is very repetitive, so I’m not going to dump the full copypaste constant propagation to every function, but instead give an illustrative set of examples

struct value_base {
    /*new*/
    //must hold a constant that we can fetch
    bool is_concrete_type() const
    {
        return type == op::VALUE && abstract_value.size() == 0;
    }
}

//this function takes two values, and checks if they hold a constant. If they do, apply a function to those constants
//the return type is a very long winded way of saying that we return an optional of the return type of 'func' applied to our arguments, which was wrapped in a value
template<typename T, typename U>
auto replay_constant(const value<T>& v1, const value<T>& v2, U&& func) -> std::optional<value<std::remove_reference_t<decltype(func(T(), T()))>>>
{
    if(!v1.is_concrete_type() || !v2.is_concrete_type())
        return std::nullopt;

    if(v1.concrete.index() != v2.concrete.index())
        return std::nullopt;

    if(!std::holds_alternative<T>(v1.concrete))
        return std::nullopt;

    return func(std::get<T>(v1.concrete), std::get<T>(v2.concrete));
}

//replay_constant, but for unary functions
template<typename T, typename U>
auto replay_constant(const value<T>& v1, U&& func) -> std::optional<value<std::remove_reference_t<decltype(func(T()))>>>
{
    if(!v1.is_concrete_type())
        return std::nullopt;

    if(!std::holds_alternative<T>(v1.concrete))
        return std::nullopt;

    return func(std::get<T>(v1.concrete));
}

#define PROPAGATE_INFIX(x, y, op) if(auto it = replay_constant(x, y, [](auto&& a, auto&& b){return a op b;})){return it.value();}
#define PROPAGATE2(x, y, op) if(auto it = replay_constant(x, y, [](auto&& a, auto&& b){return op(a,b);})){return it.value();}
#define PROPAGATE1(x, op)  if(auto it = replay_constant(x, [](auto&& a){return op(a);})){return it.value();}

friend value<T> operator+(const value<T>& v1, const value<T>& v2) {
    PROPAGATE_INFIX(v1, v2, +);

    value<T> result;
    result.type = op::PLUS;
    result.args = {v1, v2};
    return result;
}

template<typename T>
value<T> sin(const value<T>& v1)
{
    using std::sin; //sometimes i wish floats were in namespace std, and we could use adl on them
    PROPAGATE1(v1, sin);

    value<T> ret;
    ret.type = op::SIN;
    ret.args = {v1};
    return ret;
}

Obvious IEEE Non Compliant wins

There are several major transforms here that we want to do

  1. 0 * x = 0

  2. x - x = 0

  3. 0 / x = 0

  4. x / x = 1

The second is particulary important, as its a compiler optimisation that is missed in our case with -cl-fast-relaxed-math turned on, and a big performance optimisation in general. The first is a major improvement when we’re not using compiler optimisations. 3 and 4 are less significant

To be able to do these optimisations, we need to be able to determine equivalence between two values

bool equivalent(const value_base& v1, const value_base& v2)
{
    //must be the same kind of op
    if(v1.type != v2.type)
        return false;

    //must have the same kind of abstract value name
    if(v1.abstract_value != v2.abstract_value)
        return false;

    //same type of variable
    if(v1.concrete.index() != v2.concrete.index())
        return false;

    bool invalid = true;

    //same value stored in concrete
    std::visit([&](auto&& i1, auto&& i2)
    {
        invalid = !(i1 == i2);
    }, v1.concrete, v2.concrete);

    if(invalid)
        return false;

    //same number of arguments
    if(v1.args.size() != v2.args.size())
        return false;

    ///use the associativity of +* and check for the reverse equivalence
    if(v1.type == op::PLUS || v1.type == op::MULTIPLY)
    {
        if(equivalent(v1.args[0], v2.args[1]) && equivalent(v1.args[1], v2.args[0]))
            return true;
    }

    //check that all our arguments are equivalent
    for(int i=0; i < (int)v1.args.size(); i++)
    {
        if(!equivalent(v1.args[i], v2.args[i]))
            return false;
    }

    return true;
}

This is the full spiel of what you need to do in order to check if two values are strictly equivalent. Note that we are not checking for identical, as this uses the associativity of * and + to search for equivalence in the reverse order of arguments as well

To use this, we modify our operators for value:

friend value<T> operator-(const value<T>& v1, const value<T>& v2) {
    if(equivalent(v1, v2))
        return (T)0;

    PROPAGATE_INFIX(v1, v2, -);

    value<T> result;
    result.type = op::MINUS;
    result.args = {v1, v2};
    return result;
}

And

friend value<T> operator*(const value<T>& v1, const value<T>& v2) {
    //0 * x
    if(equivalent(v1, value<T>(0)))
        return (T)0;

    //x * 0
    if(equivalent(v2, value<T>(0)))
        return (T)0;

    //1 * x
    if(equivalent(v1, value<T>(1)))
        return v2;

    //x * 1
    if(equivalent(v2, value<T>(1)))
        return v1;

    PROPAGATE_INFIX(v1, v2, *);

    value<T> result;
    result.type = op::MULTIPLY;
    result.args = {v1, v2};
    return result;
}

To take two examples. Check the source linked at the end if you want the full list of applied transforms

krat

Performance Benchmarks

The implementation is now done. Its time to take more cat pictures examine how this stacks up against the competition - a hand written OpenCL implementation that I built especially for this10 to compare against our single source version. We’re going to examine:

  1. Both programs without any ieee non compliant transforms applied, nor constant propagation
  2. Trig11 functions and sqrt set to use the native gpu functions12: native_sin + native_sqrt
  3. native + native_divide
  4. native_divide only
  5. -cl-fast-relaxed-math
  6. native + -cl-fast-relaxed-math
  7. native + native_divide + -cl-fast-relaxed-math

We’ll also compare with, and without the programmatic transforms applied, which includes the IEEE non compliant ones. A hand written version of the OpenCL implementation using native_divide is difficult to do by hand with any kind of fairness - to represent how I might actually write this code, divisions within the hot loop were transformed

Click here for the single source code, here for the hand written OpenCL version, and here for the original CPU code we wrote last time round. All times are in ms/frame, and were loosely eyeballed within 1-2ms

. Standard Native Native + native_divide native_divide only Relaxed Native + Relaxed Native + native_divide + Relaxed
Generated 185 16413 155 178 72 48 48
Hand Written 189 162 ~153 ~178 59 36 36
Generated + Transforms 76 50 40 66 60 37 38
CPU - - - - - - ~15 seconds

GPUs are really very fast

So, the results we get here are essentially what I expected. The differences between the generated-without-transforms code and hand written code is discussed in this footnote14. The performance uplift vs the CPU renderer absolutely massive - which is what we’d expect from a GPU - taking ~0.25% of the time to execute vs the multithreaded CPU code. This is really why we’re going through all this pain, and the wins here are absolutely enormous

The real benefit for us in this technique comes down to the fact that by applying a deterministic, easy and very justifiable set of basic optimisations, we’re able to cut our runtime by more than 50%, and approach the same speed of the native + relaxed case, without turning on -cl-fast-relaxed-math, or friends. In the scientific-computing-super-rigor case (no native functions, no relaxed math), we’re able to get a 2.5x performance increase while not altering our result whatsoever, as the eliminated expressions correspond only to expressions that can’t be eliminated due to nan propagation (and signed zeros). The thing is, while we could15 turn off nan/nonfinite/zero handling in OpenCL - we actively still do want nan handling (see this13 footnote for a practical example), we just don’t want it within our expressions

Now we have a way to handle this kind of thing correctly, while sacrificing neither performance nor reproducibility

Conclusions

Some of the way we have to write code now - branches, loops etc are definitely clunkier, but overall this is a huge win. There are some particularly large benefits writing in our single source language vs OpenCL

  1. We can automatically mark up our kernel arguments16 as being restrict - a big performance win
  2. We can automatically mark up our kernel arguments as being const, depending on if they produce mutable variables. This is important for being able to automatically generate dependency information later down the line17
  3. We can reuse CPU-side code - as long as we don’t use branches18 - then you can substitute a float for a valuef
  4. You don’t have to destroy your code’s readability to perform optimisations like 0*x. Trying to do this in the hand-written OpenCL version would require manually substituting all our equations forward, and cannot work in the generic case
  5. We can autodiff our code. This value type is compatible with the dual type we wrote in the first article. I use numerical differentiation here for comparison (because you cannot easily automatically differentiate things in OpenCL), but moving forwards we’ll be using automatic differentiation
  6. We can use templates on a GPU! Our code here is generic to any metric tensor, as all we have to do is plug in a new function. This is impossible in C/OpenCL
  7. Our shader code is insulated against the increasing fragmentation and vendor exclusivity of the GPGPU market
  8. The performance makes this extremely worthwhile, and we can work around compiler issues (which are numerous)

One major downside is that we’re still heavily relying on the compiler to do common subexpression elimination, and this can result in high compile times. On top of that, every time you want to use a feature of OpenCL, you’re going to have to either directly insert a raw string, or crack out your implementers hat. Its not difficult to implement features, but it does add friction

Source code

One of the things that I think has worked out fairly well is that the code we write for the GPU really isn’t all that different to the code we wrote initially for the CPU, whereas the OpenCL version necessarily has to change a lot. You may find it interesting to compare for yourself:

Original CPU

Our single source

Regular OpenCL

Full single source project

The end

That’s the end of this article. Next episode, we’re going to be using this new library to explore the interior of a spinning black hole, and you’re going to learn about tetrads

Appendix I: What have I left out of this article?

There’s been quite a few things omitted here that you may want to address if you roll an implementation of this yourself

  1. The ability to declare functions within OpenCL. We’ll need this when we start linearly interpolating things in 4d, to prevent us from generating megabytes of source code
  2. Kernels with a dynamic number of arguments, or ones defined by a struct. Arguments have a cost in GPU programming, so minimising the number of arguments you have is good. In some cases its much more convenient to programmatically drive this process rather than do it declaratively
  3. Declaring and using structs. This is of limited utility, unless you want to interop with existing OpenCL code
  4. Fixing the API performance of OpenCL on AMD. Politely working around AMDs shortcomings in the GPGPU space is likely going to be the subject of a future article
  5. Binary caching, and threaded compilation. Its easy to generate slow to compile kernels here, one thing that’s a big win is to compile them all asynchronously, and then cache the compiled kernels to disk automatically as a binary
  6. The mutability situation is still pretty sketchy. If you have thoughts, I would love to hear your feedback on improving this!
  7. Multiple backend support. This is very straightforward, you use a struct with virtual functions per-operation in value_to_string

Appendix II: Implementation details for types that go in kernel arguments

The details of this segment don’t really fit neatly in anywhere else

buffer<T>::operator[]

struct buffer {
    value<T> operator[](const value<int>& index)
    {
        value<T> op;
        op.type = op::BRACKET;
        op.args = {name, index};

        return op;
    }
}

Code Generation

    //v1[v2]
    if(v.type == op::BRACKET)
        return "(" + value_to_string(v.args.at(0)) + "[" + value_to_string(v.args.at(1)) + "])";

buffer_mut<T>::operator[]

template<typename T>
struct buffer_mut : buffer<T> {
    mut<value<T>> operator[](const value<int>& index)
    {
        mut<value<T>> res;
        res.set_from_constant(buffer<T>::operator[](index));
        return res;
    }
}

literal<T>::get()

template<typename T>
struct literal {
    value<T> get()
    {
        value<T> val;
        val.abstract_value = name;
        return val;
    }
};

read_only_image<N>::read()

This function is a bit of a pain. In OpenCL, an image read always returns a vector of 4 components. We have a bit of a formalism mismatch here, our tensors aren’t really equivalent to OpenCLs vectors - they aren’t a single object, but a collection of 4 completely unrelated variables grouped in a tensor. Eg where opencl has:

float4 val;

We have, effectively

std::array<float, 4> val

The naive expression approach would be to (conceptually) construct a tensor like follows:

tensor<valuef, 4> my_tensor = {read_imagef(args).x, read_imagef(args).y, read_imagef(args).z, read_imagef(args).w};

The issue with this strategy is that we invoke read_imagef 4 times, and hope that the compiler figures it out. Instead of doing this, we conceptually generate the following code

float4 result = read_imagef(args);

tensor<valuef, 4> our_real_tensor = {result.x, result.y, result.z result.w};
template<int N>
struct read_only_image : image<N> {
    ///this is pretty basic as a read and doesn't encompass the full set of functionality, eg samplers
    template<typename T, int M>
    tensor<value<T>, M> read(execution_context_base& ectx, const tensor<value<int>, N>& pos) const
    {
        value_base type = name_type(T());
        value_base pos_type = std::string("int");

        //an image read follows the protocol: name, dimensionality, read type, position type, position[0..n]
        ///eg read_imagef(img_2d, (int2)(x,y)) is generated from
        ///{img_2d, 2, float, int, x, y}
        value_base single_read;
        single_read.type = op::IMAGE_READ;
        single_read.args = {this->name, value<int>(N), type, pos_type};

        for(auto& i : pos)
            single_read.args.push_back(i);

        //declare a new T4, eg float4 or int4 to hold the return type
        value_base decl_type = name_type(tensor<T, 4>());
        value_base decl_name = "iv" + std::to_string(ectx.next_id());

        value_base decl;
        decl.type = op::DECLARE;
        decl.args = {decl_type, decl_name, single_read};

        ectx.add(decl);

        tensor<value<T>, M> ret;

        ///index the declared variable which holds the result of our image, and stuff it into our tensor
        std::array<std::string, 4> dots = {"x", "y", "z", "w"};

        for(int i=0; i < M; i++)
        {
            value_base component = dots[i];

            value<T> dot;
            dot.type = op::DOT;
            dot.args = {decl_name, component};

            ret[i] = dot;
        }

        return ret;
    }

    template<typename T, int M>
    tensor<value<T>, M> read(const tensor<value<int>, N>& pos) const
    {
        return read<T, M>(get_context(), pos);
    }
};

Code generation

The code generation side is simpler:

    if(v.type == op::IMAGE_READ)
    {
        std::string type = value_to_string(v.args[2]);
        std::string name = value_to_string(v.args[0]);

        std::string suffix = type_to_suffix(type);

        int num_args = std::get<int>(v.args[1].concrete);

        std::string pos_type = value_to_string(v.args.at(3));

        std::vector<value_base> pos;

        for(int i=0; i < num_args; i++)
        {
            pos.push_back(v.args.at(4 + i));
        }

        return "read_image" + suffix + "(" + name + ",(" + pos_type + std::to_string(num_args) + ")(" + join(pos) + "))";
    }

Note that type_to_suffix is not the same set of suffixes that we use for literals. We have: f for float, i for int, ui for unsigned int, and h for half

write_only_image<N>::write()

template<int N>
struct write_only_image : image<N> {
    template<typename T, int M>
    void write(execution_context_base& ectx, const tensor<value<int>, N>& pos, const tensor<value<T>, M>& val) const
    {
        //this uses the protocol: {name, N, position[0..N], write_data_type, M, write_data[0..M]}
        value_base write_op;
        write_op.type = value_impl::op::IMAGE_WRITE;

        write_op.args = {this->name};

        write_op.args.push_back(value<int>(N));

        for(int i=0; i < N; i++)
            write_op.args.push_back(pos[i]);

        value_base data_type = name_type(T());

        write_op.args.push_back(data_type);
        write_op.args.push_back(value<int>(val.size()));

        for(int i=0; i < M; i++)
        {
            write_op.args.push_back(val[i]);
        }

        ectx.add(write_op);
    }

    template<typename T, int M>
    void write(const tensor<value<int>, N>& pos, const tensor<value<T>, M>& val) const
    {
        return write(get_context(), pos, val);
    }
};

Code Generation

This is a bit more complicated

    ///name, N, position[0..N], write_data_type, M, write_data[0..M]
    if(v.type == op::IMAGE_WRITE)
    {
        int last_idx = 0;

        auto next = [&]()
        {
            return v.args.at(last_idx++);
        };

        std::string name = next().abstract_value;
        int dimensions = std::get<int>(next().concrete);

        std::vector<value_base> position;

        for(int i=0; i < dimensions; i++)
        {
            position.push_back(next());
        }

        std::string datatype = next().abstract_value;

        std::string suffix = type_to_suffix(datatype);

        int data_dimensions = std::get<int>(next().concrete);
        std::vector<value_base> data;

        for(int i=0; i < data_dimensions; i++)
        {
            data.push_back(next());
        }

        //so, the position for opencl image writes is either an int, int2, or an int4. There is no int3
        if(dimensions > 2)
        {
            for(int i=position.size(); i < 4; i++)
            {
                position.push_back(value<int>(0));
            }
        }

        //we must write 4 components to opencl
        for(int i=data.size(); i < 4; i++)
        {
            data.push_back(value<int>(0));
        }

        int opencl_N = dimensions > 2 ? 4 : dimensions;
        int opencl_D = 4;

        //produces eg (int2){x,y}
        std::string position_str = "(int" + std::to_string(opencl_N) + "){" + join(position) + "}";
        //produces eg (float4){x,y,z,0}
        std::string data_str = "(" + datatype + std::to_string(opencl_D) + "){" + join(data) + "}";

        return "write_image" + suffix + "(" + name + "," + position_str + "," + data_str + ")";
    }

The core issue here is that if you do write_imagef(img, pos, data), pos cannot be an int3, and data must be a 4-width vector

That’s the end for real this time

  1. Simd would help a lot, as well as a good dynamic timestep. Additionally, you can get away with sampling at a lower rate and then interpolating the final geodesic positions, because this solution is very smooth. On top of this, the schwarzschild spacetime does not have many degrees of freedom - rotating all rays into the theta = pi/2 plane allows you to not simulate one of the components 

  2. This footnote was originally a long, slightly unprofessional rant about graphics vendors’ behaviour in the GPU space. Suffice to say that Khronos and Vulkan have an uphill battle in reaching this goal 

  3. I have written a lot of OpenCL in my life, including a much more advanced GR raytracer in OpenCL - and it is not maintainable in the long term. C lacks any kind of good higher level abstractions, and on the GPU you’re borderline banned from using structs or even the limited abstractions C does provide for performance reasons. OpenCL/C also bans a lot of useful macro functionality as well, complicating many traditional techniques 

  4. global for vram essentially, local for shared memory (ie L2 cache), private (though you never write private) for pointers to any variables we declare ourselves, and constant for variables which live in a small constant cache. We additionally have images (ie textures) which live in their own conceptual address space. Older hardware used to have separate caches for textures, rendering them very important for performance, but I believe modern GPUs have had unified caches for a while rendering textures much less useful. OpenCL 2.0 also defines a generic address space, to make writing functions easier at a possible performance cost, but we won’t deal with this 

  5. One major reason is for differentiation. We’ll be using this functionality a lot in future articles, where we additionally leverage this for some cool performance optimisations 

  6. Though clearly, if you can prove that the statements don’t affect each other (which is pretty simple in our language), you can reorder statements. Doing that in a way that improves performance is above my paygrade19 currently 

  7. There might be a better name than side effect here. Its anything which can’t really be easily expressed as an expression, statement management perhaps? Impurity? I’d be interested to know if there’s a better term 

  8. Brief aside, but why has half precision lagged so far behind in the CPU space vs the GPU space? 

  9. Here, we’re generally dealing with very large expressions. Nan propagation in one component of that expression will not affect our result meaningfully, as in a simulation the appearance of a nan will break everything immediately whether or not we apply these transforms 

  10. I debugged the assembly here to make sure that this isn’t incredibly unrepresentative - this example actually compiles unusually well, and exceeded my expectations from my recent-prior experience of compiler optimisation failures 

  11. Tan was excluded, as it gives a black screen in the handwritten version 

  12. One of the very good things about OpenCL is that you can disambiguate between a library provided sin with strict precision requirements, and a hardware provided native_sin with undefined precision requirements, which many/most shader languages do not do. The latter is much faster, but produces non portable results. Today’s test is being performed on the RDNA2 ISA, which means our native_sin compiles down to v_sin_f32 on page 153. Unfortunately, AMD don’t specify the precision of this intrinsic, but it does not seem to be a particularly poor implementation

    Note that AMD do provide ulp requirements for many other functions here with acceptable precision, so some may simply be a win 

  13. This result was originally 290, in the course of debugging this I discovered that some of the rays were becoming nan and only terminating after the maximum number of iterations. The results provided are all with an extra check for finiteness, which - somewhat problematically - is technically invalid with -cl-fast-relaxed-math turned on

    Here’s what I got initially when debugging occupancy:

    Bad

    And here’s what it should look like:

    Bad

    This was a pretty extensive rabbithole that originally formed part of this article, but was cut because its not really that relevant  2

  14. The performance difference here is very odd. There are two ways of negating it:

    1. Enabling the optimisation x-x = 0 manually
    2. Disabling floating point contraction, via #pragma OPENCL FP_CONTRACT OFF

    FP_CONTRACT/floating point contractions are the replacement of a*b + c with a higher precision equivalent like fma(a,b,c), which is something inherited from C. So my guess is that the pass to fma-ify a*b - a*b takes place before the elimination of that expression, and then it gets kept in. This setting seems to give pretty random changes to performance, so its worth investigating for your use case further 

  15. -cl-finite-math-only and -cl-no-signed-zeros can be used in conjunction to allow the transforms we use on the AST to be performed 

  16. Some of our chunkiest kernels in the future will have over 100 arguments, making this slightly tedious to do by hand 

  17. Amd’s compiler used to do this automatically, and would eliminate unnecessary barriers between non dependent kernels. Then they removed this functionality when upgrading to ROCm leading to significant pipeline stalls, which is one of the reasons their GPGPU story is often surprisingly slow. This will become a fairly large bottleneck for us later, and we will dedicate some time on how to fix AMD’s OpenCL/GPGPU implementation. I do not use ROCm, but it suffers from the same issue 

  18. If there were a standardised way to overload branches, the ternary operator, or if there were a standardised select operator in C++, this would make our lives a lot easier 

  19. That’d be £0. I was seriously ill for a number of years, and this tutorial series is hopefully my way into doing physics research 

results matching ""

    No results matching ""