Optimization_Tips

Compile-time #

The single easiest thing you can do to ensure your program runs fast is to turn on the appropriate compiler flags. This includes tuning the optimizer and providing instrumentation to permit the optimizer to make better assumptions.

-O #

The sample Makefiles provided with libfxcg optimize your binaries for size by passing -Os to gcc. This sets the optimizer to attempt to reduce code size, which in some cases will be slower than equivalent larger code. Using -O2 or -O3 may yield better performance. In programs that do floating-point arithmetic but do not depend on precision to the lengths guaranteed by IEEE 754, -ffast-math (implied by -Ofast, which also implies -O3) permits additional optimizations to floating-point operations which do not strictly conform to the standard.

-flto #

While not turned on by default in the libfxcg example Makefiles, link-time optimization (LTO) is an excellent way to improve the generated code with zero programmer effort. It is not enabled by default only because GCC before version 4.8 tends to behave poorly with it on, particularly on Windows where it seems to cause frequent crashes.

LTO permits the compiler to perform optimization across module boundaries. Normally each module (usually a single .c file) in a program is compiled independent of the others. This permits incremental recompilation, but also requires that code crossing module boundaries follow the platform ABI.

When LTO is enabled, the linker aggregates compiler IR from object files and does another pass of optimizations on the aggregate program before emitting any object code. In many cases this permits significant optimization since many more values become visible to constant folding and inlining passes. As a trivial example, calls to a function like Plot are likely to be inlined and optimized heavily, perhaps like the following in pseudocode:

/* Without LTO, this can be expected to emit unspecialized code
   for plot() and simply call that. This entails function call
   overhead in spilling registers and loading parameters, in addition
   to the control flow transfer. */
plot(42, 215, 0xFFFF);
/* With LTO, it's likely to become more like this, folding the constants
   into a compile-time known value, making the code both smaller and
   faster. */
*((color_t *)0xA8028554) = 0xFFFF);

Put simply:

CFLAGS := -flto ${CFLAGS}
LDFLAGS := -flto ${LDFLAGS}

Do this unless you have a good reason not to.

PGO #

GCC is also capable of taking performance information captured by a running program and using that to further optimize the generated code. In particular, this data permits it to reason about the most likely paths the machine takes through the program and optimize the common cases over uncommon cases, yielding an overall speedup.

Exact methods for doing this on the Prizm have not been considered in depth, but it is likely that generating profiles (with -fprofile-generate) by compiling and running on a workstation, then cross-compiling with -fprofile-use is a usable workflow.

Source-level #

Compiler options to improve performance are the low-hanging fruit of optimization. They provide significant improvements with almost no effort, but there’s only so much the compiler can do on its own. This is where you, the programmer, need to be aware of the performance characteristics of your code and ways to improve it.

Better algorithms #

Before using any of the following (micro-)optimization techniques, first ask yourself “Are my algorithms good?”. If not, do that first. In a game with a N x N play field where you do an operation of time complexity O(N) for each position, improving that operation to O(log N) (a common speedup for certain classes of lookup operations) will make that code nearly N times faster.

Value qualifiers #

Qualifying values with const not only allows the compiler to reason better about its uses, but also helps ensure your program’s correctness. For example, if you needed a precomputed lookup table for some values:

int do_lookup(unsigned idx) {
    int my_lut[] = {0, 9, 8, 7, 6, 5, 4, 3, 2, 1};
    return my_lut[idx];
}

This code will needlessly create my_lut on the stack, wasting time and memory initializing it every time the function is called. If you qualify it with const the compiler will be able to place it in read-only memory:

int do_lookup(unsigned idx) {
    const int my_lut[] = {0, 9, 8, 7, 6, 5, 4, 3, 2, 1};
    return my_lut[idx];
}

Putting the values in read-only memory saves time in each function call as well as at program startup, as well as ensuring that you cannot write to those values accidentally.

Const pointers #

const can also be applied to pointers, which is particularly useful when the pointers are passed as function parameters. When pointers which are known to point to memory which will not be modified, the compiler may assume the values there will not be changed and thus is free to keep values in registers longer or reorder loads around pointer dereferences.

A bad example:

void DrawSprite(int x, int y, struct sprite *data);

Good:

void DrawSprite(int x, int y, const struct sprite *data);

Not only does adding the const qualifier to data permit additional compiler optimizations, but it ensures that your code will not accidentally modify the sprite data within the function.

Restrict pointers #

Specifying restrict on pointers is less commonly useful, but can also enable more compiler optimization. restrict specifies that a pointer does not alias with any other accessible pointers (that is, no other pointer points to the same location), so some loads and stores may be avoided.

An example from libc:

// Bad
void *memcpy(void *dest, void *src, size_t n);
// Good
void *memcpy(restrict void *dest, restrict void *src, size_t n);

The memcpy function requires that the two pointers not overlap; marking them both restrict helps prevent some accidental misuses of the function, and tells the compiler that it is safe to assume that stores to dest cannot affect the values at src.

(Alert readers will note that src in the above should be marked const as well. The additional qualifier has been removed only to simplify the example, and it should indeed be declared const restrict void * for maximal correctness.)

Branch hints #

While it’s almost always preferable to use full profile data when you can, sometimes it’s not feasible. In those cases, you can use the intrinsic function __builtin_expect to provide hints as to what values are likely. For example, given this function:

int check_valid_ptr(void *x) {
    if (x) return 1;
    else return 0;
}

Adding a hint that x is usually non-NULL might be useful if you only use the function to detect NULL terminators or such, allowing the compiler to optimize the non-NULL case better:

int check_valid_ptr(void *x) {
    __builtin_expect(x != NULL, 1);
    if (x) return 1;
    else return 0;
}

Working in indexed color mode #

If you do not need any colors other than the eight available when operating in 3-bit-per-pixel mode, then using the full color (16 bit-per-pixel) mode may increase the power usage of the display. The display can be put into 8-color mode to reduce its power usage. This mode is used in most parts of the OS and is the default when your add-in is first launched. This is transparent to add-ins, and the format of the VRAM remains the same. Note that you can switch between the two color modes as needed while your add-in is running. See Bdisp_EnableColor for more information.

Hardware tricks #

TODO. Try these resources for hints: