Compile-time #
The single easiest thing you can do to ensure your program runs fast is to turn on the appropriate compiler flags. This includes tuning the optimizer and providing instrumentation to permit the optimizer to make better assumptions.
-O
#
The sample Makefiles provided with libfxcg optimize your binaries for
size by passing -Os
to gcc. This sets the optimizer to attempt to
reduce code size, which in some cases will be slower than equivalent
larger code. Using -O2
or -O3
may yield better performance. In
programs that do floating-point arithmetic but do not depend on
precision to the lengths guaranteed by IEEE 754, -ffast-math
(implied
by -Ofast
, which also implies -O3
) permits additional optimizations
to floating-point operations which do not strictly conform to the
standard.
-flto
#
While not turned on by default in the libfxcg example Makefiles, link-time optimization (LTO) is an excellent way to improve the generated code with zero programmer effort. It is not enabled by default only because GCC before version 4.8 tends to behave poorly with it on, particularly on Windows where it seems to cause frequent crashes.
LTO permits the compiler to perform optimization across module
boundaries. Normally each module (usually a single .c
file) in a
program is compiled independent of the others. This permits incremental
recompilation, but also requires that code crossing module boundaries
follow the platform ABI.
When LTO is enabled, the linker aggregates compiler IR from object files and does another pass of optimizations on the aggregate program before emitting any object code. In many cases this permits significant optimization since many more values become visible to constant folding and inlining passes. As a trivial example, calls to a function like Plot are likely to be inlined and optimized heavily, perhaps like the following in pseudocode:
/* Without LTO, this can be expected to emit unspecialized code
for plot() and simply call that. This entails function call
overhead in spilling registers and loading parameters, in addition
to the control flow transfer. */
plot(42, 215, 0xFFFF);
/* With LTO, it's likely to become more like this, folding the constants
into a compile-time known value, making the code both smaller and
faster. */
*((color_t *)0xA8028554) = 0xFFFF);
Put simply:
CFLAGS := -flto ${CFLAGS}
LDFLAGS := -flto ${LDFLAGS}
Do this unless you have a good reason not to.
PGO #
GCC is also capable of taking performance information captured by a running program and using that to further optimize the generated code. In particular, this data permits it to reason about the most likely paths the machine takes through the program and optimize the common cases over uncommon cases, yielding an overall speedup.
Exact methods for doing this on the Prizm have not been considered in
depth, but it is likely that generating profiles (with
-fprofile-generate
) by compiling and running on a workstation, then
cross-compiling with -fprofile-use
is a usable workflow.
Source-level #
Compiler options to improve performance are the low-hanging fruit of optimization. They provide significant improvements with almost no effort, but there’s only so much the compiler can do on its own. This is where you, the programmer, need to be aware of the performance characteristics of your code and ways to improve it.
Better algorithms #
Before using any of the following (micro-)optimization techniques, first ask yourself “Are my algorithms good?”. If not, do that first. In a game with a N x N play field where you do an operation of time complexity O(N) for each position, improving that operation to O(log N) (a common speedup for certain classes of lookup operations) will make that code nearly N times faster.
Value qualifiers #
Qualifying values with const
not only allows the compiler to reason
better about its uses, but also helps ensure your program’s correctness.
For example, if you needed a precomputed lookup table for some values:
int do_lookup(unsigned idx) {
int my_lut[] = {0, 9, 8, 7, 6, 5, 4, 3, 2, 1};
return my_lut[idx];
}
This code will needlessly create my_lut
on the stack, wasting time and
memory initializing it every time the function is called. If you qualify
it with const
the compiler will be able to place it in read-only
memory:
int do_lookup(unsigned idx) {
const int my_lut[] = {0, 9, 8, 7, 6, 5, 4, 3, 2, 1};
return my_lut[idx];
}
Putting the values in read-only memory saves time in each function call as well as at program startup, as well as ensuring that you cannot write to those values accidentally.
Const pointers #
const
can also be applied to pointers, which is particularly useful
when the pointers are passed as function parameters. When pointers which
are known to point to memory which will not be modified, the compiler
may assume the values there will not be changed and thus is free to keep
values in registers longer or reorder loads around pointer dereferences.
A bad example:
void DrawSprite(int x, int y, struct sprite *data);
Good:
void DrawSprite(int x, int y, const struct sprite *data);
Not only does adding the const
qualifier to data
permit additional
compiler optimizations, but it ensures that your code will not
accidentally modify the sprite data within the function.
Restrict pointers #
Specifying restrict
on pointers is less commonly useful, but can also
enable more compiler optimization. restrict
specifies that a pointer
does not alias with any other accessible pointers (that is, no other
pointer points to the same location), so some loads and stores may be
avoided.
An example from libc:
// Bad
void *memcpy(void *dest, void *src, size_t n);
// Good
void *memcpy(restrict void *dest, restrict void *src, size_t n);
The memcpy
function requires that the two pointers not overlap;
marking them both restrict
helps prevent some accidental misuses of
the function, and tells the compiler that it is safe to assume that
stores to dest
cannot affect the values at src
.
(Alert readers will note that src
in the above should be marked
const
as well. The additional qualifier has been removed only to
simplify the example, and it should indeed be declared
const restrict void *
for maximal correctness.)
Branch hints #
While it’s almost always preferable to use full profile data when you can, sometimes it’s not feasible. In those cases, you can use the intrinsic function __builtin_expect to provide hints as to what values are likely. For example, given this function:
int check_valid_ptr(void *x) {
if (x) return 1;
else return 0;
}
Adding a hint that x
is usually non-NULL might be useful if you only
use the function to detect NULL terminators or such, allowing the
compiler to optimize the non-NULL case better:
int check_valid_ptr(void *x) {
__builtin_expect(x != NULL, 1);
if (x) return 1;
else return 0;
}
Working in indexed color mode #
If you do not need any colors other than the eight available when operating in 3-bit-per-pixel mode, then using the full color (16 bit-per-pixel) mode may increase the power usage of the display. The display can be put into 8-color mode to reduce its power usage. This mode is used in most parts of the OS and is the default when your add-in is first launched. This is transparent to add-ins, and the format of the VRAM remains the same. Note that you can switch between the two color modes as needed while your add-in is running. See Bdisp_EnableColor for more information.
Hardware tricks #
TODO. Try these resources for hints:
- Possible hardware features
- On-chip memory, part 1 & part 2
- DMA for efficient copying