Does Restrict Actually Improve Performance?

C99 has introduced a slew of new features into the language, among which many are aimed at improving the performance and flexibility of C in embedded programming. A prominent example would be the new keyword restrict. Its sole purpose is to enable low level optimization by the compiler. The standard states the following with regards to this keyword.

The intended use of the restrict qualifier is to promote optimization, and deleting all instances of the qualifier from all preprocessing translation units composing a conforming program does not change its meaning.

This post is to explore the various uses of restrict and attempt to answer the following questions.

What does it do? Does it actually help? Will it ever backfire? And, finally, is it worth the effort?

The use of restrict is closely associated with the strict aliasing rule. This rule states that two pointers of distinct types cannot point to the same address1. This enables the compiler to reduce the number of unnecessary load and store instructions. For example, consider the following code.

typedef struct {
    uint16_t num[32];
} Buffer;
void foo (Buffer *buffer, uint32_t *values) {
    uint32_t length = sizeof(Buffer) / sizeof(uint32_t);
    for (int i = 0; i < length; ++i)         values[i] = (uint32_t)buffer->num[0];
}

Without applying strict-aliasing, the compiler would have to assume that buffer and values might point to the same address or the memory content they point to might overlap. The compiler has to load num[0] into one of the processor registers each time during the loop because it is not certain that values does not overlap num[0] in memory. The compiler needs to make sure num[0] is always updated when values is modified, as indicated in the following assembly output.

foo:
    ; initialization stuff
.L2:
    movzwl (%rcx), %r8d        ; loads every time during the loop
    movl   %r8d, (%rdx, %rax)
    addq   $4, %rax
    cmpq   $64, %rax           ; compares %rax to length
    jne L2
    ret
    ; other stuff

With strict-aliasing, however, the compiler will treat num[0] and values as they are pointing to non-overlapping regions in memory. Therefore, one single load instruction to obtain num[0] before entering the loop is enough, as shown in the assembly output below.

foo:
    movzwl (%rcx), %r8d        ; loads once only before the loop
.L2:
    movl   %r8d, (%rdx, %rax)
    addq   $4, %rax
    cmpq   $64, %rax           ; compares %rax to length
    jne L2
    ret

If length is large enough, there will likely be a significant improvement in performance due to far fewer uses of load instruction. (For more information, refer to Mike Acton’s article on strict-aliasing.)

Restrict is to enable the compiler to apply the strict aliasing rule on pointers of the same data type. This is achieved by adding the restrict type qualifier to the declaration of a pointer. The reason that pointers of the same type are not assumed to point at distinct addresses by default is that it is often desired to have pointers of the same type overlap in memory. One example can be sorting an array of structures; another can be traversing through a linked list. On the other hand, it is also quite often that we do not intend pointers of the same type to point at the same address.

Take a look at the following function.

uint32_t bar (uint32_t *a, uint32_t *b) {
    *a = 5;
    *b = 10;           // address b is possibly the same as a
    return *a + *b;    // so, result can be either 20 or 15
}
// Assembly Output:
bar:
    movl    $5, (%rcx)
    movl    $10, (%rdx)
    movl    (%rcx), %eax    ; loads (%rcx) again, since rcx and rdx
                            ; can contain the same address
    addl    $10, %eax
    ret
Since a and b are both pointers to type uint32_t, they can potentially point at the address. The result of their addition cannot be determined statically, hence the second movl from (%rcx) and addl. But what if the compiler is allowed to assume a and b do not alias each other? The compiler can then leverage this information to more aggressively optimize the code, as below.
uint32_t bar (uint32_t *restrict a, uint32_t *restrict b) {
    *a = 5;
    *b = 10;           // address b is assumed to be different from a
    return *a + *b;    // so, result can only be 15
}
// Assembly Output:
bar:
    movl    $15, %eax    ; directly saves 15 into %eax
    movl    $5, (%rcx)
    movl    $10, (%rdx)
    ret

In this case, the compiler is able to determine that the addition will always produce 15. Thus, it can directly save 15 into resigter eax. Although the other two movl instructions are not optimized away by the compiler, the produced assembly now has one fewer instruction than before.

Will the second version yield much better performance than the first? The run time of the two variations of bar are as follows.

Looped for 1010 times
With Restrict: 4.8 seconds
Without Restrict: 4.9 seconds

Unfortunately, the difference does not appear to be significant. Does this represent the general usage of restrict? Or, does there exist more potential in this language feature?

Perhaps, an example2 resembling more closely to a real life application should suffice better. The function below takes in an array of StudentFee and assign them back to default termStandard.

typedef struct {
    uint32_t meal_plan;
    uint32_t tuition;
    uint32_t bus_pass;
    uint32_t activity_fee;
} StudentFee;
void resetAllFee (uint32_t n, StudentFee *restrict students,
                              StudentFee *restrict termStandard) {
    for (uint32_t i = 0; i < n; ++i) {         students[i].meal_plan    = termStandard->meal_plan;
        students[i].tuition      = termStandard->tuition;
        students[i].bus_pass     = termStandard->bus_pass;
        students[i].activity_fee = termStandard->activity_fee;
    }
}

Compiling this code without restrict gives the following assembly output.

resetAllFee:
    testl   %ecx, %ecx
    je .L2
    leal    -1(%rcx), %eax
    salq    $4, %rax
    leaq    16(%rdx,%rax), %rcx
.L4:                       ; loop for n times
    movl    (%r8), %eax    ; load meal_plan from memory
    movl    %eax, (%rdx)   ; save meal_plan into memory
    movl    4(%r8), %eax   ; same applies to the other three fields
    movl    %eax, 4(%rdx)
    movl    8(%r8), %eax
    movl    %eax, 8(%rdx)
    movl    12(%r8), %eax
    movl    %eax, 12(%rdx)
    addq    $16, %rdx      ; increment rdx to point to next in the array
    cmpq    %rcx, %rdx
    jne .L4
.L2:
    rep
    ret

Compiling with restrict produces a significantly different assembly output.

resetAllFee:
    testl   %ecx, %ecx
    je  .L2
    leal    -1(%rcx), %eax
    movl    (%r8), %r11d    ; load meal_plan
    movl    4(%r8), %r10d   ; load tuition
    movl    8(%r8), %r9d    ; load bus_pass
    movl    12(%r8), %r8d   ; load activity_fee
    salq    $4, %rax
    leaq    16(%rdx,%rax), %rax
.L4:                        ; loop for n times
    movl    %r11d, (%rdx)   ; save meal_plan
    movl    %r10d, 4(%rdx)  ; save the other three fields
    movl    %r9d, 8(%rdx)
    movl    %r8d, 12(%rdx)
    addq    $16, %rdx       ; increment rdx to point to next in the array
    cmpq    %rax, %rdx
    jne .L4
.L2:
    rep
    ret

When restrict is used, the compilers loads the fields of termStandard from the memory before the loop even starts. This means, the code will have n * movl instructions fewer to execute during run-time. So, will this make a more significant difference in performance? Feeding the function with an array of size 10,000,000 and looping for 10^3 times.

Looped for 103 times
With Restrict: 45.1 seconds
Without Restrict: 48.4 seconds

This is 7% of performance boost, achieved by merely introducing a new qualifier!

However, before one decides that using restrict in as much as possible is a no-brainer, it should be noted that, despite its ease of use, this qualifier imposes an invisible restriction on the variables to which it applies. The restriction, not checked by compiler and easily forgotten by users, is that the optimization is obtained by assuming a pointer marked by restrict is the only alias to its address within the scope. Carelessly adding restrict to both existing and new code may introduce bugs which will not appear until the final phase of optimization. In other words, careless uses of restrict may backfire at the eleventh hour.

The only possible misuse of restrict is to mistakenly assign pointers, which are marked restrict, overlapping memory addresses. For example, given the function bar mentioned earlier in this post, what would happen if the user passes it two pointers to uint32_t with identical address?

// main.c
uint32 number = 20;
printf("%d\n", bar(&number, &number));    // what will it print?

With restrict, the above code will print 15 when the output should really be 20. This is because the compiler, under the assumption that the two parameters of bar do not alias each other, will find that the result can only be 5 + 10 = 15. Thus, the compiler will perform optimization and directly assign 15 to the result register, similar to the assembly below.

bar:
; ... initialization
    movl    $15, %eax    ; directly saves 15 into %eax
; ... return
main:
; ... more stuff in between ...
    call    bar
    movl    %eax, %edx
    leaq    .LC0(%rip), %rcx
    call    printf
; ... more stuff follows ...

The problematic nature of such bugs is that they can be very hard to discover. They are only present if strict-aliasing optimization is applied to problematic region of code. This can be somewhat unpredictable since, by the standard, the compiler is in no way under obligation to perform strict-aliasing on every pointer marked restrict. Bugs may disappear or re-appear as the development continues. To eliminate such issues, one possible approach may involve thoroughly reviewing the code and ensuring the pointers that are marked restrict never alias to the same data. Vigorous testing needs to be done after to ensure correct behavior of affected code. Also, it is important to document code that uses restrict. At last, it is recommended to always enable the highest level of optimization in compilation throughout the development cycle, so the programmers will likely notice bugs the first time they are introduced. However, this will drastically increase the compilation time and may not be suitable for all projects.

So is it worth the effort to introduce restrict to existing or new code? The answer will depend on both the project and the implementer. For existing code, it is only worth the effort if the code in question is critical to the performance of the entire program. For example, a transaction system under heavy load may benefit greatly from a few restrict’s here and there. On the other hand, not so much if the code in question is some sort of one time initialization procedure. When working on new code, however, the use of restrict should always be considered. It is very beneficial to design a system that is able to accommodate and encourage compiler optimizations as much as possible.

[1] Pointers to any data type can be aliased by char*, but the reverse is not true.
[2] Full source code of this example can be downloaded here.
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s