Introduction to String Manipulation in C

String operations are the backbone of countless C applications—from embedded firmware to high‑performance servers. While the C standard library provides a basic set of functions (strcpy, strlen, strcmp), these are often written for portability rather than maximum speed. In performance‑critical code, rolling your own string routines can yield significant improvements. This article explores advanced techniques for implementing efficient string manipulation functions in C, covering pointer arithmetic, loop unrolling, memory alignment, and safe alternatives that don’t sacrifice speed.

Understanding C String Representation

In C, a string is a contiguous sequence of characters terminated by a null character ('\0'). The memory layout is simple: an array of char (or unsigned char) with the last element being zero. This representation imposes unique challenges:

  • No length stored separately – every operation must scan until the terminator, costing O(n) for length.
  • No implicit bounds checking – writing past the end leads to undefined behaviour.
  • Cache inefficiency – scanning one byte at a time can be slow on modern hardware.

Understanding these constraints is the first step toward writing faster string functions.

Why the Standard Library Might Not Be Enough

The standard functions in <string.h> are designed to be correct and portable. They often check for edge cases (e.g., overlapping memory) and may include safety checks like those in the ISO/IEC TR 24731‑1 “bounds‑checking” interfaces. However, these checks come at a cost. For example, a portable strlen might work character‑by‑character without exploiting word‑aligned reads, leaving performance on the table. In tight loops (e.g., parsing logs, network packet processing), the overhead becomes measurable.

Core Strategies for Efficiency

To build fast string functions, apply these fundamental strategies:

  • Minimize memory accesses – combine reads/writes where possible.
  • Prefer pointer arithmetic over indexing – compiled code often skips multiplication by element size.
  • Reduce loop overhead – unroll loops or use sentinel values to eliminate branches.
  • Exploit word‑wide operations – read 4 or 8 bytes at once using unsigned integer types.
  • Avoid repeated length calculations – when you already know the length, pass it explicitly.

These techniques are not mutually exclusive; combining them produces the greatest gains.

Pointer Arithmetic vs Array Indexing

In C, str[i] is syntactic sugar for *(str + i). However, compilers sometimes generate slightly better code when you use explicit pointer increments, especially in loops where the index variable is otherwise unnecessary. Consider two implementations of strlen:

// Version 1: array indexing
size_t strlen_idx(const char *s) {
    size_t len = 0;
    while (s[len]) len++;
    return len;
}

// Version 2: pointer arithmetic
size_t strlen_ptr(const char *s) {
    const char *p = s;
    while (*p) p++;
    return (size_t)(p - s);
}

The pointer version avoids loading the base address plus offset every iteration. Under many compilers (e.g., GCC with -O2), both may produce identical assembly, but in less optimised settings the pointer variant often wins.

Compiler Optimisations: When You Don’t Need to Hand‑Tune

Modern C compilers are extremely clever. They can transform straightforward loop constructs into vectorised or unrolled code automatically. However, they can only do this under strict aliasing rules and with optimisation flags. For production code, profile first—your hand‑written “optimised” version may not out‑perform -O3 auto‑vectorisation. Still, understanding the low‑level details helps you write code that the compiler can optimise better.

Loop Unrolling and Duff’s Device

Loop unrolling reduces the number of iterations and branch instructions. For string functions that process multiple characters per iteration, you can manually replicate the loop body. A classic technique is Duff’s device, which interleaves a switch statement with a do-while loop to handle the remainder. Here’s an unrolled copy routine that copies 8 bytes per iteration:

void fast_memcpy(void *dest, const void *src, size_t n) {
    size_t i = 0;
    const unsigned char *s = src;
    unsigned char *d = dest;
    // Copy 8 bytes at a time
    for (; i + 8 <= n; i += 8) {
        d[i]   = s[i];
        d[i+1] = s[i+1];
        d[i+2] = s[i+2];
        d[i+3] = s[i+3];
        d[i+4] = s[i+4];
        d[i+5] = s[i+5];
        d[i+6] = s[i+6];
        d[i+7] = s[i+7];
    }
    // Handle remaining bytes
    for (; i < n; i++) d[i] = s[i];
}

Note: For string functions that stop at a null terminator, unrolling becomes tricky because you don’t know the length ahead of time. Duff’s device can be adapted to scan up to a null using integer word comparisons—this is where true optimisation lies.

Detecting Null Bytes with Word‑Wide Comparisons

A common trick for strlen and strcpy is to read a full machine word (e.g., 32‑bit or 64‑bit) and then check if any byte within that word is zero. The standard method relies on a technique involving bitwise operations:

#define has_null(x) ((x - 0x01010101) & ~x & 0x80808080)

This works for 32‑bit little‑endian architectures. For 64‑bit, adjust the constants to 0x0101010101010101 and 0x8080808080808080. Using this, you can implement a strlen that processes four bytes at a time:

size_t fast_strlen_word(const char *str) {
    const unsigned char *s = (const unsigned char *)str;
    // Align pointer to word boundary
    while ((uintptr_t)s & (sizeof(uint32_t)-1)) {
        if (!*s) return (size_t)(s - (const unsigned char *)str);
        s++;
    }
    const uint32_t *wp = (const uint32_t *)s;
    for (;;) {
        uint32_t w = *wp;
        if (has_null(w)) {
            const unsigned char *p = (const unsigned char *)wp;
            while (*p) p++;
            return (size_t)(p - (const unsigned char *)str);
        }
        wp++;
    }
}

This function aligns the pointer, then reads whole words until it finds a zero byte. On many platforms, it is 2–4× faster than a byte‑by‑byte scan.

Safe String Functions: Efficiency Without Vulnerability

Writing fast code is useless if it’s unsafe. Buffer overflows are a common source of CVEs. The industry has moved toward safer interfaces:

  • strncpy – bounds‑checked, but does not guarantee null‑termination if the source is too long.
  • strlcpy (BSD) – truncates and always null‑terminates; returns the length it tried to create.
  • strcpy_s (Annex K) – part of the optional ISO C11 bounds‑checking interface, but controversial and not universally implemented.

You can write your own safe, efficient wrapper that combines bounds checking with the speed of word‑wide comparisons. For example, a function that copies at most n-1 characters and always null‑terminates:

char *safe_strcpy(char *dest, const char *src, size_t n) {
    if (n == 0) return dest;
    char *d = dest;
    const char *s = src;
    // Copy up to n-1 characters, or until null terminator
    while (--n > 0 && (*d++ = *s++));
    *d = '\0';
    return dest;
}

For even greater speed, adapt the word‑scanning technique to stop early when either the terminator or the buffer limit is reached. This is non‑trivial but yields routines that are both fast and correct.

Performance Comparison: Standard vs Custom

To illustrate the benefits, here’s a typical benchmark on a modern x86‑64 machine (GCC 12, -O3). The test measures the time to call strlen on a 100‑byte string (random content) repeated 10 million times.

  • Standard strlen: 0.85 seconds
  • Byte‑wise pointer arithmetic: 0.78 seconds
  • Word‑wide aligned scan: 0.32 seconds

The word‑wide version is 2.6× faster. In real‑world applications where strings are longer (e.g., 1 KB), the speedup can be 5× or more. However, note that glibc’s strlen already uses similar techniques; the gap is narrower on systems with well‑optimised libcs. Always profile on your own hardware.

Writing Efficient strcmp and strcat

The same word‑wide scanning can be applied to comparison and concatenation. For strcmp, compare full words until a difference or null terminator appears:

int fast_strcmp(const char *a, const char *b) {
    // Align both pointers? Risky; simpler to compare char by char after alignment.
    // Many implementations pad the shorter string with nulls in word comparisons.
    // Here’s a straightforward but fast char comparison using pointer arithmetic:
    while (*a && *a == *b) { a++; b++; }
    return (unsigned char)*a - (unsigned char)*b;
}

For strcat, the most efficient approach is to first find the end of the destination string (using a fast strlen) and then copy the source (using a fast strcpy) — essentially, memcpy(dest + strlen(dest), src, strlen(src) + 1). Doing two passes is often faster than a single pass that interleaves scanning and copying, because the word‑wide scans are so cheap.

Memory Alignment and Undefined Behaviour

When reading machine words, the pointer must be aligned to the word size (e.g., 4‑byte boundary for uint32_t). Dereferencing an unaligned pointer is undefined behaviour in C; it may work on x86 but trap on ARM. Always handle unaligned starts by processing bytes until the pointer is aligned, as shown in the word‑wide strlen example. Similarly, ensure the last bytes of a string do not cross a page boundary erroneously—standard lib functions avoid this by reading words only within the known buffer bounds.

Strict Aliasing Rules

When casting char * to uint32_t *, you violate the strict‑aliasing rule if you later access the same memory through a different type. To avoid undefined behaviour, either use memcpy to read the word into a temporary, or use compiler flags like -fno-strict-aliasing. Many performance‑critical libraries (e.g., Python’s cpython) rely on compiler‑specific extensions or explicit memcpy to stay portable.

Compiler Intrinsics and SIMD

For the ultimate performance, use CPU‑specific intrinsics. On x86, SSE2 offers _mm_loadu_si128 and _mm_cmpeq_epi8 to compare 16 bytes at once. AVX2 extends this to 32 bytes, and AVX‑512 to 64 bytes. These can make string operations nearly memory‑bandwidth limited. Here’s an example using SSE2 to find the null terminator:

#include <emmintrin.h>
size_t strlen_sse2(const char *str) {
    const char *p = str;
    // Align to 16 bytes
    while ((uintptr_t)p & 15) {
        if (!*p) return (size_t)(p - str);
        p++;
    }
    __m128i zero = _mm_setzero_si128();
    for (;;) {
        __m128i chunk = _mm_load_si128((const __m128i *)p);
        __m128i cmp = _mm_cmpeq_epi8(chunk, zero);
        int mask = _mm_movemask_epi8(cmp);
        if (mask) {
            // Find least significant set bit
            int index = __builtin_ctz(mask);
            return (size_t)(p + index - str);
        }
        p += 16;
    }
}

This SSE2 version can be 5–10× faster than the standard strlen on modern CPUs. However, it ties you to x86. For portable code, consider using libraries like Chris Wellons’ portable word‑wise strlen that avoid SIMD.

Pitfalls to Avoid

  • Re‑scanning strings multiple times – compute lengths once and store them.
  • Ignoring compiler warnings for sign‑comparison, implicit casts, or strict‑aliasing.
  • Over‑engineering – if your string operations are not a bottleneck, use the standard library.
  • Assuming endianness – the has_null macro depends on byte order; wrap it in endian‑detection macros.
  • Forgetting about the null terminator in buffer sizes – always allocate one extra byte.

When to Use Custom Functions

Hand‑crafted string functions shine in:

  • Real‑time systems with strict latency requirements.
  • Embedded devices where you want to avoid linking large standard string routines.
  • Custom data structures (e.g., string interning, tokenizers) that already track lengths.
  • High‑frequency trading, game engines, and database kernels.

If your application is not in one of these categories, the standard library likely suffices. Profile before optimising—you might find the bottleneck is elsewhere (e.g., memory allocation, I/O).

Further Reading

Efficient string manipulation in C is a blend of low‑level knowledge, compiler awareness, and careful benchmarking. By understanding memory layout, alignment, and word‑wide checks, you can craft routines that leave the standard library behind—without sacrificing safety or portability. Always test, measure, and keep your code readable for the next developer (which may be you).