Shader code forms the backbone of real-time graphics rendering in mobile apps, games, and augmented reality experiences. As mobile GPUs continue to evolve, they still operate under severe power and thermal constraints that demand highly optimized shader programs. A single poorly written pixel shader can reduce frame rates by half, drain battery life, and produce unacceptable heat. Understanding how to write efficient shaders for mobile targets is therefore a critical skill for any graphics developer targeting the widest possible device range.

Mobile GPU Architecture: Why Shaders Must Be Different

Desktop and console GPUs typically use an immediate-mode rendering (IMR) architecture, where the entire frame is drawn in a single pass through the pipeline. Mobile GPUs almost universally employ a tile-based deferred rendering (TBDR) approach, as seen in Apple’s A-series, Qualcomm Adreno, and ARM Mali architectures. In a TBDR system, the framebuffer is divided into small tiles (often 16×16 or 32×32 pixels), and the GPU processes one tile at a time. This minimizes external memory bandwidth because color and depth values stay on-chip until the tile is finished.

This architectural difference has profound implications for shader optimization. For example, early in the pipeline, mobile GPUs perform a hidden surface removal pass that discards invisible fragments before the fragment shader ever runs. Shaders that rely on pixel-depth writes or discard instructions can defeat this optimization, forcing the GPU to shade fragments that are later occluded. Additionally, mobile GPUs have a limited number of uniform shader cores that share ALUs, texture units, and memory interfaces. Each core can handle multiple threads (warps or wavefronts) simultaneously, but threads within the same warp must execute the same instruction path. Divergent branching within a warp can serialise execution and dramatically reduce throughput.

Foundational Techniques for Mobile Shader Optimization

1. Minimise Shader Instruction Count

The most direct path to faster shaders is to reduce the number of instructions executed per vertex or per fragment. Every instruction consumes energy and time, so trimming unnecessary operations is essential. Common strategies include:

  • Combine similar calculations: If two outputs use the same intermediate value, compute it once and reuse it.
  • Use swizzles and dot products: Many mobile GPUs have fast specialized instructions for vector dot products and cross products. Leveraging these can replace several scalar operations.
  • Avoid matrix inversions inside shaders: Precompute inverse matrices on the CPU and pass them as uniforms.
  • Replace power functions with multiplications: pow(x, 2.0) is equivalent to x * x but often slower on mobile GPUs.

For example, a typical forward shading fragment shader might compute a normalised light vector, then use it in a diffuse term and a specular term. By consolidating the normalisation into a single step and reusing the vector, the instruction count can drop by 10–20%.

2. Use Lower Precision Data Types

Mobile GPUs support three primary floating-point precisions: float (32-bit), half (16-bit), and fixed (11-bit or 12-bit, often implemented as integer). Using half-precision instead of float can double or even quadruple the number of ALU operations per clock, because many mobile GPUs pack two half-float operations into a single 32-bit slot. The performance gain is most noticeable in fragment shaders where the workload is massive.

Critical guidelines for precision usage:

  • Use mediump or half for colour values, UV coordinates that do not require high precision, and small offsets.
  • Reserve full precision (highp or float) for world-space positions, depth values, and calculations that need high dynamic range (e.g., HDR lighting).
  • Be aware that precision conversion can also add cost. Cast values explicitly using constructors like half(myFloat) to avoid implicit conversions that may generate extra instructions.
  • Test on actual devices: some older mobile GPUs implement half-precision in software, making it slower. Target GPUs from the last three years for reliable gains.

3. Optimise Texture Usage

Texture sampling is often the largest contributor to both instruction cycles and memory bandwidth consumption in mobile shaders. The following practices help reduce the impact:

  • Use compressed texture formats: ETC2, ASTC, and PVRTC offer 4–8 bits per pixel with minimal visual loss. They reduce bandwidth by 70–90% compared to raw RGBA8.
  • Minimise the number of texture samplers: Each sampler requires GPU resources. Combine multiple maps into a single RGBA texture where possible (e.g., packing roughness, metalness, and ambient occlusion into one channel each).
  • Texture atlas or array: Instead of switching textures between draw calls, use a larger texture atlas or a texture array. This avoids sampler state changes and keeps the texture cache warm.
  • Mipmaps: Always generate mipmaps for diffuse, specular, and normal maps. Mipmaps improve cache locality for minified textures and allow the hardware to sample from smaller, lower-bandwidth levels.
  • Anisotropic filtering: Use with caution. On mobile GPUs it can multiply texture fetch cost by 2–4x. Prefer bilinear or trilinear filtering for most surfaces.

4. Avoid or Mitigate Dynamic Branching

Branching instructions like if/else and switch can cause shader execution to diverge within a warp or wavefront. In the worst case, the GPU executes both branches for every pixel, and then masks out the results. This nullifies the performance benefit of skipping work. Strategies to handle branching:

  • Use branch-free techniques: Replace conditionals with arithmetic or built-in functions. For example, max(0.0, sign(x)) can replace an if(x > 0.0) test for many cases.
  • Move branch decisions to vertex shader: If a conditional is based on per-vertex data (like material ID), compute the result per vertex and pass it as a varying. Fragment shaders then receive a varying that already contains the correct value, avoiding dynamic branching.
  • Use discard sparingly: discard removes a fragment but forces the GPU to evaluate the shader for that fragment anyway. It also defeats early-Z optimisations. Instead, use alpha testing with an alpha-to-coverage (A2C) technique when possible.

Advanced Optimisation Techniques

5. Leverage Early-Z and Late-Z Optimisations

Mobile TBDR GPUs execute an early depth test before the fragment shader. If a fragment is occluded, the GPU skips shading entirely. To maximise this benefit:

  • Render opaque objects front-to-back so that early-Z quickly rejects hidden fragments.
  • Avoid writing to gl_FragDepth or using discard in the fragment shader, as these disable early-Z. If depth writes are necessary, consider a two-pass approach: first a depth-only pass, then a shading pass that uses the pre-filled depth buffer.
  • Use conservative depth modifiers like layout(depth_unchanged) or depth_greater in GLSL to hint the GPU that depth tests can remain early.

6. Reduce Fragment Shader Work in Vertex Shader

Mobile GPUs have a fixed ratio of vertex to fragment processing units. If a scene has many vertices (e.g., character meshes or tessellated geometry), the vertex shader can become the bottleneck. Offload work from the fragment shader to the vertex shader wherever possible:

  • Compute lighting in world space per vertex and interpolate it to fragments.
  • Pre-transform normals and tangents to view space in the vertex shader.
  • Use vertex shader to compute derivatives or screen-space UV offsets that would otherwise require expensive dFdx/dFdy calls in the fragment shader.

7. Use Shader Variants and Quality Levels

One size does not fit all mobile GPUs. Provide multiple shader variants with different quality/performance trade-offs. Common variant axes include:

  • Diffuse-only, diffuse+specular, full PBR.
  • Simple vs. anisotropic specular.
  • No shadows vs. shadow maps with low-resolution sampling.
  • Single light vs. multiple lights (forward rendering).

Use preprocessor directives controlled by uniform or compile-time defines. At runtime, query the device GPU capabilities and select the appropriate variant. This ensures high-end devices get full quality while low-end devices maintain playable frame rates.

8. Optimise Uniform and Varying Data

Uniforms and varyings consume register space and memory bandwidth. Mobile GPUs have limited uniform buffer sizes (often as low as 256–512 floats). Reduce their footprint by:

  • Packing independent floats into vec4 containers.
  • Using mediump for varyings that don’t need high precision (e.g., UVs, small offsets).
  • Minimizing the number of varying vectors. Each additional varying increases the interpolation cost and the bandwidth needed to tile data.

9. Avoid Expensive Math Functions

Trigonometric functions (sin, cos, tan), exponential (exp, log), and inverse trigonometric functions are extremely expensive on mobile GPUs. A single sin can consume the same energy as 20–30 basic arithmetic operations. When such precision is not needed:

  • Use lookup tables (LUTs) sampled from texture coordinates.
  • Approximate with polynomial or rational approximations (e.g., Taylor series for small angles).
  • Replace pow with multiplications for integer exponents.

10. Manage Multiple Render Targets (MRT) Efficiently

Deferred rendering often writes multiple GBuffers in a single pass (e.g., color, normal, position, roughness). Each MRT attachment consumes memory bandwidth and ALU operations. To keep performance acceptable:

  • Use the smallest possible per-channel bit depth (e.g., RGBA8 for albedo, RG16 for normal).
  • Encode data into a single buffer using packing. For example, pack roughness and metalness into the same 8-bit channel of a single RGBA texture, and store specular F0 in the remaining channel.
  • Reduce the number of MRT attachments to 2–3. On many mobile GPUs, adding a fourth MRT doubles the bandwidth requirement and may exceed the tile memory capacity, causing off-chip writes.

Testing and Profiling Workflow

Optimisation without measurement is guesswork. Mobile GPU profiling tools reveal exactly where shader instructions are spent, how much bandwidth is consumed, and whether the GPU is bottlenecked by shading or memory. Essential tools include:

  • Qualcomm Adreno Profiler – Provides detailed per-shader ALU, texture, and bandwidth counters for Adreno GPUs. Also supports frame capture and shader debugging.
  • ARM Mobile Studio (formerly Mali Graphics Debugger) – Offers real-time performance counters for Mali GPUs, including shader core utilisation and memory transaction counts.
  • Xcode GPU Frame Capture – For Apple devices, it gives a deep view into shader execution, tile memory usage, and bandwidth.
  • Android GPU Inspector (AGI) – Works across multiple GPU vendors and provides timeline-based profiling of GPU stages.

When profiling, focus on the shader compile-to-execute cycle. Look for shaders that have high instruction counts, high register pressure, or high buffer full stalls (indicating the shader is waiting on texture fetches). Target the shaders that appear most often in draw calls. A shader used for a full-screen quad (e.g., post-processing) may be more important than a rarely seen complex shader.

Practical Profiling Steps

  1. Capture a representative frame on the target device.
  2. Identify the 10 most expensive draw calls by GPU time.
  3. Examine the shader source or disassembly for each of those draw calls.
  4. Check the number of instruction slots used. On ARM Mali, each shader core has a maximum instruction buffer (e.g., 512 instructions for a pixel shader). Exceeding this triggers a multi-cycle fetch that can half performance.
  5. Monitor memory bandwidth per frame. If bandwidth exceeds 10–15 GB/s, consider reducing texture size or compression.

Conclusion

Optimizing shaders for mobile devices is a multi‑faceted discipline that goes beyond simply writing fewer lines of code. It requires a deep understanding of tile‑based GPU architectures, awareness of precision sensitivity, judicious use of texture resources, and disciplined management of branching and control flow. The techniques outlined here—minimizing instructions, using half-precision, optimizing texture usage, avoiding dynamic branching, leveraging early-Z, and offloading work to vertex shaders—form a toolkit that every mobile graphics developer should master. When combined with rigorous profiling on actual devices, these methods consistently deliver higher frame rates, lower battery consumption, and a better user experience across the vast and varied landscape of mobile hardware.

For further reading, consult the official vendor optimisation guides: ARM Mali GPU Shader Optimization Guide, Qualcomm Adreno SDK, and the OpenGL ES Specification. These resources provide platform‑specific details that complement the general principles discussed here.