The Use of Neural Networks in Approximating Value Functions in Optimal Control

Neural networks have emerged as a transformative tool for approximating value functions in optimal control, enabling tractable solutions to problems that were previously intractable due to high dimensionality and nonlinear dynamics. The value function—also known as the cost-to-go or the optimal cost—encapsulates the minimum cumulative cost from any state to a goal, serving as the foundation for deriving optimal policies in fields ranging from robotics and aerospace to economics and energy management. Traditional numerical methods like dynamic programming on discretized grids become computationally prohibitive as the number of state variables grows, a phenomenon known as the curse of dimensionality. Neural networks, with their ability to represent complex, nonlinear relationships from data, offer a scalable and flexible alternative that has driven significant progress over the past decade. This article expands on the role of neural networks in value function approximation, covering the theoretical underpinnings, training methodologies, practical applications, challenges, and promising research directions.

What Is a Value Function in Optimal Control?

In optimal control, a value function V(x) maps each state x in the state space to the minimal possible cost that can be accumulated from that state when following an optimal policy. For discrete-time systems, the Bellman equation captures this relation recursively:

V(x) = min_{u} [ c(x, u) + γ * V(f(x, u)) ]

Here c is the immediate cost, u is the control input, f is the system dynamics, and γ is a discount factor. For continuous-time problems, the Hamilton-Jacobi-Bellman (HJB) partial differential equation plays an analogous role. The value function is intimately linked with the optimal policy: knowing V allows one to extract the optimal control by greedily choosing actions that minimize the right-hand side. Classical examples include the linear quadratic regulator (LQR), where the value function is quadratic and can be solved analytically via the Riccati equation, and shortest-path problems, where the value is the distance to the goal. However, for most real-world systems with nonlinear dynamics, high-dimensional states, or constraints, an analytical solution is impossible, and numerical approximation becomes necessary.

Why Neural Networks for Value Function Approximation?

Traditional approaches to value function approximation rely on discretization (grid-based methods) or function approximation using fixed basis functions (e.g., polynomials, radial basis functions, splines). Grid methods suffer exponentially from the curse of dimensionality: a 10-dimensional state space with 100 points per dimension requires 100¹⁰ grid points, which is infeasible. Basis function methods require careful handcrafting and may still scale poorly. Neural networks overcome these limitations through their ability to learn representations directly from data, automatically extracting features that are relevant for predicting the value. The universal approximation theorem guarantees that a sufficiently wide or deep feedforward network can approximate any continuous function on a compact domain to arbitrary accuracy, given enough neurons. This theoretical foundation, combined with the practical success of deep learning in other domains, makes neural networks a natural choice for value function approximation.

Handling High Dimensionality and Nonlinearity

Neural networks excel in high-dimensional spaces where traditional methods fail. For example, in robotic manipulation with visual inputs, the state space includes images with thousands of pixels. Convolutional neural networks (CNNs) can process such raw data directly and learn to estimate the value of a scene. Similarly, in control of soft robots or deformable objects, the dynamics are highly nonlinear; neural networks can capture these complexities without requiring explicit physics models, as long as sufficient training data is available.

Training Neural Networks for Value Approximation

Training a neural network to approximate a value function involves generating a dataset of state-value pairs and optimizing the network parameters to minimize a loss function. The choice of data generation and optimization algorithm depends on the specific control setting—whether a model of the dynamics is known, whether interactions with the real system are allowed, and whether expert demonstrations exist.

Data Generation Strategies

Simulation-based rollouts: When a simulation model is available, one can sample initial states, simulate optimal or near-optimal trajectories, and compute the cumulative discounted cost from each visited state. These state-value pairs serve as supervised learning targets. To cover the state space thoroughly, exploration strategies (e.g., random perturbations, epsilon-greedy, or curiosity-driven exploration) are used.

Expert demonstrations: In imitation learning, trajectories from an expert (human operator or a pre-existing controller) provide state-cost data. This approach is useful when simulation is expensive or inaccurate, though it may require careful bias handling.

Bootstrapping from self-generated experience: This is the core of reinforcement learning (RL) methods. The network’s own current value estimates are used to compute targets for future states, forming temporal difference (TD) errors. The most famous example is the Deep Q-Network (DQN), where a neural network approximates the action-value function Q(s, a). Targets are computed as r + γ max_a' Q(s', a'), and the network is trained to minimize the squared TD error. Experience replay and target networks stabilize training.

Loss Functions and Optimization

Common loss functions include mean squared error (MSE) for regression against supervised targets, and the TD loss for bootstrapped updates. For continuous action spaces, approaches like Deep Deterministic Policy Gradient (DDPG) or Soft Actor-Critic (SAC) use an actor-critic architecture where the critic network approximates the value function. Regularization techniques—weight decay, dropout, early stopping—are applied to combat overfitting, especially when the training data is sparse.

Neural Network Architectures

Multilayer perceptrons (MLPs) are the default for low-dimensional state spaces (e.g., joint angles and velocities). Convolutional neural networks (CNNs) are used when states are images. Recurrent neural networks (RNNs) or transformers help when the state includes temporal history or partial observability. Physics-informed neural networks (PINNs) incorporate the HJB equation directly into the loss, enabling the network to respect known physical constraints without requiring full simulation data. Mixtures of experts and attention mechanisms have also been explored for problems with multiple regimes or local structure.

Integration with Optimal Control Frameworks

Neural network value function approximators are used in both model-based and model-free optimal control.

Model-Based Control

In model-based approaches, a known or learned dynamics model is available. Neural network value iteration (fitted value iteration) repeatedly applies the Bellman operator to improve the value estimate. The network is trained on a batch of sampled states to predict the minimum over controls of immediate cost plus discounted next-state value. After convergence, the optimal policy can be derived by online optimization over the learned value function. This method has been applied to nonlinear systems like quadrotor control and autonomous racing.

Model-Free Reinforcement Learning

Model-free RL methods learn the value function directly from interaction, without explicit dynamics. DQN and its variants (Double DQN, Dueling DQN, Rainbow) use neural networks to approximate Q-functions for discrete actions. For continuous control, actor-critic algorithms like A3C, PPO, and SAC use two networks: a policy network (actor) and a value network (critic). The critic’s value estimate is used to compute advantage and guide policy updates. These methods have achieved human-level performance on games and impressive results in robotic locomotion and manipulation.

Applications of Neural Network Value Functions in Optimal Control

Robotics: Neural value functions enable complex tasks such as dexterous manipulation, bipedal walking, and drone racing. For instance, a neural network learned to predict the cost-to-go for a robotic hand in-hand object rotation, allowing the robot to plan fine motor skills. In legged locomotion, value functions guide foot placement and body posture.

Aerospace: Optimal control of spacecraft for rendezvous, landing, and orbital maneuvers benefit from neural value function approximation. The high-dimensionality of spacecraft state (position, velocity, orientation, angular rates) and nonlinear dynamics makes grid methods impractical. Neural networks trained on batches of optimal trajectories can provide near-optimal feedback policies in real time on embedded hardware.

Autonomous driving: Value functions help plan safe and efficient trajectories. They evaluate the long-term risk and reward of states, enabling decision-making in intersection crossing, lane changes, and highway merging. Combined with sensors, neural value networks have been demonstrated in simulated and real autonomous vehicles.

Finance: Optimal portfolio allocation and option hedging can be viewed as optimal control problems. The value function represents the maximal expected utility of wealth. Neural networks can approximate this function for nonlinear market models with transaction costs or constraints, adapting to changing market conditions.

Energy systems: Control of microgrids, battery storage, and HVAC systems benefit from neural value functions. They provide approximate optimal policies that reduce operational costs while respecting system constraints, outperforming traditional rule-based or linear approximation methods.

Challenges and Limitations

Overfitting and generalization: Neural networks trained on limited state samples may fail to generalize to unseen regions, leading to poor performance when the real system drifts. Regularization, domain randomization, and robust training techniques are active research areas.

Sample efficiency: Deep RL methods often require millions of environment interactions to learn a reliable value function. This is expensive or dangerous in real-world applications. Model-based approaches and offline RL aim to reduce sample requirements, but challenges remain.

Safety and robustness: A small error in the value function can lead to catastrophic control actions, especially near obstacles or system stability boundaries. Safety-critical applications require formal guarantees, which are hard to provide with neural networks. Barrier functions and explicit safety filters are being combined with learned value functions.

Interpretability: Neural network value functions are black boxes, making it difficult to understand why a particular control decision was made. This hinders certification in domains like aviation or medicine. Efforts to develop explainable neural networks (e.g., using attention maps, distillation into interpretable models) are ongoing.

Distribution shift: When the policy used to collect data differs from the final policy (e.g., during training), the value function may be evaluated on out-of-distribution states, leading to unreliable predictions. This is especially problematic in offline RL. Conservative Q-learning, uncertainty estimation, and weight clipping are partial remedies.

Current Research and Future Directions

Researchers are exploring several promising avenues to address these challenges. Physics-informed and structure-aware learning integrates known equations (e.g., HJB, dynamics) into the network or loss function, improving sample efficiency and generalization. Ensemble methods and uncertainty quantification provide confidence bounds on value estimates, enabling risk-aware decision-making. Generative models and world models learn the environment dynamics in parallel, allowing the value function to be trained entirely in latent space without costly real interactions. Multi-agent optimal control extends value function approximation to systems with multiple interacting agents, where the combined state is even higher-dimensional; mean-field approximations and graph neural networks are promising tools. Safety-aware learning incorporates Lyapunov stability or barrier certificates into the neural network training, guaranteeing the resulting value function converges to a safe set. Neural ODEs and continuous-time methods directly model the HJB solution using neural networks, offering a unified framework for control and estimation.

External resources for further reading include the textbook Stanford AA203: Optimal Control and Reinforcement Learning (lecture notes and references), the survey paper "Value Function Approximation using Neural Networks" by D. Silver et al. (foundational reading), the OpenAI Spinning Up educational resource for deep RL, and the research article "Parameterized Value Functions for Continuous Control" which discusses actor-critic architectures.

Conclusion

Neural networks have fundamentally changed the landscape of optimal control by providing a practical means to approximate value functions in high-dimensional, nonlinear systems. Their success stems from the ability to learn from data, adapt to complex dynamics, and compute value estimates at inference time with minimal overhead. While challenges like sample inefficiency, safety, and interpretability remain, ongoing research continues to narrow the gap between theory and practice. As algorithmic improvements and hardware capabilities advance, neural network value function approximators will likely become a standard component in real-world optimal control applications across engineering, science, and beyond.