Nonlinear dynamics with neural networks

Preliminary Examination

Anil Radhakrishnan

Nonlinear Artificial Intelligence Laboratory, North Carolina State University

May 8, 2024

Table of Contents

Motivation
Neural Networks and optimization
Background
Metalearning Activation functions
Control and Chaos
Background
Neural Network control of chaos
Dynamics and symmetries
Background
Symmetries of controllers
Conclusion
References
Acknowledgements

Motivation

Motivation: Why Neural Networks?

  • The preeminent computational tool of the contemporary world.
  • Differentiable, optimizable, and scalable.
  • Emergent uses in and out of Physics1.

ChatGPT

CoPilot

NetKet

Motivation: Why Nonlinear Dynamics?

  • Captures complexity.
  • Formal, well-established framework.
  • Emergent uses in and out of Physics.

Avalanche activity cascades in a sandpile automaton | Vortex street formed by flow past a cylinder | Turing patterns in a reaction-diffusion model

Animation by William Gilpin

Motivation: Why Nonlinear Dynamics for Neural Networks?

  • Optimization is inescapably nonlinear2.
  • Neural networks are inherently nonlinear.
  • Scope of nonlinearities in neural networks is underexplored.
Model Predictive control

Trainability fractal by Jascha Sohl-Dickstein

Motivation: Why Neural Networks for Nonlinear Dynamics?

  • Nonlinear dynamical systems are computationally expensive to solve.
  • Paradigm of solution as an element of a distribution translates naturally to neural networks.
  • Data-driven methods accommodate realistic complexity.

Background Neural Networks and optimization

Background: Differentiable Computing

  • Paradigm where programs can be differentiated end-to-end automatically, enabling optimization of parameters of the program3.
  • Techniques to differentiate through complex programs is more than just deep learning.
  • Can be interpreted probabilistically or as a dynamical system. autodifferentiation diagram

Background: Neural Networks

Neural Network diagram

\[ \newcommand{\defeq}{\mathrel{\mathop:}=} \newcommand{\R}{\mathbb{R}} \newcommand{\Param}{\theta} \newcommand{\Loss}{\mathcal{L}} \]

\[\varphi: \R^d \rightarrow \R^N_L \qquad T^l: \R^{N_{l-1}}\rightarrow \R^{N_{l}} \qquad \sigma: \R\rightarrow \R\] \[x \in \R^d \qquad W^l \in \R^{N_l\times N_{l-1}} \qquad b^l \in \R^{N_l}\]

Background: Backpropagation

Consider a differentiable loss function \(\Loss\),

\[ \begin{eqnarray*} \delta^L =& \nabla_{\varphi}\Loss\odot\sigma'(T^L)\\ \delta^l =& ((w^{l+1})^\intercal \delta^{l+1} ))\odot\sigma'(T^l)\\ \dfrac{\partial \Loss}{\partial b^l_j} =& \delta_j^l \\ \dfrac{\partial \Loss}{\partial w^l_{jk}} =& T^{l-1}_k\delta_j^l \\ \end{eqnarray*} \]

backprop diagram

The tunable parameters can then be updated by: \(\Param\gets \Param -\eta \nabla_{\Param}\Loss\)

Background: Loss Functions

  • Neural networks compute a probability distribution on the data space.
  • Constructing a suitable differentiable loss function gives the path to optimization of a neural network. Common loss functions include:
    • Mean Squared Error \(\Loss(\theta) = \sum_{i=1}^N (y_i - f(x_i;\theta))^2\)
    • Cross Entropy \(\Loss(\theta) = -\sum_{i=1}^N y_i \log(f(x_i;\theta))\)
    • Kullback-Leibler Divergence \(\Loss(\theta) = \sum_{i=1}^N y_i \log\left(\dfrac{y_i}{f(x_i;\theta)}\right)\)

Loss functions are combined and regularized to balance the tradeoff between model complexity and data fit.

Background: Optimization

  • To find the best model parametrization, we minimize the loss function with respect to the model parameters.
  • That is we compute, \(\mathcal{L^\star} \defeq \inf_{\theta \in \Theta} \Loss(\theta)\) assuming an infimum exists.
  • To converge to a minima the optimizer needs an oracle \(\mathcal{O}\), i.e. evaluation of the Loss function, its gradients, or higher order derivatives. Then, for an algorithm \(\mathcal{A}\), \[ \theta ^ {t+1} \defeq \mathcal{A}(\theta ^ 0, \ldots, \theta ^ t, \mathcal{O}(\theta ^ 0), \ldots, \mathcal{O}(\theta ^ t), \lambda), \] where \(\lambda \in \Lambda\) is a hyperparameter.

Stochastic Gradient Descent, Adam, and RMSProp are common optimization algorithms for training neural networks.

Background: Meta-Learning

  • Improve the learning algorithm itself given the experience of multiple learning episodes.
  • Base learning: an inner learning algorithm solves a task defined by a dataset and objective.
  • Meta-learning: an outer algorithm updates the inner learning algorithm.

metalearning

Algorithms for meta-learning are still in a nascent stage with significant computational overhead.

Figure by John Lindner

Background: Physics-Informed Neural Networks(PiNNs)

  • Synthesizing data with differential equation constraints.
  • Physics as a weak constraint in a composite loss function or a hard constraint with architectural choices.
  • Symplectic constraints to the loss function gives Hamiltonian Neural Networks4.
HNN schema

Background: Coordinates matter

Neural networks are coordinate dependent. The choice of coordinates can significantly affect the performance of the network.

Adapted from Heliocentrism and Geocentrism by Malin Christersson

Metalearning Activation functions

(Published, US and International Patent Pending)

Foundation

  • Most complex systems showcase diversity in populations.
  • Artificial neural network architectures are traditionally layerwise-homogenous
  • Will neurons diversify if given the opportunity?

Insight

  • Activation functions can themselves be modeled as neural networks.
  • Activation function subnetworks are optimizable via metalearning.
  • Multiple subnetwork initializations allow the activations to diversify and evolve into different communities.

metalearning

Figure from Publication

Methodology

  • Developed a metalearning framework to optimize activation functions.
  • Tested the algorithm on classification and regression tasks for conventional and physics-informed neural networks.
  • Showed a regime where learned diverse activations are superior.
  • Gave preliminary analysis to support diversity in activation functions improving performance.

2neuronmnist1d

Figure from Publication

Results: Scaling

scaling

Figure from Publication

Results: Real World Data

real world example

Figure from Publication

Analysis: Participation Ratio

Estimate change in dimensionality of network activations

\[ Nr = \mathcal{R} = \frac{(\operatorname{tr}{\bf C})^2}{\operatorname{tr}{\bf C}^2} = \frac{\left(\sum_{n=1}^N\lambda_n \right)^2}{\sum_{n=1}^N \lambda_n^2} \]

where \(\lambda_n\) are the co-variance matrix \(\bf C\)’s eigenvalues for neuronal activity data matrix. The normalized participation ratio \(r = \mathcal{R} / N\)5.

participation ratio

Diverse activation functions use more of the network’s capacity.

Figure from Publication

Conclusions

  • Learned Diversity Networks discover sets of activation functions that can outperform traditional activations.
  • These networks excel in the regime of low data few shot learning.
  • Due to the nature of metalearning, the memory overhead is significant concern for scaling.

Speculations: Achieving stable minima

Noisy descent

 

Optimization of a neural network with shuffled data is a noisy descent.

This can be modeled with the Langevin equation:

\[ \DeclareMathOperator{\Tr}{Tr} d\theta_{t} = - \nabla \Loss(\theta_{t})\, dt + \sqrt{2\mathbf{D}} \cdot d\mathcal{W}_{t} \] with noise intensity \(\mathbf{D} = (\eta) \Loss(\theta) \mathbf{H}(\theta\star)\)6

Speculations: Structure and universality to diverse activations

  • Learned activation functions appear qualitatively independent of the base activation function.
  • The odd and even nature of the learned functions suggest that the network is learning to span the space of possible activation functions.
spanning sets

Background: Control and Chaos

Background: Hamilton Jacobi Bellman(HJB) Equation

For a control system with state \(x(t)\) and control \(u(t)\), \[ \dfrac{dx}{dt} = f(x(t),u(t))dt \\ \]

\[ \mathcal{H}(x,u,t_0,t_f) = Q(x(t_f), t_f) + \int_{t_0}^{t_f} \Loss(x(\tau),u(\tau))d\tau\\ V(x(t_0),t_0,t_f) = \min_{u(t)} \mathcal{H}(x(t_0),u(t),t_0,t_f)\\ -\dfrac{\partial V}{\partial t} = \min_{u(t)} \left[ \Loss(x(t),u(t)) + \dfrac{\partial V}{\partial x}^T f(x(t),u(t)) \right] \]

Background: Model Predictive Control

Control scheme where a model is used for predicting the future behavior of the system over finite time window7.

Model Predictive control

Animations from do-mpc documentation

Background: Chaos

Let \(X\) be a metric space. A continuous map \(f: X \rightarrow X\) is said to be chaotic on \(X\) if:

  • \(f\) has sensitive dependence on initial conditions

  • is topologically transitive

  • and has dense periodic orbits8

Background: Traditional Chaos Control

Relies on the ergodicity of chaotic systems9,10.

Ott-Gerbogi-Yorke control

Unstable periodic orbits

Background: Chaotic Pendulum Array11

\[ l_n^2\ddot\theta_n =-\gamma\dot\theta_n\nonumber-l_n\sin{\theta_n}\nonumber +\tau_{0}+\tau_{1}\sin{\omega t}\nonumber +\kappa(\theta_{n-1}+\theta_{n+1}-2\theta_n) \]

Background: Kuramoto Oscillator12

\[\dot{\theta}_i = \omega_i + \lambda\sum_{j=1}^N \sin(\theta_j - \theta_i)\]

Synchronizing fireflies video from Robin Meier

Neural Network control of chaos

Insight

  • Optimal control of network dynamics involves minimizing a cost functional.
  • Traditional control approaches like Pontryagin’s (maximum) principle or Hamilton-Jacobi-Bellman equations are analytically and computationally intractable for complex systems.
  • Neural networks based on neural ODES can approximate the optimal control policy13.

Methodology

Model Predictive control

Results: Kuramoto Grid

Controlled kuramoto dynamics

Results: Pendulum Array

Controlled Pendulum dynamics

Future Work: Exotic Dynamics14

\[\dot{\theta}_i = \omega_i + \lambda\sum_{j=1}^N A_{ij}\sin(\theta_j - \theta_i-\alpha)\]

Future Work: Recurrence matrices in \(\mathcal{L}\)15

Recurrence matrix for lorenz attractor

Background: Dynamics and symmetries

Background: Group theory for Dynamical Systems

Let \(\Gamma\) act on \(\mathbb{R}^n\) and \(f: \mathbb{R}^n \rightarrow \mathbb{R}^m\) . Then \(f\) is said to be \(\Gamma\)-equivariant if \(f(\gamma x) = \gamma f(x)\) for all \(x \in \mathbb{R}^n\) and \(\gamma \in \Gamma\)16.

For dynamical systems, for a fixed point \(f(x\star)=0\), \(\gamma x\star\) is also a fixed point.

Isotropy subgroup: Let \(v\in \mathbb{R}^n: \Sigma_v = \{\gamma \in \Gamma: \gamma v = v\}\)

Fixed pt subspace: Let \(\sigma \in \Sigma \subseteq \Gamma\). \(\text{Fix}(\Sigma)=\{v \in \mathbb{R}^n:\sigma v=v\}\)

Thm: Let \(f: \mathbb{R}^n \rightarrow \mathbb{R}^m\) be a \(\Gamma\)-equivariant map and \(\Sigma \subseteq \Gamma\). Then \(f(\text{Fix}(\Sigma))\subseteq\text{Fix}(\Sigma)\).

Background: Group Equivariant Neural Networks

Group equivariance works by lifting and projection

Group equivariant NN work through lifting convolutions to the group space, and then projecting back to the original space17.

Figure adapted from Bart Smets et al.

Symmetries of controllers

Insight

  • Dynamical systems often obey symmetries.
  • Controllers are essentially dynamical systems.
  • Is there a computationally viable mapping between the symmetries of the system and the controller?

Methodology

  • Analyze the controllers from the previous sections for symmetries using group equivariant autoencoders
  • Construct controllers that respect the symmetries of the system
  • Compare the performance of symmetric to conventional controllers.

Expectation

  • Viability test for group equivariant neural networks in control systems.
  • Mapping between the symmetries of the system and the controller.
  • Performance analysis of symmetric controllers.

Conclusion

  • Non-traditional treatments of neural networks let us better capture nonlinearity.
  • Standard paradigms of Geometry, Statistics, and Algebra for understanding nonlinear systems are augmented by neural networks.
  • The interplay of Physics, Mathematics, and Computer Science gives us the best shot at understanding complex systems

Deliverables

● Publication and Patent on Metalearning Activation functions.

◕ Publishable article on Neural Network Control of Chaos.

◔ Thorough study of the symmetries of controllers.

◑ Codebase for the above studies.

◔ Dissertation detailing the discoveries and pitfalls found throughout.

Estimated time of completion: August 2025

Acknowledgements

Bill, John, Anshul

Background image from Jacqueline Doan

References

1.
Vicentini, F. et al. NetKet 3: Machine Learning Toolbox for Many-Body Quantum Systems. SciPost Physics Codebases 007 (2022) doi:10.21468/SciPostPhysCodeb.7.
2.
Sohl-Dickstein, J. The boundary of neural network trainability is fractal. (2024) doi:10.48550/arXiv.2402.06184.
3.
Blondel, M. & Roulet, V. The Elements of Differentiable Programming. (2024) doi:10.48550/arXiv.2403.14606.
4.
Greydanus, S., Dzamba, M. & Yosinski, J. Hamiltonian Neural Networks. (2019) doi:10.48550/arXiv.1906.01563.
5.
Gao, P. et al. A theory of multineuronal dimensionality, dynamics and measurement. (2017) doi:10.1101/214262.
6.
Mori, T., Ziyin, L., Liu, K. & Ueda, M. Power-law escape rate of SGD. (2022) doi:10.48550/arXiv.2105.09557.
7.
Fiedler, F. et al. Do-mpc: Towards FAIR nonlinear and robust model predictive control. Control Engineering Practice 140, 105676 (2023).
8.
Devaney, R. L. An Introduction to Chaotic Dynamical Systems. (Westview Press, Boulder, Colo, 2003).
9.
Ott, E., Grebogi, C. & Yorke, J. A. Controlling chaos. Physical Review Letters 64, 1196–1199 (1990).
10.
Ditto, W. L., Rauseo, S. N. & Spano, M. L. Experimental control of chaos. Physical Review Letters 65, 3211–3214 (1990).
11.
Braiman, Y., Lindner, J. F. & Ditto, W. L. Taming spatiotemporal chaos with disorder. Nature 378, 465–467 (1995).
12.
13.
Böttcher, L., Antulov-Fantulin, N. & Asikis, T. AI Pontryagin or how artificial neural networks learn to control dynamical systems. Nature Communications 13, 333 (2022).
14.
Abrams, D. M. & Strogatz, S. H. Chimera States for Coupled Oscillators. Physical Review Letters 93, 174102 (2004).
15.
Kennel, M. B., Brown, R. & Abarbanel, H. D. I. Determining embedding dimension for phase-space reconstruction using a geometrical construction. Physical Review A 45, 3403–3411 (1992).
16.
Zee, A. Group Theory in a Nutshell for Physicists. (Princeton University Press, Princeton, 2016).
17.
Cohen, T. & Welling, M. Group Equivariant Convolutional Networks. in Proceedings of The 33rd International Conference on Machine Learning 2990–2999 (PMLR, 2016).
18.
Leshno, M., Lin, V. Ya., Pinkus, A. & Schocken, S. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks 6, 861–867 (1993).
19.
Cybenko, G. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems 2, 303–314 (1989).
20.
Choudhary, A., Radhakrishnan, A., Lindner, J. F., Sinha, S. & Ditto, W. L. Neuronal diversity can improve machine learning for physics and beyond. Scientific Reports 13, 13962 (2023).
21.
Smets, B., Portegies, J., Bekkers, E. & Duits, R. PDE-based Group Equivariant Convolutional Neural Networks. Journal of Mathematical Imaging and Vision 65, 209–239 (2023).

Backup slides

Hyperparameters, integrators, and tolerances

Metalearning Activation functions:

MNIST1D: 1 hidden layer of 100 neurons, activation of 50 hidden neurons. 100 initializations averaged. RMSprop, Real Pendulum: same other than 50 initializations. Pytorch and Jax frameworks used.

Neural Network control of chaos:

Control and dynamics integration via diffrax Jax library for neural ODEs. TSit5 algorithm for ODE integration with PID controller with rtol 1e-7 and atol 1e-9. Stratanovich Milstein solver for SDE integration with PID controller with rtol 1e-7 and atol 1e-9. 1000 epochs Controller training with only 1/5 data for the first half. Implemented in Jax.

Universal approximation theorem

\[M(\sigma) = \text{span}\{\sigma(wx-\theta):w\in \mathbb{R}^n, \theta\in\mathbb{R}\}\]

Theorem 1: Let \(\sigma \in C(\mathbb{R})\). Then \(M(\sigma)\) is dense in \(C(\mathbb{R}^n)\), in the topology of uniform convergence on compacta, if and only if \(\sigma\) is not a polynomial18.

For Neural Networks,

Theorem 2: Let \(\sigma\) be any continuous sigmoidal function. Then finite sums of the form \(G(x) = \sum_{i=1}^N \alpha_i \sigma(w_i^T x + b_i)\) are dense in \(C(I_n)\). i.e. \(\forall f\in C(I_n)\) and \(\epsilon>0\), there is a sum \(G(x)\) such that \(\max_{x\in I_n}|f(x)-G(x)|<\epsilon\)19.

Informally, at least one neural network exists that can approximate any continuous function on \(I_n=[0,1]^n\) with arbitrary precision.

SGD via gradient flows

Gradient descent: \(x_{n+1} = x_n - \gamma\nabla f(x_n)\) where \(\gamma > 0\) is the step-size. Gradient flow is the limit where \(\gamma\rightarrow 0\).

There are two ways to treat SGD in this regime,

Consider fixed times \(t=n\gamma\) and \(s=m\gamma\).

  • Convergence to gradient flow

Given the recursion \(x_{n+1} = x_n-\gamma\nabla f(x_n) - \gamma\epsilon_n\),

applying this \(m\) times, we get:

\[ \begin{align*} X(t+s) - X(t) &= x_{n+m}-x_n\\ &= -\gamma \sum_{k=0}^{m-1}\nabla f(X(t+\frac{sk}{m})) - \gamma\sum_{k=0}^{m-1}ε_{k+n} \end{align*} \] in the limit, \[ \begin{align*} X(t+s) - X(t) &= -\int_t^{t+s}\nabla f(X(u))du + 0 \end{align*} \] which is just the gradient flow equation.

  • Convergence to Langevin diffusion

Given the recursion \(x_{n+1} = x_n-\gamma\nabla f(x_n) - \sqrt{2\gamma}\epsilon_n\),

applying this \(m\) times, we get:

\[ \begin{align*} X(t+s) - X(t) &= x_{n+m}-x_n\\ &= -\gamma \sum_{k=0}^{m-1}\nabla f(X(t+\frac{sk}{m}))- \sqrt{2\gamma}\sum_{k=0}^{m-1}ε_{k+n} \end{align*} \] The second term has finite variance \(\propto 2s\). When \(m\) tends to infinity,

\[ X(t+s) - X(t) =\\ -\int_t^{t+s} \nabla f(X(u))du + \sqrt{2}[B(t+s)-B(t)] \]

This limiting distribution is the Gibbs distribution with density \(\exp{(-f(x))}\)

Argument from Francis Bach

1DMNIST

MNIST1D

Neural ODEs

  • Forward propagation: \[ x(t_1) = \text{ODESolve}(f(x(t),t,\theta), x(t_0), t_0, t_1)\\ \text{Compute loss: } \Loss(x(t_1))\\ a(t_1) = \frac{\partial \Loss}{\partial x(t_1)}\\ \]

  • Back propagation: \[ \begin{bmatrix} x(t_0) \\ \frac{\partial \Loss}{\partial x(t_0)}\\ \frac{\partial \Loss}{\partial \theta} \end{bmatrix} = \text{ODESolve}\left(\begin{bmatrix} f(x(t),t,\theta) \\ -a(t)^T \frac{\partial f(x(t),t,\theta)}{\partial x} \\ -a(t)^T \frac{\partial f(x(t),t,\theta)}{\partial \theta} \end{bmatrix} , \begin{bmatrix} x(t_1) \\ \frac{\partial \Loss}{\partial x(t_1)}\\ 0_{|\theta|} \end{bmatrix} , t_1, t_0 \right) \]

HJB Derivation Sketch

Bellman optimality equation: \[ V(x(t_0),t_0,t_f) = V(x(t_0),t_0,t) + V(x(t),t,t_f) \] \[ \begin{align*} \dfrac{dV(x(t),t,t_f)}{dt} &= \dfrac{\partial V}{\partial t} + \dfrac{\partial V}{\partial x}^T \frac{dx}{dt}\\ &= \min_{u(t)} \dfrac{d}{dt} \left[ \int_0^{t_f}\Loss(x(\tau),u(\tau))d\tau + Q(x(t_f), t_f) \right]\\ &= \min_{u(t)} \left[ \dfrac{d}{dt}\int_0^{t_f}\Loss(x(\tau),u(\tau))d\tau \right] \\ \end{align*} \]

\[ \implies \boxed{-\dfrac{\partial V}{\partial t} = \min_{u(t)} \left[ \Loss(x(t),u(t)) + \dfrac{\partial V}{\partial x}^T f(x(t),u(t)) \right]} \]

Pontryagin’s Principle

Let \((x\star,u\star)\) be an optimal trajectory-control pair. Then there exists a \(\lambda\star\) such that, \[ \lambda\star =-\dfrac{\partial \mathcal{H}}{\partial x} \] and \[ \mathcal{H}(x\star,u\star,\lambda\star, t) = \min_{u} \mathcal{H}(x\star,u,\lambda\star, t) \] and by definition, \[ x\star = \dfrac{\partial \mathcal{H}}{\partial \lambda} \]

Stochasticity and noise

We can add intrinsic noise to it by adding a noise term \(\xi\) with a strength \(\sigma\): \[ \frac{dx}{dt}=f(x,t)+ σ(x,t)ξ \]

Then the rectangular reimann construction is equivalent to: \(f(t)≈\sum_{i=1}^n f(\hat{x_i})χ_{Δx_i}\)

where \(f(\hat{x_i})\) are constants. This approximation is exact in the limit of \(n\to\infty\).

For the stochastic case, the rectangular construction is equivalent to: \(\phi(t)≈\sum_{i=1}^n \hat{e}_iχ_{Δx_i}\) Then the integral is: \[ \int_0^T \phi(t) dW_t=\sum_{i=1}^n \hat{e}_i ΔW_i \]

Here, the coefficients \(\hat{e}_i\) are not constants but random variables since we are allowed to integrate \(σ\) which depends on \(X_t\).

Recurrence plots caveat emptor

Roessler lyap

Roessler lyap

Roessler recurrence

Roessler rec

Lyapunov exponent and Recurrence plot from Recurrence Plot Pitfalls

Neural network architectures

Autoencoder

Autoencoder

GRU

GRU

CNN

CNN

All images from associated wikipedia pages (Autoencoder, GRU, CNN)