Nonlinear dynamics with neural networks

Preliminary Examination

Anil Radhakrishnan

Nonlinear Artificial Intelligence Laboratory, North Carolina State University

May 8, 2024

Table of Contents

Motivation
Neural Networks and optimization
▹ Background
▹ Metalearning Activation functions
Control and Chaos
▹ Background
▹ Neural Network control of chaos
Dynamics and symmetries
▹ Background
▹ Symmetries of controllers
Conclusion
References
Acknowledgements

Motivation

Motivation: Why Neural Networks?

  • The preeminent computational tool of the contemporary world.
  • Differentiable, optimizable, and scalable.
  • Emergent uses in and out of Physics1.

ChatGPT

CoPilot

NetKet

  • Neural networks have established themselves as a great tool for a variety of tasks.
  • They are not an answer to everything, but their scalability makes them a great tool in the age of big compute, big data, and big models.
  • They have also found their place in fundamental science, both speeding up research and enabling new discoveries.

Motivation: Why Nonlinear Dynamics?

  • Captures complexity.
  • Formal, well-established framework.
  • Emergent uses in and out of Physics.
Your browser does not support the video tag.

Avalanche activity cascades in a sandpile automaton | Vortex street formed by flow past a cylinder | Turing patterns in a reaction-diffusion model

Animation by William Gilpin

Nonlinear dynamics is a powerful paradigm for understanding complex systems. And more and more systems are becoming more and more complex as we move forward.

Avalanche activity cascades in a sandpile automaton; a vortex street formed by flow past a cylinder; and Turing patterns in a reaction-diffusion model.

Motivation: Why Nonlinear Dynamics for Neural Networks?

  • Optimization is inescapably nonlinear2.
  • Neural networks are inherently nonlinear.
  • Scope of nonlinearities in neural networks is underexplored.

Your browser does not support the video tag.

Model Predictive control

Trainability fractal by Jascha Sohl-Dickstein

Each pixel corresponds to training the same neural network from the same initialization on the same data — but with different hyperparameters. Blue-green colors mean that training converged for those hyperparameters, and the network successfully trained. Red-yellow colors mean that training diverged for those hyperparameters. The paler the color the faster the convergence or divergence

The neural network consists of an input layer, a tanh nonlinearity, and an output layer. In the image, the x-coordinate changes the learning rate for the input layer’s parameters, and the y-coordinate changes the learning rate for the output layer’s parameters.

Motivation: Why Neural Networks for Nonlinear Dynamics?

  • Nonlinear dynamical systems are computationally expensive to solve.
  • Paradigm of solution as an element of a distribution translates naturally to neural networks.
  • Data-driven methods accommodate realistic complexity.

Background Neural Networks and optimization

Background: Differentiable Computing

  • Paradigm where programs can be differentiated end-to-end automatically, enabling optimization of parameters of the program3.
  • Techniques to differentiate through complex programs is more than just deep learning.
  • Can be interpreted probabilistically or as a dynamical system. autodifferentiation diagram
  • For the purposes of this study, it is better to think of these algorithms through the lens of differentiability.
  • The large parameter space afforded by network structures are ripe for gradient based optimization.

Background: Neural Networks

Neural Network diagram

φ:Rd→RNLTl:RNl−1→RNlσ:R→R x∈RdWl∈RNl×Nl−1bl∈RNl

The neural network is a composition of functions. We can swap out any of these components and and as long as its composable and differentiable, we can optimize it and call it a neural network.

Background: Backpropagation

Consider a differentiable loss function L,

δL=∇φL⊙σ′(TL)δl=((wl+1)⊺δl+1))⊙σ′(Tl)∂L∂blj=δlj∂L∂wljk=Tl−1kδlj

backprop diagram

The tunable parameters can then be updated by: θ←θ−η∇θL

  • Computational efficiency aside, backpropagation is essentially the chain rule.
  • We take the derivative of the loss with respect to the output of the network and then backpropagate the gradient through the network.
  • In the simplest case, we just subtract the gradient from the parameters to update them.

Background: Loss Functions

  • Neural networks compute a probability distribution on the data space.
  • Constructing a suitable differentiable loss function gives the path to optimization of a neural network. Common loss functions include:
    • Mean Squared Error L(θ)=∑Ni=1(yi−f(xi;θ))2
    • Cross Entropy L(θ)=−∑Ni=1yilog(f(xi;θ))
    • Kullback-Leibler Divergence L(θ)=∑Ni=1yilog(yif(xi;θ))

Loss functions are combined and regularized to balance the tradeoff between model complexity and data fit.

  • The art of crafting the loss function can make or break a model.
  • The loss needs to be computationally efficient, and smooth enough to allow for stable gradient based optimization.

Background: Optimization

  • To find the best model parametrization, we minimize the loss function with respect to the model parameters.
  • That is we compute, L⋆:=infθ∈ΘL(θ) assuming an infimum exists.
  • To converge to a minima the optimizer needs an oracle O, i.e. evaluation of the Loss function, its gradients, or higher order derivatives. Then, for an algorithm A, θt+1:=A(θ0,…,θt,O(θ0),…,O(θt),λ), where λ∈Λ is a hyperparameter.

Stochastic Gradient Descent, Adam, and RMSProp are common optimization algorithms for training neural networks.

  • The optimizer is an important choice in the training process.
  • Optimizer takes in loss related info and updates the parameters.

Adam stands for adaptive moment estimation and is a stochastic gradient descent optimization algorithm that uses an adaptive learning rate based on estimates of the first and second moments. It maintains exponential moving averages of the weights and gradients, which it uses to scale the learning rate. In other words, Adam uses estimates of the mean and variance of the gradients to adaptively scale the learning rate during training, which can improve the speed and stability of the optimization process.

When you want a fast and efficient optimization algorithm: Adam requires relatively little memory and computation, making it a fast and efficient choice for training deep learning models. When you have noisy or sparse gradients: Adam is well-suited for optimizing models with noisy or sparse gradients, as it can use the additional information provided by the gradients to adapt the learning rate on the fly.

The algorithm is an extension of the famous stochastic gradient descent (SGD) algorithm. The key idea behind RMSProp is to scale the gradient of each weight in the model by dividing it by the root mean square (RMS) of the gradients of that weight. This helps prevent weights with high gradients from learning too quickly while allowing weights with low gradients to continue learning faster. If you are training a model with many parameters and are experiencing issues with the model diverging or oscillating during training, RMSProp can help stabilize the training process by adjusting to the gradient.

Background: Meta-Learning

  • Improve the learning algorithm itself given the experience of multiple learning episodes.
  • Base learning: an inner learning algorithm solves a task defined by a dataset and objective.
  • Meta-learning: an outer algorithm updates the inner learning algorithm.

metalearning

Algorithms for meta-learning are still in a nascent stage with significant computational overhead.

Meta-learning allows the learning algorithm to optimize itself by adding a further layer of abstraction.

Figure by John Lindner

Background: Physics-Informed Neural Networks(PiNNs)

  • Synthesizing data with differential equation constraints.
  • Physics as a weak constraint in a composite loss function or a hard constraint with architectural choices.
  • Symplectic constraints to the loss function gives Hamiltonian Neural Networks4.
HNN schema

One way to integrate physics into the neural network is to add the physics as a constraint to the loss function. This can be done by adding a term to the loss function that penalizes the network for not satisfying the physics. This is known as a weak constraint because the network is not required to satisfy the physics exactly, but rather to minimize the difference between the physics and the network’s predictions.

Background: Coordinates matter

Your browser does not support the video tag.

Neural networks are coordinate dependent. The choice of coordinates can significantly affect the performance of the network.

Adapted from Heliocentrism and Geocentrism by Malin Christersson

As with any data driven modelling, the way you represent the data can have a significant impact on the performance of the model.

Metalearning Activation functions

(Published, US and International Patent Pending)
Article has an altmetric score of 115

Foundation

  • Most complex systems showcase diversity in populations.
  • Artificial neural network architectures are traditionally layerwise-homogenous
  • Will neurons diversify if given the opportunity?

Insight

  • Activation functions can themselves be modeled as neural networks.
  • Activation function subnetworks are optimizable via metalearning.
  • Multiple subnetwork initializations allow the activations to diversify and evolve into different communities.

metalearning

Figure from Publication

Methodology

  • Developed a metalearning framework to optimize activation functions.
  • Tested the algorithm on classification and regression tasks for conventional and physics-informed neural networks.
  • Showed a regime where learned diverse activations are superior.
  • Gave preliminary analysis to support diversity in activation functions improving performance.

2neuronmnist1d

Figure from Publication

Results: Scaling

scaling

Figure from Publication

Using 1DMNIST dataset, we showed that the learned diverse activations outperformed the traditional activations in the regime of low data few shot learning of around 5 epochs. These results are averaged over 100 initializations.

Results: Real World Data

real world example

Figure from Publication

We also showed that it can work for real world data, in this case a real pendulum. We used an HNN for this task to further illustrate the generalizability of the learned diverse activations.

Analysis: Participation Ratio

Estimate change in dimensionality of network activations

Nr=R=(trC)2trC2=(∑Nn=1λn)2∑Nn=1λ2n

where λn are the co-variance matrix C’s eigenvalues for neuronal activity data matrix. The normalized participation ratio r=R/N5.

participation ratio

Diverse activation functions use more of the network’s capacity.

Figure from Publication

If all the variance is in one dimension, say λn=δn1, then R=1

If the variance is evenly distributed across all dimensions, so λn=λ1, then R=N.

Typically, 1<R<N, and R corresponds to the number of dimensions needed to explain most of the variance

Conclusions

  • Learned Diversity Networks discover sets of activation functions that can outperform traditional activations.
  • These networks excel in the regime of low data few shot learning.
  • Due to the nature of metalearning, the memory overhead is significant concern for scaling.

Speculations: Achieving stable minima

Noisy descent

 

Optimization of a neural network with shuffled data is a noisy descent.

This can be modeled with the Langevin equation:

dθt=−∇L(θt)dt+√2D⋅dWt with noise intensity D=(η)L(θ)H(θ⋆)6

Minima from diverse neurons is flatter than from homogeneous ones, as measured by both the trace TrH of the Hessian and the fraction f of its eigenvalues near zero: TrH1>TrH2>TrH12 and f1<f2<f12

Since the noise term aligns with Hessian bear the minima, this suggests that the minima found by diverse neurons are flatter and more stable.

Speculations: Structure and universality to diverse activations

  • Learned activation functions appear qualitatively independent of the base activation function.
  • The odd and even nature of the learned functions suggest that the network is learning to span the space of possible activation functions.
spanning sets

The learned activations seem to form spanning sets for 1-d activation functions. There can be future work to explore the universality of this with multidimensional activations and large activation communities. But computational overhead is a significant concern.

Background: Control and Chaos

Background: Hamilton Jacobi Bellman(HJB) Equation

For a control system with state x(t) and control u(t), dxdt=f(x(t),u(t))dt

H(x,u,t0,tf)=Q(x(tf),tf)+∫tft0L(x(τ),u(τ))dτV(x(t0),t0,tf)=minu(t)H(x(t0),u(t),t0,tf)−∂V∂t=minu(t)[L(x(t),u(t))+∂V∂xTf(x(t),u(t))]

V(x,t) is the value function, the minimum cost-to-go from state x at time t. If we can solve for V then we can find from it a control u that achieves the minimum cost.

Extension of Hamilton Jacobi equation/ Continuous time analog of Dynamic Programming.

Background: Model Predictive Control

Control scheme where a model is used for predicting the future behavior of the system over finite time window7.

Your browser does not support the video tag. Your browser does not support the video tag. Model Predictive control

Animations from do-mpc documentation

Based on these predictions and the current measured/estimated state of the system, the optimal control inputs with respect to a defined control objective and subject to system constraints is computed. After a certain time interval, the measurement, estimation and computation process is repeated with a shifted horizon

Proactive control action: The controller is anticipating future disturbances, set-points etc.

Non-linear control: MPC can explicitly consider non-linear systems without linearization

Arbitrary control objective: Traditional set-point tracking and regulation or economic MPC

constrained formulation: Explicitly consider physical, safety or operational system constraints The dotted line indicates the current prediction and the solid line represents the realized values.

This simulation uses moving horizon estimation with the problem discretized by orthogonal collocation.

Background: Chaos

Let X be a metric space. A continuous map f:X→X is said to be chaotic on X if:

  • f has sensitive dependence on initial conditions

  • is topologically transitive

  • and has dense periodic orbits8

Your browser does not support the video tag.

Devaney’s definition of chaos is a general useful definition for our purposes. Any two of the rules give the other in ℝ.

There is an equivalent definition where f is indecomposable and the periodic points are dense in X.

topologically transitive dynamical system has points which eventually move under iteration from one arbitrarily small open set to any other. Consequently, such a dynamical system cannot be decomposed into two disjoint sets with nonempty interiors which do not interact under the transformation.

Background: Traditional Chaos Control

Relies on the ergodicity of chaotic systems9,10.

Ott-Gerbogi-Yorke control

Unstable periodic orbits

OGY relies on stabilizing unstable periodic orbits in the chaotic system. When a trajectory is near the UPO, the OGY method uses a perturbation to move the trajectory off the unstable manifold of the UPO and onto the stable manifold.

In the neighborhood of the UPO, the next point of the system’s Poincar´e section can be approximated using the Jacobian. And then the shift can be approximated by linear approximation

Shown attractor of the duffing oscillator with the UPOs marked in purple.

Background: Chaotic Pendulum Array11

Your browser does not support the video tag.

l2n¨θn=−γ˙θn−lnsinθn+τ0+τ1sinωt+κ(θn−1+θn+1−2θn)

This damped driven pendulum array is a classic example of a chaotic system that was deeply explored by Y. Braiman, John and Bill in the 90s exploring disorder taming chaos.

If we remove the coupling we get a simple driven damped pendulum. This system can itself be chaotic as well.

Background: Kuramoto Oscillator12

Your browser does not support the video tag. Your browser does not support the video tag. ˙θi=ωi+λN∑j=1sin(θj−θi)

Synchronizing fireflies video from Robin Meier

Kuramoto can be thought of as a first order mean field theory of the complex Ginzburg-Landau equation.

This model was initially proposed to capture synchronization in large weakly-coupled oscillator systems. It found a lot of applications like josephson junction arrays, power grids, and even modelling fireflies.

Neural Network control of chaos

Insight

  • Optimal control of network dynamics involves minimizing a cost functional.
  • Traditional control approaches like Pontryagin’s (maximum) principle or Hamilton-Jacobi-Bellman equations are analytically and computationally intractable for complex systems.
  • Neural networks based on neural ODES can approximate the optimal control policy13.

Methodology

Model Predictive control

Instead of training a neural network with a predefined dataset, we train the network in situ with the network serving as the controller that is integrated in the dynamical system at each epoch.

The time horizon for the integration is changed from the first half and second half of training by 5x to allow extrapolation of the controller.

Results: Kuramoto Grid

Controlled kuramoto dynamics

The simplest test for this method was a grid of kuramoto oscillators. The network was trained to control the phase of the oscillators to a desired state. This works very well. Interestingly, this method also extends naturally to noisy kuramoto, which is not guaranteed by HJB.

Results: Pendulum Array

Controlled Pendulum dynamics
Your browser does not support the video tag.

We also tested the method on the chaotic pendulum array into synchrony while maintaining the mean of the lengths. The network was able to control this system as well navigating out of the sliver of chaos.

Future Work: Exotic Dynamics14

˙θi=ωi+λN∑j=1Aijsin(θj−θi−α)

Your browser does not support the video tag.

This is an extension of kuramoto system where the coupling is nonlocal and the phase difference is shifted by a phase α.

It cannot be ascribed to a supercritical instability of the spatially uniform oscillation, because it occurs even if the uniform state is stable. Furthermore, it has nothing to do with the partially locked/partially incoherent states seen in populations of non-identical oscillators with distributed frequencies. There, the splitting of the population stems from the inhomogeneity of the oscillators themselves; the desynchronized oscillators are the intrinsically fastest or slowest ones. Here, all the oscillators are the same.

It would be a good test of the model to see if it can move a system into and out of chimeric states.

Future Work: Recurrence matrices in L15

Recurrence matrix for lorenz attractor

Recurrence matrices are are way to capture the dynamics of a system by looking at the recurrences of states. This is similar to the notion of delay coordinate embeddings used for approximating attractors. As you can see here the recurrence matrices are invariant to the delays.

This can be used in the loss function to prescribe the exact dynamics to optimize towards,

Computed by subtracting the transpose of the timeseries from itself and then thresholding the absolute value.

Background: Dynamics and symmetries

Background: Group theory for Dynamical Systems

Let Γ act on Rn and f:Rn→Rm . Then f is said to be Γ-equivariant if f(γx)=γf(x) for all x∈Rn and γ∈Γ16.

For dynamical systems, for a fixed point f(x⋆)=0, γx⋆ is also a fixed point.

Isotropy subgroup: Let v∈Rn:Σv={γ∈Γ:γv=v}

Fixed pt subspace: Let σ∈Σ⊆Γ. Fix(Σ)={v∈Rn:σv=v}

Thm: Let f:Rn→Rm be a Γ-equivariant map and Σ⊆Γ. Then f(Fix(Σ))⊆Fix(Σ).

Cyclic group Dihedral group = Cyclic group + Flips Symmetric group = m! S^1 circle group O(n) Measure preserving transformations

For ODEs that means if f(x) is a solution, then so is f(γx) for all gamma∈Γ. So we can classify solutions upto their symmetry.

If a system starts in a symmetric state, it will remain symmetric.

Isotropy subgroup: The set of all group elements which leave the solution unchanged. i.e Fixed-point subspaces are invariant under the flow generated by f

Background: Group Equivariant Neural Networks

Group equivariance works by lifting and projection

Group equivariant NN work through lifting convolutions to the group space, and then projecting back to the original space17.

Figure adapted from Bart Smets et al.

One was thats groups ar integrated into neural networks are as an extension of the traditional convolution neural network setup. Instead of just lifting to some real space which preserves translations as in traditional CNNs, we lift to a group space which preserves the group symmetries.

Symmetries of controllers

Insight

  • Dynamical systems often obey symmetries.
  • Controllers are essentially dynamical systems.
  • Is there a computationally viable mapping between the symmetries of the system and the controller?

Methodology

  • Analyze the controllers from the previous sections for symmetries using group equivariant autoencoders
  • Construct controllers that respect the symmetries of the system
  • Compare the performance of symmetric to conventional controllers.

Expectation

  • Viability test for group equivariant neural networks in control systems.
  • Mapping between the symmetries of the system and the controller.
  • Performance analysis of symmetric controllers.

Conclusion

  • Non-traditional treatments of neural networks let us better capture nonlinearity.
  • Standard paradigms of Geometry, Statistics, and Algebra for understanding nonlinear systems are augmented by neural networks.
  • The interplay of Physics, Mathematics, and Computer Science gives us the best shot at understanding complex systems

Deliverables

● Publication and Patent on Metalearning Activation functions.

◕ Publishable article on Neural Network Control of Chaos.

◔ Thorough study of the symmetries of controllers.

◑ Codebase for the above studies.

◔ Dissertation detailing the discoveries and pitfalls found throughout.

Estimated time of completion: August 2025

Acknowledgements

Bill, John, Anshul

Background image from Jacqueline Doan

References

1.
Vicentini, F. et al. NetKet 3: Machine Learning Toolbox for Many-Body Quantum Systems. SciPost Physics Codebases 007 (2022) doi:10.21468/SciPostPhysCodeb.7.
2.
Sohl-Dickstein, J. The boundary of neural network trainability is fractal. (2024) doi:10.48550/arXiv.2402.06184.
3.
Blondel, M. & Roulet, V. The Elements of Differentiable Programming. (2024) doi:10.48550/arXiv.2403.14606.
4.
Greydanus, S., Dzamba, M. & Yosinski, J. Hamiltonian Neural Networks. (2019) doi:10.48550/arXiv.1906.01563.
5.
Gao, P. et al. A theory of multineuronal dimensionality, dynamics and measurement. (2017) doi:10.1101/214262.
6.
Mori, T., Ziyin, L., Liu, K. & Ueda, M. Power-law escape rate of SGD. (2022) doi:10.48550/arXiv.2105.09557.
7.
Fiedler, F. et al. Do-mpc: Towards FAIR nonlinear and robust model predictive control. Control Engineering Practice 140, 105676 (2023).
8.
Devaney, R. L. An Introduction to Chaotic Dynamical Systems. (Westview Press, Boulder, Colo, 2003).
9.
Ott, E., Grebogi, C. & Yorke, J. A. Controlling chaos. Physical Review Letters 64, 1196–1199 (1990).
10.
Ditto, W. L., Rauseo, S. N. & Spano, M. L. Experimental control of chaos. Physical Review Letters 65, 3211–3214 (1990).
11.
Braiman, Y., Lindner, J. F. & Ditto, W. L. Taming spatiotemporal chaos with disorder. Nature 378, 465–467 (1995).
12.
Kuramoto, Y. In International Symposium on Mathematical Problems in Theoretical Physics. in (1975).
13.
Böttcher, L., Antulov-Fantulin, N. & Asikis, T. AI Pontryagin or how artificial neural networks learn to control dynamical systems. Nature Communications 13, 333 (2022).
14.
Abrams, D. M. & Strogatz, S. H. Chimera States for Coupled Oscillators. Physical Review Letters 93, 174102 (2004).
15.
Kennel, M. B., Brown, R. & Abarbanel, H. D. I. Determining embedding dimension for phase-space reconstruction using a geometrical construction. Physical Review A 45, 3403–3411 (1992).
16.
Zee, A. Group Theory in a Nutshell for Physicists. (Princeton University Press, Princeton, 2016).
17.
Cohen, T. & Welling, M. Group Equivariant Convolutional Networks. in Proceedings of The 33rd International Conference on Machine Learning 2990–2999 (PMLR, 2016).
18.
Leshno, M., Lin, V. Ya., Pinkus, A. & Schocken, S. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks 6, 861–867 (1993).
19.
Cybenko, G. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems 2, 303–314 (1989).
20.
Choudhary, A., Radhakrishnan, A., Lindner, J. F., Sinha, S. & Ditto, W. L. Neuronal diversity can improve machine learning for physics and beyond. Scientific Reports 13, 13962 (2023).
21.
Smets, B., Portegies, J., Bekkers, E. & Duits, R. PDE-based Group Equivariant Convolutional Neural Networks. Journal of Mathematical Imaging and Vision 65, 209–239 (2023).

Backup slides

Hyperparameters, integrators, and tolerances

Metalearning Activation functions:

MNIST1D: 1 hidden layer of 100 neurons, activation of 50 hidden neurons. 100 initializations averaged. RMSprop, Real Pendulum: same other than 50 initializations. Pytorch and Jax frameworks used.

Neural Network control of chaos:

Control and dynamics integration via diffrax Jax library for neural ODEs. TSit5 algorithm for ODE integration with PID controller with rtol 1e-7 and atol 1e-9. Stratanovich Milstein solver for SDE integration with PID controller with rtol 1e-7 and atol 1e-9. 1000 epochs Controller training with only 1/5 data for the first half. Implemented in Jax.

RMSProp addresses the issue of a global learning rate by maintaining a moving average of the squares of gradients for each weight and dividing the learning rate by this average. This ensures that the learning rate is adapted for each weight in the model, allowing for more nuanced updates. The general idea is to dampen the oscillations in directions with steep gradients while allowing for faster movement in flat regions of the loss landscape.

Tsit5 - Tsitouras 5/4 Runge-Kutta method. (free 4th order interpolant). Basically Dormand Prince with better coefficients.

Stratanovich Milstein - order 1 strong Taylor scheme

Universal approximation theorem

M(σ)=span{σ(wx−θ):w∈Rn,θ∈R}

Theorem 1: Let σ∈C(R). Then M(σ) is dense in C(Rn), in the topology of uniform convergence on compacta, if and only if σ is not a polynomial18.

For Neural Networks,

Theorem 2: Let σ be any continuous sigmoidal function. Then finite sums of the form G(x)=∑Ni=1αiσ(wTix+bi) are dense in C(In). i.e. ∀f∈C(In) and ϵ>0, there is a sum G(x) such that maxx∈In|f(x)−G(x)|<ϵ19.

Informally, at least one neural network exists that can approximate any continuous function on In=[0,1]n with arbitrary precision.

SGD via gradient flows

Gradient descent: xn+1=xn−γ∇f(xn) where γ>0 is the step-size. Gradient flow is the limit where γ→0.

There are two ways to treat SGD in this regime,

Consider fixed times t=nγ and s=mγ.

  • Convergence to gradient flow

Given the recursion xn+1=xn−γ∇f(xn)−γϵn,

applying this m times, we get:

X(t+s)−X(t)=xn+m−xn=−γm−1∑k=0∇f(X(t+skm))−γm−1∑k=0εk+n in the limit, X(t+s)−X(t)=−∫t+st∇f(X(u))du+0 which is just the gradient flow equation.

  • Convergence to Langevin diffusion

Given the recursion xn+1=xn−γ∇f(xn)−√2γϵn,

applying this m times, we get:

X(t+s)−X(t)=xn+m−xn=−γm−1∑k=0∇f(X(t+skm))−√2γm−1∑k=0εk+n The second term has finite variance ∝2s. When m tends to infinity,

X(t+s)−X(t)=−∫t+st∇f(X(u))du+√2[B(t+s)−B(t)]

This limiting distribution is the Gibbs distribution with density exp(−f(x))

γ∑m−1k=0εk+n has zero expectation and variance equal to γ2m=γs times the variance of each εk+n and thus tends to 0

Argument from Francis Bach

1DMNIST

MNIST1D

Neural ODEs

  • Forward propagation: x(t1)=ODESolve(f(x(t),t,θ),x(t0),t0,t1)Compute loss: L(x(t1))a(t1)=∂L∂x(t1)

  • Back propagation: [x(t0)∂L∂x(t0)∂L∂θ]=ODESolve([f(x(t),t,θ)−a(t)T∂f(x(t),t,θ)∂x−a(t)T∂f(x(t),t,θ)∂θ],[x(t1)∂L∂x(t1)0|θ|],t1,t0)

HJB Derivation Sketch

Bellman optimality equation: V(x(t0),t0,tf)=V(x(t0),t0,t)+V(x(t),t,tf) dV(x(t),t,tf)dt=∂V∂t+∂V∂xTdxdt=minu(t)ddt[∫tf0L(x(τ),u(τ))dτ+Q(x(tf),tf)]=minu(t)[ddt∫tf0L(x(τ),u(τ))dτ]

⟹−∂V∂t=minu(t)[L(x(t),u(t))+∂V∂xTf(x(t),u(t))]

Pontryagin’s Principle

Let (x⋆,u⋆) be an optimal trajectory-control pair. Then there exists a λ⋆ such that, λ⋆=−∂H∂x and H(x⋆,u⋆,λ⋆,t)=minuH(x⋆,u,λ⋆,t) and by definition, x⋆=∂H∂λ

Stochasticity and noise

We can add intrinsic noise to it by adding a noise term ξ with a strength σ: dxdt=f(x,t)+σ(x,t)ξ

Then the rectangular reimann construction is equivalent to: f(t)≈∑ni=1f(^xi)χΔxi

where f(^xi) are constants. This approximation is exact in the limit of n→∞.

For the stochastic case, the rectangular construction is equivalent to: ϕ(t)≈∑ni=1ˆeiχΔxi Then the integral is: ∫T0ϕ(t)dWt=n∑i=1ˆeiΔWi

Here, the coefficients ˆei are not constants but random variables since we are allowed to integrate σ which depends on Xt.

Recurrence plots caveat emptor

Roessler lyap

Roessler lyap

Roessler recurrence

Roessler rec

Lyapunov exponent and Recurrence plot from Recurrence Plot Pitfalls

Neural network architectures

Autoencoder

Autoencoder

GRU

GRU

CNN

CNN

All images from associated wikipedia pages (Autoencoder, GRU, CNN)

1 / 51
Nonlinear dynamics with neural networks Preliminary Examination Anil Radhakrishnan Nonlinear Artificial Intelligence Laboratory, North Carolina State University May 8, 2024

  1. Slides

  2. Tools

  3. Close
  • Nonlinear dynamics with neural networks
  • Table of Contents
  • Motivation
  • Motivation: Why Neural Networks?
  • Motivation: Why Nonlinear Dynamics?
  • Motivation: Why Nonlinear Dynamics for Neural Networks?
  • Motivation: Why Neural Networks for Nonlinear Dynamics?
  • Background Neural Networks and optimization
  • Background: Differentiable Computing
  • Background: Neural Networks
  • Background: Backpropagation
  • Background: Loss Functions
  • Background: Optimization
  • Background: Meta-Learning
  • Background: Physics-Informed Neural Networks(PiNNs)
  • Background: Coordinates matter
  • Metalearning Activation functions
  • Foundation
  • Insight
  • Methodology
  • Results: Scaling
  • Results: Real World Data
  • Analysis: Participation Ratio
  • Conclusions
  • Speculations: Achieving stable minima
  • Speculations: Structure and universality to diverse activations
  • Background: Control and Chaos
  • Background: Hamilton Jacobi Bellman(HJB) Equation
  • Background: Model Predictive Control
  • Background: Chaos
  • Background: Traditional Chaos Control
  • Background: Chaotic Pendulum Array11
  • Background: Kuramoto Oscillator12
  • Neural Network control of chaos
  • Insight
  • Methodology
  • Results: Kuramoto Grid
  • Results: Pendulum Array
  • Future Work: Exotic Dynamics14
  • Future Work: Recurrence matrices in \(\mathcal{L}\)15
  • Background: Dynamics and symmetries
  • Background: Group theory for Dynamical Systems
  • Background: Group Equivariant Neural Networks
  • Symmetries of controllers
  • Insight
  • Methodology
  • Expectation
  • Conclusion
  • Deliverables
  • Acknowledgements
  • References
  • Backup slides
  • Hyperparameters, integrators, and tolerances
  • Universal approximation theorem
  • SGD via gradient flows
  • 1DMNIST
  • Neural ODEs
  • HJB Derivation Sketch
  • Pontryagin’s Principle
  • Stochasticity and noise
  • Recurrence plots caveat emptor
  • Neural network architectures
  • f Fullscreen
  • s Speaker View
  • o Slide Overview
  • e PDF Export Mode
  • b Toggle Chalkboard
  • c Toggle Notes Canvas
  • d Download Drawings
  • ? Keyboard Help