Preliminary Examination
Nonlinear Artificial Intelligence Laboratory, North Carolina State University
May 8, 2024
Avalanche activity cascades in a sandpile automaton | Vortex street formed by flow past a cylinder | Turing patterns in a reaction-diffusion model
Animation by William Gilpin
Trainability fractal by Jascha Sohl-Dickstein
\[ \newcommand{\defeq}{\mathrel{\mathop:}=} \newcommand{\R}{\mathbb{R}} \newcommand{\Param}{\theta} \newcommand{\Loss}{\mathcal{L}} \]
\[\varphi: \R^d \rightarrow \R^N_L \qquad T^l: \R^{N_{l-1}}\rightarrow \R^{N_{l}} \qquad \sigma: \R\rightarrow \R\] \[x \in \R^d \qquad W^l \in \R^{N_l\times N_{l-1}} \qquad b^l \in \R^{N_l}\]
Consider a differentiable loss function \(\Loss\),
\[ \begin{eqnarray*} \delta^L =& \nabla_{\varphi}\Loss\odot\sigma'(T^L)\\ \delta^l =& ((w^{l+1})^\intercal \delta^{l+1} ))\odot\sigma'(T^l)\\ \dfrac{\partial \Loss}{\partial b^l_j} =& \delta_j^l \\ \dfrac{\partial \Loss}{\partial w^l_{jk}} =& T^{l-1}_k\delta_j^l \\ \end{eqnarray*} \]
The tunable parameters can then be updated by: \(\Param\gets \Param -\eta \nabla_{\Param}\Loss\)
Loss functions are combined and regularized to balance the tradeoff between model complexity and data fit.
Stochastic Gradient Descent, Adam, and RMSProp are common optimization algorithms for training neural networks.
Algorithms for meta-learning are still in a nascent stage with significant computational overhead.
Figure by John Lindner
Neural networks are coordinate dependent. The choice of coordinates can significantly affect the performance of the network.
Adapted from Heliocentrism and Geocentrism by Malin Christersson
Figure from Publication
Figure from Publication
Figure from Publication
Figure from Publication
Estimate change in dimensionality of network activations
\[ Nr = \mathcal{R} = \frac{(\operatorname{tr}{\bf C})^2}{\operatorname{tr}{\bf C}^2} = \frac{\left(\sum_{n=1}^N\lambda_n \right)^2}{\sum_{n=1}^N \lambda_n^2} \]
where \(\lambda_n\) are the co-variance matrix \(\bf C\)’s eigenvalues for neuronal activity data matrix. The normalized participation ratio \(r = \mathcal{R} / N\)5.
Diverse activation functions use more of the network’s capacity.
Figure from Publication
Optimization of a neural network with shuffled data is a noisy descent.
This can be modeled with the Langevin equation:
\[ \DeclareMathOperator{\Tr}{Tr} d\theta_{t} = - \nabla \Loss(\theta_{t})\, dt + \sqrt{2\mathbf{D}} \cdot d\mathcal{W}_{t} \] with noise intensity \(\mathbf{D} = (\eta) \Loss(\theta) \mathbf{H}(\theta\star)\)6
For a control system with state \(x(t)\) and control \(u(t)\), \[ \dfrac{dx}{dt} = f(x(t),u(t))dt \\ \]
\[ \mathcal{H}(x,u,t_0,t_f) = Q(x(t_f), t_f) + \int_{t_0}^{t_f} \Loss(x(\tau),u(\tau))d\tau\\ V(x(t_0),t_0,t_f) = \min_{u(t)} \mathcal{H}(x(t_0),u(t),t_0,t_f)\\ -\dfrac{\partial V}{\partial t} = \min_{u(t)} \left[ \Loss(x(t),u(t)) + \dfrac{\partial V}{\partial x}^T f(x(t),u(t)) \right] \]
Control scheme where a model is used for predicting the future behavior of the system over finite time window7.
Animations from do-mpc documentation
Let \(X\) be a metric space. A continuous map \(f: X \rightarrow X\) is said to be chaotic on \(X\) if:
|
Relies on the ergodicity of chaotic systems9,10.
\[ l_n^2\ddot\theta_n =-\gamma\dot\theta_n\nonumber-l_n\sin{\theta_n}\nonumber +\tau_{0}+\tau_{1}\sin{\omega t}\nonumber +\kappa(\theta_{n-1}+\theta_{n+1}-2\theta_n) \]
\[\dot{\theta}_i = \omega_i + \lambda\sum_{j=1}^N \sin(\theta_j - \theta_i)\] |
Synchronizing fireflies video from Robin Meier
\[\dot{\theta}_i = \omega_i + \lambda\sum_{j=1}^N A_{ij}\sin(\theta_j - \theta_i-\alpha)\]
Let \(\Gamma\) act on \(\mathbb{R}^n\) and \(f: \mathbb{R}^n \rightarrow \mathbb{R}^m\) . Then \(f\) is said to be \(\Gamma\)-equivariant if \(f(\gamma x) = \gamma f(x)\) for all \(x \in \mathbb{R}^n\) and \(\gamma \in \Gamma\)16.
For dynamical systems, for a fixed point \(f(x\star)=0\), \(\gamma x\star\) is also a fixed point.
Isotropy subgroup: Let \(v\in \mathbb{R}^n: \Sigma_v = \{\gamma \in \Gamma: \gamma v = v\}\)
Fixed pt subspace: Let \(\sigma \in \Sigma \subseteq \Gamma\). \(\text{Fix}(\Sigma)=\{v \in \mathbb{R}^n:\sigma v=v\}\)
Thm: Let \(f: \mathbb{R}^n \rightarrow \mathbb{R}^m\) be a \(\Gamma\)-equivariant map and \(\Sigma \subseteq \Gamma\). Then \(f(\text{Fix}(\Sigma))\subseteq\text{Fix}(\Sigma)\).
Group equivariant NN work through lifting convolutions to the group space, and then projecting back to the original space17.
Figure adapted from Bart Smets et al.
● Publication and Patent on Metalearning Activation functions.
◕ Publishable article on Neural Network Control of Chaos.
◔ Thorough study of the symmetries of controllers.
◑ Codebase for the above studies.
◔ Dissertation detailing the discoveries and pitfalls found throughout.
Estimated time of completion: August 2025
Background image from Jacqueline Doan
Metalearning Activation functions:
MNIST1D: 1 hidden layer of 100 neurons, activation of 50 hidden neurons. 100 initializations averaged. RMSprop, Real Pendulum: same other than 50 initializations. Pytorch and Jax frameworks used.
Neural Network control of chaos:
Control and dynamics integration via diffrax Jax library for neural ODEs. TSit5 algorithm for ODE integration with PID controller with rtol 1e-7 and atol 1e-9. Stratanovich Milstein solver for SDE integration with PID controller with rtol 1e-7 and atol 1e-9. 1000 epochs Controller training with only 1/5 data for the first half. Implemented in Jax.
\[M(\sigma) = \text{span}\{\sigma(wx-\theta):w\in \mathbb{R}^n, \theta\in\mathbb{R}\}\]
Theorem 1: Let \(\sigma \in C(\mathbb{R})\). Then \(M(\sigma)\) is dense in \(C(\mathbb{R}^n)\), in the topology of uniform convergence on compacta, if and only if \(\sigma\) is not a polynomial18.
For Neural Networks,
Theorem 2: Let \(\sigma\) be any continuous sigmoidal function. Then finite sums of the form \(G(x) = \sum_{i=1}^N \alpha_i \sigma(w_i^T x + b_i)\) are dense in \(C(I_n)\). i.e. \(\forall f\in C(I_n)\) and \(\epsilon>0\), there is a sum \(G(x)\) such that \(\max_{x\in I_n}|f(x)-G(x)|<\epsilon\)19.
Informally, at least one neural network exists that can approximate any continuous function on \(I_n=[0,1]^n\) with arbitrary precision.
Gradient descent: \(x_{n+1} = x_n - \gamma\nabla f(x_n)\) where \(\gamma > 0\) is the step-size. Gradient flow is the limit where \(\gamma\rightarrow 0\).
There are two ways to treat SGD in this regime,
Consider fixed times \(t=n\gamma\) and \(s=m\gamma\).
Given the recursion \(x_{n+1} = x_n-\gamma\nabla f(x_n) - \gamma\epsilon_n\),
applying this \(m\) times, we get:
\[ \begin{align*} X(t+s) - X(t) &= x_{n+m}-x_n\\ &= -\gamma \sum_{k=0}^{m-1}\nabla f(X(t+\frac{sk}{m})) - \gamma\sum_{k=0}^{m-1}ε_{k+n} \end{align*} \] in the limit, \[ \begin{align*} X(t+s) - X(t) &= -\int_t^{t+s}\nabla f(X(u))du + 0 \end{align*} \] which is just the gradient flow equation.
Given the recursion \(x_{n+1} = x_n-\gamma\nabla f(x_n) - \sqrt{2\gamma}\epsilon_n\),
applying this \(m\) times, we get:
\[ \begin{align*} X(t+s) - X(t) &= x_{n+m}-x_n\\ &= -\gamma \sum_{k=0}^{m-1}\nabla f(X(t+\frac{sk}{m}))- \sqrt{2\gamma}\sum_{k=0}^{m-1}ε_{k+n} \end{align*} \] The second term has finite variance \(\propto 2s\). When \(m\) tends to infinity,
\[ X(t+s) - X(t) =\\ -\int_t^{t+s} \nabla f(X(u))du + \sqrt{2}[B(t+s)-B(t)] \]
This limiting distribution is the Gibbs distribution with density \(\exp{(-f(x))}\)
Argument from Francis Bach
Forward propagation: \[ x(t_1) = \text{ODESolve}(f(x(t),t,\theta), x(t_0), t_0, t_1)\\ \text{Compute loss: } \Loss(x(t_1))\\ a(t_1) = \frac{\partial \Loss}{\partial x(t_1)}\\ \]
Back propagation: \[ \begin{bmatrix} x(t_0) \\ \frac{\partial \Loss}{\partial x(t_0)}\\ \frac{\partial \Loss}{\partial \theta} \end{bmatrix} = \text{ODESolve}\left(\begin{bmatrix} f(x(t),t,\theta) \\ -a(t)^T \frac{\partial f(x(t),t,\theta)}{\partial x} \\ -a(t)^T \frac{\partial f(x(t),t,\theta)}{\partial \theta} \end{bmatrix} , \begin{bmatrix} x(t_1) \\ \frac{\partial \Loss}{\partial x(t_1)}\\ 0_{|\theta|} \end{bmatrix} , t_1, t_0 \right) \]
Bellman optimality equation: \[ V(x(t_0),t_0,t_f) = V(x(t_0),t_0,t) + V(x(t),t,t_f) \] \[ \begin{align*} \dfrac{dV(x(t),t,t_f)}{dt} &= \dfrac{\partial V}{\partial t} + \dfrac{\partial V}{\partial x}^T \frac{dx}{dt}\\ &= \min_{u(t)} \dfrac{d}{dt} \left[ \int_0^{t_f}\Loss(x(\tau),u(\tau))d\tau + Q(x(t_f), t_f) \right]\\ &= \min_{u(t)} \left[ \dfrac{d}{dt}\int_0^{t_f}\Loss(x(\tau),u(\tau))d\tau \right] \\ \end{align*} \]
\[ \implies \boxed{-\dfrac{\partial V}{\partial t} = \min_{u(t)} \left[ \Loss(x(t),u(t)) + \dfrac{\partial V}{\partial x}^T f(x(t),u(t)) \right]} \]
Let \((x\star,u\star)\) be an optimal trajectory-control pair. Then there exists a \(\lambda\star\) such that, \[ \lambda\star =-\dfrac{\partial \mathcal{H}}{\partial x} \] and \[ \mathcal{H}(x\star,u\star,\lambda\star, t) = \min_{u} \mathcal{H}(x\star,u,\lambda\star, t) \] and by definition, \[ x\star = \dfrac{\partial \mathcal{H}}{\partial \lambda} \]
We can add intrinsic noise to it by adding a noise term \(\xi\) with a strength \(\sigma\): \[ \frac{dx}{dt}=f(x,t)+ σ(x,t)ξ \]
Then the rectangular reimann construction is equivalent to: \(f(t)≈\sum_{i=1}^n f(\hat{x_i})χ_{Δx_i}\)
where \(f(\hat{x_i})\) are constants. This approximation is exact in the limit of \(n\to\infty\).
For the stochastic case, the rectangular construction is equivalent to: \(\phi(t)≈\sum_{i=1}^n \hat{e}_iχ_{Δx_i}\) Then the integral is: \[ \int_0^T \phi(t) dW_t=\sum_{i=1}^n \hat{e}_i ΔW_i \]
Here, the coefficients \(\hat{e}_i\) are not constants but random variables since we are allowed to integrate \(σ\) which depends on \(X_t\).
Lyapunov exponent and Recurrence plot from Recurrence Plot Pitfalls
All images from associated wikipedia pages (Autoencoder, GRU, CNN)