Parameter Inference in Dynamical Systems

6 minute read

Published:

Effect of topology of neutral diversity

Calibrating a system of differential equations on data for the project "What can Economics learn from Biology?".

One of the challenges modellers face in biological sciences is to calibrate models in order to match as closely as possible observations and gain predictive power. This can be done via direct measurements through experimental design, but this process is often costly, time consuming and even sometimes not possible. Scientific machine learning addresses this problem by applying optimisation techniques originally developed within the field of artificial intelligence to mechanistic models, allowing to infer parameters directly from observation data. In this blog post, I shall explain the basics of this approach, and how the Julia ecosystem has efficiently embedded such techniques into ready to use packages. This promises exciting perspectives for modellers in all areas of environmental sciences.

๐Ÿšง This is Work in progress ๐Ÿšง

Dynamical systems are models that allow to reproduce, understand and forecast systems. They connect the time variation of the state of the system to the fundamental processes we believe driving it, that is

\[\begin{equation*} \text{ time variation of } ๐ŸŒ_t = \sum \text{processes acting on } ๐ŸŒ_t \end{equation*}\]

where $๐ŸŒ_t$ denotes the state of the system at time $t$. This translates mathematically into

\[\begin{equation} \partial_t(๐ŸŒ_t) = f_\theta( ๐ŸŒ_t ) \end{equation}\]

where the function $f_\theta$ captures the ensembles of the processes considered, and depend on the parameters $\theta$.

Eq. (1) is a Differential Equation, that can be integrated with respect to time to obtain the state of the system at time $t$ given an initial state $๐ŸŒ_{t_0}$.

\[\begin{equation} ๐ŸŒ_t = ๐ŸŒ_{t_0} + \int_0^t f_\theta( ๐ŸŒ_s ) ds \end{equation}\]

Dynamical systems have been used for hundreds of years and have successfully captured e.g. the motion of planets (second law of Kepler), the voltage in an electrical circuit, population dynamics (Lotka Volterra equations) and morphogenesis (Turing patterns)โ€ฆ

Such models can be used to forecast the state of the system in the future, or can be used in the sense of virtual laboratories. In both cases, one of the requirement is that they reproduce patterns - at least at a qualitative level. To do so, the modeler needs to find the true parameter combination $\theta$ that correspond to the system under consideration. And this is tricky! In this post we adress this challenge.

Model calibration

How to determine $\theta$ so that $\text{simulations} \approx \text{empirical data}$?

The best way to do that is to design an experiment!

When possible, measuring directly the parameters in a controlled experiment with e.g. physical devices is a great approach. This is a very powerful scientific method, used e.g. in global circulation models where scientists can measure the water viscosity, the change in water density with respect to temperature, etcโ€ฆ Unfortunately, such direct methods are often not possible considering other systems!

An opposite approach is to infer the parameters undirectly with the empirical data available.

Parameter exploration

One way to find right parameters is to perform parameter exploration, that is, slicing the parameter space and running the model for all parameter combinations chosen. Comparing the simulation results to the empirical data available, one can elect the combination with the higher explanatory power.

But as the parameter space becomes larger (higher number of parameters) this becomes tricky. Such problem is often refered to as the curse of dimensionality

Feels very much like being lost in a giant maze. We need more clever technique to get out!

A Machine Learning problem

In artificial intelligence, people try to predict a variable $y$ from predictors $x$ by finding suitable parameters $\theta$ of a parametric function $F_\theta$ so that

\[\begin{equation} y = F_\theta(x) \end{equation}\]

For example, in computer vision, this function might be designed for the specific task of labelling images, such as for instance

$F_\theta ($ $) \to \{\text{cat}, \text{dog}\}$

Usually people use neural networks, as they are good approximators for high dimensional function (see the Universal approximation theorem).

\[\begin{equation*} F_\theta \equiv NN_\theta \end{equation*}\]

One should really see neural networks as functions ! For example, feed forward neural networks are mathematically described by matrix multiplications

\[\begin{equation*} NN_\theta (x) = A x + b \end{equation*}\]

(without activation function)

Notice that Eq. (2) is similar to Eq. (3)! Indeed one can think of $๐ŸŒ_0$ as the analogous to $x$ - i.e. the predictor - and $๐ŸŒ_t$ as the variable $y$ to predict.

\[\begin{equation*} ๐ŸŒ_t = F_\theta(๐ŸŒ_{t_0}) \end{equation*}\]

where \(F_\theta (๐ŸŒ_{t_0}) \equiv ๐ŸŒ_{t_0} + \int_0^t f_\theta( ๐ŸŒ_s ) ds .\)

With this perspective in mind, techniques developed within the field of Machine Learning - to find suitable parameters $\theta$ that best predict $y$ - become readily available to reach our specific needs: model calibration!

Parameter inference

The general strategy to find a suitable neural network that can perform the tasks required is to โ€œtrainโ€ it, that is, to find the parameters $\theta$ so that its predictions are accurate.

In order to train it, one โ€œscoresโ€ how good a combination of parameter $\theta$ performs. A way to do so is to introduce a โ€œLoss functionโ€

\[\begin{equation*} L(\theta) = (F_\theta(x) - y_{\text{empirical}})^2 \end{equation*}\]

One can then use an optimisation method to find a local minima (and in the best scenario, the global minima) for $L$.

Gradient descent

You ready?

Gradient descent and stochastic gradient descent are โ€œiterative optimisation methods that seek to find a local minimum of a differentiable functionโ€ (Wikipedia). Such methods have become widely used with the development of artifical intelligence.

Those methods are used to compute iteratively $\theta$ using the sensitivity of the loss function to changes in $\theta$, denoted by $\partial_\theta L(\theta)$

\[\begin{equation*} \theta^{i+1} = \theta^{(i)} - \lambda \partial_\theta L(\theta) \end{equation*}\]

where $\lambda$ is called the learning rate.

In practice

The sensitivity with respect to the parameters $\partial_\theta L(\theta)$ is in practice obtained by differentiating the code (Automatic Differentiation).

For some programming languages this can be done automatically, with low computational cost. In particular, Flux.jl allows to efficiently obtain the gradient of any function written in the wonderful language .

The library DiffEqFlux.jl based on Flux.jl implements differentiation rules (custom adjoints) to obtain even more efficiently the sensitivity of a loss function that depends on the numerical solution of a differential equation. That is, DiffEqFlux.jl precisely allows to do parameter inference in dynamical systems. Go and check it out!