Neural Ordinary Differential Equations, as presented in this paper by Chen, Rubanova, Bettencourt, and Duvenaud, may go down in history as a genuine breakthrough in the science of machine learning. The Ontario-based research group’s new approach to constructing a neural network can dramatically simplify the inner-workings of what might otherwise become easily-convoluted networks containing interconnections of, potentially, millions of computational nodes. As a replacement for these countless nodes, the group has found a novel way to map inputs to outputs, generally, through the use of a single ordinary differential equation ("ODE"). This approach is certainly state-of-the-art, but, in all fairness, their breakthrough is not wholly new - ODEs have already played an important role in applied mathematics for over one hundred years.
Machine Learning can be thought of as another form of applied mathematics, and this idea alone should highlight the importance of the availablitiy of ODEs in our scientists' toolkits. Whereas the fields of chemistry, physics, and finance, at least, have benefitted tremendously from these maths, ours of computational science has mostly done without. But our community is not the only one to reap a potential benefit from this team's breakthrough. Just imagine anyone using an old tool, or set of tools, in new, more powerful manner. In a way, it seems our needles have become sewing machines. Of course, this comparison conceals some caveats.
Duvenaud et al. quickly establish that their ODE solver acts as a black-box. This is to say that the intermediary computations of some input-output system cannot be known without making notable changes to the system's operation. Notably, however, there is no proprietary technology being used here. In fact, they announce their reliance on the open-source Python library scipy.integrate for . And the author's official repository even announces a variety of tailored ODEs solvers (including Euler's Method) that can be read about in full detail, granted those readers are well versed in abstract methods and classes.
This black-box solver is not inherently any less capable than its more complicated cousins - even though it may appear greatly simplified. To demonstrate this, the authors employed their Ordinary Differential Equation-based Neural Network ("ODE-Net") alongside a widely respected Residual Neural Network ("ResNet") to predict labels of identical datasets in supervised learning. Their ODE-Net was able to approximate a similar solution with significantly less memory usage and fewer input parameters overall. These results were neatly presented in Table 1 of their paper (which is reprinted below with our added emphasis):
The authors later announce that there have been recent advances in ResNets for which the results above may appear biased. Therein, they state, the same memory advantage can be had [for ResNets] but that, "these methods require restricted architectures, which partition the hidden units." (Duvenaud et al. 9), while their approach to ResNets was not restrictive in this way.
Another notable comparison was made between the researchers' ODE-Net and a Recurrent Neural Network ("RNN"). Here they show the predictions and extrapolations, built from each type of network, for two spirals of many that they randomly generated for the experiment. Figure 8 below, also from the paper, shows that, given a random sampling of the dataset over time, the ODE-Net dramatically outperforms the RNN.
These two forms of neural networks, the RNN and the ResNet, are not the only computational tools to be worthy of comparison with the authors' ODE-Net. It almost goes without saying that traditional feed-forward or multi-layered perceptron ("MLP") networks, especially those with tremendous network depths, may find themselves less desirable than ever (this is also evidenced in Table 1 above). And there is the understated importance of modeling irregularly sampled data, a task that many of our machine learning tools today cannot manage without some good deal of effort. Finally, this paper shines light on the benefits of black-box ODE solvers as substitutes for the more complex computations involved in normalizing flows - but that is a topic for another day.
In perhaps an unusual way, the authors of this particular paper regarding Neural ODEs did not explicitly announce the reproducibility of their specific results, but instead they rather broadly introduced their system as a viable, albeit experimental, option in comparison to a number of current methods. This is to say that we were not provided any specific datasets with which we could validate the error rates of the ODE-Net to the ResNet or the RK-Net (see Table 1), nor could we compare the normalized flows ("NFs") to the continuously normalized flows ("CNFs") (see Section 4). Frankly, there still seems to be a steep learning curve necessary for us to: collect a viable dataset, formulate a sensible optimization problem, and perform a noteworthy analysis on such a dataset with both an ODE-Net and some other viable neural network. We would like to say that these tasks would not seem so insurmountable had we decided earlier on concrete, intermediate goals and perhaps been given more time to accomplish them.
Of those provided experiments which we were able to reproduce, we did find some impressive and some curious results. First, the impressive:
In each solver, the standard adaptive ODE solver and the adjoint solver, we can see the effects of compounding epochs; namely that, "the number of function evaluations increases throughout training, presumably adapting to increasing complexity of the model." This is plainly impressive in that the model is able to learn well enough to increase its accuracy so dramatically, more than ten-fold, by only its second epoch. This adaptation to complexity is nicely charted in the paper - although they seem to record a less extreme delta between the first epochs (Figure 3, reprinted but reduced for clarity):
As shown in Figure 2, we reached an incredible level of accuracy after only thirty epochs. That they are able to achieve such results in so few passes is radically impressive. The authors demonstrate, similarly, during their comparisons of CNFs to NFs, that they can achieve results of equivalent accuracy in the CNF with only one-fiftieth the number of iterations (Duvenaud et al. 5). Again, the complexity involved in reproducing these results appeared daunting in relation to the amount of time that we had to analyze this paper in its entirety. For those interested, there is a separate code repository created by the authors solely dedicated to "tools for training, evaluating, and visualizing CNF for reversible generative modeling."
Now for the curious results. We recreated one of the randomly generated spiral graphs (using irregularly sampled data points), but we wanted to catch the model still-at-work. So, we chose to interrupt its sampling and approximation at various iterations in an attempt to see how well it was doing. The next three figures document our findings.
In a way, these results are also impressive. Though it seems that there is some pretty frightful guessing going on in the sample at iteration 200, we thought that the sample at iteration five hundred accurately anticipated the spiral's (future) trajectory. This is especially notable given that it this model is predicting correctly well into the future, in other words - almost twice its own current lifetime. But this impressive feat is quickly negated by the wildly inaccurate estimation of events in the random spiral at iteration 1000.
Our interpretation of this inaccuracy is this: it seems that the dense inner coil of the spiral, along with the compact, inner placement of the sampled data points simply did not provide a diverse enough sampling for our ODE-Net to adequately estimate from. This tightly sampled region is contrasted by the well-spaced inner lines of the spirals generated in our first two graphs. Our best guess, then, is to say that a greater number of sampled data-points should improve the quality of this third graph's prediction. This is to say that the true curvature of the sprial, be it further inward or outward, could be better represented if only we had more data-points from which to draw our conclusions. And, at risk of burning up our personal computers, we chose to double the sampling in our randomly generated spiral and catch it after another one thousand iterations. Here’s what we saw:
Qualitatively, given that we don't have any concrete numbers to back up these graphs, we wish to say that this graph of two hundred data-points and one thousand iterations is only marginally more accurate than the similar spiral that was constructed of only one hundred data points and five hundred iterations. This immediately tells us two things that should have been obvious before. One - that the number of iterations is representative of "distance traveled" and not some distance advanced beyond our sampled data-points. And two - that the accuracy of the fit is not necessarily influenced by the number of datapoints. This led us to our final experiment which was an attempt to create a learned trajectory of 2000 iterations on a spiral with only one hundred sampled data-points. While we expected our result to be a bit more accurate than the similar representation at only 500 iterations, we instead found this:
Which, surprisingly, appears less accurate than the sample taken with identical settings but at just one quarter the iterations. And this result also greatly contrasts the first one recorded at one thousand iterations - the result whose spiral and data samples appeared too dense to lead to an accurate calculation. It's difficult for us to say why these learned trajectories seem to have such low accuracy in comparison to those presented in the paper. Even without modifying these parameters, we would expect much greater fits. Perhaps we should not have expected to draw such meaningful conclusions from such a small number of iterations. This is to say that we may have been misled by the accuracy of the ODE solvers above (in Figures 1 & 2). Those results likely led us to hold unreasonable expectations of high accuracy - in a short time - when estimating these trajectories. It is most likely, then, that we did not have the computational resources (or, at least the computational time) necessary to reproduce the accurate fits shown in the original paper.
In light of our own nuanced recordings, we continue to take the authors at their word. Their black-box Ordinary Differential Equations solver should soon become an unquestionable asset in the toolset of modern scientists given that they show it to achieve approximately equivalent results with dramatically less computational resources [as compared to the popular solutions of today]. We have surely provided an oversimplified analysis of this team's complex creation. And it is quite unlikely that we could come to find either fault or room for improvement in the workings of this system that required the combined efforts of many researchers over many months. Without hesitation, we wish to show appreciation to the authors of Neural Ordinary Differential Equations for their assembly of this novel instrument. Stunning capabilities such as these will hopefully foster a new generation of machine learning applications.
We have to give special thanks to the senior author, David Davenaud, for his contributions to the machine learning community overall and specifically to the Hacker News community; his answers to the many thoughtful and insightful questions on this HN post from January 2019 helped us better understand the nuances of the topic at hand. Additional appreciation for all of the original paper’s authors, contributors, and those references without whom this could not be possible.
We’d also like to give a great thank you to the unparalleled writers at the MIT Technology Review for their short & sweet review of this very same paper.
Of course there remains a great list of references that I cannot that begin to call out individually, especially all of those countless StackOverflow contributors and Wikipedia authors that brought us up to speed on everything from RNNs to Euler’s Method. Here are a series of blog posts that helped us realize how much we both did and did not know about this topic on first approach: