Abstract

OOD (Out-of-distribution) generalization has been an unsolved problem for Deep Learning. While self-attention based variants have taken great strides in being able to recover the underlying structured representations in the data, They lack a key component if they are to achieve arbitrary generalization - adaptive computation and strong priors.

This project presents preliminary results of the attempt to generalize the ideas present in prior literature to attention-based variants. We present a modified (recurrent) architecture that is both effective and parallelizable, that displays interesting properties as well as being able to OOD extrapolate on toy tasks. We believe that incorporating strong inductive biases and priors to traditional DL models may go a long way towards eventually being able to arbitrarily extrapolate.

The biggest challenge this line of approach faces is stability. As the dynamics of the task become harder, the approximation learnt by the NN has to become more stable - otherwise they risk compounding errors with more and more iterations. This report highlights some of the ideas used in prior literature to combat this as well as integrate some novel ideas to attempt in making these models maximally stable - even if we fall short of completely solving the problem.

<aside> 💡 Note: Further work needs to be done to better understand the viability and scalability of this direction.

Due to financial and time constraints (I'm broke) we couldn't do thorough ablations. So take all hypotheses with a grain of salt. That said, a lot of my opinions are guided by experiments that didn't make it here - so feel free to contact me. Details are given at the end of this document.

</aside>

Related Work

This section is a rapid-fire primer on the work done by Bansal et al. and Kaizer et al. for embedding adaptive-computational properties in recurrent styled models.

Bansal and Schwarzschild in their paper demonstrated how their architecture, dubbed "Deep Thinking Network” was able to OOD generalize, solving 201x201 sized mazes despite being only trained on 9x9 ones. The network learnt the dead-end filling algorithm:

ezgif.com-resize.gif

This was achieved by using the same network and applying it multiple times iteratively. Such a formulation is similar to DEQs (Deep-equilibrium Networks), but even simpler and more scalable. This is recursively performed for a maximum bound of max_iters: int iterations. It is expected the the model learns to extrapolate both iteration-wise and length-wise, if it's to OOD generalize at all.

Fig 1. employs the same block recursively, as well as a long-skip connection called

Fig 1. DTNet employs the same block recursively, as well as a long-skip connection called recall