Mathematical Aspects of Algorithmic Learning and Deep Neural Networks
Dates
October and November 2021 (six weeks total), Tuesdays and Thursdays. Each session will start at 16:00 CEST and end 18:30 CEST, with a 30 min break.
*** Oct 21 / Nov 16: 14:00 – 16:30
Location
Online via ZOOM.
Abstract
The main purpose of the course is to describe the nature of algorithmic learning (AL), of its most relevant modalities and most successful applications, together with a presentation of the main mathematical ingredients that provide the basis for both the definition and study of models and for the analysis of the algorithms. Along the way, open questions and problems will be highlighted.
The course is meant to enable students at the level of graduate or late undergraduate, with knowledge of basic mathematical topics (differential and integral calculus in several variables, probability theory, algebra and geometry), but with little or no knowledge of AL, to become reasonably at ease with its current trends, achievements, open problems, and publications.
Presentation
There will be twelve 2-hour sessions planned for the first three weeks of October and November 2021, with one session on Tuesday and one on Thursday for each of the six weeks. Each session will start at 16:00 CEST and end 18:30 CEST, with a 30 min break after the first hour.
The following table (Summary) provides a short description of each session, the day in which it will be delivered, and the name of the speaker.
Schedule
Session | Date | Topics | Speaker |
1 |
Tue, Oct 5 |
16:00: Course presentation by Lluís Alsedà, CRM Director, and Carme Cascante, BGSMath Director. Informal outline and some background topics |
SX |
2 |
Thu, Oct 7 |
The curse of dimensionality. NNs and approximation properties |
JB |
3 |
Wed, Oct 13 |
Reproducing kernel Hilbert spaces |
SX |
4 |
Thu, Oct 14 |
Gradient descent and stochastic approximation |
SX |
5 |
Tue, Oct 19 |
Training dynamics: lazy regime and Neural Tangent Kernel |
SX |
6 |
Thu, Oct 21 |
Training dynamics: active regime and mean-field description |
JB |
7 |
Tue, Nov 2 |
Group theory and differential geometry basics. Noether’s theorem |
SX |
8 |
Thu, Nov 4 |
Beyond Barron spaces: geometric stability | JB |
9 |
Tue, Nov 9 |
Harmonic Analysis: Fourier, Wavelets, Graph spectral transforms |
SX |
10 |
Thu, Nov 11 |
The Scattering Transform | JB |
11 |
Tue, Nov 16 |
Beyond Euclidean Domains: the 5G ***14:00h |
JB |
12 |
Thu, Nov 18 |
Open Problems and closing remarks | JB |
Syllabus and Basic References for Each Session
Informal outline. General references. Optimization techniques, with emphasis on the convex case. Examples. A model for inductive learning. Remarks on computational resources.
References: [1], [2], [3], [4], [5].
2 Neural networks and their approximation properties. The curse of dimensionality
The curse of dimensionality in statistical learning. Lipschitz and Sobolev hypothesis classes. From low-dimensional to high-dimensional function approximation. Polynomial approximation theorems. Universal approximation theorems. Shallow neural networks.
References: [1], [6].
3 Reproducing Kernel Hilbert Spaces
Kernels. Positive definite kernels. The kernel trick. Properties of kernels. The reproducing Hilbert space associated to a kernel. Examples of kernels. The representer theorem. Learning with kernels.
References: [7], [8], [9], [10], [11], [12], [13], [14], [15], [16].
4 Gradient descent and stochastic approximation
Optimization by gradient descent. GD with momentum. Stochastic versus batch optimization methods. Stochastic gradient approximation. Stochastic variance reduced gradient. Algorithms.
References: [17], [2], [4].
5 Training dynamics: Lazy regime and Neural TangentKernel (NTK)
Kernel gradient. Neural tangent kernel. Lazy training. Convergence of the SGD.
References: [18], [19], [20].
6 Training dynamics: active regime and mean-field description
Systems of interaction particles and thermodynamic limits. Overparameterised limits of shallow neural networks: The Barron Space. Wasserstein gradient flows and measure transport dynamics. Global convergence properties. Open questions.
References: [21, 22, 23].
7 Group theory and differential geometry basics. Noether’s theorem.
Differential manifolds. Lie groups and Lie algebras. Lagrangian and Hamiltonian mechanics. Symmetries in physical systems and conserved quantities. Noether’s theorem.
References: [24], [25], [26], [27], [28], [29], [30], [31], [32].
8 Beyond Barron spaces: geometric stability
Curse of dimensionality for Barron Spaces. Geometric Machine Learning, geometric domains and geometric priors. Examples: Grids, Graphs, Gauges, Groups, Geodesics. Invariance, Equivariance and Scale separation.
References: [31].
9 Harmonic Analysis: Fourier, Wavelets, Graph spectral transforms.
Summary of Fourier analysis. Gabor basis. The wavelet transform. Multiresolution analysis. The fast wavelet transform. Spectral techniques on graphs.
References: [33], [34], [35], [36], [31].
10 The Scattering Transform
Instability of Fourier Invariants. Stability of Wavelet equivariants. Putting everything together: wavelet scattering transform. Main mathematical properties: energy conservation and deformation stability. Examples. Open problems.
References: [37, 38].
11 Beyond Euclidean Domains
The Geometric Deep Learning Blueprint. Application to geometric domains: grids, groups, graphs, gauges, geodesics. Graph Neural Networks, Convolutional Neural Networks. Open problems.
References: [31].
12 Closing remarks
In this last lecture, we will wrap the course by tying together the open problems seen throughout the lectures, and outline key open research directions. Role of Depth. Computational Lower bounds.
References
[2] S. Bubeck, “Convex optimization: Algorithms and complexity,” Foundations and Trends® in Machine Learning, vol. 8, no. 3-4, pp. 231–358, 2015. arXiv:1405.4980.
[3] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT Press, 2016.
[4] L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for large-scale machine learning,” Siam Review, vol. 60, no. 2, pp. 223–311, 2018. https://arxiv.org/pdf/1606.04838. pdf.
[5] J. Bruna and S. Xambó-Descamps, “Aprenentatge algorísmic i xarxes neuronals profundes,” BUTLLETÍ DE LA SCM, vol. 36, no. 1, pp. 5–67.
[6] M. Telgarsky, Deep Learning Theory. https://mjt.cs.illinois.edu/dlt/, 2020.
[7] T. Hofmann, B. Schölkopf, and A. J. Smola, “A review of kernel methods in machine learning,” Mac-Planck-Institute Technical Report, vol. 156, 2006.
[8] T. Hofmann, B. Schölkopf, and A. J. Smola, “Kernel methods in machine learning,” The Annals of Statistics, pp. 1171–1220, 2008.
[9] S. Haykin, Neural networks and learning machines. Pearson, 2009.
[10] S. Marsland, Machine learning: an algorithmic perspective (second edition). Machine Learning and Pattern Recognition, Chapman and Hall/CRC, 2015.
[11] M. Belkin, D. Hsu, S. Ma, and S. Mandal, “Reconciling modern machine learning and the biasvariance trade-off,” 2018. https://arxiv.org/pdf/1812.11118.pdf
[12] M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of machine learning. MIT press, 2018.
[13] M. J.Wainwright, High-dimensional statistics: A non-asymptotic viewpoint, vol. 48 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2019.
[14] M. M. Wolf, “Mathematical foundations of supervised learning,” 2020. “growing lecture notes”: https://www-m5.ma.tum.de/foswiki/pub/M5/Allgemeines/MA4801_2020S/ML_notes_main.pdf
[15] F. Bach, “Learning theory from first principles, draft,” 2021. https://www.di.ens.fr/~fbach/ltfp_book.pdf
[16] B. Ghojogh, A. Ghodsi, F. Karray, and M. Crowley, “Reproducing Kernel Hilbert Space, Mercer’s Theorem, Eigenfunctions, Nyström Method, and Use of Kernels in Machine Learning: Tutorial and Survey,” 2021. arXiv:2106.08443.
[17] H. Robbins and S. Monro, “A stochastic approximation method,” The Annals of Mathematical Statistics, pp. 400–407, 1951.
[18] A. Jacot, F. Gabriel, and C. Hongler, “Neural tangent kernel: Convergence and generalization in neural networks,” Advances in Neural Information Processing Systems, pp. 8571–8580, 2018. arXiv:1806.07572.
[19] L. Chizat, E. Oyallon, and F. Bach, “On lazy training in differentiable programming,” 2020. arXiv:1812.07956,v5.
[20] J. Berner, P. Grohs, G. Kutyniok, and P. Petersen, “The modern mathematics of deep learning,” 2021. arXiv:2105.04026.
[21] G. Rotskoff and E. Vanden-Eijnden, “Parameters as interacting particles: long time convergence and asymptotic error scaling of neural networks,” in Advances in Neural Information Processing Systems, pp. 7146–7155, 2018.
[22] S. Mei, A. Montanari, and P.-M. Nguyen, “A mean field view of the landscape of two-layer neural networks,” Proceedings of the National Academy of Sciences, vol. 115, no. 33, pp. E7665–E7671, 2018.
[23] L. Chizat and F. Bach, “On the global convergence of gradient descent for over-parameterized models using optimal transport,” in Advances in Neural Information Processing Systems, pp. 3036–3046, 2018.
[24] R. Abraham, J. E. Marsden, and T. Ratiu, Manifolds, tensor analysis, and applications, vol. 75 of Applied Mathematical Sciences. Springer, 1988.
[25] P. Bamberg and S. Sternberg, A Course in Mathematics for Students of Physics: Volume 2, vol. 2. Cambridge University Press, 1988.
[26] D. Bleecker, Gauge theory and variational principles. Dover, 2005.
[27] G. B. Folland, Quantum Field Theory: A tourist guide for mathematicians, vol. 149 of Mathematical Surveys and Monographs. American Mathematical Soc., 2008.
[28] T. Frankel, The geometry of physics: an introduction. Cambridge University Press, 2011.
[29] C. Lavor, S. Xambó-Descamps, and I. Zaplana, A Geometric Algebra Invitation to Space-Time Physics, Robotics and Molecular Geometry. SBMA/Springerbrief, Springer, 2018.
[30] S. Xambó-Descamps, Real spinorial groups—a short mathematical introduction. SBMA/Springerbrief, Springer, 2018.
[31] M. M. Bronstein, J. Bruna, T. Cohen, and P. Velickovi´c, “Geometric deep learning: Grids, groups, graphs, geodesics, and gauges,” 2021. https://arxiv.org/abs/2104.13478
[32] T. Cohen, Equivariant Convolutional Networks. PhD thesis, 2021.
[33] M. J. Mohlenkamp and M. C. Pereyra, Wavelets, their friends, and what they can do for you, vol. 8 of Lectures in Mathemmatics. European Mathematical Society, 2008.
[34] M. Stephane, A wavelet tour of signal processing: The sparse way (third edition). Elsevier, 2009.
[35] D. K. Hammond, P. Vandergheynst, and R. Gribonval, “The spectral graph wavelet transform: Fundamental theory and fast computation,” in Vertex-Frequency Analysis of Graph Signals, pp. 141–175, Springer, 2019.
[36] B. Ghojogh, A. Ghodsi, F. Karray, and M. Crowley, “Laplacian-based dimensionality reduction including spectral clustering, Laplacian eigenmap, locality preserving projection, graph embedding, and diffusion map: Tutorial and survey,” 2021. arXiv:2106.02154.
[37] J. Bruna and S. Mallat, “Invariant scattering convolution networks,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 8, pp. 1872–1886, 2013.
[38] S. Mallat, “Group invariant scattering,” Communications on Pure and Applied Mathematics, vol. 65, no. 10, pp. 1331–1398, 2012.
Courant Institute and the Center for Data Science, New York University
Universitat Politècnica de Catalunya