Learning in the Age of LLMs: Theoretical Insights into Knowledge Distillation and Test-Time-Training

Marco Mondelli, Institute of Science and Technology Austria
July 21, 2025 2:00 pm LSK 306

The availability of powerful models pre-trained on a vast corpus of data has spurred research on alternative training methods, and the overall goal of this talk is to give theoretical insights through the lens of high-dimensional regression.

The first part considers knowledge distillation where one uses the output of a surrogate model as labels to supervise the training of a target model and, specifically, the phenomenon of weak-to-strong generalization in which a strong student outperforms the weak teacher from which the task is learned. We provide a sharp characterization of the risk of the target model for ridgeless, high-dimensional regression, under two settings: (i) model shift, where the surrogate model is arbitrary, and (ii) distribution shift, where the surrogate model is the solution of empirical risk minimization with out-of-distribution data. As a consequence, we identify the form of the optimal surrogate model, which reveals the benefits and limitations of discarding weak features in a data-dependent fashion. This has the interpretation that weak-to-strong training, with the surrogate as the weak model, can provably outperform training with strong labels under the same data budget, but it is unable to improve the data scaling law.

The second part of the talk considers test-time training (TTT) where one explicitly updates the weights of a model to adapt to the specific test instance. We investigate a gradient-based TTT algorithm for in-context learning, where we train a transformer model on the in-context demonstrations provided in the test prompt. By focusing on linear transformers when the update rule is a single gradient step, our theory (i) delineates the role of alignment between pre-training distribution and target task, (ii) demystifies how TTT can alleviate distribution shift, and (iii) quantifies the sample complexity of TTT including how it can significantly reduce the sample size required for in-context learning.

Refreshments will be served preceding the talk, beginning at 1:45.