Our distillation strategy follows a compress-and-reconstruct pipeline. The student model uses a Dynamic Supertoken Optimization (DSO) module to compress the input tokens into a small set of learnable SuperTokens. After processing by a lightweight encoder, a Cross-Attention Upsampling (CAU) module reconstructs an approximation of the teacher's latent space. The entire student is trained to minimize the reconstruction error, forcing the SuperTokens to become a powerful, compact basis for the teacher's representations.
For a small s, we see Foundry needs less computation to perform a forward pass compared to a regular Transformer. Foundry-Gate takes another parameter r which can be fixed by the user or dynamic by the gate itself. In the worst case r=0 (and s=0 because no token selected for merging), we match Transformer complexity and best, complexity of Foundry.
Foundation models pre-trained with self-supervised learning (SSL) on large-scale datasets have become powerful general-purpose feature extractors. However, their immense size and computational cost make them prohibitive for deployment on edge devices such as robots and AR/VR headsets. Existing compression techniques like standard knowledge distillation create efficient 'specialist' models but sacrifice the crucial, downstream-agnostic generality that makes foundation models so valuable. In this paper, we introduce Foundation Model Distillation (FMD), a new paradigm for compressing large SSL models into compact, efficient, and faithful proxies that retain their general-purpose representational power. We present Foundry, the first implementation of FMD for 3D point clouds. Our approach, Foundry, trains a student to learn a compressed set of SuperTokens that reconstruct the teacher’s token-level representations, capturing a compact basis of its latent space. A single distilled model maintains strong transferability across diverse downstream tasks—classification, part segmentation, and few-shot scenarios—approaching full foundation-model performance while using significantly fewer tokens and FLOPs, making such models more practical for deployment on resource-constrained hardware.
This figure shows that by matching token representations, we enable compression modules to learn a similar basis for the latent space.
We show some examples where the PCA projection is similar and good actually.
We show some examples where PCA projection is degraded.
This figure visually shows the assignment of tokens to supertokens.
Foundry results represented graphically.
Foundry results listed in competition with off-the-shelf methods and fine-tuned with FMD.
Even though Foundry is a method that compresses tokens into a set of other tokens, we observe acceptable few-shot performance, demonstrating the effectiveness of our method.
The addition of the selection gate (Foundry-Gate) allows the compression ratio to be changed dynamically during inference. We observe large variations without retraining. This is one of the limitations of the approach presented.
@article{letellier2025foundry,
title={Foundry: Distilling 3D Foundation Models for the Edge},
author={Letellier, Guillaume and Srivastava, Siddharth and Jurie, Fr{\'e}d{\'e}ric and Sharma, Gaurav},
journal={arXiv preprint arXiv:2511.20721},
year={2025}
}