Foundry: Distilling 3D Foundation Models for the Edge

Guillaume Letellier¹, Siddharth Srivastava², Frederic Jurie¹, Gaurav Sharma³

¹University of Caen Normandy ²IIT Delhi ³IIT Kanpur

CVPR 2026, Denver

Approach

Our distillation strategy follows a compress-and-reconstruct pipeline. The student model uses a Dynamic Supertoken Optimization (DSO) module to compress the input tokens into a small set of learnable SuperTokens. After processing by a lightweight encoder, a Cross-Attention Upsampling (CAU) module reconstructs an approximation of the teacher's latent space. The entire student is trained to minimize the reconstruction error, forcing the SuperTokens to become a powerful, compact basis for the teacher's representations.

For a small s, we see Foundry needs less computation to perform a forward pass compared to a regular Transformer. Foundry-Gate takes another parameter r which can be fixed by the user or dynamic by the gate itself. In the worst case r=0 (and s=0 because no token selected for merging), we match Transformer complexity and best, complexity of Foundry.

Abstract

Foundation models pre-trained with self-supervised learning (SSL) on large-scale datasets have become powerful general-purpose feature extractors. However, their immense size and computational cost make them prohibitive for deployment on edge devices such as robots and AR/VR headsets. Existing compression techniques like standard knowledge distillation create efficient 'specialist' models but sacrifice the crucial, downstream-agnostic generality that makes foundation models so valuable. In this paper, we introduce Foundation Model Distillation (FMD), a new paradigm for compressing large SSL models into compact, efficient, and faithful proxies that retain their general-purpose representational power. We present Foundry, the first implementation of FMD for 3D point clouds. Our approach, Foundry, trains a student to learn a compressed set of SuperTokens that reconstruct the teacher’s token-level representations, capturing a compact basis of its latent space. A single distilled model maintains strong transferability across diverse downstream tasks—classification, part segmentation, and few-shot scenarios—approaching full foundation-model performance while using significantly fewer tokens and FLOPs, making such models more practical for deployment on resource-constrained hardware.

Foundry: Distilling 3D Foundation Models for the Edge

Approach

Abstract

Qualitative results

Latent representation

PCA projections

Supertokens assignments

Quantitative results

Foundry (gateless)

Foundry-Gate

BibTeX