随着人工智能工作负载范围的扩大,小型专用模型在泛化能力方面面临挑战,且对大量标注训练样本的需求持续增加。相反,基础模型(FMs)通过自监督学习在互联网规模的无标签数据上进行训练,已被证明可通过极少微调适配多种任务。尽管大型基础模型在自然语言处理和计算机视觉领域已展现出显著影响,但面向地理空间应用的基础模型研究仍受限于较小规模模型,因为预训练更大模型需要配备先进硬件加速器的海量计算资源。当前卫星星座每日收集超过100TB的数据,生成的图像具有数十亿像素且为多模态特性。此类地理空间数据带来独特挑战,同时也为开发基础模型创造了新机遇。我们通过在公开可用数据上进行预训练,研究了百亿规模基础模型及其在高性能计算(HPC)环境中的训练特征。从端到端角度考察了模型规模扩展对解决方案性能与影响的作用。相较于1亿参数模型,我们更大的30亿参数模型在场景分类Top1准确率上最高提升达30%。此外,我们详细介绍了在美洲首台百亿亿次级超算系统——前沿(Frontier)超级计算机上的性能实验,采用PyTorch的全分片数据并行库(Fully Sharded Data Parallel)研究了不同的模型并行与数据并行策略。具体而言,我们研究了视觉变换器架构(ViT)的不同变体,对规模高达150亿参数的ViT模型进行了性能分析。通过讨论不同并行配置下的吞吐量与性能瓶颈,为如何优化大规模地理空间基础模型的训练提供了深入见解。
As AI workloads increase in scope, generalization capability becomes challenging for small task-specific models and their demand for large amounts of labeled training samples increases. On the contrary, Foundation Models (FMs) are trained with internet-scale unlabeled data via self-supervised learning and have been shown to adapt to various tasks with minimal fine-tuning. Although large FMs have demonstrated significant impact in natural language processing and computer vision, efforts toward FMs for geospatial applications have been restricted to smaller size models, as pretraining larger models requires very large computing resources equipped with state-of-the-art hardware accelerators. Current satellite constellations collect 100+TBs of data a day, resulting in images that are billions of pixels and multimodal in nature. Such geospatial data poses unique challenges opening up new opportunities to develop FMs. We investigate billion scale FMs and HPC training profiles for geospatial applications by pretraining on publicly available data. We studied from end-to-end the performance and impact in the solution by scaling the model size. Our larger 3B parameter size model achieves up to 30% improvement in top1 scene classification accuracy when comparing a 100M parameter model. Moreover, we detail performance experiments on the Frontier supercomputer, America's first exascale system, where we study different model and data parallel approaches using PyTorch's Fully Sharded Data Parallel library. Specifically, we study variants of the Vision Transformer architecture (ViT), conducting performance analysis for ViT models with size up to 15B parameters. By discussing throughput and performance bottlenecks under different parallelism configurations, we offer insights on how to leverage such leadership-class HPC resources when developing large models for geospatial imagery applications.