HtFLlib: A Unified Benchmarking Library for Evaluating Heterogeneous Federated Learning Methods Across Modalities

Understanding the Target Audience

The target audience for HtFLlib primarily includes researchers, data scientists, and practitioners in the field of artificial intelligence and machine learning, particularly those focused on federated learning (FL). These individuals often work in academic institutions, research labs, and tech companies that develop AI solutions leveraging distributed data. Their pain points include:

Challenges in data scarcity during model training.
Limitations of traditional FL, which necessitates homogeneous model architectures.
Concerns over intellectual property when sharing locally trained models.

Their primary goals are to:

Improve model performance across diverse data types and modalities.
Facilitate collaboration without compromising proprietary data or models.
Benchmark the effectiveness of heterogeneous models in real-world scenarios.

Interests include advancements in federated learning methodologies, collaborative model training, and applications in various domains such as healthcare, finance, and natural language processing. They prefer clear, concise, and technical communication that provides actionable insights and data-driven results.

Background on Heterogeneous Federated Learning (HtFL)

AI institutions often develop heterogeneous models tailored for specific tasks but encounter data scarcity during training. Traditional Federated Learning (FL) typically supports only homogeneous model collaboration, requiring identical architectures across all clients. However, clients often develop model architectures that cater to their unique requirements. Additionally, sharing effort-intensive locally trained models raises concerns about intellectual property and decreases participants’ motivation to engage in collaborations. Heterogeneous Federated Learning (HtFL) addresses these challenges, yet the literature lacks a unified benchmark for evaluating HtFL across various domains and aspects.

Categories of HtFL Methods

Current FL benchmarks primarily focus on data heterogeneity using homogeneous client models but overlook real-world scenarios involving model heterogeneity. Representative HtFL methods can be categorized into three main groups:

Partial parameter sharing methods (e.g., LG-FedAvg, FedGen, FedGH) maintain heterogeneous feature extractors while assuming homogeneous classifier heads for knowledge transfer.
Mutual distillation methods (e.g., FML, FedKD, FedMRL) train and share small auxiliary models through distillation techniques.
Prototype sharing methods transfer lightweight class-wise prototypes as global knowledge, aggregating local prototypes from clients to guide local training.

However, it remains unclear whether existing HtFL methods consistently perform across diverse scenarios.

Introducing HtFLlib: A Unified Benchmark

A collaborative effort from researchers at Shanghai Jiao Tong University, Beihang University, Chongqing University, Tongji University, Hong Kong Polytechnic University, and The Queen’s University of Belfast has led to the development of HtFLlib, the first Heterogeneous Federated Learning Library. This library offers an easy and extensible approach to integrating multiple datasets and model heterogeneity scenarios. HtFLlib integrates:

12 datasets across various domains, modalities, and data heterogeneity scenarios.
40 model architectures ranging from small to large across three modalities.
A modularized and easy-to-extend codebase with implementations of 10 representative HtFL methods.
Systematic evaluations covering accuracy, convergence, computation costs, and communication costs.

Datasets and Modalities in HtFLlib

HtFLlib encompasses detailed data heterogeneity scenarios categorized into three settings: Label Skew (with Pathological and Dirichlet subsettings), Feature Shift, and Real-World. It integrates 12 datasets, including Cifar10, Cifar100, Flowers102, Tiny-ImageNet, KVASIR, COVIDx, DomainNet, Camelyon17, AG News, Shakespeare, HAR, and PAMAP2. These datasets vary significantly in domain, data volume, and class numbers, showcasing HtFLlib’s comprehensive nature. Researchers primarily focus on image data, especially in the label skew setting, as image tasks are commonly employed across various fields. HtFL methods are evaluated across image, text, and sensor signal tasks to assess their respective strengths and weaknesses.

Performance Analysis: Image Modality

For image data, most HtFL methods exhibit decreased accuracy as model heterogeneity increases. FedMRL demonstrates superior performance through its combination of auxiliary global and local models. When heterogeneous classifiers render partial parameter sharing methods inadequate, FedTGP retains its advantage across diverse settings due to its adaptive prototype refinement capability. Experiments with medical datasets using black-boxed pre-trained heterogeneous models indicate that HtFL enhances model quality compared to pre-trained models and achieves greater improvements than auxiliary models like FML. For text data, FedMRL’s advantages in label skew settings diminish in real-world scenarios, while FedProto and FedTGP perform relatively poorly compared to image tasks.

Conclusion

In conclusion, HtFLlib provides a framework that addresses the critical gap in HtFL benchmarking by establishing unified evaluation standards across diverse domains and scenarios. Its modular design and extensible architecture offer a detailed benchmark for both research and practical applications in HtFL. Moreover, HtFLlib’s capability to support heterogeneous models in collaborative learning paves the way for future research into utilizing complex pre-trained large models, black-box systems, and varied architectures across different tasks and modalities.

For more information, check out the Paper and the GitHub Page. All credit for this research goes to the researchers of this project. Feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and subscribe to our newsletter.