The Hubverse: Streamlining Collaborative Infectious Disease Modeling

US-RSE Conference 2025

Anna Krystalli

info@r-rse.eu

R-RSE SMPC

Consortium of Infectious Disease Modeling Hubs

7 October 2025

Background

❌ The problem

Infectious disease modeling has scaled rapidly…

But the landscape is fragmented:
- Inconsistent formats
- Redundant or conflicting forecasts
- Lack of coordination between modelers and stakeholders

“Comparing the accuracy of forecasting applications is difficult because forecasting methods, forecast outcomes, and reported validation metrics varied widely.”

Chretien et al., PLOS ONE, 2014

Over the past two decades, and especially since the COVID-19 pandemic, infectious disease modeling has become a central tool in public health decision-making.

We’ve seen an explosion of forecasting efforts.

This growth is encouraging and reflects increased computational capacity, improved data availability, and global interest in predictive analytics for public health.

However, it has also come with growing pains.

Modeling efforts often develop independently: each group uses its own data structure for model outputs, metadata, and evaluation, making integration and comparison difficult.
Modelers and decision-makers don’t always communicate clearly. model outputs may not match what public health needs.

This results in significant fragmentation which limits the collective utility of modeling efforts.

✨ The promise of modeling hubs

Modeling hubs coordinate collaborative forecasting:

Provide centralised location for effort coordination
Define data standards and modeling targets
Improve transparency and comparability
Aggregate forecasts enabling ensembles
Facilitate timely public health decision-making

“Collaborative Hubs: Making the Most of Predictive Epidemic Modeling”, American Journal of Public Health Reich, et al. 2022

Modeling hubs which provide centralized infrastructure for coordinating the modeling efforts of multiple groups have been proposed as a solution:

Firstly Hubs bring structure and a standardised central location for collecting model outputs.
They establish a shared modeling protocol: what outcomes to forecast, for which locations and time horizons, in what format.
By enforcing standards and submission rules, hubs ensure that forecasts can be examined and compared.
Standardization also allows for ensemble models that combine many forecasts, often outperforming individual models, a major asset during uncertain conditions.
Hubs make it easier for stakeholders, from local agencies to national institutions, to access timely, interpretable forecasts.

Modeling hubs transform isolated efforts into coordinated forecasting ecosystems.

They aren’t just data repositories, they’re collaborative infrastructures.

🕰️ Project origins

Pre-COVID: Forecasting code base existed for CDC influenza hubs
During COVID: That code was reused for new COVID-19 hubs + demand internationally (e.g. Europe) for similar setups
❗ Problem: Each hub required manual editing of source code

➡️ Need for generalisation, modularity, and configurability

Timeline of forecasting hub development — Figure credits: Alex Vespignani and Nicole Samay

🌐 Enter the hubverse

An open-source software ecosystem to power modeling hubs:

GitHub repositories for centralising hub activity
Data standards for infectious disease modeling data
Schema-driven configuration for modeling tasks + hub setup
Modular tools for validation, access, evaluation, ensembling, communication and hub administration
Supports full lifecycle: from hub set up, data submission to decision-making

Hubverse overview

☑️ Standardised Data

Modeling hubs are built around a shared data standard:

Modeling task definition: targets (response variables), standard predictors, output types (e.g. mean, quantiles)
Structured hub layout: consistent file system for organizing submissions
Standard model output format: for file content and naming

✅ Enables comparability, validation, and streamlined data access

⚙️ Config-driven hub setup

Hub administrators configure hubs using structured JSON config files:

admin.json: hub-level metadata.
task.json: modeling task specification:
- Task IDs: Targets (response), horizons, locations (predictors) etc.
- Output types: accepted model outputs e.g. mean, median, quantiles, cdf, pmf, samples.
Configs are validated against a shared JSON schema

📦 The R (and friends) package stack

The hubverse package ecosystem is organized by role. Each tool is designed to support a particular group of users in the hub workflow.

Hub roles

🛠️ Hub administrators
🔬 Modelers
📊 Analysts
🏛️ Policy makers

Tools & packages

hubAdmin : config creation + validation 🛠️
hubValidations : submission checks (structure, schema, content) 🔬 🛠️
hubData (R) / hubdata (Python) : access multi-file model output via Arrow 🛠️ 🔬 📊 🏛️
hubEvals : compute evaluation metrics 🛠️ 📊 🏛️
hubEnsembles : build weighted/unweighted ensembles 🛠 📊 🏛️
hubVis : visualise model outputs 🛠️ 📊 🏛️

Let me now introduce the core hubverse packages that support our modeling hub infrastructure.

We’ve intentionally organized the ecosystem by user role. That’s because a modeling hub isn’t a single-user platform, it involves multiple roles, each with distinct responsibilities, technical skills, and needs.

On the left, we show the primary roles:

Hub administrators are responsible for setting up and maintaining the hub’s infrastructure and configuration.
Modelers focus on submitting model outputs via pull requests.
Analysts need to explore, combine, and evaluate forecast data, often across many models and time points.
Policy makers or Public health officials typically need high-level summaries, dashboards, and access to curated insights, not raw files.

On the right, we map each of these user groups to the packages that support their work:

hubAdmin helps administrators create and validate the structure of their configuration files.
hubValidations is the gatekeeper, enforcing the standards set by the admins by validating every PR, ensuring files are named, structured, and formatted correctly. Can also be used by modelers prior to submitting.
hubData (available in both R and Python) gives analysts and modelers an easy way to directly access multi-file model outputs directly as Arrow datasets.
hubEvals supports model evaluation.
hubEnsembles provides functions for combining model outputs into ensembles.
hubVis supports model output and ensemble visualisations.

📊 Dashboards & communication

Built with Quarto so easily customisable via Quarto configuration
Deployed as a fully static site , no backend required
Powered by JSON data prepared via GitHub workflows
Interactive UI built with client-side JavaScript (fast!)
New instances can be set up by copying/configuring the hub-dashboard-template

☁️ Cloud hub storage and access

Hubs mirrored to public AWS S3 buckets
Multi-file data can be opened as Arrow datasets
Enables query-able data access via R 📦 hubData and Python 📦 hubdata.

🔁 GitHub workflows

We automate everything we can:

✅ PR-level model output validation
✅ Hub configuration validation
☁️ AWS Cloud hub data synching
📊 dashboard data regeneration and model evaludation with each update

All hubverse actions stored in the hubverse-actions repo and can be installed with hubCI::use_hub_github_action()

🪩 List of adopting hubs

https://hubverse.io/community/hubs.html

🦠 Real-world example: CDC FluSight Hub

Real-world example: CDC FluSight Hub

https://github.com/cdcepi/FluSight-forecast-hub

Screenshot of CDC Flusight Hub Github repo

Used by US CDC to monitor influenza severity
Weekly forecasts from 40 teams across 70 different models.
Hosted on GitHub + S3 cloud mirror.
Managed using full hubverse stack since 2023/2024 season.

📁 File structure: model output (CDC FluSight)

Model outputs committed by teams to versioned directories > one directory per model > one file per modeling round.

✅ Model output validation with `hubValidations`

Model outputs submitted through PRs and validated through GitHub Actions

📂 Accessing model output via `hubData`

Connect to Arrow dataset of forecast submissions

library(hubData)

hub_path <- s3_bucket(
  "cdcepi-flusight-forecast-hub"
)
hub_con <- connect_hub(
  hub_path,
  skip_checks = TRUE
)
hub_con

hub_connection
9 columns
reference_date: date32[day]
target: string
horizon: int32
target_end_date: date32[day]
location: string
output_type: string
output_type_id: string
value: double
model_id: string

Query and collect data

# Filter for one model and forecast date using dplyr
library(dplyr)
hub_con |>
  filter(
    model_id == "CADPH-FluCAT_Ensemble",
    target_end_date == "2023-10-28"
  ) |>
  collect_hub()

# A tibble: 92 × 9
   model_id   reference_date target horizon target_end_date location output_type
 * <chr>      <date>         <chr>    <int> <date>          <chr>    <chr>      
 1 CADPH-Flu… 2023-10-14     wk in…       2 2023-10-28      06       quantile   
 2 CADPH-Flu… 2023-10-14     wk in…       2 2023-10-28      06       quantile   
 3 CADPH-Flu… 2023-10-14     wk in…       2 2023-10-28      06       quantile   
 4 CADPH-Flu… 2023-10-14     wk in…       2 2023-10-28      06       quantile   
 5 CADPH-Flu… 2023-10-14     wk in…       2 2023-10-28      06       quantile   
 6 CADPH-Flu… 2023-10-14     wk in…       2 2023-10-28      06       quantile   
 7 CADPH-Flu… 2023-10-14     wk in…       2 2023-10-28      06       quantile   
 8 CADPH-Flu… 2023-10-14     wk in…       2 2023-10-28      06       quantile   
 9 CADPH-Flu… 2023-10-14     wk in…       2 2023-10-28      06       quantile   
10 CADPH-Flu… 2023-10-14     wk in…       2 2023-10-28      06       quantile   
# ℹ 82 more rows
# ℹ 2 more variables: output_type_id <chr>, value <dbl>

See more in Accessing data vignette.

Python analogue hub-data also available.

🌐 Ensembling with `hubEnsembles`

Combine models using simple or weighted rules

forecast_df <- hub_con |>
  filter(
    model_id %in%
      c(
        "CADPH-FluCAT_Ensemble",
        "CEPH-Rtrend_fluH",
        "CFA_Pyrenew-Pyrenew_HE_Flu"
      ),
    output_type == "quantile"
  ) |>
  collect_hub()


hubEnsembles::simple_ensemble(
  forecast_df, 
  agg_fun = median,
  model_id = "simple-ensemble-median"
)

# A tibble: 282,716 × 9
   model_id   reference_date target horizon target_end_date location output_type
 * <chr>      <date>         <chr>    <int> <date>          <chr>    <chr>      
 1 simple-en… 2023-10-14     wk in…      -1 2023-10-07      01       quantile   
 2 simple-en… 2023-10-14     wk in…      -1 2023-10-07      01       quantile   
 3 simple-en… 2023-10-14     wk in…      -1 2023-10-07      01       quantile   
 4 simple-en… 2023-10-14     wk in…      -1 2023-10-07      01       quantile   
 5 simple-en… 2023-10-14     wk in…      -1 2023-10-07      01       quantile   
 6 simple-en… 2023-10-14     wk in…      -1 2023-10-07      01       quantile   
 7 simple-en… 2023-10-14     wk in…      -1 2023-10-07      01       quantile   
 8 simple-en… 2023-10-14     wk in…      -1 2023-10-07      01       quantile   
 9 simple-en… 2023-10-14     wk in…      -1 2023-10-07      01       quantile   
10 simple-en… 2023-10-14     wk in…      -1 2023-10-07      01       quantile   
# ℹ 282,706 more rows
# ℹ 2 more variables: output_type_id <chr>, value <dbl>

📈 Dashboard - forecasts

🩺 Dashboard - model evaluations

Evaluates forecasts against target (observed) data.

💡 Lessons & wider relevance

✅ Standards + automation reduce friction
🧰 Open source keeps it free & accessible
🏥 Collaborative infrastructure empowers public health
🌍 Standardised, open data fuels downstream use cases like training, education, and reproducible research

https://nfidd.github.io/sismid/sessions/real-world-forecasts.html

To wrap up:

Standards and automation reduce friction at every level -submitting, validating, accessing and comparing model outputs becomes much easier and more reliable.
Open-source tooling makes all this infrastructure freely accessible, which is especially valuable in fast-moving or resource-constrained public health contexts.
Collaborative infrastructure like this fosters shared understanding and coordinated and efficient efforts.
But beyond that, making the data openly available in a standardised format opens doors to wider impact. For example, hubverse data are already being used in teaching, like this SISMID course on real-world outbreak forecast evaluation, which builds directly on hubverse data and tools.

We hope this inspires other communities to adopt or adapt similar workflows.

🙏 Thank you!

Tip

Interested in getting involved in the community? Check out our Getting Involved page!

The Hubverse: Streamlining Collaborative Infectious Disease Modeling

Background

❌ The problem

✨ The promise of modeling hubs

🕰️ Project origins

🌐 Enter the hubverse

Hubverse overview

☑️ Standardised Data

⚙️ Config-driven hub setup

📦 The R (and friends) package stack

📊 Dashboards & communication

☁️ Cloud hub storage and access

🔁 GitHub workflows

🪩 List of adopting hubs

🦠 Real-world example: CDC FluSight Hub

Real-world example: CDC FluSight Hub

📁 File structure: model output (CDC FluSight)

✅ Model output validation with hubValidations

📂 Accessing model output via hubData

🌐 Ensembling with hubEnsembles

📈 Dashboard - forecasts

🩺 Dashboard - model evaluations

💡 Lessons & wider relevance

🙏 Thank you!

✅ Model output validation with `hubValidations`

📂 Accessing model output via `hubData`

🌐 Ensembling with `hubEnsembles`