Using the API¶
This page contains information about using the hubdata API.
Note: This package is based on the python version of Apache’s Arrow library.
Basic usage¶
Use
connect_hub()to get aHubConnectionobject for a hub directory.Call
HubConnection.get_dataset()to get a pyarrow Dataset extracted from the hub’s model output directory.Work with the data by either calling functions directly on the dataset (not as common) or calling Dataset.to_table() to read the data into a pyarrow Table . You can use pyarrow’s compute functions or convert the table to another format, such as Polars or pandas.
For example, here is code using native pyarrow commands to count the number of rows total in the test/hubs/flu-metrocast test hub, and then to get the unique locations in the dataset as a python list.
First, start a python interpreter with the required libraries:
Note: All shell examples assume you’re using Bash, and that you first
cdinto this repo’s root directory, e.g.,cd /<path_to_repos>/hub-data/.
python3
Then run the following Python code. (We’ve included Python output in comments.)
from pathlib import Path
from hubdata import connect_hub
import pyarrow.compute as pc
hub_connection = connect_hub(Path('test/hubs/flu-metrocast')) # relative Path is OK, but str would need to be absolute
hub_ds = hub_connection.get_dataset()
hub_ds.count_rows()
# 14895
pa_table = hub_ds.to_table() # load all hub data into memory as a pyarrow Table
pc.unique(pa_table['location']).to_pylist()
# ['Bronx', 'Brooklyn', 'Manhattan', 'NYC', 'Queens', 'Staten Island', 'Austin', 'Dallas', 'El Paso', 'Houston', 'San Antonio']
pc.unique(pa_table['target']).to_pylist()
# ['ILI ED visits', 'Flu ED visits pct']
Compute functions¶
As mentioned above, we use the pyarrow Dataset.to_table() function to load a dataset into a pyarrow Table . For example, continuing the above Python session:
# naive approach to getting a table: load entire dataset into memory
pa_table = hub_ds.to_table()
print(pa_table.column_names)
# ['reference_date', 'target', 'horizon', 'location', 'target_end_date', 'output_type', 'output_type_id', 'value', 'model_id']
print(pa_table.shape)
# (14895, 9)
However, that function reads the entire dataset into memory, which could be unnecessary or fail for large hubs. A more parsimonious approach is to use the Dataset.to_table() columns and filter arguments to select and filter only the information of interest and limit what data is pulled into memory:
# more parsimonious approach: load a subset of the data into memory (select only `target_end_date` and `value`
# associated with `Bronx` as location)
pa_table = hub_ds.to_table(columns=['target_end_date', 'value'],
filter=pc.field('location') == 'Bronx')
print(pa_table.shape)
# (1350, 2)
HubConnection.to_table() convenience function¶
If you just want the pyarrow Table and don’t need the pyarrow Dataset returned by HubConnection.get_dataset() then you can use the HubConnection.to_table() convenience function, which calls HubConnection.get_dataset() for you and then passes its args through to the returned Dataset.to_table(). So the above example in full would be:
from pathlib import Path
from hubdata import connect_hub
import pyarrow.compute as pc
hub_connection = connect_hub(Path('test/hubs/flu-metrocast'))
pa_table = hub_connection.to_table(columns=['target_end_date', 'value'],
filter=pc.field('location') == 'Bronx')
print(pa_table.shape)
# (1350, 2)
Working with a cloud-based hub¶
This package supports connecting to cloud-based hubs (primarily AWS S3 for the hubverse) via pyarrow’s abstract filesystem interface, which works with both local file systems and those on the cloud. Here’s an example of accessing the cloud-enabled example-complex-forecast-hub’s S3 bucket via the S3 URI s3://example-complex-forecast-hub/. For example, continuing the above Python session:
Note: An S3 URI (Uniform Resource Identifier) for Amazon S3 has the format
s3://\<bucket-name\>/\<key-name\>. It uniquely identifies an object stored in an S3 bucket. For example,s3: //my-bucket/data.txtrefers to a file nameddata.txtwithin the bucket namedmy-bucket.
hub_connection = connect_hub('s3://example-complex-forecast-hub/')
print(hub_connection.to_table().shape)
# (553264, 9)
Note: This package’s performance with cloud-based hubs can be slow due to how pyarrow’s dataset scanning works.
Working with data outside pyarrow: A Polars example¶
As mentioned above, once you have a pyarrow Table you can convert it to work with dataframe packages like pandas and Polars. Here we give an example of using the flu-metrocast test hub. For simplicity, we use uv in this example, which allows us to start a python session that installs the Polars package on the fly using uv run’s –with argument:
uv run --with polars python3
Then run the following Python commands to see Polars integration in action:
from pathlib import Path
import polars as pl
import pyarrow.compute as pc
from hubdata import connect_hub
# connect to the hub and then get a pyarrow Table, limiting the columns and rows loaded into memory as described above
hub_connection = connect_hub(Path('test/hubs/flu-metrocast'))
pa_table = hub_connection.to_table(
columns=['target_end_date', 'value', 'output_type', 'output_type_id', 'reference_date'],
filter=(pc.field('location') == 'Bronx') & (pc.field('target') == 'ILI ED visits'))
pa_table.shape
# (1350, 5)
# convert to polars DataFrame
pl_df = pl.from_arrow(pa_table)
pl_df
# shape: (1_350, 5)
# ┌─────────────────┬─────────────┬─────────────┬────────────────┬────────────────┐
# │ target_end_date ┆ value ┆ output_type ┆ output_type_id ┆ reference_date │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ date ┆ f64 ┆ str ┆ f64 ┆ date │
# ╞═════════════════╪═════════════╪═════════════╪════════════════╪════════════════╡
# │ 2025-01-25 ┆ 1375.608634 ┆ quantile ┆ 0.025 ┆ 2025-01-25 │
# │ 2025-01-25 ┆ 1503.974675 ┆ quantile ┆ 0.05 ┆ 2025-01-25 │
# │ 2025-01-25 ┆ 1580.89009 ┆ quantile ┆ 0.1 ┆ 2025-01-25 │
# │ 2025-01-25 ┆ 1630.75 ┆ quantile ┆ 0.25 ┆ 2025-01-25 │
# │ 2025-01-25 ┆ 1664.0 ┆ quantile ┆ 0.5 ┆ 2025-01-25 │
# │ … ┆ … ┆ … ┆ … ┆ … │
# │ 2025-06-14 ┆ 386.850998 ┆ quantile ┆ 0.5 ┆ 2025-05-24 │
# │ 2025-06-14 ┆ 454.018488 ┆ quantile ┆ 0.75 ┆ 2025-05-24 │
# │ 2025-06-14 ┆ 538.585477 ┆ quantile ┆ 0.9 ┆ 2025-05-24 │
# │ 2025-06-14 ┆ 600.680743 ┆ quantile ┆ 0.95 ┆ 2025-05-24 │
# │ 2025-06-14 ┆ 658.922076 ┆ quantile ┆ 0.975 ┆ 2025-05-24 │
# └─────────────────┴─────────────┴─────────────┴────────────────┴────────────────┘
# it's also possible to convert to a polars DataFrame and do some operations
pl_df = (
pl.from_arrow(pa_table)
.group_by(pl.col('target_end_date'))
.agg(pl.col('value').count())
.sort('target_end_date')
)
pl_df
# shape: (22, 2)
# ┌─────────────────┬───────┐
# │ target_end_date ┆ value │
# │ --- ┆ --- │
# │ date ┆ u32 │
# ╞═════════════════╪═══════╡
# │ 2025-01-25 ┆ 9 │
# │ 2025-02-01 ┆ 18 │
# │ 2025-02-08 ┆ 27 │
# │ 2025-02-15 ┆ 36 │
# │ 2025-02-22 ┆ 45 │
# │ … ┆ … │
# │ 2025-05-24 ┆ 81 │
# │ 2025-05-31 ┆ 63 │
# │ 2025-06-07 ┆ 45 │
# │ 2025-06-14 ┆ 27 │
# │ 2025-06-21 ┆ 9 │
# └─────────────────┴───────┘
Working with target data¶
All of the above examples were concerned with model output data. In this section we focus on working with target (observed) data, both time-series and oracle-output forms. The API for both is similar to that of the model output data API, with analogous create_target_data_schema() and connect_target_data() functions. Both accept a target_type enumeration argument (either TargetType.TIME_SERIES or TargetType.ORACLE_OUTPUT) that indicates which form of target data to work with.
Working again with the example-complex-forecast-hub, let’s first use the CLI to get an overview of its time-series and oracle-output data:
hubdata time-series s3://example-complex-forecast-hub/
╭─ target data ──────────────────────────╮
│ │
│ hub_path: │
│ - s3://example-complex-forecast-hub/ │
│ │
│ target type: │
│ - time-series │
│ │
│ schema: │
│ - location: string │
│ - observation: double │
│ - target: string │
│ - target_end_date: date32 │
│ │
│ dataset: │
│ - location: time-series.csv (file) │
│ - files: 1 │
│ - type: csv │
│ │
╰────────────────────────────── hubdata ─╯
hubdata oracle-output s3://example-complex-forecast-hub/
╭─ target data ──────────────────────────╮
│ │
│ hub_path: │
│ - s3://example-complex-forecast-hub/ │
│ │
│ target type: │
│ - oracle-output │
│ │
│ schema: │
│ - location: string │
│ - oracle_value: double │
│ - output_type: string │
│ - output_type_id: string │
│ - target: string │
│ - target_end_date: date32 │
│ │
│ dataset: │
│ - location: oracle-output.csv (file) │
│ - files: 1 │
│ - type: csv │
│ │
╰────────────────────────────── hubdata ─╯
Now let’s try out the API.
import pyarrow.compute as pc
from hubdata.connect_target_data import connect_target_data
from hubdata.create_target_data_schema import TargetType
# first we'll work with time-series target data - corresponds to
# https://github.com/hubverse-org/example-complex-forecast-hub/blob/main/target-data/time-series.csv
td_conn = connect_target_data('s3://example-complex-forecast-hub/', TargetType.TIME_SERIES)
# get the schema for the time-series data. this is set for you via `create_target_data_schema()`, which, for this hub,
# the program was able to deterministically figure out via the file
# https://github.com/hubverse-org/example-complex-forecast-hub/blob/main/hub-config/target-data.json
td_conn.schema
# target_end_date: date32[day]
# target: string
# location: string
# observation: double
# get a Dataset for the data
ts_ds = td_conn.get_dataset()
ts_ds.count_rows()
# 20510
# load all data into memory as a pyarrow Table
pa_table = ts_ds.to_table()
pa_table.shape
# (20510, 4)
pa_table
# pyarrow.Table
# target_end_date: date32[day]
# target: string
# location: string
# observation: double
# ----
# target_end_date: [[2020-01-11,2020-01-11,2020-01-11,2020-01-11,2020-01-11,...,2023-11-11,2023-11-11,2023-11-11,2023-11-11,2023-11-11]]
# target: [["wk inc flu hosp","wk inc flu hosp","wk inc flu hosp","wk inc flu hosp","wk inc flu hosp",...,"wk flu hosp rate","wk flu hosp rate","wk flu hosp rate","wk flu hosp rate","wk flu hosp rate"]]
# location: [["01","15","18","27","30",...,"50","53","55","54","56"]]
# observation: [[0,0,0,0,0,...,0.463743024532006,0.25853708856730895,0.3225246824744501,0.6760650983083161,1.2098523461629533]]
pc.unique(pa_table['location']).to_pylist()
# ['01', '15', ..., '49']
pc.unique(pa_table['target']).to_pylist()
# ['wk inc flu hosp', 'wk flu hosp rate']
# working with oracle-output target data is very similar - corresponds to
# https://github.com/hubverse-org/example-complex-forecast-hub/blob/main/target-data/oracle-output.csv
td_conn = connect_target_data('s3://example-complex-forecast-hub/', TargetType.ORACLE_OUTPUT)
td_conn.schema
# target_end_date: date32[day]
# target: string
# location: string
# output_type: string
# output_type_id: string
# oracle_value: double
ts_ds = td_conn.get_dataset()
ts_ds.count_rows()
# 200340
pa_table = ts_ds.to_table() # load all data into memory
pa_table.shape
# (200340, 6)
pc.unique(pa_table['output_type']).to_pylist()
# ['quantile', 'mean', 'median', 'sample', 'pmf', 'cdf']