An important function of hubUtils is allowing for the
connection to data in the model-output directory to
facilitate extraction, filtering, querying, exploring, and analyzing of
Hub data.
Structure of hubverse datasets
All data returned from connecting to and querying hubs can be read or
validated as a model_out_tbl which is an S3 class defined
by the hubUtils package. A model_out_tbl is a
long-form tibble designed to
conform to the hubverse
data specifications for model output data. In short, the columns of
a valid model_out_tbl containing model output data from a
hub are:
-
model_id: this is the unique character identifier of a model. -
output_type: a character variable that defines the type of representation of model output that is in a given row. -
output_type_id: a variable that specifies some additional identifying information specific to the output type in a given row, e.g., a numeric quantile level, a string giving the name of a possible category for a discrete outcome, or an index of a sample. -
value: a numeric variable that provides the information about the model’s prediction. -
...: other columns will be present depending on modeling tasks defined by the individual modeling hub. These columns are referred to in hubverse terminology as thetask-IDvariables.
Other hubverse tools, designed for data validation, ensemble
building, visualization,
etc…, all are designed with the “promises” implicit in the data format
specified by model_out_tbl. For example, the
hubEnsembles::linear_pool() function both accepts as
input and returns as output model_out_tbl objects.
Hub connections
There are two functions for connecting to model-output
data:
-
connect_hub()is used for connecting to fully configured hubs (i.e. which contain validadmin.jsonandtasks.jsonin ahub-configdirectory). This function uses configurations defined in config files in thehub-config/directory and allows for connecting to hubs with files in multiple file formats (allowable formats specified by thefile_formatproperty ofadmin.json). -
connect_model_output()allows for connecting directly to the contents of amodel-outputdirectory and is useful for connecting to appropriately organised files in an informal hub (i.e. which has not been fully configured with appropriatehub-config/files.)
Both functions establish connections through the arrow package,
specifically by opening datasets as FileSystemDatasets,
one for each file format.
Where multiple file formats are accepted in a single Hub, file format
specific FileSystemDatasets are combined into a single
UnionDataset for single point access to the entire Hub
model-output dataset. This only applies to
connect_hub() in fully configured Hubs, where config files
can be used to determine a unifying schema across all file formats.
In contract, connect_model_output() can only be used to
open single file format datasets of the format defined explicitly
through the file_format argument.
library(hubUtils)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, unionConnecting to a configured hub
hub_path <- system.file("testhubs/flusight", package = "hubUtils")
hub_con <- connect_hub(hub_path)
hub_con
#>
#> ── <hub_connection/UnionDataset> ──
#>
#> • hub_name: "US CDC FluSight"
#> • hub_path: /home/runner/work/_temp/Library/hubUtils/testhubs/flusight
#> • file_format: "csv(5/5)", "parquet(2/2)", and "arrow(1/1)"
#> • file_system: "LocalFileSystem"
#> • model_output_dir:
#> "/home/runner/work/_temp/Library/hubUtils/testhubs/flusight/forecasts"
#> • config_admin: hub-config/admin.json
#> • config_tasks: hub-config/tasks.json
#>
#> ── Connection schema
#> hub_connection
#> forecast_date: date32[day]
#> horizon: int32
#> target: string
#> location: string
#> output_type: string
#> output_type_id: string
#> value: double
#> model_id: stringTo access data from a hub connection you can use dplyr verbs and construct querying pipelines.
hub_con %>%
filter(output_type == "quantile", location == "US") %>%
collect()
#> # A tibble: 276 × 8
#> forecast_date horizon target location output_type output_type_id value
#> <date> <int> <chr> <chr> <chr> <chr> <dbl>
#> 1 2023-04-24 1 wk ahead inc… US quantile 0.01 0
#> 2 2023-04-24 1 wk ahead inc… US quantile 0.025 0
#> 3 2023-04-24 1 wk ahead inc… US quantile 0.05 0
#> 4 2023-04-24 1 wk ahead inc… US quantile 0.1 281
#> 5 2023-04-24 1 wk ahead inc… US quantile 0.15 600
#> 6 2023-04-24 1 wk ahead inc… US quantile 0.2 717
#> 7 2023-04-24 1 wk ahead inc… US quantile 0.25 817
#> 8 2023-04-24 1 wk ahead inc… US quantile 0.3 877
#> 9 2023-04-24 1 wk ahead inc… US quantile 0.35 913
#> 10 2023-04-24 1 wk ahead inc… US quantile 0.4 965
#> # ℹ 266 more rows
#> # ℹ 1 more variable: model_id <chr>Note however that not all dplyr filtering options are available for all data types yet.
You can see how in the above output the required
model_id, output_type,
output_type_id and value column names are all
present, as is required for a model_out_tbl object.
However, the output of the above expression, while conforming to
model_out_tbl convention, is actually returned just as a
tbl_df or tibble object.
For example, if you wanted to get all quantile predictions for the last forecast date in the hub, you might try:
hub_con %>%
filter(output_type == "quantile", location == "US") %>%
filter(forecast_date == max(forecast_date)) %>%
collect()
#> Error: Filter expression not supported for Arrow Datasets: forecast_date == max(forecast_date)
#> Call collect() first to pull data into R.This doesn’t work however as arrow does not have an
equivalent max method for Date[32] data
types.
In such a situation, you could collect after applying the first filtering level which does work for arrow and then finish the filtering on the in-memory data returned by collect.
hub_con %>%
filter(output_type == "quantile", location == "US") %>%
collect() %>%
filter(forecast_date == max(forecast_date))
#> # A tibble: 92 × 8
#> forecast_date horizon target location output_type output_type_id value
#> <date> <int> <chr> <chr> <chr> <chr> <dbl>
#> 1 2023-05-08 1 wk ahead inc… US quantile 0.01 0
#> 2 2023-05-08 1 wk ahead inc… US quantile 0.025 0
#> 3 2023-05-08 1 wk ahead inc… US quantile 0.05 0
#> 4 2023-05-08 1 wk ahead inc… US quantile 0.1 231
#> 5 2023-05-08 1 wk ahead inc… US quantile 0.15 517
#> 6 2023-05-08 1 wk ahead inc… US quantile 0.2 637
#> 7 2023-05-08 1 wk ahead inc… US quantile 0.25 741
#> 8 2023-05-08 1 wk ahead inc… US quantile 0.3 796
#> 9 2023-05-08 1 wk ahead inc… US quantile 0.35 847
#> 10 2023-05-08 1 wk ahead inc… US quantile 0.4 876
#> # ℹ 82 more rows
#> # ℹ 1 more variable: model_id <chr>Alternatively, depending on the size of the data, in might be quicker to filter the data in two steps:
- get the last forecast date available for the filtered subset.
- use the last forecast date in the filtering query.
last_forecast <- hub_con %>%
filter(output_type == "quantile", location == "US") %>%
pull(forecast_date, as_vector = TRUE) %>%
max()
hub_con %>%
filter(
output_type == "quantile", location == "US",
forecast_date == last_forecast
) %>%
collect()
#> # A tibble: 92 × 8
#> forecast_date horizon target location output_type output_type_id value
#> <date> <int> <chr> <chr> <chr> <chr> <dbl>
#> 1 2023-05-08 1 wk ahead inc… US quantile 0.01 0
#> 2 2023-05-08 1 wk ahead inc… US quantile 0.025 0
#> 3 2023-05-08 1 wk ahead inc… US quantile 0.05 0
#> 4 2023-05-08 1 wk ahead inc… US quantile 0.1 231
#> 5 2023-05-08 1 wk ahead inc… US quantile 0.15 517
#> 6 2023-05-08 1 wk ahead inc… US quantile 0.2 637
#> 7 2023-05-08 1 wk ahead inc… US quantile 0.25 741
#> 8 2023-05-08 1 wk ahead inc… US quantile 0.3 796
#> 9 2023-05-08 1 wk ahead inc… US quantile 0.35 847
#> 10 2023-05-08 1 wk ahead inc… US quantile 0.4 876
#> # ℹ 82 more rows
#> # ℹ 1 more variable: model_id <chr>Connecting to a model output directory
There is also an option to connect directly to a model output directory without using any metadata in a hub config file. This can be useful when a hub has not been fully configured yet.
The approach does have certain limitations though. For example, an
overall unifying schema cannot be determined from the config files so
the ability of open_dataset() to connect and parse data
correctly cannot be guaranteed across files.
In addition, only a single file_format dataset can be opened.
model_output_dir <- system.file("testhubs/simple/model-output", package = "hubUtils")
mod_out_con <- connect_model_output(model_output_dir, file_format = "csv")
mod_out_con
#>
#> ── <mod_out_connection/FileSystemDataset> ──
#>
#> • file_format: "csv(3/3)"
#> • file_system: "LocalFileSystem"
#> • model_output_dir:
#> "/home/runner/work/_temp/Library/hubUtils/testhubs/simple/model-output"
#>
#> ── Connection schema
#> mod_out_connection with 3 csv files
#> origin_date: date32[day]
#> target: string
#> horizon: int64
#> location: string
#> output_type: string
#> output_type_id: double
#> value: int64
#> model_id: string
mod_out_con %>%
filter(output_type == "quantile", location == "US") %>%
collect()
#> # A tibble: 138 × 8
#> origin_date target horizon location output_type output_type_id value model_id
#> <date> <chr> <int> <chr> <chr> <dbl> <int> <chr>
#> 1 2022-10-08 wk in… 1 US quantile 0.01 135 team1-g…
#> 2 2022-10-08 wk in… 1 US quantile 0.025 137 team1-g…
#> 3 2022-10-08 wk in… 1 US quantile 0.05 139 team1-g…
#> 4 2022-10-08 wk in… 1 US quantile 0.1 140 team1-g…
#> 5 2022-10-08 wk in… 1 US quantile 0.15 141 team1-g…
#> 6 2022-10-08 wk in… 1 US quantile 0.2 141 team1-g…
#> 7 2022-10-08 wk in… 1 US quantile 0.25 142 team1-g…
#> 8 2022-10-08 wk in… 1 US quantile 0.3 143 team1-g…
#> 9 2022-10-08 wk in… 1 US quantile 0.35 144 team1-g…
#> 10 2022-10-08 wk in… 1 US quantile 0.4 145 team1-g…
#> # ℹ 128 more rowsWhen connecting to a model output directly, you can also specify a schema to override the default arrow schema auto-detection. This can help at times to resolve conflicts in data types across different dataset files.
library(arrow)
#>
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#>
#> timestamp
model_output_schema <- schema(
origin_date = date32(),
target = string(),
horizon = int32(),
location = string(),
output_type = string(),
output_type_id = string(),
value = int32(),
model_id = string()
)
mod_out_con <- connect_model_output(model_output_dir,
file_format = "csv",
schema = model_output_schema
)
mod_out_con
#>
#> ── <mod_out_connection/FileSystemDataset> ──
#>
#> • file_format: "csv(3/3)"
#> • file_system: "LocalFileSystem"
#> • model_output_dir:
#> "/home/runner/work/_temp/Library/hubUtils/testhubs/simple/model-output"
#>
#> ── Connection schema
#> mod_out_connection with 3 csv files
#> origin_date: date32[day]
#> target: string
#> horizon: int32
#> location: string
#> output_type: string
#> output_type_id: string
#> value: int32
#> model_id: stringUsing a schema can however also produce new errors which can
sometimes be hard to debug. For example, here we are defining a schema
with field output_type cast as int32 data
type. As column output_type actually contain character type
data which cannot be coerced to integer, connecting to the model output
directory produces an arrow error.
model_output_schema <- schema(
origin_date = date32(),
target = string(),
horizon = int32(),
location = string(),
output_type = int32(),
output_type_id = string(),
value = int32(),
model_id = string()
)
mod_out_con <- connect_model_output(model_output_dir,
file_format = "csv",
schema = model_output_schema
)
#> Error in `arrow::open_dataset()`:
#> ! Invalid: No non-null segments were available for field 'model_id'; couldn't infer typeBeware that arrow errors can be somewhat misleading at
times so if you do get such a non-informative error, a good place to
start would be to check your schema matches the columns and your data
can be coerced to the data types specified in the schema.
