Connect to data in a model output directory through a Modeling Hub or directly. Data can be stored in a local directory or in the cloud on AWS or GCS.
Usage
connect_hub(
hub_path,
file_format = c("csv", "parquet", "arrow"),
output_type_id_datatype = c("from_config", "auto", "character", "double", "integer",
"logical", "Date"),
partitions = list(model_id = arrow::utf8()),
skip_checks = FALSE
)
connect_model_output(
model_output_dir,
file_format = c("csv", "parquet", "arrow"),
partition_names = "model_id",
schema = NULL,
skip_checks = FALSE
)
Arguments
- hub_path
Either a character string path to a local Modeling Hub directory or an object of class
<SubTreeFileSystem>
created using functionss3_bucket()
orgs_bucket()
by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in thearrow
package. The hub must be fully configured with validadmin.json
andtasks.json
files within thehub-config
directory.- file_format
The file format model output files are stored in. For connection to a fully configured hub, accessed through
hub_path
,file_format
is inferred from the hub'sfile_format
configuration inadmin.json
and is ignored by default. If supplied, it will override hub configuration setting. Multiple formats can be supplied toconnect_hub
but only a single file format can be supplied toconnect_mod_out
.- output_type_id_datatype
character string. One of
"from_config"
,"auto"
,"character"
,"double"
,"integer"
,"logical"
,"Date"
. Defaults to"from_config"
which uses the setting in theoutput_type_id_datatype
property in thetasks.json
config file if available. If the property is not set in the config, the argument falls back to"auto"
which determines theoutput_type_id
data type automatically from thetasks.json
config file as the simplest data type required to represent all output type ID values across all output types in the hub. When only point estimate output types (whereoutput_type_id
s areNA
,) are being collected by a hub, theoutput_type_id
column is assigned acharacter
data type when auto-determined. Other data type values can be used to override automatic determination. Note that attempting to coerceoutput_type_id
to a data type that is not valid for the data (e.g. trying to coerce"character"
values to"double"
) will likely result in an error or potentially unexpected behaviour so use with care.- partitions
a named list specifying the arrow data types of any partitioning column.
- skip_checks
Logical. If
FALSE
(default), checkfile_format
parameter against the hub's model output files. Also excludes invalid model output files when opening hub datasets. Setting toTRUE
will improve performance but will result in an error if the model output directory includes invalid files. Cannot beTRUE
when there are multiple file formats in the hub's model output directory or when the hub's model output directory contains files that are not model output data (for example, a README).- model_output_dir
Either a character string path to a local directory containing model output data or an object of class
<SubTreeFileSystem>
created using functionss3_bucket()
orgs_bucket()
by providing a string S3 or GCS bucket name or path to a directory containing model output data stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in thearrow
package.- partition_names
character vector that defines the field names to which recursive directory names correspond to. Defaults to a single
model_id
field which reflects the standard expected structure of amodel-output
directory.- schema
An arrow::Schema object for the Dataset. If NULL (the default), the schema will be inferred from the data sources.
Value
connect_hub
returns an S3 object of class<hub_connection>
.connect_mod_out
returns an S3 object of class<mod_out_connection>
.
Both objects are connected to the data in the model-output directory via an
Apache arrow FileSystemDataset
connection.
The connection can be used to extract data using dplyr
custom queries. The
<hub_connection>
class also contains modeling hub metadata.
Functions
connect_hub()
: connect to a fully configured Modeling Hub directory.connect_model_output()
: connect directly to amodel-output
directory. This function can be used to access data directly from an appropriately set up model output directory which is not part of a fully configured hub.
Examples
# Connect to a local simple forecasting Hub.
hub_path <- system.file("testhubs/simple", package = "hubUtils")
hub_con <- connect_hub(hub_path)
hub_con
#>
#> ── <hub_connection/UnionDataset> ──
#>
#> • hub_name: "Simple Forecast Hub"
#> • hub_path: /home/runner/work/_temp/Library/hubUtils/testhubs/simple
#> • file_format: "csv(3/3)" and "parquet(1/1)"
#> • checks: TRUE
#> • file_system: "LocalFileSystem"
#> • model_output_dir:
#> "/home/runner/work/_temp/Library/hubUtils/testhubs/simple/model-output"
#> • config_admin: hub-config/admin.json
#> • config_tasks: hub-config/tasks.json
#>
#> ── Connection schema
#> hub_connection
#> 9 columns
#> origin_date: date32[day]
#> target: string
#> horizon: int32
#> location: string
#> output_type: string
#> output_type_id: double
#> value: int32
#> model_id: string
#> age_group: string
hub_con <- connect_hub(hub_path, output_type_id_datatype = "character")
hub_con
#>
#> ── <hub_connection/UnionDataset> ──
#>
#> • hub_name: "Simple Forecast Hub"
#> • hub_path: /home/runner/work/_temp/Library/hubUtils/testhubs/simple
#> • file_format: "csv(3/3)" and "parquet(1/1)"
#> • checks: TRUE
#> • file_system: "LocalFileSystem"
#> • model_output_dir:
#> "/home/runner/work/_temp/Library/hubUtils/testhubs/simple/model-output"
#> • config_admin: hub-config/admin.json
#> • config_tasks: hub-config/tasks.json
#>
#> ── Connection schema
#> hub_connection
#> 9 columns
#> origin_date: date32[day]
#> target: string
#> horizon: int32
#> location: string
#> output_type: string
#> output_type_id: string
#> value: int32
#> model_id: string
#> age_group: string
# Connect directly to a local `model-output` directory
mod_out_path <- system.file("testhubs/simple/model-output", package = "hubUtils")
mod_out_con <- connect_model_output(mod_out_path)
mod_out_con
#>
#> ── <mod_out_connection/FileSystemDataset> ──
#>
#> • file_format: "csv(3/3)"
#> • checks: TRUE
#> • file_system: "LocalFileSystem"
#> • model_output_dir:
#> "/home/runner/work/_temp/Library/hubUtils/testhubs/simple/model-output"
#>
#> ── Connection schema
#> mod_out_connection with 3 csv files
#> 8 columns
#> origin_date: date32[day]
#> target: string
#> horizon: int64
#> location: string
#> output_type: string
#> output_type_id: double
#> value: int64
#> model_id: string
# Query hub_connection for data
library(dplyr)
#>
#> Attaching package: ‘dplyr’
#> The following objects are masked from ‘package:stats’:
#>
#> filter, lag
#> The following objects are masked from ‘package:base’:
#>
#> intersect, setdiff, setequal, union
hub_con %>%
filter(
origin_date == "2022-10-08",
horizon == 2
) %>%
collect_hub()
#> # A tibble: 69 × 9
#> model_id origin_date target horizon location age_group output_type
#> * <chr> <date> <chr> <int> <chr> <chr> <chr>
#> 1 hub-baseline 2022-10-08 wk inc flu h… 2 US NA quantile
#> 2 hub-baseline 2022-10-08 wk inc flu h… 2 US NA quantile
#> 3 hub-baseline 2022-10-08 wk inc flu h… 2 US NA quantile
#> 4 hub-baseline 2022-10-08 wk inc flu h… 2 US NA quantile
#> 5 hub-baseline 2022-10-08 wk inc flu h… 2 US NA quantile
#> 6 hub-baseline 2022-10-08 wk inc flu h… 2 US NA quantile
#> 7 hub-baseline 2022-10-08 wk inc flu h… 2 US NA quantile
#> 8 hub-baseline 2022-10-08 wk inc flu h… 2 US NA quantile
#> 9 hub-baseline 2022-10-08 wk inc flu h… 2 US NA quantile
#> 10 hub-baseline 2022-10-08 wk inc flu h… 2 US NA quantile
#> # ℹ 59 more rows
#> # ℹ 2 more variables: output_type_id <chr>, value <int>
mod_out_con %>%
filter(
origin_date == "2022-10-08",
horizon == 2
) %>%
collect_hub()
#> # A tibble: 69 × 8
#> model_id origin_date target horizon location output_type output_type_id value
#> * <chr> <date> <chr> <int> <chr> <chr> <dbl> <int>
#> 1 hub-bas… 2022-10-08 wk in… 2 US quantile 0.01 135
#> 2 hub-bas… 2022-10-08 wk in… 2 US quantile 0.025 137
#> 3 hub-bas… 2022-10-08 wk in… 2 US quantile 0.05 139
#> 4 hub-bas… 2022-10-08 wk in… 2 US quantile 0.1 140
#> 5 hub-bas… 2022-10-08 wk in… 2 US quantile 0.15 141
#> 6 hub-bas… 2022-10-08 wk in… 2 US quantile 0.2 141
#> 7 hub-bas… 2022-10-08 wk in… 2 US quantile 0.25 142
#> 8 hub-bas… 2022-10-08 wk in… 2 US quantile 0.3 143
#> 9 hub-bas… 2022-10-08 wk in… 2 US quantile 0.35 144
#> 10 hub-bas… 2022-10-08 wk in… 2 US quantile 0.4 145
#> # ℹ 59 more rows
# Connect to a simple forecasting Hub stored in an AWS S3 bucket.
if (FALSE) { # \dontrun{
hub_path <- s3_bucket("hubverse/hubutils/testhubs/simple/")
hub_con <- connect_hub(hub_path)
hub_con
} # }