Connect to data in a model output directory through a Modeling Hub or directly. Data can be stored in a local directory or in the cloud on AWS or GCS.
Usage
connect_hub(
hub_path,
file_format = c("csv", "parquet", "arrow"),
output_type_id_datatype = c("auto", "character", "double", "integer", "logical",
"Date"),
partitions = list(model_id = arrow::utf8())
)
connect_model_output(
model_output_dir,
file_format = c("csv", "parquet", "arrow"),
partition_names = "model_id",
schema = NULL
)Arguments
- hub_path
Either a character string path to a local Modeling Hub directory or an object of class
<SubTreeFileSystem>created using functionss3_bucket()orgs_bucket()by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in thearrowpackage. The hub must be fully configured with validadmin.jsonandtasks.jsonfiles within thehub-configdirectory.- file_format
The file format model output files are stored in. For connection to a fully configured hub, accessed through
hub_path,file_formatis inferred from the hub'sfile_formatconfiguration inadmin.jsonand is ignored by default. If supplied, it will override hub configuration setting. Multiple formats can be supplied toconnect_hubbut only a single file format can be supplied toconnect_mod_out.- output_type_id_datatype
character string. One of
"auto","character","double","integer","logical","Date". Defaults to"auto"indicating thatoutput_type_idwill be determined automatically from thetasks.jsonconfig file. Other data type values can be used to override automatic determination. Note that attempting to coerceoutput_type_idto a data type that is not possible (e.g. trying to coerce to"double"when the data contains"character"values) will likely result in an error or potentially unexpected behaviour so use with care.- partitions
a named list specifying the arrow data types of any partitioning column.
- model_output_dir
Either a character string path to a local directory containing model output data or an object of class
<SubTreeFileSystem>created using functionss3_bucket()orgs_bucket()by providing a string S3 or GCS bucket name or path to a directory containing model output data stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in thearrowpackage.- partition_names
character vector that defines the field names to which recursive directory names correspond to. Defaults to a single
model_idfield which reflects the standard expected structure of amodel-outputdirectory.- schema
An arrow::Schema object for the Dataset. If NULL (the default), the schema will be inferred from the data sources.
Value
connect_hubreturns an S3 object of class<hub_connection>.connect_mod_outreturns an S3 object of class<mod_out_connection>.
Both objects are connected to the data in the model-output directory via an
Apache arrow FileSystemDataset connection.
The connection can be used to extract data using dplyr custom queries. The
<hub_connection> class also contains modeling hub metadata.
Functions
connect_hub(): connect to a fully configured Modeling Hub directory.connect_model_output(): connect directly to amodel-outputdirectory. This function can be used to access data directly from an appropriately set up model output directory which is not part of a fully configured hub.
Examples
# Connect to a local simple forecasting Hub.
hub_path <- system.file("testhubs/simple", package = "hubUtils")
hub_con <- connect_hub(hub_path)
hub_con
#>
#> ── <hub_connection/UnionDataset> ──
#>
#> • hub_name: "Simple Forecast Hub"
#> • hub_path: /home/runner/work/_temp/Library/hubUtils/testhubs/simple
#> • file_format: "csv(3/3)" and "parquet(1/1)"
#> • file_system: "LocalFileSystem"
#> • model_output_dir:
#> "/home/runner/work/_temp/Library/hubUtils/testhubs/simple/model-output"
#> • config_admin: hub-config/admin.json
#> • config_tasks: hub-config/tasks.json
#>
#> ── Connection schema
#> hub_connection
#> origin_date: date32[day]
#> target: string
#> horizon: int32
#> location: string
#> output_type: string
#> output_type_id: double
#> value: int32
#> model_id: string
#> age_group: string
hub_con <- connect_hub(hub_path, output_type_id_datatype = "character")
hub_con
#>
#> ── <hub_connection/UnionDataset> ──
#>
#> • hub_name: "Simple Forecast Hub"
#> • hub_path: /home/runner/work/_temp/Library/hubUtils/testhubs/simple
#> • file_format: "csv(3/3)" and "parquet(1/1)"
#> • file_system: "LocalFileSystem"
#> • model_output_dir:
#> "/home/runner/work/_temp/Library/hubUtils/testhubs/simple/model-output"
#> • config_admin: hub-config/admin.json
#> • config_tasks: hub-config/tasks.json
#>
#> ── Connection schema
#> hub_connection
#> origin_date: date32[day]
#> target: string
#> horizon: int32
#> location: string
#> output_type: string
#> output_type_id: string
#> value: int32
#> model_id: string
#> age_group: string
# Connect directly to a local `model-output` directory
mod_out_path <- system.file("testhubs/simple/model-output", package = "hubUtils")
mod_out_con <- connect_model_output(mod_out_path)
mod_out_con
#>
#> ── <mod_out_connection/FileSystemDataset> ──
#>
#> • file_format: "csv(3/3)"
#> • file_system: "LocalFileSystem"
#> • model_output_dir:
#> "/home/runner/work/_temp/Library/hubUtils/testhubs/simple/model-output"
#>
#> ── Connection schema
#> mod_out_connection with 3 csv files
#> origin_date: date32[day]
#> target: string
#> horizon: int64
#> location: string
#> output_type: string
#> output_type_id: double
#> value: int64
#> model_id: string
# Query hub_connection for data
library(dplyr)
hub_con %>%
filter(
origin_date == "2022-10-08",
horizon == 2
) %>%
collect()
#> # A tibble: 69 × 9
#> origin_date target horizon location output_type output_type_id value model_id
#> <date> <chr> <int> <chr> <chr> <chr> <int> <chr>
#> 1 2022-10-08 wk in… 2 US quantile 0.01 135 hub-bas…
#> 2 2022-10-08 wk in… 2 US quantile 0.025 137 hub-bas…
#> 3 2022-10-08 wk in… 2 US quantile 0.05 139 hub-bas…
#> 4 2022-10-08 wk in… 2 US quantile 0.1 140 hub-bas…
#> 5 2022-10-08 wk in… 2 US quantile 0.15 141 hub-bas…
#> 6 2022-10-08 wk in… 2 US quantile 0.2 141 hub-bas…
#> 7 2022-10-08 wk in… 2 US quantile 0.25 142 hub-bas…
#> 8 2022-10-08 wk in… 2 US quantile 0.3 143 hub-bas…
#> 9 2022-10-08 wk in… 2 US quantile 0.35 144 hub-bas…
#> 10 2022-10-08 wk in… 2 US quantile 0.4 145 hub-bas…
#> # ℹ 59 more rows
#> # ℹ 1 more variable: age_group <chr>
mod_out_con %>%
filter(
origin_date == "2022-10-08",
horizon == 2
) %>%
collect()
#> # A tibble: 69 × 8
#> origin_date target horizon location output_type output_type_id value model_id
#> <date> <chr> <int> <chr> <chr> <dbl> <int> <chr>
#> 1 2022-10-08 wk in… 2 US quantile 0.01 135 hub-bas…
#> 2 2022-10-08 wk in… 2 US quantile 0.025 137 hub-bas…
#> 3 2022-10-08 wk in… 2 US quantile 0.05 139 hub-bas…
#> 4 2022-10-08 wk in… 2 US quantile 0.1 140 hub-bas…
#> 5 2022-10-08 wk in… 2 US quantile 0.15 141 hub-bas…
#> 6 2022-10-08 wk in… 2 US quantile 0.2 141 hub-bas…
#> 7 2022-10-08 wk in… 2 US quantile 0.25 142 hub-bas…
#> 8 2022-10-08 wk in… 2 US quantile 0.3 143 hub-bas…
#> 9 2022-10-08 wk in… 2 US quantile 0.35 144 hub-bas…
#> 10 2022-10-08 wk in… 2 US quantile 0.4 145 hub-bas…
#> # ℹ 59 more rows
# Connect to a simple forecasting Hub stored in an AWS S3 bucket.
if (FALSE) {
hub_path <- s3_bucket("hubverse/hubutils/testhubs/simple/")
hub_con <- connect_hub(hub_path)
hub_con
}
