
Open connection to oracle-output target data
Source:R/connect_target_oracle.R
connect_target_oracle_output.Rd
Arguments
- hub_path
Either a character string path to a local Modeling Hub directory or an object of class
<SubTreeFileSystem>
created using functionss3_bucket()
orgs_bucket()
by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in thearrow
package. The hub must be fully configured with validadmin.json
andtasks.json
files within thehub-config
directory.- na
A character vector of strings to interpret as missing values. Only applies to CSV files. The default is
c("NA", "")
. Useful when actual character string"NA"
values are used in the data. In such a case, use empty cells to indicate missing values in your files and setna = ""
.- ignore_files
A character vector of file names (not paths) or file prefixes to ignore when discovering model output files to include in dataset connections. Parent directory names should not be included. Common non-data files such as
"README"
and".DS_Store"
are ignored automatically, but additional files can be excluded by specifying them here.- output_type_id_datatype
character string. One of
"from_config"
,"auto"
,"character"
,"double"
,"integer"
,"logical"
,"Date"
. Defaults to"from_config"
which uses the setting in theoutput_type_id_datatype
property in thetasks.json
config file if available. If the property is not set in the config, the argument falls back to"auto"
which determines theoutput_type_id
data type automatically from thetasks.json
config file as the simplest data type required to represent all output type ID values across all output types in the hub. When only point estimate output types (whereoutput_type_id
s areNA
,) are being collected by a hub, theoutput_type_id
column is assigned acharacter
data type when auto-determined. Other data type values can be used to override automatic determination. Note that attempting to coerceoutput_type_id
to a data type that is not valid for the data (e.g. trying to coerce"character"
values to"double"
) will likely result in an error or potentially unexpected behaviour so use with care.
Details
If the target data is split across multiple files in a oracle-output
directory,
all files must share the same file format, either csv or parquet.
No other types of files are currently allowed in a oracle-output
directory.
Examples
hub_path <- system.file("testhubs/v5/target_file", package = "hubUtils")
# Connect to oracle-output data
oo_con <- connect_target_oracle_output(hub_path)
oo_con
#> target_oracle_output with 1 csv file
#> 6 columns
#> location: string
#> target_end_date: date32[day]
#> target: string
#> output_type: string
#> output_type_id: string
#> oracle_value: double
# Collect all oracle-output data
oo_con |> dplyr::collect()
#> # A tibble: 627 × 6
#> location target_end_date target output_type output_type_id oracle_value
#> <chr> <date> <chr> <chr> <chr> <dbl>
#> 1 US 2022-10-22 wk flu hosp… cdf 1 1
#> 2 US 2022-10-22 wk flu hosp… cdf 2 1
#> 3 US 2022-10-22 wk flu hosp… cdf 3 1
#> 4 US 2022-10-22 wk flu hosp… cdf 4 1
#> 5 US 2022-10-22 wk flu hosp… cdf 5 1
#> 6 US 2022-10-22 wk flu hosp… cdf 6 1
#> 7 US 2022-10-22 wk flu hosp… cdf 7 1
#> 8 US 2022-10-22 wk flu hosp… cdf 8 1
#> 9 US 2022-10-22 wk flu hosp… cdf 9 1
#> 10 US 2022-10-22 wk flu hosp… cdf 10 1
#> # ℹ 617 more rows
# Filter for a specific date before collecting
oo_con |>
dplyr::filter(target_end_date == "2022-12-31") |>
dplyr::collect()
#> # A tibble: 57 × 6
#> location target_end_date target output_type output_type_id oracle_value
#> <chr> <date> <chr> <chr> <chr> <dbl>
#> 1 US 2022-12-31 wk flu hosp… cdf 1 0
#> 2 US 2022-12-31 wk flu hosp… cdf 2 0
#> 3 US 2022-12-31 wk flu hosp… cdf 3 0
#> 4 US 2022-12-31 wk flu hosp… cdf 4 0
#> 5 US 2022-12-31 wk flu hosp… cdf 5 0
#> 6 US 2022-12-31 wk flu hosp… cdf 6 1
#> 7 US 2022-12-31 wk flu hosp… cdf 7 1
#> 8 US 2022-12-31 wk flu hosp… cdf 8 1
#> 9 US 2022-12-31 wk flu hosp… cdf 9 1
#> 10 US 2022-12-31 wk flu hosp… cdf 10 1
#> # ℹ 47 more rows
# Filter for a specific location before collecting
oo_con |>
dplyr::filter(location == "US") |>
dplyr::collect()
#> # A tibble: 209 × 6
#> location target_end_date target output_type output_type_id oracle_value
#> <chr> <date> <chr> <chr> <chr> <dbl>
#> 1 US 2022-10-22 wk flu hosp… cdf 1 1
#> 2 US 2022-10-22 wk flu hosp… cdf 2 1
#> 3 US 2022-10-22 wk flu hosp… cdf 3 1
#> 4 US 2022-10-22 wk flu hosp… cdf 4 1
#> 5 US 2022-10-22 wk flu hosp… cdf 5 1
#> 6 US 2022-10-22 wk flu hosp… cdf 6 1
#> 7 US 2022-10-22 wk flu hosp… cdf 7 1
#> 8 US 2022-10-22 wk flu hosp… cdf 8 1
#> 9 US 2022-10-22 wk flu hosp… cdf 9 1
#> 10 US 2022-10-22 wk flu hosp… cdf 10 1
#> # ℹ 199 more rows
# Get distinct target_end_date values
oo_con |>
dplyr::distinct(target_end_date) |>
dplyr::pull(as_vector = TRUE)
#> [1] "2022-10-22" "2022-10-29" "2022-11-05" "2022-11-12" "2022-11-19"
#> [6] "2022-11-26" "2022-12-03" "2022-12-10" "2022-12-17" "2022-12-24"
#> [11] "2022-12-31"
# Access Target oracle-output data from a cloud hub
s3_hub_path <- s3_bucket("example-complex-forecast-hub")
s3_con <- connect_target_oracle_output(s3_hub_path)
s3_con
#> target_oracle_output with 1 csv file
#> 6 columns
#> location: string
#> target_end_date: date32[day]
#> target: string
#> output_type: string
#> output_type_id: string
#> oracle_value: double
s3_con |> dplyr::collect()
#> # A tibble: 200,340 × 6
#> location target_end_date target output_type output_type_id oracle_value
#> <chr> <date> <chr> <chr> <chr> <dbl>
#> 1 US 2022-10-22 wk inc flu … quantile NA 2380
#> 2 01 2022-10-22 wk inc flu … quantile NA 141
#> 3 02 2022-10-22 wk inc flu … quantile NA 3
#> 4 04 2022-10-22 wk inc flu … quantile NA 22
#> 5 05 2022-10-22 wk inc flu … quantile NA 50
#> 6 06 2022-10-22 wk inc flu … quantile NA 124
#> 7 08 2022-10-22 wk inc flu … quantile NA 15
#> 8 09 2022-10-22 wk inc flu … quantile NA 9
#> 9 10 2022-10-22 wk inc flu … quantile NA 1
#> 10 11 2022-10-22 wk inc flu … quantile NA 8
#> # ℹ 200,330 more rows