Skip to contents

[Experimental] Open the time-series target data file(s) in a hub as an arrow dataset.

Usage

connect_target_timeseries(
  hub_path = ".",
  date_col = NULL,
  na = c("NA", ""),
  ignore_files = NULL
)

Arguments

hub_path

Either a character string path to a local Modeling Hub directory or an object of class <SubTreeFileSystem> created using functions s3_bucket() or gs_bucket() by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in the arrow package. The hub must be fully configured with valid admin.json and tasks.json files within the hub-config directory.

date_col

Optional column name to be interpreted as date. Default is NULL. Useful when the required date column is a partitioning column in the target data and does not have the same name as a date typed task ID variable in the config.

na

A character vector of strings to interpret as missing values. Only applies to CSV files. The default is c("NA", ""). Useful when actual character string "NA" values are used in the data. In such a case, use empty cells to indicate missing values in your files and set na = "".

ignore_files

A character vector of file names (not paths) or file prefixes to ignore when discovering model output files to include in dataset connections. Parent directory names should not be included. Common non-data files such as "README" and ".DS_Store" are ignored automatically, but additional files can be excluded by specifying them here.

Value

An arrow dataset object of subclass <target_timeseries>.

Details

If the target data is split across multiple files in a time-series directory, all files must share the same file format, either csv or parquet. No other types of files are currently allowed in a time-series directory.

Examples

hub_path <- system.file("testhubs/v5/target_file", package = "hubUtils")
# Connect to time-series data
ts_con <- connect_target_timeseries(hub_path)
ts_con
#> target_timeseries with 1 csv file
#> 4 columns
#> target_end_date: date32[day]
#> target: string
#> location: string
#> observation: double
# Collect all time-series data
ts_con |> dplyr::collect()
#> # A tibble: 66 × 4
#>    target_end_date target          location observation
#>    <date>          <chr>           <chr>          <dbl>
#>  1 2022-10-22      wk inc flu hosp 02                 3
#>  2 2022-10-22      wk inc flu hosp 01               141
#>  3 2022-10-22      wk inc flu hosp US              2380
#>  4 2022-10-29      wk inc flu hosp 02                14
#>  5 2022-10-29      wk inc flu hosp 01               262
#>  6 2022-10-29      wk inc flu hosp US              4353
#>  7 2022-11-05      wk inc flu hosp 02                10
#>  8 2022-11-05      wk inc flu hosp 01               360
#>  9 2022-11-05      wk inc flu hosp US              6571
#> 10 2022-11-12      wk inc flu hosp 02                20
#> # ℹ 56 more rows
# Filter for a specific date before collecting
ts_con |>
  dplyr::filter(target_end_date ==  "2022-12-31") |>
  dplyr::collect()
#> # A tibble: 6 × 4
#>   target_end_date target           location observation
#>   <date>          <chr>            <chr>          <dbl>
#> 1 2022-12-31      wk inc flu hosp  02             44   
#> 2 2022-12-31      wk inc flu hosp  01            140   
#> 3 2022-12-31      wk inc flu hosp  US          19369   
#> 4 2022-12-31      wk flu hosp rate 02              6.18
#> 5 2022-12-31      wk flu hosp rate 01              2.76
#> 6 2022-12-31      wk flu hosp rate US              5.83
# Filter for a specific location before collecting
ts_con |>
  dplyr::filter(location == "US") |>
  dplyr::collect()
#> # A tibble: 22 × 4
#>    target_end_date target          location observation
#>    <date>          <chr>           <chr>          <dbl>
#>  1 2022-10-22      wk inc flu hosp US              2380
#>  2 2022-10-29      wk inc flu hosp US              4353
#>  3 2022-11-05      wk inc flu hosp US              6571
#>  4 2022-11-12      wk inc flu hosp US              8848
#>  5 2022-11-19      wk inc flu hosp US             11427
#>  6 2022-11-26      wk inc flu hosp US             19846
#>  7 2022-12-03      wk inc flu hosp US             26333
#>  8 2022-12-10      wk inc flu hosp US             23851
#>  9 2022-12-17      wk inc flu hosp US             21435
#> 10 2022-12-24      wk inc flu hosp US             19286
#> # ℹ 12 more rows
# Access Target time-series data from a cloud hub
s3_hub_path <- s3_bucket("example-complex-forecast-hub")
s3_con <- connect_target_timeseries(s3_hub_path)
s3_con
#> target_timeseries with 1 csv file
#> 4 columns
#> target_end_date: date32[day]
#> target: string
#> location: string
#> observation: double
s3_con |> dplyr::collect()
#> # A tibble: 20,510 × 4
#>    target_end_date target          location observation
#>    <date>          <chr>           <chr>          <dbl>
#>  1 2020-01-11      wk inc flu hosp 01                 0
#>  2 2020-01-11      wk inc flu hosp 15                 0
#>  3 2020-01-11      wk inc flu hosp 18                 0
#>  4 2020-01-11      wk inc flu hosp 27                 0
#>  5 2020-01-11      wk inc flu hosp 30                 0
#>  6 2020-01-11      wk inc flu hosp 37                 0
#>  7 2020-01-11      wk inc flu hosp 48                 0
#>  8 2020-01-11      wk inc flu hosp US                 1
#>  9 2020-01-18      wk inc flu hosp 01                 0
#> 10 2020-01-18      wk inc flu hosp 15                 0
#> # ℹ 20,500 more rows