
Open connection to time-series target data
Source:R/connect_target_timeseries.R
connect_target_timeseries.Rd
Usage
connect_target_timeseries(hub_path = ".", date_col = NULL, na = c("NA", ""))
Arguments
- hub_path
Either a character string path to a local Modeling Hub directory or an object of class
<SubTreeFileSystem>
created using functionss3_bucket()
orgs_bucket()
by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in thearrow
package. The hub must be fully configured with validadmin.json
andtasks.json
files within thehub-config
directory.- date_col
Optional column name to be interpreted as date. Default is
NULL
. Useful when the required date column is a partitioning column in the target data and does not have the same name as a date typed task ID variable in the config.- na
A character vector of strings to interpret as missing values. Only applies to CSV files. The default is
c("NA", "")
. Useful when actual character string"NA"
values are used in the data. In such a case, use empty cells to indicate missing values in your files and setna = ""
.
Details
If the target data is split across multiple files in a time-series
directory,
all files must share the same file format, either csv or parquet.
No other types of files are currently allowed in a time-series
directory.
Examples
# Clone example hub
tmp_hub_path <- withr::local_tempdir()
example_hub <- "https://github.com/hubverse-org/example-complex-forecast-hub.git"
gert::git_clone(url = example_hub, path = tmp_hub_path)
# Connect to time-series data
ts_con <- connect_target_timeseries(tmp_hub_path)
ts_con
#> target_timeseries with 1 csv file
#> 4 columns
#> date: date32[day]
#> target: string
#> location: string
#> observation: double
# Collect all time-series data
ts_con |> dplyr::collect()
#> # A tibble: 20,510 × 4
#> date target location observation
#> <date> <chr> <chr> <dbl>
#> 1 2020-01-11 wk inc flu hosp 01 0
#> 2 2020-01-11 wk inc flu hosp 15 0
#> 3 2020-01-11 wk inc flu hosp 18 0
#> 4 2020-01-11 wk inc flu hosp 27 0
#> 5 2020-01-11 wk inc flu hosp 30 0
#> 6 2020-01-11 wk inc flu hosp 37 0
#> 7 2020-01-11 wk inc flu hosp 48 0
#> 8 2020-01-11 wk inc flu hosp US 1
#> 9 2020-01-18 wk inc flu hosp 01 0
#> 10 2020-01-18 wk inc flu hosp 15 0
#> # ℹ 20,500 more rows
# Filter for a specific date before collecting
ts_con |>
dplyr::filter(date == "2020-01-11") |>
dplyr::collect()
#> # A tibble: 16 × 4
#> date target location observation
#> <date> <chr> <chr> <dbl>
#> 1 2020-01-11 wk inc flu hosp 01 0
#> 2 2020-01-11 wk inc flu hosp 15 0
#> 3 2020-01-11 wk inc flu hosp 18 0
#> 4 2020-01-11 wk inc flu hosp 27 0
#> 5 2020-01-11 wk inc flu hosp 30 0
#> 6 2020-01-11 wk inc flu hosp 37 0
#> 7 2020-01-11 wk inc flu hosp 48 0
#> 8 2020-01-11 wk inc flu hosp US 1
#> 9 2020-01-11 wk flu hosp rate 01 0
#> 10 2020-01-11 wk flu hosp rate 15 0
#> 11 2020-01-11 wk flu hosp rate 18 0
#> 12 2020-01-11 wk flu hosp rate 27 0
#> 13 2020-01-11 wk flu hosp rate 30 0
#> 14 2020-01-11 wk flu hosp rate 37 0
#> 15 2020-01-11 wk flu hosp rate 48 0
#> 16 2020-01-11 wk flu hosp rate US 0.000301
# Filter for a specific location before collecting
ts_con |>
dplyr::filter(location == "US") |>
dplyr::collect()
#> # A tibble: 402 × 4
#> date target location observation
#> <date> <chr> <chr> <dbl>
#> 1 2020-01-11 wk inc flu hosp US 1
#> 2 2020-01-18 wk inc flu hosp US 0
#> 3 2020-01-25 wk inc flu hosp US 0
#> 4 2020-02-01 wk inc flu hosp US 0
#> 5 2020-02-08 wk inc flu hosp US 0
#> 6 2020-02-15 wk inc flu hosp US 0
#> 7 2020-02-22 wk inc flu hosp US 0
#> 8 2020-02-29 wk inc flu hosp US 0
#> 9 2020-03-07 wk inc flu hosp US 0
#> 10 2020-03-14 wk inc flu hosp US 0
#> # ℹ 392 more rows
# Access Target time-series data from a cloud hub
s3_hub_path <- s3_bucket("example-complex-forecast-hub")
s3_con <- connect_target_timeseries(s3_hub_path)
s3_con
#> target_timeseries with 1 csv file
#> 4 columns
#> date: date32[day]
#> target: string
#> location: string
#> observation: double
s3_con |> dplyr::collect()
#> # A tibble: 20,510 × 4
#> date target location observation
#> <date> <chr> <chr> <dbl>
#> 1 2020-01-11 wk inc flu hosp 01 0
#> 2 2020-01-11 wk inc flu hosp 15 0
#> 3 2020-01-11 wk inc flu hosp 18 0
#> 4 2020-01-11 wk inc flu hosp 27 0
#> 5 2020-01-11 wk inc flu hosp 30 0
#> 6 2020-01-11 wk inc flu hosp 37 0
#> 7 2020-01-11 wk inc flu hosp 48 0
#> 8 2020-01-11 wk inc flu hosp US 1
#> 9 2020-01-18 wk inc flu hosp 01 0
#> 10 2020-01-18 wk inc flu hosp 15 0
#> # ℹ 20,500 more rows