Read a single target data file
Arguments
- target_file_path
Character string. Path to the target data file being validated relative to the hub's
target-data
directory.- hub_path
Either a character string path to a local Modeling Hub directory or an object of class
<SubTreeFileSystem>
created using functionss3_bucket()
orgs_bucket()
by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in thearrow
package. The hub must be fully configured with validadmin.json
andtasks.json
files within thehub-config
directory.- coerce_types
character string. What to coerce column types to on read.
hub
: (default) read in (csv
) or coerce (parquet
) to schema according tohub
config (SeehubData::create_timeseries_schema()
andhubData::create_oracle_output_schema()
for details). When coercing data types using thehub
schema, theoutput_type_id_datatype
can also be used to set theoutput_type_id
column data type manually.chr
: read in (csv
) or coerce (parquet
) all columns to character.none
: No coercion. Usearrow
read_*
function defaults.
- date_col
Optional column name to be interpreted as date. Default is
NULL
. Useful when the required date column is a partitioning column in the target data and does not have the same name as a date typed task ID variable in the config.- na
A character vector of strings to interpret as missing values. Only applies to CSV files. The default is
c("NA", "")
. Useful when actual character string"NA"
values are used in the data. In such a case, use empty cells to indicate missing values in your files and setna = ""
.
Examples
# download example hub
hub_path <- withr::local_tempdir()
example_hub <- "https://github.com/hubverse-org/example-complex-forecast-hub.git"
gert::git_clone(url = example_hub, path = hub_path)
# read in time-series file
read_target_file("time-series.csv", hub_path)
#> # A tibble: 20,510 × 4
#> date target location observation
#> <date> <chr> <chr> <dbl>
#> 1 2020-01-11 wk inc flu hosp 01 0
#> 2 2020-01-11 wk inc flu hosp 15 0
#> 3 2020-01-11 wk inc flu hosp 18 0
#> 4 2020-01-11 wk inc flu hosp 27 0
#> 5 2020-01-11 wk inc flu hosp 30 0
#> 6 2020-01-11 wk inc flu hosp 37 0
#> 7 2020-01-11 wk inc flu hosp 48 0
#> 8 2020-01-11 wk inc flu hosp US 1
#> 9 2020-01-18 wk inc flu hosp 01 0
#> 10 2020-01-18 wk inc flu hosp 15 0
#> # ℹ 20,500 more rows
read_target_file("time-series.csv", hub_path, coerce_types = "chr")
#> # A tibble: 20,510 × 4
#> date target location observation
#> <chr> <chr> <chr> <chr>
#> 1 2020-01-11 wk inc flu hosp 01 0
#> 2 2020-01-11 wk inc flu hosp 15 0
#> 3 2020-01-11 wk inc flu hosp 18 0
#> 4 2020-01-11 wk inc flu hosp 27 0
#> 5 2020-01-11 wk inc flu hosp 30 0
#> 6 2020-01-11 wk inc flu hosp 37 0
#> 7 2020-01-11 wk inc flu hosp 48 0
#> 8 2020-01-11 wk inc flu hosp US 1
#> 9 2020-01-18 wk inc flu hosp 01 0
#> 10 2020-01-18 wk inc flu hosp 15 0
#> # ℹ 20,500 more rows
# read in oracle-output file
read_target_file("oracle-output.csv", hub_path)
#> # A tibble: 200,340 × 6
#> location target_end_date target output_type output_type_id oracle_value
#> <chr> <date> <chr> <chr> <chr> <dbl>
#> 1 US 2022-10-22 wk inc flu … quantile NA 2380
#> 2 01 2022-10-22 wk inc flu … quantile NA 141
#> 3 02 2022-10-22 wk inc flu … quantile NA 3
#> 4 04 2022-10-22 wk inc flu … quantile NA 22
#> 5 05 2022-10-22 wk inc flu … quantile NA 50
#> 6 06 2022-10-22 wk inc flu … quantile NA 124
#> 7 08 2022-10-22 wk inc flu … quantile NA 15
#> 8 09 2022-10-22 wk inc flu … quantile NA 9
#> 9 10 2022-10-22 wk inc flu … quantile NA 1
#> 10 11 2022-10-22 wk inc flu … quantile NA 8
#> # ℹ 200,330 more rows
read_target_file("oracle-output.csv", hub_path, coerce_types = "chr")
#> # A tibble: 200,340 × 6
#> location target_end_date target output_type output_type_id oracle_value
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 US 2022-10-22 wk inc flu … quantile NA 2380
#> 2 01 2022-10-22 wk inc flu … quantile NA 141
#> 3 02 2022-10-22 wk inc flu … quantile NA 3
#> 4 04 2022-10-22 wk inc flu … quantile NA 22
#> 5 05 2022-10-22 wk inc flu … quantile NA 50
#> 6 06 2022-10-22 wk inc flu … quantile NA 124
#> 7 08 2022-10-22 wk inc flu … quantile NA 15
#> 8 09 2022-10-22 wk inc flu … quantile NA 9
#> 9 10 2022-10-22 wk inc flu … quantile NA 1
#> 10 11 2022-10-22 wk inc flu … quantile NA 8
#> # ℹ 200,330 more rows