Skip to contents

Read a single target data file

Usage

read_target_file(
  target_file_path,
  hub_path,
  coerce_types = c("hub", "chr", "none"),
  date_col = NULL,
  na = c("NA", "")
)

Arguments

target_file_path

Character string. Path to the target data file being validated relative to the hub's target-data directory.

hub_path

Either a character string path to a local Modeling Hub directory or an object of class <SubTreeFileSystem> created using functions s3_bucket() or gs_bucket() by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in the arrow package. The hub must be fully configured with valid admin.json and tasks.json files within the hub-config directory.

coerce_types

character string. What to coerce column types to on read.

  • hub: (default) read in (csv) or coerce (parquet) to schema according to hub config (See hubData::create_timeseries_schema() and hubData::create_oracle_output_schema() for details). When coercing data types using the hub schema, the output_type_id_datatype can also be used to set the output_type_id column data type manually.

  • chr: read in (csv) or coerce (parquet) all columns to character.

  • none: No coercion. Use arrow read_* function defaults.

date_col

Optional column name to be interpreted as date. Default is NULL. Useful when the required date column is a partitioning column in the target data and does not have the same name as a date typed task ID variable in the config.

na

A character vector of strings to interpret as missing values. Only applies to CSV files. The default is c("NA", ""). Useful when actual character string "NA" values are used in the data. In such a case, use empty cells to indicate missing values in your files and set na = "".

Value

a tibble of contents of the target data file.

Examples

# download example hub
hub_path <- withr::local_tempdir()
example_hub <- "https://github.com/hubverse-org/example-complex-forecast-hub.git"
gert::git_clone(url = example_hub, path = hub_path)
# read in time-series file
read_target_file("time-series.csv", hub_path)
#> # A tibble: 20,510 × 4
#>    date       target          location observation
#>    <date>     <chr>           <chr>          <dbl>
#>  1 2020-01-11 wk inc flu hosp 01                 0
#>  2 2020-01-11 wk inc flu hosp 15                 0
#>  3 2020-01-11 wk inc flu hosp 18                 0
#>  4 2020-01-11 wk inc flu hosp 27                 0
#>  5 2020-01-11 wk inc flu hosp 30                 0
#>  6 2020-01-11 wk inc flu hosp 37                 0
#>  7 2020-01-11 wk inc flu hosp 48                 0
#>  8 2020-01-11 wk inc flu hosp US                 1
#>  9 2020-01-18 wk inc flu hosp 01                 0
#> 10 2020-01-18 wk inc flu hosp 15                 0
#> # ℹ 20,500 more rows
read_target_file("time-series.csv", hub_path, coerce_types = "chr")
#> # A tibble: 20,510 × 4
#>    date       target          location observation
#>    <chr>      <chr>           <chr>    <chr>      
#>  1 2020-01-11 wk inc flu hosp 01       0          
#>  2 2020-01-11 wk inc flu hosp 15       0          
#>  3 2020-01-11 wk inc flu hosp 18       0          
#>  4 2020-01-11 wk inc flu hosp 27       0          
#>  5 2020-01-11 wk inc flu hosp 30       0          
#>  6 2020-01-11 wk inc flu hosp 37       0          
#>  7 2020-01-11 wk inc flu hosp 48       0          
#>  8 2020-01-11 wk inc flu hosp US       1          
#>  9 2020-01-18 wk inc flu hosp 01       0          
#> 10 2020-01-18 wk inc flu hosp 15       0          
#> # ℹ 20,500 more rows
# read in oracle-output file
read_target_file("oracle-output.csv", hub_path)
#> # A tibble: 200,340 × 6
#>    location target_end_date target       output_type output_type_id oracle_value
#>    <chr>    <date>          <chr>        <chr>       <chr>                 <dbl>
#>  1 US       2022-10-22      wk inc flu … quantile    NA                     2380
#>  2 01       2022-10-22      wk inc flu … quantile    NA                      141
#>  3 02       2022-10-22      wk inc flu … quantile    NA                        3
#>  4 04       2022-10-22      wk inc flu … quantile    NA                       22
#>  5 05       2022-10-22      wk inc flu … quantile    NA                       50
#>  6 06       2022-10-22      wk inc flu … quantile    NA                      124
#>  7 08       2022-10-22      wk inc flu … quantile    NA                       15
#>  8 09       2022-10-22      wk inc flu … quantile    NA                        9
#>  9 10       2022-10-22      wk inc flu … quantile    NA                        1
#> 10 11       2022-10-22      wk inc flu … quantile    NA                        8
#> # ℹ 200,330 more rows
read_target_file("oracle-output.csv", hub_path, coerce_types = "chr")
#> # A tibble: 200,340 × 6
#>    location target_end_date target       output_type output_type_id oracle_value
#>    <chr>    <chr>           <chr>        <chr>       <chr>          <chr>       
#>  1 US       2022-10-22      wk inc flu … quantile    NA             2380        
#>  2 01       2022-10-22      wk inc flu … quantile    NA             141         
#>  3 02       2022-10-22      wk inc flu … quantile    NA             3           
#>  4 04       2022-10-22      wk inc flu … quantile    NA             22          
#>  5 05       2022-10-22      wk inc flu … quantile    NA             50          
#>  6 06       2022-10-22      wk inc flu … quantile    NA             124         
#>  7 08       2022-10-22      wk inc flu … quantile    NA             15          
#>  8 09       2022-10-22      wk inc flu … quantile    NA             9           
#>  9 10       2022-10-22      wk inc flu … quantile    NA             1           
#> 10 11       2022-10-22      wk inc flu … quantile    NA             8           
#> # ℹ 200,330 more rows