Read a single target data file
Arguments
- target_file_path
Character string. Path to the target data file being validated relative to the hub's
target-data
directory.- hub_path
Either a character string path to a local Modeling Hub directory or an object of class
<SubTreeFileSystem>
created using functionss3_bucket()
orgs_bucket()
by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in thearrow
package. The hub must be fully configured with validadmin.json
andtasks.json
files within thehub-config
directory.- coerce_types
character string. What to coerce column types to on read.
target
: (default) read in (csv
) or coerce (parquet
) to expected schema by target type (SeehubData::create_timeseries_schema()
andhubData::create_oracle_output_schema()
for details). When coercing data types using thetarget
schema, theoutput_type_id_datatype
can also be used to set theoutput_type_id
column data type manually.chr
: read in (csv
) or coerce (parquet
) all columns to character.none
: No coercion. Usearrow
read_*
function defaults.
- date_col
Optional column name to be interpreted as date. Default is
NULL
. Useful when the required date column is a partitioning column in the target data and does not have the same name as a date typed task ID variable in the config.- na
A character vector of strings to interpret as missing values. Only applies to CSV files. The default is
c("NA", "")
. Useful when actual character string"NA"
values are used in the data. In such a case, use empty cells to indicate missing values in your files and setna = ""
.
Examples
# download example hub
hub_path <- system.file("testhubs/v5/target_file",
package = "hubUtils"
)
# read in time-series file
read_target_file("time-series.csv", hub_path)
#> # A tibble: 66 × 4
#> target_end_date target location observation
#> <date> <chr> <chr> <dbl>
#> 1 2022-10-22 wk inc flu hosp 02 3
#> 2 2022-10-22 wk inc flu hosp 01 141
#> 3 2022-10-22 wk inc flu hosp US 2380
#> 4 2022-10-29 wk inc flu hosp 02 14
#> 5 2022-10-29 wk inc flu hosp 01 262
#> 6 2022-10-29 wk inc flu hosp US 4353
#> 7 2022-11-05 wk inc flu hosp 02 10
#> 8 2022-11-05 wk inc flu hosp 01 360
#> 9 2022-11-05 wk inc flu hosp US 6571
#> 10 2022-11-12 wk inc flu hosp 02 20
#> # ℹ 56 more rows
read_target_file("time-series.csv", hub_path, coerce_types = "chr")
#> # A tibble: 66 × 4
#> target_end_date target location observation
#> <chr> <chr> <chr> <chr>
#> 1 2022-10-22 wk inc flu hosp 02 3
#> 2 2022-10-22 wk inc flu hosp 01 141
#> 3 2022-10-22 wk inc flu hosp US 2380
#> 4 2022-10-29 wk inc flu hosp 02 14
#> 5 2022-10-29 wk inc flu hosp 01 262
#> 6 2022-10-29 wk inc flu hosp US 4353
#> 7 2022-11-05 wk inc flu hosp 02 10
#> 8 2022-11-05 wk inc flu hosp 01 360
#> 9 2022-11-05 wk inc flu hosp US 6571
#> 10 2022-11-12 wk inc flu hosp 02 20
#> # ℹ 56 more rows
# read in oracle-output file
read_target_file("oracle-output.csv", hub_path)
#> # A tibble: 627 × 6
#> location target_end_date target output_type output_type_id oracle_value
#> <chr> <date> <chr> <chr> <chr> <dbl>
#> 1 US 2022-10-22 wk flu hosp… cdf 1 1
#> 2 US 2022-10-22 wk flu hosp… cdf 2 1
#> 3 US 2022-10-22 wk flu hosp… cdf 3 1
#> 4 US 2022-10-22 wk flu hosp… cdf 4 1
#> 5 US 2022-10-22 wk flu hosp… cdf 5 1
#> 6 US 2022-10-22 wk flu hosp… cdf 6 1
#> 7 US 2022-10-22 wk flu hosp… cdf 7 1
#> 8 US 2022-10-22 wk flu hosp… cdf 8 1
#> 9 US 2022-10-22 wk flu hosp… cdf 9 1
#> 10 US 2022-10-22 wk flu hosp… cdf 10 1
#> # ℹ 617 more rows
read_target_file("oracle-output.csv", hub_path, coerce_types = "chr")
#> # A tibble: 627 × 6
#> location target_end_date target output_type output_type_id oracle_value
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 US 2022-10-22 wk flu hosp… cdf 1 1
#> 2 US 2022-10-22 wk flu hosp… cdf 2 1
#> 3 US 2022-10-22 wk flu hosp… cdf 3 1
#> 4 US 2022-10-22 wk flu hosp… cdf 4 1
#> 5 US 2022-10-22 wk flu hosp… cdf 5 1
#> 6 US 2022-10-22 wk flu hosp… cdf 6 1
#> 7 US 2022-10-22 wk flu hosp… cdf 7 1
#> 8 US 2022-10-22 wk flu hosp… cdf 8 1
#> 9 US 2022-10-22 wk flu hosp… cdf 9 1
#> 10 US 2022-10-22 wk flu hosp… cdf 10 1
#> # ℹ 617 more rows