
Validate the contents of a submitted target data file.
Source:R/validate_target_data.R
validate_target_data.Rd
Validate the contents of a submitted target data file.
Arguments
- hub_path
Either a character string path to a local Modeling Hub directory or an object of class
<SubTreeFileSystem>
created using functionss3_bucket()
orgs_bucket()
by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in thearrow
package. The hub must be fully configured with validadmin.json
andtasks.json
files within thehub-config
directory.- file_path
A character string representing the path to the target data file relative to the
target-data
directory.- target_type
Type of target data to retrieve matching files. One of "time-series" or "oracle-output". Defaults to "time-series".
- date_col
Optional column name to be interpreted as date. Default is
NULL
. Useful when the required date column is a partitioning column in the target data and does not have the same name as a date typed task ID variable in the config.- na
A character vector of strings to interpret as missing values. Only applies to CSV files. The default is
c("NA", "")
. Useful when actual character string"NA"
values are used in the data. In such a case, use empty cells to indicate missing values in your files and setna = ""
.- output_type_id_datatype
character string. One of
"from_config"
,"auto"
,"character"
,"double"
,"integer"
,"logical"
,"Date"
. Defaults to"from_config"
which uses the setting in theoutput_type_id_datatype
property in thetasks.json
config file if available. If the property is not set in the config, the argument falls back to"auto"
which determines theoutput_type_id
data type automatically from thetasks.json
config file as the simplest data type required to represent all output type ID values across all output types in the hub. When only point estimate output types (whereoutput_type_id
s areNA
,) are being collected by a hub, theoutput_type_id
column is assigned acharacter
data type when auto-determined. Other data type values can be used to override automatic determination. Note that attempting to coerceoutput_type_id
to a data type that is not valid for the data (e.g. trying to coerce"character"
values to"double"
) will likely result in an error or potentially unexpected behaviour so use with care.- validations_cfg_path
Path to YAML file configuring custom validation checks. If
NULL
defaults to standardhub-config/validations.yml
path. For more details see article on custom validation checks.- round_id
Character string. Not generally relevant to target datasets but can be used to specify a specific block of custom validation checks. Otherwise best set to
"default"
which will deploy the default custom validation checks.
Value
An object of class hub_validations
. Each named element contains
a hub_check
class object reflecting the result of a given check. Function
will return early if a check returns an error.
For more details on the structure of <hub_validations>
objects, including
how to access more information on individual checks,
see article on <hub_validations>
S3 class objects.
Details
Details of checks performed by validate_target_data()
Name | Check | Early return | Fail output | Extra info |
---|---|---|---|---|
target_file_read | Target data file can be read successfully. | TRUE | check_error | |
target_tbl_colnames | Target data file has the correct column names according to target type. | TRUE | check_error | |
target_tbl_coltypes | Target data file has the correct column types according to target type. | TRUE | check_error | |
target_tbl_ts_targets | Targets in a time-series target data file are valid. Only performed on `time-series` data files. | TRUE | check_error | |
target_tbl_rows_unique | Target data file rows are all unique. | FALSE | check_failure | |
target_tbl_values | Task ID columns in a target data file have valid task ID values. | TRUE | check_error | |
target_tbl_output_type_ids | Output type ID values in a target data file are valid and complete. Only performed when the target data file contains an `output_type_id` column. | TRUE | check_error | |
target_tbl_oracle_value | Oracle values in a target data file are valid. Only performed on `oracle output` data files. | FALSE | check_failure |
Examples
hub_path <- system.file("testhubs/v5/target_file", package = "hubUtils")
validate_target_data(hub_path,
file_path = "time-series.csv",
target_type = "time-series"
)
#>
#> ── time-series.csv ────
#>
#> ✔ [target_file_read]: target file could be read successfully.
#> ✔ [target_tbl_colnames]: Column names are consistent with expected column names
#> for time-series target type data.
#> ✔ [target_tbl_coltypes]: Column data types match time-series target schema.
#> ✔ [target_tbl_ts_targets]: time-series targets are all valid.
#> ✔ [target_tbl_rows_unique]: time-series target data rows are unique.
#> ✔ [target_tbl_values]: `target_tbl_chr` contains valid values/value
#> combinations.
#> ℹ [target_tbl_output_type_ids]: Check not applicable to time-series target
#> data. Skipped.
#> ℹ [target_tbl_oracle_value]: Check not applicable to time-series target data.
#> Skipped.
validate_target_data(hub_path,
file_path = "oracle-output.csv",
target_type = "oracle-output"
)
#>
#> ── oracle-output.csv ────
#>
#> ✔ [target_file_read]: target file could be read successfully.
#> ✔ [target_tbl_colnames]: Column names are consistent with expected column names
#> for oracle-output target type data.
#> ✔ [target_tbl_coltypes]: Column data types match oracle-output target schema.
#> ℹ [target_tbl_ts_targets]: Check not applicable to oracle-output target data.
#> Skipped.
#> ✔ [target_tbl_rows_unique]: oracle-output target data rows are unique.
#> ✔ [target_tbl_values]: `target_tbl_chr` contains valid values/value
#> combinations.
#> ✔ [target_tbl_output_type_ids]: oracle-output `target_tbl` contains valid
#> complete output_type_id values.
#> ✔ [target_tbl_oracle_value]: oracle-output `target_tbl` contains valid oracle
#> values.
hub_path <- system.file("testhubs/v5/target_dir", package = "hubUtils")
validate_target_data(hub_path,
file_path = "time-series/target=wk%20flu%20hosp%20rate/part-0.parquet",
target_type = "time-series"
)
#>
#> ── time-series/target=wk%20flu%20hosp%20rate/part-0.parquet ────
#>
#> ✔ [target_file_read]: target file could be read successfully.
#> ✔ [target_tbl_colnames]: Column names are consistent with expected column names
#> for time-series target type data.
#> ✔ [target_tbl_coltypes]: Column data types match time-series target schema.
#> ✔ [target_tbl_ts_targets]: time-series targets are all valid.
#> ✔ [target_tbl_rows_unique]: time-series target data rows are unique.
#> ✔ [target_tbl_values]: `target_tbl_chr` contains valid values/value
#> combinations.
#> ℹ [target_tbl_output_type_ids]: Check not applicable to time-series target
#> data. Skipped.
#> ℹ [target_tbl_oracle_value]: Check not applicable to time-series target data.
#> Skipped.
validate_target_data(hub_path,
file_path = "oracle-output/output_type=pmf/part-0.parquet",
target_type = "oracle-output"
)
#>
#> ── oracle-output/output_type=pmf/part-0.parquet ────
#>
#> ✔ [target_file_read]: target file could be read successfully.
#> ✔ [target_tbl_colnames]: Column names are consistent with expected column names
#> for oracle-output target type data.
#> ✔ [target_tbl_coltypes]: Column data types match oracle-output target schema.
#> ℹ [target_tbl_ts_targets]: Check not applicable to oracle-output target data.
#> Skipped.
#> ✔ [target_tbl_rows_unique]: oracle-output target data rows are unique.
#> ✔ [target_tbl_values]: `target_tbl_chr` contains valid values/value
#> combinations.
#> ✔ [target_tbl_output_type_ids]: oracle-output `target_tbl` contains valid
#> complete output_type_id values.
#> ✔ [target_tbl_oracle_value]: oracle-output `target_tbl` contains valid oracle
#> values.