Check that a target data file has the correct column types according to target type

Usage

check_target_tbl_coltypes(
  target_tbl,
  target_type = c("time-series", "oracle-output"),
  date_col = NULL,
  na = c("NA", ""),
  output_type_id_datatype = c("from_config", "auto", "character", "double", "integer",
    "logical", "Date"),
  file_path,
  hub_path
)

Arguments

target_tbl: A tibble/data.frame of the contents of the target data file being validated.
target_type: Type of target data to retrieve matching files. One of "time-series" or "oracle-output". Defaults to "time-series".
date_col: Optional column name to be interpreted as date for schema creation. Useful when the date column does not correspond to a valid task ID (e.g., calculated from other task IDs like origin_date + horizon), particularly when it is also a partitioning column. Ignored when target-data.json config is provided.
na: A character vector of strings to interpret as missing values. Only applies to CSV files. The default is c("NA", ""). Useful when actual character string "NA" values are used in the data. In such a case, use empty cells to indicate missing values in your files and set na = "".
output_type_id_datatype: character string. One of "from_config", "auto", "character", "double", "integer", "logical", "Date". Defaults to "from_config" which uses the setting in the output_type_id_datatype property in the tasks.json config file if available. If the property is not set in the config, the argument falls back to "auto" which determines the output_type_id data type automatically from the tasks.json config file as the simplest data type required to represent all output type ID values across all output types in the hub. When only point estimate output types (where output_type_ids are NA,) are being collected by a hub, the output_type_id column is assigned a character data type when auto-determined. Other data type values can be used to override automatic determination. Note that attempting to coerce output_type_id to a data type that is not valid for the data (e.g. trying to coerce"character" values to "double") will likely result in an error or potentially unexpected behaviour so use with care.
file_path: A character string representing the path to the target data file relative to the target-data directory.
hub_path: Either a character string path to a local Modeling Hub directory or an object of class <SubTreeFileSystem> created using functions s3_bucket() or gs_bucket() by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in the arrow package. The hub must be fully configured with valid admin.json and tasks.json files within the hub-config directory.

Value

Depending on whether validation has succeeded, one of:

<message/check_success> condition class object.
<error/check_error> condition class object.

Returned object also inherits from subclass <hub_check>.

Details

Column type validation depends on whether a target-data.json configuration file is provided:

With target-data.json config: Expected column types are determined directly from the schema defined in the configuration. Validation is performed against the schema specifications in target-data.json.

Without target-data.json config: Expected column types are determined from the dataset itself and validated for internal consistency across files, which mainly applies to partitioned datasets.