Validating Target Data Pull Requests on GitHub

library(hubValidations)

Running validation checks on a Pull Request with `validate_target_pr()`

The validate_target_pr() function is designed to validate target data contributions through Pull Requests on GitHub.

Target data files are individually validated using validate_target_submission() on each file, and the affected target type datasets as a whole are validated via validate_target_dataset(). Hub config files are also validated as part of the checks. Any other files included in the PR are ignored but flagged in a message.

(See the end of this article for details of the standard checks performed on each file. For more information on deploying optional or custom functions please check the article on including custom functions (vignette("articles/deploying-custom-functions"))).

Deploying `validate_target_pr()` through a GitHub Action workflow

The most common way to deploy validate_target_pr() is through a GitHub Action that triggers when a pull request containing changes to target data files is opened. The hubverse maintains the validate-target-data.yaml GitHub Action workflow template for deploying validate_target_pr().

The latest release of the workflow can be added to a hub’s GitHub Action workflows using the hubCI package:

hubCI::use_hub_github_action("validate-target-data")

The pertinent section of the workflow is:

      - name: Run validations
        env:
          PR_NUMBER: ${{ github.event.number }}
        run: |
          library("hubValidations")
          v <- hubValidations::validate_target_pr(
              gh_repo = Sys.getenv("GITHUB_REPOSITORY"),
              pr_number = Sys.getenv("PR_NUMBER")
          )
          hubValidations::check_for_errors(v, verbose = TRUE)
        shell: Rscript {0}

where validate_target_pr() is called on the contents of the current Pull Request, the results (an S3 <target_validations> class object) are stored in v and then check_for_errors() is used to signal whether overall validations have passed or failed and summarise any validation failures.

Example: Validating a Pull Request

Here’s an example of validating a PR that adds a valid oracle-output file:

tmp_dir <- withr::local_tempdir()
ci_target_hub_path <- fs::path(tmp_dir, "target")
gert::git_clone(
  url = "https://github.com/hubverse-org/ci-testhub-target.git",
  path = ci_target_hub_path
)
gert::git_branch_checkout("add-file-oracle-output", repo = ci_target_hub_path)
#> Creating local branch add-file-oracle-output from origin/add-file-oracle-output
#> <git repository>: /tmp/Rtmp2i20Ve/file21681dea3c58/target[@add-file-oracle-output]

v <- validate_target_pr(
  hub_path = ci_target_hub_path,
  gh_repo = "hubverse-org/ci-testhub-target",
  pr_number = 1
)
v
#> 
#> ── target ────
#> 
#> ✔ [valid_config]: All hub config files are valid.
#> 
#> 
#> ── oracle-output.csv ────
#> 
#> 
#> 
#> ✔ [target_dataset_exists]: oracle-output dataset detected.
#> ✔ [target_dataset_unique]: target-data directory contains single unique
#>   oracle-output dataset.
#> ✔ [target_dataset_file_ext_unique]: oracle-output dataset files share single
#>   unique file format.
#> ✔ [target_dataset_rows_unique]: oracle-output target dataset rows are unique.
#> ✔ [target_file_exists]: File exists at path target-data/oracle-output.csv.
#> ℹ [target_partition_file_name]: Target file path not hive-partitioned. Check
#>   skipped.
#> ✔ [target_file_ext]: Target data file extension is valid.
#> ✔ [target_file_read]: target file could be read successfully.
#> ✔ [target_tbl_colnames]: Column names are consistent with expected column names
#>   for oracle-output target type data.
#> ✔ [target_tbl_coltypes]: Column data types match oracle-output target schema.
#> ℹ [target_tbl_ts_targets]: Check not applicable to oracle-output target data.
#>   Skipped.
#> ✔ [target_tbl_rows_unique]: oracle-output target data rows are unique.
#> ✔ [target_tbl_values]: `target_tbl_chr` contains valid values/value
#>   combinations.
#> ✔ [target_tbl_output_type_ids]: oracle-output `target_tbl` contains valid
#>   complete output_type_id values.
#> ✔ [target_tbl_oracle_value]: oracle-output `target_tbl` contains valid oracle
#>   values.

check_for_errors(v)
#> ✔ All validation checks have been successful.

Configuring target data validation

For the most robust validation, hubs should include a target-data.json configuration file in their hub-config directory. When present, this config provides deterministic validation by explicitly defining the date column, column names and types, and observable unit structure.

For hubs without a target-data.json config, the date_col parameter can be used to specify the name of the column containing observation dates (e.g., "target_end_date"). This is important for correct schema creation, particularly when the date column is also used for partitioning. When target-data.json exists, any user-provided date_col is ignored in favour of the config value.

Relaxed date validation for time-series data

By default, date values in time-series target data are validated strictly against the dates defined in tasks.json. However, target data often contains historical observations with dates beyond the hub’s configured modeling rounds.

Setting allow_extra_dates = TRUE relaxes date validation for time-series data, allowing historical observations while still strictly validating other task ID values. Oracle-output data always uses strict date validation regardless of this setting.

      - name: Run validations
        env:
          PR_NUMBER: ${{ github.event.number }}
        run: |
          library("hubValidations")
          v <- hubValidations::validate_target_pr(
              gh_repo = Sys.getenv("GITHUB_REPOSITORY"),
              pr_number = Sys.getenv("PR_NUMBER"),
              allow_extra_dates = TRUE
          )
          hubValidations::check_for_errors(v, verbose = TRUE)
        shell: Rscript {0}

Configuring file modification/deletion checks

By default, validate_target_pr() does not perform modification or deletion checks on target data files. This reflects the expectation that target data may be updated or corrected over time.

This behaviour can be modified through the file_modification_check argument, which controls whether modification/deletion checks are performed and what is returned if modifications/deletions are detected:

"none" (default): No modification/deletion checks performed.
"message": Appends a <message/check_info> condition class object for each applicable modified/deleted file. Will not result in validation workflow failure.
"failure": Appends a <error/check_failure> condition class object for each applicable modified/deleted file. Will result in validation workflow failure.
"error": Appends a <error/check_error> condition class object for each applicable modified/deleted file. Will result in validation workflow failure.

Here’s an example of a PR that deletes some target data files. With file_modification_check = "failure", this produces a validation failure:

gert::git_branch_checkout("delete-target-dir-files", repo = ci_target_hub_path)
#> Creating local branch delete-target-dir-files from origin/delete-target-dir-files
#> <git repository>: /tmp/Rtmp2i20Ve/file21681dea3c58/target[@delete-target-dir-files]

v_mod <- validate_target_pr(
  hub_path = ci_target_hub_path,
  gh_repo = "hubverse-org/ci-testhub-target",
  pr_number = 5,
  file_modification_check = "failure"
)
v_mod
#> 
#> ── target ────
#> 
#> ✔ [valid_config]: All hub config files are valid.
#> 
#> 
#> ── oracle-output/output_type=sample/part-0.parquet ────
#> 
#> 
#> 
#> ✖ [oracle_output_mod]: Previously submitted oracle output files must not be
#>   removed.  target-data/oracle-output/output_type=sample/part-0.parquet
#>   removed.
#> 
#> 
#> ── oracle-output ────
#> 
#> 
#> 
#> ✔ [target_dataset_exists]: oracle-output dataset detected.
#> ✔ [target_dataset_unique]: target-data directory contains single unique
#>   oracle-output dataset.
#> ✔ [target_dataset_file_ext_unique]: oracle-output dataset files share single
#>   unique file format.
#> ✔ [target_dataset_rows_unique]: oracle-output target dataset rows are unique.

check_for_errors(v_mod)
#> 
#> ── part-0.parquet ────
#> 
#> ✖ [oracle_output_mod]: Previously submitted oracle output files must not be
#>   removed.  target-data/oracle-output/output_type=sample/part-0.parquet
#>   removed.
#> Error in `check_for_errors()`:
#> ! 
#> The validation checks produced some failures/errors reported above.

Allowing deletion of entire target type datasets

By default, deletion of an entire target type dataset (i.e., all files of a target type) in a PR is not allowed. This can be changed by setting allow_target_type_deletion to TRUE.

Here’s an example of a PR that removes the time-series dataset and adds oracle-output data. With the default settings, this produces an error:

gert::git_branch_checkout("remove-ts-add-oo", repo = ci_target_hub_path)
#> Creating local branch remove-ts-add-oo from origin/remove-ts-add-oo
#> <git repository>: /tmp/Rtmp2i20Ve/file21681dea3c58/target[@remove-ts-add-oo]

v_del <- validate_target_pr(
  hub_path = ci_target_hub_path,
  gh_repo = "hubverse-org/ci-testhub-target",
  pr_number = 4
)
v_del
#> 
#> ── target ────
#> 
#> ✔ [valid_config]: All hub config files are valid.
#> 
#> 
#> ── time-series ────
#> 
#> 
#> 
#> ⓧ [target_dataset_exists]: time-series dataset not detected.
#> 
#> 
#> ── oracle-output.csv ────
#> 
#> 
#> 
#> ✔ [target_dataset_exists_1]: oracle-output dataset detected.
#> ✔ [target_dataset_unique]: target-data directory contains single unique
#>   oracle-output dataset.
#> ✔ [target_dataset_file_ext_unique]: oracle-output dataset files share single
#>   unique file format.
#> ✔ [target_dataset_rows_unique]: oracle-output target dataset rows are unique.
#> ✔ [target_file_exists]: File exists at path target-data/oracle-output.csv.
#> ℹ [target_partition_file_name]: Target file path not hive-partitioned. Check
#>   skipped.
#> ✔ [target_file_ext]: Target data file extension is valid.
#> ✔ [target_file_read]: target file could be read successfully.
#> ✔ [target_tbl_colnames]: Column names are consistent with expected column names
#>   for oracle-output target type data.
#> ✔ [target_tbl_coltypes]: Column data types match oracle-output target schema.
#> ℹ [target_tbl_ts_targets]: Check not applicable to oracle-output target data.
#>   Skipped.
#> ✔ [target_tbl_rows_unique]: oracle-output target data rows are unique.
#> ✔ [target_tbl_values]: `target_tbl_chr` contains valid values/value
#>   combinations.
#> ✔ [target_tbl_output_type_ids]: oracle-output `target_tbl` contains valid
#>   complete output_type_id values.
#> ✔ [target_tbl_oracle_value]: oracle-output `target_tbl` contains valid oracle
#>   values.

check_for_errors(v_del)
#> 
#> ── time-series ────
#> 
#> ⓧ [target_dataset_exists]: time-series dataset not detected.
#> Error in `check_for_errors()`:
#> ! 
#> The validation checks produced some failures/errors reported above.

Setting allow_target_type_deletion = TRUE allows the deletion to pass:

v_del_allowed <- validate_target_pr(
  hub_path = ci_target_hub_path,
  gh_repo = "hubverse-org/ci-testhub-target",
  pr_number = 4,
  allow_target_type_deletion = TRUE
)
check_for_errors(v_del_allowed)
#> ✔ All validation checks have been successful.

Checking for validation failures with `check_for_errors()`

check_for_errors() is used to inspect a target_validations class object, determine whether overall validations have passed or failed and summarise any detected errors/failures.

Validation failure

If any elements of the target_validations object contain <error/check_error>, <warning/check_warning> or <error/check_exec_error> condition class objects, the function throws an error and prints the messages from the failing checks.

Validation success

If all validation checks pass, check_for_errors() returns TRUE silently and prints:

All validation checks have been successful.

Verbose output

If printing the results of all checks is preferred instead of just summarising the results of checks that failed, argument verbose can be set to TRUE.

`validate_target_pr` check details

For details on the structure of <hub_validations> objects, including on how to access more information about specific checks, see vignette("articles/hub-validations-class").

Checks on target datasets

Details of dataset-level checks performed by `validate_target_pr()`.
Name	Check	Early return	Fail output
valid_config	Hub config valid	TRUE	check_error
target_dataset_exists	Target dataset can be successfully detected for a given target type.	TRUE	check_error
target_dataset_unique	A single unique target dataset exists for a given target type.	TRUE	check_error
target_dataset_file_ext_unique	All files of a given target type share a single unique file format.	TRUE	check_error
target_dataset_rows_unique	Target dataset rows are all unique.	FALSE	check_failure

Checks on individual target files

Details of file-level checks performed by `validate_target_submission()`.
Name	Check	Early return	Fail output
target_file_exists	File exists at `file_path` provided.	TRUE	check_error
target_partition_file_name	Hive-style partition file path segments are valid and can be parsed successfully. Skipped if target dataset not hive-partitioned.	TRUE	check_error
target_file_ext	Target data file extension is valid.	TRUE	check_error
target_file_read	Target data file can be read successfully.	TRUE	check_error
target_tbl_colnames	Target data file has the correct column names according to target type.	TRUE	check_error
target_tbl_coltypes	Target data file has the correct column types according to target type.	TRUE	check_error
target_tbl_ts_targets	Targets in a time-series target data file are valid. Only performed on `time-series` data files.	TRUE	check_error
target_tbl_rows_unique	Target data file rows are all unique.	FALSE	check_failure
target_tbl_values	Task ID columns in a target data file have valid task ID values.	TRUE	check_error
target_tbl_output_type_ids	Output type ID values in a target data file are valid and complete. Only performed when the target data file contains an `output_type_id` column.	TRUE	check_error
target_tbl_oracle_value	Oracle values in a target data file are valid. Only performed on `oracle output` data files.	FALSE	check_failure

Custom checks

The standard checks discussed here are the checks deployed by default by the validate_target_pr function. For more information on deploying optional or custom functions please check the article on deploying custom functions (vignette("articles/deploying-custom-functions")).