
Check target dataset rows are all unique
Source:R/check_target_dataset_rows_unique.R
      check_target_dataset_rows_unique.RdCheck that there are no duplicate rows in a target dataset. Function designed to be used as part of overall target data integrity check.
Arguments
- target_type
 Type of target data to retrieve matching files. One of "time-series" or "oracle-output". Defaults to "time-series".
- na
 A character vector of strings to interpret as missing values. Only applies to CSV files. The default is
c("NA", ""). Useful when actual character string"NA"values are used in the data. In such a case, use empty cells to indicate missing values in your files and setna = "".- date_col
 Optional column name to be interpreted as date. Default is
NULL. Useful when the required date column is a partitioning column in the target data and does not have the same name as a date typed task ID variable in the config.- output_type_id_datatype
 character string. One of
"from_config","auto","character","double","integer","logical","Date". Defaults to"from_config"which uses the setting in theoutput_type_id_datatypeproperty in thetasks.jsonconfig file if available. If the property is not set in the config, the argument falls back to"auto"which determines theoutput_type_iddata type automatically from thetasks.jsonconfig file as the simplest data type required to represent all output type ID values across all output types in the hub. When only point estimate output types (whereoutput_type_ids areNA,) are being collected by a hub, theoutput_type_idcolumn is assigned acharacterdata type when auto-determined. Other data type values can be used to override automatic determination. Note that attempting to coerceoutput_type_idto a data type that is not valid for the data (e.g. trying to coerce"character"values to"double") will likely result in an error or potentially unexpected behaviour so use with care.- hub_path
 Either a character string path to a local Modeling Hub directory or an object of class
<SubTreeFileSystem>created using functionss3_bucket()orgs_bucket()by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in thearrowpackage. The hub must be fully configured with validadmin.jsonandtasks.jsonfiles within thehub-configdirectory.
Value
Depending on whether validation has succeeded, one of:
<message/check_success>condition class object.<error/check_failure>condition class object.
Returned object also inherits from subclass <hub_check>.
Details
If datasets are versioned, multiple observations are allowed in time-series
target data, so long as they have different as_of values. The as_of column
is therefore included when determining duplicates.
In oracle-output data, there should be only a single observation,
regardless of the as_of value so the column it is not be included when
determining duplicates.