HubConfig

class hubdata.HubConnection(hub_path: str | Path)[source]

Provides convenient access to various parts of a hub’s tasks.json file. Use the connect_hub function to create instances of this class, rather than by direct instantiation

Instance variables: - hub_path: str pointing to a hub’s root directory as passed to connect_hub() - schema: the pa.Schema for HubConnection.get_dataset(). created by the constructor via create_hub_schema() - admin: the hub’s admin.json contents as a dict - tasks: “” tasks.json “” - model_output_dir: Path to the hub’s model output directory

get_dataset(exclude_invalid_files: bool = False, ignore_files: Iterable[str] = ('README', '.DS_Store')) Dataset[source]

Main entry point for getting a pyarrow dataset to work with. Prints a warning about any files that were skipped during dataset file discovery.

Param:

exclude_invalid_files: variable passed through to pyarrow’s dataset.dataset() method. defaults to False, which works for most situations

Param:

ignore_files a str list of file names (not paths) or file prefixes to ignore when discovering model output files to include in dataset connections. Parent directory names should not be included. The default is to ignore the common files “README” and “.DS_Store”, but additional files can be excluded by specifying them here.

Returns:

a pyarrow.dataset.Dataset for my model_output_dir

to_table(*args, **kwargs) Table[source]

A convenience function that simply passes args and kwargs to pyarrow.dataset.Dataset.to_table(), returning the pyarrow.Table.

class hubdata.TargetDataConnection(hub_path: str | Path, target_type: TargetType)[source]

Returned by connect_target_data(), is the primary way of interacting with a hub’s target data.

Instance variables: - target_type: the TargetType passed to the constructor - hub_conn: a HubConnection for the passed hub_path - found_file_info: a fs.FileInfo that’s the target data source as returned by _validate_target_data() - schema: the pa.Schema for get_dataset() as returned by create_target_data_schema(). note that it is None if the schema is to be inferred from data

get_dataset() Dataset[source]

Main entry point for getting a pyarrow dataset to work with.

Returns:

a ds.Dataset for the passed hub_path. note that we return a dataset for the single file cases so that the user can control when data is materialized into memory. The returned Dataset’s schema will be as returned by create_target_data_schema(), which returns None if hub_path has no hub-config/target-data.json file, causing the schema to be inferred from the data.

to_table(*args, **kwargs) Table[source]

A convenience function that simply passes args and kwargs to pyarrow.dataset.Dataset.to_table(), returning the pyarrow.Table.

hubdata.connect_hub(hub_path: str | Path) HubConnection[source]

The main entry point for connecting to a hub, providing access to the instance variables documented in HubConnection, including admin.json and tasks.json as dicts. It also allows connecting to data in the hub’s model output directory for querying and filtering across all model files. The hub can be located in a local file system or in the cloud on AWS or GCS. Note: Calls create_hub_schema() to get the schema to use when calling HubConnection.get_dataset(). See: https://docs.hubverse.io/en/latest/user-guide/hub-structure.html for details on how hubs directories are laid out.

Parameters:

hub_path – str (for local file system hubs or cloud based ones) or Path (local file systems only) pointing to a hub’s root directory. It is passed to https://arrow.apache.org/docs/python/generated/pyarrow.fs.FileSystem.html#pyarrow.fs.FileSystem.from_uri From that page: Recognized URI schemes are “file”, “mock”, “s3fs”, “gs”, “gcs”, “hdfs” and “viewfs”. In addition, the argument can be a local path, either a pathlib.Path object or a str. NB: Passing a local path as a str requires an ABSOLUTE path, but passing the hub as a Path can be a relative path.

Returns:

a HubConnection

Raise:

RuntimeError if hub_path is invalid

hubdata.connect_target_data(hub_path: str | Path, target_type: TargetType) TargetDataConnection[source]

Top-level function for accessing the time-series target data or oracle-output target data for the passed hub_path. Like connect_hub.connect_hub() returns a “connection” object (TargetDataConnection in this case) that is used to both access useful instance variables, but mainly to get a Dataset via TargetDataConnection.get_dataset(), similar to HubConnection.get_dataset().

Parameters:
  • hub_path – str (for local file system hubs or cloud based ones) or Path (local file systems only) pointing to a hub’s root directory. It is passed to https://arrow.apache.org/docs/python/generated/pyarrow.fs.FileSystem.html#pyarrow.fs.FileSystem.from_uri From that page: Recognized URI schemes are “file”, “mock”, “s3fs”, “gs”, “gcs”, “hdfs” and “viewfs”. In addition, the argument can be a local path, either a pathlib.Path object or a str. NB: Passing a local path as a str requires an ABSOLUTE path, but passing the hub as a Path can be a relative path.

  • target_type – a TargetType specifying the target data type

:return a TargetDataConnection :raise: RuntimeError if hub_path is invalid :raise: RuntimeError if hub has no time-series target data or oracle-output target data, i.e., no target-data/time-series.csv, target-data/time-series.parquet, or target-data/time-series/ files/dir (for the time-series case), or target-data/oracle-output.csv, target-data/oracle-output.parquet, or target-data/oracle-output/ files/dir (for the oracle-output case)

hubdata.create_hub_schema(tasks: dict, output_type_id_datatype: str = 'from_config', partitions: tuple[tuple[str, DataType]] | None = (('model_id', DataType(string)),)) Schema[source]

Top-level function for creating a schema for the passed tasks.

Parameters:
  • tasks – a hub’s tasks.json contents from which to create a schema - see HubConnection.tasks

  • output_type_id_datatype – a string that’s one of “from_config”, “auto”, “character”, “double”, “integer”, “logical”, “Date”. Defaults to “from_config” which uses the setting in the output_type_id_datatype property in the tasks.json config file if available. If the property is not set in the config, the argument falls back to “auto” which determines the output_type_id data type automatically from the tasks.json config file as the simplest data type required to represent all output type ID values across all output types in the hub. When only point estimate output types (where output_type_id`s are `NA) are being collected by a hub, the output_type_id column is assigned a character data type when auto-determined.

  • partitions – a list of 2-tuples (column_name, data_type) specifying the arrow data types of any partitioning column. pass None if no partitions

Returns:

a pyarrow.Schema for the passed HubConnection

hubdata.create_target_data_schema(hub_path: str | Path, target_type: TargetType) Schema | None[source]

Top-level function for creating a time-series target schema or oracle-output target schema for the passed hub_path.

Parameters:
  • hub_path – str (for local file system hubs or cloud based ones) or Path (local file systems only) pointing to a hub’s root directory. It is passed to https://arrow.apache.org/docs/python/generated/pyarrow.fs.FileSystem.html#pyarrow.fs.FileSystem.from_uri From that page: Recognized URI schemes are “file”, “mock”, “s3fs”, “gs”, “gcs”, “hdfs” and “viewfs”. In addition, the argument can be a local path, either a pathlib.Path object or a str. NB: Passing a local path as a str requires an ABSOLUTE path, but passing the hub as a Path can be a relative path.

  • target_type – a TargetType specifying the target data schema type

Returns:

a pyarrow.Schema for the passed hub_path if a hub-config/target-data.json file is present. otherwise returns None

Raise:

RuntimeError if hub_path is invalid