HubConfig¶
- class hubdata.HubConnection(hub_path: str | Path)[source]¶
Provides convenient access to various parts of a hub’s tasks.json file. Use the connect_hub function to create instances of this class, rather than by direct instantiation
Instance variables: - hub_path: str pointing to a hub’s root directory as passed to connect_hub() - schema: the pa.Schema for HubConnection.get_dataset(). created by the constructor via create_hub_schema() - admin: the hub’s admin.json contents as a dict - tasks: “” tasks.json “” - model_output_dir: Path to the hub’s model output directory
- hubdata.connect_hub(hub_path: str | Path)[source]¶
The main entry point for connecting to a hub, providing access to the instance variables documented in HubConnection, including admin.json and tasks.json as dicts. It also allows connecting to data in the hub’s model output directory for querying and filtering across all model files. The hub can be located in a local file system or in the cloud on AWS or GCS. Note: Calls create_hub_schema() to get the schema to use when calling HubConnection.get_dataset(). See: https://docs.hubverse.io/en/latest/user-guide/hub-structure.html for details on how hubs directories are laid out.
- Parameters:
hub_path – str (for local file system hubs or cloud based ones) or Path (local file systems only) pointing to a hub’s root directory. it is passed to https://arrow.apache.org/docs/python/generated/pyarrow.fs.FileSystem.html#pyarrow.fs.FileSystem.from_uri From that page: Recognized URI schemes are “file”, “mock”, “s3fs”, “gs”, “gcs”, “hdfs” and “viewfs”. In addition, the argument can be a local path, either a pathlib.Path object or a str. NB: Passing a local path as a str requires an ABSOLUTE path, but passing the hub as a Path can be a relative path.
- Returns:
a HubConnection
- Raise:
RuntimeError if hub_path is invalid
- hubdata.create_hub_schema(tasks: dict, output_type_id_datatype: str = 'from_config', partitions: tuple[tuple[str, DataType]] | None = (('model_id', DataType(string)),)) schema [source]¶
Top-level function for creating a schema for the passed HubConnection.
- Parameters:
tasks – a hub’s tasks.json contents from which to create a schema - see HubConnection.tasks
output_type_id_datatype – a string that’s one of “from_config”, “auto”, “character”, “double”, “integer”, “logical”, “Date”. Defaults to “from_config” which uses the setting in the output_type_id_datatype property in the tasks.json config file if available. If the property is not set in the config, the argument falls back to “auto” which determines the output_type_id data type automatically from the tasks.json config file as the simplest data type required to represent all output type ID values across all output types in the hub. When only point estimate output types (where output_type_id`s are `NA) are being collected by a hub, the output_type_id column is assigned a character data type when auto-determined.
partitions – a list of 2-tuples (column_name, data_type) specifying the arrow data types of any partitioning column. pass None if no partitions
- Returns:
a pyarrow.Schema for the passed HubConnection