Command-line interface¶
The package provides a command-line interface (CLI) called hubdata which provides the following subcommands:
schema: Print a hub’s schema, i.e., the columns and datatypes that are inferred from the hub’s tasks.json file.dataset: Print summary information about the data in a hub’s model output directory. It also includes the same information as theschemasubcommand. Note that this command can take some time to run as it must scan all data files in the hub.time-series: Print a hub’s time series target data information, including its schema.oracle-output: Print a hub’s oracle output target data information, including its schema.
Note: This package is based on the python version of Apache’s Arrow library.
Note: To see command-line help, you can run the
hubdatacommand with the--helpoption, with or without a subcommand. For example,hubdata --helporhubdata dataset --help.
Show the schema of a test hub (the schema subcommand)¶
Note: All shell examples assume you’re using Bash, and that you first
cdinto this repo’s root directory, e.g.,cd /<path_to_repos>/hub-data/.
Here’s an example of running the schema subcommand on the flu-metrocast test hub included in this package. We use the pwd shell command to create the absolute path that the app requires.
hubdata schema "$(pwd)/test/hubs/flu-metrocast"
╭─ schema ─────────────────────────────────────────────────────────╮
│ │
│ hub_path: │
│ - /<path_to_repos>/hub-data/test/hubs/flu-metrocast │
│ │
│ schema: │
│ - horizon: int32 │
│ - location: string │
│ - model_id: string │
│ - output_type: string │
│ - output_type_id: double │
│ - reference_date: date32 │
│ - target: string │
│ - target_end_date: date32 │
│ - value: double │
│ │
╰──────────────────────────────────────────────────────── hubdata ─╯
Output explanation:
hub_path: argument passed to the app (here we show **/<path_to_repos>/ **, but your output will show the actual directory location)schema: schema obtained via the API’screate_hub_schema()function
Show model output information of a test hub (the dataset subcommand)¶
Here’s the output from running the dataset subcommand on the same test hub:
hubdata dataset "$(pwd)/test/hubs/flu-metrocast"
╭─ dataset ────────────────────────────────────────────────────────╮
│ │
│ hub_path: │
│ - /<path_to_repos>/hub-data/test/hubs/flu-metrocast │
│ │
│ schema: │
│ - horizon: int32 │
│ - location: string │
│ - model_id: string │
│ - output_type: string │
│ - output_type_id: double │
│ - reference_date: date32 │
│ - target: string │
│ - target_end_date: date32 │
│ - value: double │
│ │
│ dataset: │
│ - files: 31 │
│ - types: csv (found) | csv (admin) │
│ │
╰──────────────────────────────────────────────────────── hubdata ─╯
Output explanation:
hub_path: same as above exampleschema: same as above exampledataset: information about files in the hub’s model output directory:files: number of files in the datasettypes: list of the file types a) actually found in the dataset (found), and b) ones specified in the hub’s admin.json file (admin)
Show model output information of an S3-based hub (the dataset subcommand)¶
The CLI command also works with S3 URIs. Here we run it against the cloud-enabled example-complex-forecast-hub’s S3 bucket:
Note: An S3 URI (Uniform Resource Identifier) for Amazon S3 has the format s3://<bucket-name>/<key-name>. It uniquely identifies an object stored in an S3 bucket. For example, s3: //my-bucket/data.txt refers to a file named data.txt within the bucket named my-bucket.
hubdata dataset s3://example-complex-forecast-hub/
╭─ dataset ────────────────────────────────╮
│ │
│ hub_path: │
│ - s3://example-complex-forecast-hub/ │
│ │
│ schema: │
│ - horizon: int32 │
│ - location: string │
│ - model_id: string │
│ - output_type: string │
│ - output_type_id: string │
│ - reference_date: date32 │
│ - target: string │
│ - target_end_date: date32 │
│ - value: double │
│ │
│ dataset: │
│ - files: 12 │
│ - types: parquet (found) | csv (admin) │
│ │
╰──────────────────────────────── hubdata ─╯
Note: This package’s performance with cloud-based hubs can be slow due to how pyarrow’s dataset scanning works.
Show time series target data for flu-metrocast (the time-series subcommand)¶
Here we look at the time series target data for a local clone of the flu-metrocast hub:
hubdata time-series /<path_to_repos>/flu-metrocast
╭─ target data ─────────────────────────────────╮
│ │
│ hub_path: │
│ - /<path_to_repos>/flu-metrocast │
│ │
│ target type: │
│ - time-series │
│ │
│ schema: │
│ - None (inferred from data) │
│ │
│ dataset: │
│ - location: time-series.csv (file) │
│ - files: 1 │
│ - type: csv │
│ │
╰───────────────────────────────────── hubdata ─╯
Output explanation:
hub_path: same as above exampletarget type: indicates what target data was obtained, eithertime-seriesororacle-outputschema: either column and type information as shown above examples, or (in this case)None (inferred from data)if no target data configuration (target-data.jsonfile) was found.dataset: information about files in the hub’s target data, either time series (in this case) or oracle outputlocation: where the target data is stored in the hub (see File formats for details). shows the file or directory name followed by either an indication of the type, either(file)(in this case) or(dir), respectivelyfiles: number of files in the datasettype: the file type found in the dataset
Show oracle output target data for the v6_target_dir test hub (the oracle-output subcommand)¶
Here’s an example of showing oracle-output target data from the v6_target_dir included in this package.
hubdata oracle-output "$(pwd)/test/hubs/v6_target_dir"
╭─ target data ────────────────────────────────────────────────────╮
│ │
│ hub_path: │
│ - /<path_to_repos>/hub-data/test/hubs/v6_target_dir │
│ │
│ target type: │
│ - oracle-output │
│ │
│ schema: │
│ - location: string │
│ - oracle_value: double │
│ - output_type: string │
│ - output_type_id: string │
│ - target: string │
│ - target_end_date: date32 │
│ │
│ dataset: │
│ - location: oracle-output (dir) │
│ - files: 5 │
│ - type: parquet │
│ │
╰──────────────────────────────────────────────────────── hubdata ─╯
Output explanation:
hub_path: same as above exampletarget type: “”schema: column and type information (target-data.jsonfile was found)dataset: information about files in the hub’s target data (time series or oracle output):location: same as above example, but(dir), in this casefiles: same as above exampletype: “”