Effective and Consistent Configuration via YAML & CLI with Hydra

A frequent requirement for productive Python application is that they are configurable via configuration files and/or the command-line-interface (CLI). This allows you to change the behavior of your application without touching the source code, e.g. configuring another database URL or the logging verbosity. For the CLI-part, argparse or click is often used and with PyYAML configuration files can be easily read, so where is the problem?

Configuration of a Python application by CLI or a YAML file have many things in common, i.e., both

configure the runtime behaviour of your application,
need to implement validations, e.g. is the port an integer above 1024,
need to be consistent and mergeable, i.e. a CLI flag should be named like the YAML key and if both are passed the CLI overwrites the YAML configuration.

Thus implementing configuration by a CLI and a YAML file separately, leads often to code duplication and inconsistent behavior, not to mention the enormous amount of work that must be done to get this right.

With this in mind, Facebook implemented the Hydra library, which allows you to do hierarchical configuration by composition of config files and overrides via the command line. In this blog post, we demonstrate in an example project the most important features of Hydra and how it can be used in conjunction with pydantic, which extends its validation capabilities. To follow along, check out this repository that serves as a demonstration, but also as a playground for you.

Ok, so give me the gist of how Hydra works

Sure, just take a look into cli.py and config.py first as these are the only files we added, roughly 70 lines of code. The hierarchical configuration can be found in the configs folder and look like this:

├── configs
│   ├── main.yaml             <- entry point for configuration
│   ├── db                    <- database configuration group
│   │   ├── mysql.yaml        <- configuration for MySQL
│   │   └── postgresql.yaml   <- configuration for PostgreSQL
│   └── experiment            <- experiment configuration group
│       ├── exp1.yaml         <- configuration for experiment 1
│       ├── exp2.yaml         <- configuration for experiment 2
│       ├── missing_key.yaml  <- wrong configuration with missing key
│       └── wrong_type.yaml   <- wrong configuration with wrong type

Basically, this structure allows you to mix and match your configuration by choosing for instance the configuration for the MySQL database with the configuration for experiment 2. Hydra creates for you one consistent configuration object, some kind of nested dictionary of dictionaries, where each configuration group is an attribute.

In the repository of our example project, we defined the CLI command hydra-test by changing in setup.cfg the following lines:

# Add here console scripts like:
console_scripts =
     hydra-test = my_pkg.cli:main

We can thus invoke our application with the console command hydra-test and this will execute the main function in cli.py:

@hydra.main(config_path=None, config_name="main")
def main(cfg: Config) -> None:
    # this line actually runs the checks of pydantic
    OmegaConf.to_object(cfg)
    # log to console and into the `outputs` folder per default
    log.info(f"\n{OmegaConf.to_yaml(cfg)}")
    # note that IDEs allow auto-complete for accessing the attributes!
    time.sleep(cfg.main.sleep)

Looking at the actual code, we see that we only trigger some pydantic checks to see if the configuration and CLI parameters are correct, then we log the current configuration and sleep for the time defined in the configuration.

So executing just hydra-test results in:

Cannot find primary config 'main'. Check that it's in your config search path.

Config search path:
    provider=hydra, path=pkg://hydra.conf
    provider=main, path=pkg://my_pkg
    provider=schema, path=structured://

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

This is due to the fact that we set config_path=None, which is desirable for a productive application. The application itself doesn’t know where it is going to be installed and thus defining a path to the configuration files doesn’t make any sense. For this reason we pass the configuration at execution time with -cd, short form of --config-dir:

hydra-test -cd configs

This results in the error:

Error executing job with overrides: []
Traceback (most recent call last):
  File ".../hydra-example-project/src/my_pkg/cli.py", line 11, in main
    OmegaConf.to_object(cfg)
omegaconf.errors.MissingMandatoryValue: Structured config of type `Config` has missing mandatory value: experiment
    full_key: experiment
    object_type=Config

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

This behavior is exactly as we want it, because a look into config.py shows us that the schema of the main configuration is:

@dataclass
class Config:
    main: Main
    db: DataBase
    neptune: Neptune
    experiment: Experiment = MISSING

and the experiment is defined as MISSING. Therefore, experiment is a mandatory parameter that the user needs to provide via the CLI. Consequently, we add +experiment=exp1 to select the configuration from exp1.yaml and finally get what we would expect:

❯ hydra-test -cd configs +experiment=exp1
[2022-01-27 08:14:34,257][my_pkg.cli][INFO] -
main:
  sleep: 3
neptune:
  project: florian.wilhelm/my_expriments
  api_token: ~/.neptune_api_token
  tags:
  - run-1
  description: Experiment run on GCP
  mode: async
db:
  driver: mysql
  host: server_string
  port: ${oc.env:MYSQL_PORT,1028}
  username: myself
  password: secret
experiment:
  model: XGBoost
  l2: 0.01
  n_steps: 1000

Note the plus sign in the flag +experiment. This is needed since we add the mandatory experiment parameter. Conveniently, Hydra has also set up the logging for us and besides logging to the terminal, all output will also be collected in the ./outputs folder.

So the section main and neptune are directly defined in main.yaml, but why did Hydra now choose the MySQL database? This is due to fact that in main.yaml, we defined some defaults:

# hydra section to build up the config hierarchy with defaults
defaults:
  - _self_
  - base_config
  - db: mysql.yaml
  # experiment: is not mentioned here but in config.py to have a mandatory setting

Taking a look into mysql.yaml, we see that Hydra also allows accessing environment variables easily to help with configuration. As an example, we defined the database port to be whatever the environment variable MYSQL_PORT is set to or 1028 if undefined. So Hydra does not only unify the configuration via YAML and CLI but also via environment variables.

driver: mysql
host: server_string
port: ${oc.env:MYSQL_PORT,1028}
username: myself
password: secret

We can also override the default database by adding the flag db=postgresql. This time the flag has no + as we override a default:

❯ hydra-test -cd configs +experiment=exp1 db=postgresql
Error executing job with overrides: ['+experiment=exp1', 'db=postgresql']
Traceback (most recent call last):
  File ".../hydra-example-project/src/my_pkg/cli.py", line 11, in main
    OmegaConf.to_object(cfg)
pydantic.error_wrappers.ValidationError: 1 validation error for DataBase
port
  Choose a non-privileged port! (type=value_error)

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Nice! This works just as expected by telling us that our port configuration is actually wrong as we chose a privileged port! This is the magic of pydantic doing its validation work. Taking a look into config.py, we see the check that assures a port not smaller than 1024.

@dataclass
class DataBase:
    driver: str
    host: str
    @validator("port")
    def check_non_privileged_port(cls, port: int) -> int:
        if port < 1024:
            raise ValueError("Choose a non-privileged port!")
        return port
    port: int
    username: str
    password: str

Good, we can now fix our configuration file or just pass an extra parameter if we are in a hurry, i.e.:

❯ hydra-test -cd configs +experiment=exp1 db=postgresql db.port=1832
[2022-01-27 08:13:52,148][my_pkg.cli][INFO] -
main:
  sleep: 3
neptune:
  project: florian.wilhelm/my_expriments
  api_token: ~/.neptune_api_token
  tags:
  - run-1
  description: Experiment run on GCP
  mode: async
db:
  driver: postgreqsql
  host: server_string
  port: 1832
  username: me
  password: birthday
experiment:
  model: XGBoost
  l2: 0.01
  n_steps: 1000

And this works! So much flexibility and robustness in just 70 lines of code, awesome! While you are at it, you can also run hydra-test -cd configs +experiment=missing_key and hydra-test -cd configs +experiment=wrong_type to see some nice errors from pydantic telling you about a missing key and a wrong type of the configuration value, respectively. By the way, also passing the port parameter wrong, e.g. with db.port=72, would have triggered the same exception, so the configuration via the CLI and YAML share the same checks and validations. Hydra and pydantic work nicely together to make this possible and pydantic greatly extends the validation capabilities of OmegaConf, which Hydra uses as default. Just remember to use the dataclass from pydantic, not the standard library and call OmegaConf.to_object(cfg) at the start of your application to fail as early as possible.

Hydra has many more, really nice features. Imagine you want to run now the experiments exp1 and exp2 consecutively, you can just use the --multirun feature, or -m for short:

hydra-test -m -cd configs +experiment=exp1,exp2

Or in case you have hundreds of experiments, you can also use globbing like:

hydra-test -m -cd configs "+experiment=glob(exp*)"

There’s so much more to Hydra and several plugins even for hyperparameter optimization exist. Also note that with the flag --hydra-help, you can see the hydra-specific parameters of your application. Using just --help returns some automatic generated help according to the configuration of your application. This can of course be customized easily with the help of a powerful templating system as described in Customizing Application’s help docs.

Hydra makes configuration by CLI, YAML and environment variables a bliss and the time for learning Hydra is well invested as your application’s codebase will be more flexible configurable, less complex and therefore more robust.

Ok, so give me the gist of how Hydra works

Related Posts:

Comments