Support eval of LoRA adapters

Currently the evaluation of lora adapters is not supported in our `evaluation.py` scripts.

This means that the agents always store merged weights, even when they train adapters.

We could change the evaluation to automatically merge adapters. This would keep the disk footprint of the benchmark much lower.