Skip to content

Load individual elements if state dict load fails#5213

Merged
andrewcoh merged 22 commits into
mainfrom
fix-resume-imi
Apr 6, 2021
Merged

Load individual elements if state dict load fails#5213
andrewcoh merged 22 commits into
mainfrom
fix-resume-imi

Conversation

@andrewcoh

@andrewcoh andrewcoh commented Apr 1, 2021

Copy link
Copy Markdown
Contributor

Proposed change(s)

Addressing the issue of being unable to resume with GAIL/reward providers in general. If the load fails, this will copy matching elements individually and produce a lot of logger warnings as such:

2021-04-01 12:29:22 WARNING [torch_model_saver.py:102] Did not expect these keys ['action_model._continuous_distribution.log_sigma', 'action_model._continuous_distribution.mu.weight', 'action_model._continuous_distribution.mu.bias'] in checkpoint. Initializing
2021-04-01 12:29:22 WARNING [torch_model_saver.py:106] Failed to load for module Optimizer:value_optimizer. Initializing
2021-04-01 12:29:22 WARNING [torch_model_saver.py:98] Did not find these keys ['value_heads.value_heads.curiosity.weight', 'value_heads.value_heads.curiosity.bias', 'value_heads.value_heads.gail.weight', 'value_heads.value_heads.gail.bias'] in checkpoint. Initializing
2021-04-01 12:29:22 WARNING [torch_model_saver.py:102] Did not expect these keys ['value_heads.value_heads.extrinsic.weight', 'value_heads.value_heads.extrinsic.bias', 'value_heads.value_heads.rnd.weight', 'value_heads.value_heads.rnd.bias'] in checkpoint. Initializing
2021-04-01 12:29:22 WARNING [torch_model_saver.py:106] Failed to load for module Module:Curiosity. Initializing
2021-04-01 12:29:22 WARNING [torch_model_saver.py:106] Failed to load for module Module:GAIL. Initializing
2021-04-01 12:29:22 INFO [torch_model_saver.py:116] Resuming training from step 42896.

TODO:

  • tests

Useful links (Github issues, JIRA tickets, ML-Agents forum threads etc.)

Types of change(s)

  • Bug fix
  • New feature
  • Code refactor
  • Breaking change
  • Documentation update
  • Other (please describe)

Checklist

  • Added tests that prove my fix is effective or that my feature works
  • Updated the changelog (if applicable)
  • Updated the documentation (if applicable)
  • Updated the migration guide (if applicable)

Other comments

@ervteng

ervteng commented Apr 1, 2021

Copy link
Copy Markdown
Contributor

I think this warrants a doc change. In Training-ML-Agents under "Loading an existing model", we can say something like "If the network architecture changes, you may still load an existing model, and ML-Agents will only load the parts of the model that haven't changed. For instance, if you add a new reward signal, the existing model will load but the new reward signal will be initialized from scratch. If you have a model with a visual encoder (CNN) but change the hidden_units, the CNN will be loaded but the body of the network will not be."

Comment thread ml-agents/mlagents/trainers/model_saver/torch_model_saver.py Outdated
Comment thread ml-agents/mlagents/trainers/model_saver/torch_model_saver.py Outdated
Comment thread ml-agents/mlagents/trainers/model_saver/torch_model_saver.py Outdated
Comment thread ml-agents/mlagents/trainers/model_saver/torch_model_saver.py Outdated
Comment thread ml-agents/mlagents/trainers/model_saver/torch_model_saver.py Outdated
Comment thread docs/Training-ML-Agents.md Outdated
Comment thread ml-agents/mlagents/trainers/model_saver/torch_model_saver.py Outdated
Comment thread ml-agents/mlagents/trainers/model_saver/torch_model_saver.py Outdated
Comment thread ml-agents/mlagents/trainers/tests/torch/saver/test_saver.py Outdated
Comment thread ml-agents/mlagents/trainers/tests/torch/saver/test_saver.py Outdated
andrewcoh and others added 5 commits April 5, 2021 18:24
Co-authored-by: Vincent-Pierre BERGES <vincentpierre@unity3d.com>
Co-authored-by: Vincent-Pierre BERGES <vincentpierre@unity3d.com>
Co-authored-by: Vincent-Pierre BERGES <vincentpierre@unity3d.com>
Co-authored-by: Vincent-Pierre BERGES <vincentpierre@unity3d.com>
Co-authored-by: Vincent-Pierre BERGES <vincentpierre@unity3d.com>
Comment thread ml-agents/mlagents/trainers/model_saver/torch_model_saver.py Outdated

@ervteng ervteng left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add to changlog.md. I'd say this qualifies as a "Major change" for the Python package.

@andrewcoh andrewcoh merged commit ac4f43c into main Apr 6, 2021
@delete-merged-branch delete-merged-branch Bot deleted the fix-resume-imi branch April 6, 2021 17:13
ervteng pushed a commit that referenced this pull request Apr 8, 2021
Co-authored-by: Vincent-Pierre BERGES <vincentpierre@unity3d.com>
Co-authored-by: Ervin T. <ervin@unity3d.com>
(cherry picked from commit ac4f43c)
@andrewcoh andrewcoh restored the fix-resume-imi branch June 14, 2021 22:06
@github-actions github-actions Bot locked as resolved and limited conversation to collaborators Jun 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants