Hi all,
This is a follow-on discussion to the XML API discussion work that has been completed (not yet merged though as of this typing). Using the new API, I've confirmed that scanning for children is killing our performance. Note the 46% of runtime spent in scan_children (it would be a far higher % except for the fact that loading environment modules is crazy slow on this machine):

And the reason that we need to do so many scans is because it's hard to utilize assumptions about file structure to do vastly-more-efficient direct-child searches because our file formats are so inconsistent.
To help inform the discussion, here's a class diagram of the python classes in XML:

Black lines denote inheritance, red lines denote a "has-a" relationship with the critical Case class. Note that the case class splits it's has-a relationships to XML files between "entry-id" and "generic" files.
Diving deeper, here's an analysis of the files themselves:
Archive(GenericXML):
* CIME/config/$MODEL/config_archive.xml
* Format: custom
Batch(GenericXML):
* CIME/config/$MODEL/machines/config_archive.xml
* Format: custom
Compilers(GenericXML):
* CIME/config/$MODEL/machines/config_compilers.xml
* Format: custom, ACME and CESM have different formats (1.0 vs. 2.0)
Compsets(GenericXML):
* CIME/config/acme/allactive/config_compsets.xml
* CIME/src/drivers/mct/cime_config/config_compsets.xml
* components can add more
* Format: custom
EnvArchive(GenericXML):
* $CASE/env_archive.xml : Note, this is the ONLY direct subclass of GenericXML that lives in the CASE directory
* Format: custom, similar to Archive
Grids(GenericXML):
* CIME/config/$MODEL/config_grids.xml
* Format: custom, ACME and CESM have different formats (1.0 vs. 2.0)
Machines(GenericXML):
* CIME/config/$MODEL/machines/config_machines.xml
* Format: custom
Pes(GenericXML):
* CIME/config/acme/allactive/config_pesall.xml
* Format: custom
TestReporter(GenericXML):
* ? (CESM only)
* Format: custom
Tests(GenericXML):
* CIME/config/config_tests.xml
* Format: custom
TestSpec(GenericXML):
* ? (CESM only)
* Format: custom
Component(EntryId):
* CIME/src/drivers/mct/cime_config/config_component.xml + config_component_$MODEL.xml
* Format: entry_id->entry->(various entry info)
Headers(EntryId):
* CIME/config/config_headers.xml
* Format: custom, why is this a subclass of EntryId?
Files(EntryId):
* CIME/config/config_headers.xml
* Format: entry_id->entry->(various entry info)
NamelistDefinitions(EntryId):
* Lots
* Format: entry_id->entry->(various entry info)
PIO(EntryId):
* CIME/config/$MODEL/machines/config_pio.xml
* Format: entries->entry->(values) , why different root element tagname?
EnvBatch(EnvBase):
* $CASE/env_batch.xml
* Format: file->group(job)->entry + noncompliant batch_system block
EnvBuild(EnvBase):
* $CASE/env_build.xml
* Format: file->group->entry fully compliant file!
EnvCase(EnvBase):
* $CASE/env_case.xml
* Format: file->group->entry fully compliant file!
EnvMachPes(EnvBase):
* $CASE/env_mach_pes.xml
* Format: file->group->entry fully compliant file!
EnvMachSpecific(EnvBase):
* $CASE/env_mach_specific.xml
* Format: custom
EnvRun(EnvBase):
* $CASE/env_run.xml
* Format: file->group->entry fully compliant file!
EnvTest(EnvBase):
* $CASE/env_test.xml
* Format: file->group(job)->entry + noncompliant TEST block
The things that stick out to me:
- EnvArchive is strange. It's the only direct subclass of GenericXML that lives in the case directory, but it's totally non compliant with our standard env file formatting.
- Headers seems like is should be a direct subclass of GenericXML, not an EntryId, since it has no entries.
- The root element of PIO is inconsistent with its siblings
- EnvBatch is almost compliant except for one block
- EnvTest is almost compliant except for one block
- EnvMachPes makes no effort at all to be compliant
Short-term path forward:
- Re-work class hierarchy to differentiate compliant files from noncompliant
- The performance of all compliant Env*.xml files can be drastically increased by leveraging the consistent file structure they all share when retrieving entry id values
- This would most-likely involve a cached entry_name -> (group or entry node) map
Long-term path forward:
- It sure would be nice if there was more consistency all around
- break backwards compatibility by forcing more Env* files to be compliant
- more consistency between the direct subclasses of GenericXML would be nice (config_machines, config_grid), but is not a necessity since create_newcase is already pretty fast and the penalties for parsing these files is only incurred once (at create_newcase).
Hi all,
This is a follow-on discussion to the XML API discussion work that has been completed (not yet merged though as of this typing). Using the new API, I've confirmed that scanning for children is killing our performance. Note the 46% of runtime spent in scan_children (it would be a far higher % except for the fact that loading environment modules is crazy slow on this machine):
And the reason that we need to do so many scans is because it's hard to utilize assumptions about file structure to do vastly-more-efficient direct-child searches because our file formats are so inconsistent.
To help inform the discussion, here's a class diagram of the python classes in XML:
Black lines denote inheritance, red lines denote a "has-a" relationship with the critical Case class. Note that the case class splits it's has-a relationships to XML files between "entry-id" and "generic" files.
Diving deeper, here's an analysis of the files themselves:
The things that stick out to me:
Short-term path forward:
Long-term path forward: