Skip to content

Current state of CIME data files #2161

@jgfouca

Description

@jgfouca

Hi all,

This is a follow-on discussion to the XML API discussion work that has been completed (not yet merged though as of this typing). Using the new API, I've confirmed that scanning for children is killing our performance. Note the 46% of runtime spent in scan_children (it would be a far higher % except for the fact that loading environment modules is crazy slow on this machine):

screen shot 2017-12-19 at 3 46 42 pm

And the reason that we need to do so many scans is because it's hard to utilize assumptions about file structure to do vastly-more-efficient direct-child searches because our file formats are so inconsistent.

To help inform the discussion, here's a class diagram of the python classes in XML:

screen shot 2017-12-19 at 3 50 00 pm

Black lines denote inheritance, red lines denote a "has-a" relationship with the critical Case class. Note that the case class splits it's has-a relationships to XML files between "entry-id" and "generic" files.

Diving deeper, here's an analysis of the files themselves:

Archive(GenericXML):
  * CIME/config/$MODEL/config_archive.xml
  * Format: custom

Batch(GenericXML):
  * CIME/config/$MODEL/machines/config_archive.xml
  * Format: custom

Compilers(GenericXML):
  * CIME/config/$MODEL/machines/config_compilers.xml
  * Format: custom, ACME and CESM have different formats (1.0 vs. 2.0)

Compsets(GenericXML):
  * CIME/config/acme/allactive/config_compsets.xml
  * CIME/src/drivers/mct/cime_config/config_compsets.xml
  * components can add more
  * Format: custom

EnvArchive(GenericXML):
  * $CASE/env_archive.xml : Note, this is the ONLY direct subclass of GenericXML that lives in the CASE directory
  * Format: custom, similar to Archive

Grids(GenericXML):
  * CIME/config/$MODEL/config_grids.xml
  * Format: custom, ACME and CESM have different formats (1.0 vs. 2.0)

Machines(GenericXML):
  * CIME/config/$MODEL/machines/config_machines.xml
  * Format: custom

Pes(GenericXML):
  * CIME/config/acme/allactive/config_pesall.xml
  * Format: custom

TestReporter(GenericXML):
  * ? (CESM only)
  * Format: custom

Tests(GenericXML):
  * CIME/config/config_tests.xml
  * Format: custom

TestSpec(GenericXML):
  * ? (CESM only)
  * Format: custom



Component(EntryId):
  * CIME/src/drivers/mct/cime_config/config_component.xml + config_component_$MODEL.xml
  * Format: entry_id->entry->(various entry info)

Headers(EntryId):
  * CIME/config/config_headers.xml
  * Format: custom, why is this a subclass of EntryId?

Files(EntryId):
  * CIME/config/config_headers.xml
  * Format: entry_id->entry->(various entry info)

NamelistDefinitions(EntryId):
  * Lots
  * Format: entry_id->entry->(various entry info)

PIO(EntryId):
  * CIME/config/$MODEL/machines/config_pio.xml
  * Format: entries->entry->(values) , why different root element tagname?



EnvBatch(EnvBase):
  * $CASE/env_batch.xml
  * Format: file->group(job)->entry + noncompliant batch_system block

EnvBuild(EnvBase):
  * $CASE/env_build.xml
  * Format: file->group->entry fully compliant file!

EnvCase(EnvBase):
  * $CASE/env_case.xml
  * Format: file->group->entry fully compliant file!

EnvMachPes(EnvBase):
  * $CASE/env_mach_pes.xml
  * Format: file->group->entry fully compliant file!

EnvMachSpecific(EnvBase):
  * $CASE/env_mach_specific.xml
  * Format: custom

EnvRun(EnvBase):
  * $CASE/env_run.xml
  * Format: file->group->entry fully compliant file!

EnvTest(EnvBase):
  * $CASE/env_test.xml
  * Format: file->group(job)->entry + noncompliant TEST block

The things that stick out to me:

  • EnvArchive is strange. It's the only direct subclass of GenericXML that lives in the case directory, but it's totally non compliant with our standard env file formatting.
  • Headers seems like is should be a direct subclass of GenericXML, not an EntryId, since it has no entries.
  • The root element of PIO is inconsistent with its siblings
  • EnvBatch is almost compliant except for one block
  • EnvTest is almost compliant except for one block
  • EnvMachPes makes no effort at all to be compliant

Short-term path forward:

  • Re-work class hierarchy to differentiate compliant files from noncompliant
  • The performance of all compliant Env*.xml files can be drastically increased by leveraging the consistent file structure they all share when retrieving entry id values
  • This would most-likely involve a cached entry_name -> (group or entry node) map

Long-term path forward:

  • It sure would be nice if there was more consistency all around
  • break backwards compatibility by forcing more Env* files to be compliant
  • more consistency between the direct subclasses of GenericXML would be nice (config_machines, config_grid), but is not a necessity since create_newcase is already pretty fast and the penalties for parsing these files is only incurred once (at create_newcase).

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions