Upgrade: when and why to check that box

Annotated screen capture showing the upgrade checkbox.

Recently I changed the Geospatial metadata validation service so that this checkbox is unchecked. This change makes the web service work just like the command-line mp does. With the upgrade box checked, mp silently fixes some problems in its output files. This procedure causes people to mistakenly think their metadata has no errors.

If you get error messages saying that elements are misplaced, especially if those errors mention Enumerated_Domain, you might be able to solve those problems by using the upgrade checkbox.

If you use the upgrade function, you should save the XML that is output from mp and use that as your metadata.

That's probably all you need to know. But if you're curious about the details, read on.


Why is there an upgrade function?

The CSDGM was written in 1994, and was revised in 1998. In addition to a few elements being moved within the metadata, the primary revision involved how elements were repeated in the metadata. In order to assist metadata writers with the standard revisions, I added a function to MP to look for the elements that might need to be modified and change them.

The specific changes the upgrade function makes are described here.

Why should I use it?

At this point we expect most metadata records to be processed by software that expects strict conformance with the standard, so if you have records that are arranged like the 1994 standard, chances are good that they won't work in some software, such as the CKAN modules that import metadata into data.gov.

However, there remain a few variations in how some people interpret the CSDGM element Enumerated_Domain.

The CSDGM document says
  Enumerated_Domain = 
    1{Enumerated_Domain_Value
      + Enumerated_Domain_Value_Definition
      + Enumerated_Domain_Value_Definition_Source
      + 0{Attribute}n
      }n

I insist that this is an editorial mistake, because it was the intention of the revision committee, on which I sat, to eliminate all cases in which an element consists of repeated groups of elements. Instead our intention was that the container should repeat (in this case Enumerated_Domain), not the group.

Applying the standard as written would produce sequences like
  Enumerated_Domain:
    Enumerated_Domain_Value: A1
    Enumerated_Domain_Value_Definition: Steak sauce
    Enumerated_Domain_Value_Definition_Source: Heinz Company
    Enumerated_Domain_Value: B4
    Enumerated_Domain_Value_Definition: Previous time
    Enumerated_Domain_Value_Definition_Source: Twitter
    Enumerated_Domain_Value: CU
    Enumerated_Domain_Value_Definition: good bye
    Enumerated_Domain_Value_Definition_Source: Twitter
The problem with this, and the reason why we wanted to eliminate this structure, is that if software were to rearrange the elements in the order in which they appear in the standard, we would have
  Enumerated_Domain:
    Enumerated_Domain_Value: A1
    Enumerated_Domain_Value: B4
    Enumerated_Domain_Value: CU
    Enumerated_Domain_Value_Definition: Steak sauce
    Enumerated_Domain_Value_Definition: Previous time
    Enumerated_Domain_Value_Definition: good bye
    Enumerated_Domain_Value_Definition_Source: Heinz Company
    Enumerated_Domain_Value_Definition_Source: Twitter
    Enumerated_Domain_Value_Definition_Source: Twitter
which is very confusing and not what anybody wants, because it separates the abbreviations from their definitions. Note that the standard does not allow this:
  Enumerated_Domain:
    Enumerated_Domain_Value: A1
    Enumerated_Domain_Value_Definition: Steak sauce
    Enumerated_Domain_Value_Definition_Source: Heinz Company
  Enumerated_Domain:
    Enumerated_Domain_Value: B4
    Enumerated_Domain_Value_Definition: Previous time
    Enumerated_Domain_Value_Definition_Source: Twitter
  Enumerated_Domain:
    Enumerated_Domain_Value: CU
    Enumerated_Domain_Value_Definition: good bye
    Enumerated_Domain_Value_Definition_Source: Twitter
because in the definition of Attribute_Domain_Values, Enumerated_Domain is not repeatable:
  Attribute_Domain_Values = [ Enumerated_Domain | Range_Domain | Codeset_Domain | Unrepresentable_Domain ]

If Enumerated_Domain were repeatable, that would have been written

  Attribute_Domain_Values = [ 1{Enumerated_Domain}n | Range_Domain | Codeset_Domain | Unrepresentable_Domain ]
but that isn't what was written.

However Attribute_Domain_Values is repeatable within Attribute, as it was in the 1994 standard. So the safest way to arrange these elements is to write
  Attribute_Domain_Values:
    Enumerated_Domain:
      Enumerated_Domain_Value: A1
      Enumerated_Domain_Value_Definition: Steak sauce
      Enumerated_Domain_Value_Definition_Source: Heinz Company
  Attribute_Domain_Values:
    Enumerated_Domain:
      Enumerated_Domain_Value: B4
      Enumerated_Domain_Value_Definition: Previous time
      Enumerated_Domain_Value_Definition_Source: Twitter
  Attribute_Domain_Values:
    Enumerated_Domain:
      Enumerated_Domain_Value: CU
      Enumerated_Domain_Value_Definition: good bye
      Enumerated_Domain_Value_Definition_Source: Twitter
which is only a little more cumbersome than allowing Enumerated_Domain to repeat.

And this is what mp's upgrade function does.

What could go wrong?

In rare circumstances a mis-ordering of elements will cause the upgrade function to make a change that you don't intend. Specifically if you have a Theme section like
Keywords
  Theme
    Theme_Keyword: first keyword
    Theme_Keyword: second keyword
    Theme_Keyword_Thesaurus: (some vocabulary name)
    Theme_Keyword: third keyword
    Theme_Keyword: fourth keyword
Using its upgrade function, mp will find the Theme_Keyword_Thesaurus element, assume that it's really supposed to be the start of a new Theme element, and separate the first and second keywords from the third and fourth, like this:
Keywords
  Theme
    Theme_Keyword: first keyword
    Theme_Keyword: second keyword
  Theme
    Theme_Keyword_Thesaurus: (some vocabulary name)
    Theme_Keyword: third keyword
    Theme_Keyword: fourth keyword
Then you'll get an error message saying the first Theme element doesn't have a Theme_Keyword_Thesaurus.

The best solution, of course, is to have your metadata elements in the order they're listed in the Standard even though the Standard doesn't say they have to be in that order. Here that means Theme_Keyword_Thesaurus occurs before any of the Theme_Keyword elements within any given Theme element.