Skip to content

Schemas#

Tip

Most users will never need to define their own schema. Checkout the repository to find predefined schemas. Need something custom? Get in touch, support@geodatahub.dk.

Geodatahub provides the user a great degree of freedom in describing their data. The metadata attributes relevant to describe one dataset might not make sense on another dataset. Even datasets of the same basic type. Take the example of a geophysical well. A well has generic metadata attributes such as: Well name, Driller, Depth, Completed date, Owner and others. All wells in existence were fully drilled on a given date and was drilled by someone.

Note, schemas only apply to the metadata attributes and not the actual data.

However, wells can also have metadata attributes that are specific to different situations. A company, geocorp, might have an internal reference number for each well they drill (geocrop reference id).When they define their metadata attributes that reference is key. They might also need to know the type of geophysical logs measured in the well. These are specific attributes that do not apply to all wells.

Using schemas, GeoDataHub empowers users to describe metadata in a way that makes sense to them and their data. Schemas are defined using the json schemas syntax.

All datasets are validated against their schema before being added or updated. This ensures consistency in the data provided by the user. Defining a very detailed schema gives the highest data quality.

Defining a new schema#

The key to defining a good schema is understanding which metadata attributes are important. Too many attributes will confuse the user and will likely never get filled out. Defining too few attributes might ignore important information required to find and understand the data.

Let's continue the previously well example. The example below shows the schema file that defines a well with 5 metadata attribues,

{
    "$schema": "http://json-schema.org/draft-07/schema#",
    "$id": "https://schemas.geodatahub.dk/standards/well.json",
    "description": "The standard representation of a geophysical or geological well",
    "datatype": ["Logs", "Borehole", "Well"],
    "uniqueProperties": ["Well name"],
    "type": "object",
    "properties": {
        "Driller": {
            "type": "string",
            "example": "The drilling company",
            "description": "Name of the company that performed the drilling work"
        },
        "Well name": {
            "type": "string",
            "example": "Well No #23",
            "description": "A descriptive name for the well",
            "required": True
        },
        "Owner": {
            "type": "string",
            "example": "Local geological survey",
            "description": "The entity that owns or operates the well"
        },
        "Depth": {
            "type": "number",
            "example": "32",
            "unit": "m",
            "description": "The total depth of the well"
        },
        "Completed date": {
            "type": "string",
            "pattern": "^(\\([0-9]{4}\\))?[0-9]{2}-[0-9]{2}$",
            "example": "2020-01-01",
            "description": "Date when the drilling work was completed",
        },
    }
}

The file breaks down into a header with general information and the properties section that describes each individual attribute. The header contains the following attributes,

  • $schema: Specific which version of the JSON schema definition this file uses
  • $id: The location where the schema is stored. Learn more in the schema repository section
  • description: Information about the schema in general
  • datatype: Keywords of the datatypes that are described by the schema
  • uniqueProperties: Any fields from the properties section that are unique to a single dataset (e.g. internal dataset identifiers) - see unique constraints

The next section defines the schema. To understand the type and properties readers are referred to the JSON schema documentation. Under properties section each metadata attribute is listed. It contains,

  • type: The data type (such as string or number)
  • example: An example of what the field might contain
  • description: Descriptive information about the specific attribute
  • unit: The abbreviated unit (e.g. m, s, gal, ft) follow the definitions from the International Institute of Standards
  • required: The parameter is required

The attribute section supports many more parameters (min, max, unique). See all of them in understanding JSON Schema.

Note

datatype and unit are specifically designed for GeoDataHub and not part of the JSON schema standard.

Extending an existing schema#

Often, it is easier to extend an existing schema rather then creating a new from scratch.

{
    "$schema": "http://json-schema.org/draft-07/schema#",
    "$id": "https://schemas.geodatahub.dk/organizations/acmecorp/geophysics/acme_well.json",
    "description": "Description of a well from ACME Corp",
    "datatype": ["Logs", "Borehole", "Well"],
    "allOf": [
        {"$ref": "https://schemas.geodatahub.dk/schemas/standards/well.json"},
        {
            "properties": {
                "Acme id": {
                    "type": "number",
                    "example": 284,
                    "description": "Unique internal identifier for ACME Corp wells"
                }
            }
        }
    ]
}

The allOf keyword comes from the JSON schema definition and guarantees that this schema contains both all parameters from stanards/well.json and the acme id property. The resulting schema would ensure the well contains a Well name (as it is required in the parent schema) and would not allow Acme id = myWell, as only number are allowed.

Unique constraints#

The uniqueProperties definition allows schema owners to define specific fields from the properties section that are unique. The fields are provided as a list,

{
    ...
    "uniqueProperties": ["wellID", "wellName"],
    ...
    "type": "object",
    "properties": {
        "wellID": {
            "type": "number",
            "description": "Internal unique identifier"
        },
        "wellName": {
            "type": "string",
            "description": "Unique name of the well"
        },
        "depth": {
            "type": "number",
            "description": "Total depth below mean sea-level of the well bottom"
        }
    }
}

In the above example only a single dataset may contain the same combination of wellID and wellName. The constrain only applies to datasets with the same schema and owned by the same organization. Multiple datasets may define the same wellID if they follow different schema definitions. Add a new dataset with the same unique properties will overwrite the exist dataset but will NOT change the unique identifier provided by GDH. Any existing link to the dataset will still work.

Two datasets are identical only if all uniqueProperties match. Otherwise a new dataset is created (e.g. creating two datasets with the same wellID but different wellNames will create two datasets).

The unique constraint is important for automated data ingestion from external sources.

Important

Any changes to the uniqueProperties field, after the schema is added, will not automatically update the existing dataset values. Please contact support@geodatahub.dk to ensure the changes propagate to existing datasets. All future datasets, added after the change, will enforce the new constraints.

Using the open schema repository#

The open geoscientific schema repository stores all public schemas used in GeoDataHub. The project encourages companies and organizations to openly share how they model geodata. Using common metadata definitions helps users quickly discover and understand datasets.

The source schemas are stored on an online git repository. The schemas are mirrored to the schemas.geodatahub.dk website where users can access them directly.

Submit a schema#

Important

Currently, knowledge of git is required to submit new schemas. In the future users can submit schemas directly from the browser.

Using git#

The repository uses git to store and manage schemas. Before a new schema is accepted users must submit a merge request/pull request through the source repository.

  • Clone the repository git clone https://gitlab.com/GeoDataHub/schemas.geodatahub.dk.git <your local path>
  • Create a new branch (branches must start with schema/<branch name>) git checkout -b schema/<my branch name>
  • Make the required changes
  • Push the branch to the repository git push --set-upstream origin <branch name>
  • Finally, create a merge request from your branch
  • A developer will review your schema and help you get it into GeoDataHub