Skip to content

Submission Journey

Overview

The submission journey is a multi-stage process that moves data from a user's local environment into permanent, indexed storage in the NFDI4Immuno repository. The stages in submission journey include the complete processes of uploading, validating, and storing research data.

The journey begins with a POST request to the API Gateway. Data is held in temporary storage until all validation gates are passed.

sequenceDiagram
    participant U as Front-end/CLI
    participant G as API Gateway
    participant P as Submission Pipeline
    participant MV as Metadata Validation
    participant PV as File Validation
    participant H as Housekeeping
    participant S as Permanent Storage

    U->>G: POST /dataset
    G->>P: Forward submission
    P->>P: Add to Temporary Storage
    P->>MV: Validate metadata
    P->>PV: Validate files (payload)
    P->>H: Housekeeping
    P->>S: Move dataset from temporary to permanent

Key features:

  • Temporary Storage: Safe staging area during validation
  • Atomic Operations: All-or-nothing submission process

Validation Gates

To ensure data quality and findability, every submission must pass two primary validation engines:

  • Metadata Validation: Checks the JSON against metadata schemas, ensures logical consistency, and validates terms against specific ontologies or CURIEs.
  • File (Payload) Validation: Verifies MIME types, inspects file content against format-specific rules, and validates the integrity of the data via checksum comparison.

Metadata Validation

sequenceDiagram
    participant P as Submission Pipeline
    participant MV as Metadata Validation Engine
    participant S as Schema Validator
    participant Sem as Semantic Validator
    participant Ont as Ontology/CURIE Validator

    P->>MV: Start metadata validation
    MV->>S: Validate against metadata schema
    S-->>MV: Pass/Fail

    MV->>Sem: Check logical consistency<br>(relationships, constraints)
    Sem-->>MV: Pass/Fail

    MV->>Sem: Check semantic consistency<br>(plausible values)
    Sem-->>MV: Pass/Fail

    MV->>Ont: Validate terms against ontologies
    Ont-->>MV: Pass/Fail

    MV-->>P: Validation results

Key features:

  • Schema Validation: Ensures metadata conforms to required structure
  • Semantic Validation: Checks logical consistency of metadata values
  • Ontology Validation: Verifies terms against controlled vocabularies

File Validation

sequenceDiagram
    participant P as Submission Pipeline
    participant PV as File Validation Engine
    participant FT as File Type Checker
    participant FC as File Content Checker
    participant CS as Checksum Validator

    P->>PV: Start payload validation
    PV->>FT: Check file extensions and MIME types
    FT-->>PV: Pass/Fail

    PV->>FC: Inspect file content<br>(format-specific rules)
    FC-->>PV: Pass/Fail

    PV->>CS: Compare provided vs. computed checksums
    CS-->>PV: Integrity OK/Fail

    PV-->>P: Validation results

Integrity Check

If the computed checksum does not match the user-provided checksum during the file validation phase, the submission is rejected immediately to prevent data corruption.

Key features:

  • File Validation: Confirms file integrity and format compliance
  • Checksum Verification: Ensures data hasn't been corrupted

Housekeeping

Once validated, the system performs "Housekeeping" to prepare the dataset for the permanent repository. This includes:

  • Registering a persistent identifier (PID).
  • Calculating an annotation score to represent metadata richness.
  • Generating a final UUID for the physical storage location.
sequenceDiagram
    participant P as Submission Pipeline
    participant H as Housekeeping
    participant PID as PID Service
    participant AS as Annotation score
    participant C as Checksum

    P->>H: Start housekeeping

    H->>PID: Register PID
    H->>AS: Calculate annotation score
    H->>H: Add UUID (storage location)

    H->>C: Checksum final rich metadata

    H-->>P: Housekeeping done

Key features:

  • PID Registration: Assigns persistent identifiers to datasets
  • Annotation Scoring: Evaluates dataset quality and completeness
  • Metadata Enrichment: Adds contextual information