From Knowledge to Code

Why Semantic Engineering Needs Its Own Language

The Missing Infrastructure for Semantic Knowledge Management

Every organization has knowledge. Product hierarchies, customer segments, regulatory categories, clinical terminology, financial instruments, structured understanding that lives in spreadsheets, wikis, tribal memory, and the heads of people who might leave next quarter.

The promise of knowledge graphs and semantic systems is that this understanding can be formalized, made queryable, and fed into AI systems that actually comprehend context. The reality is messier. Most semantic knowledge management initiatives either collapse under their own complexity, never secure sustained funding, or produce artifacts so opaque that only their original authors can maintain them.

The Ontology Pipeline®, a framework developed by Jessica Talisman, proposes a disciplined, sequential approach to building semantic knowledge systems. It draws on Library and Information Science principles to decompose the overwhelming challenge of "build a knowledge graph" into discrete, manageable stages. It is a serious and principled answer to a real problem.

But even the best framework is limited by the tools available to execute it. And the tools available for semantic knowledge engineering are, to put it diplomatically, stuck in a different era.

This post recaps the Ontology Pipeline® framework, identifies where its execution encounters friction, and proposes that a purpose-built, human-readable language (with the collaborative tooling developers already rely on) could be the missing infrastructure that makes semantic knowledge management scalable, sustainable, and organizationally viable.


The Ontology Pipeline®: A Recap

The Ontology Pipeline® structures semantic knowledge construction as a sequence of building blocks, each establishing the foundation for the next.

Stage 1: Controlled Vocabularies

Everything begins with naming things consistently. A controlled vocabulary establishes the canonical terms an organization uses to describe its domain. "Customer Churn" means one thing, defined once, used everywhere. Synonyms are captured. Ambiguities are resolved. This is the most fundamental and most frequently skipped stage. Organizations jump to building taxonomies with undefined terms, then wonder why nothing aligns.

Stage 2: Metadata Standards

With terms defined, the next stage establishes how those terms are described and documented. What attributes does every concept carry? A label, a definition, an owner, a status, a creation date? Metadata standards ensure that every entry in the system carries consistent, machine-readable documentation. Without this stage, vocabularies become unmaintainable as they grow.

Stage 3: Taxonomies

Taxonomies introduce hierarchy, i.e. parent-child relationships that organize concepts into navigable structures. "Defective Return" is a type of "Product Return," which is a type of "Transaction." This stage transforms a flat list of terms into a structured classification system. The Ontology Pipeline® emphasizes validation at this stage: hierarchies must be logically consistent, mutually exclusive at each level, and collectively exhaustive within their scope.

Stage 4: Thesauri

Thesauri add associative and equivalence relationships beyond strict hierarchy. "Product Return" is related to "Warranty Claim." "RMA" is equivalent to "Return Merchandise Authorization." This stage captures the lateral connections that reflect how practitioners actually navigate knowledge. Not only up and down a tree, but also across conceptual neighborhoods.

Stage 5: Ontologies

Ontologies introduce formal logic: classes, properties, constraints, and inference rules. A "Defective Return" requires an "Inspection Process." A "High-Value Return" is defined as any return exceeding a threshold. Ontologies enable machines to reason over the knowledge structure, drawing conclusions that aren't explicitly stated but are logically entailed.

Stage 6: Knowledge Graphs

Knowledge graphs represent the synthesis: the assembled, queryable, visual representation of all preceding stages. They are the interface layer, where humans and machines interact with the semantic system through query languages like SPARQL, through visualizations, and through API integrations with downstream applications including RAG systems, entity management, and search.

The Framework's Strengths

The Ontology Pipeline® gets several things right that many competing approaches miss.

  • Sequencing is the core insight. By establishing that you cannot build a sound taxonomy without a controlled vocabulary, and you cannot build a reliable ontology without a validated taxonomy, the framework prevents the most common failure mode: building sophisticated structures on undefined foundations.

  • Investment legibility. By decomposing the work into stages, each with identifiable inputs, outputs, and effort requirements, the framework allows organizations to estimate costs, plan iterations, and demonstrate incremental value, addressing the chronic funding problem in knowledge management.

  • Data quality as byproduct. Every stage of the pipeline forces data cleaning, deduplication, and reconciliation. The semantic build inherently improves data quality, producing measurable returns even before the knowledge graph is complete.

  • Library science grounding. Rather than inventing new theory, the framework draws on decades of information science methodology, lending rigor and institutional credibility.

Where Friction Emerges

Despite these strengths, practical execution of the pipeline encounters persistent challenges.

  • The linearity trap. Real-world semantic engineering is iterative. Building an ontology frequently reveals that the taxonomy was wrong, which reveals gaps in the controlled vocabulary. The pipeline's sequential framing, while conceptually sound, doesn't naturally accommodate the feedback loops that dominate actual practice.

  • Stakeholder exclusion. The artifacts produced at each stage (vocabulary files, taxonomy structures, ontological axioms) are typically encoded in specialist formats (RDF/XML, OWL, Turtle) or managed in specialist tools (Protégé). Domain experts, who hold the knowledge being modeled, cannot directly read, review, or challenge these artifacts. Communication happens through meetings and documents about the artifacts rather than through the artifacts themselves.

  • Maintenance collapse. Even well-constructed semantic systems degrade without governance. Concepts are added without review. Definitions drift. Relationships conflict. The pipeline doesn't inherently include the enforcement mechanisms needed to prevent entropy.

  • Invisible integration. The connection between semantic system quality and downstream AI performance remains abstract. When a RAG system returns poor results, tracing that failure to a specific vocabulary gap or taxonomic error requires specialized investigation.

These aren't criticisms of the framework's logic. They're consequences of a tooling gap. The Ontology Pipeline® describes what to build and in what order. What's missing is infrastructure that makes the building process collaborative, transparent, and continuously validated.


A Language for Knowledge Engineering

The Proposal

What if the artifacts produced by the Ontology Pipeline® (controlled vocabularies, metadata schemas, taxonomies, thesauri, ontological rules) were expressed in a single, purpose-built language? A language designed to be:

  • Human-readable: a domain expert can read a concept definition and confirm or challenge it without training
  • Machine-parsable: compilers can transform it into OWL, SKOS, RDF, SHACL, or any target format
  • Developer-friendly: it lives in files, works with versionning system, responds to linting, testing, and continuous integration

This isn't about replacing existing semantic web standards. It's about providing a source representation, i.e. the format in which humans author, review, and maintain semantic knowledge, that compiles to those standards for machine consumption.

What the Syntax Could Looks Like

This is a full inspirational language.

Controlled Vocabulary:

#> The process by which a customer sends back
#> a purchased product to the seller.
concept ProductReturn:
  #@ label "Product Return"
  #@ altLabel "RMA", "Merchandise Return"
  #@ domain RetailOperations
  #@ status approved
  #@ owner @merchandising-team
  #@ created 2025-01-15
  
  sub Transaction

  has which_product: some Product
  has which_customer: one Customer
  has value: float [unit currency]

  one of:
    DefectiveReturn
    BuyersRemorseReturn
    WarrantyReturn

The syntax is intentionally transparent. Indentation conveys structure. Keywords are plain English. Metadata is inline. Anyone, either a developer, a merchandising director or a data engineer, can read this and understand what is being asserted.

Taxonomy:

taxonomy RetailOperations:
  Commerce
    Transaction
      Purchase
      ProductReturn
        DefectiveReturn
        BuyersRemorseReturn
        WarrantyReturn
      Exchange
    Pricing
      Discount
      Promotion

Hierarchy is expressed through indentation, mirroring how people naturally represent tree structures. The visual layout is the logical structure.

Thesaurus Relationships:

thesaurus RetailRelationships:
  ProductReturn
    related WarrantyClaim
    related CustomerComplaint
    broader Transaction
    narrower DefectiveReturn, BuyersRemorseReturn

  WarrantyClaim
    related ProductReturn
    related QualityInspection
    equivalent "Guarantee Claim"

Ontological Rules:

rule DefectiveReturnRequiresInspection:
  match:
    ?entity a DefectiveReturn
    ?entity value [> 50 USD]
  then:
    ?entity require InspectionProcess
    ?entity assign QualityTeam

class HighValueReturn:
  equivalentTo ProductReturn
    where value > 500 USD
  expect:
    ApprovalRecord
    ManagerSignoff

Metadata Schema:

schema ConceptMetadata
  required
    label        : string
    definition   : text, min 20 characters
    owner        : reference @team
    status       : enum [draft, review, approved, deprecated]
    created      : date
  optional
    altLabel     : list of string
    source       : uri
    deprecatedBy : reference concept

Each stage of the Ontology Pipeline® could have a corresponding construct in the language. The pipeline's logical sequence becomes a compositional sequence, later constructs reference and build upon earlier ones, and the language enforces this.

The Tooling Ecosystem

A language without tools is just notation. The real leverage comes from the ecosystem that a well-defined syntax enables.

Version Control and Collaboration

Semantic artifacts as text files in Git repositories means every change is tracked, attributed, reviewable, and reversible. Pull requests become the mechanism for knowledge governance.

$ git diff main..feature/warranty-taxonomy

  taxonomy RetailOperations
    Commerce
      Transaction
        ProductReturn
+         WarrantyReturn    # new: split from DefectiveReturn
          DefectiveReturn
-         ManufacturerDefect  # merged into DefectiveReturn
          BuyersRemorseReturn

The comments are added here by me and are not part of the git command output. A merchandising manager reads this diff and says: "WarrantyReturn needs to be separate from DefectiveReturn because warranty claims involve third-party manufacturers." That feedback happens in a pull request comment, asynchronously, with full context.

Linting and Validation

Automated checks run on every commit, enforcing structural integrity and catching errors that currently go undetected until they cause downstream failures.

$ ontology lint ./retail-domain/

retail/vocab.ont:47    ERR  "CustomerReturn" has no definition (required by
                            ConceptMetadata schema)
retail/taxonomy.ont:23 WARN concept "Item" has 31 children; consider decomposition
retail/rules.ont:8     ERR  rule references "InspectionProcess" which is undefined
retail/thesaurus.ont:5 WARN "RMA" appears as altLabel and as thesaurus entry;
                            possible duplication

Testing

Semantic systems gain the same confidence that test suites provide to software systems.

# Product return taxonomy completeness
test ProductReturnDefinition 
  #> every child of ProductReturn has definition
  match:
    ?sub rdfs:subClassOf ProductReturn 
  expect:
    ?sub definition ?define

test LeafHasOneParent:
  #> no leaf concept has more than 1 parent
  match:
    ?leaf a rdfs:Class
    none:
      ?_ rdfs:subClassOf ?leaf
    ?leaf rdfs:subClassOf ?p1
    ?leaf rdfs:subClassOf ?p2
  expect:
    ?p1 == ?p2

test DeprecatedHaveReplacedBy:   
  #> all deprecated concepts have replacedBy
  match:
    ?concept a rdfs:Class
    ?concept a DeprecatedConcept
  expect:
    ?concept replacedBy ?_

When a vocabulary change breaks a retrieval test, the team sees exactly which modification caused which degradation. The abstract claim that "semantic quality improves AI" becomes a specific, testable assertion.

IDE Integration

Standard developer environment features like go to definition, find all references, autocomplete, outline views, inline documentation, are able to transform how semantic engineers navigate and construct knowledge systems.

Typing related triggers autocomplete suggesting existing concepts. Hovering over DefectiveReturn shows its full definition, status, and owner inline. "Find all references" reveals every taxonomy, thesaurus entry, and rule that touches a concept, making impact analysis instant.

Live Visualization

Because the syntax is machine-parsable, graph visualizers render the current state of the semantic system in real time as authors edit. Clicking a node navigates to its source definition. Visual anomalies (disconnected clusters, unusual depth, missing relationships) are immediately apparent.

$ ontology visualize ./retail-domain/ \
    --focus ProductReturn \
    --depth 3 \
    --highlight status:draft \
    --highlight status:deprecated \
    --export interactive-html

The visualization is a development and communication tool, not a presentation artifact generated after the fact.

Refactoring

Large-scale structural changes, which are currently terrifying in semantic systems, become managed operations.

$ ontology refactor rename-concept \
    --from "CustomerReturn" \
    --to "ProductReturn" \
    --update-references \
    --update-tests \
    --preview

Preview: 47 references updated across 12 files
         3 rules affected
         2 tests require review
         0 breaking changes detected

Apply? [y/n]

Compilation and Export

The source language compiles to standard formats for machine consumption.

$ ontology compile ./retail-domain/ \
    --target owl2 \
    --output ./build/retail.owl

$ ontology compile ./retail-domain/ \
    --target skos \
    --output ./build/retail-vocab.ttl

$ ontology compile ./retail-domain/ \
    --target shacl \
    --output ./build/retail-shapes.ttl

$ ontology compile ./retail-domain/ \
    --target json-ld \
    --output ./build/retail.jsonld

The human-readable source is the single source of truth. Machine formats are build artifacts, generated and validated automatically.

Dolfin: A Language and Ecosystem for Practical Ontology Engineering

All the syntax and ideas presented until here in this blog are purely fictional. They represent a wish list, a sketch of what ontology engineering could feel like if the tooling caught up with the thinking. I took it upon myself to build a language that would approach this dreamt language.

Dolfin (Descriptive Ontology Language, Formal yet INtuitive) is that attempt. And it is not just a language. It is an ecosystem designed from the ground up to empower knowledge engineers.

The Language

Dolfin is a human-readable, structured language for expressing ontological knowledge. It is designed so that a domain expert can read a Dolfin file and understand what it says, while a machine can parse it with full semantic precision. The syntax stays close to how people actually describe their domains: clear declarations, explicit relationships, readable constraints.

If your ontologist leaves next quarter, the person who replaces them can open a Dolfin file and orient themselves in minutes, not weeks.

The Tooling

Dolfin comes with an integrated ecosystem built around the workflows that ontology engineers actually need:

  • Validation and linting: catch inconsistencies, missing definitions, and structural issues before they propagate downstream
  • Compilation to standard formats: Dolfin compiles to OWL, SKOS, SHACL, and other semantic web standards, so your work integrates with existing triple stores, reasoners, and knowledge graph platforms
  • Version control compatibility: Dolfin files are plain text, which means they work natively with Git. Branches, pull requests, diffs, code review, all the collaborative infrastructure that software teams rely on becomes available to ontology teams
  • Modular architecture: build your ontology in composable pieces that map naturally to the stages of the Ontology Pipeline®, from glossaries and taxonomies through to full formal ontologies

Why This Matters

The gap between semantic knowledge management ambition and organizational reality has never been a gap in theory. The frameworks exist. The standards exist. What has been missing is a practical surface, a way for teams to author, review, iterate, and maintain ontological knowledge using tools and workflows that feel familiar rather than arcane.

Dolfin is built to close that gap, providing a humane interface to the semantic web standards. One where the formalism lives underneath, and what faces the team is something they can read, discuss, version, and sustain.

The Debate: One Language to Rule Them All?

The vision is compelling. A single authoring language spanning controlled vocabularies through ontologies, supported by developer-native tooling, enabling collaborative construction of knowledge graphs. But compelling visions deserve rigorous scrutiny. There are genuine strengths to this approach and genuine blind spots that must be confronted honestly.

The Strong Case

  • Communication becomes concrete. The single most valuable property of this approach is that it gives domain experts, engineers, managers, semantic specialists a shared artifact to point at, discuss, and modify. Knowledge modeling decisions currently trapped in meetings, emails, and institutional memory become visible, reviewable, and traceable. This alone may justify the investment.

  • Quality becomes enforceable. Linting, testing, and CI pipelines transform quality from an aspiration ("we should validate the taxonomy") into an automated guarantee ("the build fails if any concept lacks a definition"). The stage-gate criteria missing from the Ontology Pipeline® emerge naturally from the tooling.

  • Iteration becomes cheap. Version control, branching, diffing, and refactoring tools make it safe to change things. The waterfall problem dissolves not because the pipeline's sequence changes, but because moving between stages (and revisiting earlier stages when later work reveals problems) carries minimal overhead.

  • Maintenance becomes sustainable. Deprecation workflows, automated reference tracking, ownership metadata, and change history address the governance gap directly. The system contains its own maintenance instructions.

  • AI integration becomes testable. Semantic test suites that verify downstream RAG and retrieval behavior create a measurable feedback loop between knowledge engineering and AI performance. Investment in semantic quality can be directly correlated with system behavior.

  • Onboarding accelerates. New team members can explore the semantic system using familiar IDE navigation, read definitions in plain language, and understand the system's structure through visualization. The knowledge base becomes self-documenting.

The Blind Spots

Expressivity versus readability is a real tension. OWL2 supports description logic constructs (existential and universal restrictions, cardinality constraints, property chains, class intersections and unions) that are genuinely difficult to represent in readable syntax without either losing precision or introducing notation that non-specialists can't follow. Consider:

class OrphanedReturn
  equivalentTo ProductReturn
    where (not exists associatedTransaction)
    and (createdDate < 30 daysAgo)
    or (value intersect HighValueThreshold
        and not has ManagerReview)

Is this readable? To a developer, somewhat. To a domain expert, marginally. To a logician who needs to verify the semantics, probably not, because the readable syntax may obscure operator precedence, quantifier scope, and logical entailment in ways that formal notation makes explicit. There is an irreducible tension between accessibility and precision, and a single language must navigate it without pretending it doesn't exist.

One mitigation is stratified complexity. The language supports simple constructs readably and offers explicit formal notation for advanced logic, clearly marked as requiring specialist review. But this fragments the "single language" promise into layers that different audiences engage with differently, which is arguably the current situation with better syntax.

One language may embed one worldview. The Ontology Pipeline® progresses from vocabularies to ontologies, and the proposed language mirrors this progression. But different domains model knowledge differently. Biomedical ontologies rely heavily on formal axiomatization. Enterprise taxonomies prioritize navigational clarity. Legal knowledge systems emphasize provenance and authority. A language designed around a pipeline that privileges one progression may subtly disadvantage modeling approaches that don't follow that sequence.

An organization migrating an existing OWL ontology into this language might find that the source syntax is less expressive than what they already have. Not because the compilation target lacks features, but because the human-readable layer doesn't surface them.

The tooling ecosystem doesn't exist yet. The linters, IDE plugins, test frameworks, visualizers, compilers, and CI integrations described in this post are aspirational. Building them is a substantial engineering effort. There is a bootstrapping problem: the language's value depends on its tooling, but investment in tooling depends on the language having adoption, which depends on the language having tooling. Open-source community dynamics might solve this eventually, but "eventually" is not a timeline that organizations struggling with semantic systems today can rely on.

Cultural change is the hardest part. Asking ontologists to work in text files with Git instead of Protégé is a workflow change. Asking domain experts to review pull requests instead of attending taxonomy review meetings is a behavioral change. Asking managers to read linting reports instead of status decks is a cultural change. The tooling lowers barriers, but it doesn't eliminate the organizational effort required to adopt collaborative knowledge engineering practices. Many semantic system failures are people problems, not tool problems, and better syntax doesn't automatically fix team dynamics.

Not all knowledge is hierarchical or textual. Spatial relationships, temporal dynamics, probabilistic assertions, and multi-modal knowledge (images, sensor data, physical samples) are all part of real-world knowledge systems. A text-first, hierarchy-friendly language may work beautifully for product taxonomies and struggle with geospatial ontologies, temporal reasoning, or uncertainty quantification. The risk is that the language becomes excellent for the use cases that fit its paradigm and awkward for everything else, leading to either forced compromises or parallel systems.

Standards adoption and interoperability. The semantic web community has invested decades in RDF, OWL, SKOS, and SHACL. Introducing a new source language that compiles to these standards creates a layer of indirection. When something goes wrong in the compiled output, debugging requires understanding both the source language and the target standard. This is manageable, every compiled language faces this, but it means the language doesn't reduce the total knowledge required; it redistributes it.

The Middle Ground

Perhaps the most honest assessment is that this language would be transformatively valuable for 70% of semantic knowledge engineering work and inadequate for 30%. The 70% (controlled vocabularies, metadata schemas, taxonomies, thesaurus relationships, and straightforward ontological rules) is exactly the work that most organizations need to do and currently find prohibitively opaque. Making this work accessible, collaborative, and quality-controlled through readable syntax and developer tooling would be a genuine advancement.

The 30% (complex description logic, advanced reasoning, novel ontological patterns, edge-case expressivity) would still require specialist tools and formal notation. And that's acceptable, as long as the language is honest about its boundaries and provides clean interfaces to specialist tooling when needed.

The danger is not that the language can't do everything. The danger is marketing it as if it can, leading organizations to discover its limitations only after committing to it as their sole knowledge engineering platform.


Conclusion

The Ontology Pipeline® correctly identifies that semantic knowledge management needs structure, sequence, and organizational legibility. Its staged approach, from controlled vocabularies through knowledge graphs, provides the conceptual framework organizations need to invest confidently in semantic infrastructure.

What the framework needs is an execution layer that makes its stages collaborative, transparent, and continuously validated. A purpose-built, human-readable language with developer-native tooling offers exactly this: semantic artifacts that domain experts can review, that developers can version-control and test, that linters can validate, and that compilers can transform into standard machine-consumable formats.

The vision of a single language spanning the entire pipeline is powerful but imperfect. It will excel at making the bulk of knowledge engineering work accessible and maintainable. It will struggle with the edges, namely the most complex logical constructs, the most unusual modeling paradigms, the domains that don't fit its structural assumptions. The honest path forward is to build the language for the common case, design clean extension points for the complex case, and resist the temptation to claim universality.

Semantic knowledge engineering is infrastructure. Like all infrastructure, it succeeds not through brilliance of design but through reliability of execution. A readable language, a linter that catches errors, a test suite that verifies behavior, a pull request that invites review, these are mundane tools. They are also exactly the tools that have made software engineering collaborative, scalable, and sustainable over the past three decades.

Knowledge engineering deserves the same.


The Ontology Pipeline® is a framework developed by Jessica Talisman. The language concepts and tooling proposals discussed in this post are the author's independent analysis of how developer-native infrastructure could complement the Pipeline's methodology.


Your comments are welcome in linkedIn