Skip to content

Entities

Entities are knowledge graph annotations that provide context beyond the raw text — repository stats, author profiles, language metadata. They are not designed to ever be in an end state; they are constantly evolving. If you consume content today, your experience of the context entities tomorrow may be different.

Entities are attached to content via an entity overlay — a separate JSON document that sits alongside the PCF content. This separation means:

  1. The content is the source of truth. It can be rendered, edited, and versioned without any awareness of entities.
  2. Entities evolve independently. New entity types can be added, entity data can be enriched, and entity detection can be re-run — all without touching the content.
  3. Overlapping annotations are trivial. Multiple entities can reference the same or overlapping text ranges without structural conflicts.
  4. Consumers opt in. A rendering pipeline that does not need entity data simply ignores the overlay document.

Consider caching carefully

Content is always being updated in the Content Lake as we track new entity types and add new attributes to existing entities. When content is first ingested we may do a quick first pass entity detection and then over time add in more complex entities.

We recommend:

  • No caching for editing applications.
  • Short-lived caching (< 60 seconds) for internal apps.
  • Up to 1 hour caches for end-user facing applications.

If you require longer caches for end-user facing applications then only do so where the content was ingested more than 24 hours ago. We would still caution against longer than 1 hour caches.

Entity Overlay Structure

The overlay is a JSON document with the following shape:

{
  "$schema": "https://schema.packt.com/pcf/entity-overlay/v0.1.0",
  "contentRef": "contentlake://packt/<documentId>",
  "version": "2026-03-23T10:30:00Z",
  "entities": [ ... ]
}
Field Type Required Description
$schema string No URI of the overlay schema version
contentRef string Yes contentlake:// URI identifying the PCF document this overlay annotates
version string Yes ISO 8601 timestamp of when the overlay was last generated
entities array Yes Array of entity objects

Addressing Model

The overlay uses a two-part addressing scheme to locate text within a PCF document:

  1. Block key (blockKey) — identifies which block in the PCF array the entity appears in.
  2. Character offset (startChar, endChar) — a zero-based, half-open character range within the block's plaintext.

The plaintext of a block is the concatenation of all text values from its children spans, in order, with no separators. For example:

Go is a popular language for building web services. The Chi router
^^                                                      ^^^^^^^^^^
|                                                       |
entity: language/Go (0-2)                               entity: github/chi (55-65)

Multiple entities can reference different (or overlapping) character ranges in the same block without conflict.

Entity Types

GitHub

The GitHub entity provides context about repositories referenced in the content.

{
  "_type": "github",
  "_id": "ent-gh-1",
  "pkg_id": "pkg-gh-nextjs",
  "name": "next.js",
  "confidence": 0.99,
  "salience": 0.75,
  "mentions": [
    {
      "blockKey": "b2",
      "startChar": 55,
      "endChar": 65,
      "text": "Chi router"
    }
  ],
  "metadata": {
    "href": "https://github.com/vercel/next.js",
    "about": "The React Framework",
    "license": "MIT",
    "stats": {
      "stars": 128400,
      "contributors": 3200,
      "commits": 42000
    },
    "languages": [
      {
        "name": "JavaScript",
        "proportion": 0.373,
        "pkg_id": "pkg-lang-javascript"
      }
    ]
  }
}
Metadata field Type Description
href string Full GitHub URL
about string Repository description
license string SPDX license identifier
stats object Stars, contributors, commits
languages array Language breakdown with proportions

Entity extraction

Currently we only identify GitHub entities where there is a full URL within the body of ingested content (either a printed GitHub URL or a link in the original content).

Editors can manually add GitHub entity references within the Content Lake editor if the content makes a passing reference to a repository without a link. We plan to improve automatic entity recognition over time.

Person

The Person entity represents a notable person relevant to some of the content in the Content Lake.

Not everyone will have a Person entity. Having, or not having, a Person entity associated with you is not an editorial decision made by Packt. New Person entities will be discovered and created over time as the Content Lake grows.

Traits which make you more likely to be included:

  • Author
  • Contributor (particularly to GitHub)
  • Speaker
  • Reviewer
  • Influencer (referenced by multiple published works)
{
  "_type": "person",
  "_id": "ent-person-1",
  "pkg_id": "pkg-person-guido-van-rossum",
  "name": "Guido van Rossum",
  "confidence": 0.95,
  "salience": 0.60,
  "mentions": [
    {
      "blockKey": "b3",
      "startChar": 12,
      "endChar": 28,
      "text": "Guido van Rossum"
    }
  ],
  "metadata": {
    "about": "Creator of the Python programming language",
    "profiles": [
      {
        "name": "github",
        "url": "https://github.com/gvanrossum"
      },
      {
        "name": "linkedin",
        "url": "https://www.linkedin.com/in/guido-van-rossum"
      }
    ],
    "languages": [
      {
        "name": "Python",
        "proportion": 0.85,
        "pkg_id": "pkg-lang-python"
      }
    ]
  }
}
Metadata field Type Description
about string Short biography
profiles array Social and professional profiles
languages array Programming languages associated

Language

Represents a programming language.

{
  "_type": "language",
  "_id": "ent-lang-1",
  "pkg_id": "pkg-lang-go",
  "name": "Go",
  "confidence": 0.92,
  "salience": 0.80,
  "mentions": [
    {
      "blockKey": "b1",
      "startChar": 30,
      "endChar": 32,
      "text": "Go"
    }
  ],
  "metadata": {
    "about": "An open-source programming language",
    "paradigm": ["concurrent", "imperative", "compiled"],
    "firstAppeared": 2009,
    "website": "https://go.dev"
  }
}

Technology

Represents a technology, framework, library, or tool that is not a full programming language.

{
  "_type": "technology",
  "_id": "ent-tech-1",
  "pkg_id": "pkg-tech-kubernetes",
  "name": "Kubernetes",
  "confidence": 0.98,
  "salience": 0.90,
  "mentions": [
    {
      "blockKey": "b1",
      "startChar": 15,
      "endChar": 25,
      "text": "Kubernetes"
    }
  ],
  "metadata": {
    "about": "Open-source container orchestration platform",
    "category": "infrastructure",
    "website": "https://kubernetes.io",
    "github": "https://github.com/kubernetes/kubernetes",
    "relatedTechnologies": [
      { "name": "Docker", "pkg_id": "pkg-tech-docker" },
      { "name": "Helm", "pkg_id": "pkg-tech-helm" }
    ]
  }
}

Compatibility

New entity types will be added over time. Consumers of the Content Lake will periodically need to update their clients to support new entities. Packt provides clients for common languages and we recommend using one of those where possible.

Save time

Not all clients will be interested in all entity types. You can make your parsers more efficient by only enabling entities that you are interested in.

New entity types will not break compatibility. All entities have name and mentions[].text fields which you can rely on.

When building parsers, if you do not recognise the entity type, use the name field for display and the mentions[].text for the matched content, then move on. For example, if a new community entity type is added:

{
  "_type": "community",
  "_id": "ent-comm-1",
  "name": "r/golang",
  "mentions": [
    {
      "blockKey": "b6",
      "startChar": 22,
      "endChar": 31,
      "text": "r/golang"
    }
  ],
  "metadata": {
    "platform": "reddit",
    "url": "https://www.reddit.com/r/golang",
    "subscribers": 245000
  }
}

Consumers that do not understand community entities skip them and continue processing normally.

Roadmap

We plan to build the following entities:

  • Community (representing a group like a Reddit board)

If you are interested in any other entity types or would like us to prioritise one of those already planned, please reach out to a member of the Packt Tech team.