Entities
Entities are knowledge graph annotations that provide context beyond the raw text — repository stats, author profiles, language metadata. They are not designed to ever be in an end state; they are constantly evolving. If you consume content today, your experience of the context entities tomorrow may be different.
Entities are attached to content via an entity overlay — a separate JSON document that sits alongside the PCF content. This separation means:
- The content is the source of truth. It can be rendered, edited, and versioned without any awareness of entities.
- Entities evolve independently. New entity types can be added, entity data can be enriched, and entity detection can be re-run — all without touching the content.
- Overlapping annotations are trivial. Multiple entities can reference the same or overlapping text ranges without structural conflicts.
- Consumers opt in. A rendering pipeline that does not need entity data simply ignores the overlay document.
Consider caching carefully
Content is always being updated in the Content Lake as we track new entity types and add new attributes to existing entities. When content is first ingested we may do a quick first pass entity detection and then over time add in more complex entities.
We recommend:
- No caching for editing applications.
- Short-lived caching (< 60 seconds) for internal apps.
- Up to 1 hour caches for end-user facing applications.
If you require longer caches for end-user facing applications then only do so where the content was ingested more than 24 hours ago. We would still caution against longer than 1 hour caches.
Entity Overlay Structure
The overlay is a JSON document with the following shape:
{
"$schema": "https://schema.packt.com/pcf/entity-overlay/v0.1.0",
"contentRef": "contentlake://packt/<documentId>",
"version": "2026-03-23T10:30:00Z",
"entities": [ ... ]
}
| Field | Type | Required | Description |
|---|---|---|---|
$schema |
string | No | URI of the overlay schema version |
contentRef |
string | Yes | contentlake:// URI identifying the PCF document this overlay annotates |
version |
string | Yes | ISO 8601 timestamp of when the overlay was last generated |
entities |
array | Yes | Array of entity objects |
Addressing Model
The overlay uses a two-part addressing scheme to locate text within a PCF document:
- Block key (
blockKey) — identifies which block in the PCF array the entity appears in. - Character offset (
startChar,endChar) — a zero-based, half-open character range within the block's plaintext.
The plaintext of a block is the concatenation of all text
values from its children spans, in order, with no separators.
For example:
Go is a popular language for building web services. The Chi router
^^ ^^^^^^^^^^
| |
entity: language/Go (0-2) entity: github/chi (55-65)
Multiple entities can reference different (or overlapping) character ranges in the same block without conflict.
Entity Types
GitHub
The GitHub entity provides context about repositories referenced in the content.
{
"_type": "github",
"_id": "ent-gh-1",
"pkg_id": "pkg-gh-nextjs",
"name": "next.js",
"confidence": 0.99,
"salience": 0.75,
"mentions": [
{
"blockKey": "b2",
"startChar": 55,
"endChar": 65,
"text": "Chi router"
}
],
"metadata": {
"href": "https://github.com/vercel/next.js",
"about": "The React Framework",
"license": "MIT",
"stats": {
"stars": 128400,
"contributors": 3200,
"commits": 42000
},
"languages": [
{
"name": "JavaScript",
"proportion": 0.373,
"pkg_id": "pkg-lang-javascript"
}
]
}
}
| Metadata field | Type | Description |
|---|---|---|
href |
string | Full GitHub URL |
about |
string | Repository description |
license |
string | SPDX license identifier |
stats |
object | Stars, contributors, commits |
languages |
array | Language breakdown with proportions |
Entity extraction
Currently we only identify GitHub entities where there is a full URL within the body of ingested content (either a printed GitHub URL or a link in the original content).
Editors can manually add GitHub entity references within the Content Lake editor if the content makes a passing reference to a repository without a link. We plan to improve automatic entity recognition over time.
Person
The Person entity represents a notable person relevant to some of the content in the Content Lake.
Not everyone will have a Person entity. Having, or not having, a Person entity associated with you is not an editorial decision made by Packt. New Person entities will be discovered and created over time as the Content Lake grows.
Traits which make you more likely to be included:
- Author
- Contributor (particularly to GitHub)
- Speaker
- Reviewer
- Influencer (referenced by multiple published works)
{
"_type": "person",
"_id": "ent-person-1",
"pkg_id": "pkg-person-guido-van-rossum",
"name": "Guido van Rossum",
"confidence": 0.95,
"salience": 0.60,
"mentions": [
{
"blockKey": "b3",
"startChar": 12,
"endChar": 28,
"text": "Guido van Rossum"
}
],
"metadata": {
"about": "Creator of the Python programming language",
"profiles": [
{
"name": "github",
"url": "https://github.com/gvanrossum"
},
{
"name": "linkedin",
"url": "https://www.linkedin.com/in/guido-van-rossum"
}
],
"languages": [
{
"name": "Python",
"proportion": 0.85,
"pkg_id": "pkg-lang-python"
}
]
}
}
| Metadata field | Type | Description |
|---|---|---|
about |
string | Short biography |
profiles |
array | Social and professional profiles |
languages |
array | Programming languages associated |
Language
Represents a programming language.
{
"_type": "language",
"_id": "ent-lang-1",
"pkg_id": "pkg-lang-go",
"name": "Go",
"confidence": 0.92,
"salience": 0.80,
"mentions": [
{
"blockKey": "b1",
"startChar": 30,
"endChar": 32,
"text": "Go"
}
],
"metadata": {
"about": "An open-source programming language",
"paradigm": ["concurrent", "imperative", "compiled"],
"firstAppeared": 2009,
"website": "https://go.dev"
}
}
Technology
Represents a technology, framework, library, or tool that is not a full programming language.
{
"_type": "technology",
"_id": "ent-tech-1",
"pkg_id": "pkg-tech-kubernetes",
"name": "Kubernetes",
"confidence": 0.98,
"salience": 0.90,
"mentions": [
{
"blockKey": "b1",
"startChar": 15,
"endChar": 25,
"text": "Kubernetes"
}
],
"metadata": {
"about": "Open-source container orchestration platform",
"category": "infrastructure",
"website": "https://kubernetes.io",
"github": "https://github.com/kubernetes/kubernetes",
"relatedTechnologies": [
{ "name": "Docker", "pkg_id": "pkg-tech-docker" },
{ "name": "Helm", "pkg_id": "pkg-tech-helm" }
]
}
}
Compatibility
New entity types will be added over time. Consumers of the Content Lake will periodically need to update their clients to support new entities. Packt provides clients for common languages and we recommend using one of those where possible.
Save time
Not all clients will be interested in all entity types. You can make your parsers more efficient by only enabling entities that you are interested in.
New entity types will not break compatibility. All entities have
name and mentions[].text fields which you can rely on.
When building parsers, if you do not recognise the entity type,
use the name field for display and the mentions[].text for
the matched content, then move on. For example, if a new
community entity type is added:
{
"_type": "community",
"_id": "ent-comm-1",
"name": "r/golang",
"mentions": [
{
"blockKey": "b6",
"startChar": 22,
"endChar": 31,
"text": "r/golang"
}
],
"metadata": {
"platform": "reddit",
"url": "https://www.reddit.com/r/golang",
"subscribers": 245000
}
}
Consumers that do not understand community entities skip them
and continue processing normally.
Roadmap
We plan to build the following entities:
- Community (representing a group like a Reddit board)
If you are interested in any other entity types or would like us to prioritise one of those already planned, please reach out to a member of the Packt Tech team.