Content Lake Use Cases

This guide walks through common developer scenarios for the Content Lake. Each section introduces a problem and explains how the Content Lake addresses it.

Ingesting Content from a New Source

A developer needs to bring content into the Content Lake from a platform that is not supported out of the box — perhaps an internal CMS, a partner's API, or a niche publishing tool.

The Content Lake separates sources (where documents live) from formatters (how document types are converted to PCF). A new connector typically only needs to implement one side — either a source adapter that walks a remote file system and hands files to existing formatters, or a new formatter that converts an unsupported document type to PCF.

Every document must be associated with a Contract. There are no exceptions. If your source does not have explicit licensing terms, you still need a contract (even if it is a permissive internal-use-only one). Plan for this as part of your ingestion pipeline, not as an afterthought.

Source connectors stamp source tags (source/) during ingestion. These tags carry metadata from the origin system. Downstream consumers can filter on these tags without needing to know anything about the source system's API.

Formatters add transcoder tags (transcoder/<name>, transcoder/version/<version>) so that when a formatter is improved, you can identify and reprocess all content that passed through the old version.

Building Product Experiences with Tags

A developer wants to power a product feature — a curated reading list, a topic page, a "what's new" feed.

Rather than maintaining a list of content IDs (which becomes brittle), use tag-based composition. Instead of "this product shows documents A, B, and C", say "this product shows documents tagged topic/kubernetes and publishing/status/published". Content flows in and out of the product as tags change, without code changes.

A benefit of tag-based composition is that contractual permission tags participate in the same filtering. If your product should only show content that permits AI inference, filter on contract/allow/derive-content/ai/inference alongside your topic tags. Rights enforcement happens at query time through the same mechanism as content curation.

Tags are exact string matches, not hierarchical queries. The tag topic/kubernetes does not match a filter for topic/. If you need hierarchical filtering, use multiple tags on the same document (e.g. both topic/kubernetes and topic/container-orchestration).

Searching Content

The Content Lake provides hybrid search that blends lexical matching, semantic search, and knowledge graph context.

Hybrid (default) — combines precision with recall. Usually what you want.
Strict keyword match — disables semantic components. Useful for finding specific code snippets or error messages.
Similar document search — takes up to 10 source documents and finds semantically similar content. Good for "related content" features.

The most important performance consideration is tag filtering. Tag-filtered queries return in ~10-100ms. Keyword search without filtering runs in 1-5 seconds. Always combine search with tag filters when you know the rough scope of content.

Streaming Content for Real-Time Pipelines

A developer building a text-to-speech pipeline, a live content renderer, or a real-time translation service needs content delivered with minimal latency.

The Content Lake offers a JSONL/NDJSON streaming endpoint that delivers each block as a separate newline-delimited JSON element, enabling sub-100ms time-to-first-byte. Your pipeline can start processing the first block while the rest of the document is still being read.

For streaming pipelines, the abridged endpoint (/v1/content/{id}) usually gives better throughput since it returns entity IDs only, keeping block payloads small.

Working with Knowledge Graph Entities

Entities live in a separate entity overlay alongside the PCF content. Each entity has a type (GitHub, Person, Language, Technology), an ID, and type-specific metadata. Entity mentions pinpoint exactly where in a document an entity appears — down to the block, start character, and end character.

All entities carry a name field that clients can fall back to if they do not recognise the entity type. Always handle unknown types gracefully — display the name field and continue.

Integrating AI Applications (RAG / MCP)

Before building an AI integration, check contractual permission tags. Content must be tagged contract/allow/derive-content/ai/inference to be used in RAG or MCP contexts. Filtering on this tag at query time ensures your application only retrieves permitted content.

The Content Lake's knowledge graph adds context that LLMs can use to give better answers: repository stats indicate whether a library is actively maintained, author profiles indicate expertise and credibility, and language metadata helps disambiguate terms.

For context window management:

Abridged endpoint — entity IDs only, small payloads. Good when retrieving many documents for a limited context window.
Complete endpoint — entities expanded inline. Better when retrieving fewer documents but wanting full context.

Attribution is a first-class concern. When your application generates a response based on Content Lake content, include source attribution.

Creating Derived Content

The Content Lake supports three creation paths:

System — automatic processing (image descriptions, table analysis, code block analysis).
Workflows (Zapier) — managed automation for repetitive tasks like translation.
Product Applications — full custom control for multi-source outputs.

All derived content must track lineage through the common/derived-from metadata tag. This is how the platform preserves attribution and enforces contractual restrictions on derivative works.

Derived content is itself a first-class document with its own contract, tags, and version history. The derivation relationship is metadata, not a structural dependency.

Managing Content Versions

The Content Lake uses a snapshot model. Editing happens in native tools (Word Online, Google Docs) and the Content Lake periodically snapshots via source connectors.

For most use cases, fetch the latest version (/v1/content/{id}). Use specific version IDs when you need reproducibility. The Content Lake retains the last 10 unpublished versions plus all versions used in products (never deleted). Old versions not associated with products may be compacted.