Versioning

Documents within the Content Lake are versioned. By versions, we mean snapshots of the state of the content at different points in time.

Some users may expect versioning to be more of a live editing experience where each individual letter change can be attributed back to an individual editor (like Google Docs or Office 365). One goal of the Content Lake is to be agnostic and unopinionated about the editing tools that people use to create content; we want to support thousands of content sources. Building complicated multi-user concurrent editing WYSIWYG editors is therefore out of scope of the Content Lake.

For users who want multi-user live editing experiences, keep your editors in their native applications which integrate with the Content Lake and the ingestion tooling will then periodically snapshot the edits and update the Content Lake content to newer versions.

graph TD
    A[Word Online] <-- Near-real-time Sync --> D{Content Lake Sources}
    B[SharePoint] <-- Periodic Sync --> D
    C[Google Docs] <-- Real-time Sync --> D

    D --> E[Content Lake]

Some sources will be real-time whereas others may be near real time or periodic. It'll vary by the nature of the source. For example;

Google Docs provides Webhooks so we can get real-time notifications for changes. Google batch edits into 5-30 minute groups (otherwise each time you typed the letter in a word it would technically be a new Edit) so as soon as you finish Editing we'll recieve a notification and process the updates immediately.
Substack don't provide an easy to consume API so we'll have to periodically recrawl the Substack site in order to get the latest version of Content. This means that it'll be periodically updated. Some sources will let users manually trigger refreshes where it's possible to support it.

Each version of a document has:

A version ID
A checksum
A timestamp

All of the Content Lake API endpoints accept a version parameter whenever fetching Content. If you don't specify a version then the latest will be returned.

Take an example Document b5e3605e-9b67-405b-b481-ef6971f4ce03 with one version d8017f1b7298

GET /v1/content/b5e3605e-9b67-405b-b481-ef6971f4ce03

Is equivalent to:

GET /v1/content/b5e3605e-9b67-405b-b481-ef6971f4ce03/latest

Or:

GET /v1/content/b5e3605e-9b67-405b-b481-ef6971f4ce03/d8017f1b7298

Compaction

The Content Lake doesn't support an unlimited number of versions. The Content Lake is designed to blend Publishing needs with Content discovery. Supporting an infinite number of versions of each Document would create a cost burden we can't support. We've therefore designed a two-prong compaction system to have zero disruption to any Products built using the Content Lake whilst maintaining a reasonable level of cost efficiency. Our goal is to preserve Content which has been used, or is likely to be used, in Products we publish.

Source Documents remain intact

This is only about our internal storage of Documents within the Content Lake. If you're importing Documents from other places, those original Documents remain available to you; for example you can still traverse the whole history of the Word document in Office 365 or Google Doc within those applications. The Content Lake never makes any changes to source Documents.

For each document in the Content Lake we will track:

The last 10 unpublished versions of the document.
- If any of the prior versions have been used in a Product then they aren't counted towards this 10 limit.

Examples

Example One: A Document has 10 versions which have never been included a Product. If you created a 11th version, then the first version would be cleaned up shortly after as it's the oldest non-used version.

Example Two: A Document has 15 versions, 6 of which have been used in Products. If you created a 16th version, none of the old versions would be affected as you're still within the 10 version limit because 6 versions don't count towards the cap.

Additionally, from time to time, we may compact chained unpublished versions where they have not been used in Products to optimise our storage.

graph LR
    A[Version A] --> B[Version B]
    B --> C[Version C]
    C --> D[Version D]
    D --> E[Version E]
    E --> F[Version F]

    style A fill:#C8E6C9,stroke:#00C853
    style F fill:#C8E6C9,stroke:#00C853

In the diagram above, Version A was the original version of the Content and was then used in a Product. The Content was then updated to versions B, C, D, E and eventually F. Version F was then used in a Product.

The Content Lake will periodically compact the history of unused versions between versions released in Products. In the above example, versions B, C, D and E weren't used and are now superceded by F, so they are liable to be cleaned up.

graph LR
    A[Version A] --> F[Version F]

    style A fill:#C8E6C9,stroke:#00C853
    style F fill:#C8E6C9,stroke:#00C853

We do not make any guarantees about how often these compactions will run as it is an internal maintenance event.

If a version of content has been used in a product then that version will never be deleted. Versions used in products do not count towards the 10-version cap and are never removed regardless of compaction.