Ingestion

The Packt Content Lake is designed to scale to billions of documents from thousands of sources¹. Documents can be as big as 10,000 page manuscripts or as small as individual GitHub issues. Even small bits of context can add significant value; for example a on a two sentence GitHub issue can tell you whether a maintainer agrees or disagrees with the direction of thinking. That can then feed into their future recommended best practices.

Having said that, Packt cannot build and maintain an ingestion connector for every possible data source. We are therefore taking the following approaches:

We will build open formatters² which are source agnostic and convert standard document formats into the Packt Content Format.
We will build and maintain connectors for sources¹ which are commercially significant.

There are no upper bounds on either of these dimensions and the numbers will grow over time. Our current beta support includes:

Sources: SharePoint, Google Docs, Medium, WordPress, Substack.
Formatters: PDF, Word, Markdown, ePub.

If you are a Packt employee please let the tech team know about any other sources or formatters which you would like, along with information on the potential commercial impact.

Every document must have a Contract

Every document ingested into the Content Lake must be associated with a Contract. There are no exceptions — the contract is what makes content usable downstream. If your source does not have explicit licensing terms, you still need a contract (even if it is a permissive internal-use-only one). Plan for this as part of your ingestion pipeline, not as an afterthought.

We want to make it as easy as possible for third parties to build and maintain formatters or sources. Our commitment to you is that:

The Packt Content Format is stable and has compatibility guarantees.
The formatters which our tooling uses are the same ones that you can use in your own pipelines.
We are invested in making more content available to improve the quality of our products so if you run into any problems with integrations, we are here to help.

The diagram below gives you an overview of how the ingestion process works with chained sources and formatters.

graph LR
    A[YouTube] & B[Udemy] --> C[Video]
    C -- Transcribe --> D[Markdown]
    D --> E[Packt API]

    J[SharePoint] --> D
    K[OneDrive] --> D
    L[Dropbox] --> D
    O[Google Drive] --> D

    F(ePub) --> J
    F --> K
    F --> L
    F --> O
    G(PDF) --> J
    G --> K
    G --> L
    G --> O
    H(Word) --> J
    H --> K
    H --> L
    H --> O

    I(Google Docs) --> E
    M[Medium] --> E
    N[WordPress] --> E
    P[Substack] --> E

    style I fill:#C8E6C9,stroke:#00C853
    style M fill:#C8E6C9,stroke:#00C853
    style N fill:#C8E6C9,stroke:#00C853
    style P fill:#C8E6C9,stroke:#00C853

    style J fill:#FFE0B2,stroke:#FF6D00
    style K fill:#FFE0B2,stroke:#FF6D00
    style L fill:#FFE0B2,stroke:#FF6D00
    style O fill:#FFE0B2,stroke:#FF6D00

Content ingested into the Content Lake will ultimately be text in the Packt Content Format (PCF).

Some sources (shown in green) give us content in a defined format, e.g. the Google Docs API. With these sources there is no intermediary format; we can reliably extract content in an existing well-defined schema and on the fly write it to the Packt Content Format. There is no intermediary format like Word, PDFs or Markdown.

Other sources (shown in orange) are more like third party file systems which can contain arbitrary document formats. With these sources our first job is to integrate and walk their file systems. Then our second job is to take the documents in a format which we support and ingest them into the Content Lake. That is why there is a first stage integration for source integration (in orange) before documents are then sent to one of the formatters.

Lastly, we have included examples of how YouTube and Udemy content would be treated — the video would go through a transcription process to turn the video (and/or audio) into text which is then formatted to Markdown.

Transcoder Tags

When content passes through a formatter, transcoder tags are added (transcoder/<name>, transcoder/version/<version>) so content can be identified for reprocessing when improved formatters are released. For example, if an early PDF formatter had edge cases with table extraction, transcoder tags make it possible to find and reprocess all affected content with a simple tag-filtered query.

See Tagging for more details on tag levels.

Sources: Where documents can be found (e.g. SharePoint, Dropbox, OneDrive, Google Docs, Medium, WordPress, Substack, etc). ↩↩
Formatters: Something which converts one document type (e.g. PDF, Markdown or Word) to another like the Packt Content Format (PCF). ↩