Ingestion
The Packt Content Lake is designed to scale to billions of documents from thousands of sources1. Documents can be as big as 10,000 page manuscripts or as small as individual GitHub issues. Even small bits of context can add significant value; for example a on a two sentence GitHub issue can tell you whether a maintainer agrees or disagrees with the direction of thinking. That can then feed into their future recommended best practices.
Having said that, Packt cannot build and maintain an ingestion connector for every possible data source. We are therefore taking the following approaches:
- We will build open formatters2 which are source agnostic and convert standard document formats into the Packt Content Format.
- We will build and maintain connectors for sources1 which are commercially significant.
There are no upper bounds on either of these dimensions and the numbers will grow over time. Our current beta support includes:
- Sources: SharePoint, Google Docs, Medium, WordPress, Substack.
- Formatters: PDF, Word, Markdown, ePub.
If you are a Packt employee please let the tech team know about any other sources or formatters which you would like, along with information on the potential commercial impact.
Every document must have a Contract
Every document ingested into the Content Lake must be associated with a Contract. There are no exceptions — the contract is what makes content usable downstream. If your source does not have explicit licensing terms, you still need a contract (even if it is a permissive internal-use-only one). Plan for this as part of your ingestion pipeline, not as an afterthought.
We want to make it as easy as possible for third parties to build and maintain formatters or sources. Our commitment to you is that:
- The Packt Content Format is stable and has compatibility guarantees.
- The formatters which our tooling uses are the same ones that you can use in your own pipelines.
- We are invested in making more content available to improve the quality of our products so if you run into any problems with integrations, we are here to help.
The diagram below gives you an overview of how the ingestion process works with chained sources and formatters.
graph LR
A[YouTube] & B[Udemy] --> C[Video]
C -- Transcribe --> D[Markdown]
D --> E[Packt API]
J[SharePoint] --> D
K[OneDrive] --> D
L[Dropbox] --> D
O[Google Drive] --> D
F(ePub) --> J
F --> K
F --> L
F --> O
G(PDF) --> J
G --> K
G --> L
G --> O
H(Word) --> J
H --> K
H --> L
H --> O
I(Google Docs) --> E
M[Medium] --> E
N[WordPress] --> E
P[Substack] --> E
style I fill:#C8E6C9,stroke:#00C853
style M fill:#C8E6C9,stroke:#00C853
style N fill:#C8E6C9,stroke:#00C853
style P fill:#C8E6C9,stroke:#00C853
style J fill:#FFE0B2,stroke:#FF6D00
style K fill:#FFE0B2,stroke:#FF6D00
style L fill:#FFE0B2,stroke:#FF6D00
style O fill:#FFE0B2,stroke:#FF6D00
Content ingested into the Content Lake will ultimately be text in the Packt Content Format (PCF).
Some sources (shown in green) give us content in a defined format, e.g. the Google Docs API. With these sources there is no intermediary format; we can reliably extract content in an existing well-defined schema and on the fly write it to the Packt Content Format. There is no intermediary format like Word, PDFs or Markdown.
Other sources (shown in orange) are more like third party file systems which can contain arbitrary document formats. With these sources our first job is to integrate and walk their file systems. Then our second job is to take the documents in a format which we support and ingest them into the Content Lake. That is why there is a first stage integration for source integration (in orange) before documents are then sent to one of the formatters.
Lastly, we have included examples of how YouTube and Udemy content would be treated — the video would go through a transcription process to turn the video (and/or audio) into text which is then formatted to Markdown.
Transcoder Tags
When content passes through a formatter, transcoder tags are
added (transcoder/<name>, transcoder/version/<version>) so
content can be identified for reprocessing when improved
formatters are released. For example, if an early PDF formatter
had edge cases with table extraction, transcoder tags make it
possible to find and reprocess all affected content with a simple
tag-filtered query.
See Tagging for more details on tag levels.