January 18: Support for content publishing, LLM tools, CLIP image embeddings, bug fixes
New Features
Added support for CLIP image embeddings using Roboflow, which can be used for similar image search. If you search for contents by similar contents, we will now use the content's text and/or image embeddings to find similar content.
Added support for dynamic web page ingestion. Graphlit now navigates to and automatically scrolls web pages using Browserless.io, so we capture the fully rendered HTML before extracting text. Also, we now support web page screenshots, if enabled with
enableImageAnalysis
property in preparation workflow. These screenshots can be analyzed with multimodal modals, such as GPT-4 Vision, or can be used to create image embeddings for similar image search.Added table parsing when preparing documents. We now store structured (tab-delimited) text in the JSON text mezzanine which is extracted from documents in the preparation workflow.
Added reverse geocoding of lat/long locations found in image or other content metadata. We now store the real-world address with the content metadata, for use in conversations.
Added assistant messages to the conversation message history provided to the LLM. Originally we had included only user messages, but now we are formatting both user and assistant messages into the LLM prompt for conversations.
Added new chunking algorithm for text embeddings. We support semantic chunking at the page or transcript segment level, and now will create embeddings from smaller sized text chunks per page or segment.
Added content metadata to text and image embeddings. To provide better context for the text embeddings, we now include formatted content metadata, which includes fields like title, subject, author, or description. For emails, we include to, from, cc, and bcc fields.
Added helper mutations
isContentDone
andisFeedDone
which can be used for polling completion of ingested content, or all content ingested by a feed.Added richer image descriptions generated by the GPT-4 Vision model. Now these provide more useful detail.
Added validation of extracted hyperlinks. Now we test the URIs and remove any inaccessible links during content enrichment.
Added
deleteContents
,deleteFeeds
, anddeleteConversations
mutations for multi-deletion of contents, feeds or conversations.Added
deleteAllContents
,deleteAllFeeds
, anddeleteAllConversations
mutations for bulk, filtered deletion of entities. You can delete all your contents, feeds, or conversations in your project, or a filtered subset of those entities.
Bugs Fixed
GPLA-1846: Parse Markdown headings into mezzanine JSON
GPLA-1779: Not returning SAS token with mezzanine, master URIs
GPLA-1348: Summarize text content, not just file content
GPLA-1297: Not assigning content error message on preparation workflow failure
Last updated