January 18: Support for content publishing, LLM tools, CLIP image embeddings, bug fixes

New Features

  • Added support for CLIP image embeddings using Roboflow, which can be used for similar image search. If you search for contents by similar contents, we will now use the content's text and/or image embeddings to find similar content.

  • Added support for dynamic web page ingestion. Graphlit now navigates to and automatically scrolls web pages using Browserless.io, so we capture the fully rendered HTML before extracting text. Also, we now support web page screenshots, if enabled with enableImageAnalysis property in preparation workflow. These screenshots can be analyzed with multimodal modals, such as GPT-4 Vision, or can be used to create image embeddings for similar image search.

  • Added table parsing when preparing documents. We now store structured (tab-delimited) text in the JSON text mezzanine which is extracted from documents in the preparation workflow.

  • Added reverse geocoding of lat/long locations found in image or other content metadata. We now store the real-world address with the content metadata, for use in conversations.

  • Added assistant messages to the conversation message history provided to the LLM. Originally we had included only user messages, but now we are formatting both user and assistant messages into the LLM prompt for conversations.

  • Added new chunking algorithm for text embeddings. We support semantic chunking at the page or transcript segment level, and now will create embeddings from smaller sized text chunks per page or segment.

  • Added content metadata to text and image embeddings. To provide better context for the text embeddings, we now include formatted content metadata, which includes fields like title, subject, author, or description. For emails, we include to, from, cc, and bcc fields.

  • Added helper mutations isContentDone and isFeedDone which can be used for polling completion of ingested content, or all content ingested by a feed.

  • Added richer image descriptions generated by the GPT-4 Vision model. Now these provide more useful detail.

  • Added validation of extracted hyperlinks. Now we test the URIs and remove any inaccessible links during content enrichment.

  • Added deleteContents, deleteFeeds, and deleteConversations mutations for multi-deletion of contents, feeds or conversations.

  • Added deleteAllContents, deleteAllFeeds, and deleteAllConversations mutations for bulk, filtered deletion of entities. You can delete all your contents, feeds, or conversations in your project, or a filtered subset of those entities.

Bugs Fixed

  • GPLA-1846: Parse Markdown headings into mezzanine JSON

  • GPLA-1779: Not returning SAS token with mezzanine, master URIs

  • GPLA-1348: Summarize text content, not just file content

  • GPLA-1297: Not assigning content error message on preparation workflow failure

Last updated