April 7: Support for Discord feeds, Cohere reranking, section-aware chunking and retrieval
New Features
- 💡 Graphlit now supports Discord feeds. By connecting to a Discord channel and providing a bot token, you can ingest all Discord messages and file attachments. 
- 💡 Graphlit now supports Cohere reranking after content retrieval in RAG pipeline. You can optionally use the Cohere rerank model to semantically rerank the semantic search results, before providing as context to the LLM. 
- Added support for section-aware text chunking and retrieval. Now, when using section-aware document preparation, such as Azure AI Document Intelligence, Graphlit will store the extracted text according to the semantic chunks (i.e. sections). The text for each section will be individually chunked and embedded into the vector index. 
- Added support for - retrievalStrategyin Specification type. Graphlit now supports- CHUNK,- SECTIONand- CONTENTretrieval strategies. Chunk retrieval will use the search hit chunk, section retrieval will expand the search hit chunk to the containing section (or page, if not using section-aware preparation). Content retrieval will expand the search hit chunk to the text of the entire document.
- Added support for - rerankingStrategyin Specification type. You can now configure the reranking of content sources, using the Cohere reranking model, by assigning- serviceTypeto- COHERE. More reranking models are planned for the future.
- Added - isSynchronousflag to content ingestion mutations, such as- ingestUri, so the mutation will wait for the content to complete the ingestion workflow (or error) before returning. This is useful for utilizing the API in a Jupyter notebook or Streamlit application, in a synchronous manner without polling.
- Added - includeAttachmentsflag to SlackFeedProperties. When enabled, Graphlit will automatically ingest any attachments within Slack messages.
- ⚡ Added - ingestUrimutation to replace the now deprecated- ingestPageand- ingestFilemutations. We had seen confusion on when to use one vs the other, and now for any URI, whether it is a web page or hosted PDF, you can pass it to- ingestUri, and we will infer the correct content ingestion workflow.
- ⚡ Removed - includeSummariesfrom the ConversationStrategyInput type. This will re-added in the future as part of the retrieval strategy.
- ⚡ Deprecated - enableExpandedRetrievalin ConversationStrategyInput type. This is now handled by setting- strategyTypeto- SECTIONor- CONTENTin the RetrievalStrategyInput type.
- ⚡ Moved - contentLimitfrom ConversationStrategyInput type to RetrievalStrategyInput type. You can optionally assign the- contentLimitto- retrievalStrategywhich limits the number of content sources leveraged in the LLM prompt context. (Default is 100.)
Bugs Fixed
- GPLA-2469: Failed to ingest PDF hosted on GitHub 
- GPLA-2390: Claude 3 Haiku not adhering to JSON schema 
- GPLA-2474: Prompt rewriting should ignore formatting instructions in prompt 
- GPLA-2462: Missing line break after table rows 
- GPLA-2417: Not extracting images from PPTX correctly 
Last updated
