Real-Time AI and RAG with Ververica
Introduction
Ververica provides a framework to integrate Retrieval-Augmented Generation (RAG) and other machine learning models directly into its stream processing engine. This capability allows stateful stream processing applications written in Flink SQL to enrich data streams by querying external knowledge sources. The system can invoke Large Language Models (LLMs) to perform tasks such as data classification, transformation, and generation based on both the streaming data and the retrieved external context.
Core Concepts
- Retrieval-Augmented Generation (RAG): RAG is an architecture used to improve the output of LLMs by including documents or data from an external knowledge source. Before the LLM generates a response, a retriever component first searches a knowledge source (e.g., a vector database) for relevant information. This retrieved data is then passed as context along with the user's original prompt to the LLM. This process helps to reduce factual inaccuracies (hallucinations) and allows the model to respond with more current or domain-specific information than its training data contains.
- Data Streams and Events: The system operates on continuous, unbounded data streams from sources like Apache Kafka. Each record in the stream is treated as an event that can trigger the RAG workflow.
- Vector Embeddings: To perform semantic retrieval, unstructured text from an event is converted into a numerical vector representation known as an embedding. This conversion is performed by an embedding model and allows for quantitative comparison of semantic similarity between different pieces of text.
- Vector Database: An external database, such as Milvus, that stores pre-calculated vector embeddings of a knowledge corpus. During the retrieval step, the system queries this database to find vectors that are semantically similar to the vector of the incoming event data.
- LLM Invocation: Flink SQL jobs can call external LLMs through user-defined functions or other integrations. This is used to generate vector embeddings and to produce the final output after the relevant context has been retrieved.
Architecture
The RAG workflow in Ververica is implemented as a continuous data pipeline within a Flink job. The architecture facilitates interaction between the stream processor and external AI services.
- Ingest: The Flink job consumes a real-time data stream.
- Embed: The streaming data is passed to a registered embedding model, which returns a vector embedding.
- Retrieve: The job uses the embedding to query the external Vector DB for relevant documents.
- Augment & Generate: The original event data and the retrieved context are concatenated into a prompt. This prompt is sent to a generative LLM to produce a final output (e.g., a classification, summary, or response).
- Egress: The generated output is sent to a downstream sink system.
Use Cases
- Customer Support Chatbots: A chatbot application can use RAG to query a customer’s order history from a database and technical documentation from a knowledge base to provide specific answers to customer inquiries.
- Real-Time Recommendation Engines: An e-commerce platform can process a stream of user clicks. For each click, the system can retrieve descriptions of semantically similar products from a catalog to display as recommendations.
- Financial Anomaly Detection: A system processing a stream of financial transactions can enrich each transaction with historical data for the associated account. This enriched data can then be passed to a model to better identify anomalous patterns that may indicate fraud.
- Log Analysis and Alerting: A monitoring system can process application logs. When an error log is detected, it can retrieve relevant entries from a technical knowledge base to include in an alert notification sent to an engineer.