Quick Navigation

Document Ingestion Query Processing Performance Metrics Sequence Flow Data Optimization Advanced Features

RNA Lab Navigator: Detailed RAG Pipeline

The RNA Lab Navigator is a private, retrieval-augmented assistant designed for the RNA-biology lab. This document provides a detailed technical overview of the system's architecture, focusing on the RAG (Retrieval Augmented Generation) pipeline implementation.

Key Features:

Download Resources

Download architecture diagrams and technical specifications for presentations or documentation

Download Architecture Diagrams (PDF) Download Code Samples (ZIP)

Document Ingestion Pipeline

1. Document Processing
Converting source documents into processable text
  • Thesis Documents: Detected by CHAPTER regex, processed page by page with PyMuPDF
  • Protocols: Extracted with pdfplumber for handling complex layouts
  • Research Papers: Parsed using PyPDF2 with layout preservation
  • BioRxiv Preprints: Fetched daily via API with Celery Beat scheduler
2. Chunking Strategy
Splitting documents into optimal-sized chunks
  • Chunk Size: 400±50 words for optimal context retention
  • Overlap: 100-word overlap to maintain cross-chunk coherence
  • Special Case - Theses: Split by chapter first, then chunked
  • Special Case - Figures: Extracted and linked to related text chunks
def chunk_document(text, chunk_size=400, overlap=100):
    """
    Split document text into overlapping chunks.
    
    Args:
        text: Document text to chunk
        chunk_size: Target word count per chunk
        overlap: Word overlap between chunks
        
    Returns:
        List of text chunks
    """
    words = text.split()
    chunks = []
    
    for i in range(0, len(words), chunk_size - overlap):
        chunk = words[i:i + chunk_size]
        if len(chunk) < 50:  # Skip tiny chunks at the end
            continue
        chunks.append(" ".join(chunk))
        
    return chunks
3. Metadata Extraction
Enhancing chunks with rich contextual metadata
  • Common Fields: doc_type, source_file, chunk_index, word_count
  • Thesis-specific: author, year, chapter, department
  • Protocol-specific: category, last_updated, reagents
  • Paper-specific: authors, publication_date, journal, doi
4. Vector Embedding Generation
Creating dense vector representations of text chunks
  • Model: OpenAI text-embedding-ada-002 (1536 dimensions)
  • Batching: Chunks processed in batches of 20 to optimize API calls
  • Caching: SHA-256 hash-based embedding cache in Redis
  • Fallback: Local SentenceTransformers model for offline operation
def generate_embeddings(chunks, use_cache=True):
    """
    Generate embeddings for text chunks with caching.
    
    Args:
        chunks: List of text chunks
        use_cache: Whether to use embedding cache
        
    Returns:
        List of embedding vectors
    """
    embeddings = []
    batch_size = 20
    
    for i in range(0, len(chunks), batch_size):
        batch = chunks[i:i + batch_size]
        batch_embeddings = []
        
        for chunk in batch:
            # Generate cache key from content
            chunk_hash = hashlib.sha256(chunk.encode()).hexdigest()
            
            if use_cache and redis_client.exists(f"emb:{chunk_hash}"):
                # Retrieve from cache
                embedding = json.loads(redis_client.get(f"emb:{chunk_hash}"))
            else:
                # Generate new embedding
                response = openai.Embedding.create(
                    input=chunk,
                    model="text-embedding-ada-002"
                )
                embedding = response['data'][0]['embedding']
                
                # Cache the embedding with 30-day TTL
                if use_cache:
                    redis_client.setex(
                        f"emb:{chunk_hash}",
                        60 * 60 * 24 * 30,  # 30 day TTL
                        json.dumps(embedding)
                    )
            
            batch_embeddings.append(embedding)
        
        embeddings.extend(batch_embeddings)
    
    return embeddings
5. Vector Database Storage
Storing vectors and metadata for efficient retrieval
  • Database: Weaviate Cloud (HNSW vector index)
  • Schema: Document class with normalized properties
  • Indexing: HNSW with ef=256, maxConnections=64
  • Hybrid Search: BM25 enabled (0.75 vector, 0.25 keyword)
class_obj = {
    "class": "Document",
    "vectorizer": "none",  # We provide vectors directly
    "vectorIndexConfig": {
        "distance": "cosine",
        "ef": 256,
        "efConstruction": 256,
        "maxConnections": 64,
    },
    "properties": [
        {"name": "content", "dataType": ["text"]},
        {"name": "doc_type", "dataType": ["text"], "indexFilterable": True},
        {"name": "source_file", "dataType": ["text"], "indexFilterable": True},
        {"name": "chunk_index", "dataType": ["int"]},
        {"name": "author", "dataType": ["text"], "indexFilterable": True},
        {"name": "year", "dataType": ["int"], "indexFilterable": True},
        {"name": "chapter", "dataType": ["text"]},
        {"name": "department", "dataType": ["text"]},
        {"name": "category", "dataType": ["text"], "indexFilterable": True},
        {"name": "last_updated", "dataType": ["date"]},
        {"name": "reagents", "dataType": ["text[]"]},
        {"name": "authors", "dataType": ["text[]"]},
        {"name": "publication_date", "dataType": ["date"]},
        {"name": "journal", "dataType": ["text"], "indexFilterable": True},
        {"name": "doi", "dataType": ["text"]},
    ],
    "moduleConfig": {
        "text2vec-contextionary": {
            "skip": True  # Skip built-in vectorization
        },
        "text2vec-transformers": {
            "skip": True  # Skip built-in vectorization
        }
    }
}

Query Processing Pipeline

1. Query Expansion & Preprocessing
Enhancing the query for better retrieval
  • Query Cleaning: Removing special characters, normalizing whitespace
  • Entity Extraction: Identifying key entities (genes, proteins, techniques)
  • Stopword Removal: Filtering common words for keyword search
  • Automatic Expansion: Adding related terms for broader coverage
def preprocess_query(query):
    """
    Clean and normalize the query text
    """
    # Remove special characters
    query = re.sub(r'[^\w\s]', ' ', query)
    
    # Normalize whitespace
    query = re.sub(r'\s+', ' ', query).strip()
    
    return query

def expand_query(query):
    """
    Expand query with related terms
    """
    # Extract key entities
    entities = extract_entities(query)
    
    # Add related terms based on domain knowledge
    expanded_terms = []
    for entity in entities:
        if entity in domain_knowledge:
            expanded_terms.extend(domain_knowledge[entity]['synonyms'])
    
    # Combine original query with expanded terms
    expanded_query = query
    if expanded_terms:
        expanded_query = f"{query} {' '.join(expanded_terms)}"
    
    return expanded_query
2. Two-Stage Vector Search
Finding relevant document chunks
  • Stage 1: Hybrid Search
    • Vector similarity (HNSW index): 0.75 weight
    • Keyword matching (BM25): 0.25 weight
    • Initial k=10 results retrieved
  • Filtering: Apply doc_type and other filters from user request
  • Result Schema: content, metadata, distance/similarity scores
def hybrid_search(query, filters=None, limit=10):
    """
    Perform hybrid vector + keyword search
    """
    # Generate embedding for query
    query_embedding = generate_embedding(query)
    
    # Prepare search parameters
    search_params = {
        "near_vector": {
            "vector": query_embedding,
        },
        "limit": limit,
    }
    
    # Add BM25 for hybrid search
    search_params["hybrid"] = {
        "query": query,
        "alpha": 0.75,  # Weight for vector search vs keyword search
    }
    
    # Add filters if specified
    if filters:
        search_params["where"] = filters
    
    # Execute search
    results = weaviate_client.query.get(
        "Document", ["content", "doc_type", "source_file", "chunk_index", 
                     "author", "year", "chapter", "category"]
    ).with_additional(["distance", "score"]).with_near_vector(
        search_params["near_vector"]
    ).with_hybrid(
        search_params["hybrid"]
    ).with_limit(limit).do()
    
    return results["data"]["Get"]["Document"]
3. Cross-Encoder Reranking
Improving search precision with pair-wise relevance
  • Model: MiniLM cross-encoder for precise relevance scoring
  • Input: (query, chunk) pairs for each retrieved result
  • Output: Relevance scores (0-1) for precise ranking
  • Optimization: Model kept in memory for fast inference
  • Cutoff: Results below score threshold (0.45) are discarded
def rerank_results(query, results, top_k=3):
    """
    Rerank search results using cross-encoder
    """
    # If we have no results, return empty list
    if not results:
        return []
    
    # Prepare pairs for cross-encoder
    pairs = [(query, result["content"]) for result in results]
    
    # Get relevance scores
    relevance_scores = cross_encoder.predict(pairs)
    
    # Add scores to results
    for i, result in enumerate(results):
        result["relevance_score"] = float(relevance_scores[i])
    
    # Sort by relevance score
    reranked_results = sorted(results, key=lambda x: x["relevance_score"], reverse=True)
    
    # Apply confidence threshold
    filtered_results = [r for r in reranked_results if r["relevance_score"] >= 0.45]
    
    # Return top k results
    return filtered_results[:top_k]
4. Context Preparation
Formatting retrieved context for optimal LLM use
  • Content Selection: Top-N chunks from reranker (typically 3)
  • Context Formatting: Adding citation tokens and metadata
  • Length Control: Truncation to fit token limits while preserving citations
  • Source Management: Creating source lookup table for citations
def prepare_context(results):
    """
    Format search results into LLM-ready context with citations
    """
    context_parts = []
    source_map = {}
    
    for i, result in enumerate(results):
        # Create citation token
        citation = f"[{i+1}]"
        source_id = f"source_{i+1}"
        
        # Add to source map for later reference
        source_map[source_id] = {
            "doc_type": result.get("doc_type", ""),
            "title": result.get("source_file", "").split("/")[-1],
            "author": result.get("author", ""),
            "year": result.get("year", ""),
            "chapter": result.get("chapter", ""),
            "score": result.get("relevance_score", 0)
        }
        
        # Format the context with citation
        context_part = f"{result['content']} {citation}"
        context_parts.append(context_part)
    
    # Combine all contexts
    context = "\n\n".join(context_parts)
    
    return context, source_map
5. LLM Prompt Construction
Building the optimal prompt for accurate answers
  • System Message: Instructions enforcing citation requirements
  • Golden Rule: "Answer only from provided sources; if unsure, say 'I don't know.'"
  • Context Management: Strategic placement of retrieved content
  • Citation Format: Clear instructions for citation formatting
def construct_prompt(query, context, source_map):
    """
    Construct the LLM prompt with system message, query, and context
    """
    system_message = """
    You are RNA Lab Navigator, a specialized assistant for an RNA biology research lab.
    Answer only from the provided sources; if unsure, say 'I don't know.'
    
    Important rules:
    1. Include citations for all factual statements using the [X] format
    2. Citations must appear at the end of the sentence containing the information
    3. Only reference information explicitly stated in the provided context
    4. Maintain scientific accuracy and precision
    5. If multiple sources confirm the same information, cite all of them
    6. If the query cannot be answered from the provided context, say 'I don't know'
    7. Never make up information or citations
    """
    
    # Construct the full prompt
    prompt = f"""
    Context:
    {context}
    
    Sources:
    {json.dumps(source_map, indent=2)}
    
    Question:
    {query}
    """
    
    return {
        "system": system_message,
        "prompt": prompt
    }
6. Generation & Post-processing
Creating and validating the final response
  • LLM Call: OpenAI GPT-4o with appropriate temperature (0.1-0.3)
  • Stream Processing: Real-time token streaming for fast response
  • Citation Validation: Verifying all citations match source materials
  • Confidence Score: Computing overall confidence based on source relevance
  • Link Generation: Creating clickable source links for citations
def generate_answer(prompt_data, stream=True):
    """
    Generate answer from LLM using the constructed prompt
    """
    response = openai.ChatCompletion.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": prompt_data["system"]},
            {"role": "user", "content": prompt_data["prompt"]}
        ],
        temperature=0.2,
        max_tokens=1000,
        stream=stream
    )
    
    # Process streaming response
    if stream:
        collected_chunks = []
        for chunk in response:
            collected_chunks.append(chunk)
            # Yield chunk for streaming to client
            yield chunk
        
        # Process the full response after streaming
        full_response = "".join([c.choices[0].delta.get("content", "") for c in collected_chunks])
    else:
        # Process the full response directly
        full_response = response.choices[0].message.content
    
    return full_response

def validate_citations(answer, source_map):
    """
    Validate all citations in the answer
    """
    citation_pattern = r'\[(\d+)\]'
    citations = re.findall(citation_pattern, answer)
    
    valid_citations = []
    for citation in citations:
        source_id = f"source_{citation}"
        if source_id in source_map:
            valid_citations.append(source_id)
    
    # If no valid citations found, mark low confidence
    if not valid_citations:
        return False, 0.0
    
    # Calculate overall confidence based on source relevance scores
    confidence = sum(source_map[source_id]["score"] for source_id in valid_citations) / len(valid_citations)
    
    return True, confidence

System Performance Metrics

End-to-end Latency
4.7s
Target: ≤5.0s
Answer Quality
87%
Target: ≥85%
Vector Search Time
180ms
~4% of total latency
Reranking Time
90ms
~2% of total latency
LLM Generation Time
4.1s
~87% of total latency

Sequence Flow Diagram

User (React)
POST /api/query/ {query, doc_type?}
API (Django)
Process query and prepare search request
Weaviate
near_text + hybrid search with filters
Weaviate
Return top-10 matches with metadata
Cross-Encoder
Rerank results with pair-wise scoring
Cross-Encoder
Return relevance scores and top-3 results
LLM (GPT-4o)
Send prompt with context and system message
LLM (GPT-4o)
Stream answer tokens with citations
API (Django)
Validate citations and calculate confidence
User (React)
Display answer with highlighted citations and sources

Data Flow Optimizations

Caching Strategy

Query Results Cache

Frequently asked questions are cached to eliminate redundant processing. Cache entries include:

  • Query hash as key (normalized query text)
  • Search results with metadata
  • Final response with citation mapping
  • TTL: 24 hours (configurable)
Embedding Cache

SHA-256 hash-based caching system for vector embeddings:

  • Content hash as key
  • 1536-dimension vector as value
  • TTL: 30 days
  • Estimated ~40% API cost savings

Memory Optimization

Cross-Encoder Management

MiniLM cross-encoder model is kept in memory for fast inference:

  • Loaded once at application startup
  • Shared across worker processes
  • Batch prediction for multiple chunks
  • ~90ms average inference time for 10 results

Stream Processing

Token Streaming

Real-time streaming of LLM response tokens:

  • Django channels for WebSocket communication
  • Token-by-token streaming to frontend
  • Reduces perceived latency by ~70%
  • First token appears in ~1.5s
Citation Processing

Front-end processing of citation tokens:

  • React-based token parsing
  • Citation highlighting during streaming
  • Dynamic source panel updating
  • Preview thumbnails for cited documents

Hybrid Search Tuning

Vector + BM25 Balance

Optimized hybrid search parameters:

  • Vector search weight: 0.75
  • Keyword search weight: 0.25
  • Optimized for RNA biology terminology
  • 12% improvement in recall over vector-only

Advanced Implementations

Tiered Model Selection

To optimize cost and performance, the system implements intelligent model routing based on query complexity:

Simple Queries
Protocol locator, reagent lookup, straightforward questions
  • Model: GPT-3.5 Turbo
  • Criteria:
    • Single-intent queries
    • Token count < 250
    • High confidence results (>0.8)
  • Examples: "Where can I find the RNA extraction protocol?", "What's the concentration of Trizol in the RNA isolation buffer?"
Complex Queries
Technical explanations, multi-step processes, synthesis
  • Model: GPT-4o
  • Criteria:
    • Multi-intent queries
    • Technical complexity score > 0.6
    • Requires synthesis across sources
  • Examples: "Explain the key differences between the CRISPR protocols from Chakraborty and Miyamoto labs and when each would be preferred", "What's the evidence for RNA modification in regulating p53 expression from our lab's papers?"
def analyze_query_complexity(query, search_results):
    """
    Analyze query complexity to determine appropriate LLM
    """
    # Simple properties
    token_count = len(query.split())
    query_length = len(query)
    
    # Check for technical terms
    technical_terms_count = sum(1 for term in query.split() if term.lower() in TECHNICAL_TERMS)
    technical_density = technical_terms_count / max(1, token_count)
    
    # Check number of intents
    intents = identify_intents(query)
    multi_intent = len(intents) > 1
    
    # Check confidence of top result
    top_confidence = search_results[0]["relevance_score"] if search_results else 0
    
    # Determine if this needs the more powerful model
    needs_powerful_model = (
        multi_intent or 
        technical_density > 0.3 or 
        token_count > 250 or
        top_confidence < 0.8
    )
    
    # Select appropriate model
    model = "gpt-4o" if needs_powerful_model else "gpt-3.5-turbo"
    
    return {
        "model": model,
        "complexity_score": 0.2 * multi_intent + 0.5 * technical_density + 
                            0.1 * min(1, token_count/300) + 0.2 * (1 - top_confidence)
    }

Future Enhancements

The following features are planned for future development cycles:

  • Q2 2025: Knowledge graph integration for complex entity relationships
  • Q2 2025: Figure extraction and multimodal search capabilities
  • Q3 2025: Feedback loop for continuous learning and model fine-tuning
  • Q3 2025: Local model fallback for reduced latency and improved privacy
  • Q4 2025: Multi-modal support for visual protocol steps and diagrams

The following optimizations are planned to further improve system performance:

  • Streaming Improvements: Enhanced token-by-token processing for perceived latency reduction
  • Caching Enhancements: More sophisticated cache invalidation strategies
  • Vector Index Tuning: Periodic reindexing with optimized parameters based on usage patterns
  • Query Planning: Adaptive query plans based on query complexity and historical performance
  • Hardware Acceleration: Evaluation of GPU acceleration for cross-encoder inference
Knowledge Graph Integration
Entity-relationship modeling for enhanced reasoning
  • Entities: Genes, proteins, reagents, techniques, protocols
  • Relationships: Used-in, interacts-with, regulates, requires
  • Implementation: Neo4j graph database with Weaviate integration
  • Value: Multi-hop reasoning for complex queries connecting multiple documents
Figure Extraction & Retrieval
Visual content analysis and presentation
  • Extraction: PDF figure isolation with PyMuPDF
  • Caption Analysis: NLP-based caption parsing for metadata
  • Storage: Vector embeddings for figure+caption pairs
  • Retrieval: Multimodal search combining text and visual elements
  • Presentation: Integrated figures in text responses
Continuous Learning Loop
Improving system quality through feedback
  • Feedback Collection: User ratings and corrections
  • Dataset Creation: Building labeled training examples
  • Model Fine-tuning: Weekly retraining of cross-encoder
  • Performance Tracking: Ongoing metrics visualization
  • A/B Testing: Controlled rollout of improvements

Architecture Approach Comparison

Approach Latency Quality Cost Advantages Disadvantages
Vector-only (HNSW) Fast
~120ms
Moderate
76%
Low
$$$
  • Very fast retrieval
  • Good for semantic similarity
  • Low infrastructure cost
  • Misses exact keyword matches
  • Poor with technical terms
  • Limited for RNA domain terms
BM25 Keyword Very Fast
~80ms
Low
65%
Very Low
$
  • Excellent for exact matches
  • No embedding costs
  • Simple implementation
  • Poor semantic understanding
  • Misses related concepts
  • Requires exact terminology
Hybrid (Current) Moderate
~180ms
High
87%
Moderate
$$$
  • Best of both approaches
  • Handles domain terminology
  • Good balance of precision/recall
  • More complex implementation
  • Requires tuning of weights
  • Slightly higher latency
Hybrid + Cross-encoder Slow
~270ms
Very High
92%
High
$$$$
  • Superior relevance ranking
  • Excellent for complex queries
  • High precision at top results
  • Added computational cost
  • Higher latency
  • Requires more infrastructure

RNA Lab Navigator Technical Documentation

Document Ingestion Pipeline | Weekly Progress Report | Back to Top