Quick Navigation

Document Workflow Document Types Implementation Performance Metrics Future Enhancements

RNA Lab Navigator: Document Ingestion Pipeline

The document ingestion pipeline is a critical component of the RNA Lab Navigator, responsible for processing various types of documents (protocols, theses, papers) and preparing them for efficient retrieval. This document provides a detailed technical overview of the ingestion process, focusing on the specialized handling of different document types and optimization strategies.

Key Features:

Document Ingestion Workflow

Input Sources
Protocols, Theses, Papers, BioRxiv
Text Extraction
PyMuPDF, pdfplumber
Document Chunking
400±50 words, 100-word overlap
Metadata Extraction
Source, Author, Date, Type
Citation Generation
Source IDs, Page Numbers
Embedding Generation
Ada-002 (1536-dim)
Weaviate Storage
HNSW + BM25 Hybrid Index

Document Source Handling

The ingestion pipeline employs specialized handling for different document types:

Lab Protocols
  • Extraction Method: pdfplumber for complex lab document layouts
  • Chunking Strategy: Standard 400±50 words with reagent section detection
  • Metadata: Protocol ID, category, last modified date, author, reagents list
  • Special Handling: Table extraction for reagent concentrations and steps
  • Version Tracking: Maintains protocol versioning history
PhD Theses
  • Extraction Method: PyMuPDF with OCR fallback for scanned sections
  • Chunking Strategy: Chapter-based primary split, then 400±50 word chunks
  • Metadata: Author, year, department, chapter titles, page numbers
  • Special Handling: Figure extraction and reference linking
  • Structure Preservation: Chapter/section hierarchy maintained in metadata
Research Papers
  • Extraction Method: PyPDF2 with column detection for academic layouts
  • Chunking Strategy: Section-aware chunking (abstract, methods, results, discussion)
  • Metadata: Authors, journal, publication date, DOI, keywords
  • Special Handling: Table and figure extraction with caption linking
  • Citation Linking: Cross-reference resolution for cited works
BioRxiv Preprints
  • Extraction Method: API-based fetching via Celery Beat (daily schedule)
  • Filtering: RNA biology keyword matching (configurable)
  • Content: Abstract + title as initial content, with full-text on demand
  • Metadata: Authors, submission date, category, DOI, version
  • Update Strategy: Incremental update for revised preprints

Technical Implementation

1. Text Extraction

The extraction module uses a multi-tool approach with fallback mechanisms:

  • Primary Tools: PyMuPDF, pdfplumber, PyPDF2
  • OCR Integration: Tesseract for scanned documents
  • Layout Analysis: Page segmentation for columns and regions
  • Error Handling: Graceful fallback for corrupted pages

The tool selection is automatic based on document analysis:

  • pdfplumber: Complex layouts, tables, forms
  • PyMuPDF: General-purpose extraction with good performance
  • Tesseract OCR: When digital text extraction fails
def extract_text(pdf_path):
    """
    Extract text from PDF using appropriate tools
    """
    # Analyze PDF to determine best extraction method
    extraction_method = analyze_pdf_structure(pdf_path)
    
    if extraction_method == "pdfplumber":
        return extract_with_pdfplumber(pdf_path)
    elif extraction_method == "pymupdf":
        return extract_with_pymupdf(pdf_path)
    elif extraction_method == "ocr":
        return extract_with_ocr(pdf_path)
    else:
        # Fallback to combined approach
        return extract_combined(pdf_path)

def analyze_pdf_structure(pdf_path):
    """
    Determine the appropriate extraction method
    based on PDF structure analysis
    """
    with fitz.open(pdf_path) as doc:
        # Check if document has selectable text
        has_text = False
        for page in doc:
            if page.get_text():
                has_text = True
                break
        
        # If no selectable text, use OCR
        if not has_text:
            return "ocr"
        
        # Check for complex layouts (tables, multi-column)
        has_complex_layout = False
        sample_page = doc[0]
        blocks = sample_page.get_text("blocks")
        if len(blocks) > 10:  # Heuristic for complex layout
            has_complex_layout = True
        
        # Return appropriate method
        if has_complex_layout:
            return "pdfplumber"
        else:
            return "pymupdf"

2. Intelligent Chunking

The chunking system balances several factors:

  • Size: Target 400±50 words per chunk
  • Overlap: 100 words between chunks to maintain context
  • Semantic Boundaries: Preference for section/paragraph breaks
  • Special Case - Theses: Chapter detection with regex patterns
  • Special Case - Papers: Section awareness (Methods, Results, etc.)

Key features include:

  • Preservation of hierarchical information (chapter → section → paragraph)
  • Table and figure handling to avoid mid-content breaks
  • Reference preservation to maintain citation context
  • Reagent list completeness in protocol chunks
def chunk_document(text, doc_type, metadata=None):
    """
    Intelligently chunk document based on type and content
    """
    if doc_type == "thesis":
        # For theses, first split by chapters
        chapter_pattern = r'CHAPTER\s+\d+[\s\-:]*([^\n]+)'
        chapters = re.split(chapter_pattern, text)
        chunks = []
        
        for i, chapter in enumerate(chapters):
            if i % 2 == 1:  # Chapter titles are at odd indices
                chapter_title = chapter.strip()
                chapter_content = chapters[i+1] if i+1 < len(chapters) else ""
                
                # For each chapter, create word chunks
                chapter_chunks = create_word_chunks(
                    chapter_content, 
                    chunk_size=400, 
                    overlap=100
                )
                
                # Add chapter metadata to each chunk
                for j, chunk in enumerate(chapter_chunks):
                    chunk_metadata = metadata.copy() if metadata else {}
                    chunk_metadata.update({
                        "chapter": chapter_title,
                        "chunk_index": j,
                        "chapter_index": i // 2
                    })
                    chunks.append({
                        "content": chunk,
                        "metadata": chunk_metadata
                    })
        
        return chunks
    
    elif doc_type == "protocol":
        # For protocols, preserve reagent sections
        sections = split_by_protocol_sections(text)
        chunks = []
        
        for section_name, section_content in sections:
            # Create word chunks within each section
            section_chunks = create_word_chunks(
                section_content,
                chunk_size=400,
                overlap=100
            )
            
            # Add section metadata to each chunk
            for j, chunk in enumerate(section_chunks):
                chunk_metadata = metadata.copy() if metadata else {}
                chunk_metadata.update({
                    "section": section_name,
                    "chunk_index": j
                })
                chunks.append({
                    "content": chunk,
                    "metadata": chunk_metadata
                })
        
        return chunks
    
    else:  # Default for papers and other types
        standard_chunks = create_word_chunks(
            text,
            chunk_size=400,
            overlap=100
        )
        
        return [
            {
                "content": chunk,
                "metadata": {
                    **(metadata or {}),
                    "chunk_index": i
                }
            }
            for i, chunk in enumerate(standard_chunks)
        ]

3. Metadata Extraction

The metadata extraction module employs a combination of rule-based and ML approaches:

Core Metadata Fields

  • Common Fields:
    • doc_type: "thesis", "protocol", "paper", "preprint"
    • source_file: Original filename/path
    • ingestion_date: ISO timestamp of processing
    • word_count: Total words in chunk
    • chunk_index: Position in sequence
  • Type-Specific Fields:
    • Theses: author, year, department, institution, advisor
    • Protocols: author, version, creation_date, update_date, category
    • Papers: authors, journal, publication_date, doi, keywords

Extraction Techniques

  • Rule-based Extraction:
    • Regular expressions for structured fields
    • Pattern matching for dates, DOIs, author names
    • Title page parsing for thesis metadata
  • ML-based Extraction:
    • Named Entity Recognition for authors and institutions
    • Document classification for protocol categories
    • Keyword extraction for paper topics
  • Manual Override:
    • Web UI for metadata verification and correction
    • Batch update functionality for curators

4. Embedding Generation

The embedding generation system prioritizes quality and efficiency:

  • Primary Model: OpenAI text-embedding-ada-002 (1536 dimensions)
  • Fallback Model: SentenceTransformers all-MiniLM-L6-v2 (384 dimensions)
  • Batch Processing: 20 chunks per API call to minimize overhead
  • Caching Strategy: SHA-256 hash-based with 30-day TTL

Optimization techniques include:

  • Intelligent batching to maximize API efficiency
  • Concurrent processing for large document sets
  • Redis-based caching to prevent redundant embedding generation
  • Rate limiting and retry logic for API stability
Embedding Time
0.5s
per chunk (avg)
Cache Hit Rate
42%
for new documents
API Cost
$0.06
per MB of text
def generate_embedding_batch(chunks, use_cache=True):
    """
    Generate embeddings for a batch of chunks with caching
    """
    # Check cache for existing embeddings
    cache_hits = []
    chunks_to_embed = []
    chunk_hashes = []
    
    for chunk in chunks:
        chunk_text = chunk["content"]
        chunk_hash = hashlib.sha256(chunk_text.encode()).hexdigest()
        chunk_hashes.append(chunk_hash)
        
        if use_cache and redis_client.exists(f"emb:{chunk_hash}"):
            # Retrieve from cache
            embedding = json.loads(redis_client.get(f"emb:{chunk_hash}"))
            cache_hits.append({
                "chunk": chunk,
                "embedding": embedding,
                "source": "cache"
            })
        else:
            chunks_to_embed.append(chunk)
    
    # If all chunks were in cache, return early
    if not chunks_to_embed:
        return cache_hits
    
    # Generate embeddings for remaining chunks
    try:
        response = openai.Embedding.create(
            input=[c["content"] for c in chunks_to_embed],
            model="text-embedding-ada-002"
        )
        
        # Process response and update cache
        api_results = []
        for i, chunk in enumerate(chunks_to_embed):
            embedding = response["data"][i]["embedding"]
            chunk_hash = chunk_hashes[len(cache_hits) + i]
            
            # Cache the embedding
            if use_cache:
                redis_client.setex(
                    f"emb:{chunk_hash}",
                    60 * 60 * 24 * 30,  # 30 day TTL
                    json.dumps(embedding)
                )
            
            api_results.append({
                "chunk": chunk,
                "embedding": embedding,
                "source": "api"
            })
        
        # Combine cache hits and API results
        return cache_hits + api_results
        
    except Exception as e:
        # Fallback to local model on API failure
        logger.warning(f"OpenAI API failed, falling back to local model: {e}")
        
        local_model = SentenceTransformer("all-MiniLM-L6-v2")
        embeddings = local_model.encode(
            [c["content"] for c in chunks_to_embed],
            batch_size=32,
            show_progress_bar=False
        )
        
        fallback_results = []
        for i, chunk in enumerate(chunks_to_embed):
            embedding = embeddings[i].tolist()
            fallback_results.append({
                "chunk": chunk,
                "embedding": embedding,
                "source": "local_model"
            })
        
        return cache_hits + fallback_results

5. Vector Database Storage

The Weaviate configuration is optimized for RNA biology domain:

  • Index Type: HNSW (Hierarchical Navigable Small World)
  • Distance Metric: Cosine similarity
  • HNSW Parameters:
    • ef: 256 (search complexity)
    • efConstruction: 256 (index build complexity)
    • maxConnections: 64 (nodes per level)
  • Hybrid Search: BM25 enabled with vector:keyword ratio of 0.75:0.25

Schema design includes:

  • Document class with indexFilterable properties
  • Text analyzers for biomedical terminology
  • Cross-references between related chunks
  • Custom tokenization for RNA biology terms
class_obj = {
    "class": "Document",
    "vectorizer": "none",  # We provide vectors directly
    "vectorIndexConfig": {
        "distance": "cosine",
        "ef": 256,
        "efConstruction": 256,
        "maxConnections": 64,
    },
    "moduleConfig": {
        "text2vec-contextionary": {
            "skip": True  # Skip built-in vectorization
        },
        "text2vec-transformers": {
            "skip": True  # Skip built-in vectorization
        }
    },
    "properties": [
        {"name": "content", "dataType": ["text"]},
        {"name": "doc_type", "dataType": ["text"], "indexFilterable": True},
        {"name": "source_file", "dataType": ["text"], "indexFilterable": True},
        {"name": "chunk_index", "dataType": ["int"]},
        {"name": "word_count", "dataType": ["int"]},
        {"name": "ingestion_date", "dataType": ["date"]},
        
        # Thesis-specific properties
        {"name": "author", "dataType": ["text"], "indexFilterable": True},
        {"name": "year", "dataType": ["int"], "indexFilterable": True},
        {"name": "department", "dataType": ["text"], "indexFilterable": True},
        {"name": "institution", "dataType": ["text"], "indexFilterable": True},
        {"name": "advisor", "dataType": ["text"]},
        {"name": "chapter", "dataType": ["text"], "indexFilterable": True},
        
        # Protocol-specific properties
        {"name": "version", "dataType": ["string"]},
        {"name": "creation_date", "dataType": ["date"]},
        {"name": "update_date", "dataType": ["date"]},
        {"name": "category", "dataType": ["text"], "indexFilterable": True},
        {"name": "reagents", "dataType": ["text[]"]},
        
        # Paper-specific properties
        {"name": "authors", "dataType": ["text[]"]},
        {"name": "journal", "dataType": ["text"], "indexFilterable": True},
        {"name": "publication_date", "dataType": ["date"]},
        {"name": "doi", "dataType": ["text"]},
        {"name": "keywords", "dataType": ["text[]"], "indexFilterable": True},
        
        # Cross-references
        {"name": "references", "dataType": ["Document[]"]}
    ]
}

Pipeline Performance Metrics

Processing Speed
1.2 MB/min
End-to-end pipeline
Average Chunks
120
Per PhD thesis
Average Chunks
15
Per protocol document
Average Chunks
25
Per research paper
Memory Usage
2-4 GB
Peak processing RAM

Storage requirements: ~2 KB per chunk metadata + ~12 KB per vector embedding = ~14 KB per chunk

In-Progress Enhancements

Figure & Image Extraction

Automated extraction and embedding of figures from documents:

  • PDF page image extraction with context
  • CLIP-based embeddings for image-text similarity
  • Figure caption parsing and linking
  • Multimodal search capabilities

This enhancement will enable the system to include relevant figures in responses.

Reagent Entity Recognition

Specialized extraction of reagent information from protocols:

  • Named entity recognition for chemicals and reagents
  • Concentration and amount extraction
  • Cross-protocol reagent linking
  • Integration with lab inventory system

This will enable precise reagent lookup and inventory integration.

Hierarchical Document Representation

Enhanced document structure preservation:

  • Parent-child relationships between chunks
  • Document tree visualization
  • Contextual expansion during retrieval
  • Section-aware query routing

This will improve navigation within large documents like theses.

Incremental Update System

Efficient handling of document updates:

  • Differential document comparison
  • Selective chunk reprocessing
  • Version history maintenance
  • Change notification system

This will optimize processing for frequently updated protocols.

RNA Lab Navigator Technical Documentation

RAG Pipeline Architecture | Weekly Progress Report | Back to Top