Building RAG Applications in Rails 7.1+ with pgvector: Complete Guide

Traditional keyword search struggles to understand user intent. When someone searches “best pizza spots” versus “top-rated pizzerias,” your app should know these mean the same thing. That’s where Retrieval Augmented Generation (RAG) comes in.

RAG combines semantic search with AI-generated responses, letting your Rails app understand meaning, not just match keywords. This guide shows you how to build a production-ready RAG system using pgvector—a PostgreSQL extension that brings vector similarity search directly into your database.

By the end of this tutorial, you’ll have a working document Q&A system that understands natural language queries and generates accurate, context-aware answers using your own data.

Prerequisites #

Before starting, ensure you have:

  • Rails 7.1+ application with PostgreSQL 15+ database
  • Ruby 3.2+ installed
  • OpenAI API key (get one here )
  • Basic understanding of embeddings (text converted to numerical vectors)
  • Familiarity with ActiveRecord and Rails services

This tutorial assumes intermediate Rails knowledge. If you’re new to AI concepts, check out What is RAG? first.

What You’ll Build #

We’re building a document Q&A system where users can ask questions in natural language and receive AI-generated answers based on your documentation. Think of it as ChatGPT trained specifically on your company’s knowledge base.

The complete code examples are available on GitHub .

Part 1: Setup & Fundamentals 🟢 #

Installing the pgvector Extension #

pgvector is a PostgreSQL extension that adds vector similarity search capabilities. Unlike external vector databases (Pinecone, Weaviate), pgvector runs inside your existing PostgreSQL database—no additional infrastructure needed.

First, install the pgvector extension on your PostgreSQL server. On macOS with Homebrew:

brew install pgvector

On Linux (Ubuntu/Debian):

sudo apt-get install postgresql-15-pgvector

Now create a migration to enable pgvector in your Rails database:

# db/migrate/20251016000001_enable_pgvector.rb
class EnablePgvector < ActiveRecord::Migration[7.1]
  def change
    enable_extension 'vector'
  end
end

Next, add a vector column to store document embeddings. OpenAI’s text-embedding-3-small model generates 1536-dimensional vectors:

# db/migrate/20251016000002_add_embedding_to_documents.rb
class AddEmbeddingToDocuments < ActiveRecord::Migration[7.1]
  def change
    add_column :documents, :embedding, :vector, limit: 1536
  end
end

Run the migrations:

rails db:migrate

Your database now supports vector storage and similarity search operations.

Generating Embeddings with OpenAI #

Embeddings are numerical representations of text that capture semantic meaning. Similar concepts have similar vectors, even if they use different words.

Install the OpenAI Ruby client:

bundle add ruby-openai

Create a service to generate embeddings:

# app/services/embedding_service.rb
class EmbeddingService
  def initialize
    @client = OpenAI::Client.new(
      access_token: ENV['OPENAI_API_KEY']
    )
  end

  def generate(text)
    response = @client.embeddings(
      parameters: {
        model: 'text-embedding-3-small',
        input: text
      }
    )
    response.dig('data', 0, 'embedding')
  end
end

The text-embedding-3-small model is cost-effective and performs well for most use cases. For production apps, consider batch processing multiple texts in a single API call to reduce latency.

Storing Document Vectors #

Create a Document model that automatically generates and stores embeddings:

# app/models/document.rb
class Document < ApplicationRecord
  validates :title, :content, presence: true

  after_save :generate_embedding, if: :content_changed?

  private

  def generate_embedding
    embedding = EmbeddingService.new.generate(content)
    update_column(:embedding, embedding)
  end
end

When you save a document, Rails automatically generates its embedding:

Document.create!(
  title: 'Rails Performance Guide',
  content: 'Use database indexes to speed up queries...'
)
# Embedding automatically generated and stored

The after_save callback ensures embeddings stay synchronized with content changes. For large documents, consider moving this to a background job (covered in Part 3).

pgvector provides several distance operators for measuring similarity. We’ll use cosine distance (<=>), which works well for normalized embeddings:

# app/models/document.rb
class Document < ApplicationRecord
  # Previous code...

  def self.search_similar(query, limit: 5)
    query_embedding = EmbeddingService.new.generate(query)

    # Find documents with closest vector distance
    where.not(embedding: nil)
      .order(Arel.sql("embedding <=> '#{query_embedding}'"))
      .limit(limit)
  end
end

Now you can search documents semantically:

Document.search_similar('How do I make my Rails app faster?')
# Returns documents about performance, optimization, caching, etc.
# Even if they don't contain the exact words "faster" or "performance"

Cosine distance returns values between 0 (identical) and 2 (opposite). Smaller distances indicate higher similarity.

Part 2: Building the RAG Pipeline 🟡 #

Document Chunking Strategy #

Large documents create problems for RAG systems. When you retrieve a 10,000-word document, most of it won’t be relevant to the user’s question. Chunking splits documents into smaller, focused segments that improve retrieval precision.

Optimal chunk sizes depend on your content, but 200-500 tokens (roughly 150-375 words) works well for most cases. Smaller chunks provide more precise retrieval, but you might miss context. Larger chunks preserve context but reduce precision.

# app/services/chunk_service.rb
class ChunkService
  CHUNK_SIZE = 300 # tokens
  CHUNK_OVERLAP = 50 # tokens for context continuity

  def initialize(tokenizer: Tokenizers.from_pretrained('gpt2'))
    @tokenizer = tokenizer
  end

  def chunk_text(text)
    tokens = @tokenizer.encode(text).ids
    chunks = []

    (0...tokens.length).step(CHUNK_SIZE - CHUNK_OVERLAP) do |i|
      chunk_tokens = tokens[i, CHUNK_SIZE]
      chunk_text = @tokenizer.decode(chunk_tokens)
      chunks << chunk_text
    end

    chunks
  end
end

This creates overlapping chunks to preserve context across boundaries. If a sentence is split, the overlap ensures both chunks contain the full sentence.

Update your Document model to store chunks:

# db/migrate/20251016000003_create_document_chunks.rb
class CreateDocumentChunks < ActiveRecord::Migration[7.1]
  def change
    create_table :document_chunks do |t|
      t.references :document, foreign_key: true
      t.text :content
      t.vector :embedding, limit: 1536
      t.integer :position
      t.timestamps
    end
  end
end

Semantic Search Implementation #

Build a service that searches across document chunks and returns the most relevant content:

# app/services/vector_search_service.rb
class VectorSearchService
  SIMILARITY_THRESHOLD = 0.7

  def search(query, limit: 5)
    query_embedding = EmbeddingService.new.generate(query)

    # Calculate cosine similarity (1 - cosine distance)
    # Higher scores = more similar
    DocumentChunk
      .select(
        'document_chunks.*',
        "1 - (embedding <=> '#{query_embedding}') AS similarity_score"
      )
      .where('1 - (embedding <=> ?) > ?', query_embedding, SIMILARITY_THRESHOLD)
      .order('similarity_score DESC')
      .limit(limit)
      .includes(:document)
  end
end

The similarity threshold (0.7) filters out low-quality matches. Tune this value based on your data—too high and you might miss relevant results, too low and you’ll include irrelevant content.

Test your semantic search:

results = VectorSearchService.new.search('database optimization techniques')
results.each do |chunk|
  puts "Score: #{chunk.similarity_score}"
  puts "Content: #{chunk.content[0..100]}..."
  puts "---"
end

RAG Query Pipeline #

The RAG pipeline combines vector search with AI generation. First, retrieve relevant context, then use it to generate an accurate answer:

# app/services/rag_service.rb
class RagService
  def initialize
    @vector_search = VectorSearchService.new
    @openai = OpenAI::Client.new(access_token: ENV['OPENAI_API_KEY'])
  end

  def query(question)
    # Step 1: Retrieve relevant chunks
    chunks = @vector_search.search(question, limit: 3)

    return no_context_response if chunks.empty?

    # Step 2: Build context from chunks
    context = chunks.map { |c| c.content }.join("\n\n")

    # Step 3: Generate answer using OpenAI
    response = @openai.chat(
      parameters: {
        model: 'gpt-4-turbo-preview',
        messages: build_messages(context, question),
        temperature: 0.3 # Lower = more factual
      }
    )

    response.dig('choices', 0, 'message', 'content')
  end

  private

  def build_messages(context, question)
    [
      {
        role: 'system',
        content: "You are a helpful assistant. Answer questions based ONLY on the provided context. If the context doesn't contain relevant information, say so."
      },
      {
        role: 'user',
        content: "Context:\n#{context}\n\nQuestion: #{question}"
      }
    ]
  end

  def no_context_response
    "I couldn't find relevant information to answer your question."
  end
end

The low temperature (0.3) makes responses more deterministic and factual. Higher values increase creativity but might introduce hallucinations.

Rails Controller Integration #

Expose the RAG system through a REST API:

# app/controllers/api/rag_controller.rb
module Api
  class RagController < ApplicationController
    def query
      question = params.require(:question)

      result = RagService.new.query(question)

      render json: {
        answer: result,
        timestamp: Time.current
      }
    rescue ActionController::ParameterMissing => e
      render json: { error: e.message }, status: :bad_request
    rescue StandardError => e
      Rails.logger.error("RAG query failed: #{e.message}")
      render json: { error: 'Query processing failed' }, status: :internal_server_error
    end
  end
end

Add the route:

# config/routes.rb
Rails.application.routes.draw do
  namespace :api do
    post 'rag/query', to: 'rag#query'
  end
end

Test your endpoint:

curl -X POST http://localhost:3000/api/rag/query \
  -H "Content-Type: application/json" \
  -d '{"question": "How do I optimize database queries in Rails?"}'

Part 3: Production Optimization 🔴 #

Caching Embeddings #

Generating embeddings costs money and time (50-200ms per request). For frequently queried content, implement caching:

# app/services/cached_embedding_service.rb
class CachedEmbeddingService
  CACHE_TTL = 24.hours

  def initialize
    @openai = OpenAI::Client.new(access_token: ENV['OPENAI_API_KEY'])
    @redis = Redis.new
  end

  def generate(text)
    cache_key = cache_key_for(text)

    # Try cache first
    cached = @redis.get(cache_key)
    return JSON.parse(cached) if cached

    # Generate fresh embedding
    response = @openai.embeddings(
      parameters: {
        model: 'text-embedding-3-small',
        input: text
      }
    )

    embedding = response.dig('data', 0, 'embedding')

    # Cache the result
    @redis.setex(
      cache_key,
      CACHE_TTL.to_i,
      embedding.to_json
    )

    embedding
  end

  def invalidate(text)
    @redis.del(cache_key_for(text))
  end

  private

  def cache_key_for(text)
    # Use SHA256 to handle long texts
    digest = Digest::SHA256.hexdigest(text)
    "embedding:#{digest}"
  end
end

Update your services to use the cached version:

# app/services/rag_service.rb
class RagService
  def initialize
    @vector_search = VectorSearchService.new
    @embedding_service = CachedEmbeddingService.new
    @openai = OpenAI::Client.new(access_token: ENV['OPENAI_API_KEY'])
  end

  # rest of the code...
end

Cache invalidation happens automatically after 24 hours, or manually when content changes:

class Document < ApplicationRecord
  after_update :invalidate_cache, if: :content_changed?

  private

  def invalidate_cache
    CachedEmbeddingService.new.invalidate(content)
  end
end

Batch Processing & Background Jobs #

Generating embeddings synchronously blocks your request-response cycle. For bulk document imports or large documents, use background jobs with Sidekiq:

# app/jobs/embedding_job.rb
class EmbeddingJob < ApplicationJob
  queue_as :default

  # Track progress for UI feedback
  def perform(document_id)
    document = Document.find(document_id)
    chunks = ChunkService.new.chunk_text(document.content)

    # Process chunks in batches to reduce API calls
    chunks.each_slice(20).with_index do |chunk_batch, batch_index|
      embeddings = generate_embeddings_batch(chunk_batch)

      chunk_batch.each_with_index do |content, index|
        DocumentChunk.create!(
          document: document,
          content: content,
          embedding: embeddings[index],
          position: (batch_index * 20) + index
        )
      end

      # Update progress
      progress = ((batch_index + 1) * 20 / chunks.size.to_f * 100).round
      update_progress(document, progress)

      # Rate limiting: OpenAI allows 3,000 RPM
      sleep(0.02) if batch_index < chunks.size / 20 - 1
    end

    document.update!(indexed_at: Time.current)
  end

  private

  def generate_embeddings_batch(texts)
    client = OpenAI::Client.new(access_token: ENV['OPENAI_API_KEY'])

    response = client.embeddings(
      parameters: {
        model: 'text-embedding-3-small',
        input: texts
      }
    )

    response['data'].map { |d| d['embedding'] }
  end

  def update_progress(document, progress)
    # Store in Redis for real-time UI updates
    Redis.new.setex(
      "document:#{document.id}:progress",
      1.hour.to_i,
      progress
    )
  end
end

Queue the job when creating documents:

class Document < ApplicationRecord
  after_create :enqueue_embedding_job

  private

  def enqueue_embedding_job
    EmbeddingJob.perform_later(id)
  end
end

Monitor job progress via API:

# app/controllers/api/documents_controller.rb
def indexing_status
  document = Document.find(params[:id])
  progress = Redis.new.get("document:#{document.id}:progress") || 0

  render json: {
    indexed: document.indexed_at.present?,
    progress: progress.to_i
  }
end

Performance Tuning #

Add indexes to speed up vector similarity queries:

# db/migrate/20251016000004_add_vector_indexes.rb
class AddVectorIndexes < ActiveRecord::Migration[7.1]
  def change
    # IVFFlat index for approximate nearest neighbor search
    # Lists: sqrt(total_rows) is a good starting point
    execute <<-SQL
      CREATE INDEX idx_document_chunks_embedding ON document_chunks
      USING ivfflat (embedding vector_cosine_ops)
      WITH (lists = 100);
    SQL

    # Analyze the table to populate index statistics
    execute 'ANALYZE document_chunks;'
  end
end

IVFFlat indexes trade accuracy for speed. They partition vectors into clusters (lists), then search only relevant clusters. More lists = better accuracy but slower search.

Monitor query performance:

# app/services/vector_search_service.rb
class VectorSearchService
  def search(query, limit: 5)
    start_time = Time.current

    results = perform_search(query, limit)

    duration = Time.current - start_time
    Rails.logger.info("Vector search completed in #{duration}ms")

    results
  end

  private

  def perform_search(query, limit)
    # Previous search implementation
  end
end

Optimize chunk retrieval counts based on your use case:

  • High precision needed: Retrieve fewer chunks (3-5), use higher similarity threshold (0.8+)
  • High recall needed: Retrieve more chunks (10-15), use lower similarity threshold (0.6+)
  • Balanced approach: 5-7 chunks, threshold 0.7

Real-World Example: Document Q&A System #

Let’s walk through a complete example using JetThoughts’ development documentation:

# Seed development documentation
Document.create!(
  title: 'Rails Performance Best Practices',
  content: 'Use database indexes for frequently queried columns. Implement caching with Redis. Optimize N+1 queries with includes()...'
)

Document.create!(
  title: 'Testing Strategy',
  content: 'Write integration tests first. Use RSpec for behavior-driven development. Mock external APIs in tests...'
)

# Wait for background jobs to process
sleep(5)

# Query the system
rag = RagService.new
answer = rag.query('How should I improve my Rails application performance?')

puts answer
# Output: "To improve Rails application performance, consider these approaches:
# 1. Add database indexes to frequently queried columns
# 2. Implement caching using Redis for expensive operations
# 3. Optimize N+1 queries using includes() or preload()
# These techniques are mentioned in the Rails Performance Best Practices documentation."

Performance metrics for this system:

  • Embedding generation: ~100ms per chunk
  • Vector search: 10-50ms (with IVFFlat index)
  • LLM generation: 1-3 seconds
  • Total query time: 1.5-4 seconds
  • Cost per query: ~$0.002 (OpenAI API pricing)

Troubleshooting Common Issues #

Embeddings returning null vectors #

Cause: Empty or whitespace-only text Solution: Validate content before generating embeddings:

def generate_embedding
  return if content.blank?
  embedding = EmbeddingService.new.generate(content)
  update_column(:embedding, embedding)
end

Vector search returns no results #

Cause: Similarity threshold too high or no indexed documents Solution: Lower threshold or check document indexing status:

# Check indexing
Document.where(indexed_at: nil).count

# Test with lower threshold
VectorSearchService::SIMILARITY_THRESHOLD = 0.5

Slow query performance #

Cause: Missing indexes or too many chunks Solution: Add IVFFlat index and adjust chunk size:

# Check if index exists
ActiveRecord::Base.connection.execute(
  "SELECT * FROM pg_indexes WHERE tablename = 'document_chunks'"
)

# Reduce chunk count
ChunkService::CHUNK_SIZE = 500 # Larger chunks = fewer total chunks

OpenAI API rate limits #

Cause: Too many concurrent requests Solution: Implement rate limiting in background jobs:

class EmbeddingJob < ApplicationJob
  # Process 10 jobs per minute
  sidekiq_options retry: 3, rate: { limit: 10, period: 1.minute }
end

Conclusion #

You’ve built a production-ready RAG system that combines PostgreSQL’s vector capabilities with OpenAI’s language models. Your Rails app can now understand semantic queries, retrieve relevant context, and generate accurate answers based on your documentation.

Key takeaways:

  • pgvector eliminates the need for external vector databases
  • Chunking documents improves retrieval precision
  • Caching and background jobs optimize performance
  • Proper indexing makes vector search fast enough for production

Next Steps #

Enhance retrieval quality:

Improve answer generation:

  • Stream responses with ActionCable for better UX
  • Add citation tracking (which chunks informed the answer)
  • Implement conversation history for multi-turn dialogues

Scale for production:

  • Monitor vector index performance with pganalyze
  • Implement query analytics with Ahoy
  • Add Sidekiq Pro for better job management

Have questions about implementing RAG in your Rails app? Contact JetThoughts for consulting and development services.


Source code: Complete working examples available at github.com/jetthoughts/rails-rag-pgvector-example