PostgreSQL, vectors, and Clojure

Like every other startup, I'm adding vectors and embeddings to all of my products to capitalise on the cumulative creative efforts of humanity.

As a lover of simple, boring technology, PostgreSQL often appears in my products, and thankfully, some very generous people have donated their time to adding support for vectors to PostgreSQL via the excellent pgvector extension.

Installing extensions can be tricky on hosted platforms, but communities are asking for their outsourced ops teams to oblige, and some have already done so.

When working with PostgreSQL locally, more and more I find myself using devenv, which makes installing extensions and running a project-specific database instance a breeze. The following snippet is all it takes to get an embedding-ready instance running in your shell.

services.postgres.enable = true;

services.postgres.extensions = extensions: [
  extensions.pgvector
];

With that aside aside, we can get down to talking to our database from next.jdbc and HoneySQL. To make this work smoothly, we’ll need to complete the following steps:

  1. Teach the underlying JDBC layer about the new PGVector type
  2. Add vector operators to HoneySQL
  3. Write gnarly SQL queries with Clojure data structures

Teaching JDBC about vectors

This is relatively easy thanks to the pgvector team again, who provide a Java project that adds support for vectors to JDBC, Spring JDBC, Groovy SQL, and Scala’s Slick. In my case, I added the dependency to my project via tools.deps before extending my JDCB connection.

{:deps
 {com.github.seancorfield/honeysql  {:mvn/version "2.4.9999-SNAPSHOT"}
  com.github.seancorfield/next.jdbc {:mvn/version "1.3.894"}
  com.pgvector/pgvector             {:mvn/version "0.1.3"}}

You might be wondering why the bizarre snapshot version of HoneySQL. I ran into an issue with next.jdbc.prepare/SettableParameter when doing some interop with Python via libpython-clj. Sean — the absolute legend that he is — jumped on the problem after Tom helped me track the issue down to the use of contains? on a value that doesn’t support clojure.lang.RT/contains. That problem was fixed in version 2.5.1091 of HoneySQL back in October 2023.

With that second aside aside, we can extend our JDBC connection like so:

(ns jcf.embeds.your.work
  (:require
   [next.jdbc :as jdbc])
  (:import
   (org.pgvector PGVector)))

(PGvector/addVectorType
 (jdbc/get-connection
  (jdbc/get-datasource {:dbtype "postgres"})))

Adding vector operators to HoneySQL

As we’ll need the <-> and <#> operators, we use HoneySQL’s register-op! to grow a private map of tricks.

(require '[honey.sql :as sql])

(sql/register-op! :<->)
(sql/register-op! :<#>)

All of the stateful setup logic above is something I’d normally take care of via Alessandra Sierra’s Component. This is possible when extending the JDBC connection, but as there’s no way to remove registered operators, I opted to add my keywords at compile time.

SQL strings from data

Assuming you have your operators in place, it should now be possible to generate a bit of SQL using a query and embeddings.

(require '[honey.sql :as sql])

(def query
  {:select [:content]
   :from   :documents
   :where  [[[:<-> :documents.embedding :?embedding] :asc]]})

(sql/format query {:params {:embedding embdding}})

I used the following bit of SQL to setup my documents table.

CREATE TABLE IF NOT EXISTS documents (
    id bigserial PRIMARY KEY,

    -- This could be the primary key, but any self-respecting software engineer
    -- knows to optimise for scale one will never reach.
    hash uuid NOT NULL UNIQUE,

    content text NOT NULL,

    -- Our sentence-transformers model outputs a 384-dimensional vector.
    --
    -- https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
    embedding vector(384)
);

CREATE INDEX IF NOT EXISTS documents_content_idx ON documents USING GIN (to_tsvector('english', content));
CREATE INDEX IF NOT EXISTS documents_hash_idx ON documents USING BRIN (hash);

The reason for the additional columns is self-evident to me, but as you can’t read my mind, I store both the original content and a cryptographic hash of the content to ease retrieval and ongoing maintenance. This is a toy system, of course, and with paying customers a little more thought is likely required but you’re not paying me to design your database schema, so enjoy getting what you pay for!

Until next time, keep on calculating similarity using trigonometric functions!