Embedding Python embeddings

While Python sees the most significant investment by the machine-learning community in the 2020s, the foundations of AI were symbolic and largely Lisp-based. Many have attempted to explain this progression, and while history is always fascinating, today’s learning comes from wanting a better language than Python that can benefit from all the brilliant work being done by the snake-loving masochists.

I recently needed to produce embeddings and wanted to go the open-source route as opposed to building another closed wrapper on top of OpenAI’s API. For this, I opted for the MiniLM-L6 sentence-transformer model trained on countless Reddit comments, WikiAnswers, and Stack Exchange title-answer pairs.

To interact with the Python ecosystem from Clojure is trivial, thanks to the formidable work of the venerable Chris Nuernberger. Much like adding Clojure to a Java project is a mere Maven dependency away, Chris has made Python a dependency of our Clojure projects via libpython-clj. What a time to be alive!

{:deps {clj-python/libpython-clj {:mvn/version "2.025"}}}

This also works from plain old Java, so now we enterprise JVM-sorts can send HTTP requests to OpenAI, too!

To use the Sentence Transformers means adding a Python dependency, which libpython-clj cannot help us with, so I naturally lent on Nix and Poetry to take the Python dependency pain away. And with devenv, it’s even easier.

languages.python.enable = true;
languages.python.poetry.enable = true;
languages.python.poetry.activate.enable = true;
languages.python.poetry.install.enable = true;
languages.python.version = "3.11";

Note I had to stick to Python 3.11 because at the time pyarrow didn’t work with 3.12, which prevented me from smuggling eggs.

My pyproject.toml included a lot more than just Sentence Transformers.

[tool.poetry]
name = "tragedy"
version = "9.9.9"
description = "AI and the tragedy of the commons."
authors = ["James Conroy-Finn <github@invetica.co.uk>"]

[tool.poetry.dependencies]
python = ">=3.11,<3.12"
pandas = "^2.1.1"
pyarrow = "^13.0.0"
faker = "^19.9.0"
scipy = "^1.11.3"
sentence-transformers = "^2.2.2"

[build-system]
requires = ["poetry>=0.12"]
build-backend = "poetry.masonry.api"

After some time warming eggs to the point of hatching, my laptop was becoming more intelligent, able to convert words into long lists of numbers.

A little glue code in Clojure made interacting with Python a doddle, after a brief pause to download the MiniLM model.

(ns tragedy.sentence
  (:require
   [libpython-clj2.python :as py]
   [libpython-clj2.require :refer [require-python]]))

(require-python
 '[numpy :as np]
 '[sentence_transformers :refer [SentenceTransformer]])

(def model
  (SentenceTransformer "sentence-transformers/all-MiniLM-L6-v2"))

(defn encode-one
  [string]
  (first (py/py. model encode [string])))

(defn encode
  [strings]
  (zipmap strings (py/py. model encode strings)))

There are limits on the number of tokens the MiniLM model will ingest before giving up, and the vectors produced are of dimension 384, but you can choose to use other models trivially, all with the combined power of Java, Python, and even JavaScript from a beautifully designed Lisp.

Why learn one language and one ecosystem when you can learn them all‽