Ancient Greek Variant SBERTs

Two Greek-Berts characters in togas discuss Bible manuscripts spread across a table with scrolls.

A foundation for deeper understanding of Ancient Greek biblical verses, where meaning and variation can be traced with care.

Overview

This project explores sentence-transformer embeddings for Ancient Greek biblical verses, enabling semantic similarity, clustering, and search across verse variants. The Variant SBERT is fine-tuned from Ancient-Greek-BERT, and the S3BERT variant builds on the Variant SBERT as its foundation. Both models were trained on a single NVIDIA A100 80GB PCIe using data from the Ancient Greek New Testament corpora on Zenodo. This dataset supports distant-reading analyses of textual variations by providing a corpus of biblical names with spelling variation and inflections, alongside their manuscript mentions.

Why This Matters

Biblical manuscripts exist in many variant forms, with differences in spelling, word order, omissions, and additions that accumulated over centuries of copying. For scholars studying textual transmission, manually comparing thousands of verse variants is impractical. These models aim to enable computational approaches: clustering related variants, detecting near-duplicates across manuscripts, and searching for semantically similar passages even when surface forms differ.

Ancient Greek Variant SBERT

The Variant SBERT is trained with Multiple Negatives Ranking Loss, mapping verses into a 768-dimensional space. During training, verses sharing the same NKV identifier (indicating the same underlying verse) form positive pairs, while other verses in the batch serve as in-batch negatives. This approach requires no explicit negative pairs. The model learns to pull variants together and push unrelated verses apart, making it robust to spelling variation in biblical Greek. The released checkpoint is available on Hugging Face and in the TextVariant Explorer.

Ancient Greek Variant S3BERT (Gender)

The S3BERT variant enhances the Variant SBERT with interpretable sub-embeddings following the S3BERT approach. It uses a teacher-student distillation setup: the Variant SBERT acts as teacher, and the student learns to preserve overall similarity while dedicating a small subspace to an interpretable feature. The gender sub-embedding is trained on scores derived from word-level gender annotations, reflecting gender similarity between sentences: sentences with similar gendered language score higher, while mixed or opposing gendered phrasing scores lower. There is no model release yet, but results can be reproduced from the repository.

Limitations

Both models are optimized specifically for biblical Greek and may have reduced performance on other genres. Input text requires preprocessing (accent stripping and lowercasing) for best results.

Future Work

The S3BERT model is experimental and serves as a proof of concept for interpretable biblical verse embeddings. Future work could improve gender score supervision and add additional feature vectors beyond gender, such as sentiment.
See the original S3BERT implementation and the base Ancient-Greek-BERT model.