Overview
This project explores sentence-transformer embeddings for Ancient
Greek biblical verses, enabling semantic similarity, clustering, and
search across verse variants. The Variant SBERT is fine-tuned from
Ancient-Greek-BERT, and the S3BERT variant builds on the Variant SBERT as its
foundation. Both models were trained on a single NVIDIA A100 80GB
PCIe using data from the Ancient Greek New Testament corpora on
Zenodo. This dataset supports distant-reading analyses of textual
variations by providing a corpus of biblical names with spelling
variation and inflections, alongside their manuscript mentions.
Why This Matters
Biblical manuscripts exist in many variant forms, with differences
in spelling, word order, omissions, and additions that accumulated
over centuries of copying. For scholars studying textual
transmission, manually comparing thousands of verse variants is
impractical. These models aim to enable computational approaches:
clustering related variants, detecting near-duplicates across
manuscripts, and searching for semantically similar passages even
when surface forms differ.
Ancient Greek Variant SBERT
The Variant SBERT is trained with Multiple Negatives Ranking Loss, mapping verses into
a 768-dimensional space. During training, verses sharing the same NKV
identifier (indicating the same underlying verse) form positive pairs,
while other verses in the batch serve as in-batch negatives. This approach
requires no explicit negative pairs. The model learns to pull variants
together and push unrelated verses apart, making it robust to spelling
variation in biblical Greek. The released checkpoint is available on Hugging Face and in the TextVariant Explorer.
Ancient Greek Variant S3BERT (Gender)
The S3BERT variant enhances the Variant SBERT with interpretable sub-embeddings following
the S3BERT approach. It uses a teacher-student distillation setup: the Variant SBERT
acts as teacher, and the student learns to preserve overall
similarity while dedicating a small subspace to an interpretable
feature. The gender sub-embedding is trained on scores derived from
word-level gender annotations, reflecting gender similarity between
sentences: sentences with similar gendered language score higher,
while mixed or opposing gendered phrasing scores lower. There is no
model release yet, but results can be reproduced from the
repository.
Limitations
Both models are optimized specifically for biblical Greek and may
have reduced performance on other genres. Input text requires
preprocessing (accent stripping and lowercasing) for best results.
Future Work
The S3BERT model is experimental and serves as a proof of concept
for interpretable biblical verse embeddings. Future work could
improve gender score supervision and add additional feature vectors
beyond gender, such as sentiment.
See the original S3BERT implementation and the base Ancient-Greek-BERT model.