Exploring new documents and open-weight LLMs

Working with ledgers, transitioning to an open-weight model, pushing Gemini 3.5

After a few weeks off, we are back at it with some new historical materials! This week, I started working with historical ledgers. I specialized in colonial American newspapers, court records, and government documents during undergraduate, so this is a big change! I’ve been studying up on the format and reading about historical bookkeeping methods (pictured). Research is crucial for our evaluation process. To check the AI’s work, I manually transcribe a subset of documents. Then, the computer calculates the Levenshtein distance to compare the differences between computer and human transcriptions. We can identify weak points, iterate on our prompts, and improve AI transcriptions. But first, I have to understand how ledgers are formatted to write an accurate transcription. More updates to come on this exciting new project!

We are in the process of moving away from proprietary LLMs to open-weight models. In the past few weeks, we began training an open-weight model for full-text and semantic searching.

A proprietary LLM is closed and not customizable. We cannot access proprietary LLMs’ training data, add our own, or refine its capabilities. It’s just not a good fit!

All LLMs have a present-day bias because they are trained on modern data. As a result, they are not equipped for historical documents that contain different word spellings, meanings, and relationships. Open-weight models more flexibly accommodate our specific needs. With an open-weight model, embedded meanings and relations between words are changeable. We fine-tune these embeddings to encode historical data directly into the LLM. Equipped with this knowledge, the AI will interface with documents like a trained historian! Fine-tuning an open-weight model increases search accuracy and decreases irrelevant results (another common problem with proprietary LLMs).

But how do we know that fine-tuning is improving search results? That is where the judgements come in! Judgements evaluate search results and give the AI feedback. Was that result relevant or irrelevant? Why? Fine-tuning and judgements help narrow and sharpen Videlicet into a useful scholarly tool.

However, we haven’t completely moved away from proprietary LLMs. This week, we tuned and pushed the new thinking model, Gemini 3.5 Flash, into production for transcriptions. We’re already seeing improved results! We work to stay on the cutting-edge of technological advancement and make AI tools accessible and useful for working scholars.

LinkedIn