How do we measure the accuracy of search results?

starting judgements list

By Bonny Broom Research | Friday, May 29, 2026

This week, I started building our judgements list for Videlicet. Judgement lists record the pages that are relevant or irrelevant in a given search. They will be used in the near future to test the accuracy of our semantic search engine. In our move to open-weight models, judgements allow us to track the refinement of our training. It’s a tedious task! I evaluated some basic search terms that a historian of 18th century North America would use. Evaluating the search results, I included synonyms and different spellings. I excluded noise. In the coming weeks, I will increase the scope to phrases and natural search terms.

As part of the process, I grappled with how wide to cast the net of semantic search for each of the terms. This is where my history and digital humanities training kicks in! I spent a lot of my week thinking about what scholars expect when interacting with historical databases and how the flow of research works.

For the search term “Chickasaw,” Videlicet pulled documents containing the word and its various spellings. It also pulled documents containing names of Chickasaw chiefs, which was pretty cool too! However, it also brought up pages for the Cherokees, Catawbas, Creeks, and other Indigenous nations. I determined that these pages were not relevant to a “Chickasaw” search because of the term’s specificity. If a historian uses that term, chances are they exclusively want results for Chickasaw peoples.

Looking past judgements, our next steps is to build a fine-tuning list. We’re training the open-weight model to think like a historian!