What does quality actually mean for automated patent search?

When agentic AI tools entered the patent search market, quality became the topic everyone wanted to talk about, and one of the hardest to discuss honestly. Impressive-looking outputs are easy to generate, but outputs you can actually trust take a different kind of work. IPRally's co-founder and CPO Sakari Arvela breaks down what quality means for Agent, and shares the data behind that claim.

Quality is multi-dimensional, by definition

Quality is all about solving the task at hand: getting the information allowing one to make the decision needed. The decision can be for example whether to file a patent, oppose a patent, change R&D or business plans.

Each use case has different requirements and confidence levels needed. For example, the investment decisions relating to filing of a patent and building a new factory are hugely different. The former decision can be made based on a quick and dirty novelty search, whereas the latter requires a Freedom-to-Operate study with extra high confidence. In an invalidation case, a single killer hit missed by the Examiners may make all the difference – but you don’t know beforehand if such a hit even exists.

When you combine the resources and time window available, professional knowledge and intuition in decision-making with the organization’s risk tolerance, it becomes clear that quality is a multi-dimensional concept.

The fundamental question then becomes: How do we achieve and consistently maintain the level of quality that effectively solves the problem and supports sound decision-making? This is the core challenge we address in the rest of this blog post.

Three dimensions we keep separating

When we talk internally about Agent’s quality, we keep splitting it into three things that are easy to mix up:

Invention comprehensiveness. How much of the invention has actually been searched for? A human searcher under time pressure has to choose a few angles and compromise on the rest, so there is always a risk of “wasting the search” on the wrong focus. An AI Agent can take in all the details and run targeted searches for each, so the compromise shrinks.
Search quality. Of everything in the database, how much of the truly relevant prior art ends up in the result set? This is where retrieval performance lives.
Analysis quality. For each retrieved document, how reliable is the relevance assessment? LLM analysis is not perfect, but unlike a human report, every document has been analysed, and every assessment is open to human verification.

Aiming for perfection on all three is a fine ambition, but 100% is rarely required for informed decision-making. What you need is a level of quality calibrated to the use case, and the ability to see where that level sits.

Verifiability is the part nobody talks about

When patent professionals ask me whether Agent is “good enough”, they’re usually asking three questions stacked on top of each other. Will it miss important prior art? Is the relevance assessment accurate, or just plausible-sounding? And – the one I find most interesting – if it gets something wrong, how would I know?

That third question can’t be answered with a benchmark number. It has to be answered by how the system is built.

This is the distinction we kept coming back to while building Agent: quality you are told about versus quality you can verify for yourself. For people who evaluate evidence for a living, only the second kind builds trust.

So we made a deliberate choice early on: full transparency into every step of the reasoning is a requirement, not a feature. Every search case Agent creates can be opened and inspected. Every claim-derived feature comes with a YES, MAYBE, or NO and the AI’s explanation for that answer. Nothing collapses into a confidence score or a generated summary. The output gives you the material to reach your own conclusion.

We sometimes call this a glass box rather than a black box. It’s not a new conviction. The search engine inside Agent is the same proprietary AI we have been training on patent examiner citations and refining in real-world use for the past eight years – the technology that hundreds of organisations already rely on every day. Transparency has been part of the system from the beginning.

What the performance data shows

With that framing out of the way, let me get specific.

We evaluated Agent on real patent search cases across a range of patent classes, using examiner novelty citations as ground truth. This is the industry standard for this kind of evaluation.

Two layers of Agent’s architecture each measurably improve on a standard single search.

The first is retrieval. Instead of running one search with the full patent text, Agent extracts individual claims, runs targeted searches for each, and combines the results. On a 50-case multi-claim evaluation, this gave us +14% recall and +11% precision over a regular search, and retrieved roughly 2.7× more unique documents at the same search depth. More of the right documents, not just more documents.

The second is ranking. After retrieval, Agent evaluates each document against AI-generated relevance questions derived from the claims – the same feature-level review IPRally users already rely on every day – and sorts results into High, Medium, and Low relevance tiers. In the High Relevance tier, typically around eight documents, Agent achieves +16% recall, +18% precision, and +35% better ranking compared to a single search returning the same number of results.

In practical terms: the High Relevance tier alone captures 55% of the relevant documents you would find in a top-25 single search, in roughly a third of the documents. High and Medium together capture 82% of them in roughly half the documents. A reviewer working through the High Relevance tier reaches the most important prior art earlier, with less noise around it, and with a clear explanation attached to each result. Fully automated, in less than ten minutes.

The ceiling every retrieval system has

The 50-case multi-claim evaluation above is one slice; a larger evaluation is in progress and we will share those results when they are ready.

There is also a ceiling built into any retrieval-based system, ours included. Agent can only re-rank what it first retrieves. If a relevant document does not make it into the initial retrieval set, no amount of ranking will surface it. Manual Boolean search has the same constraint: it is bounded by the queries the searcher constructs. Agent’s multi-claim strategy raises that ceiling considerably, but no approach removes it entirely.

That is also why I keep saying Agent is a first-pass tool. Sometimes, but not always, it is the final answer as well, depending on the problem at hand. You know that when you look at the results. The feature chart, the relevance tiers, the written report – they are designed as a structured starting point that hands off cleanly into a workflow where a human expert continues the work. To me, that is the honest framing of where automated patent search currently sits.

Even when you decide to continue manually, you will always get a consistent starting point with a lot of heavy-lifting done for you, and a Project to continue working on.

What beta users told us

Agent was tested in closed beta with patent professionals at organisations across automotive, medtech, energy, consumer goods, and patent offices. The tests were carried out on real cases, not test data.

The feedback was encouraging. Perceived value averaged 4.5 out of 5. Time savings were described as “85% time saved” and “enormous.” Beta users who ran Agent next to the tools they were already paying for consistently rated IPRally favourably on search quality.

What I find most telling is not the score itself, but the pattern behind it: experienced reviewers arrived at the same conclusions they had reached manually, in a fraction of the time, with the reasoning right there to verify.

Is this the future of search?

I believe so. Agent makes the expert better and the novice an expert, and there is no good reason to begin a novelty or invalidity search from a blank Boolean query when an Agent can hand you a structured, explainable first pass in ten minutes. Over time, I expect the natural starting point for most searches to shift, and the interesting work – judgement, strategy, decisions – to move to where it belongs.

A more useful question to ask

“Is Agent good enough?” is hard to answer in the abstract. The better question is: can I verify whether Agent is good enough for this specific matter?

The most direct way to answer that is to run Agent on a case you have already completed. Open the search cases it creates. Read the reasoning behind the features it flagged. Check whether the High Relevance documents align with what your own search produced.

The quality is in the data. What Agent found, and why it found it, is always available to the person reviewing the output. That, to us, is what verified novelty intelligence means in practice – a workflow where the professional is always in a position to see the work and make their own judgement about it.

If you want to put that to the test, run Agent on a case you have already completed and compare the results yourself. You can get started at iprally.com/agent.

Sakari Arvela

April 28, 2026

•

5 min read