There are tens of millions English patent publications. You have an invention and you need to know whether there is a patent publication about it already. To compare your invention against all the existing patents, you clearly need computers. But how do you explain your invention for the machine?
With Google Patents and the like, you give few keywords. Generally, you get way too many irrelevant results because you cannot define the invention completely. Oh, and you forgot to define a synonym or two, and now you also missed half of the relevant hits.
Then there are tools where you just give the computer the description of your invention in natural text. This is more reasonable, as you can define the invention completely, and the text may be useful elsewhere as well. Current progress in natural language processing has given some amazing tools, and we are closer to the point where computers can truly understand language.
Yet we are doing patent search with graphs. Why bother, when the fully textual approach is natural for humans and quickly becoming easier for computers? For the answer we need to consider both UI and AI.
When you define invention in a graph, you can see the same structure the computer sees. With text, it is very hard to know what is the clearest way to write for a computer. Graphs also give nice visualisation opportunities. Soon we can pinpoint the relevant parts of the result graphs and you can trust the relevance of the search result quickly, without reading 30 pages of patent text.
In the end of the day, the actual results are what will make the difference. In 2015, Stanford researchers managed to get better results with graph approach compared to the best LSTM benchmarks in two NLP tasks. It could have been a new general direction for NLP research but the models were difficult to make fast enough. Nevertheless, the situation has changed. Tools have improved and last Autumn we solved the Tree-LSTM performance issues.
Graph format makes even more sense for patents than for general text. The research with graphs and text has been focused around dependency trees where all the words are kept but the computer sees them in a semantic order. With patents, we don't need all the words. We parse only the technical core of the invention to the graph, and 30 pages of text becomes a graph of 1000 items.
In machine learning, data quality and quantity are usually the most important factors, and graphs give few tricks for enhancing the data. We can create new training samples. If we remove items from the graph, the invention becomes more general and if something defined similar invention before, it would still be relevant.
Graphs can also be split. If all the pieces of your invention are found from a document, the document is a search result you want to find. This changes the game for two reasons. First, now we get to solve an easier problem, as it is much easier to answer whether a patent document contains some specific part of the invention. Second, even if we could only answer the easier question as well as the other patent search providers answer the harder question, we would still win. Why? As we have split the problem, we get more data about the match. If these individual samples are merely as accurate as the competitor's one sample, our accuracy is higher after we combine the results.
Because of all of this, we believe the graph based approach is simply superior to the traditional patent searches. There is, however, quite a lot of extra work. For the full benefits, we have had to convert all patent documents to graphs. Also, it's easier to make the existing machine learning solutions to work with raw text than with graphs, as text is always text but graph structures vary a lot. Eventually the graphs are nevertheless going to take over, as the benefits are too great. Already we believe our search is the state of the art for some cases.