Free trial
Back to Blog
June 26, 2020

Why machine translations matter?

The amount of non-English patent documents is rapidly growing. To tackle the searchability problem, we are using machine translations to create English knowledge graphs from non-English texts.

During the spring 2020 we expanded our data sources to include tens of national patent office databases including multiple non-English databases. I took some time to do data analysis for the newly imported data. During the spring 2020 we expanded our data sources to include tens of national patent office databases including multiple non-English databases. I took some time to do data analysis for the newly imported data. The main focus was to understand the proportions between different major national patent offices and the importance of non-English documents. Following charts and document amounts were collected and calculated from our search database using only documents which contain searchable specifications. The filtering is done to provide a realistic view to our search space.

China leads the patent filling

Chart 1: Amount of documents per major data source during 2000-2019. Chart shows mostly stable grow except huge leap from China and mild decrease of documents from Japan.

As the Chart 1 shows, the amount of Chinese publications have been growing strongly during the 2010s. This is part of China’s industrial strategy where intellectual property is a major pillar. According to the strategy, companies qualify for government subsidies upon filing a patent application. But when studying more closely the mass of Chinese publications in Chart 2, the major part of the grow comes from the unexamined patent applications (A) and granted utility models (U) leaving the granted patent publications (B) behind. The subsidy policy has been criticized by legal experts, who warn that it could flood the system with cheap patents, but the increase in the granted publications seems to be moderate.

Chart 2: Chinese documents in 2010s by kind code. The growth relies on unexamined patent applications (A) and granted utility models (U).

Amount of non-English publications is growing

On a global scale, the amount of non-English publications have grown according to the World Intellectual Property Organization (WIPO). In 2019, China passed the United States of America as the top source of international patent applications filed with WIPO. At the same time international patent applications filed via WIPO’s Patent Cooperation Treaty System (PCT) grew by 5.2%, more than half of which were accounted for by Asia-based applicants.

Chart 3: PCT applications by original language. English leads as a most used language, but amount of documents with Asian languages keep rising.

When inspecting PCT applications by original language, English is still the most used language with 46% share. The top 5 contain three big Asian languages: Chinese, Korean and Japanese making the precise and high quality translations important when working with PCT applications.

New machine translations to the rescue!

Since the addition of national data sources, the increase in searchable patent publications has been over 240%. Most of the national sources are from non-English countries, making the non-English documents majority in our search space with a 1.25:1 ratio compared to English documents. Because our graph-based approach to patent search currently works only with English patent texts, we required translations for all non-English documents. Luckily our data provider, IFI CLAIMS Patent Services, started a huge project in 2019 to translate all non-English documents to English with Google’s state of the art translator. While writing this post, the last newly translated documents are finally imported to our patent database and are ready to be parsed to our knowledge graph format, which has gone through some major improvements during last few weeks by our NLP experts Sakke & Sebastian.

Written by

Juuso Piskonen

Lead Engineer & Co-Founder