IPRally logo

Google Cloud Pub/Sub with Clojure

September 11, 2019

Google Cloud Pub/Sub is simple to use from Python, but what about Clojure? There is no official support and the Java interop is not straightforward. There are few libraries but none is active. We ended up building our internal mini-library jonotin, and today we have published the code.

Cheaper infrastructure with Pub/Sub

We process a lot of data. Big part of this is transforming all the patent text into knowledge graphs. That means that every time we improve the code, we need to process text documents into ~30 million graphs for the benefits. With 300 machines this takes around one week. In the beginning we split the data into monthly batches which the machines went through, one by one. This was fine, but as the data grew, the expenses grew too.

We optimised the parsing to the point that the next step could have been moving to C language. The problem with the batches became apparent, as we would have wanted to use the much cheaper preemptible machines. They are cheap machines that have a chance to be shut down any minute, and as processing a single batch could take an hour, a shutdown would cause bad data loss. Our data guru Juuso came up with the plan to process one document at a time, instead of a batch. This required new tools.

Today we are serious about queues.

Message queues are the standard way of solving such a problem. You publish messages to a queue which then can be processed by multiple subscribers. Pub/Sub is Google's version of this pattern. Ideally, we would have wanted to use cloud functions, but the parsing required more memory than the maximum of 2GB the cloud functions offer. Also, the parser machine startup time is never going to be super fast, as only the loading of the required 1GB language model to memory takes few seconds. So, we needed machines that could process a queue. This was actually straightforward in Python with google-cloud-pubsub library. With one week effort, Juuso had saved 75% of graph parsing expenses and all was great.

Pub/Sub from Clojure

We do everything except AI in Clojure, mostly because I love the language. We did the earliest data imports with monthly batches, and it worked well. Not fancy or the most optimal, but saving 100€ for such one time job was not worth neither the effort nor adding a new tool to our system. As we started getting new data sources in and some of the data needed to be processed again, queues became attractive. We already had them in use too, so it wouldn’t be that much extra mental load.

There are few pubsub implementations for Clojure. clj-gcp seems to do the trick, but its pubsub part is written to be used with integrant and is outdated. google-cloud is even more outdated, but the main problem for us was that it didn’t allow setting maximum number of pulled messages.

So why not to use Java interop? With the good experiences from Python and examples from the old libraries, this should be easy. Except that it wasn’t. Clojure interop with Java is decent, but Google has injected some heavy OOP to the Java google-cloud-pubsub. We needed to read through not only the google-cloud-pubsub code but also some code of its dependencies.

Our Pub/Sub wrapper jonotin

We endured more pain than we expected, and that is why we have published the code (github). Hopefully it helps someone out there. It’s 70 lines of Java interop with just two functions: publish! and subscribe!. There will be use cases it doesn’t fit right away, but for us this was the simplest solution that gets the job done.

Publishing messages is minimal. The published messages are any strings and jonotin takes only the parameters needed for identifying the Pub/Sub queue.

(jonotin/publish! {:project-name "my-gcloud-project"
                   :topic-name "my-pubsub-topic"
                   :messages ["message to queue"
                              "another message"]})

Subscribing is simple as well. The messages are read as long as there are any in the queue. We only use pull subscriptions. We needed to specify batch size for the number of messages fetched at once from the queue, so that it's small enough that we manage to process the messages before they timeout. Every error is caught and handle-error-fn can do whatever is needed with them. Whatever happens, message is acked in the end.

(jonotin/subscribe! {:project-name "my-gcloud-project"
                     :subscription-name "my-pubsub-subscription"
                     :batch-size 10
                     :handle-msg-fn (fn [msg]
                                      (println "Handling" msg))
                     :handle-error-fn (fn [e]
                                        (println "Oops!" e))})

The name jonotin means “a thing that queues” in Finnish, a word that nobody ever uses. Most of the Clojure open source libraries from Finland are made by Metosin, a software house from Tampere. They usually come with funky Finnish names, like reitit, muuntaja or  kekkonen. We wanted to follow this tradition.

jonotin from inside

This section might be of interested for you if you plan to modify jonotin, or you are simply interested of the trouble we had to go through,

One detail to remember about publishing to Google Pub/Sub is to convert the string  into byte string before creating the message object.

(-> (PubsubMessage/newBuilder)
    (.setData (ByteString/copyFromUtf8 "message to queue"))
    .build)

Subscribe contains most of the real tricks. For getting messages from the queue, you need to create a PullRequest object. With the (.setReturnImmediately true), the pull request returns 0-10 messages straight away instead of waiting for 10.

(-> (PullRequest/newBuilder)
    (.setReturnImmediately true)
    (.setSubscription
      (ProjectSubscriptionName/format "my-gcloud-project"
                                      "my-pubsub-subscription")
    (.setMaxMessages 10)
    .build)

The biggest challenge was to actually get the Google pubsub Java SDK to work. Back then we couldn't make the newest version to work at all, and the older ones needed an updated grpc. There must have been something exotic about our solution, maybe the all-synchronous approach.


:dependencies [[org.clojure/clojure "1.10.0"]
               [com.google.cloud/google-cloud-pubsub "1.69.0"]
               [io.grpc/grpc-core "1.20.0"]
               [io.grpc/grpc-netty-shaded "1.20.0" :exclusions [io.grpc/grpc-core]]]

Today we tried updating the version. The newest 1.87.0 caused an error but 1.71.0 seemed to work just fine, without the need for any special grpc.

jonotin in Github

Written by
Juho Kallio
CTO, Co-Founder

Curious to hear more about our solution?

Is your organization willing to be in the IPR frontline? Get in touch to get a demo or take a sneak peek of the future of patent AI as we see it.