Long blocking on event `:kv-cluster/get-val` without timeout/retry/failover semantics?

Question

Long blocking on event `:kv-cluster/get-val` without timeout/retry/failover semantics?

asked Jun 23, 2021 in On-Prem by Norbert Wójtowicz

On a test environment, we're seeing this kind of log from the Datomic Peer:

;; ~5 hours CI inactive (if this is relevant)

DEBUG 2021-06-18T13:32:37.702 [clojure-agent-send-off-pool-23]: {:event :kv-cluster/get-val, :val-key "60a4cc9e-ac11-40fe-9bd0-582f8209a403", :phase :begin, :pid 7, :tid 225}

;; 15 minutes pass....

DEBUG 2021-06-18T13:48:25.091 [clojure-agent-send-off-pool-23]: {:event :kv-cluster/get-val, :val-key "60a4cc9e-ac11-40fe-9bd0-582f8209a403", :msec 947000.0, :phase :end, :pid 7, :tid 225}

;; This next one finishes in 21ms:

DEBUG 2021-06-18T13:48:25.094 [clojure-agent-send-off-pool-23]: {:event :kv-cluster/get-val, :val-key "60128f6d-25fb-41f0-a6c2-9c5e73267da7", :phase :begin, :pid 7, :tid 225}
DEBUG 2021-06-18T13:48:25.115 [clojure-agent-send-off-pool-23]: {:event :kv-cluster/get-val, :val-key "60128f6d-25fb-41f0-a6c2-9c5e73267da7", :msec 21.0, :phase :end, :pid 7, :tid 225}

;; and now we start processing the transacts that queued up when the tests started...

INFO  2021-06-18T13:48:25.117 [manifold-execute-43]: {:event :peer/transact, :uuid #uuid "60cca429-7908-49b1-82c8-56b50effb4ce", :phase :start, :pid 7, :tid 282}
DEBUG 2021-06-18T13:48:25.234 [clojure-agent-send-off-pool-24]: {:event :peer/accept-new, :id #uuid "60cca429-7908-49b1-82c8-56b50effb4ce", :phase :begin, :pid 7, :tid 226}
DEBUG 2021-06-18T13:48:25.234 [clojure-agent-send-off-pool-24]: {:event :peer/accept-new, :id #uuid "60cca429-7908-49b1-82c8-56b50effb4ce", :msec 0.462, :phase :end, :pid 7, :tid 226}
INFO  2021-06-18T13:48:25.235 [clojure-agent-send-off-pool-24]: {:event :peer/transact, :uuid #uuid "60cca429-7908-49b1-82c8-56b50effb4ce", :phase :end, :pid 7, :tid 226}

After those 15 minutes of waiting, the peer moved on and transacted all the transactions that happened to queue in the meantime, as if nothing happened. There do not seem to be any interesting or anomalous logs in the Transactor for this time period.

Datomic Peer (1.0.6269) with PostgreSQL storage. The peer and transactor are both running on Kubernetes, but Postgres is hosted outside of the k8s cluster.

I think my followup questions are:

Any idea what could be going on with the :kv-cluster/get-val and how to go about debugging this further?
Is there some timeout we can configure to avoid this kind of situation in a production environment?
Is the probable culprit Kubernetes? If so, has someone seen or fixed similar behavior before?
These logs are happening at the DEBUG level, otherwise we would not even know that the Peer was blocking/hanging/waiting. Are there other metrics or logs we should be monitoring to identify this kind of problem?

2 Answers

Francis Avila · Answer 1 · 2021-06-23T12:14:30+0000

The first step is to look up that uuid in Postgres and see what it is. Is it an especially large value? (It should be <60kb) If you ungzip and defressian it, what kind of segment is it?

If it’s a large value, you need to find the source of it and stop producing it. Common things people do which datomic doesn’t like is transact a large number of datoms at once (aim for 1000-2000 max per tx. Remember retractions are datoms too) or large string values (keep them under 4K chars).

If it is not a large value, then this may just be cpu or IO scheduling latency after your instance being quiet for so long, or it may be Postgres. How long does a raw select of this row take?

Ivar Refsdal · Answer 2 · 2022-01-27T17:28:31+0000

I think the most probably culprit for this is a lack of socketTimeout property on the PostgreSQL connection string. Here is a quote from HikariCP's Rapid Recovery page:

When TCP communication is abruptly interrupted, the client or server
can be left awaiting the acknowledgement of a packet that will never
come. The connection is therefore "stuck", until an operating system
level TCP timeout occurs. This can be as long as several hours,
depending on the operating system TCP stack tuning.

And it continues:

In order to avoid this condition, it is imperative that the
application configures the driver-level TCP socket timeout. Each
driver differs in how this timeout is set, but nearly all drivers
support it.

HikariCP recommends that the driver-level socket timeout be set to (at
least) 2-3x the longest running SQL transaction, or 30 seconds,
whichever is longer. However, your own recovery time targets should
determine the appropriate timeout for your application.

https://github.com/brettwooldridge/HikariCP/wiki/Rapid-Recovery

We have been facing a similar situation, and after setting socketTimeout=30, :kv-cluster/get-val will recover after 30s instead of 16 minutes.

Here is an example Datomic and PostgreSQL connection string:

datomic:sql://somedb?jdbc:postgresql://somehost:5432/datomic?user=datomic@somedb&password=xxx&sslmode=require&socketTimeout=30

Notice &socketTimeout=30 at the end.

Please note that the timeout value of 15 or 16 minutes is something that the OS decides. Running on Azure ("our OS"), we have been waiting for a TCP packet for about 24 hours.

I think the Datomic setup instructions should definitely recommend setting the socketTimeout parameter.

Best regards.

Long blocking on event `:kv-cluster/get-val` without timeout/retry/failover semantics?

Please log in or register to answer this question.

2 Answers

Please log in or register to add a comment.

Please log in or register to add a comment.

Categories

Long blocking on event `:kv-cluster/get-val` without timeout/retry/failover semantics?

Please log in or register to answer this question.

2 Answers

Please log in or register to add a comment.

Please log in or register to add a comment.

Related questions

Categories