Welcome! Please see the About page for a little more info on how this works.

0 votes
in On-Prem by

On a test environment, we're seeing this kind of log from the Datomic Peer:

;; ~5 hours CI inactive (if this is relevant)

DEBUG 2021-06-18T13:32:37.702 [clojure-agent-send-off-pool-23]: {:event :kv-cluster/get-val, :val-key "60a4cc9e-ac11-40fe-9bd0-582f8209a403", :phase :begin, :pid 7, :tid 225}

;; 15 minutes pass....

DEBUG 2021-06-18T13:48:25.091 [clojure-agent-send-off-pool-23]: {:event :kv-cluster/get-val, :val-key "60a4cc9e-ac11-40fe-9bd0-582f8209a403", :msec 947000.0, :phase :end, :pid 7, :tid 225}

;; This next one finishes in 21ms:

DEBUG 2021-06-18T13:48:25.094 [clojure-agent-send-off-pool-23]: {:event :kv-cluster/get-val, :val-key "60128f6d-25fb-41f0-a6c2-9c5e73267da7", :phase :begin, :pid 7, :tid 225}
DEBUG 2021-06-18T13:48:25.115 [clojure-agent-send-off-pool-23]: {:event :kv-cluster/get-val, :val-key "60128f6d-25fb-41f0-a6c2-9c5e73267da7", :msec 21.0, :phase :end, :pid 7, :tid 225}

;; and now we start processing the transacts that queued up when the tests started...

INFO  2021-06-18T13:48:25.117 [manifold-execute-43]: {:event :peer/transact, :uuid #uuid "60cca429-7908-49b1-82c8-56b50effb4ce", :phase :start, :pid 7, :tid 282}
DEBUG 2021-06-18T13:48:25.234 [clojure-agent-send-off-pool-24]: {:event :peer/accept-new, :id #uuid "60cca429-7908-49b1-82c8-56b50effb4ce", :phase :begin, :pid 7, :tid 226}
DEBUG 2021-06-18T13:48:25.234 [clojure-agent-send-off-pool-24]: {:event :peer/accept-new, :id #uuid "60cca429-7908-49b1-82c8-56b50effb4ce", :msec 0.462, :phase :end, :pid 7, :tid 226}
INFO  2021-06-18T13:48:25.235 [clojure-agent-send-off-pool-24]: {:event :peer/transact, :uuid #uuid "60cca429-7908-49b1-82c8-56b50effb4ce", :phase :end, :pid 7, :tid 226}

After those 15 minutes of waiting, the peer moved on and transacted all the transactions that happened to queue in the meantime, as if nothing happened. There do not seem to be any interesting or anomalous logs in the Transactor for this time period.

Datomic Peer (1.0.6269) with PostgreSQL storage. The peer and transactor are both running on Kubernetes, but Postgres is hosted outside of the k8s cluster.

I think my followup questions are:

  1. Any idea what could be going on with the :kv-cluster/get-val and how to go about debugging this further?
  2. Is there some timeout we can configure to avoid this kind of situation in a production environment?
  3. Is the probable culprit Kubernetes? If so, has someone seen or fixed similar behavior before?
  4. These logs are happening at the DEBUG level, otherwise we would not even know that the Peer was blocking/hanging/waiting. Are there other metrics or logs we should be monitoring to identify this kind of problem?

2 Answers

+1 vote
by

The first step is to look up that uuid in Postgres and see what it is. Is it an especially large value? (It should be <60kb) If you ungzip and defressian it, what kind of segment is it?

If it’s a large value, you need to find the source of it and stop producing it. Common things people do which datomic doesn’t like is transact a large number of datoms at once (aim for 1000-2000 max per tx. Remember retractions are datoms too) or large string values (keep them under 4K chars).

If it is not a large value, then this may just be cpu or IO scheduling latency after your instance being quiet for so long, or it may be Postgres. How long does a raw select of this row take?

by
Thanks for following up! The row in question was ~42kb of binary data. After ungzip and defressian, it turned out to be a `index-tdata` with 1551 datoms. We're going to do a deeper dive into how big other rows are to make sure we are not producing any large transactions or string values, but in this particular case this did not seem to be the root cause (in SQL the row can be queried quickly).

Should I raise a separate issue to request some kind of timeout/retry on this kind of `:kv-cluster/get-val` in the future? And perhaps some kind of logging/monitoring of this scenario that is not DEBUG? I'm not sure if that's appropriate, but if there are networking issues it seems the Peer can hang for quite a long time without visibility into the problem.
0 votes
by

I think the most probably culprit for this is a lack of socketTimeout property on the PostgreSQL connection string. Here is a quote from HikariCP's Rapid Recovery page:

When TCP communication is abruptly interrupted, the client or server
can be left awaiting the acknowledgement of a packet that will never
come. The connection is therefore "stuck", until an operating system
level TCP timeout occurs. This can be as long as several hours,
depending on the operating system TCP stack tuning.

And it continues:

In order to avoid this condition, it is imperative that the
application configures the driver-level TCP socket timeout. Each
driver differs in how this timeout is set, but nearly all drivers
support it.

HikariCP recommends that the driver-level socket timeout be set to (at
least) 2-3x the longest running SQL transaction, or 30 seconds,
whichever is longer. However, your own recovery time targets should
determine the appropriate timeout for your application.

https://github.com/brettwooldridge/HikariCP/wiki/Rapid-Recovery

We have been facing a similar situation, and after setting socketTimeout=30, :kv-cluster/get-val will recover after 30s instead of 16 minutes.

Here is an example Datomic and PostgreSQL connection string:

datomic:sql://somedb?jdbc:postgresql://somehost:5432/datomic?user=datomic@somedb&password=xxx&sslmode=require&socketTimeout=30

Notice &socketTimeout=30 at the end.

Please note that the timeout value of 15 or 16 minutes is something that the OS decides. Running on Azure ("our OS"), we have been waiting for a TCP packet for about 24 hours.

I think the Datomic setup instructions should definitely recommend setting the socketTimeout parameter.

Best regards.

Welcome to the Datomic Knowledgebase, where you can make features requests, ask questions and receive answers from other members of the community.
...