Welcome! Please see the About page for a little more info on how this works.

0 votes
in On-Prem by

On a test environment, we're seeing this kind of log from the Datomic Peer:

;; ~5 hours CI inactive (if this is relevant)

DEBUG 2021-06-18T13:32:37.702 [clojure-agent-send-off-pool-23]: {:event :kv-cluster/get-val, :val-key "60a4cc9e-ac11-40fe-9bd0-582f8209a403", :phase :begin, :pid 7, :tid 225}

;; 15 minutes pass....

DEBUG 2021-06-18T13:48:25.091 [clojure-agent-send-off-pool-23]: {:event :kv-cluster/get-val, :val-key "60a4cc9e-ac11-40fe-9bd0-582f8209a403", :msec 947000.0, :phase :end, :pid 7, :tid 225}

;; This next one finishes in 21ms:

DEBUG 2021-06-18T13:48:25.094 [clojure-agent-send-off-pool-23]: {:event :kv-cluster/get-val, :val-key "60128f6d-25fb-41f0-a6c2-9c5e73267da7", :phase :begin, :pid 7, :tid 225}
DEBUG 2021-06-18T13:48:25.115 [clojure-agent-send-off-pool-23]: {:event :kv-cluster/get-val, :val-key "60128f6d-25fb-41f0-a6c2-9c5e73267da7", :msec 21.0, :phase :end, :pid 7, :tid 225}

;; and now we start processing the transacts that queued up when the tests started...

INFO  2021-06-18T13:48:25.117 [manifold-execute-43]: {:event :peer/transact, :uuid #uuid "60cca429-7908-49b1-82c8-56b50effb4ce", :phase :start, :pid 7, :tid 282}
DEBUG 2021-06-18T13:48:25.234 [clojure-agent-send-off-pool-24]: {:event :peer/accept-new, :id #uuid "60cca429-7908-49b1-82c8-56b50effb4ce", :phase :begin, :pid 7, :tid 226}
DEBUG 2021-06-18T13:48:25.234 [clojure-agent-send-off-pool-24]: {:event :peer/accept-new, :id #uuid "60cca429-7908-49b1-82c8-56b50effb4ce", :msec 0.462, :phase :end, :pid 7, :tid 226}
INFO  2021-06-18T13:48:25.235 [clojure-agent-send-off-pool-24]: {:event :peer/transact, :uuid #uuid "60cca429-7908-49b1-82c8-56b50effb4ce", :phase :end, :pid 7, :tid 226}

After those 15 minutes of waiting, the peer moved on and transacted all the transactions that happened to queue in the meantime, as if nothing happened. There do not seem to be any interesting or anomalous logs in the Transactor for this time period.

Datomic Peer (1.0.6269) with PostgreSQL storage. The peer and transactor are both running on Kubernetes, but Postgres is hosted outside of the k8s cluster.

I think my followup questions are:

  1. Any idea what could be going on with the :kv-cluster/get-val and how to go about debugging this further?
  2. Is there some timeout we can configure to avoid this kind of situation in a production environment?
  3. Is the probable culprit Kubernetes? If so, has someone seen or fixed similar behavior before?
  4. These logs are happening at the DEBUG level, otherwise we would not even know that the Peer was blocking/hanging/waiting. Are there other metrics or logs we should be monitoring to identify this kind of problem?

1 Answer

+1 vote
by

The first step is to look up that uuid in Postgres and see what it is. Is it an especially large value? (It should be <60kb) If you ungzip and defressian it, what kind of segment is it?

If it’s a large value, you need to find the source of it and stop producing it. Common things people do which datomic doesn’t like is transact a large number of datoms at once (aim for 1000-2000 max per tx. Remember retractions are datoms too) or large string values (keep them under 4K chars).

If it is not a large value, then this may just be cpu or IO scheduling latency after your instance being quiet for so long, or it may be Postgres. How long does a raw select of this row take?

by
Thanks for following up! The row in question was ~42kb of binary data. After ungzip and defressian, it turned out to be a `index-tdata` with 1551 datoms. We're going to do a deeper dive into how big other rows are to make sure we are not producing any large transactions or string values, but in this particular case this did not seem to be the root cause (in SQL the row can be queried quickly).

Should I raise a separate issue to request some kind of timeout/retry on this kind of `:kv-cluster/get-val` in the future? And perhaps some kind of logging/monitoring of this scenario that is not DEBUG? I'm not sure if that's appropriate, but if there are networking issues it seems the Peer can hang for quite a long time without visibility into the problem.
Welcome to the Datomic Knowledgebase, where you can make features requests, ask questions and receive answers from other members of the community.
...