On a test environment, we're seeing this kind of log from the Datomic Peer:
;; ~5 hours CI inactive (if this is relevant)
DEBUG 2021-06-18T13:32:37.702 [clojure-agent-send-off-pool-23]: {:event :kv-cluster/get-val, :val-key "60a4cc9e-ac11-40fe-9bd0-582f8209a403", :phase :begin, :pid 7, :tid 225}
;; 15 minutes pass....
DEBUG 2021-06-18T13:48:25.091 [clojure-agent-send-off-pool-23]: {:event :kv-cluster/get-val, :val-key "60a4cc9e-ac11-40fe-9bd0-582f8209a403", :msec 947000.0, :phase :end, :pid 7, :tid 225}
;; This next one finishes in 21ms:
DEBUG 2021-06-18T13:48:25.094 [clojure-agent-send-off-pool-23]: {:event :kv-cluster/get-val, :val-key "60128f6d-25fb-41f0-a6c2-9c5e73267da7", :phase :begin, :pid 7, :tid 225}
DEBUG 2021-06-18T13:48:25.115 [clojure-agent-send-off-pool-23]: {:event :kv-cluster/get-val, :val-key "60128f6d-25fb-41f0-a6c2-9c5e73267da7", :msec 21.0, :phase :end, :pid 7, :tid 225}
;; and now we start processing the transacts that queued up when the tests started...
INFO 2021-06-18T13:48:25.117 [manifold-execute-43]: {:event :peer/transact, :uuid #uuid "60cca429-7908-49b1-82c8-56b50effb4ce", :phase :start, :pid 7, :tid 282}
DEBUG 2021-06-18T13:48:25.234 [clojure-agent-send-off-pool-24]: {:event :peer/accept-new, :id #uuid "60cca429-7908-49b1-82c8-56b50effb4ce", :phase :begin, :pid 7, :tid 226}
DEBUG 2021-06-18T13:48:25.234 [clojure-agent-send-off-pool-24]: {:event :peer/accept-new, :id #uuid "60cca429-7908-49b1-82c8-56b50effb4ce", :msec 0.462, :phase :end, :pid 7, :tid 226}
INFO 2021-06-18T13:48:25.235 [clojure-agent-send-off-pool-24]: {:event :peer/transact, :uuid #uuid "60cca429-7908-49b1-82c8-56b50effb4ce", :phase :end, :pid 7, :tid 226}
After those 15 minutes of waiting, the peer moved on and transacted all the transactions that happened to queue in the meantime, as if nothing happened. There do not seem to be any interesting or anomalous logs in the Transactor for this time period.
Datomic Peer (1.0.6269) with PostgreSQL storage. The peer and transactor are both running on Kubernetes, but Postgres is hosted outside of the k8s cluster.
I think my followup questions are:
- Any idea what could be going on with the
:kv-cluster/get-val
and how to go about debugging this further?
- Is there some timeout we can configure to avoid this kind of situation in a production environment?
- Is the probable culprit Kubernetes? If so, has someone seen or fixed similar behavior before?
- These logs are happening at the DEBUG level, otherwise we would not even know that the Peer was blocking/hanging/waiting. Are there other metrics or logs we should be monitoring to identify this kind of problem?