This is my continous journal regarding thresholds network diary study to setup a testnet node in the threshold network. I was briefed by one member of the Threshold Team and questions were quickly resolved. Let's get to work!
I chose Hetzner as my cloud hosting provider as it has good tooling, fair prices and I'm generally quite happy with its hosting capabilities. It supports also rudimentary monitoring capabilities (e.g. when notification channels when host goes down or some node resources go over customized threshold) but in order to monitor software, I will do this on a more granular level later post-setup.
According to node requirements, I will go for one with containerized component requirements in the range of 2 (shared) vCPUs / 2GiB Memory / >1 GiB Storage. In Hetzners case, an ideal candidate is cpx11
.
I checked available server-types for hetzner directly from commandline with their handy cli tool.
--- ~ » hcloud server-type list 130 ↵
ID NAME CORES CPU TYPE MEMORY DISK STORAGE TYPE
1 cx11 1 shared 2.0 GB 20 GB local
3 cx21 2 shared 4.0 GB 40 GB local
5 cx31 2 shared 8.0 GB 80 GB local
7 cx41 4 shared 16.0 GB 160 GB local
9 cx51 8 shared 32.0 GB 240 GB local
11 ccx11 2 dedicated 8.0 GB 80 GB local
12 ccx21 4 dedicated 16.0 GB 160 GB local
13 ccx31 8 dedicated 32.0 GB 240 GB local
14 ccx41 16 dedicated 64.0 GB 360 GB local
15 ccx51 32 dedicated 128.0 GB 600 GB local
22 cpx11 2 shared 2.0 GB 40 GB local
23 cpx21 3 shared 4.0 GB 80 GB local
24 cpx31 4 shared 8.0 GB 160 GB local
25 cpx41 8 shared 16.0 GB 240 GB local
26 cpx51 16 shared 32.0 GB 360 GB local
33 ccx12 2 dedicated 8.0 GB 80 GB local
34 ccx22 4 dedicated 16.0 GB 160 GB local
35 ccx32 8 dedicated 32.0 GB 240 GB local
36 ccx42 16 dedicated 64.0 GB 360 GB local
37 ccx52 32 dedicated 128.0 GB 600 GB local
38 ccx62 48 dedicated 192.0 GB 960 GB local
Regarding OS, I go with latest LTS Ubuntu 22.04. Again, checking if available via cli:
--- ~ » hcloud image list -t system -s name
ID TYPE NAME DESCRIPTION IMAGE SIZE DISK SIZE CREATED DEPRECATED
3 system centos-7 CentOS 7 - 5 GB Mon Jan 15 12:34:45 CET 2018 -
45778012 system centos-stream-8 CentOS Stream 8 - 5 GB Thu Aug 5 07:07:23 CEST 2021 -
59752342 system centos-stream-9 CentOS Stream 9 - 5 GB Thu Jan 27 08:52:03 CET 2022 -
5924233 system debian-10 Debian 10 - 5 GB Mon Jul 8 08:35:48 CEST 2019 -
45557056 system debian-11 Debian 11 - 5 GB Mon Aug 16 13:12:01 CEST 2021 -
69726282 system fedora-36 Fedora 36 - 5 GB Wed May 11 07:50:00 CEST 2022 -
45780948 system rocky-8 Rocky Linux 8 - 5 GB Thu Aug 19 08:30:23 CEST 2021 -
76766499 system rocky-9 Rocky Linux 9 - 5 GB Wed Jul 20 15:55:52 CEST 2022 -
168855 system ubuntu-18.04 Ubuntu 18.04 - 5 GB Wed May 2 13:02:30 CEST 2018 -
15512617 system ubuntu-20.04 Ubuntu 20.04 - 5 GB Thu Apr 23 19:55:14 CEST 2020 -
67794396 system ubuntu-22.04 Ubuntu 22.04 - 5 GB Thu Apr 21 15:32:38 CEST 2022 -
Next, bootstraping the server (PII redacted):
hcloud server create --ssh-key redacted --image ubuntu-22.04 --name apeiratos --type cpx11 --location hel1
Node was ready in under a minute. I tried to login:
hcloud server ssh apeiratos
worked immediately. Fresh Ubuntu LTS 22.04 ready for testnet deployment.
Because this is going to be running on docker containers and I'm not sure the setup script covers docker engine installation, I did this first after node setup via official setup guide from https://docs.docker.com/engine/install/ubuntu/. Worked without issues and docker service is running:
root@apeiratos:~# systemctl status docker
● docker.service - Docker Application Container Engine
Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2022-10-18 22:20:22 UTC; 15h ago
TriggeredBy: ● docker.socket
Docs: https://docs.docker.com
Main PID: 1648 (dockerd)
Tasks: 8
Memory: 21.9M
CPU: 5.665s
CGroup: /system.slice/docker.service
└─1648 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
Oct 18 22:20:22 apeiratos dockerd[1648]: time="2022-10-18T22:20:22.147434124Z" level=info msg="scheme \"unix\" not registered>
Oct 18 22:20:22 apeiratos dockerd[1648]: time="2022-10-18T22:20:22.147446146Z" level=info msg="ccResolverWrapper: sending upd>
Oct 18 22:20:22 apeiratos dockerd[1648]: time="2022-10-18T22:20:22.147453099Z" level=info msg="ClientConn switching balancer >
Oct 18 22:20:22 apeiratos dockerd[1648]: time="2022-10-18T22:20:22.181848845Z" level=info msg="Loading containers: start."
Oct 18 22:20:22 apeiratos dockerd[1648]: time="2022-10-18T22:20:22.318011100Z" level=info msg="Default bridge (docker0) is as>
Oct 18 22:20:22 apeiratos dockerd[1648]: time="2022-10-18T22:20:22.377792435Z" level=info msg="Loading containers: done."
Oct 18 22:20:22 apeiratos dockerd[1648]: time="2022-10-18T22:20:22.402092437Z" level=info msg="Docker daemon" commit=03df974 >
Oct 18 22:20:22 apeiratos dockerd[1648]: time="2022-10-18T22:20:22.402215928Z" level=info msg="Daemon has completed initializ>
Oct 18 22:20:22 apeiratos systemd[1]: Started Docker Application Container Engine.
Oct 18 22:20:22 apeiratos dockerd[1648]: time="2022-10-18T22:20:22.427780499Z" level=info msg="API listen on /run/docker.sock"
lines 1-22/22 (END)
I then also updated packages for the base Ubuntu 22.04 image provided by Hetzner to have latest security/updated base packages:
apt update && apt upgrade -y
Because this provided me with a later linux kernel than the host was booted on, I also rebooted the machine to run on the newest ubuntu-supported kernel.
reboot
After checking I was on the updated kernel (Welcome to Ubuntu 22.04.1 LTS (GNU/Linux 5.15.0-52-generic x86_64)
) I then
removed no longer required old kernel images & modules: apt autoremove
From documentation, I immediately went for the launcher script https://thresholdnetwork.notion.site/thresholdnetwork/Docker-Launch-Script-4d304d61be6941d78e450a79406f0403.
First thing I needed to get is my Goerli WebSocket URL. I have an infura account, so I logged in there and created a new project:
In network endpoints section from this project, I changed to Goerli Testnet in Ethereum section, copied the URL and inserted it in the script where ETHEREUM_WS_URL
is set.
Next are the operator keys:
As https://docs.threshold.network/staking-and-running-a-node/running-a-node/self-managed/tbtc-v2-node-setup/operator-account describes, I generated my key & password with:
geth account new --keystore ./operator-key
. Because I already had latest geth installed on my local machine, I did it from there and then copied the keys to the server via scp -P 22 -ri path/to/redacted/key operator-key redacted:
After reading the script I figured the keyfile needs to go into CONFIG_DIR
so I moved it there: mkdir config storage; cp operator-key/keyfile config/
. I then filled the OPERATOR_KEY_FILE_NAME
and OPERATOR_KEY_FILE_PASSWORD
with my specific inputs.
Finally, I inserted my PUBLIC_IP
in the script. Then I started the script to test my configuration (and to see what happens / what errors are generated, when I don't have any GoerliETH/GoerliT yet). I expect to run into issues at this point, but want to see, if the container is generally running and the binary finds my keys and can associate the keyfile via my secret.
Unable to find image 'us-docker.pkg.dev/keep-test-f3e0/public/keep-client:latest' locally
latest: Pulling from keep-test-f3e0/public/keep-client
213ec9aee27d: Pull complete
17c536332366: Pull complete
Digest: sha256:81b47295efa93dc4acaf74b011630bd775e6ffeb2dcda4d7b6f2eeed380b320e
Status: Downloaded newer image for us-docker.pkg.dev/keep-test-f3e0/public/keep-client:latest
51db1142c31e2e1d2f0fd4af7a31afddd3708abdb10d4a4d65d8e978ce05a9d4
It doesn't throw me an error, so far so good. Let's see if the container has crashed or is running:
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS P
ORTS NAMES
51db1142c31e us-docker.pkg.dev/keep-test-f3e0/public/keep-client "keep-client start -…" 5 seconds ago Up 4 seconds 0
.0.0.0:3919->3919/tcp, :::3919->3919/tcp, 0.0.0.0:9601->9601/tcp, :::9601->9601/tcp naughty_mendel
This is good! It didn't crash, so it seems to be waiting in standby. Let's check the logs:
2022-10-19T14:59:03.951Z INFO keep-cmd cmd/start.go:61 Starting the client against [goerli] ethereum network.
2022-10-19T14:59:05.142Z INFO keep-ethereum ethereum/ethereum.go:294 enabled ethereum client request rate limiter; rps limit [150]; concurrency limit [30]
2022-10-19T14:59:05.528Z WARN keep-ethereum [email protected]/log.go:180 could not create subscription to new blocks: [notifications not supported]
2022-10-19T14:59:06.656Z WARN keep-libp2p [email protected]/log.go:180 could not establish connection with bootstrap peer [16Uiu2HAky2Y4Tyq5vTA1CxikcDes6o5EH11i2qcg5dBV9W3Lks5c]: [failed to dial 16Uiu2HAky2Y4Tyq5vTA1CxikcDes6o5EH11i2qcg5dBV9W3Lks5c:
* [/ip4/34.141.9.57/tcp/3919] failed to negotiate stream multiplexer: EOF]
2022-10-19T14:59:06.957Z WARN keep-libp2p [email protected]/log.go:180 could not establish connection with bootstrap peer [16Uiu2HAmMosdpAuRSw1ahNhqFq8e3Y4d4c5WZkjW1FGQi5WJwWZ7]: [failed to dial 16Uiu2HAmMosdpAuRSw1ahNhqFq8e3Y4d4c5WZkjW1FGQi5WJwWZ7:
* [/ip4/20.81.168.158/tcp/4001] failed to negotiate stream multiplexer: EOF]
2022-10-19T14:59:07.065Z WARN keep-libp2p [email protected]/log.go:180 could not establish connection with bootstrap peer [16Uiu2HAmCcfVpHwfBKNFbQuhvGuFXHVLQ65gB4sJm7HyrcZuLttH]: [failed to dial 16Uiu2HAmCcfVpHwfBKNFbQuhvGuFXHVLQ65gB4sJm7HyrcZuLttH:
* [/ip4/104.154.61.116/tcp/3919] failed to negotiate stream multiplexer: EOF]
2022-10-19T14:59:07.115Z WARN keep-libp2p [email protected]/log.go:180 could not establish connection with bootstrap peer [16Uiu2HAm3eJtyFKAttzJ85NLMromHuRg4yyum3CREMf6CHBBV6KY]: [failed to dial 16Uiu2HAm3eJtyFKAttzJ85NLMromHuRg4yyum3CREMf6CHBBV6KY:
* [/ip4/35.223.100.87/tcp/3919] failed to negotiate stream multiplexer: EOF]
2022-10-19T14:59:08.371Z WARN keep-libp2p [email protected]/log.go:180 could not establish connection with bootstrap peer [16Uiu2HAm77eSvRq5ioD4J8VFPkq3bJHBEHkssCuiFkgAoABwjo2S]: [failed to dial 16Uiu2HAm77eSvRq5ioD4J8VFPkq3bJHBEHkssCuiFkgAoABwjo2S:
* [/ip4/52.79.203.57/tcp/3919] failed to negotiate stream multiplexer: EOF]
2022-10-19T14:59:08.371Z WARN keep-libp2p [email protected]/log.go:180 bootstrap round error: [all bootstrap attempts failed]
2022-10-19T14:59:08.371Z INFO keep-clientinfo clientinfo/metrics.go:144 observing connected_peers_count with [1m0s] tick
▓▓▌ ▓▓ ▐▓▓ ▓▓▓▓▓▓▓▓▓▓▌▐▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▄
▓▓▓▓▓▓▓▓▓▓ ▓▓▓▓▓▓▓▓▓▓▌▐▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
▓▓▓▓▓▓ ▓▓▓▓▓▓▓▀ ▐▓▓▓▓▓▓ ▐▓▓▓▓▓ ▓▓▓▓▓▓ ▓▓▓▓▓ ▐▓▓▓▓▓▌ ▐▓▓▓▓▓▓
▓▓▓▓▓▓▄▄▓▓▓▓▓▓▓▀ ▐▓▓▓▓▓▓▄▄▄▄ ▓▓▓▓▓▓▄▄▄▄ ▐▓▓▓▓▓▌ ▐▓▓▓▓▓▓
▓▓▓▓▓▓▓▓▓▓▓▓▓▀ ▐▓▓▓▓▓▓▓▓▓▓ ▓▓▓▓▓▓▓▓▓▓▌ ▐▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
▓▓▓▓▓▓▀▀▓▓▓▓▓▓▄ ▐▓▓▓▓▓▓▀▀▀▀ ▓▓▓▓▓▓▀▀▀▀ ▐▓▓▓▓▓▓▓▓▓▓▓▓▓▓▀
▓▓▓▓▓▓ ▀▓▓▓▓▓▓▄ ▐▓▓▓▓▓▓ ▓▓▓▓▓ ▓▓▓▓▓▓ ▓▓▓▓▓ ▐▓▓▓▓▓▌
▓▓▓▓▓▓▓▓▓▓ █▓▓▓▓▓▓▓▓▓ ▐▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ ▓▓▓▓▓▓▓▓▓▓
▓▓▓▓▓▓▓▓▓▓ ▓▓▓▓▓▓▓▓▓▓ ▐▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ ▓▓▓▓▓▓▓▓▓▓
Trust math, not hardware.
--------------------------------------------------------------------------------------------------
| Keep Client Node |
| |
| Version: v1.3.1-6462-gf3e894fa7 (f3e894fa7) |
| |
...
All looking quite good! Last line of log tells me what to do next:
2022-10-19T15:10:11.048Z FATAL keep-cmd cmd/start.go:36 error initializing beacon: [could not set up sortition pool monitoring: [operator not registered for the staking provider, check Threshold dashboard]]
I'll therefore wait until I got the Goerli Tokens and then continue from the threshold dashboard according to https://docs.threshold.network/staking-and-running-a-node/running-a-node/self-managed/tbtc-v2-node-setup/application-authorization-and-operator-registration.
This guide, of course, does not link to the testnet dashboard, but I found the correct link in the mail: https://dashboard.test.threshold.network/overview/network
So according to the dashboard, I have to do 3. Steps to complete:
- STAKE TOKENS
- Staked T via https://dashboard.test.threshold.network/staking/ . Confirmed in 2 TX for 50k T staking.
- AUTHORIZE APPS
- Allowed TBTC & Random Beacon App for 50k T. Confirmed in 2 TX (one per App)
- SET UP NODE
- SW Setup for this has been already covered in Day 1. I just need to map the operator address to my stake it looks like.
- The operator registration / mapping guide leads me through the next 2 TX. As for the operator address, I use the same as the provider address. The dashboard confirms to me, that I have successfully mapped operator <-> provider addresses 🔥
I like that the dashboard user is lead through the 3. Steps in a sequential way. I didn't have to search around on the dashboard to do the Stake / Authorize TBTC/Random Beacon Apps which saved me time and didn't lead to confusion which kind of apps I would need to authorize. Also a plus, that I can directly go to the setup docs for the different apps from the applications page.
After checking via docker logs
on my server, actually the docker container is constantly rebooting after FATALing on the initialization. I still see the same message I did, before I started the dashboard staking/authorization/mapping steps:
2022-10-19T22:43:55.750Z FATAL keep-cmd cmd/start.go:36 error initializing beacon: [could not set up sortition pool monitoring: [operator not registered for the staking provider, check Threshold dashboard]]
I'm getting the impression I did something fundamentaly wrong from the order of setup steps (1# basically test-running the node without properly 2# staking/authorization/operator mapping). I also now see this remark in the setup guide "Application Authorization & Operation Registration":
Don't forget: a tBTC v2 node will not be able to be deployed without successfully authorizing both the tBTC and Random Beacon applications, and registering the node's operator address FIRST.
This does not sound good.. I probably have to contact support regarding this and what I can do to register my node now, because I DID run the node before registering the nodes operator address FIRST. At this point I don't know, why this is fundamentally not possible in this order and would need some explanation from the team.
I see the comment in the guide, but if this does have such a fundamental impact, it should at least be mentioned in bold in the top section (https://docs.threshold.network/staking-and-running-a-node/running-a-node/self-managed/tbtc-v2-node-setup).
On 2nd look into https://docs.threshold.network/extras/contract-addresses/goerli-testnet, I see that this contract addresses doesn't match what I see in my node client logs:
--------------------------------------------------------------------------------------------------
| Keep Client Node |
| |
| Version: v1.3.1-6462-gf3e894fa7 (f3e894fa7) |
| |
| Operator: redacted |
| |
| Port: 3919 |
| IPs : /ip4/redacted/tcp/3919/ipfs/redacted |
| |
| Contracts: |
| RandomBeacon : 0x0Acf872ea89b73E70Aa1d1b8F46dC01bB71f4B03 |
| WalletRegistry : 0x2d51348b1a903aAECF266d0844dA69a178fC1dC7 |
| TokenStaking : 0x1da5d88C26EA4f87b5e09C3452eE2384Ee20DC75 |
--------------------------------------------------------------------------------------------------
Looks to me, like there is a mismatch in the tbtc docker container (that is fetched via setup script) and what the test dashboard is using. It also looks to be, that the client seems to be the outdated one, as I don't see any recent registration tx for e.g. WalletRegistry used by container (https://goerli.etherscan.io/address/0x2d51348b1a903aAECF266d0844dA69a178fC1dC7). Will clarify with support, what to do, because I don't know at that point, which contracts should be called for registration.
Support answered and updated the docker images. I then pulled the latest with docker pull us-docker.pkg.dev/keep-test-f3e0/public/keep-client
and removed the old container with old image docker rm naughty_mendel
. I then restared the launch script: ./keep.sh
. I can immediately see the corrected contract addresses from the log:
| |
| Contracts: |
| RandomBeacon : 0xaFfCD4734eEa140Ba5666Bf60541CCAFfa74F4be |
| WalletRegistry : 0x82BE0F8C8d43fAC584B03f4b782370E202A34527 |
| TokenStaking : 0x1da5d88C26EA4f87b5e09C3452eE2384Ee20DC75 |
--------------------------------------------------------------------------------------------------
This looks good 😄 Next, I let the node run for a few hours and then checked again. I can see that it didn't crash via docker ps
but am not sure if my node is active because I still see recurring errors in the logs like:
2022-10-20T16:22:44.117Z WARN keep-ethereum [email protected]/log.go:180 could not create subscription to new blocks: [notifications not supported]
and
2022-10-20T19:20:41.182Z WARN keep-libp2p [email protected]/log.go:180 bootstrap round error: [all bootstrap attempts failed]
. I do have connected peers though:
2022-10-20T19:20:42.904Z INFO keep-libp2p libp2p/libp2p.go:241 number of connected peers: [11]
Maybe the boot nodes are currently down or these warnings can be ignored.. Need to check with documentation again and see how to verify a successfull operating node.
Update: Ok, so through the discord forum I got the answer thanks to Vict0r regarding my first issue with the block subscription issue.. I copied the HTTPS instead of the websockets RPC endpoint from infura on day1, no wonder! I now replaced it with the correct WS endpoint. Only warnings that still are regulary occuring are the bootstrap round errors. I don't think this is a connection issue from my side, as Hetzner generally doesn't block incoming/outgoing traffic and I didn't create application-based FW rules to limit port connectitvy as documented in https://docs.threshold.network/staking-and-running-a-node/running-a-node/self-managed/tbtc-v2-node-setup/network-configuration.
For now I think the setup part has been successfull and I will continue building out the infrastructure for monitoring & failover service-recovery in the Post-Setup part of my journal.
After searching for available monitoring info (see Docs Day 3), I checked if the metrics endpoint was working for me. It was:
➜ grafana curl http://localhost:9601/metrics
client_info{version="v2.0.0-m1-9-g4049dc015"} 1
# TYPE connected_bootstrap_count gauge
connected_bootstrap_count 2 1666534806116
# TYPE connected_peers_count gauge
connected_peers_count 13 1666534806116
# TYPE eth_connectivity gauge
eth_connectivity 1 1666534746116
# TYPE tbtc_pre_params_count gauge
tbtc_pre_params_count 1000 1666534806577
I then copied my docker-compose monitoring stack conf, which includes:
- Grafana
- Prometheus
- Prometheus Node-Exporter
- Traefik (for domain forwarding / auto letscencrypt certificate generation)
I re-used the same setup as I did for my PRE/nucypher node.
My docker-compose conf (again with PII redacted):
name: grafana
services:
grafana:
depends_on:
prometheus:
condition: service_started
environment:
GF_SECURITY_ADMIN_PASSWORD: redacted
GF_SECURITY_ADMIN_USER: redacted
GF_SERVER_DOMAIN: redacted
GF_SERVER_ROOT_URL: https://
GF_USERS_ALLOW_SIGN_UP: "false"
extra_hosts:
host.docker.internal: host-gateway
image: grafana/grafana:latest
labels:
traefik.http.middlewares.some-name-redirect.redirectScheme.permanent: "true"
traefik.http.middlewares.some-name-redirect.redirectScheme.scheme: https
traefik.http.routers.some-name-ssl.entryPoints: port443
traefik.http.routers.some-name-ssl.rule: host(`redacted`)
traefik.http.routers.some-name-ssl.service: some-name-ssl
traefik.http.routers.some-name-ssl.tls: "true"
traefik.http.routers.some-name-ssl.tls.certResolver: le-ssl
traefik.http.routers.some-name.entryPoints: port80
traefik.http.routers.some-name.middlewares: some-name-redirect
traefik.http.routers.some-name.rule: host(`redacted`)
traefik.http.services.some-name-ssl.loadBalancer.server.port: "3000"
logging:
driver: journald
networks:
default: null
restart: always
volumes:
- type: volume
source: grafana-data
target: /var/lib/grafana
volume: {}
node-exporter:
container_name: node-exporter
depends_on:
prometheus:
condition: service_started
hostname: node-exporter
image: prom/node-exporter
network_mode: host
pid: host
restart: always
prometheus:
command:
- --config.file=/etc/prometheus/prometheus.yml
- --storage.tsdb.path=/prometheus
- --storage.tsdb.retention.time=4w
- --web.console.libraries=/etc/prometheus/console_libraries
- --web.console.templates=/etc/prometheus/consoles
- --web.listen-address=0.0.0.0:9090
image: prom/prometheus:v2.22.0
network_mode: host
restart: always
volumes:
- type: bind
source: /root/grafana/prometheus
target: /etc/prometheus
bind:
create_host_path: true
- type: volume
My prometheus config is quite simple and resides in prometheus/prometheus.yml
:
# my global config
global:
scrape_interval: 10s # By default, scrape targets every 15 seconds.
evaluation_interval: 10s # By default, scrape targets every 15 seconds.
scrape_configs:
# Scrape Prometheus itself
- job_name: 'prometheus'
scrape_interval: 10s
scrape_timeout: 10s
static_configs:
- targets: ['localhost:6031']
# Scrape the Node Exporter
- job_name: 'node'
scrape_interval: 10s
static_configs:
- targets: ['localhost:9100']
# Scrape tbtc
- job_name: 'tbtc'
scrape_interval: 10s
static_configs:
- targets: ['localhost:9601']
This are the only 2 files I need to deploy a self-sustained monitoring instance on this node which I can securly connect from remote.
I then added a new DNS A record for this node to my domain for this purpose.
Tomorrow I will continue with adding Grafana Dashboard and connect Prometheus data source to it.
First I checked, if Grafana was correctly answering under the new domain name.
This worked, so after login I continued with registering prometheus data source.
Some note here: In my stack setup, Prometheus is configured in host networking mode. In order to find this container from other services like grafana (which reside in grafana_default
), we need to use the special docker internal host resolving address host.docker.internal
. If you don't have this requirement (prometheus in other docker-compose network), you could address it in grafana normally under service name (grafana
).
To connect prometheus data-source to Grafana:
- Configuration -> Data Sources -> Add Data Source
- Select "Prometheus".
- Goto configuration setting HTTP. Specify in url
http://host.docker.internal:9090
- On bottom, click on *save & test". This should work. If not, check, if you can access
http://host.docker.internal:9090
from grafana container, e.g. viadocker-compose exec grafana bash -c "nc host.docker.internal 9090 -v"
. This should respond withhost.docker.internal ([internal-ip]:9090) open
. If you don't see this, you probably need to usehttp://prometheus:9090
instead (again, can be tested both from grafana UI as well as from container withnc
command).
Next, I created a new tBTC v2 dashboard. I attached the dashboard as exported JSON to this gist. Here's a screenshot:
I added a grafana alerting rule & notification policy to give me a notice via TG, if my tbtc v2 node peer count is below a certain number of peers. I tested this alert rule and notification policy process by setting the peer threshold extra high so it would fire. After evaluation period (5m) was over, I was alerted via TG and updated the threshold thereafter to a non-firing sane default.
Going over the launch script guide https://thresholdnetwork.notion.site/thresholdnetwork/Docker-Launch-Script-4d304d61be6941d78e450a79406f0403, two things I would remark:
- There are different parts in the script (
ETHEREUM_WS_URL="<Ethereum API WS URL>"
,OPERATOR_KEY_FILE_NAME="<Operator Account keyfile name>"
,PUBLIC_IP="<PUBLIC_IP_OF_MACHINE>"
) where you need to replace template information with data snippets that is specific for each operator. If you run the script as is, the docker run will fail. For some novice node operators that could be the first hurdle which goes to result in support contact, because they go by the guide and don't understand that they need to change parts of the script, because there's no step in the guide mentioning of replacing data. This should be included so novice operators know what they need to change, in order todocker run
successfully. /home/keep
path requirements: There are hardcoded paths to/home/keep/config
&/home/keep/storage
. I would replace this with a more generic approach for operators that don't run the node askeep
user by using$HOME/config
&$HOME/storage
instead.
The documentation on docs.threshold.network for the tBTC setup is quite straight-forward. As commented in the setup guide, I would make important disclaimers more prominent
After starting the container and seeing a bunch of warnings in the logs, I would have wished to have more details from the documentation, what kind of log events I have to watch for, in order to check my node has successfully joined as a new tbtc staking node.
Simply stating that if the operator sees the KEEP logo + operator/contract addresses gives a false safety of that the node is active (as stated in both, Docker & Binary Installation docs).
Instead, I would tell the operators to look for certain activation events of the services the node should provide (e.g. number of peers connected, successful bootstraping messages, successfull subscriptions, protocol exchanges with peers, etc.).
I'd also recommend for the future to get some kind of rudimentary guide for monitoring up, but as has been explained to me by the intro-interview yesterday, this can be sourced by working together with the beta testers which will try different methods. I'll therefore check the code/tbtc api docs to see what can be fetched and try to see what can be done and note my results in the post-setup section.
Checked documentation for how to best export tbtc v2 node metrics. TBH the docs did not help me in this endeavour and should surely get a sub-page under https://docs.threshold.network/staking-and-running-a-node/running-a-node/self-managed/tbtc-v2-node-setup. So next I searched in the threshold discord for tbtc metrics and found great info in Vict0r's answers to the community. Apparently I can check http://localhost:9601/metrics
and for more detailed information regarding client & connected peers http://localhost:9601/diagnostic
.
The former is prometheus compatible, the later not. So my plan is to copy my working prometheus monitoring stack onto this node and create a simple tbtc v2 dashboard and/or copy an already existing one from the threshold community and customize it (if available).
During my post-setup work with grafana (see Day 4 section above) I searched discord community once again, as I already knew that the docs website does not contain anything regarding tbtc monitoring. I also didn't found anything, so next I checked the tbtc repositories. I found two grafana dashboards by searching for one of the metrics (tbtc_pre_params_count
):
- https://github.com/keep-network/keep-core/blob/357243671dfc734181143ccb1c70e825598a3fc9/infrastructure/kube/keep-test/monitoring/grafana/dashboards/keep/keep-network-nodes.json
- https://github.com/keep-network/keep-core/blob/357243671dfc734181143ccb1c70e825598a3fc9/infrastructure/kube/keep-test/monitoring/grafana/dashboards/keep/keep-network-nodes-public.json
I imported those as well and checked it against my manually created dashboard. They are quite similar, as there's not that many metrics which are exported for now.
In an upcoming docs part for monitoring I would probably include one of those dashboards, in addition how your jobname should be set in prometheus config to automatically link with the name used in the dashboards, or make it more configurable for independent jobnames for the tbtc part.
I would also add a sub-section for alerting, what should be tracked here (probably connectivity metrics to eth / bootstraping nodes, low peer count) and how to check if your node is still online (e.g. node network metrics explorer like what was done for the nucypher network).
You can also recommend some availability monitoring provider like https://Pingdom.com, https://betteruptime.com/ etc. (or guides how to do this for popular host providers like AWS, GCloud, Azure, DigitalOcean, etc.)