Threshold Network Diary Study

This is my continous journal regarding thresholds network diary study to setup a testnet node in the threshold network. I was briefed by one member of the Threshold Team and questions were quickly resolved. Let's get to work!

Node Setup Preparation (Day 1)

I chose Hetzner as my cloud hosting provider as it has good tooling, fair prices and I'm generally quite happy with its hosting capabilities. It supports also rudimentary monitoring capabilities (e.g. when notification channels when host goes down or some node resources go over customized threshold) but in order to monitor software, I will do this on a more granular level later post-setup.

According to node requirements, I will go for one with containerized component requirements in the range of 2 (shared) vCPUs / 2GiB Memory / >1 GiB Storage. In Hetzners case, an ideal candidate is cpx11. I checked available server-types for hetzner directly from commandline with their handy cli tool.

--- ~ » hcloud server-type list                                                                                         130 ↵
ID   NAME    CORES   CPU TYPE    MEMORY     DISK     STORAGE TYPE
1    cx11    1       shared      2.0 GB     20 GB    local
3    cx21    2       shared      4.0 GB     40 GB    local
5    cx31    2       shared      8.0 GB     80 GB    local
7    cx41    4       shared      16.0 GB    160 GB   local
9    cx51    8       shared      32.0 GB    240 GB   local
11   ccx11   2       dedicated   8.0 GB     80 GB    local
12   ccx21   4       dedicated   16.0 GB    160 GB   local
13   ccx31   8       dedicated   32.0 GB    240 GB   local
14   ccx41   16      dedicated   64.0 GB    360 GB   local
15   ccx51   32      dedicated   128.0 GB   600 GB   local
22   cpx11   2       shared      2.0 GB     40 GB    local
23   cpx21   3       shared      4.0 GB     80 GB    local
24   cpx31   4       shared      8.0 GB     160 GB   local
25   cpx41   8       shared      16.0 GB    240 GB   local
26   cpx51   16      shared      32.0 GB    360 GB   local
33   ccx12   2       dedicated   8.0 GB     80 GB    local
34   ccx22   4       dedicated   16.0 GB    160 GB   local
35   ccx32   8       dedicated   32.0 GB    240 GB   local
36   ccx42   16      dedicated   64.0 GB    360 GB   local
37   ccx52   32      dedicated   128.0 GB   600 GB   local
38   ccx62   48      dedicated   192.0 GB   960 GB   local

Regarding OS, I go with latest LTS Ubuntu 22.04. Again, checking if available via cli:

--- ~ » hcloud image list -t system -s name
ID         TYPE     NAME              DESCRIPTION       IMAGE SIZE   DISK SIZE   CREATED                         DEPRECATED
3          system   centos-7          CentOS 7          -            5 GB        Mon Jan 15 12:34:45 CET 2018    -
45778012   system   centos-stream-8   CentOS Stream 8   -            5 GB        Thu Aug  5 07:07:23 CEST 2021   -
59752342   system   centos-stream-9   CentOS Stream 9   -            5 GB        Thu Jan 27 08:52:03 CET 2022    -
5924233    system   debian-10         Debian 10         -            5 GB        Mon Jul  8 08:35:48 CEST 2019   -
45557056   system   debian-11         Debian 11         -            5 GB        Mon Aug 16 13:12:01 CEST 2021   -
69726282   system   fedora-36         Fedora 36         -            5 GB        Wed May 11 07:50:00 CEST 2022   -
45780948   system   rocky-8           Rocky Linux 8     -            5 GB        Thu Aug 19 08:30:23 CEST 2021   -
76766499   system   rocky-9           Rocky Linux 9     -            5 GB        Wed Jul 20 15:55:52 CEST 2022   -
168855     system   ubuntu-18.04      Ubuntu 18.04      -            5 GB        Wed May  2 13:02:30 CEST 2018   -
15512617   system   ubuntu-20.04      Ubuntu 20.04      -            5 GB        Thu Apr 23 19:55:14 CEST 2020   -
67794396   system   ubuntu-22.04      Ubuntu 22.04      -            5 GB        Thu Apr 21 15:32:38 CEST 2022   -

Next, bootstraping the server (PII redacted): hcloud server create --ssh-key redacted --image ubuntu-22.04 --name apeiratos --type cpx11 --location hel1

Node was ready in under a minute. I tried to login: hcloud server ssh apeiratos

worked immediately. Fresh Ubuntu LTS 22.04 ready for testnet deployment.

Node Setup

Day 1

Because this is going to be running on docker containers and I'm not sure the setup script covers docker engine installation, I did this first after node setup via official setup guide from https://docs.docker.com/engine/install/ubuntu/. Worked without issues and docker service is running:

root@apeiratos:~# systemctl status docker
● docker.service - Docker Application Container Engine
     Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
     Active: active (running) since Tue 2022-10-18 22:20:22 UTC; 15h ago
TriggeredBy: ● docker.socket
       Docs: https://docs.docker.com
   Main PID: 1648 (dockerd)
      Tasks: 8
     Memory: 21.9M
        CPU: 5.665s
     CGroup: /system.slice/docker.service
             └─1648 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock

Oct 18 22:20:22 apeiratos dockerd[1648]: time="2022-10-18T22:20:22.147434124Z" level=info msg="scheme \"unix\" not registered>
Oct 18 22:20:22 apeiratos dockerd[1648]: time="2022-10-18T22:20:22.147446146Z" level=info msg="ccResolverWrapper: sending upd>
Oct 18 22:20:22 apeiratos dockerd[1648]: time="2022-10-18T22:20:22.147453099Z" level=info msg="ClientConn switching balancer >
Oct 18 22:20:22 apeiratos dockerd[1648]: time="2022-10-18T22:20:22.181848845Z" level=info msg="Loading containers: start."
Oct 18 22:20:22 apeiratos dockerd[1648]: time="2022-10-18T22:20:22.318011100Z" level=info msg="Default bridge (docker0) is as>
Oct 18 22:20:22 apeiratos dockerd[1648]: time="2022-10-18T22:20:22.377792435Z" level=info msg="Loading containers: done."
Oct 18 22:20:22 apeiratos dockerd[1648]: time="2022-10-18T22:20:22.402092437Z" level=info msg="Docker daemon" commit=03df974 >
Oct 18 22:20:22 apeiratos dockerd[1648]: time="2022-10-18T22:20:22.402215928Z" level=info msg="Daemon has completed initializ>
Oct 18 22:20:22 apeiratos systemd[1]: Started Docker Application Container Engine.
Oct 18 22:20:22 apeiratos dockerd[1648]: time="2022-10-18T22:20:22.427780499Z" level=info msg="API listen on /run/docker.sock"
lines 1-22/22 (END)

I then also updated packages for the base Ubuntu 22.04 image provided by Hetzner to have latest security/updated base packages: apt update && apt upgrade -y

Because this provided me with a later linux kernel than the host was booted on, I also rebooted the machine to run on the newest ubuntu-supported kernel. reboot

After checking I was on the updated kernel (Welcome to Ubuntu 22.04.1 LTS (GNU/Linux 5.15.0-52-generic x86_64)) I then removed no longer required old kernel images & modules: apt autoremove

Launcher script

From documentation, I immediately went for the launcher script https://thresholdnetwork.notion.site/thresholdnetwork/Docker-Launch-Script-4d304d61be6941d78e450a79406f0403.

First thing I needed to get is my Goerli WebSocket URL. I have an infura account, so I logged in there and created a new project:

In network endpoints section from this project, I changed to Goerli Testnet in Ethereum section, copied the URL and inserted it in the script where ETHEREUM_WS_URL is set.

Next are the operator keys:

As https://docs.threshold.network/staking-and-running-a-node/running-a-node/self-managed/tbtc-v2-node-setup/operator-account describes, I generated my key & password with: geth account new --keystore ./operator-key. Because I already had latest geth installed on my local machine, I did it from there and then copied the keys to the server via scp -P 22 -ri path/to/redacted/key operator-key redacted:

After reading the script I figured the keyfile needs to go into CONFIG_DIR so I moved it there: mkdir config storage; cp operator-key/keyfile config/. I then filled the OPERATOR_KEY_FILE_NAME and OPERATOR_KEY_FILE_PASSWORD with my specific inputs.

Finally, I inserted my PUBLIC_IP in the script. Then I started the script to test my configuration (and to see what happens / what errors are generated, when I don't have any GoerliETH/GoerliT yet). I expect to run into issues at this point, but want to see, if the container is generally running and the binary finds my keys and can associate the keyfile via my secret.

Unable to find image 'us-docker.pkg.dev/keep-test-f3e0/public/keep-client:latest' locally
latest: Pulling from keep-test-f3e0/public/keep-client
213ec9aee27d: Pull complete 
17c536332366: Pull complete 
Digest: sha256:81b47295efa93dc4acaf74b011630bd775e6ffeb2dcda4d7b6f2eeed380b320e
Status: Downloaded newer image for us-docker.pkg.dev/keep-test-f3e0/public/keep-client:latest
51db1142c31e2e1d2f0fd4af7a31afddd3708abdb10d4a4d65d8e978ce05a9d4

It doesn't throw me an error, so far so good. Let's see if the container has crashed or is running:

docker ps
CONTAINER ID   IMAGE                                                 COMMAND                  CREATED         STATUS         P
ORTS                                                                                  NAMES
51db1142c31e   us-docker.pkg.dev/keep-test-f3e0/public/keep-client   "keep-client start -…"   5 seconds ago   Up 4 seconds   0
.0.0.0:3919->3919/tcp, :::3919->3919/tcp, 0.0.0.0:9601->9601/tcp, :::9601->9601/tcp   naughty_mendel

This is good! It didn't crash, so it seems to be waiting in standby. Let's check the logs:

2022-10-19T14:59:03.951Z        INFO    keep-cmd        cmd/start.go:61 Starting the client against [goerli] ethereum network.                                                                                                                    
2022-10-19T14:59:05.142Z        INFO    keep-ethereum   ethereum/ethereum.go:294        enabled ethereum client request rate limiter; rps limit [150]; concurrency limit [30]                                                                             
2022-10-19T14:59:05.528Z        WARN    keep-ethereum   [email protected]/log.go:180        could not create subscription to new blocks: [notifications not supported]  
2022-10-19T14:59:06.656Z        WARN    keep-libp2p     [email protected]/log.go:180        could not establish connection with bootstrap peer [16Uiu2HAky2Y4Tyq5vTA1CxikcDes6o5EH11i2qcg5dBV9W3Lks5c]: [failed to dial 16Uiu2HAky2Y4Tyq5vTA1CxikcDes6o5EH11i2qcg5dBV9W3Lks5c:
  * [/ip4/34.141.9.57/tcp/3919] failed to negotiate stream multiplexer: EOF]
2022-10-19T14:59:06.957Z        WARN    keep-libp2p     [email protected]/log.go:180        could not establish connection with bootstrap peer [16Uiu2HAmMosdpAuRSw1ahNhqFq8e3Y4d4c5WZkjW1FGQi5WJwWZ7]: [failed to dial 16Uiu2HAmMosdpAuRSw1ahNhqFq8e3Y4d4c5WZkjW1FGQi5WJwWZ7:
  * [/ip4/20.81.168.158/tcp/4001] failed to negotiate stream multiplexer: EOF]
2022-10-19T14:59:07.065Z        WARN    keep-libp2p     [email protected]/log.go:180        could not establish connection with bootstrap peer [16Uiu2HAmCcfVpHwfBKNFbQuhvGuFXHVLQ65gB4sJm7HyrcZuLttH]: [failed to dial 16Uiu2HAmCcfVpHwfBKNFbQuhvGuFXHVLQ65gB4sJm7HyrcZuLttH:
  * [/ip4/104.154.61.116/tcp/3919] failed to negotiate stream multiplexer: EOF]
2022-10-19T14:59:07.115Z        WARN    keep-libp2p     [email protected]/log.go:180        could not establish connection with bootstrap peer [16Uiu2HAm3eJtyFKAttzJ85NLMromHuRg4yyum3CREMf6CHBBV6KY]: [failed to dial 16Uiu2HAm3eJtyFKAttzJ85NLMromHuRg4yyum3CREMf6CHBBV6KY:
  * [/ip4/35.223.100.87/tcp/3919] failed to negotiate stream multiplexer: EOF]
2022-10-19T14:59:08.371Z        WARN    keep-libp2p     [email protected]/log.go:180        could not establish connection with bootstrap peer [16Uiu2HAm77eSvRq5ioD4J8VFPkq3bJHBEHkssCuiFkgAoABwjo2S]: [failed to dial 16Uiu2HAm77eSvRq5ioD4J8VFPkq3bJHBEHkssCuiFkgAoABwjo2S:
  * [/ip4/52.79.203.57/tcp/3919] failed to negotiate stream multiplexer: EOF]
2022-10-19T14:59:08.371Z        WARN    keep-libp2p     [email protected]/log.go:180        bootstrap round error: [all bootstrap attempts failed]
2022-10-19T14:59:08.371Z        INFO    keep-clientinfo clientinfo/metrics.go:144       observing connected_peers_count with [1m0s] tick
  

▓▓▌ ▓▓ ▐▓▓ ▓▓▓▓▓▓▓▓▓▓▌▐▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▄
▓▓▓▓▓▓▓▓▓▓ ▓▓▓▓▓▓▓▓▓▓▌▐▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
  ▓▓▓▓▓▓    ▓▓▓▓▓▓▓▀    ▐▓▓▓▓▓▓    ▐▓▓▓▓▓   ▓▓▓▓▓▓     ▓▓▓▓▓   ▐▓▓▓▓▓▌   ▐▓▓▓▓▓▓
  ▓▓▓▓▓▓▄▄▓▓▓▓▓▓▓▀      ▐▓▓▓▓▓▓▄▄▄▄         ▓▓▓▓▓▓▄▄▄▄         ▐▓▓▓▓▓▌   ▐▓▓▓▓▓▓
  ▓▓▓▓▓▓▓▓▓▓▓▓▓▀        ▐▓▓▓▓▓▓▓▓▓▓         ▓▓▓▓▓▓▓▓▓▓▌        ▐▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
  ▓▓▓▓▓▓▀▀▓▓▓▓▓▓▄       ▐▓▓▓▓▓▓▀▀▀▀         ▓▓▓▓▓▓▀▀▀▀         ▐▓▓▓▓▓▓▓▓▓▓▓▓▓▓▀
  ▓▓▓▓▓▓   ▀▓▓▓▓▓▓▄     ▐▓▓▓▓▓▓     ▓▓▓▓▓   ▓▓▓▓▓▓     ▓▓▓▓▓   ▐▓▓▓▓▓▌
▓▓▓▓▓▓▓▓▓▓ █▓▓▓▓▓▓▓▓▓ ▐▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓  ▓▓▓▓▓▓▓▓▓▓
▓▓▓▓▓▓▓▓▓▓ ▓▓▓▓▓▓▓▓▓▓ ▐▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓  ▓▓▓▓▓▓▓▓▓▓

Trust math, not hardware.

--------------------------------------------------------------------------------------------------
| Keep Client Node                                                                               |
|                                                                                                |
| Version: v1.3.1-6462-gf3e894fa7 (f3e894fa7)                                                    |
|                                                                                                |
...

All looking quite good! Last line of log tells me what to do next:

2022-10-19T15:10:11.048Z        FATAL   keep-cmd        cmd/start.go:36 error initializing beacon: [could not set up sortition pool monitoring: [operator not registered for the staking provider, check Threshold dashboard]]

I'll therefore wait until I got the Goerli Tokens and then continue from the threshold dashboard according to https://docs.threshold.network/staking-and-running-a-node/running-a-node/self-managed/tbtc-v2-node-setup/application-authorization-and-operator-registration.

This guide, of course, does not link to the testnet dashboard, but I found the correct link in the mail: https://dashboard.test.threshold.network/overview/network

Day 2

So according to the dashboard, I have to do 3. Steps to complete:

STAKE TOKENS
Staked T via https://dashboard.test.threshold.network/staking/ . Confirmed in 2 TX for 50k T staking.
AUTHORIZE APPS
Allowed TBTC & Random Beacon App for 50k T. Confirmed in 2 TX (one per App)
SET UP NODE
SW Setup for this has been already covered in Day 1. I just need to map the operator address to my stake it looks like.
The operator registration / mapping guide leads me through the next 2 TX. As for the operator address, I use the same as the provider address. The dashboard confirms to me, that I have successfully mapped operator <-> provider addresses 🔥

I like that the dashboard user is lead through the 3. Steps in a sequential way. I didn't have to search around on the dashboard to do the Stake / Authorize TBTC/Random Beacon Apps which saved me time and didn't lead to confusion which kind of apps I would need to authorize. Also a plus, that I can directly go to the setup docs for the different apps from the applications page.

After checking via docker logs on my server, actually the docker container is constantly rebooting after FATALing on the initialization. I still see the same message I did, before I started the dashboard staking/authorization/mapping steps:

2022-10-19T22:43:55.750Z        FATAL   keep-cmd        cmd/start.go:36 error initializing beacon: [could not set up sortition pool monitoring: [operator not registered for the staking provider, check Threshold dashboard]]

I'm getting the impression I did something fundamentaly wrong from the order of setup steps (1# basically test-running the node without properly 2# staking/authorization/operator mapping). I also now see this remark in the setup guide "Application Authorization & Operation Registration":

Don't forget: a tBTC v2 node will not be able to be deployed without successfully authorizing both the tBTC and Random Beacon applications, and registering the node's operator address FIRST.

This does not sound good.. I probably have to contact support regarding this and what I can do to register my node now, because I DID run the node before registering the nodes operator address FIRST. At this point I don't know, why this is fundamentally not possible in this order and would need some explanation from the team.

I see the comment in the guide, but if this does have such a fundamental impact, it should at least be mentioned in bold in the top section (https://docs.threshold.network/staking-and-running-a-node/running-a-node/self-managed/tbtc-v2-node-setup).

On 2nd look into https://docs.threshold.network/extras/contract-addresses/goerli-testnet, I see that this contract addresses doesn't match what I see in my node client logs:

--------------------------------------------------------------------------------------------------
| Keep Client Node                                                                               |
|                                                                                                |
| Version: v1.3.1-6462-gf3e894fa7 (f3e894fa7)                                                    |
|                                                                                                |
| Operator: redacted                                           |
|                                                                                                |
| Port: 3919                                                                                     |
| IPs : /ip4/redacted/tcp/3919/ipfs/redacted |
|                                                                                                |
| Contracts:                                                                                     |
| RandomBeacon   : 0x0Acf872ea89b73E70Aa1d1b8F46dC01bB71f4B03                                    |
| WalletRegistry : 0x2d51348b1a903aAECF266d0844dA69a178fC1dC7                                    |
| TokenStaking   : 0x1da5d88C26EA4f87b5e09C3452eE2384Ee20DC75                                    |
--------------------------------------------------------------------------------------------------

Looks to me, like there is a mismatch in the tbtc docker container (that is fetched via setup script) and what the test dashboard is using. It also looks to be, that the client seems to be the outdated one, as I don't see any recent registration tx for e.g. WalletRegistry used by container (https://goerli.etherscan.io/address/0x2d51348b1a903aAECF266d0844dA69a178fC1dC7). Will clarify with support, what to do, because I don't know at that point, which contracts should be called for registration.

Support answered and updated the docker images. I then pulled the latest with docker pull us-docker.pkg.dev/keep-test-f3e0/public/keep-client and removed the old container with old image docker rm naughty_mendel. I then restared the launch script: ./keep.sh. I can immediately see the corrected contract addresses from the log:

|                                                                                                |
| Contracts:                                                                                     |
| RandomBeacon   : 0xaFfCD4734eEa140Ba5666Bf60541CCAFfa74F4be                                    |
| WalletRegistry : 0x82BE0F8C8d43fAC584B03f4b782370E202A34527                                    |
| TokenStaking   : 0x1da5d88C26EA4f87b5e09C3452eE2384Ee20DC75                                    |
--------------------------------------------------------------------------------------------------

This looks good 😄 Next, I let the node run for a few hours and then checked again. I can see that it didn't crash via docker ps but am not sure if my node is active because I still see recurring errors in the logs like:

2022-10-20T16:22:44.117Z        WARN    keep-ethereum   [email protected]/log.go:180        could not create subscription to new blocks: [notifications not supported]

and

2022-10-20T19:20:41.182Z        WARN    keep-libp2p     [email protected]/log.go:180        bootstrap round error: [all bootstrap attempts failed]

. I do have connected peers though:

2022-10-20T19:20:42.904Z        INFO    keep-libp2p     libp2p/libp2p.go:241    number of connected peers: [11]

Maybe the boot nodes are currently down or these warnings can be ignored.. Need to check with documentation again and see how to verify a successfull operating node.

Update: Ok, so through the discord forum I got the answer thanks to Vict0r regarding my first issue with the block subscription issue.. I copied the HTTPS instead of the websockets RPC endpoint from infura on day1, no wonder! I now replaced it with the correct WS endpoint. Only warnings that still are regulary occuring are the bootstrap round errors. I don't think this is a connection issue from my side, as Hetzner generally doesn't block incoming/outgoing traffic and I didn't create application-based FW rules to limit port connectitvy as documented in https://docs.threshold.network/staking-and-running-a-node/running-a-node/self-managed/tbtc-v2-node-setup/network-configuration.

For now I think the setup part has been successfull and I will continue building out the infrastructure for monitoring & failover service-recovery in the Post-Setup part of my journal.

Post-Setup

Day 3

After searching for available monitoring info (see Docs Day 3), I checked if the metrics endpoint was working for me. It was:

➜  grafana curl http://localhost:9601/metrics
client_info{version="v2.0.0-m1-9-g4049dc015"} 1

# TYPE connected_bootstrap_count gauge
connected_bootstrap_count 2 1666534806116

# TYPE connected_peers_count gauge
connected_peers_count 13 1666534806116

# TYPE eth_connectivity gauge
eth_connectivity 1 1666534746116

# TYPE tbtc_pre_params_count gauge
tbtc_pre_params_count 1000 1666534806577

I then copied my docker-compose monitoring stack conf, which includes:

Grafana
Prometheus
Prometheus Node-Exporter
Traefik (for domain forwarding / auto letscencrypt certificate generation)

I re-used the same setup as I did for my PRE/nucypher node.

My docker-compose conf (again with PII redacted):

name: grafana
services:
  grafana:
    depends_on:
      prometheus:
        condition: service_started
    environment:
      GF_SECURITY_ADMIN_PASSWORD: redacted
      GF_SECURITY_ADMIN_USER: redacted
      GF_SERVER_DOMAIN: redacted
      GF_SERVER_ROOT_URL: https://
      GF_USERS_ALLOW_SIGN_UP: "false"
    extra_hosts:
      host.docker.internal: host-gateway
    image: grafana/grafana:latest
    labels:
      traefik.http.middlewares.some-name-redirect.redirectScheme.permanent: "true"
      traefik.http.middlewares.some-name-redirect.redirectScheme.scheme: https
      traefik.http.routers.some-name-ssl.entryPoints: port443
      traefik.http.routers.some-name-ssl.rule: host(`redacted`)
      traefik.http.routers.some-name-ssl.service: some-name-ssl
      traefik.http.routers.some-name-ssl.tls: "true"
      traefik.http.routers.some-name-ssl.tls.certResolver: le-ssl
      traefik.http.routers.some-name.entryPoints: port80
      traefik.http.routers.some-name.middlewares: some-name-redirect
      traefik.http.routers.some-name.rule: host(`redacted`)
      traefik.http.services.some-name-ssl.loadBalancer.server.port: "3000"
    logging:
      driver: journald
    networks:
      default: null
    restart: always
    volumes:
    - type: volume
      source: grafana-data
      target: /var/lib/grafana
      volume: {}
  node-exporter:
    container_name: node-exporter
    depends_on:
      prometheus:
        condition: service_started
    hostname: node-exporter
    image: prom/node-exporter
    network_mode: host
    pid: host
    restart: always
  prometheus:
    command:
    - --config.file=/etc/prometheus/prometheus.yml
    - --storage.tsdb.path=/prometheus
    - --storage.tsdb.retention.time=4w
    - --web.console.libraries=/etc/prometheus/console_libraries
    - --web.console.templates=/etc/prometheus/consoles
    - --web.listen-address=0.0.0.0:9090
    image: prom/prometheus:v2.22.0
    network_mode: host
    restart: always
    volumes:
    - type: bind
      source: /root/grafana/prometheus
      target: /etc/prometheus
      bind:
        create_host_path: true
    - type: volume

My prometheus config is quite simple and resides in prometheus/prometheus.yml:

# my global config
global:
  scrape_interval:     10s # By default, scrape targets every 15 seconds.
  evaluation_interval: 10s # By default, scrape targets every 15 seconds.

scrape_configs:
  # Scrape Prometheus itself
  - job_name: 'prometheus'
    scrape_interval: 10s
    scrape_timeout: 10s
    static_configs:
      - targets: ['localhost:6031']

  # Scrape the Node Exporter
  - job_name: 'node'
    scrape_interval: 10s
    static_configs:
      - targets: ['localhost:9100']

  # Scrape tbtc
  - job_name: 'tbtc'
    scrape_interval: 10s
    static_configs:
      - targets: ['localhost:9601']

This are the only 2 files I need to deploy a self-sustained monitoring instance on this node which I can securly connect from remote.

I then added a new DNS A record for this node to my domain for this purpose.

Tomorrow I will continue with adding Grafana Dashboard and connect Prometheus data source to it.

Day 4

First I checked, if Grafana was correctly answering under the new domain name.

This worked, so after login I continued with registering prometheus data source.

Some note here: In my stack setup, Prometheus is configured in host networking mode. In order to find this container from other services like grafana (which reside in grafana_default), we need to use the special docker internal host resolving address host.docker.internal. If you don't have this requirement (prometheus in other docker-compose network), you could address it in grafana normally under service name (grafana).

To connect prometheus data-source to Grafana:

Configuration -> Data Sources -> Add Data Source
Select "Prometheus".
Goto configuration setting HTTP. Specify in url http://host.docker.internal:9090
On bottom, click on *save & test". This should work. If not, check, if you can access http://host.docker.internal:9090 from grafana container, e.g. via docker-compose exec grafana bash -c "nc host.docker.internal 9090 -v". This should respond with host.docker.internal ([internal-ip]:9090) open. If you don't see this, you probably need to use http://prometheus:9090 instead (again, can be tested both from grafana UI as well as from container with nc command).

Next, I created a new tBTC v2 dashboard. I attached the dashboard as exported JSON to this gist. Here's a screenshot:

I added a grafana alerting rule & notification policy to give me a notice via TG, if my tbtc v2 node peer count is below a certain number of peers. I tested this alert rule and notification policy process by setting the peer threshold extra high so it would fire. After evaluation period (5m) was over, I was alerted via TG and updated the threshold thereafter to a non-firing sane default.

Documentation

Day 1

Going over the launch script guide https://thresholdnetwork.notion.site/thresholdnetwork/Docker-Launch-Script-4d304d61be6941d78e450a79406f0403, two things I would remark:

There are different parts in the script (ETHEREUM_WS_URL="<Ethereum API WS URL>",OPERATOR_KEY_FILE_NAME="<Operator Account keyfile name>" ,PUBLIC_IP="<PUBLIC_IP_OF_MACHINE>") where you need to replace template information with data snippets that is specific for each operator. If you run the script as is, the docker run will fail. For some novice node operators that could be the first hurdle which goes to result in support contact, because they go by the guide and don't understand that they need to change parts of the script, because there's no step in the guide mentioning of replacing data. This should be included so novice operators know what they need to change, in order to docker run successfully.
/home/keep path requirements: There are hardcoded paths to /home/keep/config & /home/keep/storage. I would replace this with a more generic approach for operators that don't run the node as keep user by using $HOME/config & $HOME/storage instead.

The documentation on docs.threshold.network for the tBTC setup is quite straight-forward. As commented in the setup guide, I would make important disclaimers more prominent

Day 2

After starting the container and seeing a bunch of warnings in the logs, I would have wished to have more details from the documentation, what kind of log events I have to watch for, in order to check my node has successfully joined as a new tbtc staking node.

Simply stating that if the operator sees the KEEP logo + operator/contract addresses gives a false safety of that the node is active (as stated in both, Docker & Binary Installation docs).

Instead, I would tell the operators to look for certain activation events of the services the node should provide (e.g. number of peers connected, successful bootstraping messages, successfull subscriptions, protocol exchanges with peers, etc.).

I'd also recommend for the future to get some kind of rudimentary guide for monitoring up, but as has been explained to me by the intro-interview yesterday, this can be sourced by working together with the beta testers which will try different methods. I'll therefore check the code/tbtc api docs to see what can be fetched and try to see what can be done and note my results in the post-setup section.

Day 3

Checked documentation for how to best export tbtc v2 node metrics. TBH the docs did not help me in this endeavour and should surely get a sub-page under https://docs.threshold.network/staking-and-running-a-node/running-a-node/self-managed/tbtc-v2-node-setup. So next I searched in the threshold discord for tbtc metrics and found great info in Vict0r's answers to the community. Apparently I can check http://localhost:9601/metrics and for more detailed information regarding client & connected peers http://localhost:9601/diagnostic.

The former is prometheus compatible, the later not. So my plan is to copy my working prometheus monitoring stack onto this node and create a simple tbtc v2 dashboard and/or copy an already existing one from the threshold community and customize it (if available).

Day 4

During my post-setup work with grafana (see Day 4 section above) I searched discord community once again, as I already knew that the docs website does not contain anything regarding tbtc monitoring. I also didn't found anything, so next I checked the tbtc repositories. I found two grafana dashboards by searching for one of the metrics (tbtc_pre_params_count):

I imported those as well and checked it against my manually created dashboard. They are quite similar, as there's not that many metrics which are exported for now.

In an upcoming docs part for monitoring I would probably include one of those dashboards, in addition how your jobname should be set in prometheus config to automatically link with the name used in the dashboards, or make it more configurable for independent jobnames for the tbtc part.

I would also add a sub-section for alerting, what should be tracked here (probably connectivity metrics to eth / bootstraping nodes, low peer count) and how to check if your node is still online (e.g. node network metrics explorer like what was done for the nucypher network).

You can also recommend some availability monitoring provider like https://Pingdom.com, https://betteruptime.com/ etc. (or guides how to do this for popular host providers like AWS, GCloud, Azure, DigitalOcean, etc.)

sashatanase/Diary Entry - User 3.md

Threshold Network Diary Study

Node Setup Preparation (Day 1)

Node Setup

Day 1

Launcher script

Day 2

Post-Setup

Day 3

Day 4

Documentation

Day 1

Day 2

Day 3

Day 4