I experienced something similar building a temporary ECS/Fargate cluster via dask-cloudprovider. The answer ultimately fell into the bucket of network architecture. Here are some recommendations:
- Make sure you have network firewall rules for whatever IAM roles you have set up. This is a "Security Group" in AWS, but not positive about other platforms.
- Assure your network routing tables are correctly set up for your internet gateways and are allowing ingress and egress for your nodes... this is particularly insecure if not set up properly in a private subnet. If you are trying to run in a private subnet, then definitely try to identify whether the NAT gateway is properly setup, as well as any load balancers you may have..
- I see that your system is looking on ports 2323... dask usually looks for 8787 by default as far as I know, I'd look into that if you're unsure.
This problem is pretty hard to nail down, so I'd recommend a fair amount of trail-and-error. Check logs on each worker and scheduler and try to garner other hints to what can be causing the issue.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…