Azure 上Salt Minion异常断线的问题

在几台Azure 的机器上发现salt minion一直在切换节点，以为是到阿里云香港的抖动问题，加了一个阿里云硅谷节点，把Azure 美国境内的节点都分配过去，还是异常。

这个时候监控也发现几次Azure 到阿里云境内Ping 异常丢失，以为是网络问题，抓MTR报障处理中。。。

网络恢复之后，还是常常切换master，切就切吧，问题是切换之后，似乎进入了一种异常状态，只能重启salt minion 才能解决。

好像翻到一个issue 说是master list 的问题，切换成单master，观察一晚上，还是异常。

又怀疑是zeromq 的问题，换到tcp transport，还是有问题。

Feb 27 14:51:07 ... salt-minion[18665]: AttributeError: 'NoneType' object has no attribute 'add_callback'
Feb 27 15:07:35 ... salt-minion[18665]: [WARNING ] Master ip address changed from 47.8 to 18.1
Feb 27 15:07:35 ... salt-minion[18665]: [WARNING ] Master ip address changed from 47.8 to 18.1
Feb 27 15:07:36 ... salt-minion[18665]: [WARNING ] Master ip address changed from 47.8 to 18.1
Feb 27 15:07:36 ... salt-minion[18665]: [ERROR   ] Exception in callback <function SaltMessageClient.connect.<locals>.handle_future at 0x7f6283dac1e0> for <salt.ext.
Feb 27 15:07:36 ... salt-minion[18665]: Traceback (most recent call last):
Feb 27 15:07:36 ... salt-minion[18665]: File "/usr/lib/python3.6/site-packages/salt/ext/tornado/concurrent.py", line 326, in _set_done
Feb 27 15:07:36 ... salt-minion[18665]: cb(self)
Feb 27 15:07:36 ... salt-minion[18665]: File "/usr/lib/python3.6/site-packages/salt/transport/tcp.py", line 1043, in handle_future
Feb 27 15:07:36 ... salt-minion[18665]: self.io_loop.add_callback(self.connect_callback, response)
Feb 27 15:07:36 ... salt-minion[18665]: AttributeError: 'NoneType' object has no attribute 'add_callback'
Feb 27 15:23:55 ... salt-minion[18665]: [WARNING ] Master ip address changed from 18.1 to 47.8
Feb 27 15:23:56 ... salt-minion[18665]: [ERROR   ] Exception in callback <function SaltMessageClient.connect.<locals>.handle_future at 0x7f6284bf1b70> for <salt.ext.
Feb 27 15:23:56 ... salt-minion[18665]: Traceback (most recent call last):
Feb 27 15:23:56 ... salt-minion[18665]: File "/usr/lib/python3.6/site-packages/salt/ext/tornado/concurrent.py", line 326, in _set_done
Feb 27 15:23:56 ... salt-minion[18665]: cb(self)
Feb 27 15:23:56 ... salt-minion[18665]: File "/usr/lib/python3.6/site-packages/salt/transport/tcp.py", line 1043, in handle_future
Feb 27 15:23:56 ... salt-minion[18665]: self.io_loop.add_callback(self.connect_callback, response)
Feb 27 15:23:56 ... salt-minion[18665]: AttributeError: 'NoneType' object has no attribute 'add_callback'
Feb 27 15:40:15 ... salt-minion[18665]: [WARNING ] Master ip address changed from ... to ...
Feb 27 15:40:16 ... salt-minion[18665]: [WARNING ] Master ip address changed from ... to ...
Feb 27 15:40:16 ... salt-minion[18665]: [ERROR   ] Exception in callback <function SaltMessageClient.connect.<locals>.handle_future at 0x7f6280447e18> for <salt.ext.
Feb 27 15:40:16 ... salt-minion[18665]: Traceback (most recent call last):
Feb 27 15:40:16 ... salt-minion[18665]: File "/usr/lib/python3.6/site-packages/salt/ext/tornado/concurrent.py", line 326, in _set_done
Feb 27 15:40:16 ... salt-minion[18665]: cb(self)
Feb 27 15:40:16 ... salt-minion[18665]: File "/usr/lib/python3.6/site-packages/salt/transport/tcp.py", line 1043, in handle_future
Feb 27 15:40:16 ... salt-minion[18665]: self.io_loop.add_callback(self.connect_callback, response)
Feb 27 15:40:16 ... salt-minion[18665]: AttributeError: 'NoneType' object has no attribute 'add_callback'

又翻了几个issue：

If Azure isn't respecting keepalive, that could definitely be causing your problems. As of right now, the minions will not attempt to reconnect outside of the ZMQ keepalive routines. (We recognize that this is a problem -- the biggest blocker is the fact that ZMQ is not very good at reporting that connections are dead. We've been trying to find a good way around this problem) basepi commented on Dec 6, 2013

原来Azure 有个默认的网络链接超时时间，Use keepalives to reset the outbound idle timeout

Outbound connections have a 4-minute idle timeout. This timeout is adjustable via Outbound rules. You can also use transport (for example, TCP keepalives) or application-layer keepalives to refresh an idle flow and reset this idle timeout if necessary.

Azure 默认是240s，小于Salt minion 默认的300s，相当于长连接会被Azure 从中间断开。猜测salt 对这种closed 情况代码上可能处理的不够好。

指定下tcp_keepalive_idle，默认是300s：

tcp_keepalive_idle: 60

重启，终于解决了。

fangpsh's blog