统一告警服务
指标告警在大数据平台运维是很重要的,告警分多种类
1)服务内部指标,通过jmx暴露出来,再用prometheus去采集,在grafana中展示,grafana中有告警配置,根据指标做各种配置告警。
2)shell脚本,监控进程和服务端口
grafana告警
grafana告警配置
例:统计datandoe变动告警:
-- psql:
Hadoop_NameNode_NumLiveDataNodes{cluster=~"yycluster0.*",name="FSNamesystem"} -Hadoop_NameNode_NumLiveDataNodes{cluster=~"yycluster0.*",name="FSNamesystem"} offset 1m
告警配置如下:
通过webhook把消息发送到后端接口:

然后后端接口接收到消息,再解析进行发送到打电话或发邮件或钉钉,如流等。下面是一个通用消息解析,打电话的参考,可根据情况自己实现。
显示/隐藏:告警解析示例(Python,默认收起,点击展示开)
#!/usr/bin/env python
# coding=utf-8
# encoding=utf8
# -*-coding:utf-8-*-
from flask import Flask, jsonify, request, json
import requests
import sys
app = Flask(__name__)
@app.route('/')
def hello_world():
return 'Hello, World!'
@app.route('/pushPhone', methods=['GET', 'POST'])
def pushPhone():
data = request.get_data()
parm_data = json.loads(data)
print type(parm_data)
print parm_data
message_context = parm_data['message']
# print type(message_context)
# print message_context
# message_context = json.loads(eval(json.dumps(message_context, ensure_ascii=False).encode('utf-8')))
message_context = json.loads(eval(json.dumps(message_context)))
print type(message_context)
print(message_context)
# parm_message_context = json.loads(message_context)
message = message_context['message']
i_id = message_context["iid"]
sid = message_context['sid']
msgCustom = ""
if 'msgCustom' in message_context:
msgCustom = message_context['msgCustom']
op_admin_dw = "dw_liangrui"
if 'op_admin_dw' in message_context:
op_admin_dw = message_context['op_admin_dw']
msg_key = message_context['msg_key']
# 告警值
if op_admin_dw in (None, ""):
op_admin_dw = "dw_liangrui"
evalMatchesStr = ""
# print(parm_data)
if 'evalMatches' in parm_data:
evalMatches = parm_data['evalMatches']
for eva in evalMatches:
if msgCustom is None or msgCustom == "":
evalMatchesStr = evalMatchesStr + (eva['metric'] + ":" + str(round(eva['value'], 2))) + "%"
else:
# tags.deployment_name
# metric.ods-turnover-order-id-mapping
if "." in msgCustom:
msgCustomArray = msgCustom.split(".")
tagsOrMetric = msgCustomArray[0]
metricMsgCustomKey = msgCustomArray[1]
if tagsOrMetric == "tags":
tags = eva['tags']
tagsKeys = metricMsgCustomKey.split(",")
for tag in tagsKeys:
sendMsg = tags[tag]
evalMatchesStr = evalMatchesStr + (tag + ":" + sendMsg+"_")
# sendMsg = tags[metricMsgCustomKey]
# evalMatchesStr = evalMatchesStr + (metricMsgCustomKey + ":" + sendMsg) + "%"
elif tagsOrMetric == "metric":
metric = eva['metric']
keyArray = metricMsgCustomKey.split(",")
for key in keyArray:
if key == metric:
evalMatchesStr = evalMatchesStr + (metric + ":" + str(round(eva['value'], 2))+"_")
else:
print("metric not:" + msgCustom)
print "evalMatchesStr=" + evalMatchesStr
import os
reload(sys)
sys.setdefaultencoding('utf-8')
if sid is not None and sid != '':
# get app
parm_str = "id=%s&sid=%s&msg=%s&msg_key=%s" % (
i_id, sid, message + "-" + evalMatchesStr.replace("=", "-"), msg_key)
print "parm_str=%s" % parm_str.encode('utf-8')
# 这里是服务器上的打电话告警脚本,每个公司的接入都不一样,需要自己去实现打电话的逻辑即可
sys_str = (
'python /home/dspeak/yyms/yymp/yymp_report_script/yymp_report_alarm.py '
'"op_admin_dw=%s&%s"' % (op_admin_dw, parm_str.encode('utf-8')))
else:
print 'other===='
sys_str = (
'python /home/dspeak/yyms/yymp/yymp_report_script/yymp_report_alarm.py "id=45496&sid=358762&op_admin_dw=dw_liangrui&msg_key=dw236&msg=YARN调度异常"')
print "python====%s" % sys_str
p = os.system(sys_str)
import time
print(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))
print("retrun=========" + str(p))
return "ok"
if __name__ == '__main__':
app.run(host="0.0.0.0", port=9863)shell脚本告警
shell脚本告警,是写个定时任务,检查端口或主从切换或进程的告警
hdfs HA切换示例
比如hdfs发生主从切换,需要告警出来,可以立马去看日志,排查为什么要切换
check_nn_state.sh 脚本代码如下:
#!/bin/bash
active=""
c_time=`date +'%F %T'`
active_nn=`hdfs haadmin -ns yycluster06 -getAllServiceState | grep active | awk -F ':' '{print $1}'`
# 这个是记录上一次的主节点
old_state=`cat /data/hdfs/check_state/active_nn.state`
# yymp_report_alarm.py 是每个服务器自带的打电话逻辑,每个公司不一样,需要自己实现即可
if [[ $old_state != $active_nn ]];then
python /home/dspeak/yyms/yymp/yymp_report_script/yymp_report_alarm.py "op_admin_dw=xx&id=111&sid=111&msg=nn_active_transform_Current_$active_nn&msg_key=778"
echo "$c_time" nn_active_transform_Current_"$active_nn" >> /data/hdfs/check_state/nn_state.log
fi
c_hour=$(echo $c_time | awk -F ' ' '{print $2}' | awk -F ':' '{print $1}')
if [[ $c_hour == 10 ]];then
echo "$c_time" active "$active_nn" > /data/hdfs/check_state/nn_state.log
else
echo "$c_time" active "$active_nn" >> /data/hdfs/check_state/nn_state.log
fi
#check_current_rm_wirte state
echo $active_nn > /data/hdfs/check_state/active_nn.state然后在corntab里增加每1分钟执行一次即可,共它的检查也可以按这个逻辑去改写
*/1 * * * * hdfs /data/hdfs/check_state/check_nn_state.sh >/dev/null 2>&1
发布:
大纲: