Kubernetes和OpenShift集群的守护者用于监视集群运行状况和发出故障信号/警报的工具-Python-Shell资源-CSDN文库

共104个文件

png：25个

py：17个

md：13个

Python

需积分: 1 199 浏览量 2023-01-09 13:22:46 上传评论收藏 566KB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

Kubernetes和OpenShift集群的守护者用于监视集群运行状况和发出故障信号/警报的工具 -Python-Shell （104个子文件）

setup.cfg 1KB

Dockerfile 747B

Dockerfile-ppc64le 1KB

.gitignore 515B

analysis.html 743B

MANIFEST.in 314B

LICENSE 11KB

config.md 9KB

README.md 8KB

usage.md 6KB

installation.md 4KB

example_report.md 3KB

slack.md 2KB

README.md 2KB

node-problem-detector.md 2KB

contribute.md 2KB

build_own_image-README.md 1013B

alerts.md 935B

README.md 722B

README.md 118B

my_tests 74B

cerberus-logo_color-light-full-horizontal.pdf 47KB

cerberus-logo_color-dark-full-horizontal.pdf 47KB

cerberus-logo_color-black-full-horizontal.pdf 47KB

cerberus-logo_color-black-full-horiszontal.pdf 47KB

cerberus-logo_color-light-full-stacked.pdf 47KB

cerberus-logo_color-dark-full-stacked.pdf 47KB

cerberus-logo_color-black-full-stacked.pdf 47KB

cerberus-logo_color-light_mark-only.pdf 42KB

cerberus-logo_color-dark-mark-only.pdf 42KB

cerberus-logo_color-black-mark-only.pdf 42KB

cerberus-logo_color-black-mark-only..pdf 42KB

cerberus-workflow.png 110KB

cerberus-logo_color-dark-large-full-horizontal.png 18KB

cerberus-logo_large-color-light-full-horizontal.png 18KB

cerberus-logo_color-black-large-full-horizontal.png 18KB

cerberus-logo_color-large-white-full-horizontal.png 16KB

cerberus-logo_color-dark-large-full-stacked.png 15KB

cerberus-logo_large-color-light-full-stacked.png 15KB

cerberus-logo_color-black-large-full-stacked.png 14KB

cerberus-logo_color-large-white-full-stacked.png 13KB

cerberus-logo_small-color-light-full-horizontal.png 9KB

cerberus-logo_color-dark-small-full-horizontal.png 9KB

cerberus-logo_color-black-small-full-horizontal.png 8KB

cerberus-logo_color-small-white-full-horizontal.png 8KB

cerberus-logo_large-color-light-mark-only.png 7KB

cerberus-logo_color-dark-large-mark-only.png 7KB

cerberus-logo_color-black-large-mark-only.png 7KB

cerberus-logo_small-color-light-full-stacked.png 7KB

cerberus-logo_color-dark-small-full-stacked.png 7KB

cerberus-logo_color-large-white-mark-only.png 6KB

cerberus-logo_color-black-small-full-stacked.png 6KB

cerberus-logo_color-small-white-full-stacked.png 6KB

cerberus-logo_small-color-light-mark-only.png 3KB

cerberus-logo_color-dark-small-mark-only.png 3KB

cerberus-logo_color-black-small-mark-only.png 3KB

cerberus-logo_color-small-white-mark-only.png 3KB

start_cerberus.py 25KB

client.py 19KB

client.py 4KB

slack_client.py 3KB

server.py 3KB

client.py 2KB

inspect.py 1KB

command.py 1KB

custom_check_sample.py 374B

setup.py 297B

__init__.py 0B

run_ci.sh 2KB

run_test.sh 1KB

test_detailed_data_inspection.sh 979B

test_slack_integration.sh 879B

common.sh 478B

test_daemon_disabled.sh 410B

master_test.sh 293B

cerberus-logo_color-light-full-horizontal.svg 7KB

cerberus-logo_color-dark-full-horizontal.svg 7KB

cerberus-logo_color-light-full-stacked.svg 7KB

cerberus-logo_color-dark-full-stacked.svg 7KB

cerberus-logo_color-black-full-horiszontal.svg 7KB

cerberus-logo_color-white-full-stacked.svg 7KB

cerberus-logo_color-black-full-horizontal.svg 7KB

cerberus-logo_color-black-full-stacked.svg 7KB

cerberus-logo_color-light_mark-only.svg 2KB

cerberus-logo_color-dark-mark-only.svg 2KB

cerberus-logo_color-white-mark-only.svg 2KB

cerberus-logo_color-black-mark-only..svg 2KB

test_list 83B

requirements.txt 119B

config.yaml 5KB

kubernetes_config.yaml 4KB

.pre-commit-config.yaml 808B

hello_openshift_pod.yaml 607B

共 104 条

# Cerberus Guardian of Kubernetes and OpenShift Clusters ![Cerberus logo](media/logo_assets/full_color/over_light_background/cerberus-logo_small-color-light-full-horizontal.png) Cerberus watches the Kubernetes/OpenShift clusters for dead nodes, system component failures/health and exposes a go or no-go signal which can be consumed by other workload generators or applications in the cluster and act accordingly. ### Workflow ![Cerberus workflow](media/cerberus-workflow.png) ### Installation Instructions on how to setup, configure and run Cerberus can be found at [Installation](docs/installation.md). ### What Kubernetes/OpenShift components can Cerberus monitor? Following are the components of Kubernetes/OpenShift that Cerberus can monitor today, we will be adding more soon. Component | Description | Working ----------------------------------- | ---------------------------------------------------------------------------------------------------------------- | ------------------------- | Nodes | Watches all the nodes including masters, workers as well as nodes created using custom MachineSets | :heavy_check_mark: | Namespaces | Watches all the pods including containers running inside the pods in the namespaces specified in the config | :heavy_check_mark: | Cluster Operators | Watches all Cluster Operators | :heavy_check_mark: | Masters Schedulability | Watches and warns if masters nodes are marked as schedulable | :heavy_check_mark: | Routes | Watches specified routes | :heavy_check_mark: | CSRs | Warns if any CSRs are not approved | :heavy_check_mark: | Critical Alerts | Warns the user on observing abnormal behavior which might effect the health of the cluster | :heavy_check_mark: | Bring your own checks | Users can bring their own checks and Ceberus runs and includes them in the reporting as wells as go/no-go signal | :heavy_check_mark: | An explanation of all the components that Cerberus can monitor are explained [here](docs/config.md) ### How does Cerberus report cluster health? Cerberus exposes the cluster health and failures through a go/no-go signal, report and metrics API. #### Go or no-go signal When the cerberus is configured to run in the daemon mode, it will continuosly monitor the components specified, runs a light weight http server at http://0.0.0.0:8080 and publishes the signal i.e True or False depending on the components status. The tools can consume the signal and act accordingly. #### Report The report is generated in the run directory and it contains the information about each check/monitored component status per iteration with timestamps. It also displays information about the components in case of failure. Refer [report](docs/example_report.md) for example. #### Metrics API Cerberus exposes the metrics including the failures observed during the run through an API. Tools consuming Cerberus can query the API to get a blob of json with the observed failures to scrape and act accordingly. For example, we can query for etcd failures within a start and end time and take actions to determine pass/fail for test cases or report whether the cluster is healthy or unhealthy for that duration. - The failures in the past 1 hour can be retrieved in the json format by visiting http://0.0.0.0:8080/history. - The failures in a specific time window can be retrieved in the json format by visiting http://0.0.0.0:8080/history?loopback=<interval>. - The failures between two time timestamps, the failures of specific issues types and the failures related to specific components can be retrieved in the json format by visiting http://0.0.0.0:8080/analyze url. The filters have to be applied to scrape the failures accordingly. ### Slack integration Cerberus supports reporting failures in slack. Refer [slack integration](docs/slack.md) for information on how to set it up. ### Node Problem Detector Cerberus also consumes [node-problem-detector](https://github.com/kubernetes/node-problem-detector) to detect various failures in Kubernetes/OpenShift nodes. More information on setting it up can be found at [node-problem-detector](docs/node-problem-detector.md) ### Bring your own checks Users can add additional checks to monitor components that are not being monitored by Cerberus and consume it as part of the go/no-go signal. This can be accomplished by placing relative paths of files containing additional checks under custom_checks in config file. All the checks should be placed within the main function of the file. If the additional checks need to be considered in determining the go/no-go signal of Cerberus, the main function can return a boolean value for the same. Having a dict return value of the format {'status':status, 'message':message} shall send signal to Cerberus along with message to be displayed in slack notification. However, it's optional to return a value. Refer to [example_check](https://github.com/openshift-scale/cerberus/blob/master/custom_checks/custom_check_sample.py) for an example custom check file. ### Alerts Monitoring metrics and alerting on abnormal behavior is critical as they are the indicators for clusters health. Information on supported alerts can be found at [alerts](docs/alerts.md). ### Use cases There can be number of use cases, here are some of them: - We run tools to push the limits of Kubernetes/OpenShift to look at the performance and scalability. There are a number of instances where system components or nodes start to degrade, which invalidates the results and the workload generator continues to push the cluster until it is unrecoverable. - When running chaos experiments on a kubernetes/OpenShift cluster, they can potentially break the components unrelated to the targeted components which means that the chaos experiment won't be able to find it. The go/no-go signal can be used here to decide whether the cluster recovered from the failure injection as well as to decide whether to continue with the next chaos scenario. ### Tools consuming Cerberus - [Benchmark Operator](https://github.com/cloud-bulldozer/benchmark-operator): The intent of this Operator is to deploy common workloads to establish a performance baseline of Kubernetes cluster on your provider. Benchmark Operator consumes Cerberus to determine if the cluster was healthy during the benchmark run. More information can be found at [cerberus-integration](https://github.com/cloud-bulldozer/benchmark-operator#cerberus-integration). - [Kraken](https://github.com/openshift-scale/kraken/): Tool to inject deliberate failures into Kubernetes/OpenShift clusters to check if it is resilient. Kraken consumes Cerberus to determine if the cluster is healthy as a whole in addition to the targeted component during chaos testing. More information can be found at [cerberus-integration](https://github.com/openshift-scale/kraken#kraken-scenario-passfail-criteria-and-report). ### Blogs and other useful resources - https://www.openshift.com/blog/openshift-scale-ci-part-4-introduction-to-cerberus-guardian-of-kubernetes/openshift-clouds - https://www.openshift.com/blog/reinforcing-cerberus-guardian-of-openshift/kubernetes-clusters ### Contributions We are always looking for more enhancements, fixes to make it better, any contributions are most welcome.

评论收藏

内容反馈