Kubernetes Prometheus

介绍

Prometheus 是一个开源的系统监控和警报工具包，最初由 SoundCloud 开发，现在已成为云原生计算基金会（CNCF）的毕业项目。它特别适合在 Kubernetes 环境中使用，因为它能够自动发现和监控 Kubernetes 集群中的服务。

Prometheus 的核心功能包括：

多维度数据模型：通过键值对标签来标识时间序列数据。
强大的查询语言（PromQL）：用于查询和聚合时间序列数据。
高效的存储：使用本地磁盘存储时间序列数据，支持高效的数据压缩和查询。
灵活的警报机制：可以通过 PromQL 定义警报规则，并通过 Alertmanager 发送警报。

在 Kubernetes 中，Prometheus 通常用于监控集群的健康状态、资源使用情况以及应用程序的性能。

安装 Prometheus

在 Kubernetes 中安装 Prometheus 通常使用 Helm 包管理器。以下是安装步骤：

安装 Helm：如果尚未安装 Helm，请参考 Helm 官方文档进行安装。

添加 Prometheus Helm 仓库：

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

安装 Prometheus：

helm install prometheus prometheus-community/prometheus

安装完成后，Prometheus 将自动部署到 Kubernetes 集群中，并开始收集指标数据。

配置 Prometheus

Prometheus 的配置主要通过 prometheus.yml 文件进行。以下是一个简单的配置示例：

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'kubernetes-nodes'
    kubernetes_sd_configs:
      - role: node
    relabel_configs:
      - source_labels: [__address__]
        regex: '(.*):10250'
        replacement: '${1}:9100'
        target_label: __address__

在这个配置中：

scrape_interval 定义了 Prometheus 抓取指标的时间间隔。
scrape_configs 定义了要监控的目标。在这个例子中，Prometheus 将监控 Kubernetes 集群中的所有节点。

使用 PromQL 查询数据

PromQL 是 Prometheus 的查询语言，用于查询和聚合时间序列数据。以下是一些常见的查询示例：

查询 CPU 使用率：

rate(container_cpu_usage_seconds_total{job="kubernetes-nodes"}[1m])

查询内存使用量：

container_memory_usage_bytes{job="kubernetes-nodes"}

查询 HTTP 请求速率：

rate(http_requests_total{job="api-server"}[5m])

这些查询可以帮助你了解集群和应用程序的性能状况。

实际案例：监控 Kubernetes 集群

假设你有一个运行在 Kubernetes 上的 Web 应用程序，并且你想监控其性能和健康状态。以下是如何使用 Prometheus 实现这一目标的步骤：

部署 Prometheus：按照前面的步骤在 Kubernetes 集群中部署 Prometheus。

配置监控目标：在 prometheus.yml 中配置监控目标，例如：

scrape_configs:
  - job_name: 'web-app'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        action: keep
        regex: 'web-app'

创建警报规则：在 Prometheus 中定义警报规则，例如：

groups:
- name: web-app-alerts
  rules:
  - alert: HighRequestLatency
    expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{job="web-app"}[5m])) > 1
    for: 10m
    labels:
      severity: critical
    annotations:
      summary: "High request latency detected"
      description: "The 99th percentile request latency is above 1 second."

配置 Alertmanager：配置 Alertmanager 以发送警报通知，例如通过电子邮件或 Slack。

总结

Prometheus 是 Kubernetes 监控的强大工具，能够帮助你深入了解集群和应用程序的性能。通过本文，你已经学习了如何安装、配置和使用 Prometheus，以及如何通过 PromQL 查询数据和创建警报规则。

附加资源

练习

在你的 Kubernetes 集群中安装 Prometheus，并配置监控目标。
使用 PromQL 查询集群的 CPU 和内存使用情况。
创建一个警报规则，当某个服务的请求延迟超过 1 秒时触发警报。

通过完成这些练习，你将更深入地理解 Prometheus 在 Kubernetes 中的应用。

介绍​

安装 Prometheus​

配置 Prometheus​

使用 PromQL 查询数据​

实际案例：监控 Kubernetes 集群​

总结​

附加资源​

练习​

介绍

安装 Prometheus

配置 Prometheus

使用 PromQL 查询数据

实际案例：监控 Kubernetes 集群

总结

附加资源

练习