Observability EngineerWhy We Need This Role: - The platform requires comprehensive observability dashboards built from multiple sources (platform health, use case performance, cost, security) - The platform needs advanced monitoring beyond GCP native tools for production operations - Prometheus and Grafana expertise required for custom metrics, alerting, and dashboards - Open Telemetry instrumentation across all use cases requires dedicated focus - No current team member has deep Prometheus/ Grafana expertiseJob Description: Observability EngineerReports To: (Chief Engineer) About the Role:Our GenAI platform requires comprehensive observability to ensure production reliability, performance optimisation, and cost management. As our Observability Engineer, you will design and implement the monitoring, alerting, and dashboarding infrastructure that gives teams visibility into platform health, use case performance, and operational costs.Key Responsibilities:Design and implement observability architecture using Prometheus and GrafanaDeploy and manage Prometheus stack on GKE with appropriate retention andHA configurationCreate comprehensive Grafana dashboards for platform health, API performance, and use case metricsImplement custom metrics collection for CrewAI agents, Kong API Gateway, and LLM usageConfigure OpenTelemetry instrumentation across all platform servicesDesign alerting rules and notification channels for P0-P3 incident severity levels Build cost and usage dashboards for LLM token consumption and infrastructure spendIntegrate with Cloud Monitoring and Cloud Logging for unified observabilityEstablish SLI/SLO frameworks for platform and use case servicesCreate runbooks for common alerting scenarios and incident responseRequired Skills:4+ years experience in observability and monitoring engineeringStrong expertise in Prometheus (PromQL, recording rules, alerting rules)Proficiency in Grafana (dashboard design, variables, annotations, alerting)Experience with OpenTelemetry for distributed tracing and metricsKnowledge of Kubernetes monitoring patterns and kube-state-metricsUnderstanding of SRE principles (SLIs, SLOs, error budgets)Experience with log aggregation and analysis (Loki, ELK, or similar) Familiarity with alerting best practices and on-call workflows Desirable Skills:Experience with GCP Cloud Monitoring and Cloud Trace integration Knowledge of AI/ML observability patterns (model latency, token usage, drift detection)Background in API gateway monitoring (Kong, Envoy, or similar)Experience with long-term Prometheus storageFamiliarity with FinOps and cost observability dashboards