Introduction

TOC

Monitoring & Ops Introduction

Monitoring & Ops is a core module of the Alauda AI platform designed specifically for AI inference service operations. It provides comprehensive observability and operational capabilities across the full lifecycle of inference services, enabling unified management of logs and multi-dimensional metrics through integrated monitoring dashboards. As a critical component of Alauda AI's MLOps/LLMOps/GenOps solutions, it empowers teams to ensure service reliability, optimize resource utilization, and accelerate incident response.

This module focuses on two key operational aspects:

  • Logging: Real-time streaming of inference service replica pod logs
  • Monitor: Multi-dimensional performance dashboards covering infrastructure, GPU resources, and API traffic

Note: GPU dashboards of Hami are only supported in AML version 1.4 and later.