Role Consulting
Overview
Herdora is an advanced AI performance monitoring system that provides real-time insights and continuous optimization to ensure peak performance of your models.
Key Features:
- Intelligent performance monitoring
- Continuous optimization
- Alerts with answers
- Automated GPU profiling
- Root cause analysis
- Layer-by-layer visualization
Use Cases:
- Monitoring production traffic
- Identifying performance bottlenecks
- Simulating workloads before deployment
- Visualizing model performance
- Optimizing model updates
Benefits:
- Improved model performance
- Faster issue resolution
- Informed decision-making
- Reduced downtime
- Enhanced competitive advantage
Capabilities
- Profiles PyTorch code using kandc.ProfilerWrapper and kandc.ProfilerDecorator to capture CPU/CUDA timing and memory
- Captures model execution traces with kandc.capture_model_class, kandc.capture_model_instance, and kandc.capture_trace
- Parses and analyzes trace files via kandc.parse_model_trace to produce structured trace analysis outputs
- Initializes and tracks experiment runs with kandc.init, kandc.log, kandc.finish and Run.log/log_artifact/run.finish
- Uploads metrics and artifacts via kandc.APIClient (create_run, log_metrics, create_artifact) and returns dashboard URLs
- Automatically captures source code snapshots (respects .gitignore) and exposes env vars to configure code capture
- Operates in online, offline, or disabled run modes for cloud streaming or local-only profiling storage
- Ingests Weights & Biases runs via W&B API key for no-code analysis and crash detection
- Generates AI-powered crash reports and optimization suggestions for training runs (Keys & Caches)
- Sends real-time alerts and Slack notifications for training/inference issues (Keys & Caches)
- Offloads Python functions to cloud GPUs using Chisel decorators and the chisel CLI
- Runs cloud GPU jobs on A100 variants with automatic GPU detection and multi-GPU configurations
- Streams live job output and provides web UI job tracking for cloud GPU executions
- Authenticates via browser-based browser auth and supports secure credential storage for cloud job access
- Provides timing helpers kandc.timed and kandc.timed_call to record function execution times programmatically