Machine Learning Pipeline - Multi-Agent MLOps Orchestration
Design and implement a complete ML pipeline for: $ARGUMENTS
Use this skill when
●Working on machine learning pipeline - multi-agent mlops orchestration tasks or workflows
●Needing guidance, best practices, or checklists for machine learning pipeline - multi-agent mlops orchestration
Do not use this skill when
●The task is unrelated to machine learning pipeline - multi-agent mlops orchestration
●You need a different domain or tool outside this scope
Instructions
●Clarify goals, constraints, and required inputs.
●Apply relevant best practices and validate outcomes.
●Provide actionable steps and verification.
●If detailed examples are required, open resources/implementation-playbook.md.
Thinking
This workflow orchestrates multiple specialized agents to build a production-ready ML pipeline following modern MLOps best practices. The approach emphasizes:
●Phase-based coordination: Each phase builds upon previous outputs, with clear handoffs between agents
●Modern tooling integration: MLflow/W&B for experiments, Feast/Tecton for features, KServe/Seldon for serving
●Production-first mindset: Every component designed for scale, monitoring, and reliability
●Reproducibility: Version control for data, models, and infrastructure
●Continuous improvement: Automated retraining, A/B testing, and drift detection
The multi-agent approach ensures each aspect is handled by domain experts:
●Data engineers handle ingestion and quality
●Data scientists design features and experiments
●ML engineers implement training pipelines
●MLOps engineers handle production deployment
●Observability engineers ensure monitoring
Phase 1: Data & Requirements Analysis
subagent_type: data-engineer
prompt: |
Analyze and design data pipeline for ML system with requirements: $ARGUMENTS
Deliverables:
1.Data source audit and ingestion strategy:
●Source systems and connection patterns
●Schema validation using Pydantic/Great Expectations
●Data versioning with DVC or lakeFS
●Incremental loading and CDC strategies
2.Data quality framework:
●Profiling and statistics generation
●Anomaly detection rules
●Data lineage tracking
●Quality gates and SLAs
3.Storage architecture:
●Raw/processed/feature layers
●Partitioning strategy
●Retention policies
●Cost optimization
Provide implementation code for critical components and integration patterns.
subagent_type: data-scientist
prompt: |
Design feature engineering and model requirements for: $ARGUMENTS
Using data architecture from: {phase1.data-engineer.output}
Deliverables:
1.Feature engineering pipeline:
●Transformation specifications
●Feature store schema (Feast/Tecton)
●Statistical validation rules
●Handling strategies for missing data/outliers
2.Model requirements:
●Algorithm selection rationale
●Performance metrics and baselines
●Training data requirements
●Evaluation criteria and thresholds
3.Experiment design:
●Hypothesis and success metrics
●A/B testing methodology
●Sample size calculations
●Bias detection approach
Include feature transformation code and statistical validation logic.
Phase 2: Model Development & Training
subagent_type: ml-engineer
prompt: |
Implement training pipeline based on requirements: {phase1.data-scientist.output}
Using data pipeline: {phase1.data-engineer.output}
Build comprehensive training system:
1.Training pipeline implementation:
●Modular training code with clear interfaces
●Hyperparameter optimization (Optuna/Ray Tune)
●Distributed training support (Horovod/PyTorch DDP)
●Cross-validation and ensemble strategies
2.Experiment tracking setup:
●MLflow/Weights & Biases integration
●Metric logging and visualization
●Artifact management (models, plots, data samples)
●Experiment comparison and analysis tools
3.Model registry integration:
●Version control and tagging strategy
●Model metadata and lineage
●Promotion workflows (dev -> staging -> prod)
●Rollback procedures
Provide complete training code with configuration management.
subagent_type: python-pro
prompt: |
Optimize and productionize ML code from: {phase2.ml-engineer.output}
Focus areas:
1.Code quality and structure:
●Refactor for production standards
●Add comprehensive error handling
●Implement proper logging with structured formats
●Create reusable components and utilities
2.Performance optimization:
●Profile and optimize bottlenecks
●Implement caching strategies
●Optimize data loading and preprocessing
●Memory management for large-scale training
3.Testing framework:
●Unit tests for data transformations
●Integration tests for pipeline components
●Model quality tests (invariance, directional)
●Performance regression tests
Deliver production-ready, maintainable code with full test coverage.
Phase 3: Production Deployment & Serving
subagent_type: mlops-engineer
prompt: |
Design production deployment for models from: {phase2.ml-engineer.output}
With optimized code from: {phase2.python-pro.output}
Implementation requirements:
1.Model serving infrastructure:
●REST/gRPC APIs with FastAPI/TorchServe
●Batch prediction pipelines (Airflow/Kubeflow)
●Stream processing (Kafka/Kinesis integration)
●Model serving platforms (KServe/Seldon Core)
2.Deployment strategies:
●Blue-green deployments for zero downtime
●Canary releases with traffic splitting
●Shadow deployments for validation
●A/B testing infrastructure
3.CI/CD pipeline:
●GitHub Actions/GitLab CI workflows
●Automated testing gates
●Model validation before deployment
●ArgoCD for GitOps deployment
4.Infrastructure as Code:
●Terraform modules for cloud resources
●Helm charts for Kubernetes deployments
●Docker multi-stage builds for optimization
●Secret management with Vault/Secrets Manager
Provide complete deployment configuration and automation scripts.
subagent_type: kubernetes-architect
prompt: |
Design Kubernetes infrastructure for ML workloads from: {phase3.mlops-engineer.output}
Kubernetes-specific requirements:
1.Workload orchestration:
●Training job scheduling with Kubeflow
●GPU resource allocation and sharing
●Spot/preemptible instance integration
●Priority classes and resource quotas
2.Serving infrastructure:
●HPA/VPA for autoscaling
●KEDA for event-driven scaling
●Istio service mesh for traffic management
●Model caching and warm-up strategies
3.Storage and data access:
●PVC strategies for training data
●Model artifact storage with CSI drivers
●Distributed storage for feature stores
●Cache layers for inference optimization
Provide Kubernetes manifests and Helm charts for entire ML platform.
Phase 4: Monitoring & Continuous Improvement
subagent_type: observability-engineer
prompt: |
Implement comprehensive monitoring for ML system deployed in: {phase3.mlops-engineer.output}
Using Kubernetes infrastructure: {phase3.kubernetes-architect.output}
Monitoring framework:
1.Model performance monitoring:
●Prediction accuracy tracking
●Latency and throughput metrics
●Feature importance shifts
●Business KPI correlation
2.Data and model drift detection:
●Statistical drift detection (KS test, PSI)
●Concept drift monitoring
●Feature distribution tracking
●Automated drift alerts and reports
3.System observability:
●Prometheus metrics for all components
●Grafana dashboards for visualization
●Distributed tracing with Jaeger/Zipkin
●Log aggregation with ELK/Loki
4.Alerting and automation:
●PagerDuty/Opsgenie integration
●Automated retraining triggers
●Performance degradation workflows
●Incident response runbooks
5.Cost tracking:
●Resource utilization metrics
●Cost allocation by model/experiment
●Optimization recommendations
●Budget alerts and controls
Deliver monitoring configuration, dashboards, and alert rules.
Configuration Options
●experiment_tracking: mlflow | wandb | neptune | clearml
●feature_store: feast | tecton | databricks | custom
●serving_platform: kserve | seldon | torchserve | triton
●orchestration: kubeflow | airflow | prefect | dagster
●cloud_provider: aws | azure | gcp | multi-cloud
●deployment_mode: realtime | batch | streaming | hybrid
●monitoring_stack: prometheus | datadog | newrelic | custom
Success Criteria
1.Data Pipeline Success:
●< 0.1% data quality issues in production
●Automated data validation passing 99.9% of time
●Complete data lineage tracking
●Sub-second feature serving latency
2.Model Performance:
●Meeting or exceeding baseline metrics
●< 5% performance degradation before retraining
●Successful A/B tests with statistical significance
●No undetected model drift > 24 hours
3.Operational Excellence:
●99.9% uptime for model serving
●< 200ms p99 inference latency
●Automated rollback within 5 minutes
●Complete observability with < 1 minute alert time
4.Development Velocity:
●< 1 hour from commit to production
●Parallel experiment execution
●Reproducible training runs
●Self-service model deployment
5.Cost Efficiency:
●< 20% infrastructure waste
●Optimized resource allocation
●Automatic scaling based on load
●Spot instance utilization > 60%
Final Deliverables
Upon completion, the orchestrated pipeline will provide:
●End-to-end ML pipeline with full automation
●Comprehensive documentation and runbooks
●Production-ready infrastructure as code
●Complete monitoring and alerting system
●CI/CD pipelines for continuous improvement
●Cost optimization and scaling strategies
●Disaster recovery and rollback procedures