server-management
Cloud, DevOps & SystèmesServer management principles and decision-making. Process management, monitoring strategy, and scaling decisions. Teaches thinking, not commands.
Documentation
Server Management
> Server management principles for production operations.
> Learn to THINK, not memorize commands.
---
1. Process Management Principles
Tool Selection
| Scenario | Tool |
|----------|------|
| Node.js app | PM2 (clustering, reload) |
| Any app | systemd (Linux native) |
| Containers | Docker/Podman |
| Orchestration | Kubernetes, Docker Swarm |
Process Management Goals
| Goal | What It Means |
|------|---------------|
| Restart on crash | Auto-recovery |
| Zero-downtime reload | No service interruption |
| Clustering | Use all CPU cores |
| Persistence | Survive server reboot |
---
2. Monitoring Principles
What to Monitor
| Category | Key Metrics |
|----------|-------------|
| Availability | Uptime, health checks |
| Performance | Response time, throughput |
| Errors | Error rate, types |
| Resources | CPU, memory, disk |
Alert Severity Strategy
| Level | Response |
|-------|----------|
| Critical | Immediate action |
| Warning | Investigate soon |
| Info | Review daily |
Monitoring Tool Selection
| Need | Options |
|------|---------|
| Simple/Free | PM2 metrics, htop |
| Full observability | Grafana, Datadog |
| Error tracking | Sentry |
| Uptime | UptimeRobot, Pingdom |
---
3. Log Management Principles
Log Strategy
| Log Type | Purpose |
|----------|---------|
| Application logs | Debug, audit |
| Access logs | Traffic analysis |
| Error logs | Issue detection |
Log Principles
---
4. Scaling Decisions
When to Scale
| Symptom | Solution |
|---------|----------|
| High CPU | Add instances (horizontal) |
| High memory | Increase RAM or fix leak |
| Slow response | Profile first, then scale |
| Traffic spikes | Auto-scaling |
Scaling Strategy
| Type | When to Use |
|------|-------------|
| Vertical | Quick fix, single instance |
| Horizontal | Sustainable, distributed |
| Auto | Variable traffic |
---
5. Health Check Principles
What Constitutes Healthy
| Check | Meaning |
|-------|---------|
| HTTP 200 | Service responding |
| Database connected | Data accessible |
| Dependencies OK | External services reachable |
| Resources OK | CPU/memory not exhausted |
Health Check Implementation
---
6. Security Principles
| Area | Principle |
|------|-----------|
| Access | SSH keys only, no passwords |
| Firewall | Only needed ports open |
| Updates | Regular security patches |
| Secrets | Environment vars, not files |
| Audit | Log access and changes |
---
7. Troubleshooting Priority
When something's wrong:
---
8. Anti-Patterns
| ❌ Don't | ✅ Do |
|----------|-------|
| Run as root | Use non-root user |
| Ignore logs | Set up log rotation |
| Skip monitoring | Monitor from day one |
| Manual restarts | Auto-restart config |
| No backups | Regular backup schedule |
---
> Remember: A well-managed server is boring. That's the goal.
Compétences similaires
Explorez d'autres agents de la catégorie Cloud, DevOps & Systèmes
azure-ai-voicelive-py
Build real-time voice AI applications using Azure AI Voice Live SDK (azure-ai-voicelive). Use this skill when creating Python applications that need real-time bidirectional audio communication with Azure AI, including voice assistants, voice-enabled chatbots, real-time speech-to-speech translation, voice-driven avatars, or any WebSocket-based audio streaming with AI models. Supports Server VAD (Voice Activity Detection), turn-based conversation, function calling, MCP tools, avatar integration, and transcription.
azure-ai-translation-text-py
|
hugging-face-jobs
"This skill should be used when users want to run any workload on Hugging Face Jobs infrastructure. Covers UV scripts, Docker-based jobs, hardware selection, cost estimation, authentication with tokens, secrets management, timeout configuration, and result persistence. Designed for general-purpose compute workloads including data processing, inference, experiments, batch jobs, and any Python-based tasks. Should be invoked for tasks involving cloud compute, GPU workloads, or when users mention running jobs on Hugging Face infrastructure without local setup."