Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop Implementation Plan for evaluating downstream energy impacts of scalable workloads #151

Open
SRF-Audio opened this issue Dec 12, 2024 · 1 comment

Comments

@SRF-Audio
Copy link
Contributor

SRF-Audio commented Dec 12, 2024

Objective:

Create a set of tests, metrics, and outputs that represent the energy delta between:

  • a baseline minimum deployment test workload
  • a production-level HA version of that workload
  • Optional: find a metric analogous to big-O notation that instead represents how an HA workload's energy delta from baseline scales with the workload

Example Scenario:

Suppose I want to deploy a test instance of Grafana. A basic/naive implementation might be:

  • Deployment (just pointing to the public Docker image)
  • ClusterIP Service
  • PVC

But, for production-like environments, a team uses the Grafana Helm chart, which adds:

  • Replica Sets
  • Configmaps
  • RBAC
  • Endpoints
  • Secrets

Perhaps the team also uses things like:

If the benchmarks we collect only test metrics in a basic/naive implementation, there is potential to underestimate a workload's total energy impacts when it gets used at scale, due to those additional supporting compute/memory/storage resources. In an HA configuration, there is X additional compute/memory/network resource overhead to manage data consistency, queuing, load balancing, etc.

Even if these additional Kubernetes resource differences are locally small on a single node, at large scales these small differences aggregate. Also, each auto-scaling tool is using resources when it monitors scaling triggers, and executes a scaling event.

Required Research:

  • For a given tool, use the tool's official documentation to determine their different recommended deployment models, specifically looking for HA/Production paradigms, vs. single/local/test deployments
    • Create an HA-specific benchmark evaluation for that tool
    • if possible, find common HA paradigms across CNCF ecosystem tools that way we have something generic enough to account for many workload's common HA configurations
  • Determine if we are able to use the Power Capping Framework for control plane and/or worker nodes that we are running benchmarks on.
    • If yes, create a list of required outputs from PCF to gather for tests
  • Create a list of Prometheus node metrics that would provide the data needed for this evaluation
  • Create/Evaluate k8s Control Plane + Worker Node baselines to compare to HA workload delta

Desired Outcome:

Any tool that goes through our benchmarking can see both their core workload energy performance, and the delta with how their recommended deployment paradigms and auto scaling settings impact their energy footprint.

@SRF-Audio
Copy link
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant