In our environment, nodes are provisioned using infrastructure-as-code and are grouped into dedicated node pools based on workload type, such as general services, high-memory applications, or batch processing jobs. Each node is configured with standardized base images, security hardening policies, and monitoring agents. We define resource requests and limits carefully so Pods are scheduled efficiently without overcommitting CPU or memory. Auto-scaling groups are enabled to add or remove nodes based on demand, ensuring both cost control and performance stability. One of the main challenges we faced was right-sizing nodes to prevent resource fragmentation and avoid issues like OOM kills or CPU throttling. We also had to strengthen observability to quickly detect node-level bottlenecks and improve patch management to maintain security without causing downtime. Overall, automation and clear capacity planning have been key to managing nodes reliably at scale.