We are seeking a highly skilled Senior Staff Engineer to join our team, focusing on the management and optimization of our hybrid cloud infrastructure. This role requires deep expertise in managing resources across leading public cloud platforms, ensuring system performance, security, and compliance at an enterprise level. The ideal candidate will be instrumental in orchestrating unified operations, driving cost-efficiency, and maintaining high availability across our global cloud environments.
Core Responsibilities
Cloud Platform Architecture and Operations
Design, deploy, and maintain robust cloud infrastructures utilizing key services from AWS and Alibaba Cloud. This includes compute instances (EC2, ECS), object storage (S3, OSS), networking (VPC, CEN), serverless functions (Lambda, Function Compute), and managed Kubernetes services (EKS, ACK). A critical part of this role is architecting highly available systems that can scale automatically based on demand while optimizing network topologies and resource configurations for peak performance.
Monitoring and Incident Management
Implement comprehensive, full-stack monitoring solutions using a combination of cloud-native tools like AWS CloudWatch and Alibaba Cloud CloudMonitor, alongside open-source stacks such as Prometheus with Grafana and the ELK stack. You will lead the response to critical incidents, perform thorough root cause analyses, and establish preventive measures to address issues like resource contention, configuration errors, and network latency.
Cost Optimization and Resource Management
Analyze cloud spending and resource utilization patterns to identify and implement significant cost-saving opportunities. Strategies include purchasing reserved instances, configuring auto-scaling policies, and implementing intelligent storage lifecycle management. You will also establish and enforce resource quota frameworks to prevent waste and control expenditures across departments.
Security and Compliance
Implement and enforce stringent cloud security baselines. This involves managing security groups, identity and access management policies (AWS IAM, Alibaba Cloud RAM), and utilizing security services like AWS Security Hub. You will conduct regular security audits, remediate vulnerabilities, and design granular access controls. Ensuring comprehensive auditing via tools like AWS CloudTrail and Alibaba Cloud DAS is also a key requirement.
Collaboration and Knowledge Sharing
Work closely with development teams to optimize application architectures for the cloud, advocating for modern approaches like microservices and serverless computing. A significant part of your role will be to document standard operating procedures and lead internal technical training sessions to elevate the entire team's cloud capabilities.
Required Qualifications and Skills
Technical Expertise
- Mastery of core services across compute, storage, networking, and security on either AWS or Alibaba Cloud, with working knowledge of the other platform.
- High proficiency in Linux/Windows system administration and automation using scripting languages like Shell or Python and tools like Ansible.
- Hands-on experience with containerized operations using Kubernetes and managed services like EKS, ECS, or ACK, including familiarity with cloud-native technologies like Service Mesh.
Professional Experience
- A minimum of 5 years of operations experience, with at least 3 years dedicated to managing public cloud environments (AWS/Alibaba Cloud) supporting over 100 instances.
- Proven experience in building cloud platforms from the ground up, designing hybrid cloud architectures, or leading large-scale migration projects (e.g., from on-premise data centers to the cloud).
Soft Skills and Education
- Exceptional problem-solving abilities and a proven capacity to perform under high-pressure operational challenges.
- Excellent communication and collaboration skills to work effectively with development, testing, and security teams.
- A Bachelor’s degree or higher in Computer Science, Network Engineering, or a related field is required. Professional certifications such as AWS Certified SysOps Administrator or Alibaba Cloud ACP/ACE are highly preferred.
Additional Valued Skills
- Experience with multi-cloud management platforms (e.g., spanning AWS, Alibaba Cloud, Azure) or FinOps methodologies for cross-cloud financial management.
- Knowledge of advanced cloud security practices, including the configuration of Web Application Firewalls (WAF) and DDoS protection services (e.g., AWS Shield, Alibaba Cloud Anti-DDoS Premium).
- Familiarity with operating big data or AI/ML platforms on the cloud, such as Alibaba Cloud MaxCompute or AWS EMR.
- Previous team leadership or tech lead experience is a significant advantage.
What We Offer
We provide a competitive total compensation package designed to attract top talent. Our benefits include extensive Learning & Development programs with education subsidies to support your continuous growth. We foster a strong community through various team-building programs and company events. Employees also enjoy wellness and meal allowances, alongside comprehensive healthcare schemes that extend to dependants.
We are committed to creating a rewarding and diverse environment, united by a culture that emphasizes teamwork, integrity, and results. For a deeper look at how you can contribute to our forward-thinking team, we encourage you to 👉 explore this career opportunity further.
Frequently Asked Questions
What is the primary focus of a Senior Staff Engineer in Public Cloud Operations?
This role focuses on the end-to-end lifecycle management of a large-scale hybrid cloud infrastructure. The engineer is responsible for ensuring high availability, optimizing performance and costs, and maintaining strict security and compliance standards across AWS and Alibaba Cloud environments.
What are the key technical skills required for this position?
Key skills include mastery of core AWS or Alibaba Cloud services, proficiency in automation with Shell/Python/Ansible, and hands-on experience with container orchestration using Kubernetes and its managed services (EKS, ACK). Experience designing highly available and auto-scaling architectures is crucial.
How does this role contribute to cost management?
The engineer analyzes cloud resource usage to identify savings opportunities through strategies like reserved instances, auto-scaling, and intelligent storage tiering. They also establish resource quota management systems to prevent overspending and ensure efficient use of cloud budgets.
What is the importance of security in this cloud operations role?
Security is paramount. The engineer implements security baselines, manages IAM/RAM permissions, conducts regular audits, and remediates vulnerabilities. They design granular access controls and ensure comprehensive auditing is in place to protect enterprise assets and data.
Does this role require collaboration with other teams?
Yes, extensive collaboration is essential. The engineer works closely with development teams to optimize application architectures for the cloud and provides internal training and documentation to share knowledge and standardize operational best practices across the organization.
What kind of experience is preferred for candidates?
Beyond the 5+ years in operations, preference is given to candidates with experience building cloud platforms from scratch, designing hybrid cloud solutions, or leading large-scale migration projects. Familiarity with multi-cloud management and FinOps is also highly valued.