Site Reliability Engineer (SRE)
2K develops and publishes interactive entertainment globally for console systems, handheld gaming devices, and personal computers, including smartphones and tablets. 2K is a leading publisher of today’s most popular gaming genres and most well-known for critically acclaimed game franchises like NBA 2K, WWE 2K, Bioshock, Borderlands, Evolve, XCOM, and the beloved Sid Meier’s Civilization.
About the Team: Site Reliability Engineering (SRE)
The 2K Site Reliability team is responsible for the operations and infrastructure of all consumer-facing production systems and developer-facing systems at 2K Games, including NBA2K game services, customer-facing account services, and websites. This team handles systems and services spanning multiple datacenters both terrestrial and cloud-based.
What We Need:
We are looking for an engineer who is passionate about building multi-datacenter infrastructure and services. Robust systems and problem-solving skills are required as we develop solutions for game studios and support data centers around the world alongside a group of outstanding engineers.
In this role, you will collaborate with network engineers, systems architects, and development staff to support our gamers and the needs of the business.
What We Do:
- Build and automate scaled service infrastructure
- Own and operate monitoring and alerting services across multiple regions
- Define and implement standards that will impact systems, services, and multiple software environments
- Diagnose and resolve technical issues from both internal and external customers
- Remove out infrastructure toil with automation
- Spread SRE and Operational Best Practices to customers and the greater organization
- Participate in Site Reliability Engineering’s on-call rotation
Who We Believe Will Be an Outstanding Fit:
You are eager to work in a fast-paced environment with other highly skilled engineers who are passionate about service availability and health! The idea of building data center infrastructure services from greenfield to implementation moves you!
- Expertise in scalable production services (config management, monitoring, infrastructure-as-a-code, load balancing, distributed systems)
- Experience with Systems Infrastructure, Virtualization, Kubernetes, and many of the following technologies: Helm, Docker, Terraform, Elasticsearch, Prometheus, Puppet, Git, Jenkins
- Strong understanding of the SLI, SLO, and SLA concepts
- A passion for service health and reliability
- Demonstrated ability to decompose sophisticated problems and engage in lateral investigations
- Strong coding experience in at least one or more of Python, Ruby, Java, or Go and a good understanding of code management
- Experience with Unix/Linux operating systems(tuning and system internals) and TCP/IP Networking Fundamentals
- Prior hands-on experience working in a highly available environment, scaling to thousands of nodes
- Experience mentoring other specialists
- Experience working with product owners on service level