Site Reliability Engineer (SRE)
RangeForce
Description:
Cyberbit Range is a Cloud-Based Simulation enables cybersecurity teams to experience real-life cyber-attacks before they encounter them. Our cloud-based Cyber Range allows instant access to real-world networks and security tools for enabling SOC teams and students to battle real-world malwares and cyber-attacks to get prepared for the inevitable attack.
At Cyberbit, we’re passionate about building software that solves problems. we count on our site reliability engineers (SREs) to empower our users with a rich feature set, high availability, and stellar performance level to pursue their missions.
If you are looking for a job that will allow you to work with cutting edge technologies on cloud-based service of Cyber Training of a super cool and successful company, come and join us!
Key Responsibilities:
- Oversee the production environment by continuously monitoring system availability and ensuring comprehensive system health.
- Troubleshoot and implement rapid, effective solutions for production incidents.
- Deploy innovative tools and methodologies to enhance scalability and system performance.
- Design and develop automation solutions to improve reliability and operational efficiency.
- Utilise cloud technologies to address client technical requirements and business needs.
- Maintain and optimise existing workflows in alignment with microservices and serverless architectures.
- Offer technical leadership and provide training on development and operational best practices to team members and colleagues.
- Deliver primary operational support for multiple distributed software applications.
Requirements:
- Minimum 2 years of hands-on experience as a Site Reliability Engineer (SRE) in the cybersecurity industry.
- Proven expertise managing production workloads on public cloud platforms (AWS and/or Azure), including infrastructure-as-code, scaling, and high-availability deployments.
- Advanced proficiency in developing and delivering complex automation and orchestration solutions using scripting and programming languages such as Python, Bash, and PowerShell.
- Deep experience designing, maintaining, and optimizing Jenkins pipelines, with strong knowledge of Groovy scripting.
- Solid background in configuration management, preferably with Ansible, for automating system configuration and application deployment.
- In-depth understanding of microservices architecture and containerization technologies, including Docker, Kubernetes, ECS, and EKS.
- Expertise in monitoring and observability using Prometheus and Grafana as primary monitoring systems, including metric collection, alerting, and dashboard creation. Experience with additional tools such as Elastic and CloudWatch is a plus.
- Demonstrated experience supporting serverless and containerized production workloads.
- Strong grasp of standard networking and security protocols, with the ability to troubleshoot and secure distributed systems.
- Experience managing and operating virtualized environments, such as VMware ESXi and VMware vCenter.
- Proactive mindset for identifying system issues, performance bottlenecks, and opportunities for reliability improvements.