#vacancy #iGaming #limassol #SRE #sitereliabilityengineer
🎯
SRE - Role Summary:
As a Site Reliability Engineer (SRE), you will play a critical role in maintaining the reliability, performance, and scalability of our systems and applications. You will design and manage monitoring infrastructure, respond to incidents, automate processes, and support system improvements—all with the aim of ensuring exceptional service for our users worldwide.
💼 Key Responsibilities
🔍 Monitoring & Observability
Design and implement proactive monitoring and alerting solutions using tools like Prometheus, Grafana, Loki, CloudWatch, and Datadog
Analyze telemetry data to identify anomalies and root causes before they affect end users
Maintain high visibility into system health and performance through dashboards and SLO/SLI tracking
🚨 Incident & Problem Management
Respond to incidents, perform thorough analysis, and contribute to postmortems
Collaborate with Engineering, DevOps, and QA teams to reduce MTTR and eliminate recurring issues
Use tools like Opsgenie, Jira, and Slack for efficient alerting, coordination, and documentation
🤖 Automation & Efficiency
Automate deployment, configuration, and monitoring tasks using Bash, Python, Ansible,
Monitor and maintain self-healing infrastructure using Kubernetes, Docker, and IaC tooling
🤝 Collaboration & Communication
Serve as a reliability champion across technical teams
Lead and document incident response practices and cross-functional drills
Partner with Data, Product, and Engineering teams to optimize end-to-end delivery
Fluent English, Russian will be a plus
🧩 Key Requirements
✔️ Must-Have Skills
Proficiency in observability tools: Prometheus, Grafana, Loki, LogQL, Cloud-native monitoring (AWS/GCP), Jaeger, DataDog, Checkly
Strong scripting abilities in Bash, Python, and/or PowerShell
Hands-on experience with Kubernetes, Docker, IaC tools (e.g., Ansible, Git), and CI/CD flows
Fluency in incident lifecycle management (ITIL familiarity is a plus)
Knowledge of SLIs/SLOs/SLAs and how to manage them effectively
🌐 Preferred Skills
Familiarity with, Opsgenie, Playwright (for automated testing), and Checkly.
Working knowledge of databases: SQL Server, BigQuery, MySQL
Linux system administration experience
Security awareness (basic understanding of tools like Kali Linux, Wireshark, nmap, etc)
Experience with monitoring as code, specifically using Checkly’s CLI, JavaScript/TypeScript SDKs, or Terraform
Strong ability to debug complex user flows with Playwright trace viewer, DOM snapshots, and network waterfalls for failing browser checks in distributed environments
🎓 Courses and Certifications (a plus)
AWS Cloud Practitioner
Google Cloud Professional, Associate Certifications or Azure Fundamentals
LPIC-1 Linux Administrator
Fundamentals of Infrastructure as Code (IaC)
Datadog Fundamentals, APM & Distributed Tracing Fundamentals and Log Management Fundamentals
Getting Started with Synthetic Monitoring and Browser Testing by Datadog Learning
🌟 What We Offer
Lunch allowance
Hybrid work format as per WFH internal policy
Health insurance from Day 1
Birthday and anniversary gifts
Inclusive, dynamic, and innovative work culture
Interested candidates please send your CV to
[email protected] or pm me for more