Geidea Established in 2008, Geidea epitomizes customer focused empowerment and commercial success through continuous innovation.Geidea makes best in class digital payment solutions available for all by attracting and leveraging the best creative & entrepreneurial talent in the marketOur solutions give any business the chance to get ahead and reach for more no matter their size or maturity.Our technology mirrors our people - Smart, Innovative & Forward Thinkingwww.geidea.netTo maintain a competitive advantage as we grow, we are currently looking for a new "Senior Site Reliability Engineer "Job purpose:The Senior Specialist Site Reliability Engineering (SRE) is responsible for ensuring the reliability, availability, scalability, and performance of critical production systems. This role combines software engineering and systems engineering to build and maintain highly resilient platforms while driving automation, monitoring, and continuous improvement across infrastructure and applications.The position plays a key role in reducing operational risk, improving system observability, and enhancing service stability in a 24/7 environment.Design proactive alerting strategies.Build dashboards for infrastructure, applications, and business KPIs.Analyze performance bottlenecks and system anomalies.Responsibilities:1. Reliability & AvailabilityEnsure high availability and performance of production systems.Define and manage SLAs, SLOs, and SLIs.Lead incident management and root cause analysis (RCA).Implement proactive measures to prevent recurring incidents.2. Monitoring & ObservabilityDesign and maintain monitoring solutions (Infrastructure, Application, Database).Develop dashboards and alerts using tools such as Cloud watch,Grafana, Prometheus, ELK, etc.Improve logging, tracing, and metrics collection.Reduce alert noise and improve actionable monitoring.3. Automation & DevOps PracticesAutomate operational tasks using scripting (PowerShell, Bash, Python).Implement CI/CD pipelines and deployment automation.Apply Infrastructure as Code (IaC) using Terraform, Ansible, or similar tools.Improve release reliability and reduce deployment risks.4. Incident & Problem ManagementParticipate in 24/7 on-call rotation.Lead Major Incident handling and communication.Conduct post-incident reviews and drive corrective actions.Collaborate with application and infrastructure teams for permanent fixes.5. Performance & Capacity ManagementConduct system performance tuning.Monitor capacity trends and forecast scaling needs.Optimize resource utilization across environments.6. Security & ComplianceSupport security hardening initiatives.Ensure compliance with IT governance and audit requirements.Implement secure configuration standards.Technical Requirements:Strong knowledge of Linux and/or Windows Server environments.Experience with cloud platforms (AWS, Azure, or GCP).Hands-on experience with monitoring tools (Grafana, Prometheus, Zabbix, etc.).Experience with containerization (Docker, Kubernetes).Knowledge of networking fundamentals (TCP/IP, DNS, Load Balancers).Experience with scripting and automation.Understanding of database systems (SQL Server, MySQL, PostgreSQL).Qualifications:5 Years of experienceBachelor’s degree in IT or engineeringStrong knowledge of Linux and/or Windows Server environments.Experience with cloud platforms (AWS, Azure, or GCP).Hands-on experience with monitoring tools (Grafana, Cloud watch, Prometheus, Zabbix, etc.).Experience with containerization (Docker, Kubernetes).Knowledge of networking fundamentals (TCP/IP, DNS, Load Balancers).Experience with scripting and automation.Understanding of database systems (SQL Server, MySQL, PostgreSQL).Our values guide how we think and act - They describe what we care about the mostCustomer first - It’s embedded in our design thinking and customer service approachOpen - Openness allows us to constantly improve and evolveReal - No jargon and no excuses!Bold - Constantly challenging ourselves and our way of thinking.Resilient – If we fail, we bounce back stronger than before.Collaborative - We know that we can achieve a lot more as a team.We are changing lives by constantly striving for a better solution.
We are on a mission to help merchants start, run and grow their businesses.
What you should know
Dominant Market Share: Captured 50% of Saudi Arabia's point-of-sale market within just two years of launching its first certified terminal
Massive Payment Network: Operates a network of approximately 700,000 payment terminals and ATMs across the region
2 First Licenses: Became the first fintech in Saudi Arabia to obtain a payment institution license and a non-bank merchant acquiring license
How they work
Infrastructure means reliability first — Payment systems can't be interesting at the cost of being unreliable — engineering and product decisions are made with uptime and trust as the primary constraints
Merchant churn is the failure metric — Acquiring a merchant matters less than keeping them — the business model only works when merchants see real value and stay