- Create, improve metrics and monitoring initiatives to gain more insight into the behavior of our applications and products.
- Install “smoke detectors” to catch fires even before happen – proactive monitoring.
- Build automation to alert firefighter teams.
- Create and maintain Continuous Integration Processes, Deployment Scripts, Server Spin-Up/Terminate Scripts and Scripts for basic server operations.
- Operational experience of Linux based application
- Experience working with diverse infrastructure options such as cloud, container and bare metal servers
- Strong expertise in one (or more) of the following technologies: Prometheus, NewRelic, collectd, InfluxData, Docker, OpsGenie, Slack App, Webhooks, TeamCity
- Experience with software configuration management systems and source code version control systems (Puppet, Chef, Salt, Ansible, Git)
- Ability to rapidly learn new development languages (GO, Bash, PHP and Python are all in heavy use) Our Stack
We are monitoring 1000+ servers in 3 data-centers. Around 200k RPM. Our servers are running PHP-FPM, GO, Memcached, ElasticSearch, MySql, RabbitMQ, etc