Overview
How do you keep a data-intensive, real-time service that monitors tens of thousands of servers, up and running around the clock?
How do you respond to infrastructure failures or performance issues in a high-volume, low-latency computing environment?
What should the infrastructure look like when Datadog monitors a million servers? If you think you have the answers, join us as a Site Reliability Engineer.
What you will do
- Keep our service reliable, available and fast as a member of the operations team.
- Respond to, investigate and fix service issues, whether they be deep in the OS kernel or in the application code.
- Design, build and maintain the infrastructure we need to support orders of magnitude more customers.
Who you must be
- You have a BS/MS/PhD in a scientific field
- You have a track record as an engineer in the operations of a large site
- You value correctness and efficiency; you leave no stone unturned when diagnosing production issues
- You handle infrastructure with code because automation lets you focus on the more difficult and rewarding problems
- You have production experience with distributed compute/storage tools, e.g. zookeeper, cassandra, postgres, kafka, elasticsearch redis
Bonus Points
- You have submitted bug fixes to the aforementioned projects
- You are fully fluent in python, ruby and go
Is this you? Tell us why, and apply now. Include links to your github, stackoverflow or other online projects.