Merchant Mentors

Overview

How do you keep a data-intensive, real-time service that monitors tens of thousands of servers, up and running around the clock?

How do you respond to infrastructure failures or performance issues in a high-volume, low-latency computing environment?

What should the infrastructure look like when Datadog monitors a million servers? If you think you have the answers, join us as a Site Reliability Engineer.

What you will do

  • Keep our service reliable, available and fast as a member of the operations team.
  • Respond to, investigate and fix service issues, whether they be deep in the OS kernel or in the application code.
  • Design, build and maintain the infrastructure we need to support orders of magnitude more customers.

Who you must be

  • You have a BS/MS/PhD in a scientific field
  • You have a track record as an engineer in the operations of a large site
  • You value correctness and efficiency; you leave no stone unturned when diagnosing production issues
  • You handle infrastructure with code because automation lets you focus on the more difficult and rewarding problems
  • You have production experience with distributed compute/storage tools, e.g. zookeeper, cassandra, postgres, kafka, elasticsearch redis

Bonus Points

  • You have submitted bug fixes to the aforementioned projects
  • You are fully fluent in python, ruby and go

Is this you? Tell us why, and apply now. Include links to your github, stackoverflow or other online projects.

Related Jobs