Overview
Reliability Engineer
As a Reliability Engineer you would measure application behavior to improve systems in the face of the continual failures and degradations present in a distributed service environment. You would have a passion for being able to identify bottlenecks and failure patterns and the ability to execute to improve them at any level of the stack. You would be responsible for staying ahead of the curve as bitly scales systems to meet demands.
Responsibilities
- Large-scale distributed systems instrumenting and troubleshooting
- Monitor and maintain critical systems
- Participation in periodic on call duties
- Measure and improve system stability in the face of distributed systems failures
- Collaborate with Application Development teams to solve scalability issues due to user and dataset growth
- Develop platform for collecting and analysing realtime application health data… at scale
Qualifications
- Bachelor’s degree in Computer science or related field/degree. (or 4+ years of equivalent experience )
- Fluency Python or Go
- Experience in C, C++ or Bash a plus
- Experience with IO Loop and concurrent programming (Go, tornado, libevent)
- Strong knowledge of Nginx, the HTTP stack, and approaches to Load Balancing traffic (haproxy, LVS, ELB).
- Strong network stack knowledge and analysis and troubleshooting (tcpdump, iperf, iostat)
- Experience with distributed service architecture
- Experience instrumenting and measuring applications with Graphite
- Systems Trending experience with Munin, Cacti, Ganglia or similar
- Configuration Management experience with Chef/Puppet/CF Engine or similar
- Experience with Distributed Message Handling at scale ([NSQ](http://nsq.io/) a plus)
- Strong familiarity with Linux environments
What The Bitly Infrastructure Team Says
“Working on the infrastructure team is fun because you get a grab bag of challenging problems that don’t get solved by other teams. We are relied on as the last layer of architecting systems, solving scaling problems and keeping things running.“ – Justin Hines, Infrastructure Engineer
“Getting to work with world class people on complex and challenging problems at scale. It’s just been amazing.” – Dan Lotterman, Operations Engineer