Back to all job listings


Site Reliability Engineer

Netgate is looking for a Site Reliability Engineer (SRE). They will be focused on automation, connectivity, availability, reliability, security, and performance of internal and customer-facing resources.

Apply Now

If the "Apply Now" button above does not present an email form, it could be related to your browser settings (try this). Or, send us an email at hr@netgate.com. Thank you!

Department: Engineering     Type: Full Time     Location: Austin, Texas

Site Reliability Engineer

Netgate is looking for a Site Reliability Engineer (SRE). They will be focused on automation, connectivity, availability, reliability, security, and performance of internal and customer-facing resources.

The successful candidate will work to ensure service availability, both internally and customer-facing, identify and automate manual processes within the organization, solve problems relating to mission-critical services and build automation to prevent problem recurrence, and bridge the gaps between the product development teams and operations. You will be responsible for implementing operational improvements, improving the way our systems talk to each other, assisting in growth and capacity planning, and more. Continuous improvement through code, the introduction of modern tools, and/or better processes and efficiency is the ultimate goal.

 

DUTIES AND RESPONSIBILITIES

WHAT YOU WILL DO:

  • Collaborate with other engineers to help solve problems ranging from systems security to building automation
  • Combine software and systems knowledge to engineer high-volume distributed systems in a reliable, scalable, and fault-tolerant manner
  • Continually optimize systems and workflows by improving the architecture, infrastructure, automation, and observability
  • Build tools to help developers to manage the applications in the SDLC
  • Work closely with other engineers to solve technical challenges and ensure continued application scalability
  • Conduct Root Cause Analysis (RCA) following critical production incidents and drive mitigation strategies
  • Seek and implement opportunities to automate routine maintenance tasks and resolve common issues
  • Build systems and tools to automate deployment pipelines
  • Define and own best practices for our engineering teams and assist them in engaging these processes
  • Influence our infrastructure direction with your ideas
  • Stay current with industry trends, systems, and practices and teach others to help them level up

 

REQUIREMENTS

WHAT YOU WILL NEED:

  • 5+ years of production system administration and web operations experience
  • 5+ years of experience with Linux operating systems internals and administration (e.g., filesystems, inodes, system calls)
  • 3+ years of experience with programming using Java, Go, Python, Ruby, or some combination of the above
  • 3+ years of experience with configuration management tools like Jenkins (preferable), Chef, Puppet, Salt, or equivalent
  • Experience with a broad range of concepts, applications, and languages, including application containerization (Docker), Jenkins, Git/Github/Gitlab, Infrastructure as Code, Ruby, Java, Apache Netbeans, Go, Python, sphinx/reST, postscript, Markdown, HTML, or some combination of the above
  • A strong desire to innovate, experiment, collaborate and learn
  • High standards for quality and attention to detail
  • Excellent problem-solving and analytical skills
  • Excellent oral and written communication skills
  • Strong interest in analyzing and troubleshooting highly available services with a distributed architecture
  • Strong understanding of Linux systems and networking and security fundamentals
  • Experience with major cloud providers like AWS and Azure
  • The ability to explain your ideas clearly, give and receive feedback, and work well with team members
  • BS degree in Computer Science, related technical field, or equivalent practical experience
  • Experience in data structures, algorithms, and software design

 


Apply Now