Site Reliability Engineer

Netgate is looking for a Site Reliability Engineer (SRE). They will be focused on automation, connectivity, availability, reliability, security, and performance of internal and customer-facing resources.

Apply Now

If the "Apply Now" button above does not present an email form, it could be related to your browser settings (try this). Or, send us an email at hr@netgate.com. Thank you!

Department: Engineering Type: Full Time Location: Austin, Texas

Site Reliability Engineer

The successful candidate will work to ensure service availability, both internally and customer-facing, identify and automate manual processes within the organization, solve problems relating to mission-critical services and build automation to prevent problem recurrence, and bridge the gaps between the product development teams and operations. You will be responsible for implementing operational improvements, improving the way our systems talk to each other, assisting in growth and capacity planning, and more. Continuous improvement through code, the introduction of modern tools, and/or better processes and efficiency is the ultimate goal.

DUTIES AND RESPONSIBILITIES

WHAT YOU WILL DO:

Collaborate with other engineers to help solve problems ranging from systems security to building automation
Combine software and systems knowledge to engineer high-volume distributed systems in a reliable, scalable, and fault-tolerant manner
Continually optimize systems and workflows by improving the architecture, infrastructure, automation, and observability
Build tools to help developers to manage the applications in the SDLC
Work closely with other engineers to solve technical challenges and ensure continued application scalability
Conduct Root Cause Analysis (RCA) following critical production incidents and drive mitigation strategies
Seek and implement opportunities to automate routine maintenance tasks and resolve common issues
Build systems and tools to automate deployment pipelines
Define and own best practices for our engineering teams and assist them in engaging these processes
Influence our infrastructure direction with your ideas
Stay current with industry trends, systems, and practices and teach others to help them level up

REQUIREMENTS

WHAT YOU WILL NEED:

5+ years of production system administration and web operations experience
5+ years of experience with Linux operating systems internals and administration (e.g., filesystems, inodes, system calls)
3+ years of experience with programming using Java, Go, Python, Ruby, or some combination of the above
3+ years of experience with configuration management tools like Jenkins (preferable), Chef, Puppet, Salt, or equivalent
Experience with a broad range of concepts, applications, and languages, including application containerization (Docker), Jenkins, Git/Github/Gitlab, Infrastructure as Code, Ruby, Java, Apache Netbeans, Go, Python, sphinx/reST, postscript, Markdown, HTML, or some combination of the above
A strong desire to innovate, experiment, collaborate and learn
High standards for quality and attention to detail
Excellent problem-solving and analytical skills
Excellent oral and written communication skills
Strong interest in analyzing and troubleshooting highly available services with a distributed architecture
Strong understanding of Linux systems and networking and security fundamentals
Experience with major cloud providers like AWS and Azure
The ability to explain your ideas clearly, give and receive feedback, and work well with team members
BS degree in Computer Science, related technical field, or equivalent practical experience
Experience in data structures, algorithms, and software design

Apply Now

Featured Story

USNS Mercy

Site Reliability Engineer

Site Reliability Engineer

DUTIES AND RESPONSIBILITIES

WHAT YOU WILL DO:

REQUIREMENTS

WHAT YOU WILL NEED: