Heidi Health

Senior SRE

Reposted 13 Hours Ago

Be an Early Applicant

Remote

3 Locations

Senior level

Remote

3 Locations

Senior level

As a Senior Site Reliability Engineer, you will establish and scale reliability practices, manage incidents, and optimize observability strategies for AI-powered healthcare solutions.

The summary above was generated by AI

Who are Heidi?

Heidi is on a mission to halve the time it takes to deliver world-class care.

We believe that by 2050, every clinician will practice with AI systems that free them from administrative burdens and increase the quality and accessibility of care to patients across the world.

Built for clinicians, by clinicians, at the core of Heidi is its people. We are an eclectic bunch of inventors, builders, scientists, nurses, doctors, mathematicians, designers, creatives, and high-agency executors.

We achieve in 6 months what it takes our competitors 4 years to do. In just 12 months, 20 million patient consults were supported by Heidi, and we’re now powering more than 1 million consults every week.

With our most recent $16.6MM round of funding from leading VC firms, we’re geared up to supercharge our ambitious global growth, starting with the US, Canada, UK and Europe - and we need great people like you to get there. Ready for the challenge?

The Role

As a Senior Site Reliability Engineer at Heidi, you'll be instrumental in establishing and scaling our reliability practices while ensuring robust, secure, and observable systems.

You'll work closely with our engineering team to implement comprehensive monitoring, incident management, and reliability processes for our AI-powered healthcare solutions.

Primary Responsibilities:

Observability & Monitoring

Design and implement comprehensive observability strategies using Datadog, or other tooling that you are able to convince us with!
Implement OpenTelemetry instrumentation across our backend and frontend services
Set up real user monitoring (RUM) and application performance monitoring (APM) to ensure end-to-end visibility
Create and maintain dashboards that provide meaningful insights for different stakeholders (technical teams, support, management)
Monitor and optimise third-party service integrations, particularly for critical services

Incident Management & Response

Establish and implement incident management processes from the ground up
Evaluate and implement appropriate incident management tools that integrate with our observability stack
Create and maintain incident response playbooks and automated runbooks
Lead post-incident reviews and foster a blameless culture
Implement and maintain on-call rotations and escalation policies

SLA & SLO Management

Define and implement SLOs that align with business requirements and customer expectations
Set up error budgets and tracking mechanisms
Create comprehensive SLA reporting for enterprise customers
Design and implement SLI metrics that provide meaningful insights into service health

Cost Optimisation & Efficiency

Optimise observability costs through efficient logging and metrics collection
Implement log management and retention strategies
Fine-tune alerting to minimise alert fatigue while maintaining service reliability
Evaluate and recommend cost-effective tooling solutions

Key Requirements:

Extensive experience with observability platforms (Datadog preferred) and understanding of observability architecture
Strong knowledge of OpenTelemetry and modern instrumentation practices
Experience implementing APM and RUM in Python and React/React Native environments
Track record of establishing incident management processes and fostering a blameless culture
Experience defining and implementing SLAs/SLOs for enterprise customers
Strong background in monitoring distributed systems and third-party service integrations
Experience with cloud infrastructure (AWS required, Azure and GCP beneficial)
Proven track record in implementing SRE practices and reliability improvements

Preferred Qualifications:

Experience with chaos engineering practices
Knowledge of automated runbook implementation
Healthcare industry experience
Understanding of HIPAA or similar healthcare compliance frameworks

What we will look for:

Problem-solving mindset with a focus on reliability and scalability
Strong communication skills to work with cross-functional teams
Ability to balance technical requirements with business needs
Experience in fast-paced startup environments
Dedication to maintaining high standards in a regulated environment

What do we believe in?

We create unconventional solutions to difficult problems and we build them fast. We want you to set impossible goals and make them happen, think landing a rocket but the medical version.
You'll be surrounded by a world-class team of engineers, medicos and designers to do your best work, inspired by our shared beliefs:
- We will stop at nothing to improve patient care across the world.
- We design user experiences for joy and ship them fast.
- We make decisions in a flat hierarchy that prioritises the truth over rank.
- We provide the resources for people to succeed and give them the freedom to do it.

Why you will flourish with us 🚀?

Flexible hybrid working environment, with 3 days in the office.
Additional paid day off for your birthday and wellness days
Special corporate rates at Anytime Fitness in Melbourne, Sydney tbc.
A generous personal development budget of $500 per annum
Learn from some of the best engineers and creatives, joining a diverse team
Become an owner, with shares (equity) in the company, if Heidi wins, we all win
The rare chance to create a global impact as you immerse yourself in one of Australia’s leading healthtech startups
If you have an impact quickly, the opportunity to fast track your startup career!

Top Skills

AWS

Azure

Datadog

GCP

Opentelemetry

Python

React

React Native

Similar Jobs

ServiceNow

Senior Site Reliability Engineer

4 Hours Ago

Remote

Hybrid

Sydney, New South Wales, AUS

Senior level

Artificial Intelligence • Cloud • HR Tech • Information Technology • Productivity • Software • Automation

As a Senior Site Reliability Engineer, you will enhance infrastructure reliability through automation, coding, and systems engineering while collaborating on design improvements to prevent issues.

Top Skills: Cloud ArchitectureJavaScriptLinuxPython

GitLab

Senior Site Reliability Engineer, Runway

6 Days Ago

Easy Apply

Remote

Easy Apply

Senior level

Cloud • Security • Software • Cybersecurity • Automation

The Senior Site Reliability Engineer will design and maintain infrastructure on GCP and AWS, automate operations, lead incident responses, and ensure system reliability and scalability.

Top Skills: AWSGCPGoGrafanaHashicorp VaultIstioKubernetesLinkerdOpenbaoPrometheusPulumiTerraform

GitLab

Senior Site Reliability Engineer, Environment Automation

25 Days Ago

Easy Apply

Remote

Easy Apply

Senior level

Cloud • Security • Software • Cybersecurity • Automation

As a Senior Site Reliability Engineer, you'll automate operational tasks, develop monitoring and alerting systems, respond to emergencies, and enhance security for GitLab's infrastructure while collaborating with engineering teams.

Top Skills: AnsibleAWSElkGCPGitlabGoInfrastructure As CodeKubernetesPrometheusRubyTerraform

What you need to know about the Melbourne Tech Scene

Home to 650 biotech companies, 10 major research institutes and nine universities, Melbourne is among one of the top cities for biotech. In fact, some of the greatest medical advancements were conceptualized and developed here, including Symex Lab's "lab-on-a-chip" solution that monitors hormones to predict ovulation for conception, and Denteric's vaccine for periodontal gum disease. Yet, the thousands of people working in the city's healthtech sector are just getting started, to say nothing of the tech advancements across all other sectors.