Full-time - Latin AmericaNas Company is a media and tech company on a mission to help people feel more connected, both online and offline.
With a team of 100 amazing people from 30 countries, we’re leading the way in the creator economy, reaching 300 million people worldwide every month.
We believe in making real connections at a global scale- bringing people together, no matter where they are.
Why should you be part of our success story?
At Nas Company, we help some of the world's biggest brands level up their social media and reach millions of people. We’re not just creating content—we’re creating the next wave of storytellers. We've trained companies, governments, and organizations on how to make viral content, empowering their employees, customers, executives, and even citizens to connect and share stories.
Our focus is on mastering storytelling, building communities, and running powerful campaigns. We help the world’s top brands share their messages, create impact, and build communities that last.
Our partners include Google, Facebook, the Bill & Melinda Gates Foundation, AppsFlyer, Canon, Grab, eToro, Coinbase, Solana, DW Bonn, University of Maryland, and many more.
We've raised $23 million so far, backed by top VC investors like Lightspeed Venture Partners, Pitango, and 500 Global, to help make even bigger stories and bring people together.
Position Overview
We are looking for a Senior Site Reliability Engineer (SRE) to join our distributed engineering team and lead our reliability, observability, and infrastructure initiatives. In this remote role (Latin America timezone), you will be the primary on-call engineer during Asia-based off-hours (approximately 8:00 PM – 9:00 AM GMT+8) to ensure our platform remains stable and performant. The ideal candidate is a seasoned, autonomous engineer who can maintain platform stability with minimal direct oversight. You will work mostly asynchronously, collaborating with the team via documentation and chat, with one weekly synchronous meeting to align with the broader engineering group.
Team Setup & Reporting: You will report directly to the Head of Engineering (Edwin Candinegara). Given the time overlap, you will have minimal working-hour overlap with our core engineering team in Singapore/India, so strong communication and independent decision-making are crucial. Expect primarily asynchronous communication, with weekly sync-ups for team meetings or critical discussions.
Key Responsibilities
-
Infrastructure Monitoring & Maintenance: Independently monitor, maintain, and improve our AWS infrastructure and deployment pipelines during off-hours to ensure smooth operations even when others are offline.
-
Platform Reliability: Ensure high availability, reliability, and uptime of all platform services (web, backend, and mobile) by proactively managing system health and responding to incidents swiftly.
-
Observability & Alerting: Implement robust observability solutions – set up monitoring dashboards, logging, and real-time alerting across all systems (web applications, backend services, mobile API) using tools like Prometheus, Grafana, Datadog, AWS CloudWatch, etc.
-
Performance & Cost Optimization: Continuously monitor AWS and related infrastructure performance. Optimize resource usage and configurations for improved performance and cost efficiency (e.g., right-sizing instances, caching improvements, query optimization).
-
Asynchronous Collaboration: Work closely with product and engineering teams in an asynchronous manner. Document your insights, decisions, and progress clearly so team members in other timezones can follow along and contribute.
-
Incident Management: Proactively identify and resolve production issues. Act as the first responder to any system incidents during your shift, performing root cause analysis and restoring service. Communicate incidents and fixes to the team, and update runbooks for future reference.
-
Documentation & Playbooks: Develop and maintain internal SRE documentation, runbooks, and playbooks. Ensure that troubleshooting guides, deployment processes, and escalation protocols are well-documented and easy to follow for the entire engineering team.
Qualifications & Skills
-
Experience: 4+ years in a Site Reliability Engineer, DevOps, or similar role, with a track record of maintaining and scaling web infrastructure.
-
Observability Tools: Proficiency with monitoring and observability tools such as Prometheus, Grafana, Datadog, and AWS CloudWatch. You know how to instrument applications and set up alerts that catch issues early.
-
Cloud & DevOps: Strong hands-on experience with Amazon Web Services (AWS) and managing cloud resources. Familiarity with MongoDB Atlas (managed MongoDB) and deployment platforms like Vercel. Comfortable automating infrastructure (Infrastructure as Code, CI/CD pipelines) and managing deployments.
-
Tech Stack Familiarity: Exposure to modern web development stacks. Our environment includes Node.js/Python backends, Next.js frontends, Redis caching, and a Flutter mobile app. Direct coding in these is not mandatory, but understanding how these components work is a plus.
-
CI/CD & Automation: Excellent grasp of CI/CD concepts and tools. Experience implementing build pipelines, continuous integration, and automated deployments. Knowledge of Docker, container orchestration, and version control workflows.
-
Problem Solving: Strong analytical and problem-solving skills. Able to debug complex issues across distributed systems and find root causes. Experience with incident response and post-mortem analysis is highly valued.
-
Communication & Autonomy: Outstanding communication skills in a remote, asynchronous setting. You can document your work and decisions clearly. Highly self-driven and able to make sound decisions independently, especially during the hours when other team members are offline.
Special Inquiries for the Hiring Process
The hiring process I have in mind:
-
At least 1 technical interview with someone in the team (1.5 hours).
-
We may have more if we think the result for the first interview is not an obvious yes.
-
The interview will be around asking about past experience, fundamental coding skills, fundamental computer science backgrounds, and system design skills (related to SRE stuffs).
-
Interview with the head of engineering.
-
Interview with the CEO (if needed).
Specific details or unique aspects related to the position
-
Independent Impact: Because this role covers hours when the Asia-based team is offline, you’ll often be the point person for urgent decisions. You should be comfortable making critical calls independently to keep systems running, with the trust and empowerment of the team behind you.
-
Async Communication: Our workflow is primarily asynchronous. Outside of a weekly team meeting, you’ll communicate through tools like Slack, documentation, and pull requests. This means fewer interruptions and the freedom to structure your work, but it also requires discipline in keeping the team informed through writing.
-
Define SRE Practices: As the dedicated SRE, you will play a key role in defining and refining Nas.io’s reliability and infrastructure practices. You’ll have the opportunity to influence tooling choices, establish best practices for monitoring/alerting, and shape incident management and response processes. Your work will lay the foundation for how we maintain and scale our systems reliably as we grow.
-
Global Team & Culture: You’ll be joining a diverse, distributed team spanning Asia and other regions. We pride ourselves on a culture of mutual respect, continuous learning, and bias for action. Even though you’ll operate with a lot of autonomy, you’re never truly alone – the team is always a message away, and we make sure to celebrate successes and learn from failures together.
Our Company Values