AI Systems Engineer(3+ years)Onsite
Minimum qualifications:
Bachelor's degree in Computer Science, Engineering, Artificial Intelligence, or a related field, or equivalent practical experience.
3+ years of backend, platform, infrastructure, AI engineering, or distributed systems experience.
Experience designing and operating production services, APIs, databases, queues, or workflow systems.
Comfortable reviewing architecture, debugging production issues, writing tests, and documenting system behavior.
Interest in building AI infrastructure that is reliable, observable, secure, and scalable.
Job Description:
As an AI Systems Engineer at Aiotrix, you will work on the core technical systems that make AI products dependable. This role focuses on runtime behavior, state, memory, tool execution, traces, evaluations, security, and reliability.
You will help build the infrastructure layer behind ART and Aiotrix AI systems, where AI workflows must run predictably, recover from failures, and remain observable in production.
Responsibilities:
Design AI runtime components for tool execution, memory, state management, queues, retries, and long-running workflows.
Build observability systems for AI workflows, including traces, logs, tool calls, latency, cost, outputs, and failures.
Develop evaluation and regression systems to measure quality, reliability, and behavior across AI workflows.
Implement secure execution boundaries, permission checks, rate limits, and safety controls.
Work with vector databases, retrieval systems, event streams, caches, and backend services.
Optimize performance, reliability, scalability, and operational cost of AI infrastructure.
Collaborate with product and AI engineers to expose system capabilities through simple developer and user workflows.
Preferred qualifications:
Strong backend engineering experience with Python, Go, TypeScript, Node.js, or similar technologies.
Experience with distributed systems concepts such as queues, retries, idempotency, concurrency, state, and observability.
Hands-on exposure to LLM APIs, RAG, vector databases, tool calling, or AI workflow orchestration.
Experience with tracing, monitoring, logging, metrics, testing, and debugging production systems.
Understanding of security, permissions, authentication, and safe execution patterns.
Ability to reason deeply about reliability, failure modes, and platform behavior.
