AI SRE Agent OS

Your Infrastructure.
Its Own Operator.

An autonomous AI agent that investigates incidents, writes and deploys tools, runs scheduled operations, and generates rich interactive reports—all from a single interface.

See How It Works Get Started
Investigate with full tool access
Create tools on the fly
Schedule autonomous operations
Visualize everything interactively
01 / Coding CLI

Full Shell Power.
Zero Constraints.

A WebGL-accelerated terminal running a real PTY session in your browser. The agent has full shell access—it reads files, runs commands, executes scripts, pipes output, and navigates your codebase exactly like a senior engineer sitting at the keyboard.

Full interactive PTY via xterm.js with WebGL rendering
Persistent sessions—switch tabs, come back, output is still there
Run any command: git, docker, kubectl, curl, custom scripts
Multi-provider: Claude Code, Cursor CLI, or Codex—toggle in one click
omlet — production-api
omlet> The payment service is returning 504s. Investigate and fix it.
Starting investigation. Let me check the service status and recent logs.
$ kubectl get pods -n payments
$ kubectl logs payment-api-7d4f8b-x2k9z --tail=200
$ curl -s http://payment-api:8080/healthz
Found it. The connection pool to Postgres is exhausted—max connections hit at 14:23 UTC after a deploy introduced an N+1 query in the order lookup path.
✓ Identified root cause: N+1 query in OrderService.getByUser()
✓ Applied fix: batch query with WHERE user_id IN (...)
✓ Connection pool recovering: 2/50 → 47/50 available
✓ 504 rate: 23% → 0.1%
omlet> _
02 / Deep Investigation

Watches the Agent Think

Every file read, grep, command, and sub-agent call is visible in real time. The agent dynamically chains tools—spawning sub-agents for parallel investigation, correlating metrics with logs with traces—and the system automatically extracts key insights, errors, and root causes from the full investigation trail.

User
"Latency on checkout-service spiked 3x in the last hour. Find the root cause."
Tool Group — 14 calls
Bash ×6 Read ×4 Grep ×3 Sub-Agent ×1
Queried Prometheus metrics, tailed service logs, traced slow requests, spawned sub-agent for DB analysis
Insight Detected
"Redis cache hit ratio dropped from 94% to 12% at 14:47 UTC after config deploy removed TTL settings"
Tool Group — 8 calls
Edit ×3 Bash ×5
Restored TTL config, restarted cache, verified hit ratio recovery
Resolved
"Cache hit ratio restored to 91%. P99 latency: 2.4s → 180ms. All alerts cleared."
Session Intelligence
22
Tool Calls
3
Outputs
1
Sub-Agent
0
Errors
Work Phases
Investigation ×2 Code Change ×1 Verification ×1
Root Cause
Config deploy at 14:47 UTC removed Redis TTL settings, causing cache miss storm. All requests fell through to Postgres, exhausting the connection pool and spiking P99 latency to 2.4s.
pagerduty-mcp
stdio
npx @pagerduty/mcp-server --api-key $PD_KEY
list_incidents acknowledge resolve create_note
k8s-operator
http
http://k8s-mcp.internal:8080/mcp
get_pods scale_deployment rollback get_events exec_pod
runbook-agent
sse
Custom serverless function: auto-triage alerts using runbook knowledge base
triage_alert search_runbooks suggest_fix
03 / Tool Creation

Build Its Own Tools

The agent isn't limited to built-in commands. Connect any MCP server—PagerDuty, Kubernetes, Datadog, your internal APIs—or write custom serverless functions that the agent can invoke on demand. Test connectivity, discover available tools, and configure granular permissions, all from the UI.

Add MCP servers via form builder or JSON import—stdio, HTTP, SSE
Discover tools: see every function an MCP server exposes before deploying
Write serverless JS/AI functions with cron triggers and OTEL tracing
Granular permissions: allow kubectl get but block kubectl delete
04 / Task System

Scheduled Autonomous Ops

Define agent tasks in plain markdown. Schedule them on any cron cadence—hourly health checks, daily incident summaries, weekly security audits. The agent runs autonomously, resumes from previous sessions to maintain context, and auto-expires when the job is done.

tasks.md
## Task: Morning Health Check
**Prompt:** "Check all production services,
report any anomalies, and generate
an HTML status dashboard"
**Folder:** /ops/health-checks
**Schedule:** 0 8 * * *
**Timezone:** America/Los_Angeles
**SessionId:** a1b2c3d4
## Task: Security Scan
**Prompt:** "Audit dependencies for
CVEs and open fix PRs"
**Schedule:** weekly
**MaxRuns:** 12
Morning Health Check
Running
⏱ Started 2m ago ⚙ 8 tool calls ↻ Run #47
Security Scan
Next: Mon 9am
↻ 4 of 12 runs ⏱ Last: 3 days ago
Nightly DB Backup Verify
Completed
✓ 3m 12s ⚙ 14 tool calls ↻ Run #182
Incident Postmortem Generator
Trigger: on-alert
↻ Event-driven ⏱ Last: 12 hours ago
05 / Interactive Views

Reports That Come Alive

The agent doesn't just output text—it generates full interactive HTML dashboards, charts, and reports rendered live in a sandboxed browser view. Share them with expiring public links, edit them in a built-in code editor, or let the agent iterate on the design.

Agent-generated HTML/JS/CSS rendered live in sandboxed iframe
WebGL and Canvas for high-performance data visualization
Built-in CodeMirror editor—tweak the agent's output directly
Public share links with configurable expiration (1h to 30 days)
Spreadsheet viewer for CSV/Excel, image viewer, syntax-highlighted code
omlet.internal/views/infra-health-2026-02-19.html
99.97%
Uptime
142ms
P50 Latency
3
Active Alerts
Request Rate (24h)
06 / View Modes

Three Ways to See the Work

Every session can be viewed in three modes. Full detail for engineers, a pipeline timeline for investigations, and an executive summary for stakeholders. Switch instantly—same session, different lens.

List
Pipeline
Summary
Focus Mode
Detail
Why is checkout-service slow?
Reading checkout-service/config.yaml
Running: kubectl top pods -n checkout
Searching for "timeout" in logs
Redis cache TTLs were removed in the last deploy. Cache hit ratio dropped from 94% to 12%, causing all requests to hit Postgres directly.
Editing redis-config.yaml
Fixed. TTLs restored, cache recovering. P99 dropping.
Pipeline
Timeline
User prompt
Checkout latency investigation
14 tool calls
Bash ×6, Read ×4, Grep ×3, Task ×1
Insight
Cache TTL removal caused miss storm
8 tool calls
Edit ×3, Bash ×5
Resolved
P99: 2.4s → 180ms. All alerts cleared.
Summary
Executive
22
Tools
3
Outputs
0
Errors
Top Tools
Bash ×11 Read ×4 Grep ×3 Edit ×3
Phases
Investigation Code Change Verification
Root Cause
Config deploy removed Redis TTL. Cache miss rate spiked, exhausting DB pool.

Put Your Infrastructure
on Autopilot

An AI agent that doesn't just alert you—it investigates, fixes, reports, and learns. Start running autonomous SRE operations today.