Blog
All posts — newest first.
-
Observability and incident response — the SRE basics
A primer on the two operational disciplines every SRE team needs to run: observability (logs, metrics, traces) and incident response (roles, severities, blameless postmortems). Includes the practical shape of an incident and how AI is starting to absorb the lower rungs of both.
-
Toil and the 50% rule — what it is, how to measure it, and how to kill it
A primer on toil — the manual, repetitive, automatable work that quietly eats SRE teams. Covers Google's six-part definition, the 50% cap, how to measure toil honestly, and how the 2026 generation of AI agents changes the toil-elimination playbook.
-
SLI, SLO, SLA, and error budgets — the reliability contract explained
A primer on the four numbers every SRE team needs to agree on: Service Level Indicators, Objectives, Agreements, and the error budget that falls out of them. Includes concrete examples, the math behind 'nines,' and what the contract looks like once AI agents start contributing to the burn rate.
-
What is Site Reliability Engineering (SRE)?
A primer on Site Reliability Engineering — what SRE is, where it came from at Google, how it differs from DevOps and Platform Engineering, and the core principles that make it work. Includes a short note on what changes in 2026 as AI moves into the on-call seat.
-
What are vector embeddings?
A short primer on vector embeddings — the numerical representation that lets a computer treat 'the meaning of this text' as something it can search, cluster, and compare. Covers what an embedding actually is, how similarity works, why model choice matters more than retrieval quality, and the production failure modes you only see in evaluation.
-
What is function calling (tool use)?
A short primer on function calling — the mechanism that lets an LLM decide to invoke an external function and let your code do the actual work. Covers the JSON-schema contract, the request/response loop, parallel and forced tool calls, and why every production AI agent in 2026 is built on this primitive.
-
What is prompt caching?
A short primer on prompt caching — the LLM-provider feature that drops the cost of a repeated long prompt by 50–90% and the latency by half. Covers how prefix matching works, the TTL economics across Anthropic / OpenAI / Google, where caching helps and where it quietly does not, and the operational gotchas that determine whether your hit rate is 90% or 9%.
-
The CAP theorem in AI-native distributed systems
CAP didn't get repealed when LLMs showed up. But the costs of choosing C, A, or P shift when the datastore behind the system is a vector index, a context graph, or a model-served retrieval layer. A short revisit of the trade-offs, framed for teams building AI-enabled infrastructure.
-
NAS vs SAN for GPU workloads — what changed when AI showed up
The classical NAS-vs-SAN decision was about file vs block, ethernet vs fibre, and how much you wanted to pay. GPU training and inference rewrote the question. Here's how the calculus shifts when your storage has to keep an A100 or H100 cluster fed.
-
What is an AI agent? A primer for cloud engineers
A short primer on AI agents — the perceive-reason-act loop, what separates an agent from a one-shot LLM call, the classical agent types (reflex, model-based, goal-based, utility-based, learning) and how they map onto the agents running in modern SRE and platform tooling.
-
What is Model Context Protocol (MCP)?
A short primer on Model Context Protocol — the open standard that lets AI applications talk to tools and data sources through a uniform interface. Covers the host/client/server architecture, the data layer (JSON-RPC) and transport layer split, and why it matters for cloud and platform teams.
-
What is Retrieval-Augmented Generation (RAG)?
A short primer on Retrieval-Augmented Generation — the pattern that grounds an LLM's answer in documents you actually trust. Covers the indexing and serving paths, the role of the embedding model and vector index, and the failure modes that catch teams off guard in production.
-
Mental models for applying AI to infrastructure
Most writing about AI in infrastructure is tutorials. Tutorials answer how. Mental models answer whether. Here are seven I use as the front gate before any LLM goes near a production system — recoverability, reversibility, per-call economics, the autonomy ladder as a risk function, tools-not-chat, context as substrate, and identity that travels with the action.
-
Prompt engineering for SRE: patterns that actually work in production
Most prompt-engineering advice is written for chatbots. SRE workloads are different — the input is messy, the output has to be machine-readable, and there's no human to gracefully handle a wrong answer. Here are six patterns I've shipped to production for SRE LLM tools, and why each one earned its place.
-
The MCP gateway pattern: five jobs your agent runtime can't skip
Letting agents call MCP servers directly is the same mistake as letting microservices call each other without an API gateway. Here are the five jobs an MCP gateway has to do, and reproducible patterns for each — scope-token exchange, schema firewall, quarantine queue, provenance ledger, and a catalog/broker split.
-
Skills for AI agents that do SRE work
Most agent skills are chatbot prompts in disguise. The ones I just published are operator tools — opinionated, output-contracted, with mandatory discipline sections that say what the skill won't do. Three skills, portable across Claude Code, Claude Desktop, Codex CLI, and any markdown-prompt runtime.
-
Alert fatigue? Let AI triage.
How I built alert-explainer — an open-source service that sits between Alertmanager and your on-call routing and turns every Prometheus alert into a plain-English brief in 1–4 seconds for under a cent. Design, reliability patterns, and production tradeoffs.
-
When NOT to Use AI in Production SRE
Most AI-for-SRE writing tells you where AI helps. Here are seven places it actively hurts — and the operational rule of thumb I use to decide.
-
Building incident-scribe: Slack Thread to Incident Report with Claude
How I built an open-source tool that turns messy Slack incident threads into blameless, structured incident reports in under 30 seconds — design, reliability patterns, and production tradeoffs.
-
Why AI is the Next SRE Superpower
After 15 years in cloud infrastructure and SRE — including 8+ years building safety-critical systems at a global aviation-SaaS platform — here's why I believe AI is the most significant shift in how we operate systems since Kubernetes.
-
REST vs. GraphQL APIs – how to choose?
REST APIs has been around for a while now, while GraphQL is relatively new to the game. While they both are used for data transfer - sending HTTP requests and receive HTTP responses, they both have th
-
GraphQL for API Development
What is it? GraphQL is a query language for APIs and a runtime for fulfilling those queries with your existing data. It was originally created at Facebook in 2012 for describing the capabilities and r
-
What is Address Resolution Protocol (ARP)?
ARP is a communication protocol used for discovering link layer address associated with a given internet layer address. Eg: find the MAC (media access control) addresses associated with IPV4 addresses
-
The CAP theorem
Lets look at the different system design options (databases particularly) in detail: 1. Consistency and Availability over Partition tolerance Here we prefer having some data and same data for ev
-
The 7 layers of ISO OSI model
The International Organization for Standardization came up with the Open Systems Interconnection (OSI) conceptual model which provides a standard for diverse computer systems to be able to communicate
-
NAS vs SAN – A brief comparison.
Network Attached Storage [NAS] NAS is a specialized data storage device connected to a network providing data access to other machines in the network over ethernet. Its hardware, software, or sp
-
What is RAID?
RAID [Redundant Array of Independent Disks] is a way of providing redundancy to the stored data, providing protection from Disk failures. RAID makes it possible to use lower-priced disks are in large
-
Amazon Leadership Principles – my thoughts.
Leadership principles are important for Amazon - when they hire, when they do evaluations in the job etc. If you are preparing for an interview with Amazon, you should be expecting a lot of behavioral
-
Gettings started with Ansible.
Ansible is an open-source tool that enables the automation, configuration, and orchestration of infrastructure. It fully embraces the concept of Infrastructure as Code. We can build out our entire sys
-
Introduction to YAML!
YAML Aint Markup Language! It is designed with a focus on human-readable formatting. The creators of YAML wanted it to be easily readable by humans. It is portable, easily extendable, and suppor
-
Preparing for AWS Solution Architect (Professional) Certification.
Backstory I cleared the AWS Solution Architect - Associate level certification in 2019. It wasnt too hard, I could clear it with a couple of years of on-job experience and almost a month of read