Blog

All posts — newest first.

May 24, 2026 SREObservabilityIncident Response

Observability and incident response — the SRE basics

A primer on the two operational disciplines every SRE team needs to run: observability (logs, metrics, traces) and incident response (roles, severities, blameless postmortems). Includes the practical shape of an incident and how AI is starting to absorb the lower rungs of both.
May 23, 2026 SREReliabilityFundamentals

Toil and the 50% rule — what it is, how to measure it, and how to kill it

A primer on toil — the manual, repetitive, automatable work that quietly eats SRE teams. Covers Google's six-part definition, the 50% cap, how to measure toil honestly, and how the 2026 generation of AI agents changes the toil-elimination playbook.
May 22, 2026 SREReliabilityFundamentals

SLI, SLO, SLA, and error budgets — the reliability contract explained

A primer on the four numbers every SRE team needs to agree on: Service Level Indicators, Objectives, Agreements, and the error budget that falls out of them. Includes concrete examples, the math behind 'nines,' and what the contract looks like once AI agents start contributing to the burn rate.
May 21, 2026 SREPlatform EngineeringFundamentals

What is Site Reliability Engineering (SRE)?

A primer on Site Reliability Engineering — what SRE is, where it came from at Google, how it differs from DevOps and Platform Engineering, and the core principles that make it work. Includes a short note on what changes in 2026 as AI moves into the on-call seat.
May 15, 2026 AIArchitecturePlatform Engineering

What are vector embeddings?

A short primer on vector embeddings — the numerical representation that lets a computer treat 'the meaning of this text' as something it can search, cluster, and compare. Covers what an embedding actually is, how similarity works, why model choice matters more than retrieval quality, and the production failure modes you only see in evaluation.
May 15, 2026 AIArchitecturePlatform Engineering

What is function calling (tool use)?

A short primer on function calling — the mechanism that lets an LLM decide to invoke an external function and let your code do the actual work. Covers the JSON-schema contract, the request/response loop, parallel and forced tool calls, and why every production AI agent in 2026 is built on this primitive.
May 15, 2026 AIArchitecturePlatform Engineering

What is prompt caching?

A short primer on prompt caching — the LLM-provider feature that drops the cost of a repeated long prompt by 50–90% and the latency by half. Covers how prefix matching works, the TTL economics across Anthropic / OpenAI / Google, where caching helps and where it quietly does not, and the operational gotchas that determine whether your hit rate is 90% or 9%.
May 12, 2026 AIArchitectureDistributed Systems

The CAP theorem in AI-native distributed systems

CAP didn't get repealed when LLMs showed up. But the costs of choosing C, A, or P shift when the datastore behind the system is a vector index, a context graph, or a model-served retrieval layer. A short revisit of the trade-offs, framed for teams building AI-enabled infrastructure.
May 12, 2026 AIStorageInfrastructure

NAS vs SAN for GPU workloads — what changed when AI showed up

The classical NAS-vs-SAN decision was about file vs block, ethernet vs fibre, and how much you wanted to pay. GPU training and inference rewrote the question. Here's how the calculus shifts when your storage has to keep an A100 or H100 cluster fed.
May 11, 2026 AIAgentsPlatform Engineering

What is an AI agent? A primer for cloud engineers

A short primer on AI agents — the perceive-reason-act loop, what separates an agent from a one-shot LLM call, the classical agent types (reflex, model-based, goal-based, utility-based, learning) and how they map onto the agents running in modern SRE and platform tooling.
May 11, 2026 AIMCPPlatform Engineering

What is Model Context Protocol (MCP)?

A short primer on Model Context Protocol — the open standard that lets AI applications talk to tools and data sources through a uniform interface. Covers the host/client/server architecture, the data layer (JSON-RPC) and transport layer split, and why it matters for cloud and platform teams.
May 11, 2026 AIArchitecturePlatform Engineering

What is Retrieval-Augmented Generation (RAG)?

A short primer on Retrieval-Augmented Generation — the pattern that grounds an LLM's answer in documents you actually trust. Covers the indexing and serving paths, the role of the embedding model and vector index, and the failure modes that catch teams off guard in production.
May 6, 2026 AISREArchitecture

Mental models for applying AI to infrastructure

Most writing about AI in infrastructure is tutorials. Tutorials answer how. Mental models answer whether. Here are seven I use as the front gate before any LLM goes near a production system — recoverability, reversibility, per-call economics, the autonomy ladder as a risk function, tools-not-chat, context as substrate, and identity that travels with the action.
May 5, 2026 AISREPrompt Engineering

Prompt engineering for SRE: patterns that actually work in production

Most prompt-engineering advice is written for chatbots. SRE workloads are different — the input is messy, the output has to be machine-readable, and there's no human to gracefully handle a wrong answer. Here are six patterns I've shipped to production for SRE LLM tools, and why each one earned its place.
May 2, 2026 AIMCPPlatform Engineering

The MCP gateway pattern: five jobs your agent runtime can't skip

Letting agents call MCP servers directly is the same mistake as letting microservices call each other without an API gateway. Here are the five jobs an MCP gateway has to do, and reproducible patterns for each — scope-token exchange, schema firewall, quarantine queue, provenance ledger, and a catalog/broker split.
Apr 30, 2026 AISREOpen Source

Skills for AI agents that do SRE work

Most agent skills are chatbot prompts in disguise. The ones I just published are operator tools — opinionated, output-contracted, with mandatory discipline sections that say what the skill won't do. Three skills, portable across Claude Code, Claude Desktop, Codex CLI, and any markdown-prompt runtime.
Apr 29, 2026 AISREOpen Source

Alert fatigue? Let AI triage.

How I built alert-explainer — an open-source service that sits between Alertmanager and your on-call routing and turns every Prometheus alert into a plain-English brief in 1–4 seconds for under a cent. Design, reliability patterns, and production tradeoffs.
Apr 25, 2026 AISREReliability

When NOT to Use AI in Production SRE

Most AI-for-SRE writing tells you where AI helps. Here are seven places it actively hurts — and the operational rule of thumb I use to decide.
Apr 21, 2026 AISREOpen Source

Building incident-scribe: Slack Thread to Incident Report with Claude

How I built an open-source tool that turns messy Slack incident threads into blameless, structured incident reports in under 30 seconds — design, reliability patterns, and production tradeoffs.
Apr 20, 2026 AISRECloud

Why AI is the Next SRE Superpower

After 15 years in cloud infrastructure and SRE — including 8+ years building safety-critical systems at a global aviation-SaaS platform — here's why I believe AI is the most significant shift in how we operate systems since Kubernetes.
Dec 8, 2022 APIDev

REST vs. GraphQL APIs – how to choose?

REST APIs has been around for a while now, while GraphQL is relatively new to the game. While they both are used for data transfer - sending HTTP requests and receive HTTP responses, they both have th
Nov 19, 2022 APIFundamentals

GraphQL for API Development

What is it? GraphQL is a query language for APIs and a runtime for fulfilling those queries with your existing data. It was originally created at Facebook in 2012 for describing the capabilities and r
Nov 18, 2022 GeneralNetworks

What is Address Resolution Protocol (ARP)?

ARP is a communication protocol used for discovering link layer address associated with a given internet layer address. Eg: find the MAC (media access control) addresses associated with IPV4 addresses
Jun 2, 2021 FundamentalsGeneralStorage

The CAP theorem

Lets look at the different system design options (databases particularly) in detail: 1. Consistency and Availability over Partition tolerance Here we prefer having some data and same data for ev
Jun 1, 2021 FundamentalsGeneralNetworks

The 7 layers of ISO OSI model

The International Organization for Standardization came up with the Open Systems Interconnection (OSI) conceptual model which provides a standard for diverse computer systems to be able to communicate
May 30, 2021 GeneralLinuxSRE

NAS vs SAN – A brief comparison.

Network Attached Storage [NAS] NAS is a specialized data storage device connected to a network providing data access to other machines in the network over ethernet. Its hardware, software, or sp
May 30, 2021 LinuxStorage

What is RAID?

RAID [Redundant Array of Independent Disks] is a way of providing redundancy to the stored data, providing protection from Disk failures. RAID makes it possible to use lower-priced disks are in large
May 29, 2021 AWSGeneralLeadership

Amazon Leadership Principles – my thoughts.

Leadership principles are important for Amazon - when they hire, when they do evaluations in the job etc. If you are preparing for an interview with Amazon, you should be expecting a lot of behavioral
May 26, 2021 AnsibleGeneral

Gettings started with Ansible.

Ansible is an open-source tool that enables the automation, configuration, and orchestration of infrastructure. It fully embraces the concept of Infrastructure as Code. We can build out our entire sys
May 10, 2021 General

Introduction to YAML!

YAML Aint Markup Language! It is designed with a focus on human-readable formatting. The creators of YAML wanted it to be easily readable by humans. It is portable, easily extendable, and suppor
Mar 30, 2021 AWSCertification

Preparing for AWS Solution Architect (Professional) Certification.

Backstory I cleared the AWS Solution Architect - Associate level certification in 2019. It wasnt too hard, I could clear it with a couple of years of on-job experience and almost a month of read

Blog

Observability and incident response — the SRE basics

Toil and the 50% rule — what it is, how to measure it, and how to kill it

SLI, SLO, SLA, and error budgets — the reliability contract explained

What is Site Reliability Engineering (SRE)?

What are vector embeddings?

What is function calling (tool use)?

What is prompt caching?

The CAP theorem in AI-native distributed systems

NAS vs SAN for GPU workloads — what changed when AI showed up

What is an AI agent? A primer for cloud engineers

What is Model Context Protocol (MCP)?

What is Retrieval-Augmented Generation (RAG)?

Mental models for applying AI to infrastructure

Prompt engineering for SRE: patterns that actually work in production

The MCP gateway pattern: five jobs your agent runtime can't skip

Skills for AI agents that do SRE work

Alert fatigue? Let AI triage.

When NOT to Use AI in Production SRE

Building incident-scribe: Slack Thread to Incident Report with Claude

Why AI is the Next SRE Superpower

REST vs. GraphQL APIs – how to choose?

GraphQL for API Development

What is Address Resolution Protocol (ARP)?

The CAP theorem

The 7 layers of ISO OSI model

NAS vs SAN – A brief comparison.

What is RAID?

Amazon Leadership Principles – my thoughts.

Gettings started with Ansible.

Introduction to YAML!

Preparing for AWS Solution Architect (Professional) Certification.