Download CV

David Cockson

Hope is not a strategy.

Systems thinker with a focus on identifying constraints and designing robust controls within complex environments. With 8 years of experience analyzing high-stakes regulatory systems (AML gambling and data protection), I bring a T-shaped generalist perspective to technical problem-solving. I am now applying this analytical rigor to hands-on engineering, building observable and resilient infrastructure across cloud and AI systems.

Technical Projects

control — Self-Hosted LLM Platform [case study]

  • Solo-built v2 in 10 days, spec → live: 12 Docker services across 2 hosts plus cloud, 100% IaC (Terraform + Ansible, S3-backed state), 518 pytest tests at cutover
  • Crash-safe filesystem job queue via atomic shutil.move transitions (_queue → _active → _completed) — no broker, no message loss; worker re-queues stranded jobs on restart
  • End-to-end SSE streaming from FastAPI through the worker to a TypeScript React 18 UI; hybrid RAG over Qdrant vectors + a hand-rolled Neo4j knowledge graph
  • Multi-machine Ollama routing (Contabo VPS + homelab GPU) with explicit cloud fallback (Groq, Gemini, Anthropic) via FastMCP; routes by model: field
  • Zero open inbound ports — Cloudflare Tunnel outbound-only + Tailscale mesh, secrets resolved at runtime from Infisical; full OpenTelemetry → Grafana Cloud (Loki/Mimir/Tempo) telemetry with Discord alerting

maps to → FastAPI, React/TypeScript, SSE, Terraform, Ansible, hybrid RAG, Qdrant, Neo4j, MCP, Cloudflare Zero Trust, OpenTelemetry, Grafana

EvalUI — Dual-Backend LLM Observability [live] [repo]

  • Next.js 16 dashboard on Vercel fanning a single OpenTelemetry source out to Langfuse and Arize Phoenix, normalised to a shared 4-stage blueprint and raced side-by-side
  • Independent Claude Haiku 4.5 judge scoring Gemini 2.5 Flash generations on DeepEval Faithfulness, Contextual Precision, Answer Relevancy, and Hallucination
  • ISR (revalidate=60) + tag-based cache invalidation to stay inside Hobby-tier rate limits; a <canvas> latency replay driven by real span timings
  • Built solo in a single day; deployed behind Cloudflare Zero Trust

maps to → Next.js 16, OpenTelemetry, Langfuse, Arize Phoenix, DeepEval, LLM-as-judge, Vercel, TypeScript

EARS Specs — VS Code Extension [marketplace] [repo]

  • Published extension for writing requirements in EARS notation for spec-driven development — live on the VS Code Marketplace and Open VSX (installs in VS Code, Cursor, VSCodium)
  • Syntax highlighting, an auto-classifying sidebar across the 5 EARS archetypes, a spec scaffolder, and Tab-completion snippets; JavaScript, ~30 KB, CI release pipeline via GitHub Actions

maps to → JavaScript, VS Code extension API, CI/CD, GitHub Actions, spec-driven development

Double Diamond — VS Code Extension [marketplace]

  • Published extension bringing the four-phase Double Diamond design process into the editor: an idea state machine, a Kanban webview, and Obsidian library export

maps to → TypeScript, VS Code extension API, webviews, design-process tooling

Terminalz — Terminal Multiplexer for AI Agents

  • Desktop multiplexer for running multiple coding agents at once: a cover-flow layout that keeps every session live — Tauri 2 / Rust, xterm.js v6, portable-pty, TypeScript
  • Process-type detection via /proc colour-codes Claude Code, Gemini CLI, and SSH panes at a glance; built EARS-spec-first with dedicated QA passes

maps to → Rust, Tauri 2, TypeScript, xterm.js, systems programming, IPC

MapIt + MappitHills — Geospatial Rendering

  • MapIt: Python CLI + web app rendering OpenStreetMap data (Overpass API) as animated SVG/HTML across 4 aesthetic modes including laser/G-code output; Overpass caching/retry, SSE progress streaming, result caching, Docker, 105 tests — built, deployed, and documented in a day
  • MappitHills: GPX walking-route renderer over real 3D terrain (MapLibre-GL + Mapzen Terrarium tiles), gradient-coloured by ascent rate with a vertical-exaggeration slider; Flask backend, Docker — built solo in half a day the following day

maps to → Python, geospatial, Overpass API, SVG/Canvas, MapLibre-GL, Flask, Docker

vault-runner — Self-Hosted LLM Job Runner

  • File-driven job queue: Markdown files with YAML frontmatter move through _queue → _active → _completed — Syncthing propagates between laptop and VPS, results land in Obsidian automatically
  • Multi-machine model routing: Contabo VPS (Qwen2.5 14B local) + Windows gaming laptop (Gemma 4:26b via authenticated Cloudflare Tunnel) — jobs route by model: field; cloud fallback via Groq, Gemini, and OpenRouter
  • Five job types: text, vision, staged checklist, chain pipeline, chain_planner (LLM generates and executes its own step sequence)
  • Chain actions: Tavily web search, URL fetch + defuddle, GitLab push + MR creation, CI pipeline polling
  • FastAPI + HTMX web UI with live SSE output streaming, job cancellation, template picker, and vault search, served via Cloudflare Tunnel
  • MCP / MemPalace semantic memory: every past job output is indexed; new jobs inject relevant context with one YAML flag (use_memory: true)
  • 76+ pytest tests, GitLab CI/CD pipeline (Bandit SAST, pip-audit), auto-deploy on merge
  • Full observability: OpenTelemetry spans per job and per LLM call → Tempo; Langfuse @observe() decorators for LLM-specific tracing

maps to → Python, FastAPI, HTMX, SSE, LLM orchestration, multi-machine routing, MCP, OpenTelemetry, Langfuse, pytest, GitLab CI/CD, agentic systems

Monitoring & Observability

  • Dual observability pipeline: OpenTelemetry → Tempo (infrastructure traces, job duration, token counts) and Langfuse (LLM-specific traces — per-model token usage, chain sessions, per-call latency)
  • Grafana Alloy → Grafana Cloud (eu-west-2) for metrics shipping; Grafana dashboards across homelab and Contabo VPS
  • Full stack: Prometheus + Node Exporter + Grafana across homelab and Contabo VPS
  • Diagnosed silent cAdvisor failure (cgroup v2+ non-standard Docker root), pivoted to Telegraf Docker socket API
  • Published Grafana community dashboard (ID 25012)
  • Discord alerting via webhook — job completion and failure notifications from the LLM runner

maps to → monitoring, observability, OpenTelemetry, Tempo, Langfuse, Prometheus, Grafana, incident diagnosis, distributed tracing

Linux & Containerisation

  • Proxmox hypervisor running Ubuntu VM with 20+ Docker containers
  • Services: Plex, Immich, Gitea, SearXNG, Nginx Proxy Manager, Portainer, Syncthing, Langfuse stack (ClickHouse, MinIO, PostgreSQL)
  • Multi-stage Docker builds (Node/Vite → nginx:alpine) for project deployments
  • systemd service authoring for production services: LLM runner poller (runbook.py) and web UI (web.py) with auto-restart; Tailscale metrics proxy
  • Syncthing as a file-sync transport layer — propagates job queue files between laptop and VPS, results land in Obsidian automatically
  • Linux Mint as primary dev environment

maps to → Linux, Docker, containerisation, virtualisation, systemd, process management, Syncthing

Cloud Infrastructure

  • AWS: self-taught runbook methodology through 23 reps — click-ops → CLI → scripted automation → self-healing fleet (3 EC2 instances)
  • Static and containerised app deployments to EC2 via multi-stage Docker builds (Node/Vite → nginx:alpine)
  • Contabo VPS (Ubuntu 24.04, 8 vCPU, 24GB): production LLM runner stack — Ollama Qwen2.5 14B, FastAPI web UI, OTel agent, systemd-managed services, Cloudflare Tunnel (zero-trust public endpoints without open ports)
  • Multi-machine Ollama routing: Contabo (Qwen2.5 14B) + Windows gaming laptop (Gemma 4:26b via authenticated Cloudflare Tunnel) — jobs route automatically by model field
  • Cloud API routing: Groq (Llama 3.3 70B), Gemini 2.5 Flash, OpenRouter (Qwen3 32B) as fallback/specialist runners
  • Multi-provider experience: AWS, Contabo, Cloudflare, GCP Skills Boost

maps to → AWS, EC2, CLI, multi-cloud, VPS management, Cloudflare Tunnel, zero-trust networking, multi-machine orchestration

Python Development

  • vault-runner: production LLM job runner — FastAPI + HTMX web UI with live SSE streaming, file-based queue (no broker), multi-machine model routing, job cancellation
  • Multi-job-type system: text, vision, staged checklist, chain pipeline, chain_planner (LLM generates and executes its own steps)
  • Chain actions: Tavily web search, URL fetch + defuddle, GitLab push + MR creation, CI pipeline polling
  • 76+ pytest tests with GitLab CI/CD pipeline (Bandit SAST, pip-audit); auto-deploy on merge
  • OpenTelemetry instrumentation — every job is a trace, every LLM call is a span; Langfuse @observe() decorators for LLM-specific tracing
  • Multi-provider LLM client: Ollama (local), Groq, Gemini, OpenRouter — unified routing via config
  • MCP client integration (MemPalace) — semantic memory over all past job outputs injected into prompts
  • Code kata system: self-designed, multi-domain reps (AWS, Linux, networking); scripting across homelab and cloud automation

maps to → Python, FastAPI, HTMX, SSE, pytest, CI/CD, OpenTelemetry, LLM orchestration, MCP, agentic systems, automation

Terraform & IaC

  • Runbook self-learning methodology: 4 reps completed, committed to pinned infra-practice repo
  • Progression: single EC2 → User Data + IAM + CloudWatch → ALB + HTTPS → full production stack (ASG + Launch Template) in a single terraform apply
  • GCP Skills Boost badges (Terraform, Kubernetes, Google Cloud Network)

maps to → Terraform, IaC, HCL, ALB, ASG, GCP exposure

Networking & Access Control

  • Tailscale mesh networking across homelab, laptop, phone, and Contabo VPS
  • Nginx Proxy Manager as reverse proxy for all services
  • DNS management across multiple custom domains (Cloudflare)
  • Labelled physical network infrastructure with structured cabling and surge protection
  • Tailscale metrics proxy: custom Python service exposing network metrics to Prometheus

maps to → networking, DNS, reverse proxy, VPN/zero-trust, infrastructure documentation

Volunteer — Geeks for Social Change (GFSC)

  • Drafted initial observability strategy (Uptime Kuma → Node Exporter → Grafana Cloud) collaboratively in HedgeDoc
  • Iterated infrastructure review based on community maintainer feedback
  • First external contribution: committed final technical proposals and system documentation to the organisation's public GitHub repo

maps to → infrastructure mapping, Git collaboration, technical documentation, systems discovery, open-source community

Professional Experience

Compliance Manager — Gambling Commission of Great Britain

July 2024 – Present

  • Assessment of regulated gambling companies including casinos, betting shops, bingo, AGCs (in-person) and remote casinos, sportsbooks, bingo and B2B software development companies (online)
  • Led all 2025 Software Development Licence assessments, including ISO 27001 reviews, Change Management evaluations, policy and procedure assessments, and governance interviews with C-level executives (CTO/CEO), Heads of Departments and SMEs
  • Managed end-to-end incident response—from real-time triage to root-cause investigation—for complex software failures. Evaluated active incident reports and sensitive disclosures, gathered critical system context, and escalated high-severity findings to executive committees to enforce necessary technical remediations.
  • Supported cross-functional technical initiatives and unplanned operational workstreams, managing highly sensitive data escalations and critical internal risk disclosures.
  • Utilise Microsoft Copilot, SharePoint and proprietary software to streamline regulatory reporting and maintain audit trails for complex assessment workflows

Regulatory Compliance Assurance Manager — William Hill (888 Holdings)

May 2023 – December 2023

  • Recruited, trained and directed an Assurance team across Marketing Compliance, Technical Compliance and VIP/HVC schemes for 3 business units (UK & Ireland, International, US) spanning 22 regulated markets
  • Established procedural frameworks and authored the Risk Matrix for the Group Assurance department; planned and executed all non-AML and SG assurance testing for 2023
  • Coordinated operations for a 5-person international data and monitoring team and a 19-person testing team (Manila), providing task allocation and performance feedback
  • Integrated Assurance into group regulatory and controls mapping with external entities including KPMG; authored the framework and supported delivery of RTS testing framework automation and the GB annual assurance statement
  • Developed and delivered specialist training (Assurance Testing, GB Regulatory Actions, 3rd Party Risk Analysis, Marketing Compliance)

Compliance Officer — Allwyn UK (National Lottery)

September 2022 – March 2023

  • Authored technical Business Requirement Documents (BRDs) and facilitated critical operational handover workshops with legacy service providers.
  • Architected and deployed the foundational system monitoring frameworks and operational control registers for the platform transition.
  • Advised Marketing, Retail and Product Development teams on technical implementation and regulatory alignment as a Subject Matter Expert
  • Managed tier-one technology vendor relationships (Scientific Games) and coordinated international infrastructure efficiency (Net Zero) initiatives.
  • Directed a joint Machine Learning development project with Oxford University, bridging the gap between academic AI models and commercial product deployment.

Previous Roles