Available for Work Based in La Jolla, California

Reliable AI systems, recovery, and infrastructure

Reliable AI infrastructure and dependable systems.

I am Angting Cai, a software engineer and systems researcher working on LLM infrastructure, recovery, distributed systems, and verification.

Research

PILOT

Recovery simulation for distributed systems, accepted at NSDI '26.

Systems

Phoenix

OS-supported high-availability restart framework, accepted at SOSP '25.

Industry

Amazon

Customer-facing SDE internship project shipped through production-safe processes.

Selected Projects

Technical work that shows how I think and build.

These projects emphasize systems implementation, fault tolerance, and practical AI infrastructure rather than broad product surface area.

Sept 2025 - Dec 2025

Edge Computer-Use Agent (ECUA)

Built a fully local computer-use agent on Ubuntu using a perception-plan-execute-reflect loop around Qwen3-VL-8B served with vLLM.

  • Hybrid perception with screenshots and accessibility trees
  • Deterministic grounding and constrained action schema
  • FP8, prefix caching, and CUDA graphs reduced time-to-first-token by about 65%

Nov 2023 - Dec 2023

Sharded Key/Value Service with Paxos Groups

Designed a fault-tolerant sharded key/value storage system in Go with a shard master and Paxos-backed replica groups.

  • Configuration management with a shard master
  • Replica-group coordination using Paxos
  • RPC design for smooth server-to-server communication

Mar 2023 - Apr 2023

Pager

Implemented a pager that manages virtual memory for application processes with a simulated MMU and fixed-size page table.

  • Address space creation, faults, destruction, and mapping
  • Swap-backed and file-backed paging support
  • FSM design for multicore scenarios

May 2024 - Aug 2024

Intel P6-style Out-of-Order Pipeline

Implemented a scalar out-of-order pipeline in SystemVerilog with Tomasulo scheduling and a reorder buffer for VeriSimpleV.

  • In-order commit with out-of-order execution
  • Instruction prefetching and load-store queue enhancements
  • Docker-based benchmarking against a baseline pipeline

Experience

Research depth with production-minded execution.

Dec 2024 - Mar 2026

Research Assistant

Pilot Execution, University of Virginia

NSDI '26

Studied recovery failures where the recovery action itself can trigger irreversible or cascading failures in distributed systems. Implemented core components of PILOT, an execution model for dry-run recovery simulation, and applied it to systems including Hadoop YARN.

June 2025 - Sept 2025

Software Development Engineer Intern

Amazon

Production

Completed a customer-facing internship project from design choices through implementation, team communication, testing, and safe production deployment. Worked inside standardized CI/CD, A/B testing, and integration-test pipelines while using internal AI tooling to accelerate learning and execution.

May 2023 - May 2024

Research Assistant

Phoenix Project, University of Michigan

SOSP '25

Built Phoenix, an OS-supported recovery framework that accelerates high-availability restarts by preserving long-lived process state while discarding transient state. Worked on Linux syscalls, allocator metadata, memory mapping management, runtime APIs, and fault-injection infrastructure for Redis and LevelDB validation.

May 2022 - Aug 2022

Instructional Aid

Introduction to Logic Design, UM-SJTU

Teaching

Held weekly labs, discussions, and office hours for a core ECE course with more than 100 students, helping improve hands-on implementation and debugging ability in digital logic labs.

Education

Strong academic grounding across systems and computer engineering.

Sept 2024 - Dec 2025

University of California San Diego

M.S. Computer Science

GPA: 3.97 / 4.0

Sept 2022 - May 2024

University of Michigan

B.S. Computer Science and Engineering

GPA: 3.88 / 4.0, Dean's Honor List 2022

Sept 2020 - Aug 2024

Shanghai Jiao Tong University

B.Eng., UM Joint Institute

GPA: 3.53 / 4.0, Undergraduate Excellent Scholarship 2021

Research Highlights

High-signal work on recovery and dependable systems.

Accepted at NSDI '26

PILOT

Co-authored work on dry-run execution for recovery actions in production distributed systems, making it possible to observe consequences before commitment.

Accepted at SOSP '25

Phoenix

Co-authored work on OS-supported recovery for high-availability restarts.

View proceedings

Skills and Systems Stack

Research interests, languages, and infrastructure I work with.

Research interests

Reliable AI/ML systems LLM inference Agent infrastructure Distributed systems Recovery and fault tolerance Systems verification

Languages

Python C/C++ Go Java JavaScript Shell

Systems and AI infra

Linux Docker AWS Redis LevelDB Hadoop YARN Paxos vLLM RAG CI/CD Fault injection Prefix caching CUDA graphs FP8 optimization

Contact

Open to conversations about research, engineering, and dependable AI systems.

The fastest way to reach me is by email. If you are hiring for software engineering, systems, infrastructure, or research-adjacent roles, I would be glad to talk.