Research
PILOT
Recovery simulation for distributed systems, accepted at NSDI '26.
Reliable AI systems, recovery, and infrastructure
I am Angting Cai, a software engineer and systems researcher working on LLM infrastructure, recovery, distributed systems, and verification.
Research
Recovery simulation for distributed systems, accepted at NSDI '26.
Systems
OS-supported high-availability restart framework, accepted at SOSP '25.
Industry
Customer-facing SDE internship project shipped through production-safe processes.
Selected Projects
These projects emphasize systems implementation, fault tolerance, and practical AI infrastructure rather than broad product surface area.
Built a fully local computer-use agent on Ubuntu using a perception-plan-execute-reflect loop around Qwen3-VL-8B served with vLLM.
Designed a fault-tolerant sharded key/value storage system in Go with a shard master and Paxos-backed replica groups.
Implemented a pager that manages virtual memory for application processes with a simulated MMU and fixed-size page table.
Implemented a scalar out-of-order pipeline in SystemVerilog with Tomasulo scheduling and a reorder buffer for VeriSimpleV.
Experience
Pilot Execution, University of Virginia
Studied recovery failures where the recovery action itself can trigger irreversible or cascading failures in distributed systems. Implemented core components of PILOT, an execution model for dry-run recovery simulation, and applied it to systems including Hadoop YARN.
Amazon
Completed a customer-facing internship project from design choices through implementation, team communication, testing, and safe production deployment. Worked inside standardized CI/CD, A/B testing, and integration-test pipelines while using internal AI tooling to accelerate learning and execution.
Phoenix Project, University of Michigan
Built Phoenix, an OS-supported recovery framework that accelerates high-availability restarts by preserving long-lived process state while discarding transient state. Worked on Linux syscalls, allocator metadata, memory mapping management, runtime APIs, and fault-injection infrastructure for Redis and LevelDB validation.
Introduction to Logic Design, UM-SJTU
Held weekly labs, discussions, and office hours for a core ECE course with more than 100 students, helping improve hands-on implementation and debugging ability in digital logic labs.
Education
M.S. Computer Science
GPA: 3.97 / 4.0
B.S. Computer Science and Engineering
GPA: 3.88 / 4.0, Dean's Honor List 2022
B.Eng., UM Joint Institute
GPA: 3.53 / 4.0, Undergraduate Excellent Scholarship 2021
Research Highlights
Co-authored work on dry-run execution for recovery actions in production distributed systems, making it possible to observe consequences before commitment.
Co-authored work on OS-supported recovery for high-availability restarts.
View proceedingsSkills and Systems Stack
Contact
The fastest way to reach me is by email. If you are hiring for software engineering, systems, infrastructure, or research-adjacent roles, I would be glad to talk.