Job Description
We are currently partnered with a globally leading research centre in the UK looking to expand their team with a Systems Research Engineer. This is an incredibly high calibre team reshaping how large-scale models are trained and served through next-generation AI-native infrastructure and "super-node" clusters.
This is a permanent opportunity based onsite in Edinburgh.
Key responsibilities for this Systems Research Engineer position:
- Architect and implement distributed system components for AI workloads across CPU, GPU, and NPU clusters.
- Conduct in-depth profiling and performance tuning of inference pipelines, focusing on KV cache management.
- Develop low-latency, fault-tolerant AI serving frameworks using vLLM, Ray Serve, and PyTorch Distributed.
- Research and prototype novel techniques for cache sharing, data locality, and resource orchestration.
- Translate innovative designs into publishable contributions at top-tier venues (e.g., OSDI, NSDI, MLSys).
- Collaborate with global research teams to drive the internal adoption of novel system architectures.
Key Requirements:
- Preferably a PhD, at minimum a masters degree, in Computer Science, distributed systems, or related field.
- Strong knowledge of Distributed Systems, OS internals, and Machine Learning systems architecture.
- Hands-on experience with LLM serving frameworks (vLLM, Ray Serve, TensorRT-LLM, or TGI).
- Proficiency in C/C++ for systems development and Python for research prototyping.
- Solid grounding in distributed algorithms, load balancing, and state management.
- Proven ability to conduct systems research, ideally evidenced by publications in top-tier conferences.