Lead Engineer, Platform Engineering - AI
Indexed description
The AI Platform Engineering Lead drives the AI Platform Operations team, guiding platform strategy, governance, and stakeholder engagement. They align technical execution with business goals, ensuring cost-effective, secure, and scalable AI/ML solutions. The lead defines the architecture, standards, and governance across AI/ML infrastructure, workflow automation, and agentic AI capabilities, including RAG pipelines, vector store infrastructure, agent memory frameworks, and MCP server strategy. They establish design patterns and best practices that span LLM, MCP, and agentic capabilities, ensuring the platform scales securely and operates with consistency across the enterprise. The lead collaborates across teams to set security standards, manage resources, maintain compliance, and align platform capabilities with enterprise architecture standards.
Responsibilities
- Define platform strategy, roadmap, and capability evolution
- Establish governance frameworks, policies, and exception processes
- Manage team budget, CapEx planning, and vendor relationships
- Build and lead the AI Platform Operations team
- Define the architecture, standards, and governance for AI/ML infrastructure, including GPU cluster design, compute resource planning, security controls, and observability across the platform
- Drive the strategy, standards, and governance for AI-enabled workflow automation across LLM, MCP, and agentic capabilities, ensuring the platform scales securely and operates with consistency across the enterprise
- Define the architecture and design standards for vector store infrastructure supporting RAG pipelines, agent memory, and semantic search across the enterprise
- Establish design patterns and best practices for RAG workflow implementation, including ingestion strategies, chunking approaches, embedding model selection, and retrieval optimization
- Architect agent memory frameworks, defining standards for short-term context, long-term persistent memory, and episodic memory patterns across AI platform workloads
- Drive the architecture and governance of Agentic AI systems, including multi-agent orchestration design and tool-use pipeline standards
- Define the strategy and architecture for hosting and managing MCP servers across the platform, including deployment topology, security boundaries, and integration standards
- Establish governance frameworks and policies for MCP server lifecycle management, versioning, and access control
- Evaluate and select MCP server tooling and vendors; manage relationships and roadmap alignment
- Serve as executive liaison for platform matters
- Own major incident management and executive communication
- Drive continuous improvement and platform maturity initiatives
- Align platform capabilities with enterprise architecture standards
- Respond to and assist in production operations in a 24/7 environment
- Provide technical analysis, resolve problems, and propose solutions
- Provide support to, and coordinate with, developers, operations staff, release engineers, and end-users
- Educate and mentor team members and operations staff
- 8+ years in IT infrastructure or platform engineering roles
- 3+ years in technical leadership or management positions
- 1+ years hands-on experience with Kubernetes in production
- Direct experience with GPU infrastructure (NVIDIA preferred)
- 2+ years experience using CUDA
- 1+ years experience using MCPs
- 2+ years experience with vector databases and embedding infrastructure
- 2+ years experience with RAG pipeline design and deployment
- 2+ years experience with agent memory patterns (in-context, external stores, retrieval-augmented memory)
- 1+ years experience with agentic AI systems using orchestration frameworks
- 2+ years experience with semantic search, embedding models, and ANN search techniques
- 3+ years working with workflow/orchestrion automation tools
- Experience managing teams of 5+ technical staff
- Demonstrated success in vendor management and contract negotiation
- Strong executive communication and presentation skills
- Understanding of AI/ML workloads and infrastructure requirements
- Experience with enterprise monitoring and observability tools
- Ability to work in a service-oriented team environment
- Project Management, organization, and time management
- Customer focused, and dedicated to the best possible user experience
- Communicate effectively with both technical and business resources
- Fluent speaking, reading, and writing in English
New York Base Salary Range
The expected base salary for this role, if located in New York, is between $149,400 - 180,000 USD. The base salary range does not include Intercontinental Exchange’s incentive compensation. While we provide this range as general guidance, at ICE we compensate employees based on the skillset and experience of the individual. Regular full-time ICE employees are eligible for a suite of competitive employee benefits, including healthcare coverage (medical, dental and vision), a 401(k) plan, life insurance, time off, and paid leave for qualifying circumstances.
California Base Salary Range
The expected base salary for this role, if located in California, is between $149,400 - 180,000 USD. The base salary range does not include Intercontinental Exchange’s incentive compensation. While we provide this range as general guidance, at ICE we compensate employees based on the skillset and experience of the individual. Regular full-time ICE employees are eligible for a suite of competitive employee benefits, including healthcare coverage (medical, dental and vision), a 401(k) plan, life insurance, time off, and paid leave for qualifying circumstances.
Intercontinental Exchange, Inc. is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to legally protected characteristics.
Create a free Caio profile to unlock the full index and keep your job-search signal for future recommendations.
Unlock free search