AI Platform Engineer (OneAI)
We are seeking an AI Platform Engineer to contribute to the design, implementation, and deployment of advanced AI capabilities within OneAI.
For over a decade now, OpenNebula Systems has been building the open source technology that helps organizations around the world to manage their corporate data centers and build Enterprise Clouds with unique, innovative features.
Born back in the day as an open source platform for Private Clouds, OpenNebula is a powerful, but easy-to-use, open source Cloud & Edge Computing Platform whose community includes nowadays leading companies and public agencies in a wide range of industry niches and countries.
If you want to join an established leader in the cloud infrastructure industry and the global open source community, keep reading, because you can now join exceptionally passionate, talented colleagues, and help world´s leading enterprises implement their next-generation edge and cloud strategies. We are hiring!
Come join us in our distributed, fully-remote, international team, where we promote Creativity, Innovation, Collaboration, Open Communication and Iterative Work, and we´re seeking a AI Engineer to work in the development of the OpenNebula open-source cloud management platform and its new strategic project in AI.
Since 2019, and thanks to the support from the European Commission, OpenNebula Systems is leading the edge computing innovation in Europe, investing heavily in research and development, and playing a key role in the key strategic initiatives of the European Union. https://opennebula.io/innovation/
Job Description
We are seeking an AI Platform Engineer to contribute to the design, implementation, and deployment of advanced AI capabilities within OneAI, our platform for operating high-performance inference and training workloads across diverse cloud and edge environments leveraging OpenNebula.
In this role, you will be responsible for integrating cutting-edge AI frameworks, tools and engines (such as vLLM, PyTorch and Unsloth) into a secure, scalable, and highly observable AI platform that can provision compute, schedule workloads, and expose reliable APIs for the entire AI model lifecycle, serving and fine-tuning process.
Beyond the backend infrastructure, you will help shape the OneAI user experience by designing intuitive workflows for managing models, configuring deployments, and operating inference and training jobs. A key focus will be integrating with public model repositories like Hugging Face to streamline model and dataset discovery, import, versioning, and deployment directly into OpenNebula cloud deployments.
Working at the intersection of applied AI, systems engineering and product design, you will ensure that inference and training are efficient at scale. Key focus areas include establishing deployment strategies, optimizing performance and cost efficiency, implementing GPU-aware operations, and building comprehensive observability tools to track latency, throughput, utilization, and failure rates.
As we are an international team, please submit your CV in English.
Core Responsibilities
Design, implement, and deploy advanced AI capabilities within the OneAI platform.
Shape the end-user experience by designing intuitive workflows for model management, deployment configuration, and job operation.
Streamline the model lifecycle by integrating public repositories (e.g., Hugging Face) for seamless discovery, import, versioning, and deployment.
Bridge the gap between systems engineering and product design to ensure a seamless transition from backend infrastructure to user features.
AI Platform & Systems Engineering
Integrate cutting-edge AI frameworks and engines, such as vLLM, NVIDIA Dynamo and Unsloth, into a secure and scalable environment.
Leverage OpenNebula to orchestrate high-performance inference and training workloads across diverse cloud and edge environments.
Develop and maintain reliable APIs for compute provisioning and workload scheduling.
Implement GPU-aware operations to ensure optimal resource allocation and hardware utilization.
Build comprehensive observability suites to monitor and track critical metrics, including latency, throughput, utilization, and failure rates.
Research, Optimization & Innovation
Establish and refine deployment and workflow strategies to ensure AI workloads remain efficient and stable at scale.
Optimize system architecture to balance high performance with cost efficiency.
Research and integrate emerging AI tools and engines to keep the OneAI platform at the forefront of the industry.
Analyze performance bottlenecks to iterate on the efficiency of both training and inference processes.
Experience Required
Academic Background and Certifications
Bachelor’s or Master’s degree in Computer Science, Information Technology, or Engineering.
Professional Experience
3+ years of experience in applied AI, machine learning, or software engineering, with hands-on delivery of AI/ML solutions in production environments
Demonstrated experience designing and deploying high-performance AI infrastructure, specifically focusing on the scalability and reliability of inference and training workloads.
Proven track record of deploying Large Language Models (LLMs) at scale, with deep knowledge of serving engines (e.g., vLLM) and fine-tuning tools (e.g., Unsloth).
Experience building AI-centric platforms or toolchains that manage the model lifecycle (versioning, deployment, and discovery).
Experience with GPU orchestration and optimizing workloads for cloud, distributed or large-scale environments and collaborating with platform or infrastructure teams.
Technical Experience
Hands-on experience with high-throughput inference engines (e.g., vLLM) and fine-tuning tools (e.g., Unsloth)
Proficiency in integrating with the Hugging Face ecosystem (Transformers, Hub, Datasets) for model and data management.
Experience implementing monitoring tools to track system-level AI metrics such as token throughput, latency, GPU utilization, and failure rates.
Experience designing and implementing scalable, reliable APIs for compute provisioning and workload scheduling.
Experience working with cloud platforms and containerized environments (e.g., OpenNebula, Kubernetes)
Language Skills
Advanced English level (B2 or higher) is required.
Soft Skills & Collaboration
Strong analytical and problem-solving skills with a practical, experimental mindset
Ability to work independently in complex, fast-moving technical environments
Comfortable collaborating in distributed teams and engaging with open-source communities
What's in it for me?
Some of our benefits and perks vary depending on location and employment type, but we are proud to provide employees with the following;
Competitive compensation package and flexible remuneration: Meals, Transport, Nursery/Childcare
Customized workstation (macOS, Windows, Linux)
Private health insurance
Paid time off: Holidays, Personal Time, Sick Time, Parental leave
Afternoon-off working day every friday and during summer
Remote company with bright HQ centrally located in Madrid; offices in Boston (USA), Brussels (Belgium) and Brno (Czech Republic); and access to office space near your location when needed. During the first year, for onboarding purposes, and for participation on certain projects, employees should be able to attend events and face-to-face meetings in our Madrid offices and other European cities. All employees are also required to attend our company-wide face-to-face all-hands meetings twice a year
Healthy work-life balance: We encourage the right for Digital Disconnecting and promote harmony between employees personal and professional lives
Flexible hiring options: Full Time/Part Time, Employee (Spain/USA) / Contractor (other locations)
We are building an awesome, Engineering First Culture and your opinion matters: Thrive in the high-energy environment of a young company where openness, collaboration, risk-taking, and continuous growth are valued
Be exposed to a broad technology ecosystem. We encourage learning and researching new technologies and methods as part of your everyday duties
- Department
- Innovation
- Locations
- Headquarters
- Remote status
- Fully Remote