Hugging Face
Remote
Here at Hugging Face, we’re on a journey to advance good Machine Learning and make it more accessible. Along the way, we contribute to the development of technology for the better. We have built the fastest-growing, open-source, library of pre-trained models in the world. With more than 500K+ models and 250K+ stars on GitHub, over 15.000 companies are using HF technology in production, including leading AI organizations such as Google, Elastic, Salesforce, Algolia, and Grammarly. About the role: We are looking for a Site Reliability Engineer responsible for maintaining and scaling our product infrastructure. The ideal candidate will have experience maintaining large infrastructure for AI workflows and strong experience supporting teams to create best practices for reliability and scalability. Responsabilities: Design, develop, deploy, and maintain reliable and scalable infrastructure. Manage large Kubernetes clusters. Measure and optimize system performance....