Data ML Engineer

About the Company:

We're working with a social AI company building a suite of advanced real-time models that push the boundaries of expression, personality, and realism. Their work brings characters to life, transforming how people tell stories, connect, and create. Their flagship social AI platform is just the beginning, with a broader ecosystem to follow.

The company is now hiring an Applied ML Engineer (Data) to build and scale the data pipelines behind large video generation models. This role is focused on collecting and curating large volumes of relevant video data, producing high-quality training samples, and developing robust workflows for preprocessing, filtering, and parsing.

You’ll own the full lifecycle of training data: from raw ingestion to clean, model-ready datasets that directly improve model quality, working at the intersection of data engineering and ML research.

Key Responsibilities:

Build and maintain large-scale pipelines for video datasets: ingestion, parsing, filtering, preprocessing, and curation (AWS S3, DynamoDB).
Design and run annotation workflows (e.g., MTurk/Prolific): task design, quality control, label validation.
Train/evaluate smaller supporting models for filtering, quality assessment, preprocessing, or other pipeline stages.
Partner with research and engineering to turn experimental workflows into scalable, repeatable systems for training and evaluation.
Improve data quality end-to-end by identifying bottlenecks, failure modes, and weak data sources; build tooling and automation to streamline dataset creation.
Work within a Kubernetes-based training environment; optimize model inference scripts used in preprocessing for speed and cost at scale.

Requirements:

Must-have:

3+ years of experience in ML engineering / applied ML / data pipelines (or related engineering roles).
Strong Python skills and hands-on experience building reliable data processing pipelines for ML workflows.
Experience preparing training data at scale: parsing, filtering, dataset curation, and quality control (e.g., S3/DynamoDB).
Experience with video, vision, multimodal systems, or generative video.
Experience working with Kubernetes for orchestrating distributed workloads and delivering datasets to training clusters.
Working knowledge of PyTorch and ability to read/debug/optimize research inference code used in preprocessing.

Nice-to-have:

Annotation/labeling workflow experience (crowd platforms or vendors).
Experience training or fine-tuning smaller models used for filtering/ranking/quality assessment.

Conditions and Compensation:

Remote: U.S. or Europe
Full-time role in the Engineering team
Competitive compensation
Benefits: near-full medical/dental/vision coverage, 42 days paid time off, parental leave & fertility support, 401(k), $500/month lifestyle spending account, and more

Send your CV on Telegram @dariiyah