Strategic data selection and curation practices significantly reduce annotation costs and drive development productivity. Credit: Kundra / Shutterstock Computer vision teams face an uncomfortable reality. Even as annotation costs continue to rise, research consistently shows that teams annotate far more data than they actually need. Sometimes teams annotate the wrong data entirely, contributing little to model improvements. In fact, by some estimates, 95% of data annotations go to waste. The problem extends beyond cost. As I explored in my previous article on annotation quality, error rates average 10% in production machine learning (ML) applications. But there’s a deeper issue that precedes annotation quality: Most teams never develop systematic approaches to selecting which data needs annotation in the first place. This is largely because annotation often remains siloed from data curation and model evaluation, making it impossible to act on the full picture. Safety-critical models, such as models for autonomous vehicles (AV) with multi-sensor perception stacks, require highly accurate 2D bounding boxes and 3D cuboid annotations. Without intelligent data selection, teams find themselves not only collecting vast amounts of data but also labeling millions of redundant samples while missing the edge cases that actually improve model performance. When tools become barriers The conventional approach treats annotation as an isolated workflow: Collect data, export to a labeling platform, wait for humans to label data, import labels, discover problems, go back to the annotation vendor, and repeat. This fragmentation creates two critical gaps that turn annotation into a development bottleneck rather than an enabling capability. No systematic data selection Random sampling and “label everything” approaches waste annotation budgets on redundant samples. Teams annotating AV datasets might label 100,000 highway cruise images that provide minimal new information while missing rare scenarios like emergency vehicle encounters or unusual weather conditions. Lost context across tool boundaries When annotation lives in one platform, curation in another, and model evaluation in a third, teams lose critical context at each handoff. Data scientists spend 80% of their time curating data, yet most of this effort happens in ad hoc, disconnected ways that don’t inform downstream annotation decisions. Some estimates indicate that ~45% of companies now use four or more tools simultaneously, cobbling together partial solutions that impact budgets and timelines. Curate first: A paradigm shift in ML workflows The “curate first, then annotate” approach inverts the conventional wisdom. Instead of treating data curation as a second step in development, curation becomes the foundation that drives intelligent annotation decisions. This methodology recognizes that annotation isn’t primarily a labeling problem—it’s a data understanding problem. Strategic data selection focuses on annotation where it matters Zero-shot coreset selection represents a breakthrough in pre-annotation intelligence. Using pre-trained foundation models to analyze unlabeled data, this technique scores each sample based on unique information contribution, automatically filtering redundant examples. The methodology works through iterative subspace sampling: Embedding computation: Foundation models generate high-dimensional representations capturing semantic content. Uniqueness scoring: Each sample receives a score indicating information diversity relative to existing selections. Iterative selection: Samples with the highest uniqueness scores enter the training set. Redundancy elimination: Visually similar samples get deprioritized automatically. Benchmarks on ImageNet demonstrate that this approach achieves the same model accuracy with just 10% of training data, eliminating annotation costs for over 1.15 million images. Zero-shot coreset selection process to prioritize the right data for model training. Voxel51 To put it in perspective, for a 100,000-image dataset at typical rates of $0.05 to $0.09 per object, strategic selection can save ~$81K in annotation costs while improving model generalization on edge cases. Programmatically: import fiftyone.zoo as foz from zcore import zcore_scores, select_coreset dataset = foz.load_zoo_dataset("quickstart") model = foz.load_zoo_model("clip-vit-base32-torch") embeddings = dataset.compute_embeddings(model, batch_size=2) scores = zcore_scores(embeddings, use_multiprocessing=True, num_workers=4) coreset = select_coreset(dataset, scores, coreset_size=int(0.1 * len(dataset))) Embedding-based curation This approach surfaces the samples that will contribute most to model learning, transforming annotation from a volume game into a strategic exercise. Modern platforms enable embedding-based curation through straightforward workflows. For example, you can leverage computed embeddings to identify the most unique samples in the embedding space using a k-nearest-neighbors calculation. Those samples are then prioritized for annotation. import fiftyone as fo import fiftyone.brain as fob import fiftyone.zoo as foz # Load your unlabeled dataset dataset = fo.Dataset.from_dir( dataset_dir="/path/to/images", dataset_type=fo.types.ImageDirectory, ) # Generate embeddings using pre-trained model model = foz.load_zoo_model("clip-vit-base32-torch") dataset.compute_embeddings(model, embeddings_field="embeddings") # Perform uniqueness-based selection fob.compute_uniqueness(dataset, embeddings_field="embeddings") # Sort by uniqueness score to prioritize diverse samples unique_view = dataset.sort_by("uniqueness", reverse=True) # Select top 10% most informative samples for annotation samples_to_annotate = unique_view.take(len(dataset) // 10) Embedding-based curation surfaces the samples that will contribute most to model learning. Voxel51 Model analysis results feed into prioritizing what to label Once you have trained a baseline model on your initial curated subset, you can shift from pure data exploration to targeted improvement. Instead of randomly selecting the next batch, use the model’s own predictions to identify “hard” samples where the model is confused or uncertain. The most effective workflow intersects uncertainty with uniqueness. This ensures you prioritize valid edge cases that drive better model understanding, rather than just noise (for example, blurry images which are inherently low-confidence). We can filter programmatically for this “Goldilocks zone” of high uniqueness and low confidence. from fiftyone import ViewField as F # Filter for samples where model confidence is low hard_samples = dataset.match(F("predictions.confidence") < 0.5) # Sort the hard samples by uniqueness valuable_hard_samples = hard_samples.sort_by("uniqueness", reverse=True) # Select the top 5% that will actually improve the model next_batch = valuable_hard_samples.take(int(0.05 * len(dataset))) # Tag for annotation next_batch.tag_samples("active_learning_batch_1") Quantifying the curation advantage The financial impact of curation-first workflows manifests across multiple dimensions, with organizations reporting cost and efficiency improvements. Reduced annotation volume: Curation achieves equivalent model performance with 60% to 80% less annotated data. Lower error correction costs: Finding and fixing labeling mistakes early reduces expensive rework cycles that typically add 20% to 40% to project budgets. Minimized tool licensing and coordination overhead: Unified workflows eliminate redundant platform costs that average $50K annually per tool and minimize handoffs. Faster iteration cycles: Targeted annotation and validation eliminate weeks of review cycles. A mid-sized AV team annotating 500K samples monthly at $0.07 per object can reduce this from $35K to $14K through intelligent selection, leading to an annual savings of ~$336K. Impact on development teams: From reactive to strategic The shift to curation-first methodologies fundamentally changes how ML engineering teams operate, moving them from reactive problem-solving to proactive dataset optimization. Workflow transformation Traditional workflow: Data collection → Data annotation → Model training → Discover failures → Debug → Reannotate → Retrain Curation-first workflow: Data collection → Intelligent curation → Targeted annotation → Continuous validation → Model training → Strategic expansion This reordering frontloads data understanding, helping identify issues when they’re cheapest to fix. Teams report improvements in doing real work as engineers shift their focus from tedious quality firefighting to strategic model improvement. Best practices: Implementing curation-driven annotation Successful implementations follow established patterns that balance automation with human expertise. Start with embedding-based exploration Before annotating anything, generate embeddings and visualize your dataset’s distribution. This reveals the structure and distribution of your dataset. For example, tight clusters indicate redundancy, or sparse regions suggest rare scenarios worth targeted collection or synthetic augmentation. # Compute embeddings dataset.compute_embeddings(model, embeddings_field="embeddings") # Generate 2D visualization using UMAP results = fob.compute_visualization( dataset, embeddings="embeddings", brain_key="img_viz" ) # Launch interactive exploration session = fo.launch_app(dataset) Implement progressive annotation strategies Rather than annotating entire datasets up front, adopt iterative expansion: Initial selection: Curate 10% to 20% of the most unique/representative samples with coreset selection, mistakenness computation, or another algorithmic tool. Auto labeling and training: Annotate quickly with foundation models and train your initial model from those labels. Failure analysis: Identify prediction errors and edge case gaps. Targeted expansion: Collect or annotate specific scenarios addressing weaknesses. Iterate: Repeat cycle, focusing resources on high-impact improvements. This approach mirrors active learning but with explicit curation intelligence guiding selection. Automate quality gates Replace subjective manual review with deterministic quality gates. Automated checks are the only way to catch systematic errors like schema violations or class imbalance that human reviewers inevitably miss at scale. from fiftyone import ViewField as F # Find bounding boxes that are impossibly small tiny_boxes = dataset.filter_labels( "ground_truth", (F("bounding_box")[2] * F("bounding_box")[3]) < 0.01 ) # Find samples where the model disagrees with ground truth possible_errors = dataset.match(F("mistakenness") > 0.8) # Schema Validation: Find detections missing required attributes incomplete_labels = dataset.filter_labels( "ground_truth", F("occluded") == None ) Maintain annotation provenance Track curation decisions and annotation metadata to support iterative improvement. This provenance enables sophisticated analysis of which curation strategies yield the best model improvements and supports continuous workflow optimization. # Grab the "most unique" sample from a curated view of unique smaples most_confusing_sample = unique_view.first() # Add sample-level provenance most_confusing_sample.tags.append("curated_for_review") # Set metadata on the specific labels (detections) if most_confusing_sample.detections: for det in most_confusing_sample.detections.detections: det["annotator"] = "expert_reviewer" det["review_status"] = "validated" most_confusing_sample.save() A unified platform for curation-driven workflows Voxel51’s flagship open source computer vision platform, FiftyOne, provides the necessary tools to curate, annotate, and evaluate AI models. It provides a unified interface for data selection, QA, and iteration. Architecture advantages Open-source foundations provide transparency into data processing while enabling customization for specific workflows. FiftyOne has millions of community users and an extensive integrations framework that lets you integrate FiftyOne with any workflow or external tool. The design recognizes that curation, annotation, and evaluation are interconnected activities requiring shared context rather than isolated tools. This architectural philosophy enables the feedback loops that make curation-first workflows effective: evaluation insights immediately inform curation priorities, which drive targeted annotation, and which in turn feed back into refined models. Data-centric selection: Zero-shot coreset selection, uniqueness scoring, and embedding-based exploration enable intelligent prioritization before any annotation investment. Unified annotation: Create and modify 2D bounding boxes, 3D cuboids, and polylines directly within the platform where you already curate and evaluate. Annotate and QA 2D and 3D annotations in a single interface to maintain spatial context across modalities. (View a demo video.) ML-powered quality control: Mistakenness scoring, similarity search, and embedding visualization surface labeling errors systematically rather than through random sampling. Production-grade features: Dataset versioning captures state at each training iteration, annotation schemas enforce consistency, and programmatic quality gates prevent drift. Getting started Teams can implement curation-first workflows incrementally: pip install fiftyone # Load existing dataset import fiftyone as fo dataset = fo.Dataset.from_dir( dataset_dir="/path/to/data", dataset_type=fo.types.ImageDirectory ) # Generate embeddings model = foz.load_zoo_model("clip-vit-base32-torch") dataset.compute_embeddings(model) # Compute 2-D visualization fob.compute_visualization( dataset, embeddings=embeddings, brain_key="clip_viz", ) # Visualize and curate your data session = fo.launch_app(dataset) Future outlook: From reactive labeling to proactive intelligence Three technical shifts are accelerating the move to curation-first workflows. Foundation models as curators: Pre-trained vision-language models (VLMs) can now describe and filter images semantically without task-specific training. Instead of waiting for human review, teams can use multi-modal models to auto-tag complex sensor data (LiDAR/camera) and prioritize scenarios based on deployment needs. Active learning meets intelligent curation: Standard active learning can waste budget by blindly flagging “low-confidence” predictions that are really just noisy or redundant frames. Next-generation pipelines now filter these requests through a uniqueness check. By prioritizing samples that are both confusing to the model and unique in the dataset, teams maximize the learning value of every labeled image. Continuous curation in production: As models deploy to production, curation intelligence will extend to monitoring and maintenance. Embedding analysis of production data will detect distribution drift, trigger targeted data collection for new scenarios, and prioritize annotation of examples where models fail. This closes the loop from deployment back to development, enabling continuous model improvement grounded in real-world performance data. Make your annotation investments count Curation-first workflows coupled with smart labeling fundamentally transform how teams develop computer vision systems. Progressive annotation strategies focus on high-impact data help teams achieve better model performance with 60% to 80% less labeling effort. For teams ready to make that shift, the path forward starts with understanding your data before you label it. — New Tech Forum provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to doug_dineley@foundryco.com. Machine LearningArtificial IntelligenceSoftware Development