AI Glossary
Data labeling
data labeling, data annotation, labeling, annotation
Data labeling is the practice of attaching labels or annotations to raw data to describe the correct answer, so the data can train or evaluate a model. It is the basis of supervised learning and reliable evaluation.
- It means adding labels to data that describe the correct outcome — a category, a sentiment, a marked object, or a correct answer.
- Without labels, data is just a set of examples; the label tells the model what to learn from them, or what to be tested against.
- The quality and consistency of labels directly cap the quality of the trained model — wrong labels teach wrong answers.
Data labeling is the process of annotating raw examples with information about the correct outcome for each one. That might mean assigning a category to a piece of text, marking the sentiment of a review, outlining an object in an image, or recording the reference answer to a question. A dataset annotated this way becomes training data for supervised learning — the model learns to map an input to the label assigned by a human or another trusted process.
The difference from training data itself matters: training data is the whole body of material a model learns from, whereas labeling is the specific act of adding the correct answers to it. Labels are also used beyond training — in model evaluation, where the model's answers are compared against a previously labeled reference set, and in fine-tuning, where a ready-made model is adapted on a smaller, carefully labeled set for a specific task.
Labeling can be expensive and labor-intensive, because it usually takes people and clear instructions, and inconsistent or wrong labels carry straight through into model errors. That is why organizations watch for agreement between annotators and run quality control. Synthetic data, generated automatically, can be a partial supplement, but where fidelity to reality matters, manual or human-verified labeling remains the point of reference.
Related terms