What is Data pipeline?

AI Glossary

Data pipeline

data pipeline, data flow, data pipe

A data pipeline is an ordered sequence of steps through which data flows from its source, via ingestion, cleaning, and processing, all the way to the model or the database powering RAG. Each stage passes its result to the next, making the flow repeatable.

An ordered flow of data from the source, through ingestion, cleaning, and transformation, to a target — a model or a vector database.
Each stage takes the output of the previous one and passes its own to the next, making the flow repeatable and able to run automatically.
In RAG, a typical pipeline covers fetching documents, splitting them into chunks, computing embeddings, and writing to a vector database.

A data pipeline is an ordered sequence of steps that carries data from where it originates to where it is used. Typically it covers ingesting data from a source, cleaning and standardizing it, transforming it into the required format, and writing it to its destination — a model, a warehouse, or a database that feeds an AI system. The key here is order and repeatability: each stage takes the output of the previous one, so the whole flow can run many times and automatically, always the same way.

In the context of AI systems, a data pipeline is the layer that prepares the material before it reaches the model. For a RAG architecture, a typical pipeline fetches documents from their sources, splits them into fragments through chunking, computes an embedding for each fragment, and stores them in a vector database. Only a database prepared this way can serve user questions, so the quality and completeness of the pipeline translate directly into what the model receives as context.

A data pipeline should not be confused with a single transformation: an individual step, such as chunking alone or computing embeddings, is just one link, whereas the pipeline binds these links into a whole and ensures the data passes through them in a fixed order. In enterprise deployments it is the pipeline that decides whether a new or changed document reaches the system quickly and without manual handling — which is why its stability and monitoring are treated on par with the quality of the model itself.

Related terms