Split Text
When you want to load text chunks into a vector index, you often need to split the text into smaller pieces. This can help finding the right text chunks when querying the vector index, for example, for retrieval-augmented generation.
Split Functions
Split functions take a text string as input and return an array of text strings.
type SplitFunction = ({ text }: { text: string }) => string[];
The implementations provided by ModelFusion are factory functions that create a split function for a given configuration.
splitOnSeparator
Splits text on a separator string. The separator is omitted from the resulting chunks.
const split = splitOnSeparator({ separator: "\n" });
const result = await split({ text });
splitAtCharacter
Splits text recursively until the resulting chunks are smaller than the maxCharactersPerChunk
. The text is recursively split in the middle, so that all chunks are roughly the same size.
const split = splitAtCharacter({ maxCharactersPerChunk: 1000 });
const result = await split({ text });
splitAtToken
Splits text recursively until the resulting chunks are smaller than the maxTokensPerChunk
, while respecting the token boundaries. The text is recursively split in the middle, so that all chunks are roughly the same size.
const split = splitAtToken({
maxTokensPerChunk: 256,
// You can get a tokenizer from a model or create it explicitly.
// The tokenizer must support getting the text for a single token.
tokenizer: openai.Tokenizer({ model: "gpt-4" }),
});
const result = await split({ text });
Splitting Text Chunks
splitTextChunk
The splitTextChunk
function splits a single text chunk into multiple smaller text chunks using a split function.
It retains the properties of the input text chunk other than the text
property in the output chunks.
This is helpful when you want to retain metadata, e.g. to identify the original source, when ingesting text chunks into a vector index.
Example
// chunks will be of type Array<{ text: string; source: string; }>
const chunks = await splitTextChunk(
// split function:
splitAtCharacter({ maxCharactersPerChunk: 1000 }),
{
// text property (string) = input to split:
text: sanFranciscoWikipediaText,
// other properties are replicated in the output chunks:
source: "data/san-francisco-wikipedia.json",
}
);
splitTextChunks
The splitTextChunks
functions splits many text chunks into multiple smaller text chunks and flattens the result.
Otherwise it behaves the same as splitTextChunk
.
Example
const inputChunks: Array<{
{ text: string; source: string; }
}> = [
// ....
];
// outputChunks will be of type Array<{ text: string; source: string; }>
const outputChunks = await splitTextChunks(
splitAtCharacter({ maxCharactersPerChunk: 1000 }),
inputChunks
);