Split strategies
In some cases, default chunking methods aren't enough to handle the complexity of your data. Imagine working with dense technical documentation, splitting it effectively can make a big difference in how well the information is indexed and retrieved.
To address this, you can define a split strategy that controls how content is broken into chunks during ingestion. Split strategies are reusable and configurable, allowing you to tailor the chunking process to your specific needs.
Once defined, these strategies can be applied across multiple ingestion jobs, giving you flexibility and consistency in how your data is processed.
What is a split strategy?
A split strategy is a reusable configuration that defines how your data is divided into chunks during ingestion. By customizing a split strategy, you can optimize chunking for different types of content and use cases.
Key components of a split strategy include:
- Max paragraph size: Sets the maximum size (in characters or tokens) for each chunk.
- Custom split: Determines the method used for splitting:
- Manual splitting: Splits content based on a specified delimiter (default is
"\n"
). - LLM splitting: Uses a language model to segment text intelligently, with optional rules to guide the process.
- Manual splitting: Splits content based on a specified delimiter (default is
Note that you can only choose one splitting method per strategy, and that you will have to set custom_split to 1 for manual splitting or 2 for LLM splitting.
LLM Splitting
LLM splitting leverages a language model to intelligently segment text into meaningful chunks, going beyond simple character or line-based splitting. This approach is ideal when you need context-aware chunking, such as dividing content by sections, topics, or other logical boundaries.
You can enable LLM splitting with default settings, or customize it for your specific needs using these parameters:
- LLM: Specify the language model in the generative_model field. Optionally, provide credentials for your chosen LLM provider in user_keys.
- Rules: Define natural language instructions to guide how the text should be split. For example, you might use rules like "Split at each section heading" or "Separate lists into individual items".
Example configuration:
{
"name": "llm",
"max_paragraph": 1000,
"custom_split": 2,
"llm_split": {
"llm": {
"generative_model": "chatgpt-azure-4o-mini",
},
"rules": [
"Split at each section heading"
]
},
}
Experiment with different rules and models to optimize chunking for your use case. Note that using an LLM may increase processing time and cost, and could slightly alter the extracted text.
Manual Splitting
Manual splitting divides text based on a specified delimiter, making it straightforward and efficient for simpler chunking needs. This method is particularly useful when your content has clear, consistent separators, such as paragraphs or bullet points. You can customize manual splitting using the following parameters:
- Splitter: Choose the character or string that indicates where to split the text. The default is a double newline character (
"\n\n"
), but you can set it to any delimiter that suits your content structure. Example configuration:
{
"name": "manual",
"custom_split": 1,
"manual_split": {
"splitter": "\n"
}
}
Strategy creation and management
Before we can use a split strategy, we need to create it. We can create as many strategies as we want, and just use the right one for each processing job. Once created we can not modify our strategies, but we can delete them and inspect the ones we have created for a given kb.
Dashboard
To create a strategy, just go to the section AI Models and then to Extract & split. Then you can click on the button Create configuration , next to Split configurations to create a new strategy. Once there, you can fill in the necessary fields with the desired configuration. In the same section, you can also see the list of strategies you have created, and you can delete them once you no longer need them.
CLI
To create a strategy using the CLI, you can use the command nuclia kb split_strategies add
with the desired configuration in JSON format. You can also list the strategies you have created with nuclia kb split_strategies list
, and delete them with nuclia kb split_strategies delete
.
nuclia kb split_strategies add --config='{"name":"strategy1","custom_split": 1,"manual_split": {"splitter": "\n"}}'
nuclia kb split_strategies list
nuclia kb split_strategies delete --id=1361c0c7-918a-4a7f-b44b-ba37437619fb
SDK
To create a strategy using the SDK, you can use the add
method of the split_strategies
object, passing the desired configuration in JSON format. You can also list the strategies you have created with the list
method, and delete them with the delete
method.
from nuclia import sdk
split_strategies = sdk.NucliaKB().split_strategies
print(split_strategies.list())
id = split_strategies.add(config='{"name":"strategy1","custom_split": 1,"manual_split": {"splitter": "\n"}}')
split_strategies.delete(id=id)
Use split strategies for processing
Dashboard
To use a split strategy for processing documents, just upload the document normally, enable Customize data extraction and select the strategy you want to use in the dropdown menu. Once the document is uploaded, it will be processed using the selected strategy.
CLI
To use a split strategy for processing documents using the CLI, you can use the command nuclia kb upload file
with the --split_strategy
option, passing the ID of the strategy you want to use.
nuclia kb upload file --path=FILE_PATH --split_strategy=1361c0c7-918a-4a7f-b44b-ba37437619fb
nuclia kb upload file --path=FILE_PATH --split_strategy=1361c0c7-918a-4a7f-b44b-ba37437619fb
SDK
To use a split strategy for processing documents using the SDK, you can use the file
method of the NucliaUpload
object, passing the path to the file and the ID of the strategy you want to use.
from nuclia import sdk
upload = sdk.NucliaUpload()
upload.file(path=FILE_PATH, split_strategy="1361c0c7-918a-4a7f-b44b-ba37437619fb")