Split strategies

In some cases, default chunking methods aren't enough to handle the complexity of your data. Imagine working with dense technical documentation, splitting it effectively can make a big difference in how well the information is indexed and retrieved.

To address this, you can define a split strategy that controls how content is broken into chunks during ingestion. Split strategies are reusable and configurable, allowing you to tailor the chunking process to your specific needs.

Once defined, these strategies can be applied across multiple ingestion jobs, giving you flexibility and consistency in how your data is processed.

What is a split strategy?

A split strategy is a reusable configuration that defines how your data is divided into chunks during ingestion. By customizing a split strategy, you can optimize chunking for different types of content and use cases.

Key components of a split strategy include:

Max paragraph size: Sets the maximum size (in characters or tokens) for each chunk.
Custom split: Determines the method used for splitting:
- Manual splitting: Splits content based on a specified delimiter (default is "\n").
- LLM splitting: Uses a language model to segment text intelligently, with optional rules to guide the process.

Note that you can only choose one splitting method per strategy, and that you will have to set custom_split to 1 for manual splitting or 2 for LLM splitting.

LLM Splitting

LLM splitting leverages a language model to intelligently segment text into meaningful chunks, going beyond simple character or line-based splitting. This approach is ideal when you need context-aware chunking, such as dividing content by sections, topics, or other logical boundaries.

You can enable LLM splitting with default settings, or customize it for your specific needs using these parameters:

LLM: Specify the language model in the generative_model field. Optionally, provide credentials for your chosen LLM provider in user_keys.
Rules: Define natural language instructions to guide how the text should be split. For example, you might use rules like "Split at each section heading" or "Separate lists into individual items".

Example configuration:

{
        "name": "llm",
        "max_paragraph": 1000,
        "custom_split": 2, 
        "llm_split": {
            "llm": {
                "generative_model": "chatgpt-azure-4o-mini",
            },
            "rules": [
                "Split at each section heading"
            ]
        },
    }

Experiment with different rules and models to optimize chunking for your use case. Note that using an LLM may increase processing time and cost, and could slightly alter the extracted text.

Manual Splitting

Manual splitting divides text based on a specified delimiter, making it straightforward and efficient for simpler chunking needs. This method is particularly useful when your content has clear, consistent separators, such as paragraphs or bullet points. You can customize manual splitting using the following parameters:

Splitter: Choose the character or string that indicates where to split the text. The default is a double newline character ("\n\n"), but you can set it to any delimiter that suits your content structure. Example configuration:

 {
        "name": "manual",
        "custom_split": 1,
        "manual_split": {
            "splitter": "\n"
        }
    }

Strategy creation and management

Before we can use a split strategy, we need to create it. We can create as many strategies as we want, and just use the right one for each processing job. Once created we can not modify our strategies, but we can delete them and inspect the ones we have created for a given kb.

Dashboard

To create a strategy, just go to the section AI Models and then to Extract & split. Then you can click on the button Create configuration , next to Split configurations to create a new strategy. Once there, you can fill in the necessary fields with the desired configuration. In the same section, you can also see the list of strategies you have created, and you can delete them once you no longer need them.

CLI

To create a strategy using the CLI, you can use the command nuclia kb split_strategies add with the desired configuration in JSON format. You can also list the strategies you have created with nuclia kb split_strategies list, and delete them with nuclia kb split_strategies delete.

nuclia kb split_strategies add --config='{"name":"strategy1","custom_split": 1,"manual_split": {"splitter": "\n"}}'
nuclia kb split_strategies list
nuclia kb split_strategies delete --id=1361c0c7-918a-4a7f-b44b-ba37437619fb

SDK

To create a strategy using the SDK, you can use the add method of the split_strategies object, passing the desired configuration in JSON format. You can also list the strategies you have created with the list method, and delete them with the delete method.

from nuclia import sdk
split_strategies = sdk.NucliaKB().split_strategies
print(split_strategies.list())
id = split_strategies.add(config='{"name":"strategy1","custom_split": 1,"manual_split": {"splitter": "\n"}}')
split_strategies.delete(id=id)

Use split strategies for processing

Dashboard

To use a split strategy for processing documents, just upload the document normally, enable Customize data extraction and select the strategy you want to use in the dropdown menu. Once the document is uploaded, it will be processed using the selected strategy.

CLI

To use a split strategy for processing documents using the CLI, you can use the command nuclia kb upload file with the --split_strategy option, passing the ID of the strategy you want to use.

nuclia kb upload file --path=FILE_PATH --split_strategy=1361c0c7-918a-4a7f-b44b-ba37437619fb

nuclia kb upload file --path=FILE_PATH --split_strategy=1361c0c7-918a-4a7f-b44b-ba37437619fb

SDK

To use a split strategy for processing documents using the SDK, you can use the file method of the NucliaUpload object, passing the path to the file and the ID of the strategy you want to use.

from nuclia import sdk
upload = sdk.NucliaUpload()
upload.file(path=FILE_PATH, split_strategy="1361c0c7-918a-4a7f-b44b-ba37437619fb")

What is a split strategy?​

LLM Splitting​

Manual Splitting​

Strategy creation and management​

Dashboard​

CLI​

SDK​

Use split strategies for processing​

Dashboard​

CLI​

SDK​

What is a split strategy?

LLM Splitting

Manual Splitting

Strategy creation and management

Dashboard

CLI

SDK

Use split strategies for processing

Dashboard

CLI

SDK