このバージョンはまだ開発中であり、まだ安定しているとは考えられていません。最新のスナップショットバージョンについては、Spring AI 1.1.2 を使用してください。

ETL パイプライン

抽出、変換、ロード (ETL) フレームワークは、検索拡張生成 (RAG) ユースケース内のデータ処理のバックボーンとして機能します。

ETL パイプラインは、生データソースから構造化ベクトルストアへのフローを調整し、データが AI モデルによる取得に最適な形式であることを保証します。

RAG ユースケースは、データ本体から関連情報を取得して、生成される出力の品質と関連性を向上させることで、生成モデルの機能を強化するためのテキストです。

API の概要

ETL パイプラインは、Document インスタンスを作成、変換、保存します。

Document クラスには、テキスト、メタデータ、オプションでイメージ、オーディオ、ビデオなどの追加のメディア型が含まれます。

ETL パイプラインには 3 つの主要なコンポーネントがあります。

Supplier<List<Document>> を実装した DocumentReader
Function<List<Document>, List<Document>> を実装した DocumentTransformer
Consumer<List<Document>> を実装した DocumentWriter

Document クラスのコンテンツは、DocumentReader の助けを借りて、PDF、テキストファイル、その他のドキュメント型から作成されます。

単純な ETL パイプラインを構築するには、各型のインスタンスをチェーンで結合します。

これら 3 つの ETL 型の次のインスタンスがあるとします

PagePdfDocumentReader DocumentReader の実装
TokenTextSplitter DocumentTransformer の実装
VectorStore DocumentWriter の実装

検索拡張生成パターンで使用するためにベクトルデータベースにデータの基本的な読み込みを実行するには、Java 関数スタイルの構文で次のコードを使用します。

vectorStore.accept(tokenTextSplitter.apply(pdfReader.get()));

あるいは、ドメインをより自然に表現するメソッド名を使用することもできます。

vectorStore.write(tokenTextSplitter.split(pdfReader.read()));

ETL インターフェース

ETL パイプラインは、次のインターフェースと実装で構成されます。詳細な ETL クラス図は ETL クラス図セクションに示されています。

DocumentReader

さまざまな起源のドキュメントのソースを提供します。

public interface DocumentReader extends Supplier<List<Document>> {

    default List<Document> read() {
		return get();
	}
}

DocumentTransformer

処理ワークフローの一部としてドキュメントのバッチを変換します。

public interface DocumentTransformer extends Function<List<Document>, List<Document>> {

    default List<Document> transform(List<Document> transform) {
		return apply(transform);
	}
}

DocumentWriter

ETL プロセスの最終段階を管理し、保存するドキュメントを準備します。

public interface DocumentWriter extends Consumer<List<Document>> {

    default void write(List<Document> documents) {
		accept(documents);
	}
}

ETL クラス図

次のクラス図は、ETL インターフェースと実装を示しています。

DocumentReaders

JSON

JsonReader は JSON ドキュメントを処理し、それを Document オブジェクトのリストに変換します。

サンプル

@Component
class MyJsonReader {

	private final Resource resource;

    MyJsonReader(@Value("classpath:bikes.json") Resource resource) {
        this.resource = resource;
    }

	List<Document> loadJsonAsDocuments() {
        JsonReader jsonReader = new JsonReader(this.resource, "description", "content");
        return jsonReader.get();
	}
}

コンストラクターオプション

JsonReader にはいくつかのコンストラクターオプションが用意されています。

JsonReader(Resource resource)
JsonReader(Resource resource, String… jsonKeysToUse)
JsonReader(Resource resource, JsonMetadataGenerator jsonMetadataGenerator, String… jsonKeysToUse)

パラメーター

resource: JSON ファイルを指す Spring Resource オブジェクト。
jsonKeysToUse: 結果の Document オブジェクトのテキストコンテンツとして使用される JSON からのキーの配列。
jsonMetadataGenerator: 各 Document のメタデータを作成するためのオプションの JsonMetadataGenerator。

振る舞い

JsonReader は JSON コンテンツを次のように処理します。

JSON 配列と単一の JSON オブジェクトの両方を処理できます。
各 JSON オブジェクト (配列または単一のオブジェクト) について:
- 指定された jsonKeysToUse に基づいてコンテンツを抽出します。
- キーが指定されていない場合は、JSON オブジェクト全体がコンテンツとして使用されます。
- 提供された JsonMetadataGenerator (提供されていない場合は空の JsonMetadataGenerator ) を使用してメタデータを生成します。
- 抽出されたコンテンツとメタデータを含む Document オブジェクトを作成します。

JSON ポインターの使用

JsonReader は、JSON ポインターを使用して JSON ドキュメントの特定の部分を取得できるようになりました。この機能により、複雑な JSON 構造からネストされたデータを簡単に抽出できます。

`get(String pointer)` 法

public List<Document> get(String pointer)

このメソッドを使用すると、JSON ポインターを使用して JSON ドキュメントの特定の部分を取得できます。

パラメーター

pointer: JSON 構造内で目的の要素を見つけるための JSON ポインター文字列 (RFC 6901 で定義)。

戻り値

ポインターによって特定された JSON 要素から解析されたドキュメントを含む List<Document> を返します。

振る舞い

このメソッドは、提供された JSON ポインターを使用して、JSON 構造内の特定の場所に移動します。
ポインタが有効で、既存の要素を指している場合:
- JSON オブジェクトの場合: 単一のドキュメントを含むリストを返します。
- JSON 配列の場合: 配列内の各要素ごとに 1 つのドキュメントのリストを返します。
ポインタが無効であるか、存在しない要素を指している場合は、IllegalArgumentException がスローされます。

サンプル

JsonReader jsonReader = new JsonReader(resource, "description");
List<Document> documents = this.jsonReader.get("/store/books/0");

JSON 構造の例

[
  {
    "id": 1,
    "brand": "Trek",
    "description": "A high-performance mountain bike for trail riding."
  },
  {
    "id": 2,
    "brand": "Cannondale",
    "description": "An aerodynamic road bike for racing enthusiasts."
  }
]

この例では、JsonReader が jsonKeysToUse として "description" で構成されている場合、配列内の各バイクの「説明」フィールドの値がコンテンツとなる Document オブジェクトが作成されます。

ノート

JsonReader は JSON 解析に Jackson を使用します。
配列のストリーミングを使用することで、大きな JSON ファイルを効率的に処理できます。
jsonKeysToUse に複数のキーが指定されている場合、その内容はそれらのキーの値を連結したものになります。
リーダーは柔軟性があり、jsonKeysToUse と JsonMetadataGenerator をカスタマイズすることでさまざまな JSON 構造に適応できます。

テキスト

TextReader はプレーンテキストドキュメントを処理し、それを Document オブジェクトのリストに変換します。

サンプル

@Component
class MyTextReader {

    private final Resource resource;

    MyTextReader(@Value("classpath:text-source.txt") Resource resource) {
        this.resource = resource;
    }

	List<Document> loadText() {
		TextReader textReader = new TextReader(this.resource);
		textReader.getCustomMetadata().put("filename", "text-source.txt");

		return textReader.read();
    }
}

コンストラクターオプション

TextReader には 2 つのコンストラクターオプションがあります。

TextReader(String resourceUrl)
TextReader(Resource resource)

パラメーター

resourceUrl: 読み取るリソースの URL を表す文字列。
resource: テキストファイルを指す Spring Resource オブジェクト。

構成

setCharset(Charset charset): テキストファイルの読み取りに使用する文字セットを設定します。デフォルトは UTF-8 です。
getCustomMetadata(): ドキュメントのカスタムメタデータを追加できる変更可能なマップを返します。

振る舞い

TextReader はテキストコンテンツを次のように処理します。

テキストファイルの内容全体を単一の Document オブジェクトに読み込みます。
ファイルの内容が Document の内容になります。
メタデータは Document に自動的に追加されます:
- charset: ファイルの読み取りに使用される文字セット (デフォルト: "UTF-8")。
- source: ソーステキストファイルのファイル名。
getCustomMetadata() 経由で追加されたカスタムメタデータはすべて Document に含まれます。

ノート

TextReader はファイルの内容全体をメモリに読み込むため、非常に大きなファイルには適さない可能性があります。
テキストを小さなチャンクに分割する必要がある場合は、ドキュメントを読み取った後に TokenTextSplitter などのテキスト分割ツールを使用できます。

List<Document> documents = textReader.get();
List<Document> splitDocuments = new TokenTextSplitter().apply(this.documents);

リーダーは Spring の Resource 抽象化を使用して、さまざまなソース (クラスパス、ファイルシステム、URL など) から読み取ることができます。
getCustomMetadata() メソッドを使用してリーダーが作成したすべてのドキュメントにカスタムメタデータを追加できます。

HTML (JSoup)

JsoupDocumentReader は HTML ドキュメントを処理し、JSoup ライブラリを使用して Document オブジェクトのリストに変換します。

サンプル

@Component
class MyHtmlReader {

    private final Resource resource;

    MyHtmlReader(@Value("classpath:/my-page.html") Resource resource) {
        this.resource = resource;
    }

    List<Document> loadHtml() {
        JsoupDocumentReaderConfig config = JsoupDocumentReaderConfig.builder()
            .selector("article p") // Extract paragraphs within <article> tags
            .charset("ISO-8859-1")  // Use ISO-8859-1 encoding
            .includeLinkUrls(true) // Include link URLs in metadata
            .metadataTags(List.of("author", "date")) // Extract author and date meta tags
            .additionalMetadata("source", "my-page.html") // Add custom metadata
            .build();

        JsoupDocumentReader reader = new JsoupDocumentReader(this.resource, config);
        return reader.get();
    }
}

JsoupDocumentReaderConfig を使用すると、JsoupDocumentReader の動作をカスタマイズできます。

charset: HTML ドキュメントの文字エンコーディングを指定します (デフォルトは "UTF-8" )。
selector: テキストを抽出する要素を指定するための JSoup CSS セレクター (デフォルトは "body" )。
separator: 選択した複数の要素のテキストを結合するために使用される文字列 (デフォルトは "\n" )。
allElements: true の場合、selector を無視して、<body> 要素からすべてのテキストを抽出します (デフォルトは false)。
groupByElement: true の場合、selector に一致する各要素に対して個別の Document を作成します (デフォルトは false)。
includeLinkUrls: true の場合、絶対リンク URL を抽出し、メタデータに追加します (デフォルトは false)。
metadataTags: コンテンツを抽出する <meta> タグ名のリスト (デフォルトは ["description", "keywords"])。
additionalMetadata: 作成されたすべての Document オブジェクトにカスタムメタデータを追加できます。

サンプルドキュメント: my-page.html

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>My Web Page</title>
    <meta name="description" content="A sample web page for Spring AI">
    <meta name="keywords" content="spring, ai, html, example">
    <meta name="author" content="John Doe">
    <meta name="date" content="2024-01-15">
    <link rel="stylesheet" href="style.css">
</head>
<body>
    <header>
        <h1>Welcome to My Page</h1>
    </header>
    <nav>
        <ul>
            <li><a href="/">Home</a></li>
            <li><a href="/about">About</a></li>
        </ul>
    </nav>
    <article>
        <h2>Main Content</h2>
        <p>This is the main content of my web page.</p>
        <p>It contains multiple paragraphs.</p>
        <a href="https://www.example.com">External Link</a>
    </article>
    <footer>
        <p>&copy; 2024 John Doe</p>
    </footer>
</body>
</html>

振る舞い:

JsoupDocumentReader は HTML コンテンツを処理し、構成に基づいて Document オブジェクトを作成します。

selector は、テキスト抽出に使用する要素を決定します。
allElements が true の場合、<body> 内のすべてのテキストが 1 つの Document に抽出されます。
groupByElement が true の場合、selector に一致する各要素は個別の Document を作成します。
allElements も groupByElement も true でない場合は、selector に一致するすべての要素のテキストが separator を使用して結合されます。
ドキュメントのタイトル、指定された <meta> タグのコンテンツ、および (オプションで) リンク URL が Document メタデータに追加されます。
相対リンクを解決するためのベース URI は、URL リソースから抽出されます。

リーダーは選択した要素のテキストコンテンツを保持しますが、その中の HTML タグはすべて削除します。

マークダウン

MarkdownDocumentReader は Markdown ドキュメントを処理し、Document オブジェクトのリストに変換します。

サンプル

@Component
class MyMarkdownReader {

    private final Resource resource;

    MyMarkdownReader(@Value("classpath:code.md") Resource resource) {
        this.resource = resource;
    }

    List<Document> loadMarkdown() {
        MarkdownDocumentReaderConfig config = MarkdownDocumentReaderConfig.builder()
            .withHorizontalRuleCreateDocument(true)
            .withIncludeCodeBlock(false)
            .withIncludeBlockquote(false)
            .withAdditionalMetadata("filename", "code.md")
            .build();

        MarkdownDocumentReader reader = new MarkdownDocumentReader(this.resource, config);
        return reader.get();
    }
}

MarkdownDocumentReaderConfig を使用すると、MarkdownDocumentReader の動作をカスタマイズできます。

horizontalRuleCreateDocument: true に設定すると、Markdown の水平線によって新しい Document オブジェクトが作成されます。
includeCodeBlock: true に設定すると、コードブロックは周囲のテキストと同じ Document に含まれます。false に設定すると、コードブロックは個別の Document オブジェクトを作成します。
includeBlockquote: true に設定すると、引用ブロックは周囲のテキストと同じ Document に含まれます。false に設定すると、引用ブロックは個別の Document オブジェクトを作成します。
additionalMetadata: 作成されたすべての Document オブジェクトにカスタムメタデータを追加できます。

サンプルドキュメント: code.md

This is a Java sample application:

```java
package com.example.demo;

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;

@SpringBootApplication
public class DemoApplication {
    public static void main(String[] args) {
        SpringApplication.run(DemoApplication.class, args);
    }
}
```

Markdown also provides the possibility to `use inline code formatting throughout` the entire sentence.

---

Another possibility is to set block code without specific highlighting:

```
./mvnw spring-javaformat:apply
```

振る舞い: MarkdownDocumentReader は Markdown コンテンツを処理し、構成に基づいて Document オブジェクトを作成します。

ヘッダーは Document オブジェクト内のメタデータになります。
段落は Document オブジェクトのコンテンツになります。
コードブロックは、独自の Document オブジェクトに分離することも、周囲のテキストに含めることもできます。
ブロック引用は、独自の Document オブジェクトに分離することも、周囲のテキストに含めることもできます。
水平線を使用して、コンテンツを個別の Document オブジェクトに分割できます。

リーダーは、Document オブジェクトのコンテンツ内のインラインコード、リスト、テキストスタイルなどの書式設定を保持します。

PDF ページ

PagePdfDocumentReader は Apache PdfBox ライブラリを使用して PDF ドキュメントを解析します

Maven または Gradle を使用してプロジェクトに依存関係を追加します。

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-pdf-document-reader</artifactId>
</dependency>

または、Gradle build.gradle ビルドファイルに保存します。

dependencies {
    implementation 'org.springframework.ai:spring-ai-pdf-document-reader'
}

サンプル

@Component
public class MyPagePdfDocumentReader {

	List<Document> getDocsFromPdf() {

		PagePdfDocumentReader pdfReader = new PagePdfDocumentReader("classpath:/sample1.pdf",
				PdfDocumentReaderConfig.builder()
					.withPageTopMargin(0)
					.withPageExtractedTextFormatter(ExtractedTextFormatter.builder()
						.withNumberOfTopTextLinesToDelete(0)
						.build())
					.withPagesPerDocument(1)
					.build());

		return pdfReader.read();
    }

}

PDF 段落

ParagraphPdfDocumentReader は、PDF カタログ (TOC など) 情報を使用して、入力 PDF をテキスト段落に分割し、段落ごとに 1 つの Document を出力します。注: すべての PDF ドキュメントに PDF カタログが含まれているわけではありません。

依存関係

Maven または Gradle を使用してプロジェクトに依存関係を追加します。

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-pdf-document-reader</artifactId>
</dependency>

または、Gradle build.gradle ビルドファイルに保存します。

dependencies {
    implementation 'org.springframework.ai:spring-ai-pdf-document-reader'
}

サンプル

@Component
public class MyPagePdfDocumentReader {

	List<Document> getDocsFromPdfWithCatalog() {

        ParagraphPdfDocumentReader pdfReader = new ParagraphPdfDocumentReader("classpath:/sample1.pdf",
                PdfDocumentReaderConfig.builder()
                    .withPageTopMargin(0)
                    .withPageExtractedTextFormatter(ExtractedTextFormatter.builder()
                        .withNumberOfTopTextLinesToDelete(0)
                        .build())
                    .withPagesPerDocument(1)
                    .build());

	    return pdfReader.read();
    }
}

ティカ (DOCX、PPTX、HTML …)

TikaDocumentReader は、Apache Tika を使用して、PDF、DOC/DOCX、PPT/PPTX、HTML などのさまざまなドキュメント形式からテキストを抽出します。サポートされている形式の包括的なリストについては、Tika ドキュメント [Apache] (英語) を参照してください。

依存関係

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-tika-document-reader</artifactId>
</dependency>

または、Gradle build.gradle ビルドファイルに保存します。

dependencies {
    implementation 'org.springframework.ai:spring-ai-tika-document-reader'
}

サンプル

@Component
class MyTikaDocumentReader {

    private final Resource resource;

    MyTikaDocumentReader(@Value("classpath:/word-sample.docx")
                            Resource resource) {
        this.resource = resource;
    }

    List<Document> loadText() {
        TikaDocumentReader tikaDocumentReader = new TikaDocumentReader(this.resource);
        return tikaDocumentReader.read();
    }
}

Transformers

TextSplitter

TextSplitter は、AI モデルのコンテキストウィンドウに合わせてドキュメントを分割するのに役立つ抽象基本クラスです。

TokenTextSplitter

TokenTextSplitter は、CL100K_BASE エンコーディングを使用して、トークン数に基づいてテキストをチャンクに分割する TextSplitter の実装です。

使用方法

基本的な使い方

@Component
class MyTokenTextSplitter {

    public List<Document> splitDocuments(List<Document> documents) {
        TokenTextSplitter splitter = new TokenTextSplitter();
        return splitter.apply(documents);
    }

    public List<Document> splitCustomized(List<Document> documents) {
        TokenTextSplitter splitter = new TokenTextSplitter(1000, 400, 10, 5000, true, List.of('.', '?', '!', '\n'));
        return splitter.apply(documents);
    }
}

ビルダーパターンの使用

TokenTextSplitter を作成するための推奨方法は、より読みやすく柔軟な API を提供するビルダーパターンを使用することです。

@Component
class MyTokenTextSplitter {

    public List<Document> splitWithBuilder(List<Document> documents) {
        TokenTextSplitter splitter = TokenTextSplitter.builder()
            .withChunkSize(1000)
            .withMinChunkSizeChars(400)
            .withMinChunkLengthToEmbed(10)
            .withMaxNumChunks(5000)
            .withKeepSeparator(true)
            .build();

        return splitter.apply(documents);
    }
}

カスタム句読点

テキストを意味的に意味のあるチャンクに分割する際に使用する句読点をカスタマイズできます。これは特に国際化に役立ちます。

@Component
class MyInternationalTextSplitter {

    public List<Document> splitChineseText(List<Document> documents) {
        // Use Chinese punctuation marks
        TokenTextSplitter splitter = TokenTextSplitter.builder()
            .withChunkSize(800)
            .withMinChunkSizeChars(350)
            .withPunctuationMarks(List.of('。', '？', '！', '；'))  // Chinese punctuation
            .build();

        return splitter.apply(documents);
    }

    public List<Document> splitWithCustomMarks(List<Document> documents) {
        // Mix of English and other punctuation marks
        TokenTextSplitter splitter = TokenTextSplitter.builder()
            .withChunkSize(800)
            .withPunctuationMarks(List.of('.', '?', '!', '\n', ';', ':', '。'))
            .build();

        return splitter.apply(documents);
    }
}

コンストラクターオプション

TokenTextSplitter には 3 つのコンストラクターオプションがあります。

TokenTextSplitter(): デフォルト設定でスプリッターを作成します。
TokenTextSplitter(boolean keepSeparator): カスタムセパレーター動作を持つスプリッターを作成します。
TokenTextSplitter(int chunkSize, int minChunkSizeChars, int minChunkLengthToEmbed, int maxNumChunks, boolean keepSeparator, List<Character> punctuationMarks): すべてのカスタマイズオプションを備えた完全なコンストラクター。

カスタム構成でインスタンスを作成する場合は、ビルダーパターン (上記) が推奨されるアプローチです。

パラメーター

chunkSize: 各テキストチャンクのターゲットサイズ (トークン単位) (デフォルト: 800)。
minChunkSizeChars: 各テキストチャンクの最小サイズ (文字数) (デフォルト: 350)。
minChunkLengthToEmbed: 含めるチャンクの最小長さ (デフォルト: 5)。
maxNumChunks: テキストから生成するチャンクの最大数 (デフォルト: 10000)。
keepSeparator: チャンク内に区切り文字 (改行など) を保持するかどうか (デフォルト: true)。
punctuationMarks: 文を分割するための境界として使用する文字のリスト (デフォルト: .、?、!、\n)。

振る舞い

TokenTextSplitter はテキストコンテンツを次のように処理します。

CL100K_BASE エンコーディングを使用して入力テキストをトークンにエンコードします。
エンコードされたテキストを chunkSize に基づいてチャンクに分割します。
各チャンクについて:
1. チャンクをテキストにデコードします。
2. Only if the total token count exceeds the chunk size では、minChunkSizeChars の後に適切なブレークポイント (構成された punctuationMarks を使用) を見つけようとします。
3. ブレークポイントが見つかった場合、そのポイントでチャンクが切り捨てられます。
4. チャンクをトリミングし、オプションで keepSeparator 設定に基づいて改行文字を削除します。
5. 結果のチャンクが minChunkLengthToEmbed より長い場合は、出力に追加されます。
このプロセスは、すべてのトークンが処理されるか、maxNumChunks に到達するまで続行されます。
残りのテキストは、minChunkLengthToEmbed より長い場合は最終チャンクとして追加されます。

句読点に基づく分割は、トークン数がチャンクサイズを超えた場合にのみ適用されます。チャンクサイズと完全に一致するか、チャンクサイズより小さいテキストは、句読点に基づく切り捨てを行わずに単一のチャンクとして返されます。これにより、小さなテキストが不必要に分割されることを防ぎます。

サンプル

Document doc1 = new Document("This is a long piece of text that needs to be split into smaller chunks for processing.",
        Map.of("source", "example.txt"));
Document doc2 = new Document("Another document with content that will be split based on token count.",
        Map.of("source", "example2.txt"));

TokenTextSplitter splitter = new TokenTextSplitter();
List<Document> splitDocuments = this.splitter.apply(List.of(this.doc1, this.doc2));

for (Document doc : splitDocuments) {
    System.out.println("Chunk: " + doc.getContent());
    System.out.println("Metadata: " + doc.getMetadata());
}

ノート

TokenTextSplitter は、新しい OpenAI モデルと互換性のある jtokkit ライブラリの CL100K_BASE エンコーディングを使用します。
スプリッターは、可能な場合は文の境界で分割して、意味的に意味のあるチャンクを作成しようとします。
元のドキュメントのメタデータは保持され、そのドキュメントから派生したすべてのチャンクにコピーされます。
copyContentFormatter が true に設定されている場合 (デフォルトの動作)、元のドキュメントのコンテンツフォーマッタ (設定されている場合) も派生チャンクにコピーされます。
このスプリッターは、トークン制限のある大規模な言語モデルのテキストを準備し、各チャンクがモデルの処理機能内に収まるようにするのに特に便利です。
カスタム句読点 : デフォルトの句読点（.、?、!、\n）は英語のテキストに適しています。他の言語や特殊なコンテンツの場合は、ビルダーの withPunctuationMarks() メソッドを使用して句読点をカスタマイズしてください。
Performance Consideration : While the splitter can handle any number of punctuation marks, it’s recommended to keep the list reasonably small (under 20 characters) for optimal performance, as each mark is checked for every chunk.
拡張性 : The getLastPunctuationIndex(String) method is protected, allowing subclasses to override the punctuation detection logic for specialized use cases.
Small Text Handling : As of version 2.0, small texts (with token count at or below the chunk size) are no longer split at punctuation marks, preventing unnecessary fragmentation of content that already fits within the size limits.

ContentFormatTransformer

すべてのドキュメントにわたって均一なコンテンツ形式を保証します。

KeywordMetadataEnricher

KeywordMetadataEnricher は、生成 AI モデルを使用してドキュメントコンテンツからキーワードを抽出し、メタデータとして追加する DocumentTransformer です。

使用方法

@Component
class MyKeywordEnricher {

    private final ChatModel chatModel;

    MyKeywordEnricher(ChatModel chatModel) {
        this.chatModel = chatModel;
    }

    List<Document> enrichDocuments(List<Document> documents) {
        KeywordMetadataEnricher enricher = KeywordMetadataEnricher.builder(chatModel)
                .keywordCount(5)
                .build();

        // Or use custom templates
        KeywordMetadataEnricher enricher = KeywordMetadataEnricher.builder(chatModel)
               .keywordsTemplate(YOUR_CUSTOM_TEMPLATE)
               .build();

        return enricher.apply(documents);
    }
}

コンストラクターオプション

KeywordMetadataEnricher には 2 つのコンストラクターオプションがあります。

KeywordMetadataEnricher(ChatModel chatModel, int keywordCount): デフォルトのテンプレートを使用して、指定された数のキーワードを抽出します。
KeywordMetadataEnricher(ChatModel chatModel, PromptTemplate keywordsTemplate): キーワード抽出にカスタムテンプレートを使用します。

振る舞い

KeywordMetadataEnricher は次のようにドキュメントを処理します。

入力ドキュメントごとに、ドキュメントの内容を使用してプロンプトを作成します。
このプロンプトは提供された ChatModel に送信され、キーワードが生成されます。
生成されたキーワードは、キー "excerpt_keywords" のドキュメントのメタデータに追加されます。
強化されたドキュメントが返されます。

カスタム

デフォルトのテンプレートを使用することも、keywordsTemplate パラメーターを使用してテンプレートをカスタマイズすることもできます。デフォルトのテンプレートは次のとおりです。

\{context_str}. Give %s unique keywords for this document. Format as comma separated. Keywords:

ここで、{context_str} はドキュメントの内容に置き換えられ、%s は指定されたキーワード数に置き換えられます。

サンプル

ChatModel chatModel = // initialize your chat model
KeywordMetadataEnricher enricher = KeywordMetadataEnricher.builder(chatModel)
                .keywordCount(5)
                .build();

// Or use custom templates
KeywordMetadataEnricher enricher = KeywordMetadataEnricher.builder(chatModel)
                .keywordsTemplate(new PromptTemplate("Extract 5 important keywords from the following text and separate them with commas:\n{context_str}"))
                .build();

Document doc = new Document("This is a document about artificial intelligence and its applications in modern technology.");

List<Document> enrichedDocs = enricher.apply(List.of(this.doc));

Document enrichedDoc = this.enrichedDocs.get(0);
String keywords = (String) this.enrichedDoc.getMetadata().get("excerpt_keywords");
System.out.println("Extracted keywords: " + keywords);

ノート

KeywordMetadataEnricher では、キーワードを生成するために機能する ChatModel が必要です。
キーワード数は 1 以上である必要があります。
エンリッチャーは、処理された各ドキュメントに "excerpt_keywords" メタデータフィールドを追加します。
生成されたキーワードは、コンマ区切りの文字列として返されます。
このエンリッチャーは、ドキュメントの検索性を向上させたり、ドキュメントのタグやカテゴリを生成したりするのに特に役立ちます。
Builder パターンでは、keywordsTemplate パラメーターが設定されている場合、keywordCount パラメーターは無視されます。

SummaryMetadataEnricher

SummaryMetadataEnricher は、生成 AI モデルを使用してドキュメントの要約を作成し、それをメタデータとして追加する DocumentTransformer です。現在のドキュメントだけでなく、隣接するドキュメント (前と次) の要約も生成できます。

使用方法

@Configuration
class EnricherConfig {

    @Bean
    public SummaryMetadataEnricher summaryMetadata(OpenAiChatModel aiClient) {
        return new SummaryMetadataEnricher(aiClient,
            List.of(SummaryType.PREVIOUS, SummaryType.CURRENT, SummaryType.NEXT));
    }
}

@Component
class MySummaryEnricher {

    private final SummaryMetadataEnricher enricher;

    MySummaryEnricher(SummaryMetadataEnricher enricher) {
        this.enricher = enricher;
    }

    List<Document> enrichDocuments(List<Document> documents) {
        return this.enricher.apply(documents);
    }
}

コンストラクター

SummaryMetadataEnricher は 2 つのコンストラクターを提供します。

SummaryMetadataEnricher(ChatModel chatModel, List<SummaryType> summaryTypes)
SummaryMetadataEnricher(ChatModel chatModel, List<SummaryType> summaryTypes, String summaryTemplate, MetadataMode metadataMode)

パラメーター

chatModel: 要約を生成するために使用される AI モデル。
summaryTypes: 生成するサマリーを示す SummaryType 列挙値のリスト (PREVIOUS、CURRENT、NEXT)。
summaryTemplate: サマリー生成用のカスタムテンプレート (オプション)。
metadataMode: 要約を生成するときにドキュメントのメタデータを処理する方法を指定します (オプション)。

振る舞い

SummaryMetadataEnricher は次のようにドキュメントを処理します。

入力ドキュメントごとに、ドキュメントの内容と指定された概要テンプレートを使用してプロンプトを作成します。
このプロンプトは、提供された ChatModel に送信され、要約が生成されます。
指定された summaryTypes に応じて、各ドキュメントに次のメタデータが追加されます。
- section_summary: 現在のドキュメントの概要。
- prev_section_summary: 前回のドキュメントの要約（入手可能でリクエストされた場合）。
- next_section_summary: 次のドキュメントの概要 (利用可能でリクエストされている場合)。
強化されたドキュメントが返されます。

カスタム

サマリー生成プロンプトは、カスタム summaryTemplate を提供することでカスタマイズできます。デフォルトのテンプレートは次のとおりです。

"""
Here is the content of the section:
{context_str}

Summarize the key topics and entities of the section.

Summary:
"""

サンプル

ChatModel chatModel = // initialize your chat model
SummaryMetadataEnricher enricher = new SummaryMetadataEnricher(chatModel,
    List.of(SummaryType.PREVIOUS, SummaryType.CURRENT, SummaryType.NEXT));

Document doc1 = new Document("Content of document 1");
Document doc2 = new Document("Content of document 2");

List<Document> enrichedDocs = enricher.apply(List.of(this.doc1, this.doc2));

// Check the metadata of the enriched documents
for (Document doc : enrichedDocs) {
    System.out.println("Current summary: " + doc.getMetadata().get("section_summary"));
    System.out.println("Previous summary: " + doc.getMetadata().get("prev_section_summary"));
    System.out.println("Next summary: " + doc.getMetadata().get("next_section_summary"));
}

提供された例は、期待される動作を示しています。

2 つのドキュメントのリストの場合、両方のドキュメントに section_summary が与えられます。
最初のドキュメントは next_section_summary を受け取りますが、prev_section_summary は受け取りません。
2 番目のドキュメントは prev_section_summary を受け取りますが、next_section_summary は受け取りません。
最初のドキュメントの section_summary は、2 番目のドキュメントの prev_section_summary と一致します。
最初のドキュメントの next_section_summary は、2 番目のドキュメントの section_summary と一致します。

ノート

SummaryMetadataEnricher では、要約を生成するために機能する ChatModel が必要です。
エンリッチャーは、任意のサイズのドキュメントリストを処理し、最初のドキュメントと最後のドキュメントのエッジケースを適切に処理できます。
このエンリッチメントは、コンテキスト認識型の要約を作成する場合に特に役立ち、シーケンス内のドキュメントの関連をよりよく理解できるようになります。
MetadataMode パラメーターを使用すると、既存のメタデータをサマリー生成プロセスに組み込む方法を制御できます。

ライター

ファイル

FileDocumentWriter は、Document オブジェクトのリストの内容をファイルに書き込む DocumentWriter 実装です。

使用方法

@Component
class MyDocumentWriter {

    public void writeDocuments(List<Document> documents) {
        FileDocumentWriter writer = new FileDocumentWriter("output.txt", true, MetadataMode.ALL, false);
        writer.accept(documents);
    }
}

コンストラクター

FileDocumentWriter は 3 つのコンストラクターを提供します。

FileDocumentWriter(String fileName)
FileDocumentWriter(String fileName, boolean withDocumentMarkers)
FileDocumentWriter(String fileName, boolean withDocumentMarkers, MetadataMode metadataMode, boolean append)

パラメーター

fileName: ドキュメントを書き込むファイルの名前。
withDocumentMarkers: 出力にドキュメントマーカーを含めるかどうか (デフォルト: false)。
metadataMode: ファイルに書き込むドキュメントの内容を指定します (デフォルト: MetadataMode.NONE)。
append: true の場合、データはファイルの先頭ではなく末尾に書き込まれます (デフォルト: false)。

振る舞い

FileDocumentWriter は次のようにドキュメントを処理します。

指定されたファイル名の FileWriter を開きます。
入力リスト内の各ドキュメントについて:
1. withDocumentMarkers が true の場合、ドキュメントインデックスとページ番号を含むドキュメントマーカーが書き込まれます。
2. 指定された metadataMode に基づいて、ドキュメントのフォーマットされたコンテンツを書き込みます。
すべてのドキュメントが書き込まれた後、ファイルは閉じられます。

ドキュメントマーカー

withDocumentMarkers が true に設定されている場合、ライターは各ドキュメントのマーカーを次の形式で含めます。

### Doc: [index], pages:[start_page_number,end_page_number]

メタデータ処理

ライターは 2 つの特定のメタデータキーを使用します。

page_number: ドキュメントの開始ページ番号を表します。
end_page_number: ドキュメントの終了ページ番号を表します。

これらはドキュメントマーカーを書き込むときに使用されます。

サンプル

List<Document> documents = // initialize your documents
FileDocumentWriter writer = new FileDocumentWriter("output.txt", true, MetadataMode.ALL, true);
writer.accept(documents);

これにより、ドキュメントマーカーを含むすべてのドキュメントが、利用可能なすべてのメタデータを使用して "output.txt" に書き込まれ、ファイルがすでに存在する場合はそのファイルに追加されます。

ノート

ライターは FileWriter を使用するため、オペレーティングシステムのデフォルトの文字エンコーディングでテキストファイルを書き込みます。
書き込み中にエラーが発生した場合、元の例外を原因として RuntimeException がスローされます。
metadataMode パラメーターを使用すると、既存のメタデータを書き込まれたコンテンツに組み込む方法を制御できます。
このライターは、ドキュメントコレクションのデバッグや、人間が判読できる出力の作成に特に役立ちます。

VectorStore

さまざまなベクトルストアとの統合を提供します。完全なリストについては、ベクトル DB ドキュメントを参照してください。

ETL パイプライン

API の概要

ETL インターフェース

DocumentReader

DocumentTransformer

DocumentWriter

ETL クラス図

DocumentReaders

JSON

サンプル

コンストラクターオプション

パラメーター

振る舞い

JSON ポインターの使用

get(String pointer) 法

パラメーター

戻り値

振る舞い

サンプル

JSON 構造の例

ノート

テキスト

サンプル

コンストラクターオプション

パラメーター

構成

振る舞い

ノート

HTML (JSoup)

サンプル

サンプルドキュメント: my-page.html

マークダウン

サンプル

サンプルドキュメント: code.md

PDF ページ

サンプル

PDF 段落

依存関係

サンプル

ティカ (DOCX、PPTX、HTML …)

依存関係

サンプル

Transformers

TextSplitter

TokenTextSplitter

使用方法

基本的な使い方

ビルダーパターンの使用

カスタム句読点

コンストラクターオプション

パラメーター

振る舞い

サンプル

ノート

ContentFormatTransformer

KeywordMetadataEnricher

使用方法

コンストラクターオプション

振る舞い

カスタム

サンプル

ノート

SummaryMetadataEnricher

使用方法

コンストラクター

パラメーター

振る舞い

カスタム

サンプル

ノート

ライター

ファイル

使用方法

コンストラクター

パラメーター

振る舞い

ドキュメントマーカー

メタデータ処理

サンプル

ノート

`get(String pointer)` 法