テキスト読み上げ (TTS) API

Spring AI は、TextToSpeechModel および StreamingTextToSpeechModel インターフェースを通じて、音声合成（TTS）用の統合 API を提供します。これにより、異なる TTS プロバイダー間で動作する移植性の高いコードを作成できます。

サポートされているプロバイダー

共通インターフェース

すべての TTS プロバイダーは、次の共有インターフェースを実装します。

TextToSpeechModel

TextToSpeechModel インターフェースは、テキストを音声に変換するためのメソッドを提供します。

public interface TextToSpeechModel extends Model<TextToSpeechPrompt, TextToSpeechResponse>, StreamingTextToSpeechModel {

    /**
     * Converts text to speech with default options.
     */
    default byte[] call(String text) {
        // Default implementation
    }

    /**
     * Converts text to speech with custom options.
     */
    TextToSpeechResponse call(TextToSpeechPrompt prompt);

    /**
     * Returns the default options for this model.
     */
    default TextToSpeechOptions getDefaultOptions() {
        // Default implementation
    }
}

StreamingTextToSpeechModel

StreamingTextToSpeechModel インターフェースは、リアルタイムでオーディオをストリーミングするためのメソッドを提供します。

@FunctionalInterface
public interface StreamingTextToSpeechModel extends StreamingModel<TextToSpeechPrompt, TextToSpeechResponse> {

    /**
     * Streams text-to-speech responses with metadata.
     */
    Flux<TextToSpeechResponse> stream(TextToSpeechPrompt prompt);

    /**
     * Streams audio bytes for the given text.
     */
    default Flux<byte[]> stream(String text) {
        // Default implementation
    }
}

TextToSpeechPrompt

TextToSpeechPrompt クラスは入力テキストとオプションをカプセル化します。

TextToSpeechPrompt prompt = new TextToSpeechPrompt(
    "Hello, this is a text-to-speech example.",
    options
);

TextToSpeechResponse

The TextToSpeechResponse class contains the generated audio and metadata:

TextToSpeechResponse response = model.call(prompt);
byte[] audioBytes = response.getResult().getOutput();
TextToSpeechResponseMetadata metadata = response.getMetadata();

Writing Provider-Agnostic Code

One of the key benefits of the shared TTS interfaces is the ability to write code that works with any TTS provider without modification. The actual provider (OpenAI, ElevenLabs, etc.) is determined by your Spring Boot configuration, allowing you to switch providers without changing application code.

Basic Service Example

The shared interfaces allow you to write code that works with any TTS provider:

@Service
public class NarrationService {

    private final TextToSpeechModel textToSpeechModel;

    public NarrationService(TextToSpeechModel textToSpeechModel) {
        this.textToSpeechModel = textToSpeechModel;
    }

    public byte[] narrate(String text) {
        // Works with any TTS provider
        return textToSpeechModel.call(text);
    }

    public byte[] narrateWithOptions(String text, TextToSpeechOptions options) {
        TextToSpeechPrompt prompt = new TextToSpeechPrompt(text, options);
        TextToSpeechResponse response = textToSpeechModel.call(prompt);
        return response.getResult().getOutput();
    }
}

This service works seamlessly with OpenAI, ElevenLabs, or any other TTS provider, with the actual implementation determined by your Spring Boot configuration.

高度な例: Multi-Provider Support

You can build applications that support multiple TTS providers simultaneously:

@Service
public class MultiProviderNarrationService {

    private final Map<String, TextToSpeechModel> providers;

    public MultiProviderNarrationService(List<TextToSpeechModel> models) {
        // Spring will inject all available TextToSpeechModel beans
        this.providers = models.stream()
            .collect(Collectors.toMap(
                model -> model.getClass().getSimpleName(),
                model -> model
            ));
    }

    public byte[] narrateWithProvider(String text, String providerName) {
        TextToSpeechModel model = providers.get(providerName);
        if (model == null) {
            throw new IllegalArgumentException("Unknown provider: " + providerName);
        }
        return model.call(text);
    }

    public Set<String> getAvailableProviders() {
        return providers.keySet();
    }
}

Streaming Audio Example

The shared interfaces also support streaming for real-time audio generation:

@Service
public class StreamingNarrationService {

    private final TextToSpeechModel textToSpeechModel;

    public StreamingNarrationService(TextToSpeechModel textToSpeechModel) {
        this.textToSpeechModel = textToSpeechModel;
    }

    public Flux<byte[]> streamNarration(String text) {
        // TextToSpeechModel extends StreamingTextToSpeechModel
        return textToSpeechModel.stream(text);
    }

    public Flux<TextToSpeechResponse> streamWithMetadata(String text, TextToSpeechOptions options) {
        TextToSpeechPrompt prompt = new TextToSpeechPrompt(text, options);
        return textToSpeechModel.stream(prompt);
    }
}

REST Controller Example

Building a REST API with provider-agnostic TTS:

@RestController
@RequestMapping("/api/tts")
public class TextToSpeechController {

    private final TextToSpeechModel textToSpeechModel;

    public TextToSpeechController(TextToSpeechModel textToSpeechModel) {
        this.textToSpeechModel = textToSpeechModel;
    }

    @PostMapping(value = "/synthesize", produces = "audio/mpeg")
    public ResponseEntity<byte[]> synthesize(@RequestBody SynthesisRequest request) {
        byte[] audio = textToSpeechModel.call(request.text());
        return ResponseEntity.ok()
            .contentType(MediaType.parseMediaType("audio/mpeg"))
            .header("Content-Disposition", "attachment; filename=\"speech.mp3\"")
            .body(audio);
    }

    @GetMapping(value = "/stream", produces = MediaType.APPLICATION_OCTET_STREAM_VALUE)
    public Flux<byte[]> streamSynthesis(@RequestParam String text) {
        return textToSpeechModel.stream(text);
    }

    record SynthesisRequest(String text) {}
}

Configuration-Based Provider Selection

Switch between providers using Spring profiles or properties:

# application-openai.yml
spring:
  ai:
    model:
      audio:
        speech: openai
    openai:
      api-key: ${OPENAI_API_KEY}
      audio:
        speech:
          options:
            model: gpt-4o-mini-tts
            voice: alloy

# application-elevenlabs.yml
spring:
  ai:
    model:
      audio:
        speech: elevenlabs
    elevenlabs:
      api-key: ${ELEVENLABS_API_KEY}
      tts:
        options:
          model-id: eleven_turbo_v2_5
          voice-id: your_voice_id

Then activate the desired provider:

# Use OpenAI
java -jar app.jar --spring.profiles.active=openai

# Use ElevenLabs
java -jar app.jar --spring.profiles.active=elevenlabs

Using Portable Options

For maximum portability, use only the common TextToSpeechOptions interface methods:

@Service
public class PortableNarrationService {

    private final TextToSpeechModel textToSpeechModel;

    public PortableNarrationService(TextToSpeechModel textToSpeechModel) {
        this.textToSpeechModel = textToSpeechModel;
    }

    public byte[] createPortableNarration(String text) {
        // Use provider's default options for maximum portability
        TextToSpeechOptions defaultOptions = textToSpeechModel.getDefaultOptions();
        TextToSpeechPrompt prompt = new TextToSpeechPrompt(text, defaultOptions);
        TextToSpeechResponse response = textToSpeechModel.call(prompt);
        return response.getResult().getOutput();
    }
}

Working with Provider-Specific Features

When you need provider-specific features, you can still use them while maintaining a portable codebase:

@Service
public class FlexibleNarrationService {

    private final TextToSpeechModel textToSpeechModel;

    public FlexibleNarrationService(TextToSpeechModel textToSpeechModel) {
        this.textToSpeechModel = textToSpeechModel;
    }

    public byte[] narrate(String text, TextToSpeechOptions baseOptions) {
        TextToSpeechOptions options = baseOptions;

        // Apply provider-specific optimizations if available
        if (textToSpeechModel instanceof OpenAiAudioSpeechModel) {
            options = OpenAiAudioSpeechOptions.builder()
                .from(baseOptions)
                .model("gpt-4o-tts")  // OpenAI-specific: use high-quality model
                .speed(1.0)
                .build();
        } else if (textToSpeechModel instanceof ElevenLabsTextToSpeechModel) {
            // ElevenLabs-specific options could go here
        }

        TextToSpeechPrompt prompt = new TextToSpeechPrompt(text, options);
        TextToSpeechResponse response = textToSpeechModel.call(prompt);
        return response.getResult().getOutput();
    }
}

Best Practices for Portable Code

Depend on Interfaces : Always inject TextToSpeechModel rather than concrete implementations
Use Common Options : Stick to TextToSpeechOptions interface methods for maximum portability
Handle Metadata Gracefully : Different providers return different metadata; handle it generically
Test with Multiple Providers : Ensure your code works with at least two TTS providers
Document Provider Assumptions : If you rely on specific provider behavior, document it clearly

Provider-Specific Features

While the shared interfaces provide portability, each provider also offers specific features through provider-specific options classes (e.g., OpenAiAudioSpeechOptions, ElevenLabsSpeechOptions). These classes implement the TextToSpeechOptions interface while adding provider-specific capabilities.