テキスト読み上げ (TTS) API
Spring AI は、TextToSpeechModel および StreamingTextToSpeechModel インターフェースを通じて、音声合成(TTS)用の統合 API を提供します。これにより、異なる TTS プロバイダー間で動作する移植性の高いコードを作成できます。
共通インターフェース
すべての TTS プロバイダーは、次の共有インターフェースを実装します。
TextToSpeechModel
TextToSpeechModel インターフェースは、テキストを音声に変換するためのメソッドを提供します。
public interface TextToSpeechModel extends Model<TextToSpeechPrompt, TextToSpeechResponse>, StreamingTextToSpeechModel {
/**
* Converts text to speech with default options.
*/
default byte[] call(String text) {
// Default implementation
}
/**
* Converts text to speech with custom options.
*/
TextToSpeechResponse call(TextToSpeechPrompt prompt);
/**
* Returns the default options for this model.
*/
default TextToSpeechOptions getDefaultOptions() {
// Default implementation
}
}StreamingTextToSpeechModel
StreamingTextToSpeechModel インターフェースは、リアルタイムでオーディオをストリーミングするためのメソッドを提供します。
@FunctionalInterface
public interface StreamingTextToSpeechModel extends StreamingModel<TextToSpeechPrompt, TextToSpeechResponse> {
/**
* Streams text-to-speech responses with metadata.
*/
Flux<TextToSpeechResponse> stream(TextToSpeechPrompt prompt);
/**
* Streams audio bytes for the given text.
*/
default Flux<byte[]> stream(String text) {
// Default implementation
}
}Writing Provider-Agnostic Code
One of the key benefits of the shared TTS interfaces is the ability to write code that works with any TTS provider without modification. The actual provider (OpenAI, ElevenLabs, etc.) is determined by your Spring Boot configuration, allowing you to switch providers without changing application code.
Basic Service Example
The shared interfaces allow you to write code that works with any TTS provider:
@Service
public class NarrationService {
private final TextToSpeechModel textToSpeechModel;
public NarrationService(TextToSpeechModel textToSpeechModel) {
this.textToSpeechModel = textToSpeechModel;
}
public byte[] narrate(String text) {
// Works with any TTS provider
return textToSpeechModel.call(text);
}
public byte[] narrateWithOptions(String text, TextToSpeechOptions options) {
TextToSpeechPrompt prompt = new TextToSpeechPrompt(text, options);
TextToSpeechResponse response = textToSpeechModel.call(prompt);
return response.getResult().getOutput();
}
}This service works seamlessly with OpenAI, ElevenLabs, or any other TTS provider, with the actual implementation determined by your Spring Boot configuration.
高度な例: Multi-Provider Support
You can build applications that support multiple TTS providers simultaneously:
@Service
public class MultiProviderNarrationService {
private final Map<String, TextToSpeechModel> providers;
public MultiProviderNarrationService(List<TextToSpeechModel> models) {
// Spring will inject all available TextToSpeechModel beans
this.providers = models.stream()
.collect(Collectors.toMap(
model -> model.getClass().getSimpleName(),
model -> model
));
}
public byte[] narrateWithProvider(String text, String providerName) {
TextToSpeechModel model = providers.get(providerName);
if (model == null) {
throw new IllegalArgumentException("Unknown provider: " + providerName);
}
return model.call(text);
}
public Set<String> getAvailableProviders() {
return providers.keySet();
}
}Streaming Audio Example
The shared interfaces also support streaming for real-time audio generation:
@Service
public class StreamingNarrationService {
private final TextToSpeechModel textToSpeechModel;
public StreamingNarrationService(TextToSpeechModel textToSpeechModel) {
this.textToSpeechModel = textToSpeechModel;
}
public Flux<byte[]> streamNarration(String text) {
// TextToSpeechModel extends StreamingTextToSpeechModel
return textToSpeechModel.stream(text);
}
public Flux<TextToSpeechResponse> streamWithMetadata(String text, TextToSpeechOptions options) {
TextToSpeechPrompt prompt = new TextToSpeechPrompt(text, options);
return textToSpeechModel.stream(prompt);
}
}REST Controller Example
Building a REST API with provider-agnostic TTS:
@RestController
@RequestMapping("/api/tts")
public class TextToSpeechController {
private final TextToSpeechModel textToSpeechModel;
public TextToSpeechController(TextToSpeechModel textToSpeechModel) {
this.textToSpeechModel = textToSpeechModel;
}
@PostMapping(value = "/synthesize", produces = "audio/mpeg")
public ResponseEntity<byte[]> synthesize(@RequestBody SynthesisRequest request) {
byte[] audio = textToSpeechModel.call(request.text());
return ResponseEntity.ok()
.contentType(MediaType.parseMediaType("audio/mpeg"))
.header("Content-Disposition", "attachment; filename=\"speech.mp3\"")
.body(audio);
}
@GetMapping(value = "/stream", produces = MediaType.APPLICATION_OCTET_STREAM_VALUE)
public Flux<byte[]> streamSynthesis(@RequestParam String text) {
return textToSpeechModel.stream(text);
}
record SynthesisRequest(String text) {}
}Configuration-Based Provider Selection
Switch between providers using Spring profiles or properties:
# application-openai.yml
spring:
ai:
model:
audio:
speech: openai
openai:
api-key: ${OPENAI_API_KEY}
audio:
speech:
options:
model: gpt-4o-mini-tts
voice: alloy
# application-elevenlabs.yml
spring:
ai:
model:
audio:
speech: elevenlabs
elevenlabs:
api-key: ${ELEVENLABS_API_KEY}
tts:
options:
model-id: eleven_turbo_v2_5
voice-id: your_voice_idThen activate the desired provider:
# Use OpenAI
java -jar app.jar --spring.profiles.active=openai
# Use ElevenLabs
java -jar app.jar --spring.profiles.active=elevenlabsUsing Portable Options
For maximum portability, use only the common TextToSpeechOptions interface methods:
@Service
public class PortableNarrationService {
private final TextToSpeechModel textToSpeechModel;
public PortableNarrationService(TextToSpeechModel textToSpeechModel) {
this.textToSpeechModel = textToSpeechModel;
}
public byte[] createPortableNarration(String text) {
// Use provider's default options for maximum portability
TextToSpeechOptions defaultOptions = textToSpeechModel.getDefaultOptions();
TextToSpeechPrompt prompt = new TextToSpeechPrompt(text, defaultOptions);
TextToSpeechResponse response = textToSpeechModel.call(prompt);
return response.getResult().getOutput();
}
}Working with Provider-Specific Features
When you need provider-specific features, you can still use them while maintaining a portable codebase:
@Service
public class FlexibleNarrationService {
private final TextToSpeechModel textToSpeechModel;
public FlexibleNarrationService(TextToSpeechModel textToSpeechModel) {
this.textToSpeechModel = textToSpeechModel;
}
public byte[] narrate(String text, TextToSpeechOptions baseOptions) {
TextToSpeechOptions options = baseOptions;
// Apply provider-specific optimizations if available
if (textToSpeechModel instanceof OpenAiAudioSpeechModel) {
options = OpenAiAudioSpeechOptions.builder()
.from(baseOptions)
.model("gpt-4o-tts") // OpenAI-specific: use high-quality model
.speed(1.0)
.build();
} else if (textToSpeechModel instanceof ElevenLabsTextToSpeechModel) {
// ElevenLabs-specific options could go here
}
TextToSpeechPrompt prompt = new TextToSpeechPrompt(text, options);
TextToSpeechResponse response = textToSpeechModel.call(prompt);
return response.getResult().getOutput();
}
}Best Practices for Portable Code
Depend on Interfaces : Always inject
TextToSpeechModelrather than concrete implementationsUse Common Options : Stick to
TextToSpeechOptionsinterface methods for maximum portabilityHandle Metadata Gracefully : Different providers return different metadata; handle it generically
Test with Multiple Providers : Ensure your code works with at least two TTS providers
Document Provider Assumptions : If you rely on specific provider behavior, document it clearly
Provider-Specific Features
While the shared interfaces provide portability, each provider also offers specific features through provider-specific options classes (e.g., OpenAiAudioSpeechOptions, ElevenLabsSpeechOptions). These classes implement the TextToSpeechOptions interface while adding provider-specific capabilities.