This guide demonstrates how to use Arize and Coval together to evaluate voice AI applications, combining Arize's deep system-level observability with Coval's conversation-level simulation and evaluation capabilities.
Overview
Arize provides comprehensive observability for voice AI applications, capturing detailed traces of internal system calls, audio processing events, and performance metrics. It allows you to deep dive into the technical implementation and troubleshoot issues at the system level.
Coval pulls traces from Arize and provides conversation-level simulation and evaluation capabilities. With just your API key, Coval can access your Arize traces and enable higher-level testing, simulation, and evaluation of entire voice conversations.
Architecture
Voice AI Application sends detailed traces to Arize
Arize captures system calls, API events, and technical metrics
Coval pulls traces from Arize for conversation-level analysis and simulation




Setting Up Arize for Voice AI Tracing
1. Instrument Your Voice AI Application
First, set up comprehensive tracing in your voice AI application to send detailed system traces to Arize.
from opentelemetry import trace
from arize.opentelemetry import register
tracer = register(
space_id="your_space_id",
api_key="your_arize_api_key",
model_id="your_voice_ai_model",
model_version="v1.0"
)from opentelemetry import trace
from arize.opentelemetry import register
tracer = register(
space_id="your_space_id",
api_key="your_arize_api_key",
model_id="your_voice_ai_model",
model_version="v1.0"
)from opentelemetry import trace
from arize.opentelemetry import register
tracer = register(
space_id="your_space_id",
api_key="your_arize_api_key",
model_id="your_voice_ai_model",
model_version="v1.0"
)2. Key Events for Voice AI Instrumentation
Arize captures detailed system-level events from OpenAI Realtime API's WebSocket:
Session Events
session.created: New session initialization with system parameters
session.updated: Session configuration changes and system state updates
Audio Input Events
input_audio_buffer.speech_started: Speech detection algorithms triggered
input_audio_buffer.speech_stopped: End-of-speech detection completed
input_audio_buffer.committed: Audio buffer processing pipeline initiated
Conversation Events
conversation.item.created: Message processing and context management
Response Events
response.audio_transcript.delta: Real-time transcription processing
response.audio_transcript.done: Transcription pipeline completion
response.done: Complete response generation cycle
response.audio.delta: Audio synthesis and streaming
Error Events
error: System failures, API errors, and processing exceptions
3. Detailed Span Creation for System Observability
if event.get("type") == "session.created":
with tracer.start_as_current_span("session.lifecycle") as parent_span:
parent_span.set_attribute("session.id", event["session"]["id"])
parent_span.set_attribute("system.model", "gpt-4o-realtime-preview")
parent_span.set_attribute("system.voice", event["session"]["voice"])
parent_span.set_attribute("system.input_audio_format", event["session"]["input_audio_format"])
if event.get("type") == "input_audio_buffer.speech_started":
with tracer.start_as_current_span("audio.input.processing") as audio_span:
audio_span.set_attribute("audio.processing.stage", "speech_detection")
audio_span.set_attribute("audio.buffer.size", len(audio_buffer))
if event.get("type") == "response.done":
resp = event["response"]
with tracer.start_as_current_span("response.generation") as response_span:
response_span.set_attributes({
"llm.token_count.prompt": resp["usage"]["input_tokens"],
"llm.token_count.completion": resp["usage"]["output_tokens"],
"system.processing_time_ms": resp.get("processing_time"),
"system.model_temperature": resp.get("temperature", 0.8),
"metadata.status_details": resp["status_details"],
})
if event.get("type") == "session.created":
with tracer.start_as_current_span("session.lifecycle") as parent_span:
parent_span.set_attribute("session.id", event["session"]["id"])
parent_span.set_attribute("system.model", "gpt-4o-realtime-preview")
parent_span.set_attribute("system.voice", event["session"]["voice"])
parent_span.set_attribute("system.input_audio_format", event["session"]["input_audio_format"])
if event.get("type") == "input_audio_buffer.speech_started":
with tracer.start_as_current_span("audio.input.processing") as audio_span:
audio_span.set_attribute("audio.processing.stage", "speech_detection")
audio_span.set_attribute("audio.buffer.size", len(audio_buffer))
if event.get("type") == "response.done":
resp = event["response"]
with tracer.start_as_current_span("response.generation") as response_span:
response_span.set_attributes({
"llm.token_count.prompt": resp["usage"]["input_tokens"],
"llm.token_count.completion": resp["usage"]["output_tokens"],
"system.processing_time_ms": resp.get("processing_time"),
"system.model_temperature": resp.get("temperature", 0.8),
"metadata.status_details": resp["status_details"],
})
if event.get("type") == "session.created":
with tracer.start_as_current_span("session.lifecycle") as parent_span:
parent_span.set_attribute("session.id", event["session"]["id"])
parent_span.set_attribute("system.model", "gpt-4o-realtime-preview")
parent_span.set_attribute("system.voice", event["session"]["voice"])
parent_span.set_attribute("system.input_audio_format", event["session"]["input_audio_format"])
if event.get("type") == "input_audio_buffer.speech_started":
with tracer.start_as_current_span("audio.input.processing") as audio_span:
audio_span.set_attribute("audio.processing.stage", "speech_detection")
audio_span.set_attribute("audio.buffer.size", len(audio_buffer))
if event.get("type") == "response.done":
resp = event["response"]
with tracer.start_as_current_span("response.generation") as response_span:
response_span.set_attributes({
"llm.token_count.prompt": resp["usage"]["input_tokens"],
"llm.token_count.completion": resp["usage"]["output_tokens"],
"system.processing_time_ms": resp.get("processing_time"),
"system.model_temperature": resp.get("temperature", 0.8),
"metadata.status_details": resp["status_details"],
})4. Audio File Management and URLs
def upload_to_gcs(file_path, bucket_name, destination_blob_name, make_public=False):
"""Uploads audio files to Google Cloud Storage for Arize tracing."""
try:
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(destination_blob_name)
blob.upload_from_filename(file_path)
if make_public:
blob.make_public()
return blob.public_url
else:
return destination_blob_name
except Exception as e:
raise RuntimeError(f"Failed to upload {file_path} to GCS: {e}")
def process_audio_and_upload(pcm16_audio, span):
"""Processes audio, uploads to storage, and adds URL to Arize span."""
timestamp = time.strftime("%Y%m%d_%H%M%S")
file_name = f"audio_{timestamp}.wav"
file_path = file_name
bucket_name = "your-audio-bucket"
try:
save_audio_to_wav(pcm16_audio, file_path)
gcs_url = upload_to_gcs(file_path, bucket_name, f"voice-ai/audio/{file_name}")
span.set_attribute("input.audio.url", gcs_url)
span.set_attribute("input.audio.mime_type", "audio/wav")
span.set_attribute("input.audio.duration_seconds", get_audio_duration(file_path))
finally:
if os.path.exists(file_path):
os.remove(file_path)
return gcs_urldef upload_to_gcs(file_path, bucket_name, destination_blob_name, make_public=False):
"""Uploads audio files to Google Cloud Storage for Arize tracing."""
try:
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(destination_blob_name)
blob.upload_from_filename(file_path)
if make_public:
blob.make_public()
return blob.public_url
else:
return destination_blob_name
except Exception as e:
raise RuntimeError(f"Failed to upload {file_path} to GCS: {e}")
def process_audio_and_upload(pcm16_audio, span):
"""Processes audio, uploads to storage, and adds URL to Arize span."""
timestamp = time.strftime("%Y%m%d_%H%M%S")
file_name = f"audio_{timestamp}.wav"
file_path = file_name
bucket_name = "your-audio-bucket"
try:
save_audio_to_wav(pcm16_audio, file_path)
gcs_url = upload_to_gcs(file_path, bucket_name, f"voice-ai/audio/{file_name}")
span.set_attribute("input.audio.url", gcs_url)
span.set_attribute("input.audio.mime_type", "audio/wav")
span.set_attribute("input.audio.duration_seconds", get_audio_duration(file_path))
finally:
if os.path.exists(file_path):
os.remove(file_path)
return gcs_urldef upload_to_gcs(file_path, bucket_name, destination_blob_name, make_public=False):
"""Uploads audio files to Google Cloud Storage for Arize tracing."""
try:
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(destination_blob_name)
blob.upload_from_filename(file_path)
if make_public:
blob.make_public()
return blob.public_url
else:
return destination_blob_name
except Exception as e:
raise RuntimeError(f"Failed to upload {file_path} to GCS: {e}")
def process_audio_and_upload(pcm16_audio, span):
"""Processes audio, uploads to storage, and adds URL to Arize span."""
timestamp = time.strftime("%Y%m%d_%H%M%S")
file_name = f"audio_{timestamp}.wav"
file_path = file_name
bucket_name = "your-audio-bucket"
try:
save_audio_to_wav(pcm16_audio, file_path)
gcs_url = upload_to_gcs(file_path, bucket_name, f"voice-ai/audio/{file_name}")
span.set_attribute("input.audio.url", gcs_url)
span.set_attribute("input.audio.mime_type", "audio/wav")
span.set_attribute("input.audio.duration_seconds", get_audio_duration(file_path))
finally:
if os.path.exists(file_path):
os.remove(file_path)
return gcs_urlSetting Up Coval for Conversation-Level Evaluation
1. Add Your API Keys
Configure Coval to pull traces from your Arize instance by adding your API keys in the Coval dashboard.
2. Conversation-Level Simulation & Evaluation
Once connected, Coval can:
Pull conversation data from Arize traces
Run automated conversation simulations
Evaluate conversation quality metrics
Generate comprehensive performance reports
Arize Deep Dive Capabilities
System-Level Monitoring
Use Arize to analyze:
Technical Performance
API response times and latencies
Audio processing pipeline performance
Token usage and costs
Error rates by system component
Audio Processing Metrics
Speech-to-text accuracy
Audio quality scores
Processing buffer sizes
Compression and encoding efficiency
Model Performance
Response generation times
Context window utilization
Temperature and parameter effects
Function calling success rates
Debugging with Arize Traces
Coval Conversation Analysis
Conversation Metrics
Tool call evaluation
Conversation flow analysis
User satisfaction scoring
Response quality assessment
Arize Prompt and Tool Evaluation
Unit-level testing of prompts
Tool calling accuracy
Context management evaluation
Integration Workflow
Daily Monitoring Workflow
System Monitoring in Arize
Monitor technical performance metrics
Track error rates and system health
Analyze API usage and costs
Debug technical issues in real-time
Conversation Analysis in Coval
Pull daily conversation data from Arize
Evaluate conversation quality metrics
Run automated conversation simulations
Generate conversation performance reports
Combined Insights
Correlate system performance with conversation quality
Identify technical issues affecting user experience
Optimize both system parameters and conversation flows
Continuous Improvement Process
def weekly_analysis():
system_metrics = arize.get_system_metrics(period="week")
conversations = coval.pull_conversations(period="week")
correlation_analysis = coval.analyze_system_conversation_correlation(
system_metrics=system_metrics,
conversations=conversations
)
recommendations = coval.generate_recommendations(
analysis_results=correlation_analysis,
improvement_areas=["latency", "accuracy", "user_satisfaction"]
)
return recommendations
def weekly_analysis():
system_metrics = arize.get_system_metrics(period="week")
conversations = coval.pull_conversations(period="week")
correlation_analysis = coval.analyze_system_conversation_correlation(
system_metrics=system_metrics,
conversations=conversations
)
recommendations = coval.generate_recommendations(
analysis_results=correlation_analysis,
improvement_areas=["latency", "accuracy", "user_satisfaction"]
)
return recommendations
def weekly_analysis():
system_metrics = arize.get_system_metrics(period="week")
conversations = coval.pull_conversations(period="week")
correlation_analysis = coval.analyze_system_conversation_correlation(
system_metrics=system_metrics,
conversations=conversations
)
recommendations = coval.generate_recommendations(
analysis_results=correlation_analysis,
improvement_areas=["latency", "accuracy", "user_satisfaction"]
)
return recommendationsBest Practices
Arize Configuration
Instrument all critical system events
Include comprehensive span attributes
Store audio files in accessible cloud storage
Set up alerting for system anomalies
Use proper error handling and logging
Coval Usage
Regular conversation pulls for fresh data
Define clear evaluation criteria
Use representative conversation samples
Set up automated evaluation pipelines
Compare performance across time periods
Data Management
Maintain consistent audio file naming conventions
Implement proper access controls for sensitive conversations
Archive old conversation data appropriately
Ensure GDPR/privacy compliance for voice data
Regular backup of evaluation results
Troubleshooting
Common Integration Issues
Coval Cannot Pull Traces from Arize
Verify API key permissions and space access
Check that traces exist in the specified time range
Ensure model IDs match between Arize and Coval
Validate network connectivity and firewall settings
Missing Conversation Data
Confirm that conversation spans are properly structured in Arize
Check that audio URLs are accessible to Coval
Verify conversation identification logic
Review trace aggregation settings
Evaluation Failures
Validate conversation data format and completeness
Check evaluation template syntax and criteria
Ensure sufficient conversation samples for analysis
Monitor API rate limits for evaluation models
Conclusion
The combination of Arize and Coval provides a complete voice AI evaluation solution:
Arize gives you deep technical observability into your voice AI system's internal operations, allowing you to monitor performance, debug issues, and optimize system-level components
Coval leverages this detailed trace data to provide conversation-level insights, simulation capabilities, and comprehensive evaluation of user experiences
This two-tier approach ensures you can maintain both technical excellence and conversation quality in your voice AI applications. Start by implementing comprehensive Arize tracing, then use Coval to pull this data for higher-level conversation analysis and optimization.