Skip to main content
Realtime transcription enables live speech-to-text from microphone input with automatic audio slicing, VAD-based speech detection, and memory management.

Overview

The RealtimeTranscriber provides:
  • Live microphone capture and transcription
  • Voice Activity Detection (VAD) for speech/silence detection
  • Automatic audio slicing at configurable intervals
  • Memory-efficient circular buffer management
  • Event-based architecture for transcription updates
  • Audio recording to WAV files

Quick Start

1

Initialize Contexts

First, initialize both Whisper and VAD contexts:
import { initWhisper, initWhisperVad } from 'whisper.rn';

// Initialize Whisper context
const whisperContext = await initWhisper({
  filePath: require('../assets/ggml-base.bin'),
});

// Initialize VAD context
const vadContext = await initWhisperVad({
  filePath: require('../assets/ggml-silero-v6.2.0.bin'),
  useGpu: true,
  nThreads: 4,
});

console.log('Contexts initialized');
2

Create RealtimeTranscriber

Set up the transcriber with dependencies, options, and callbacks:
import RNFS from 'react-native-fs';
import {
  RealtimeTranscriber,
  RingBufferVad,
  VAD_PRESETS,
  AudioPcmStreamAdapter,
} from 'whisper.rn/realtime-transcription';

// Create VAD wrapper with preset
const vadWrapper = new RingBufferVad(vadContext, {
  vadOptions: VAD_PRESETS.default,
  vadPreset: 'default',
  logger: (msg) => console.log(msg),
});

// Create audio stream adapter
const audioStream = new AudioPcmStreamAdapter();

// Create transcriber
const transcriber = new RealtimeTranscriber(
  // Dependencies
  {
    whisperContext,
    vadContext: vadWrapper,
    audioStream,
    fs: RNFS,
  },
  // Options
  {
    logger: (msg) => console.log(msg),
    audioSliceSec: 30,
    audioMinSec: 0.5,
    maxSlicesInMemory: 3,
    transcribeOptions: {
      language: 'en',
      maxLen: 1,
    },
    audioOutputPath: `${RNFS.DocumentDirectoryPath}/recording.wav`,
  },
  // Callbacks
  {
    onTranscribe: (event) => {
      console.log('Transcription:', event.data?.result);
    },
    onVad: (event) => {
      console.log('VAD:', event.type, event.confidence);
    },
    onError: (error) => {
      console.error('Error:', error);
    },
    onStatusChange: (isActive) => {
      console.log('Status:', isActive ? 'ACTIVE' : 'INACTIVE');
    },
    onStatsUpdate: (stats) => {
      console.log('Stats:', stats.data);
    },
  }
);
3

Start Transcription

Start realtime transcription:
await transcriber.start();
console.log('Realtime transcription started');
4

Stop and Cleanup

Stop transcription and release resources:
await transcriber.stop();
await transcriber.release();

// Release contexts
await whisperContext.release();
await vadContext.release();

VAD Presets

The library includes pre-configured VAD presets for different use cases:
Balanced settings for general use:
VAD_PRESETS.default = {
  threshold: 0.5,
  minSpeechDurationMs: 250,
  minSilenceDurationMs: 100,
  maxSpeechDurationS: 30,
  speechPadMs: 30,
  samplesOverlap: 0.1,
}

Event Callbacks

The transcriber provides several event callbacks:

onTranscribe

Receives transcription results:
onTranscribe: (event: RealtimeTranscribeEvent) => {
  const { data, sliceIndex, processTime } = event;
  
  if (data?.result) {
    console.log(`Slice ${sliceIndex}: ${data.result}`);
    console.log(`Processed in ${processTime}ms`);
    
    // Access all segments
    data.segments.forEach((segment) => {
      console.log(`[${segment.t0} --> ${segment.t1}] ${segment.text}`);
    });
  }
}

onVad

Receives VAD events (speech start, speech end, silence):
onVad: (event: RealtimeVadEvent) => {
  console.log(`VAD: ${event.type} (confidence: ${event.confidence.toFixed(2)})`);
  
  if (event.type === 'speech_start') {
    console.log('Speech detected!');
  } else if (event.type === 'speech_end') {
    console.log('Speech ended, triggering slice...');
  }
}

onSliceTranscriptionStabilized

Receives the most recent stabilized transcription:
onSliceTranscriptionStabilized: (text: string) => {
  console.log('Stabilized text:', text);
  // Update UI with current transcription
  setCurrentTranscription(text);
}

onStatsUpdate

Receives statistics about memory usage and processing:
onStatsUpdate: (stats: RealtimeStatsEvent) => {
  const { data } = stats;
  console.log('Slices in memory:', data.sliceStats?.memoryUsage?.slicesInMemory);
  console.log('Memory usage:', data.sliceStats?.memoryUsage?.estimatedMB, 'MB');
  console.log('Is transcribing:', data.isTranscribing);
}

Complete Example

Here’s a complete React Native component with realtime transcription:
Complete Realtime Example
import React, { useCallback, useEffect, useRef, useState } from 'react';
import { View, Text, Button, ScrollView, Switch } from 'react-native';
import RNFS from 'react-native-fs';
import { initWhisper, initWhisperVad } from 'whisper.rn';
import type { WhisperContext, WhisperVadContext } from 'whisper.rn';
import {
  RealtimeTranscriber,
  RingBufferVad,
  VAD_PRESETS,
  AudioPcmStreamAdapter,
  type RealtimeTranscribeEvent,
  type RealtimeVadEvent,
} from 'whisper.rn/realtime-transcription';

export default function RealtimeTranscription() {
  const whisperContextRef = useRef<WhisperContext | null>(null);
  const vadContextRef = useRef<WhisperVadContext | null>(null);
  const transcriberRef = useRef<RealtimeTranscriber | null>(null);

  const [logs, setLogs] = useState<string[]>([]);
  const [currentText, setCurrentText] = useState<string>('');
  const [isTranscribing, setIsTranscribing] = useState(false);
  const [vadPreset, setVadPreset] = useState<keyof typeof VAD_PRESETS>('default');

  const log = useCallback((...messages: any[]) => {
    const timestamp = new Date().toLocaleTimeString();
    setLogs((prev) => [...prev, `${timestamp}: ${messages.join(' ')}`]);
  }, []);

  useEffect(() => {
    return () => {
      whisperContextRef.current?.release();
      vadContextRef.current?.release();
      transcriberRef.current?.release();
    };
  }, []);

  const initialize = async () => {
    try {
      log('Initializing contexts...');
      
      // Initialize Whisper
      const whisperCtx = await initWhisper({
        filePath: require('../assets/ggml-base.bin'),
      });
      whisperContextRef.current = whisperCtx;
      log('Whisper initialized');

      // Initialize VAD
      const vadCtx = await initWhisperVad({
        filePath: require('../assets/ggml-silero-v6.2.0.bin'),
        useGpu: true,
        nThreads: 4,
      });
      vadContextRef.current = vadCtx;
      log('VAD initialized');
    } catch (error) {
      log('Error initializing:', error);
    }
  };

  const startTranscription = async () => {
    if (!whisperContextRef.current || !vadContextRef.current) {
      log('Contexts not initialized');
      return;
    }

    try {
      const audioStream = new AudioPcmStreamAdapter();
      
      const vadWrapper = new RingBufferVad(vadContextRef.current, {
        vadOptions: VAD_PRESETS[vadPreset],
        vadPreset,
        logger: (msg) => console.log(msg),
      });

      const transcriber = new RealtimeTranscriber(
        {
          whisperContext: whisperContextRef.current,
          vadContext: vadWrapper,
          audioStream,
          fs: RNFS,
        },
        {
          logger: (msg) => log(msg),
          audioSliceSec: 30,
          audioMinSec: 0.5,
          maxSlicesInMemory: 3,
          transcribeOptions: {
            language: 'en',
            maxLen: 1,
          },
          audioOutputPath: `${RNFS.DocumentDirectoryPath}/realtime.wav`,
        },
        {
          onTranscribe: (event: RealtimeTranscribeEvent) => {
            if (event.data?.result) {
              log(`Transcribed: "${event.data.result.substring(0, 50)}..."`);
            }
          },
          onVad: (event: RealtimeVadEvent) => {
            if (event.type !== 'silence') {
              log(`VAD: ${event.type}`);
            }
          },
          onError: (error) => log('Error:', error),
          onStatusChange: (isActive) => setIsTranscribing(isActive),
          onSliceTranscriptionStabilized: (text) => setCurrentText(text),
        }
      );

      transcriberRef.current = transcriber;
      await transcriber.start();
      log('Realtime transcription started');
    } catch (error) {
      log('Error starting transcription:', error);
    }
  };

  const stopTranscription = async () => {
    if (!transcriberRef.current) return;

    try {
      await transcriberRef.current.stop();
      log('Transcription stopped');
    } catch (error) {
      log('Error stopping:', error);
    }
  };

  return (
    <ScrollView style={{ padding: 20 }}>
      <Button title="Initialize" onPress={initialize} />
      
      <View style={{ marginTop: 10 }}>
        <Text>VAD Preset: {vadPreset}</Text>
        <Button
          title="Change VAD Preset"
          onPress={() => {
            const presets = Object.keys(VAD_PRESETS) as Array<keyof typeof VAD_PRESETS>;
            const currentIndex = presets.indexOf(vadPreset);
            const nextPreset = presets[(currentIndex + 1) % presets.length];
            setVadPreset(nextPreset);
            log(`Changed VAD preset to: ${nextPreset}`);
          }}
        />
      </View>

      <View style={{ marginTop: 10 }}>
        <Button
          title={isTranscribing ? 'Stop' : 'Start Realtime'}
          onPress={isTranscribing ? stopTranscription : startTranscription}
          disabled={!whisperContextRef.current}
        />
      </View>

      {currentText && (
        <View style={{ marginTop: 20, padding: 10, backgroundColor: '#e8f5e8' }}>
          <Text style={{ fontWeight: 'bold' }}>Current Transcription:</Text>
          <Text>{currentText}</Text>
        </View>
      )}

      <View style={{ marginTop: 20 }}>
        <Text style={{ fontWeight: 'bold' }}>Logs:</Text>
        {logs.slice(-10).map((log, i) => (
          <Text key={i} style={{ fontSize: 12 }}>{log}</Text>
        ))}
      </View>
    </ScrollView>
  );
}

File Simulation Mode

Test realtime transcription using pre-recorded audio files:
import { SimulateFileAudioStreamAdapter } from 'whisper.rn/realtime-transcription/adapters';

const audioStream = new SimulateFileAudioStreamAdapter({
  fs: RNFS,
  filePath: '/path/to/audio.wav',
  playbackSpeed: 1.0, // 1x speed, can go faster for testing
  chunkDurationMs: 100,
  loop: false,
  onEndOfFile: () => {
    console.log('File playback complete');
  },
  logger: (msg) => console.log(msg),
});

// Use with RealtimeTranscriber
const transcriber = new RealtimeTranscriber(
  {
    whisperContext,
    vadContext: vadWrapper,
    audioStream, // File simulation adapter
    fs: RNFS,
  },
  { /* options */ },
  { /* callbacks */ }
);

Advanced Features

Force Next Slice

Manually trigger a slice during transcription:
await transcriber.nextSlice();
console.log('Forced next slice');

Update VAD Options

Change VAD settings during transcription:
transcriber.updateVadOptions(VAD_PRESETS.sensitive);

Reset Transcriber

Clear all state without stopping:
transcriber.reset();
console.log('Transcriber reset');

Get Transcription Results

Retrieve all transcription results:
const results = transcriber.getTranscriptionResults();
results.forEach(({ slice, transcribeEvent }) => {
  console.log(`Slice ${slice.index}: ${transcribeEvent.data?.result}`);
});

Performance Tips

Slice Duration: 30 seconds is optimal for most cases. Shorter slices = more frequent processing, longer slices = higher memory usage.
Memory Management: Set maxSlicesInMemory: 3 to keep memory usage low. Older slices are automatically discarded.
VAD Preset: Start with ‘default’, switch to ‘sensitive’ for quiet environments or ‘conservative’ for noisy environments.
Model Selection: Use ‘tiny’ or ‘base’ models for realtime. Larger models may cause lag on lower-end devices.

Next Steps

Basic Transcription

Learn basic audio file transcription

VAD Detection

Understand Voice Activity Detection

File Handling

Work with different audio formats

API Reference

Full RealtimeTranscriber API documentation