From Sound Waves to Scripture on Screen

When a minister says “as we read in First Corinthians chapter thirteen verse four, love is patient, love is kind,” a lot happens inside Vies in the span of a couple seconds. Audio becomes text. Text becomes candidate matches. Candidates get scored. The winning verse lands on your projection screen.

This article walks through each stage of that process.

Stage 1: Hearing What Is Being Said

The first job is turning spoken audio into written text. Vies runs AI-powered speech recognition directly on your computer. In the default Local mode, no audio leaves the machine, no cloud service is involved, and no internet connection is required.

Before audio reaches the speech engine, Voice Activity Detection (Silero VAD) filters out silence, hymns, and ambient noise. This prevents the transcriber from processing dead air, which would otherwise produce garbled text and false verse matches. VAD is enabled by default and can be toggled in settings.

The speech recognition engine handles accents, background noise, and theological vocabulary well. It processes audio in chunks and streams partial transcripts back to Vies every few seconds, giving the detection pipeline fresh text continuously.

Vies offers three quality tiers for the speech engine — Best (1.5 GB, GPU required), Standard (834 MB), and Lite (466 MB, no GPU needed). Higher tiers produce more accurate transcription, which directly improves verse detection downstream.

In the default Local mode, everything runs on-device and your sermon audio stays completely private. There is nothing to configure and no account to create. Plug in a microphone and go.

Stage 2: Finding the Right Verse

Once Vies has a transcript fragment, it runs two detection strategies in parallel. This dual approach is key to catching verses whether the minister quotes the text word-for-word or simply names the reference.

Intelligent Text Matching

Vies maintains the full Bible text indexed and ready for comparison. When a new transcript fragment arrives, it scores the spoken words against every passage and returns the best matches.

In practice: if the preacher says “for God so loved the world that He gave His only begotten Son,” Vies recognizes those words match John 3:16 with high confidence, even if the minister never says “John three sixteen” out loud. The matching system weighs word frequency, passage length, and term importance to produce a relevance score.

This approach excels when the minister quotes or closely paraphrases scripture. It struggles with short, generic phrases that could match many passages. That is why Vies uses a confidence threshold.

Spoken Reference Parsing

The second strategy handles the case where the minister names the reference directly: “Turn with me to Psalm twenty-three” or “look at chapter five, from verse twelve to fifteen.”

Vies has a purpose-built parser that understands the patterns preachers use when citing scripture. It handles:

  • Standard references like “Romans 8:28” or “First John chapter 3 verse 16”
  • Spoken number words like “twenty-three” and ordinal book names like “Second Corinthians”
  • Range expressions like “verses 4 through 7” or “from verse 1 to verse 5”
  • Contextual continuations where the minister says “verse 10” and Vies knows to apply it to the book and chapter from the previous reference

The parser ignores surrounding words and extracts just the structural reference pattern. When the minister says “chapter fourteen verse six,” it knows from context that they are still in the same book they referenced a minute ago.

Cloud AI Enhancement

If Cloud AI mode is enabled, transcript text is also sent to Gemini for an additional layer of verse identification. No audio leaves the machine — only the transcribed words. This requires a Gemini API key (bring your own key), and it works alongside the local detection strategies rather than replacing them.

Speaker Diarization

Vies tracks who is speaking so it can focus on the minister and ignore background noise from the congregation. If someone in the front row reads along aloud, their words do not trigger a verse detection.

Stage 3: Filtering and Deduplication

Both strategies produce candidates, each with a confidence score. Vies applies a minimum threshold (configurable in settings) and discards anything below it. The surviving candidates go through a deduplication layer.

Why deduplication? Preachers repeat themselves. A minister might say “Romans 8:28,” read the verse, then say “Romans 8:28 tells us…” within thirty seconds. Vies prevents the same verse from firing multiple times in quick succession. There is also a cap on verses per transcript update, so a rapid-fire list of cross-references does not cause the screen to flicker.

A profanity filter also runs at this stage — if the transcript contains inappropriate language (sometimes hallucinated by the speech engine during silence), the segment is redacted and blocked from verse matching.

Stage 4: Output

The winning verse goes to whichever outputs you have enabled. That could be any combination of:

  • The NDI canvas, for streaming via OBS or vMix
  • The floating overlay window on your projection display
  • EasyWorship’s scripture search, for direct projection control
  • FreeShow via its REST API (supports remote hosts, not just localhost)
  • OpenLP via its REST API (port 4316)
  • Other Vies instances on the local network via WebSocket sharing

All outputs receive the verse simultaneously. The pipeline does not favor one over another.

What About Accuracy?

No speech recognition system is perfect. Background noise, strong accents, and crosstalk all reduce accuracy. But in a typical church setting with a decent microphone feed from the mixing desk, Vies catches the large majority of verse references.

The text matching and spoken reference parser together cover the two main ways preachers reference scripture: quoting the text and naming the location. The one gap is vague allusions where someone references a verse indirectly without quoting or naming it. That is genuinely hard for any system, and Vies does not try to guess.

To see this in action, install Vies and try it with a recorded sermon. The Getting Started guide covers the setup.