VoxAccalia
Open menu

← How it works

Cue breakdown & scoring notes

Technical detail for speech pathologists, voice coaches, and researchers—not part of the everyday user guide.

This section explains how cue breakdown scores and practice index are produced—measurement, references, weighting, and limits—so you can answer client or colleague questions without treating the app as a listener test or clinical instrument.

Summary. Each clip is analyzed with Praat-class algorithms (Parselmouth). Pitch, formants, and intonation are each mapped to a 0–100% position between two acoustic anchors (population references or the user’s saved starting voice and training target). Practice index is a weighted blend of those three cues (currently 60% formants, 30% pitch, 10% intonation). It tracks acoustic change over time—not perceived gender, pass/fail, or diagnostic status.

Pipeline

  1. Acoustic extraction — median F0 on voiced frames; F1–F3 as IQR-trimmed medians from Burg tracking (implausible frames dropped); pitch σ and range for intonation.
  2. Per-cue mapping — each measurement is compared to a low and high endpoint (Hz), then converted with a logistic curve to 0–100% on an internal lower-to-higher acoustic axis.
  3. Practice index — weighted average of the three cue percentages; formants are omitted from the blend when automatic tracking is deemed unreliable.

Cue breakdown: the three cards

Pitch

Measure: median fundamental frequency (F0) on voiced speech frames.

Endpoints: population means from a labeled pitch dataset (voice-gender CSV, ~3,168 samples), or the user’s personal starting / training target recordings when set.

Research basis: median F0 is a well-studied acoustic dimension in voice perception research, but it is not the whole story—listeners also use resonance (formants), and the two can move independently on a given clip. Some lab tasks rank F0 first when cues are isolated (e.g. Skuk & Schweinberger, 2014); in connected speech, formants often stay informative even when F0 sits in an ambiguous range (see Gelfer & Bennett, 2013, in references below).

In this app: pitch is 30% of practice index—important, but weighted below resonance (60%) because sustainable voice training here targets pitch and mouth/throat shape together, not laryngeal height alone. That matches our FAQ and the formants guide on How it works.

Resonance (formants)

Measure: F1, F2, and F3 as IQR-trimmed medians; each formant gets its own logistic %, then combines as 50% F1 + 30% F2 + 20% F3 for the formants cue.

Population anchors: Hillenbrand et al. (1995) corner-vowel averages for American English (e.g. lower-reference F1 ~519 Hz vs higher-reference ~625 Hz for the four-vowel average). See formants & resonance on How it works (Log in for the Benchmarks table.)

Research basis: formant frequencies reflect vocal-tract resonance (general acoustics). Listener gender-judgment studies find F0 and formants both matter: Skuk & Schweinberger (2014) ranked F0 strongest among isolated syllable cues but showed F0 + formants together can equal a full morph; Gelfer & Bennett (2013) found vowel formants especially informative when speaking F0 sits around 145–165 Hz in connected speech.

In this app: resonance is 60% of practice index (F1 weighted highest within that cue). Passage-level medians compared to Hillenbrand vowel anchors are a training convention, not identical to isolated-vowel clinical protocols.

Intonation

Measure: pitch standard deviation (σ, IQR-trimmed on frames) and pitch range (max − min F0) within the clip.

Endpoints: configured typical values for σ and range (documented in app config); personal baselines can override when saved.

Research basis: Prosody and pitch dynamics add secondary information beyond mean F0; weighted lightly (10%) because the product prioritizes sustainable pitch and resonance training over expressiveness alone.

Practice index

When all cues are trustworthy: practice index ≈ (60% × formants) + (30% × pitch) + (10% × intonation), with weights renormalized if formants are excluded. The displayed label is practice index, not a listener judgment. Users who train toward a lower voice see the same internal math reflected toward their chosen target direction.

Personal baselines (on How it works) replace population endpoints when set; re-scoring uses stored metrics without re-running analysis unless the user re-analyzes audio.

What research supports vs what we chose in software

  • From literature: published vowel anchors (Hillenbrand 1995); listener gender-perception work on F0 and formants (Skuk & Schweinberger, 2014—syllable morphing; Gelfer & Bennett, 2013—connected speech). Our 10% intonation weight is a product choice, not taken from those papers (they did not test pitch σ/range the way this app does).
  • Product choices (transparent, not from one paper’s formula): logistic mapping between anchors; 60/30/10 headline weights; 50/30/20 within formants; IQR trimming and formant-trust gating.

Selected references

  • Hillenbrand, J., Getty, L. A., Clark, M. J., & Wheeler, K. (1995). Acoustic characteristics of American English vowels. Journal of the Acoustical Society of America, 97(5), 3099–3111. (Corner-vowel formant anchors used in VoxAccalia.)
  • Skuk, V. G., & Schweinberger, S. R. (2014). Influences of fundamental frequency, formant frequencies, aperiodicity, and spectrum level on the perception of voice gender. Journal of Speech, Language, and Hearing Research, 57(1), 285–296. doi:10.1044/1092-4388(2013/12-0314) (Isolated syllable tokens, parameter morphing; DOI year 2013, print Feb 2014. F0 strongest single cue; F0 + formants comparable to full morph.)
  • Gelfer, M. P., & Bennett, Q. E. (2013). Speaking fundamental frequency and vowel formant frequencies: Effects on perception of gender. Journal of Voice, 27(5), 556–566. doi:10.1016/j.jvoice.2012.11.008 (Connected speech; formants important especially when SFF is ~145–165 Hz; compared with their isolated-vowel work.)
  • Pitch population means: primaryobjects/voice-gender dataset (engineering reference, not clinical norm data).