How it works — VoxAccalia

What this app is — and what it is not

VoxAccalia is not a passability detector. It does not judge how others will perceive you, whether you “pass,” or whether a voice sounds convincing. No percentage here can answer that—and we will not pretend it can.

This does not replace listening. Scores are no substitute for hearing your own recordings, getting feedback from a trusted listener, or working with a speech clinician or voice coach when you have one. Use the numbers to track practice—not instead of real listening.

This app is an acoustic measurement tool. It extracts pitch (how high/low), formants F1–F3 (vowel resonance / mouth shape), and intonation (pitch movement) from a clip, then compares those numbers to reference ranges or your own baselines. Practice index is a weighted blend of those measurements—useful for tracking your practice over time, not for labeling you.

Pitch and formants move independently. A very high, cartoony falsetto can raise pitch scores while formants stay unchanged—so the overall % may rise even when the voice does not sound natural or aligned with your goal. That is expected math, not a bug. Sustainable training usually shifts pitch and resonance together; watch the breakdown lines, not just practice index.

Garbage in, garbage out. Whispering, mumbling, background noise, or clips so quiet the analyzer barely finds speech will produce unreliable numbers. Record a few seconds of clear, natural speech at a normal volume—or upload a clean file from the Upload menu (same as Record in the navigation bar).

How to use VoxAccalia

Everything happens from the top navigation after you log in. You do not need hidden settings or developer tools.

Record (recommended)

Menu → New clip → Record passage. Pick a standard read-aloud passage (Rainbow, Grandfather, etc.), press record, speak clearly at a normal volume for a few seconds, then save. Passages keep your takes comparable over time.

Upload

Menu → New clip → Upload file (or “Upload file instead” on the Record page). Choose a WAV, MP3, M4A, or similar file up to the size limit. Name the clip, submit, and wait for analysis—same results page as a browser recording.

Read your results

Menu → Recordings → open a clip. Check the pitch / formant / intonation breakdown—not only practice index. Compare takes on the same passage when you have two or more.

Track progress

Menu → Progress for charts and streaks. Optional: use Menu → Baselines so scores reflect your starting voice vs your target, not only population averages.

The big picture

You record a short clip (or upload one). The app listens to stable parts of your speech and measures a few acoustic features: how high your voice is (pitch), how your mouth shapes vowels (resonance / formants), and how much your pitch moves up and down (intonation).

Those measurements are compared to reference points—either typical ranges from speech research, or clips you saved as your own “starting voice” and “training target.” The result is a set of percentages and charts meant to help you practice, not to label you.

Step by step: one recording

You record or upload. Give the clip a name so you can find it later. Audio stays tied to your account.
The app finds voiced speech. It ignores long silences and focuses on parts where you are actually speaking, so one number represents your voice rather than background noise.
It measures pitch. Think of this as “how high or low” the vibration of your vocal folds is—the familiar idea of a higher or lower speaking voice.
It measures resonance (formants). These are bands of energy shaped by your tongue, lips, and jaw. They are why the same pitch can sound “brighter,” “darker,” or more “front” / “back” in the mouth. The app tracks the first three formant averages (F1, F2, F3) as simple Hz numbers.
It looks at intonation. How steady your pitch is and how wide it swings (minimum to maximum) both feed a smaller part of the overall score.
It compares you to references. For each feature, the app asks: “Where does this clip sit between a lower reference and a higher reference?” That becomes cue scores, then practice index.
You see results. A practice index, breakdowns for pitch / resonance (formants) / intonation, charts over time within the clip, and links to manage your personal baselines.

What the numbers on a recording mean

Each completed recording shows a few summary measurements. They are computed from voiced speech frames—pauses and silence are skipped, so a long gap between phrases does not pull pitch or formants toward zero.

Pitch (median) — the middle pitch across all voiced frames in the clip. This value is not IQR-trimmed; it reflects your speech as tracked, minus silent gaps.
Pitch variability (σ) — how much your pitch moves during speech. We trim extreme frame-to-frame outliers (1.5× the interquartile range) before calculating σ so one bad tracker spike does not dominate. When raw and trimmed σ differ a lot, the page shows both.
Formants F1–F3 (trimmed medians) — resonance bands shaped by your mouth and throat. These are IQR-trimmed medians: frames far outside the typical spread for that clip are dropped, then we take the median of what remains. That is what feeds practice index.
Raw formant medians — when trimming changes the number by more than a couple of Hz, the recording page also shows the plain median of every valid tracked frame. A big gap usually means noisy audio or shaky formant tracking, not that you “really” sound that different.

Re-analyzing an older clip after an app update may add raw median fields; clips analyzed before that will only show trimmed values until you run analysis again.

Surprised by high Hz values or how scores line up with what you hear? See the FAQ.

Formants & resonance in voice training

If you are working on how your voice is perceived—especially as part of transgender voice feminization or masculinization—pitch alone is not the whole story. Resonance (formants) is often what makes a voice sound brighter, darker, more “front,” or more “back” in the mouth, even when pitch stays similar.

What are formants?

When you speak, your vocal folds create a buzz. Your throat, tongue, lips, and jaw shape that buzz into vowels and timbre. Formants are the main resonant peaks in that shaped sound—think of them as “where the energy piles up” in Hz, not as a second pitch.

F1 — strongly tied to how open your mouth is (tongue height). It moves a lot between vowels and is the main resonance cue in our scoring.
F2 — tied to tongue front vs back (“ee” vs “oo” brightness). Important for vowel color and perceived brightness.
F3 — a higher band that can nudge timbre; we track it with less weight than F1 and F2.

On a recording, VoxAccalia reports trimmed medians of F1–F3 across voiced speech in the clip (outlier frames removed). They are useful for comparing your own takes over time, not for matching a single “correct” Hz on a chart.

Why they matter for many trans voice goals

Listeners often weigh resonance heavily when they describe a voice as more feminine- or masculine-leaning in everyday speech. Training that only raises or lowers pitch can sound thin, cartoony, or “stuck” if mouth shape and throat space do not change too. Sustainable practice usually shifts pitch and resonance together—exactly why practice index weights formants heavily (especially F1).

This app does not decide your identity or whether you “pass.” It gives acoustic feedback so you can hear yourself, adjust articulation and resonance drills, and see whether measurements move in the direction you chose (population anchors or your own baselines).

How to use formants in VoxAccalia

Open a completed recording and read the Resonance (formants) card (F1–F3 in Hz). Compare the pitch breakdown and the formant breakdown—not only practice index.
Listen first. Play the clip. If it sounds right but numbers look odd, trust your ears and check common questions about formants and scores.
Use comparable takes. Same passage, similar mic and volume, a few seconds of clear speech. Trends on Progress matter more than one session.
Set personal baselines when you can. Voice baselines (starting voice + training target) score you between your endpoints instead of only population vowel averages.
Watch F1 in particular. Many training paths focus on raising or lowering F1 relative to your references; F2 supports brightness. If tracking is marked unreliable, fix audio quality before chasing Hz targets.
Explore the formant chart on the recording page (when available) to see how F1–F3 move through the clip—not just one summary number.

All three formants use Hz, but they are not pitch. F2 and F3 are often ~1,500–2,800 Hz in research averages; F3 around 2,500 Hz is normal. F1 on vowel charts is often ~500–650 Hz, but sentence-wide tracking can read higher. Do not “check” formants with a tone generator—a pure sine at that frequency is not your resonance. See the FAQ for F1–F3 ranges and why Hz numbers can look surprising.

What practice index means

This is a training estimate based on acoustic patterns that, in research studies, tend to shift listener perception along a lower-to-higher axis for pitch and resonance. They are weighted blends:

60% from resonance (formants), with the first formant (F1) counting the most
30% from average pitch
10% from intonation (pitch variability and range)

See what this app is not for what these scores do not measure. In short: training feedback from acoustics, not listener judgments or identity labels.

What we can’t tell you

VoxAccalia measures acoustics in a clip. It cannot answer questions that need a listener, a room, or a life context:

Whether someone will perceive your voice a certain way on a phone call, in person, or on video.
Whether you “pass,” sound convincing, or meet any social or clinical bar—those are human judgments, not Hz numbers.
Who you are or how you should describe your voice; scores describe training direction, not identity.
Medical, surgical, or clinical outcomes—this is practice feedback, not healthcare.
Proof from a single clip; trends across comparable takes and your cue breakdown matter more than one practice index.

Population benchmarks vs your personal baselines

On the Benchmarks page you can see reference values drawn from published vowel and pitch studies (for example, reference formant averages from vowel datasets). Those are useful defaults when you have not set anything personal yet.

On Voice baselines you can go further:

Starting voice — a recording of how you usually sound today (the low end of your personal scale).
Training target — a recording of a sound you are working toward (the high end of your personal scale).

When both are set, each new clip is scored as progress between your own endpoints. If you only set one, the app fills the other end from population benchmarks. Changing baselines updates scores on your existing recordings automatically—you do not need to re-record.

Progress page and recording list

The Progress dashboard charts completed recordings over time: average pitch, formants, and practice index for the period you choose.

The Recordings page lists everything you have saved. Search by title to jump back to an older session, open the full analysis, or set a clip as a baseline from there or from the recording detail page.

You can also share a completed recording with a private link (audio + scores). Links expire after about 30 days unless you extend or turn sharing off sooner.

Tips for trustworthy numbers

Use the same microphone and distance when you can.
Record a few seconds of natural speech—not whispering, not shouting.
Reduce background noise and echo if possible.
Re-save baselines if you change mic or room.
If a score looks wrong, try another take; resonance tracking sometimes struggles on very short or noisy clips.

How VoxAccalia works