Why I Built SottoASR
I liked cloud dictation. I needed private dictation. Then typing became the bottleneck for agent-heavy work, so I built a local macOS app and trained a tiny cleanup model to keep up.
I built SottoASR because I wanted something that already existed, except for the one part that made it unusable for me.
Open-source, local speech-to-text for Apple Silicon Macs. Try the latest release or inspect the source before trusting the privacy claim.
I had been using Wispr Flow, and I want to be fair about that: it is a genuinely good product. It feels fast, the pricing is reasonable, and the writing cleanup is useful enough that I missed it immediately when I stopped using it. The problem was not quality. The problem was privacy.
For my work, the privacy boundary matters. I increasingly use agents as thought partners, code reviewers, research assistants, and occasionally as patient recipients of “here is a very messy idea, please help me turn this into something useful.” A lot of that starts as spoken stream-of-consciousness. Some of it contains project details, business context, unfinished thoughts, and the kind of half-formed technical nonsense that should be allowed to exist privately before it becomes a Git commit.
Cloud dictation did not fit that threat model. The words were mine, but the processing path was not.
So I built the version I wanted: a local, open-source macOS dictation app where the audio stays on my Mac, the text stays on my Mac, and the code is public enough that the privacy claim can be inspected instead of merely trusted.
Typing Became the Bottleneck
The practical reason was speed.
As I started orchestrating more agents at the same time, typing became the narrow pipe in the system. I could think faster than I could type. I could definitely talk faster than I could type. This matters more when the input is not a polished instruction, but a messy bundle of intent:
- what I think the problem is,
- what I tried already,
- what I am worried the model will miss,
- what constraints matter,
- and which rabbit holes should be treated as decorative and ignored.
About six months before SottoASR, I learned that agents were surprisingly good at turning that kind of spoken mess into useful work. I could dump a chaotic idea into an agent, then use research tools and refinement loops to turn it into a plan, a spec, a test matrix, or a piece of writing. The workflow was not “dictate perfect prose.” It was closer to “empty the junk drawer onto the bench and make the useful parts visible.”
That workflow only works if capture is effortless. If the friction is high, I edit myself too early. If I edit myself too early, the useful weird detail often gets cut. This is my most professional endorsement of rambling.
The Contract
SottoASR had a simple product contract:
| Requirement | What it means |
|---|---|
| Local by default | Speech recognition and cleanup should run on-device, not through a cloud API. |
| System-wide | Press a hotkey, speak, and paste into whatever app already has focus. |
| Fast enough | Dictation should feel like a shortcut, not a batch job. |
| Transparent | The app and the model artifacts should be public. |
| Honest cleanup | Remove disfluencies and restore formatting without stealing meaning. |
The result is a Tauri v2 app with a Rust backend and Svelte frontend. The public source lives on GitHub, and the app site states the same privacy boundary directly: audio and text are processed on-device. Press Cmd+Shift+Space for push-to-talk, or Cmd+Shift+D for longer toggle recording, and SottoASR captures microphone audio, transcribes it locally, optionally runs local cleanup, copies the result to the clipboard, and simulates paste at the cursor.
The ASR path uses NVIDIA Parakeet TDT v3, accelerated on Apple Silicon through FluidAudio, CoreML, and the Apple Neural Engine. Parakeet has been excellent for my dictation use. In my day-to-day use it has been better than Whisper, which is not what I expected when I started.
SottoASR also has history, a floating recording overlay, settings, onboarding, and first-run model downloads. Those downloads are for local inference artifacts, not remote transcription. The soul of the project is this: when I say something into SottoASR, I do not want to wonder which server heard it.
Raw Transcripts Are Not Enough
ASR gives you words. That is not the same as usable text.
Spoken thoughts contain filler, false starts, corrections, missing punctuation, run-on paragraphs, and sentences that change direction halfway through. A raw transcript can be accurate and still annoying to use.
The cleanup idea was straightforward: when the feature is enabled, put a small local language model after the ASR model. The less obvious part was the contract. I did not need a summarizer. I did not want a ghostwriter. I wanted a narrow post-processor:
- remove “um”, “uh”, and the harmless verbal scaffolding,
- restore punctuation and capitalization,
- format long dictation into paragraphs,
- convert spoken numbers when appropriate,
- and preserve what I actually said.
That last bullet became the project.
There is a very real difference between an editor and a liar. Transcript cleanup has to know that difference.
The Small Model
The cleanup model is based on LiquidAI/LFM2.5-350M-Base, a 350M-parameter model small enough to ship locally and run on Apple Silicon after MLX quantization. The deployed Apple Silicon artifact is the 5-bit MLX model, and the training/full-precision artifact is also public on Hugging Face.
The prompt stayed deliberately plain:
### Input:
{raw transcript}
### Output:
{clean transcript}
That simplicity was useful. It kept the model focused on the transformation instead of asking a tiny model to act like a general-purpose writing coach.
The current cleanup model is not as good as Wispr Flow’s polishing. I would rather say that plainly than pretend otherwise. But it is local, inspectable, good enough that I use it every day, and getting it there taught me more than a bigger hosted model would have.
The Training Loop
The training work became its own project.
I used an automated multi-day research loop to generate data, fine-tune models, run evaluations, inspect failures, generate gap data, and try again. The later loop ran on my local AI workstation with 2x RTX 6000 Pro Workstation Edition GPUs, which is a very fast way to discover that your metric is lying to you.
The dataset is public: juanquivilla/sotto-transcript-cleanup. It contains transcript-cleanup pairs for filler removal, self-correction, false starts, dictation commands, number handling, long-form formatting, and adversarial cases where the model must not over-edit.
The early models got better at looking clean. That was not the same as being trustworthy. One branch improved filler removal while deleting more substantive content. A model can win a surface metric and still lose the user’s trust. That is a deeply irritating sentence because it is also a good debugging principle.
So the question changed from:
Did the output look cleaner?
to:
Did the output preserve the user’s actual meaning while cleaning the transcript?
That forced the eval to measure substantive deletion: strip filler from the input, count real words with multiplicity, and punish the model when too many meaningful words disappeared.
The Breakthroughs Were Mostly Corrections
The model got better when the loop stopped asking the wrong questions.
The exact metrics moved as the evaluation suite changed, so these rows are a timeline, not a single apples-to-apples leaderboard:
| Stage | What changed | Why it mattered |
|---|---|---|
| v23 paragraph pipeline | Added paragraph-formatting data and GRPO refinement | Long dictation stopped turning into one wall of prose; paragraph emission landed around 90% on the paragraph holdout. |
| v36 full fine-tune | Moved all 354M parameters instead of using LoRA | Substantive deletion finally dropped sharply while filler removal improved. |
| v45 number campaign | Added grounded inverse-text-normalization data and digit-signature rewards | ”server three sixty” stopped becoming the wrong number. Number accuracy reached 95.9% on the stratified validation set. |
| v51 production-gap loop | Trained against adversarial cases from real failures | The model improved on year drift, phone hallucination, repeated numbers, content merging, and long-form preservation. |
soup_30 | Averaged nearby v55 and v51 checkpoints | Kept the best tradeoff: stronger numbers and filler handling without giving up sampled adversarial performance. |
| Problem | What looked right | What was actually wrong | Fix |
|---|---|---|---|
| Over-cleaning | Filler-free rate improved | The model deleted meaningful words | Added substantive-deletion evaluation and rewards |
| Paragraphs | Short cleanup was strong | Long dictation came out as flat prose | Added paragraph-formatting data |
| Spoken numbers | The model preserved wording | ”server three sixty” could become the wrong number | Built grounded inverse-text-normalization data |
| Long-input deletion | Eval showed deletion | max_new_tokens=512 was truncating outputs | Raised output headroom for production eval |
| Repetition loops | Looked like a model bug | Inference config allowed rare loops | Used repetition_penalty=1.05 |
The full fine-tune was the first major quality jump. LoRA was useful for exploration, but the preservation behavior needed all 354M parameters to move. Later, targeted number handling fixed the embarrassing cases where spoken numbers were either left untouched or converted incorrectly. Then production-style adversarial cases exposed gaps the normal validation set had politely ignored.
The final move was almost comically simple: model souping. The shipped soup_30 model is a weight-space average of two nearby checkpoints from the same lineage:
theta = 0.3 * theta_v55 + 0.7 * theta_v51
That kept v55’s number and filler gains while recovering v51’s sampled adversarial strength. Further training from the soup made things worse, which is a useful reminder that “one more training run” is not a strategy. It is sometimes just a very expensive way to move away from the answer.
The current public model card reports these production-mode numbers:
| Capability | soup_30 |
|---|---|
| Number accuracy, 171-sample stratified validation | 96.5% |
| 66-case adversarial benchmark, greedy | 86.4% |
| 66-case adversarial benchmark, sampled | 86.0% |
| Loops on 264 sampled probes | 0 |
| Filler-free on 241 long inputs | 71.8% |
| Substantive deletion >15% on 241 long inputs | 5.0% |
| Composite score | 89.51 |
Those numbers are not the point by themselves. The point is the shape of the contract: clean the transcript, convert obvious spoken numbers, format the text, and do not steal the user’s words.
What Shipped
SottoASR shipped as a local macOS app, open source under MIT, with public model and dataset artifacts:
- SottoASR app
- Source code
- Cleanup model, MLX 5-bit
- Cleanup model, full precision
- Cleanup dataset
- NVIDIA Parakeet TDT v3
- LiquidAI LFM2.5-350M-Base
- Apple MLX
- mlx-lm
There are limitations. SottoASR currently targets Apple Silicon Macs. The cleanup model is optimized for English conversational and technical dictation. The polishing is useful, but I still think Wispr Flow’s cloud cleanup has an edge. Privacy has a cost: you carry the models locally, and you do not get to borrow someone else’s giant inference stack every time you speak.
For me, that trade is worth it.
The Real Reason
The real reason I wrote SottoASR is not that I wanted to win a benchmark. I wanted a tool that fit how I work.
I talk to agents all day. I use them to turn vague thinking into concrete work, to pressure-test ideas, to write specs, to debug firmware, to plan hardware changes, and to turn notes into something that can survive contact with tomorrow morning. Typing was slowing that down. Cloud dictation made the privacy boundary too fuzzy. Raw local ASR was not enough.
SottoASR is my answer to that: press a hotkey, say the messy thought, keep it local, clean it just enough, and move on.