SottoASR

Why I Built SottoASR

I liked cloud dictation. I needed private dictation. Then typing became the bottleneck for agent-heavy work, so I built a local macOS app and trained a tiny cleanup model to keep up.

I built SottoASR because I wanted something that already existed, except for the one part that made it unusable for me.

Download SottoASR

Open-source, local speech-to-text for Apple Silicon Macs. Try the latest release or inspect the source before trusting the privacy claim.

DownloadSource

I had been using Wispr Flow, and I want to be fair about that: it is a genuinely good product. It feels fast, the pricing is reasonable, and the writing cleanup is useful enough that I missed it immediately when I stopped using it. The problem was not quality. The problem was privacy.

For my work, the privacy boundary matters. I increasingly use agents as thought partners, code reviewers, research assistants, and occasionally as patient recipients of “here is a very messy idea, please help me turn this into something useful.” A lot of that starts as spoken stream-of-consciousness. Some of it contains project details, business context, unfinished thoughts, and the kind of half-formed technical nonsense that should be allowed to exist privately before it becomes a Git commit.

Cloud dictation did not fit that threat model. The words were mine, but the processing path was not.

So I built the version I wanted: a local, open-source macOS dictation app where the audio stays on my Mac, the text stays on my Mac, and the code is public enough that the privacy claim can be inspected instead of merely trusted.

Typing Became the Bottleneck

The practical reason was speed.

As I started orchestrating more agents at the same time, typing became the narrow pipe in the system. I could think faster than I could type. I could definitely talk faster than I could type. This matters more when the input is not a polished instruction, but a messy bundle of intent:

  • what I think the problem is,
  • what I tried already,
  • what I am worried the model will miss,
  • what constraints matter,
  • and which rabbit holes should be treated as decorative and ignored.

About six months before SottoASR, I learned that agents were surprisingly good at turning that kind of spoken mess into useful work. I could dump a chaotic idea into an agent, then use research tools and refinement loops to turn it into a plan, a spec, a test matrix, or a piece of writing. The workflow was not “dictate perfect prose.” It was closer to “empty the junk drawer onto the bench and make the useful parts visible.”

That workflow only works if capture is effortless. If the friction is high, I edit myself too early. If I edit myself too early, the useful weird detail often gets cut. This is my most professional endorsement of rambling.

The Contract

SottoASR had a simple product contract:

RequirementWhat it means
Local by defaultSpeech recognition and cleanup should run on-device, not through a cloud API.
System-widePress a hotkey, speak, and paste into whatever app already has focus.
Fast enoughDictation should feel like a shortcut, not a batch job.
TransparentThe app and the model artifacts should be public.
Honest cleanupRemove disfluencies and restore formatting without stealing meaning.

The result is a Tauri v2 app with a Rust backend and Svelte frontend. The public source lives on GitHub, and the app site states the same privacy boundary directly: audio and text are processed on-device. Press Cmd+Shift+Space for push-to-talk, or Cmd+Shift+D for longer toggle recording, and SottoASR captures microphone audio, transcribes it locally, optionally runs local cleanup, copies the result to the clipboard, and simulates paste at the cursor.

The ASR path uses NVIDIA Parakeet TDT v3, accelerated on Apple Silicon through FluidAudio, CoreML, and the Apple Neural Engine. Parakeet has been excellent for my dictation use. In my day-to-day use it has been better than Whisper, which is not what I expected when I started.

SottoASR also has history, a floating recording overlay, settings, onboarding, and first-run model downloads. Those downloads are for local inference artifacts, not remote transcription. The soul of the project is this: when I say something into SottoASR, I do not want to wonder which server heard it.

Raw Transcripts Are Not Enough

ASR gives you words. That is not the same as usable text.

Spoken thoughts contain filler, false starts, corrections, missing punctuation, run-on paragraphs, and sentences that change direction halfway through. A raw transcript can be accurate and still annoying to use.

The cleanup idea was straightforward: when the feature is enabled, put a small local language model after the ASR model. The less obvious part was the contract. I did not need a summarizer. I did not want a ghostwriter. I wanted a narrow post-processor:

  • remove “um”, “uh”, and the harmless verbal scaffolding,
  • restore punctuation and capitalization,
  • format long dictation into paragraphs,
  • convert spoken numbers when appropriate,
  • and preserve what I actually said.

That last bullet became the project.

There is a very real difference between an editor and a liar. Transcript cleanup has to know that difference.

The Small Model

The cleanup model is based on LiquidAI/LFM2.5-350M-Base, a 350M-parameter model small enough to ship locally and run on Apple Silicon after MLX quantization. The deployed Apple Silicon artifact is the 5-bit MLX model, and the training/full-precision artifact is also public on Hugging Face.

The prompt stayed deliberately plain:

### Input:
{raw transcript}

### Output:
{clean transcript}

That simplicity was useful. It kept the model focused on the transformation instead of asking a tiny model to act like a general-purpose writing coach.

The current cleanup model is not as good as Wispr Flow’s polishing. I would rather say that plainly than pretend otherwise. But it is local, inspectable, good enough that I use it every day, and getting it there taught me more than a bigger hosted model would have.

The Training Loop

The training work became its own project.

I used an automated multi-day research loop to generate data, fine-tune models, run evaluations, inspect failures, generate gap data, and try again. The later loop ran on my local AI workstation with 2x RTX 6000 Pro Workstation Edition GPUs, which is a very fast way to discover that your metric is lying to you.

The dataset is public: juanquivilla/sotto-transcript-cleanup. It contains transcript-cleanup pairs for filler removal, self-correction, false starts, dictation commands, number handling, long-form formatting, and adversarial cases where the model must not over-edit.

The early models got better at looking clean. That was not the same as being trustworthy. One branch improved filler removal while deleting more substantive content. A model can win a surface metric and still lose the user’s trust. That is a deeply irritating sentence because it is also a good debugging principle.

So the question changed from:

Did the output look cleaner?

to:

Did the output preserve the user’s actual meaning while cleaning the transcript?

That forced the eval to measure substantive deletion: strip filler from the input, count real words with multiplicity, and punish the model when too many meaningful words disappeared.

The Breakthroughs Were Mostly Corrections

The model got better when the loop stopped asking the wrong questions.

The exact metrics moved as the evaluation suite changed, so these rows are a timeline, not a single apples-to-apples leaderboard:

StageWhat changedWhy it mattered
v23 paragraph pipelineAdded paragraph-formatting data and GRPO refinementLong dictation stopped turning into one wall of prose; paragraph emission landed around 90% on the paragraph holdout.
v36 full fine-tuneMoved all 354M parameters instead of using LoRASubstantive deletion finally dropped sharply while filler removal improved.
v45 number campaignAdded grounded inverse-text-normalization data and digit-signature rewards”server three sixty” stopped becoming the wrong number. Number accuracy reached 95.9% on the stratified validation set.
v51 production-gap loopTrained against adversarial cases from real failuresThe model improved on year drift, phone hallucination, repeated numbers, content merging, and long-form preservation.
soup_30Averaged nearby v55 and v51 checkpointsKept the best tradeoff: stronger numbers and filler handling without giving up sampled adversarial performance.
ProblemWhat looked rightWhat was actually wrongFix
Over-cleaningFiller-free rate improvedThe model deleted meaningful wordsAdded substantive-deletion evaluation and rewards
ParagraphsShort cleanup was strongLong dictation came out as flat proseAdded paragraph-formatting data
Spoken numbersThe model preserved wording”server three sixty” could become the wrong numberBuilt grounded inverse-text-normalization data
Long-input deletionEval showed deletionmax_new_tokens=512 was truncating outputsRaised output headroom for production eval
Repetition loopsLooked like a model bugInference config allowed rare loopsUsed repetition_penalty=1.05

The full fine-tune was the first major quality jump. LoRA was useful for exploration, but the preservation behavior needed all 354M parameters to move. Later, targeted number handling fixed the embarrassing cases where spoken numbers were either left untouched or converted incorrectly. Then production-style adversarial cases exposed gaps the normal validation set had politely ignored.

The final move was almost comically simple: model souping. The shipped soup_30 model is a weight-space average of two nearby checkpoints from the same lineage:

theta = 0.3 * theta_v55 + 0.7 * theta_v51

That kept v55’s number and filler gains while recovering v51’s sampled adversarial strength. Further training from the soup made things worse, which is a useful reminder that “one more training run” is not a strategy. It is sometimes just a very expensive way to move away from the answer.

The current public model card reports these production-mode numbers:

Capabilitysoup_30
Number accuracy, 171-sample stratified validation96.5%
66-case adversarial benchmark, greedy86.4%
66-case adversarial benchmark, sampled86.0%
Loops on 264 sampled probes0
Filler-free on 241 long inputs71.8%
Substantive deletion >15% on 241 long inputs5.0%
Composite score89.51

Those numbers are not the point by themselves. The point is the shape of the contract: clean the transcript, convert obvious spoken numbers, format the text, and do not steal the user’s words.

What Shipped

SottoASR shipped as a local macOS app, open source under MIT, with public model and dataset artifacts:

There are limitations. SottoASR currently targets Apple Silicon Macs. The cleanup model is optimized for English conversational and technical dictation. The polishing is useful, but I still think Wispr Flow’s cloud cleanup has an edge. Privacy has a cost: you carry the models locally, and you do not get to borrow someone else’s giant inference stack every time you speak.

For me, that trade is worth it.

The Real Reason

The real reason I wrote SottoASR is not that I wanted to win a benchmark. I wanted a tool that fit how I work.

I talk to agents all day. I use them to turn vague thinking into concrete work, to pressure-test ideas, to write specs, to debug firmware, to plan hardware changes, and to turn notes into something that can survive contact with tomorrow morning. Typing was slowing that down. Cloud dictation made the privacy boundary too fuzzy. Raw local ASR was not enough.

SottoASR is my answer to that: press a hotkey, say the messy thought, keep it local, clean it just enough, and move on.