Context

I've been learning Indian classical music for about eight years and I train as a vocalist. I take online lessons every week usually on Sundays from my teacher Sudarshan. For the first several years, my lessons were in person and the lessons moved online during the COVID pandemic. As I moved cities from Washington, D.C. to Nashville, my classes have now primarily become online.

A couple of years ago, it occurred to me that it would be extremely beneficial for me to have snippets from these classes where I just have the voice of my teacher demonstrating how these lines are sung perfectly stitched together. Cliff notes to memorize . Cliff notes of voice; of a singer, demonstrating a song . Snippets I could listen to while walking or transiting through an airport or at my own leisure and perfect over time.

This post shares my experience at trying to create an app for that and lessons I learnt along the way.

Structure of a Lesson

My lessons usually last an hour. We pick up a piece of music, typically called a keerthana (a poetic song set to a musical scale called a Raaga), and I learn one Keerthana over maybe two or three classes. A keerthana is usually structured as a chorus followed by stanzas. I go through learning lines of the chorus and once I've learnt it well enough, move to the stanza and the next and so on.

My teacher sings a phrase; I sing it back. He teaches me the various nuances, corrects me where I am wrong, I sing back, and we move ahead once I can reproduce a line ~75% accurate. The process is iterative. After I have learnt a few lines in succession, my teacher revisits those lines and sings back those lines in a way that is concert ready. The intent is that when I revisit those lines after my class, I have a sample of what the perfect rendition of those few lines is for posterity. 10000 hours. Practice makes one perfect. These lines are intended for rigorous practice.

My Ask of Claude

The ask was simple — or so I thought. To write code to isolate my teacher's voice from mine, and identify snippets in the one hour where my teacher sings uninterrupted for 30 seconds or longer, since those segments are most likely what I am looking for.

The workflow is straightforward in principle, but as I sat to write code for this, I realized there were so many edge cases that derailed simple efforts at producing code for this.

Lessons Learnt

  1. Providing what a good output looks like upfront is very useful. It forces one to front load work, but dramatically improves the odds of success.
  2. Understanding the input and articulating it clearly — in this case, it occurred to me after a few iterations that specifying exactly what is present in the recording could help. For example, there is a tanpura playing throughout the class in the background. That was distorting the sound sampling and made it harder to isolate sounds. Also, there is rhythm (taala) as we sing which sounds like a hand clap, adding to distortion in the signal. These hints were critical in finding coding approaches that work.
  3. When in doubt, overcommunicate what you need, getting into as many specifics and nuanced details as possible. It helps building a product that does not break at edge cases.

The Solution: Journey of Getting There

Version 1 did the obvious thing and was surprisingly good. Got me all excited. It studied the music lesson, fed it through Resemblyzer's neural voice embeddings, clustered the output into two voices with KMeans, and extracted long segments of each singer (my teacher and I). It worked, technically — two separate audio files came out, each dominated by one voice. But "dominated" was doing a lot of heavy lifting. The algorithm had no concept of teacher versus student. It just saw Voice A and Voice B, extracted both, and left me to figure out which was which. The segments themselves were rough: teacher demonstrations bleeding into student repetitions, brief silences splitting what should have been continuous phrases, and — worst of all — no way to separate the moments where both voices sang together from the stretches where the teacher sang alone.

Version 2 swung hard in the other direction — more signals, more sophistication, more ways to fail. As I started to iterate, I came upon an insight that in my music lessons, the teacher speaks a line or two before demonstrating, effectively stating he's going to recap or something like that. That speech-to-singing transition is a clean anchor point. So v2 classified every audio window as speech, singing, or silence using zero-crossing rate and spectral flatness, found those transitional anchors, built a voice profile from them, and then scored every window on three overlap signals: cosine similarity drop (the strongest), energy spike (two voices are 3–6 dB louder than one), and spectral bandwidth increase. Five complementary detection strategies in a 975-line script. It was clever, but it was fragile in its own unique way. Resemblyzer treats a speaking voice and a singing voice from the same person as substantially different embeddings. The teacher profile built from speech anchors didn't reliably match his singing segments. The overlap detector worked well when both voices were present, but the false positive rate on solo segments was high enough to make the output unreliable without manual cleanup.

Version 3 abandoned the multi-signal architecture and went back to clustering — but with two ideas that made all the difference. First, a domain-specific heuristic: in the final 40% of a lesson recording, the teacher dominates. He's recapping, demonstrating full compositions, running through phrases one last time. Clustering the windows into two voices and then checking which cluster owns the late session reliably identifies the teacher without needing speech anchors at all. Second, a label-smoothing pass that fixed the single biggest failure mode. Resemblyzer would occasionally assign a window of the teacher's speaking voice to the student cluster, creating tiny holes in otherwise continuous teacher segments. The smooth_labels() function runs a thirty-second rolling window and flips any isolated minority label when the surrounding context is eighty percent or more from the other cluster. Simple, but it eliminated the most common source of fragmented output. The script then tags surviving segments as either LONGRUN (three minutes or longer, always included) or CLEAN (shorter but ninety percent or more teacher), producing two output files: one with everything, one with only the best material.

The scripts alone weren't enough for a usable workflow. Automated extraction gets you eighty percent of the way; the remaining twenty percent is judgment calls — a recap that starts a few seconds earlier than the algorithm detected, two demonstrations that belong together but got split by a brief dialogue, etc.

So I built a Tkinter applet with two tabs: one to stitch multiple audio files together in any order (drag, reorder, concatenate), and one to snip specific time ranges from a single file and stitch them into a new output. All operations re-encode to uniform AAC at 192 kbps through ffmpeg before concatenating, which sidesteps the codec mismatch issues that plagued early attempts at lossless copy. A companion diagnostic tool, inspect_window.py, dumps a per-window table for any time range — cluster assignment, voiced status, RMS energy — so when a known recap isn't being captured, I can see exactly which windows are being misclassified and why, without re-running the full extraction.

The ultimate piece that tied it together was ground truth validation. I went back through several lesson recordings and manually timestamped the teacher's solo demonstrations into companion text files. Version three reads these, compares them against its automated segments, and reports Intersection over Union scores — a single number that tells you how well the algorithm's boundaries match reality. That closed the feedback loop. Instead of listening to output files and guessing whether a parameter change helped, I let Claude Code tune merge-gap (how many seconds of silence to bridge), smoothing context window, and purity thresholds against actual measured accuracy. The current defaults — four-second merge gap, sixty-second smooth context, seventy-five percent majority threshold — are the product of that tuning, not guesswork.