The Convolutional Neural Network (CNN) Onset Detector Inside the TremoloMASTER Wizard

Tremolo presents unique challenges for onset detection:

Notes are very close together (fast repetition).
The tone is similar note-to-note (same string area, similar timbre).
Dynamics vary (thumb vs fingers; accents; uneven volume).
Real recordings include room reflections, noise, and mic differences.
Players often have uneven spacing and "ghost" attacks while learning.

To improve reliability---especially across different guitars, tempos, and recording conditions---Doug added a CNN (Convolutional Neural Network) to detect onsets in tremolo samples.

This article explains:

What a CNN is (in plain English),
How Doug's CNN is used in the TremoloMASTER Wizard,
What the user gains from it,
What it can and cannot do, and
Practical questions users (and technical readers) typically ask.

What a CNN is (without the math)

A Convolutional Neural Network is a type of machine-learning model that is especially good at finding patterns in things that look like images. It is the "brain" of the TremoloMASTER Wizard.

Even though we're working with audio, we can convert audio into a picture-like representation---most commonly a spectrogram. A spectrogram shows:

Time (left to right)
Frequency (low to high)
Energy (how strong a frequency is at a moment)

In that "picture," a plucked note creates recognizable visual patterns: a sharp transient, harmonic stripes, and energy changes across frequency bands. A CNN is very good at learning those patterns---even when they vary from one guitar, player, mic, or room to another.

                    A helpful analogy
                    Traditional onset detection is like using a ruler and a threshold: "If the energy jumps above X, call it an onset."
A CNN is like training an expert listener/observer: "This looks like a real note attack, and that one looks like noise or a false peak."

                

What "convolution" really means here

A CNN slides small "pattern detectors" across the spectrogram and asks:

"Do I see an onset-like shape here?"
"Does this look like a guitar transient?"
"Is the harmonic structure consistent with a pluck?"
"Is this just a bump in noise, or a real note?"

Those pattern detectors are learned from training examples (more on that below).

Why tremolo onset detection is tricky

Tremolo is not four random notes---it is a repeated four-note cycle. In classical technique terms the pattern is typically p--a--m--i, but in the TremoloMASTER Wizard we label them as:

Note #1
Note #2
Note #3
Note #4

That is intentional: it keeps the app flexible even when the musical context changes (thumb on different bass strings, fingers on different treble strings, alternating bass patterns, etc.). It can therefore also analyze non-traditional tremolo patterns such as p--i--m--a, p--i--m--i and p--m--i--m. The app can analyze tremolo as a repeating four-note cycle without making assumptions about exact string choices in every musical passage.

But tremolo remains hard for any detector because:

Notes can be almost evenly spaced (good), or very uneven (learning stage).
Finger attacks can blur together at higher speeds.
Accents can "fool" simpler energy-based logic.
Some recordings contain extra noises that resemble attacks.

The CNN is designed to be more robust under these real-world conditions.

How the CNN is used inside the TremoloMASTER Wizard

High-level flow (what happens when you analyze a segment)

You upload a recording (phone recording is fine).
You select a region in the waveform/timeline for analysis.
The app sends that audio segment to the server.
The server prepares the audio (format consistency matters for reliable detection).
The CNN scans the audio and predicts where the true note onsets are.
The app groups detected onsets into repeating 4-note cycles:
- Note #1 → Note #2 → Note #3 → Note #4 → (back to Note #1)
The Wizard calculates timing metrics (and other analysis) based on those onset times.
Results are displayed to the user: timing evenness, finger-pair spacing patterns, trends, and any warnings if detection confidence is low.

What the CNN outputs (conceptually)

Depending on implementation, CNN onset models typically output something like:

A probability curve over time (0.0 to 1.0) indicating "how onset-like" each moment is, or
A list of predicted onset timestamps directly.

Either way, the app then converts those predictions into:

A clean list of onset times
A cycle grouping into Note #1--#4
Metrics computed from the spacing between those times

What the user gains from the CNN

1) More consistent onset detection across real recordings

Compared to older threshold-style methods, a CNN can be less fragile when:

the recording level is low or high,
the player accents certain notes strongly,
the microphone or phone compresses the audio,
the room adds reflections,
the guitar tone is dark/bright,
tempo changes.

2) Better handling of accents and unevenness

Tremolo students often have:

a "punchy" thumb note,
weaker finger notes,
uneven spacing between certain finger pairs.

Those conditions can cause traditional detectors to miss quiet notes or invent false ones. A CNN can learn what real plucks look like even when they're quiet or slightly masked.

3) Better user feedback (because timing data is more trustworthy)

Most of the Wizard's value depends on accurate onset times:

finger-to-finger spacing (Note #1→#2, #2→#3, #3→#4, #4→next #1)
overall evenness
consistency over time (trends)
detecting "rushing" or "dragging" patterns

If onset detection improves, user feedback improves.

4) More helpful warnings when the performance is extremely uneven

When tremolo is very uneven, no detector is perfect. A well-designed CNN pipeline can still provide useful results, and it also enables the app to detect when it is operating near its limits and display a warning like:

                    "Detection confidence is low... slow down, make spacing more even, then re-analyze."
                

That kind of transparency helps users trust the tool.

Training the CNN (how it learns)

A CNN is not "born" knowing tremolo. It learns from examples.

What the training data looks like

For audio onset detection, training data typically includes:

Audio clips of tremolo playing
A "ground truth" set of onset timestamps (the correct note attack times)
Variations in tempo, dynamics, strings, tone, and recording conditions

The model repeatedly sees examples and adjusts itself to reduce errors.

Why variety matters

If a model only sees one guitar and one recording setup, it may struggle with others. So good training data includes variety:

different tempos (slow to fast),
different dynamic patterns (even vs accented),
different bass-string choices for Note #1,
different treble strings for Notes #2--#4,
phone recordings and studio recordings,
quiet rooms and normal rooms.

What the CNN learns that's hard to hand-code

Humans can write rules like "energy spikes often indicate onsets," but tremolo is full of exceptions. CNNs can learn subtle cues such as:

the specific transient shape of a pluck,
how harmonics appear right after an attack,
the difference between a real note attack and a noise bump,
onset patterns at different speeds and timbres.

How the Wizard labels notes: "Note #1--#4" instead of "p--a--m--i"

Many players think in p--a--m--i, so why does the app use Note #1--#4?

Because the app analyzes the cycle structure first. In most tremolo playing:

Note #1 is typically the thumb (p),
Notes #2--#4 are typically fingers (a--m--i)
By naming the notes #1-#4, it can also analyze non-traditional tremolo patterns such as p--i--m--a, p--i--m--i and p--m--i--m.

Therefore, even when musical context varies (string choices, alternation, and technique variations), the Note #1--#4 approach stays stable and avoids hard-coding assumptions that could break in edge cases. In other words, the app's analysis remains valid even if your musical context shifts.

What the CNN does NOT do (important limits)

1) It does not "grade musicianship"

It does not judge tone quality, expression, or artistry. It measures timing and related performance signals.

2) It is not magic in extreme conditions

Like any onset detector, it can struggle if:

the tremolo is extremely uneven,
notes are heavily smeared together at very high speed,
the recording is distorted/clipping,
a metronome is audible in the recording,
there is strong background noise.

When the audio makes the onsets ambiguous even to a human listener, the model may be uncertain too.

3) It relies on the selected segment being tremolo

If the chosen segment includes silence, unrelated notes, or large non-tremolo gestures, results may degrade. The best analyses come from a clean 5--10 second section of steady tremolo.

Practical recording tips (to help the CNN help you)

These are not "requirements," but they improve analysis quality:

Record in a quiet room.
Use a silent metronome (blink/vibration only; no audible ticks).
Avoid clipping/distortion (if the phone is too close, it can distort).
Keep the phone 1--3 feet away, aimed toward the guitar.
Make the tremolo fairly steady for at least several seconds.
If results seem odd, try analyzing a slightly different section, or slow down.

FAQ for users

"Why did you switch from traditional detection to a CNN?"

Traditional onset detection can work well, but tremolo exposes edge cases: accents, very close note spacing, variable dynamics, and phone recordings. A CNN can be more robust across those conditions.

"Will the CNN always be correct?"

No onset detector is perfect on every possible recording. The goal is to be correct far more often across typical real-world tremolo samples, and to warn you when the analysis is likely less reliable.

"Does this replace the need for good technique?"

It supports your practice by giving precise feedback, but you still must improve your tremolo the old-fashioned way: slow work, evenness, relaxation, and gradual tempo increases.

"What should I do if the results don't match what I hear?"

Try these steps:

Select a slightly different section (still steady tremolo).
Choose a segment where Note #1 is clearly present and consistent.
Slow down and re-record (extreme unevenness reduces detection confidence).
Ensure no audible metronome ticks and no clipping.

"Does the app store my audio?"

No. The system only needs and uses the audio long enough to process it.

FAQ for technical readers

"What does the CNN 'look at' in the audio?"

Most onset CNNs operate on short time windows of a time--frequency representation (like a spectrogram or mel-spectrogram). The model learns patterns associated with note attacks and outputs a probability of onset over time.

"How does it become a list of onsets?"

A post-processing step typically:

smooths the probability curve,
finds peaks above a threshold,
enforces a minimum separation between onsets,
then outputs timestamps.

"How do you get Note #1--#4 from onsets?"

Once the onset list is obtained, the app groups them into sets of four in time order. Then timing intervals and statistics are computed between:

1→2, 2→3, 3→4, 4→(next 1)

This cycle logic is central to tremolo analysis.

"What causes false positives or missed onsets?"

Common causes:

strong noise bursts that resemble transients,
extremely weak finger notes (quiet attacks),
room reflections producing secondary transients,
clipping distortion that reshapes transients,
very uneven timing that breaks expected patterns.

"Why not just use one universal threshold?"

Because thresholds break across:

different recording levels,
different rooms,
different guitars,
different accents,
different tempos,
different players.

A CNN can learn invariances that hand-tuned thresholds struggle to capture.

Interpreting "confidence" and "detection warnings"

When the app warns that detection confidence is low, it usually means:

the tremolo is highly uneven, and/or
the audio makes onsets ambiguous,
the model's internal scoring doesn't show a clean pattern.

This is not "scolding"---it's a helpful signal that your best next move is usually:

slow down,
aim for more even spacing,
then re-analyze.

Glossary

Onset: The moment a note begins (the "attack").
Spectrogram: A visual representation of sound over time and frequency.
CNN: Convolutional Neural Network; a pattern-recognition model often used for images (and audio-as-image).
Post-processing: Turning model output (probabilities) into clean timestamps.
4-note cycle: In tremolo, the repeating pattern: Note #1, #2, #3, #4.
Evenness: Consistency of time spacing between notes.

Summary

Adding a CNN onset detector makes the TremoloMASTER Wizard more robust and more useful in real practice conditions. It improves the reliability of the most important input to tremolo analysis---the timing of each pluck---which improves the accuracy of everything built on top of it: evenness metrics, finger-pair spacing diagnosis, trend tracking, and practice guidance.