-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtranscription.py
829 lines (717 loc) · 36.2 KB
/
transcription.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
'''
The aim of a transcription algorithm is to produce a symbolic representation of
a recorded piece of music in the form of a set of discrete notes. There are
different ways to represent notes symbolically. Here we use the piano-roll
convention, meaning each note has a start time, a duration (or end time), and
a single, constant, pitch value. Pitch values can be quantized (e.g. to a
semitone grid tuned to 440 Hz), but do not have to be. Also, the transcription
can contain the notes of a single instrument or voice (for example the melody),
or the notes of all instruments/voices in the recording. This module is
instrument agnostic: all notes in the estimate are compared against all notes
in the reference.
There are many metrics for evaluating transcription algorithms. Here we limit
ourselves to the most simple and commonly used: given two sets of notes, we
count how many estimated notes match the reference, and how many do not. Based
on these counts we compute the precision, recall, f-measure and overlap ratio
of the estimate given the reference. The default criteria for considering two
notes to be a match are adopted from the `MIREX Multiple fundamental frequency
estimation and tracking, Note Tracking subtask (task 2)
<http://www.music-ir.org/mirex/wiki/2015:Multiple_Fundamental_Frequency_\
Estimation_%26_Tracking_Results_-_MIREX_Dataset#Task_2:Note_Tracking_\
.28NT.29>`_:
"This subtask is evaluated in two different ways. In the first setup , a
returned note is assumed correct if its onset is within +-50ms of a reference
note and its F0 is within +- quarter tone of the corresponding reference note,
ignoring the returned offset values. In the second setup, on top of the above
requirements, a correct returned note is required to have an offset value
within 20% of the reference note's duration around the reference note's
offset, or within 50ms whichever is larger."
In short, we compute precision, recall, f-measure and overlap ratio, once
without taking offsets into account, and the second time with.
For further details see Salamon, 2013 (page 186), and references therein:
Salamon, J. (2013). Melody Extraction from Polyphonic Music Signals.
Ph.D. thesis, Universitat Pompeu Fabra, Barcelona, Spain, 2013.
IMPORTANT NOTE: the evaluation code in ``mir_eval`` contains several important
differences with respect to the code used in MIREX 2015 for the Note Tracking
subtask on the Su dataset (henceforth "MIREX"):
1. ``mir_eval`` uses bipartite graph matching to find the optimal pairing of
reference notes to estimated notes. MIREX uses a greedy matching algorithm,
which can produce sub-optimal note matching. This will result in
``mir_eval``'s metrics being slightly higher compared to MIREX.
2. MIREX rounds down the onset and offset times of each note to 2 decimal
points using ``new_time = 0.01 * floor(time*100)``. ``mir_eval`` rounds down
the note onset and offset times to 4 decinal points. This will bring our
metrics down a notch compared to the MIREX results.
3. In the MIREX wiki, the criterion for matching offsets is that they must be
within ``0.2 * ref_duration`` **or 0.05 seconds from each other, whichever
is greater** (i.e. ``offset_dif <= max(0.2 * ref_duration, 0.05)``. The
MIREX code however only uses a threshold of ``0.2 * ref_duration``, without
the 0.05 second minimum. Since ``mir_eval`` does include this minimum, it
might produce slightly higher results compared to MIREX.
This means that differences 1 and 3 bring ``mir_eval``'s metrics up compared to
MIREX, whilst 2 brings them down. Based on internal testing, overall the effect
of these three differences is that the Precision, Recall and F-measure returned
by ``mir_eval`` will be higher compared to MIREX by about 1%-2%.
Finally, note that different evaluation scripts have been used for the Multi-F0
Note Tracking task in MIREX over the years. In particular, some scripts used
``<`` for matching onsets, offsets, and pitch values, whilst the others used
``<=`` for these checks. ``mir_eval`` provides both options: by default the
latter (``<=``) is used, but you can set ``strict=True`` when calling
:func:`mir_eval.transcription.precision_recall_f1_overlap()` in which case
``<`` will be used. The default value (``strict=False``) is the same as that
used in MIREX 2015 for the Note Tracking subtask on the Su dataset.
Conventions
-----------
Notes should be provided in the form of an interval array and a pitch array.
The interval array contains two columns, one for note onsets and the second
for note offsets (each row represents a single note). The pitch array contains
one column with the corresponding note pitch values (one value per note),
represented by their fundamental frequency (f0) in Hertz.
Metrics
-------
* :func:`mir_eval.transcription.precision_recall_f1_overlap`: The precision,
recall, F-measure, and Average Overlap Ratio of the note transcription,
where an estimated note is considered correct if its pitch, onset and
(optionally) offset are sufficiently close to a reference note.
* :func:`mir_eval.transcription.onset_precision_recall_f1`: The precision,
recall and F-measure of the note transcription, where an estimated note is
considered correct if its onset is sufficiently close to a reference note's
onset. That is, these metrics are computed taking only note onsets into
account, meaning two notes could be matched even if they have very different
pitch values.
* :func:`mir_eval.transcription.offset_precision_recall_f1`: The precision,
recall and F-measure of the note transcription, where an estimated note is
considered correct if its offset is sufficiently close to a reference note's
offset. That is, these metrics are computed taking only note offsets into
account, meaning two notes could be matched even if they have very different
pitch values.
'''
import numpy as np
import collections
from . import util
import warnings
# The number of decimals to keep for onset/offset threshold checks
N_DECIMALS = 4
def validate(ref_intervals, ref_pitches, est_intervals, est_pitches):
"""Checks that the input annotations to a metric look like time intervals
and a pitch list, and throws helpful errors if not.
Parameters
----------
ref_intervals : np.ndarray, shape=(n,2)
Array of reference notes time intervals (onset and offset times)
ref_pitches : np.ndarray, shape=(n,)
Array of reference pitch values in Hertz
est_intervals : np.ndarray, shape=(m,2)
Array of estimated notes time intervals (onset and offset times)
est_pitches : np.ndarray, shape=(m,)
Array of estimated pitch values in Hertz
"""
# Validate intervals
validate_intervals(ref_intervals, est_intervals)
# Make sure intervals and pitches match in length
if not ref_intervals.shape[0] == ref_pitches.shape[0]:
raise ValueError('Reference intervals and pitches have different '
'lengths.')
if not est_intervals.shape[0] == est_pitches.shape[0]:
raise ValueError('Estimated intervals and pitches have different '
'lengths.')
# Make sure all pitch values are positive
if ref_pitches.size > 0 and np.min(ref_pitches) <= 0:
raise ValueError("Reference contains at least one non-positive pitch "
"value")
if est_pitches.size > 0 and np.min(est_pitches) <= 0:
raise ValueError("Estimate contains at least one non-positive pitch "
"value")
def validate_intervals(ref_intervals, est_intervals):
"""Checks that the input annotations to a metric look like time intervals,
and throws helpful errors if not.
Parameters
----------
ref_intervals : np.ndarray, shape=(n,2)
Array of reference notes time intervals (onset and offset times)
est_intervals : np.ndarray, shape=(m,2)
Array of estimated notes time intervals (onset and offset times)
"""
# If reference or estimated notes are empty, warn
if ref_intervals.size == 0:
warnings.warn("Reference notes are empty.")
if est_intervals.size == 0:
warnings.warn("Estimated notes are empty.")
# Validate intervals
util.validate_intervals(ref_intervals)
util.validate_intervals(est_intervals)
def match_note_offsets(ref_intervals, est_intervals, offset_ratio=0.2,
offset_min_tolerance=0.05, strict=False):
"""Compute a maximum matching between reference and estimated notes,
only taking note offsets into account.
Given two note sequences represented by ``ref_intervals`` and
``est_intervals`` (see :func:`mir_eval.io.load_valued_intervals`), we seek
the largest set of correspondences ``(i, j)`` such that the offset of
reference note ``i`` has to be within ``offset_tolerance`` of the offset of
estimated note ``j``, where ``offset_tolerance`` is equal to
``offset_ratio`` times the reference note's duration, i.e. ``offset_ratio
* ref_duration[i]`` where ``ref_duration[i] = ref_intervals[i, 1] -
ref_intervals[i, 0]``. If the resulting ``offset_tolerance`` is less than
``offset_min_tolerance`` (50 ms by default) then ``offset_min_tolerance``
is used instead.
Every reference note is matched against at most one estimated note.
Note there are separate functions :func:`match_note_onsets` and
:func:`match_notes` for matching notes based on onsets only or based on
onset, offset, and pitch, respectively. This is because the rules for
matching note onsets and matching note offsets are different.
Parameters
----------
ref_intervals : np.ndarray, shape=(n,2)
Array of reference notes time intervals (onset and offset times)
est_intervals : np.ndarray, shape=(m,2)
Array of estimated notes time intervals (onset and offset times)
offset_ratio : float > 0
The ratio of the reference note's duration used to define the
``offset_tolerance``. Default is 0.2 (20%), meaning the
``offset_tolerance`` will equal the ``ref_duration * 0.2``, or 0.05 (50
ms), whichever is greater.
offset_min_tolerance : float > 0
The minimum tolerance for offset matching. See ``offset_ratio``
description for an explanation of how the offset tolerance is
determined.
strict : bool
If ``strict=False`` (the default), threshold checks for offset
matching are performed using ``<=`` (less than or equal). If
``strict=True``, the threshold checks are performed using ``<`` (less
than).
Returns
-------
matching : list of tuples
A list of matched reference and estimated notes.
``matching[i] == (i, j)`` where reference note ``i`` matches estimated
note ``j``.
"""
# set the comparison function
if strict:
cmp_func = np.less
else:
cmp_func = np.less_equal
# check for offset matches
offset_distances = np.abs(np.subtract.outer(ref_intervals[:, 1],
est_intervals[:, 1]))
# Round distances to a target precision to avoid the situation where
# if the distance is exactly 50ms (and strict=False) it erroneously
# doesn't match the notes because of precision issues.
offset_distances = np.around(offset_distances, decimals=N_DECIMALS)
ref_durations = util.intervals_to_durations(ref_intervals)
offset_tolerances = np.maximum(offset_ratio * ref_durations,
offset_min_tolerance)
offset_hit_matrix = (
cmp_func(offset_distances, offset_tolerances.reshape(-1, 1)))
# check for hits
hits = np.where(offset_hit_matrix)
# Construct the graph input
# Flip graph so that 'matching' is a list of tuples where the first item
# in each tuple is the reference note index, and the second item is the
# estimated note index.
G = {}
for ref_i, est_i in zip(*hits):
if est_i not in G:
G[est_i] = []
G[est_i].append(ref_i)
# Compute the maximum matching
matching = sorted(util._bipartite_match(G).items())
return matching
def match_note_onsets(ref_intervals, est_intervals, onset_tolerance=0.05,
strict=False):
"""Compute a maximum matching between reference and estimated notes,
only taking note onsets into account.
Given two note sequences represented by ``ref_intervals`` and
``est_intervals`` (see :func:`mir_eval.io.load_valued_intervals`), we see
the largest set of correspondences ``(i,j)`` such that the onset of
reference note ``i`` is within ``onset_tolerance`` of the onset of
estimated note ``j``.
Every reference note is matched against at most one estimated note.
Note there are separate functions :func:`match_note_offsets` and
:func:`match_notes` for matching notes based on offsets only or based on
onset, offset, and pitch, respectively. This is because the rules for
matching note onsets and matching note offsets are different.
Parameters
----------
ref_intervals : np.ndarray, shape=(n,2)
Array of reference notes time intervals (onset and offset times)
est_intervals : np.ndarray, shape=(m,2)
Array of estimated notes time intervals (onset and offset times)
onset_tolerance : float > 0
The tolerance for an estimated note's onset deviating from the
reference note's onset, in seconds. Default is 0.05 (50 ms).
strict : bool
If ``strict=False`` (the default), threshold checks for onset matching
are performed using ``<=`` (less than or equal). If ``strict=True``,
the threshold checks are performed using ``<`` (less than).
Returns
-------
matching : list of tuples
A list of matched reference and estimated notes.
``matching[i] == (i, j)`` where reference note ``i`` matches estimated
note ``j``.
"""
# set the comparison function
if strict:
cmp_func = np.less
else:
cmp_func = np.less_equal
# check for onset matches
onset_distances = np.abs(np.subtract.outer(ref_intervals[:, 0],
est_intervals[:, 0]))
# Round distances to a target precision to avoid the situation where
# if the distance is exactly 50ms (and strict=False) it erroneously
# doesn't match the notes because of precision issues.
onset_distances = np.around(onset_distances, decimals=N_DECIMALS)
onset_hit_matrix = cmp_func(onset_distances, onset_tolerance)
# find hits
hits = np.where(onset_hit_matrix)
# Construct the graph input
# Flip graph so that 'matching' is a list of tuples where the first item
# in each tuple is the reference note index, and the second item is the
# estimated note index.
G = {}
for ref_i, est_i in zip(*hits):
if est_i not in G:
G[est_i] = []
G[est_i].append(ref_i)
# Compute the maximum matching
matching = sorted(util._bipartite_match(G).items())
return matching
def match_notes(ref_intervals, ref_pitches, est_intervals, est_pitches,
onset_tolerance=0.05, pitch_tolerance=50.0, offset_ratio=0.2,
offset_min_tolerance=0.05, strict=False):
"""Compute a maximum matching between reference and estimated notes,
subject to onset, pitch and (optionally) offset constraints.
Given two note sequences represented by ``ref_intervals``, ``ref_pitches``,
``est_intervals`` and ``est_pitches``
(see :func:`mir_eval.io.load_valued_intervals`), we seek the largest set
of correspondences ``(i, j)`` such that:
1. The onset of reference note ``i`` is within ``onset_tolerance`` of the
onset of estimated note ``j``.
2. The pitch of reference note ``i`` is within ``pitch_tolerance`` of the
pitch of estimated note ``j``.
3. If ``offset_ratio`` is not ``None``, the offset of reference note ``i``
has to be within ``offset_tolerance`` of the offset of estimated note
``j``, where ``offset_tolerance`` is equal to ``offset_ratio`` times the
reference note's duration, i.e. ``offset_ratio * ref_duration[i]`` where
``ref_duration[i] = ref_intervals[i, 1] - ref_intervals[i, 0]``. If the
resulting ``offset_tolerance`` is less than 0.05 (50 ms), 0.05 is used
instead.
4. If ``offset_ratio`` is ``None``, note offsets are ignored, and only
criteria 1 and 2 are taken into consideration.
Every reference note is matched against at most one estimated note.
This is useful for computing precision/recall metrics for note
transcription.
Note there are separate functions :func:`match_note_onsets` and
:func:`match_note_offsets` for matching notes based on onsets only or based
on offsets only, respectively.
Parameters
----------
ref_intervals : np.ndarray, shape=(n,2)
Array of reference notes time intervals (onset and offset times)
ref_pitches : np.ndarray, shape=(n,)
Array of reference pitch values in Hertz
est_intervals : np.ndarray, shape=(m,2)
Array of estimated notes time intervals (onset and offset times)
est_pitches : np.ndarray, shape=(m,)
Array of estimated pitch values in Hertz
onset_tolerance : float > 0
The tolerance for an estimated note's onset deviating from the
reference note's onset, in seconds. Default is 0.05 (50 ms).
pitch_tolerance : float > 0
The tolerance for an estimated note's pitch deviating from the
reference note's pitch, in cents. Default is 50.0 (50 cents).
offset_ratio : float > 0 or None
The ratio of the reference note's duration used to define the
offset_tolerance. Default is 0.2 (20%), meaning the
``offset_tolerance`` will equal the ``ref_duration * 0.2``, or 0.05 (50
ms), whichever is greater. If ``offset_ratio`` is set to ``None``,
offsets are ignored in the matching.
offset_min_tolerance : float > 0
The minimum tolerance for offset matching. See offset_ratio description
for an explanation of how the offset tolerance is determined. Note:
this parameter only influences the results if ``offset_ratio`` is not
``None``.
strict : bool
If ``strict=False`` (the default), threshold checks for onset, offset,
and pitch matching are performed using ``<=`` (less than or equal). If
``strict=True``, the threshold checks are performed using ``<`` (less
than).
Returns
-------
matching : list of tuples
A list of matched reference and estimated notes.
``matching[i] == (i, j)`` where reference note ``i`` matches estimated
note ``j``.
"""
# set the comparison function
if strict:
cmp_func = np.less
else:
cmp_func = np.less_equal
# check for onset matches
onset_distances = np.abs(np.subtract.outer(ref_intervals[:, 0],
est_intervals[:, 0]))
# Round distances to a target precision to avoid the situation where
# if the distance is exactly 50ms (and strict=False) it erroneously
# doesn't match the notes because of precision issues.
onset_distances = np.around(onset_distances, decimals=N_DECIMALS)
onset_hit_matrix = cmp_func(onset_distances, onset_tolerance)
# check for pitch matches
pitch_distances = np.abs(1200*np.subtract.outer(np.log2(ref_pitches),
np.log2(est_pitches)))
pitch_hit_matrix = cmp_func(pitch_distances, pitch_tolerance)
# check for offset matches if offset_ratio is not None
if offset_ratio is not None:
offset_distances = np.abs(np.subtract.outer(ref_intervals[:, 1],
est_intervals[:, 1]))
# Round distances to a target precision to avoid the situation where
# if the distance is exactly 50ms (and strict=False) it erroneously
# doesn't match the notes because of precision issues.
offset_distances = np.around(offset_distances, decimals=N_DECIMALS)
ref_durations = util.intervals_to_durations(ref_intervals)
offset_tolerances = np.maximum(offset_ratio * ref_durations,
offset_min_tolerance)
offset_hit_matrix = (
cmp_func(offset_distances, offset_tolerances.reshape(-1, 1)))
else:
offset_hit_matrix = True
# check for overall matches
note_hit_matrix = onset_hit_matrix * pitch_hit_matrix * offset_hit_matrix
hits = np.where(note_hit_matrix)
# Construct the graph input
# Flip graph so that 'matching' is a list of tuples where the first item
# in each tuple is the reference note index, and the second item is the
# estimated note index.
G = {}
for ref_i, est_i in zip(*hits):
if est_i not in G:
G[est_i] = []
G[est_i].append(ref_i)
# Compute the maximum matching
matching = sorted(util._bipartite_match(G).items())
return matching
def precision_recall_f1_overlap(ref_intervals, ref_pitches, est_intervals,
est_pitches, onset_tolerance=0.05,
pitch_tolerance=50.0, offset_ratio=0.2,
offset_min_tolerance=0.05, strict=False,
beta=1.0):
"""Compute the Precision, Recall and F-measure of correct vs incorrectly
transcribed notes, and the Average Overlap Ratio for correctly transcribed
notes (see :func:`average_overlap_ratio`). "Correctness" is determined
based on note onset, pitch and (optionally) offset: an estimated note is
assumed correct if its onset is within +-50ms of a reference note and its
pitch (F0) is within +- quarter tone (50 cents) of the corresponding
reference note. If ``offset_ratio`` is ``None``, note offsets are ignored
in the comparison. Otherwise, on top of the above requirements, a correct
returned note is required to have an offset value within 20% (by default,
adjustable via the ``offset_ratio`` parameter) of the reference note's
duration around the reference note's offset, or within
``offset_min_tolerance`` (50 ms by default), whichever is larger.
Examples
--------
>>> ref_intervals, ref_pitches = mir_eval.io.load_valued_intervals(
... 'reference.txt')
>>> est_intervals, est_pitches = mir_eval.io.load_valued_intervals(
... 'estimated.txt')
>>> (precision,
... recall,
... f_measure) = mir_eval.transcription.precision_recall_f1_overlap(
... ref_intervals, ref_pitches, est_intervals, est_pitches)
>>> (precision_no_offset,
... recall_no_offset,
... f_measure_no_offset) = (
... mir_eval.transcription.precision_recall_f1_overlap(
... ref_intervals, ref_pitches, est_intervals, est_pitches,
... offset_ratio=None))
Parameters
----------
ref_intervals : np.ndarray, shape=(n,2)
Array of reference notes time intervals (onset and offset times)
ref_pitches : np.ndarray, shape=(n,)
Array of reference pitch values in Hertz
est_intervals : np.ndarray, shape=(m,2)
Array of estimated notes time intervals (onset and offset times)
est_pitches : np.ndarray, shape=(m,)
Array of estimated pitch values in Hertz
onset_tolerance : float > 0
The tolerance for an estimated note's onset deviating from the
reference note's onset, in seconds. Default is 0.05 (50 ms).
pitch_tolerance : float > 0
The tolerance for an estimated note's pitch deviating from the
reference note's pitch, in cents. Default is 50.0 (50 cents).
offset_ratio : float > 0 or None
The ratio of the reference note's duration used to define the
offset_tolerance. Default is 0.2 (20%), meaning the
``offset_tolerance`` will equal the ``ref_duration * 0.2``, or
``offset_min_tolerance`` (0.05 by default, i.e. 50 ms), whichever is
greater. If ``offset_ratio`` is set to ``None``, offsets are ignored in
the evaluation.
offset_min_tolerance : float > 0
The minimum tolerance for offset matching. See ``offset_ratio``
description for an explanation of how the offset tolerance is
determined. Note: this parameter only influences the results if
``offset_ratio`` is not ``None``.
strict : bool
If ``strict=False`` (the default), threshold checks for onset, offset,
and pitch matching are performed using ``<=`` (less than or equal). If
``strict=True``, the threshold checks are performed using ``<`` (less
than).
beta : float > 0
Weighting factor for f-measure (default value = 1.0).
Returns
-------
precision : float
The computed precision score
recall : float
The computed recall score
f_measure : float
The computed F-measure score
avg_overlap_ratio : float
The computed Average Overlap Ratio score
"""
validate(ref_intervals, ref_pitches, est_intervals, est_pitches)
# When reference notes are empty, metrics are undefined, return 0's
if len(ref_pitches) == 0 or len(est_pitches) == 0:
return 0., 0., 0., 0.
matching = match_notes(ref_intervals, ref_pitches, est_intervals,
est_pitches, onset_tolerance=onset_tolerance,
pitch_tolerance=pitch_tolerance,
offset_ratio=offset_ratio,
offset_min_tolerance=offset_min_tolerance,
strict=strict)
precision = float(len(matching))/len(est_pitches)
recall = float(len(matching))/len(ref_pitches)
f_measure = util.f_measure(precision, recall, beta=beta)
avg_overlap_ratio = average_overlap_ratio(ref_intervals, est_intervals,
matching)
return precision, recall, f_measure, avg_overlap_ratio
def average_overlap_ratio(ref_intervals, est_intervals, matching):
"""Compute the Average Overlap Ratio between a reference and estimated
note transcription. Given a reference and corresponding estimated note,
their overlap ratio (OR) is defined as the ratio between the duration of
the time segment in which the two notes overlap and the time segment
spanned by the two notes combined (earliest onset to latest offset):
>>> OR = ((min(ref_offset, est_offset) - max(ref_onset, est_onset)) /
... (max(ref_offset, est_offset) - min(ref_onset, est_onset)))
The Average Overlap Ratio (AOR) is given by the mean OR computed over all
matching reference and estimated notes. The metric goes from 0 (worst) to 1
(best).
Note: this function assumes the matching of reference and estimated notes
(see :func:`match_notes`) has already been performed and is provided by the
``matching`` parameter. Furthermore, it is highly recommended to validate
the intervals (see :func:`validate_intervals`) before calling this
function, otherwise it is possible (though unlikely) for this function to
attempt a divide-by-zero operation.
Parameters
----------
ref_intervals : np.ndarray, shape=(n,2)
Array of reference notes time intervals (onset and offset times)
est_intervals : np.ndarray, shape=(m,2)
Array of estimated notes time intervals (onset and offset times)
matching : list of tuples
A list of matched reference and estimated notes.
``matching[i] == (i, j)`` where reference note ``i`` matches estimated
note ``j``.
Returns
-------
avg_overlap_ratio : float
The computed Average Overlap Ratio score
"""
ratios = []
for match in matching:
ref_int = ref_intervals[match[0]]
est_int = est_intervals[match[1]]
overlap_ratio = (
(min(ref_int[1], est_int[1]) - max(ref_int[0], est_int[0])) /
(max(ref_int[1], est_int[1]) - min(ref_int[0], est_int[0])))
ratios.append(overlap_ratio)
if len(ratios) == 0:
return 0
else:
return np.mean(ratios)
def onset_precision_recall_f1(ref_intervals, est_intervals,
onset_tolerance=0.05, strict=False, beta=1.0):
"""Compute the Precision, Recall and F-measure of note onsets: an estimated
onset is considered correct if it is within +-50ms of a reference onset.
Note that this metric completely ignores note offset and note pitch. This
means an estimated onset will be considered correct if it matches a
reference onset, even if the onsets come from notes with completely
different pitches (i.e. notes that would not match with
:func:`match_notes`).
Examples
--------
>>> ref_intervals, _ = mir_eval.io.load_valued_intervals(
... 'reference.txt')
>>> est_intervals, _ = mir_eval.io.load_valued_intervals(
... 'estimated.txt')
>>> (onset_precision,
... onset_recall,
... onset_f_measure) = mir_eval.transcription.onset_precision_recall_f1(
... ref_intervals, est_intervals)
Parameters
----------
ref_intervals : np.ndarray, shape=(n,2)
Array of reference notes time intervals (onset and offset times)
est_intervals : np.ndarray, shape=(m,2)
Array of estimated notes time intervals (onset and offset times)
onset_tolerance : float > 0
The tolerance for an estimated note's onset deviating from the
reference note's onset, in seconds. Default is 0.05 (50 ms).
strict : bool
If ``strict=False`` (the default), threshold checks for onset matching
are performed using ``<=`` (less than or equal). If ``strict=True``,
the threshold checks are performed using ``<`` (less than).
beta : float > 0
Weighting factor for f-measure (default value = 1.0).
Returns
-------
precision : float
The computed precision score
recall : float
The computed recall score
f_measure : float
The computed F-measure score
"""
validate_intervals(ref_intervals, est_intervals)
# When reference notes are empty, metrics are undefined, return 0's
if len(ref_intervals) == 0 or len(est_intervals) == 0:
return 0., 0., 0.
matching = match_note_onsets(ref_intervals, est_intervals,
onset_tolerance=onset_tolerance,
strict=strict)
onset_precision = float(len(matching))/len(est_intervals)
onset_recall = float(len(matching))/len(ref_intervals)
onset_f_measure = util.f_measure(onset_precision, onset_recall, beta=beta)
return onset_precision, onset_recall, onset_f_measure
def offset_precision_recall_f1(ref_intervals, est_intervals, offset_ratio=0.2,
offset_min_tolerance=0.05, strict=False,
beta=1.0):
"""Compute the Precision, Recall and F-measure of note offsets: an
estimated offset is considered correct if it is within +-50ms (or 20% of
the ref note duration, which ever is greater) of a reference offset. Note
that this metric completely ignores note onsets and note pitch. This means
an estimated offset will be considered correct if it matches a
reference offset, even if the offsets come from notes with completely
different pitches (i.e. notes that would not match with
:func:`match_notes`).
Examples
--------
>>> ref_intervals, _ = mir_eval.io.load_valued_intervals(
... 'reference.txt')
>>> est_intervals, _ = mir_eval.io.load_valued_intervals(
... 'estimated.txt')
>>> (offset_precision,
... offset_recall,
... offset_f_measure) = mir_eval.transcription.offset_precision_recall_f1(
... ref_intervals, est_intervals)
Parameters
----------
ref_intervals : np.ndarray, shape=(n,2)
Array of reference notes time intervals (onset and offset times)
est_intervals : np.ndarray, shape=(m,2)
Array of estimated notes time intervals (onset and offset times)
offset_ratio : float > 0 or None
The ratio of the reference note's duration used to define the
offset_tolerance. Default is 0.2 (20%), meaning the
``offset_tolerance`` will equal the ``ref_duration * 0.2``, or
``offset_min_tolerance`` (0.05 by default, i.e. 50 ms), whichever is
greater.
offset_min_tolerance : float > 0
The minimum tolerance for offset matching. See ``offset_ratio``
description for an explanation of how the offset tolerance is
determined.
strict : bool
If ``strict=False`` (the default), threshold checks for onset matching
are performed using ``<=`` (less than or equal). If ``strict=True``,
the threshold checks are performed using ``<`` (less than).
beta : float > 0
Weighting factor for f-measure (default value = 1.0).
Returns
-------
precision : float
The computed precision score
recall : float
The computed recall score
f_measure : float
The computed F-measure score
"""
validate_intervals(ref_intervals, est_intervals)
# When reference notes are empty, metrics are undefined, return 0's
if len(ref_intervals) == 0 or len(est_intervals) == 0:
return 0., 0., 0.
matching = match_note_offsets(ref_intervals, est_intervals,
offset_ratio=offset_ratio,
offset_min_tolerance=offset_min_tolerance,
strict=strict)
offset_precision = float(len(matching))/len(est_intervals)
offset_recall = float(len(matching))/len(ref_intervals)
offset_f_measure = util.f_measure(offset_precision, offset_recall,
beta=beta)
return offset_precision, offset_recall, offset_f_measure
def evaluate(ref_intervals, ref_pitches, est_intervals, est_pitches, **kwargs):
"""Compute all metrics for the given reference and estimated annotations.
Examples
--------
>>> ref_intervals, ref_pitches = mir_eval.io.load_valued_intervals(
... 'reference.txt')
>>> est_intervals, est_pitches = mir_eval.io.load_valued_intervals(
... 'estimate.txt')
>>> scores = mir_eval.transcription.evaluate(ref_intervals, ref_pitches,
... est_intervals, est_pitches)
Parameters
----------
ref_intervals : np.ndarray, shape=(n,2)
Array of reference notes time intervals (onset and offset times)
ref_pitches : np.ndarray, shape=(n,)
Array of reference pitch values in Hertz
est_intervals : np.ndarray, shape=(m,2)
Array of estimated notes time intervals (onset and offset times)
est_pitches : np.ndarray, shape=(m,)
Array of estimated pitch values in Hertz
kwargs
Additional keyword arguments which will be passed to the
appropriate metric or preprocessing functions.
Returns
-------
scores : dict
Dictionary of scores, where the key is the metric name (str) and
the value is the (float) score achieved.
"""
# Compute all the metrics
scores = collections.OrderedDict()
# Precision, recall and f-measure taking note offsets into account
kwargs.setdefault('offset_ratio', 0.2)
orig_offset_ratio = kwargs['offset_ratio']
if kwargs['offset_ratio'] is not None:
(scores['Precision'],
scores['Recall'],
scores['F-measure'],
scores['Average_Overlap_Ratio']) = util.filter_kwargs(
precision_recall_f1_overlap, ref_intervals, ref_pitches,
est_intervals, est_pitches, **kwargs)
# Precision, recall and f-measure NOT taking note offsets into account
kwargs['offset_ratio'] = None
(scores['Precision_no_offset'],
scores['Recall_no_offset'],
scores['F-measure_no_offset'],
scores['Average_Overlap_Ratio_no_offset']) = (
util.filter_kwargs(precision_recall_f1_overlap,
ref_intervals, ref_pitches,
est_intervals, est_pitches, **kwargs))
# onset-only metrics
(scores['Onset_Precision'],
scores['Onset_Recall'],
scores['Onset_F-measure']) = (
util.filter_kwargs(onset_precision_recall_f1,
ref_intervals, est_intervals, **kwargs))
# offset-only metrics
kwargs['offset_ratio'] = orig_offset_ratio
if kwargs['offset_ratio'] is not None:
(scores['Offset_Precision'],
scores['Offset_Recall'],
scores['Offset_F-measure']) = (
util.filter_kwargs(offset_precision_recall_f1,
ref_intervals, est_intervals, **kwargs))
return scores