Organization of This Document
The SDIF standard includes an extensible collection of standard frame and matrix
types, listed in this document.
Each standard matrix type exists independently of the standard frame types
that must include it; any matrix may appear in a frame of any type. However,
for clarity, this document describes each matrix type in the context of the
frame type for which it was invented, with a special section at the end for
matrix types invented to be a part of any frame.
SDIF Standard Frame Types
The following frame types have been defined as part of the SDIF standard.
Each of these frame types has one or more corresponding matrix types. To
give a sense of what kind of data is in each frame type, this table also
lists the columns of the main matrix type for each frame type. Click on
the frame type ID for a detailed description.
Frame Type ID
Frame Type
Columns of Main Matrix
1FQ0
Fundamental Frequency Estimates
Fundamental frequency, confidence
1STF
Discrete Short-Term Fourier Transform
Real & imaginary bin values
1PIC
Picked Spectral Peaks
Freq, Amp, phase, confidence
1TRC
Sinusoidal Tracks
Index, freq, amp, phase
1HRM
Pseudo-harmonic Sinusoidal Tracks
Harmonic partial #, freq, amp, phase
1RES
Resonances / Exponentially Decaying Sinusoids
Freq, amp, decay rate, phase
1TDS
Time Domain Samples
Channels of sample data
The following sound descriptions should eventually have standard SDIF
frame types. We have decided to delay the definition of these types until
the base SDIF standard has been accepted by the community. We welcome any
ideas or proposals about how to represent this data in SDIF frames.
Time-domain samples are the typical representation for digitally sampled sound,
used by common sound file formats such as WAV and AIFF. The goal of SDIF's time-domain
samples frame type to provide a uniform representation and the convenience of
having time domain samples in the same SDIF file or stream as other sound descriptions,
not to codify every ingenious scheme for representing audio in the minimum number
of bits. Therefore we restrict this type to linearly quantized samples with
no compression.
1TDS frames must contain a 1TDS matrix to hold the samples and a ITDS "time
domain info" matrix that says how to interpret the samples:
1TDS matrix:
ITDS matrix:
More columns may be added to the ITDSmatrix in the future, including the following:
Unlike most other SDIF frame types, a frame of 1TDS data represents an interval
of time (equal to the number of rows in the 1TDS matrix divided by the sampling
rate) rather than an instant of time. The time tag of a 1TDS frame represents
the beginning of this interval.
Most SDIF streams containing 1TDS data will consist of a single large frame
at time zero with all of the samples for the stream in a single matrix. The
same data could be represented equivalently in a series of shorter frames,
for example, a series of frames containing one-second intervals of sample
data at times 0, 1, 2, 3, 4, etc., or unequal-sized frames, e.g., 1.5
seconds at time 0, 2 seconds at time 1.5, 0.7 seconds at time 3.5, 1 second
at time 4.2, etc. Note that at a 96K Hz sampling rate, the limit of 2^32
rows in a matrix imposes a limit of about 12.4 hours of sound in a single
frame.
There is also the possibility of "gaps" in the time axis, for example,
one second of sound in a frame at time 0 followed by more sound in a frame at
time 10. In these cases, the stream implicitly contains zero-valued samples
in any intervals of time not spanned by sample data in frames. So, in this example,
there would be one second of sound, followed by 9 seconds of silence, followed
by more sound.
There is also the possibility of 1TDS frames that overlap in the time axis,
for example, a frame at time zero with 2 seconds of samples, followed by a frame
at time 1 with more samples. In these cases, the semantics are that the sample
values are added together.
Rather than define some fixed interpretation of multi-channel data like "1
is front left, 2 is front right, 3 is rear left, 4 is rear right", we propose
to invent an SDIF matrix type specifically for describing multi-channel data.
This would allow simple textual labels like those above, but also precise geometric
measurements about exact microphone placement, speaker placement, etc. It would
also support textual annotations about the content of each channel, e.g., the
name of an instrument on a particular channel of a multi-track recording.
This matrix type would be optional in 1TDS frames, or any other frame type
with multi-channel data.
Not all sounds have a definite fundamental frequency; some have multiple possible
fundamental frequencies. Note that we use the term "fundamental frequency" or
"f0" rather than "pitch"; this is because pitch is a perceptual phenomenon while
fundamental frequency is a signal processing quantity. We might invent a new
SDIF frame and matrix type for pitch to represent the result of a true pitch
estimator that applied a model based on human perception.
1FQ0 frames consist of a single 1FQ0 matrix:
Note that this format accommodates estimators that vote amongst fundamental
frequency candidates. Each row in the data vector is an estimated fundamental
frequency.
Note that this format does not support the notion of "tracking"
various fundamental frequency estimates over time. In this respect it is
more like the 1PIC frame type than the 1TRC frame type. We are considering adding another frame
type for "tracked fundamental frequency estimates" that would
include an index for each fundamental frequency.
1STF frames represent the data that come out of a discrete short-term time-domain
to frequency-domain transform such as an FFT.
Here is a precise mathematical definition of this frame type:
We define the input to the transform, x(n), as follows. Note that the windowed
signal is 'put' at the beginning of the vector x(n).
Let x(n) = s(i+n) * w(n) for 0 <= n <= M-1 x(n) = 0 for M <= n <= N-1
(This is slightly redundant, since we define w(m)=0 when m>=M.)
The 1STF matrix data is the Discrete Fourier Transform (DFT) of size N, i.e.
the X(k) as follows.
The DFT is a length N vector X, with these elements:
N-1 X(k) = sum x(n) * exp(-j * 2 * pi * k * n/N) n=0 0 <= k <=N-1
The time tag in a 1STF frame is the time of the center of the
window, i.e., (i + M/2)/SR, not the beginning.
Notes:
1STF frames consist of an ISTF "info" matrix to record overall
information about the transform, plus a 1STF matrix that contains the actual
bin data.
STFT info matrix:
Each 1STF frame must also contain a 1WIN matrix specifying the window function.
STFT data matrix:
You can convert these complex numbers into polar form to get magnitude
and phase.
Picked spectral peaks represent peaks (local maxima) in a spectrum at
a given time. Peak pickers typically fit some kind of curve to 1STF data,
providing frequency, amplitude, and phase estimates that are more accurate
than the bins themselves.
1PIC frames consist of a single 1PIC matrix:
The confidence factor might be used to indicate how much of the energy
around this peak was from a sinusoid or how well the energy around this
peak matches a sinusoid.
Sinusoidal tracks represent sinusoids that maintain their continuity
over time as their frequencies, amplitudes, and phases evolve. Sinusoidal
tracks are the standard data format used as the input to classical additive
synthesis.
1TRC frames consist of a single 1TRC matrix:
Synthesizers of 1TRC frames are expected to match the data for each sinusoid
from frame to frame using the index numbers. Values for amplitude and frequency
should somehow be interpolated so that they change smoothly between each
frame.
As phase is the integral of the instantaneous frequency over time, the
phase values in each frame may not necessarily match a synthesizer's concept
of what the phase should be based on the previous phase and the frequency
trajectory since the previous phase. Some synthesizers will ignore the phase
field or use it only for the initial phase. Others will take the phases
into account when interpolating frequencies from frame to frame. Others
will "cheat" the desired frequencies to produce the desired phases.
We imagine SDIF utilities that
would check the "reasonableness" of phase values based on the
frequencies.
There is no guarantee that a partial appearing in one frame will also
appear in the next frame. The situation where a partial appears in one frame
but not the next is called a "death", and when a partial does
not appear in one frame but does appear in the next frame it's called a
"birth". These cases are challenging when writing a synthesizer.
It's recommended that partials appearing for the first or last time in a
series of frames should have amplitudes of zero, so that the semantics of
fading in and out are explicitly in the SDIF data rather than needing to
be added on by the synthesizer.
Pseudo-harmonic sinusoidal tracks frames are exactly like sinusoidal
track frames except that the partials are understood to lie on or close
to a harmonic series. Thus, the index column of the 1HRM matrix represents harmonic
partial number rather than an arbitrary index. Partial numbers start from 1,
so the frequency of each pseudo-harmonic sinusoid should be close to the partial
number times the fundamental frequency.
Resonances data can describe the characteristics of a resonant system
like a group of tuned filter banks, or can specify parameters for a model
of sinusoids with fixed frequencies and exponentially decaying amplitudes.
(If you put an impulse into such a group of filter banks, the output should
be a sum of sinusoids with fixed frequencies and exponentially decaying
amplitudes, so these two situations are in a certain sense the same.)
1RES frames consist of a single 1RES matrix:
The decay curve of a resonance should be the same as that of a two-pole filter
with bandwidth equal to decay rate divided by pi. This formula gives the amplitude
of each sinusoid over time:
amp(t) = initial_amp * e ^ (- decay_rate * t)
The phase of a resonance specifies the initial phase of each decaying
sinusoid. (Thanks to Jean Laroche for
suggesting that we include phase in this frame type.)
The original SDIF spec included some interesting extra
columns for resonances.