SDIF Standard Frame Types

Organization of This Document

The SDIF standard includes an extensible collection of standard frame and matrix
types, listed in this document.

Each standard matrix type exists independently of the standard frame types
that must include it; any matrix may appear in a frame of any type. However,
for clarity, this document describes each matrix type in the context of the
frame type for which it was invented, with a special section at the end for
matrix types invented to be a part of any frame.

The following frame types have been defined as part of the SDIF standard.
Each of these frame types has one or more corresponding matrix types. To
give a sense of what kind of data is in each frame type, this table also
lists the columns of the main matrix type for each frame type. Click on
the frame type ID for a detailed description.

Frame Type ID
Frame Type
Columns of Main Matrix

1FQ0
Fundamental Frequency Estimates
Fundamental frequency, confidence

1STF
Discrete Short-Term Fourier Transform
Real & imaginary bin values

1PIC
Picked Spectral Peaks
Freq, Amp, phase, confidence

1TRC
Sinusoidal Tracks
Index, freq, amp, phase

1HRM
Pseudo-harmonic Sinusoidal Tracks
Harmonic partial #, freq, amp, phase

1RES
Resonances / Exponentially Decaying Sinusoids
Freq, amp, decay rate, phase

1TDS
Time Domain Samples
Channels of sample data

Frame Types to be Standardized

The following sound descriptions should eventually have standard SDIF
frame types. We have decided to delay the definition of these types until
the base SDIF standard has been accepted by the community. We welcome any
ideas or proposals about how to represent this data in SDIF frames.

Spectral envelopes (sampled and parametric)
Cepstral coefficients
LPC coefficients
Formants
Wavelets
Diphones
"Note lists"

Conventions Followed By SDIF's Standard Frame and Matrix types

Amplitude is always linear, never in dB or any other scale.
When a matrix has both frequency and amplitude columns, frequency always
comes first.
The "main" matrix required by a frame type will have a MatrixTypeID
equal to the FrameTypeID.
Some frame types consist of a main matrix of data plus a few extra fields
in a secondary 1D matrix, e.g., time-domain-sample frames must include the
sampling rate as well as the actual sample values. In these cases the naming
convention is for the info matrix's MatrixTypeID to begin with the character
"I" (for "info"), and have the same 3 final characters
as the FrameTypeID.
In general, we try to encode information without reference to a particular
sampling rate, and even to allow for non-isochronous sampling methods.
In general, we try to define the semantics of frames to be as stateless
as possible: it should be possible to interpret the contents of a frame
without reference to any other frames. When this is not possible, e.g.,
when the frames contain data for a custom synthesis method that needs to
be configured, the second best alternative is to put all
"initialization" information in a single frame that allows
all the data frames to be interpreted.
SDIF frames should describe "what they are" rather than
"what they came from."

Time-Domain Samples

Time-domain samples are the typical representation for digitally sampled sound,
used by common sound file formats such as WAV and AIFF. The goal of SDIF's time-domain
samples frame type to provide a uniform representation and the convenience of
having time domain samples in the same SDIF file or stream as other sound descriptions,
not to codify every ingenious scheme for representing audio in the minimum number
of bits. Therefore we restrict this type to linearly quantized samples with
no compression.

1TDS frames must contain a 1TDS matrix to hold the samples and a ITDS "time
domain info" matrix that says how to interpret the samples:

1TDS matrix:

Matrix Type: "1TDS"
Rows: Sample frames
Allowed MatrixDataTypes: float32, float64, int32, int64
Columns: Amplitudes in each channel. Linear. All but the first are optional.

ITDS matrix:

Matrix type: "ITDS"
Rows: Always exactly one row
Allowed MatrixDataTypes: float64
Columns:
- The sampling rate. Required.

More columns may be added to the ITDSmatrix in the future, including the following:

The nominal number of bits of precision of the A/D converter.
The nominal noise floor of the converter, in dB
The noise floor of the converter as computed/estimated by examining some
digitized "open channel" signal.
The DC offset of the sample amplitudes (i.e., the average sample amplitude)
The location and magnitude of the most positive sample
The location and magnitude of the most negative sample

Unlike most other SDIF frame types, a frame of 1TDS data represents an interval
of time (equal to the number of rows in the 1TDS matrix divided by the sampling
rate) rather than an instant of time. The time tag of a 1TDS frame represents
the beginning of this interval.

Most SDIF streams containing 1TDS data will consist of a single large frame
at time zero with all of the samples for the stream in a single matrix. The
same data could be represented equivalently in a series of shorter frames,
for example, a series of frames containing one-second intervals of sample
data at times 0, 1, 2, 3, 4, etc., or unequal-sized frames, e.g., 1.5
seconds at time 0, 2 seconds at time 1.5, 0.7 seconds at time 3.5, 1 second
at time 4.2, etc. Note that at a 96K Hz sampling rate, the limit of 2^32
rows in a matrix imposes a limit of about 12.4 hours of sound in a single
frame.

There is also the possibility of "gaps" in the time axis, for example,
one second of sound in a frame at time 0 followed by more sound in a frame at
time 10. In these cases, the stream implicitly contains zero-valued samples
in any intervals of time not spanned by sample data in frames. So, in this example,
there would be one second of sound, followed by 9 seconds of silence, followed
by more sound.

There is also the possibility of 1TDS frames that overlap in the time axis,
for example, a frame at time zero with 2 seconds of samples, followed by a frame
at time 1 with more samples. In these cases, the semantics are that the sample
values are added together.

Separate Matrix Type for Annotating Multi-Channel Data

Rather than define some fixed interpretation of multi-channel data like "1
is front left, 2 is front right, 3 is rear left, 4 is rear right", we propose
to invent an SDIF matrix type specifically for describing multi-channel data.
This would allow simple textual labels like those above, but also precise geometric
measurements about exact microphone placement, speaker placement, etc. It would
also support textual annotations about the content of each channel, e.g., the
name of an instrument on a particular channel of a multi-track recording.

This matrix type would be optional in 1TDS frames, or any other frame type
with multi-channel data.

Fundamental Frequency Estimates

Not all sounds have a definite fundamental frequency; some have multiple possible
fundamental frequencies. Note that we use the term "fundamental frequency" or
"f0" rather than "pitch"; this is because pitch is a perceptual phenomenon while
fundamental frequency is a signal processing quantity. We might invent a new
SDIF frame and matrix type for pitch to represent the result of a true pitch
estimator that applied a model based on human perception.

1FQ0 frames consist of a single 1FQ0 matrix:

Matrix Type: "1FQ0"
Allowed MatrixDataTypes: float32, float64
Rows: Candidate fundamental frequencies suggested by the estimator.
Columns:
- Fundamental frequency (Hertz). Required.
- Confidence (0 = none, 1=completely sure). Optional, default is 1.

Note that this format accommodates estimators that vote amongst fundamental
frequency candidates. Each row in the data vector is an estimated fundamental
frequency.

Note that this format does not support the notion of "tracking"
various fundamental frequency estimates over time. In this respect it is
more like the 1PIC frame type than the 1TRC frame type. We are considering adding another frame
type for "tracked fundamental frequency estimates" that would
include an index for each fundamental frequency.

Discrete Short-Term Fourier Transform/Phase Vocoder

1STF frames represent the data that come out of a discrete short-term time-domain
to frequency-domain transform such as an FFT.

Here is a precise mathematical definition of this frame type:

Let s(i) be a discrete signal with sampling rate SR Hertz
Let w(m) be a window defined with the support [0, M-1], i.e., w(m)=0 for
m<0 and m>=M
Let N be the size of the transform

We define the input to the transform, x(n), as follows. Note that the windowed
signal is 'put' at the beginning of the vector x(n).

Let x(n) =   s(i+n) * w(n)  for  0 <= n <= M-1
    x(n) =   0              for  M <= n <= N-1

(This is slightly redundant, since we define w(m)=0 when m>=M.)

The 1STF matrix data is the Discrete Fourier Transform (DFT) of size N, i.e.
the X(k) as follows.

The DFT is a length N vector X, with these elements:

              N-1
       X(k) = sum  x(n) * exp(-j * 2 * pi * k * n/N)
              n=0

       0 <= k <=N-1

The time tag in a 1STF frame is the time of the center of the
window, i.e., (i + M/2)/SR, not the beginning.

Notes:

This definition corresponds to the output of Matlab's (and UDI's) FFTfunction
The real and imaginary parts come directly from this formula: therefore,
if you compute a phase as atan2(imaginary, real), it is the phase of the corresponding
COSINUSOID (and not sinusoid as we are used in additive synthesis) at time
(i)/SR.
Note that the windowed signal is 'put' at the beginning of the vector x(n)
(then zero padding follows) and this is crucial for the phase definition.
Because of aliasing and foldover above the Nyquist frequency (and below
the negative Nyquist frequency), the output of the DFT can be thought of as a
periodic function of frequency over the range -infinity to infinity. The
period of this function is the range from the negative Nyquist frequency to
the positive Nyquist frequency, in other words, the sampling rate of the input
signal.

1STF frames consist of an ISTF "info" matrix to record overall
information about the transform, plus a 1STF matrix that contains the actual
bin data.

STFT info matrix:

Matrix type: "ISTF"
Allowed MatrixDataTypes: float32, float64
Rows: always exactly one row
Columns (all required)
- period of the DFT (i.e., SR): Hertz
- Frame size (i.e., size of the windowed signal) M/SR: seconds
- Size of the transform N (The data matrix does not necessarily represent
  all N bins)

Each 1STF frame must also contain a 1WIN matrix specifying the window function.

STFT data matrix:

Matrix type: "1STF"
Allowed MatrixDataTypes: float32, float64, int32, int64
Rows: Frequency bins output by the transform
Columns (both required)
- Real part (unitless, as it comes out of the STFT)
- Imaginary part (unitless, as it comes out of the STFT)

You can convert these complex numbers into polar form to get magnitude
and phase.

Picked Spectral Peaks

Picked spectral peaks represent peaks (local maxima) in a spectrum at
a given time. Peak pickers typically fit some kind of curve to 1STF data,
providing frequency, amplitude, and phase estimates that are more accurate
than the bins themselves.

1PIC frames consist of a single 1PIC matrix:

Matrix type: "1PIC"
Allowed MatrixDataTypes: float32, float64
Rows: Spectral peaks
Columns
- Frequency (Hertz). Required.
- Amplitude (linear). Optional; default is 1.0.
- Phase (Radians: from 0 to 2*pi). Optional, no default.
- Confidence (1.0 = 100%, 0.0 = 0%). Optional, default is 1.0.

The confidence factor might be used to indicate how much of the energy
around this peak was from a sinusoid or how well the energy around this
peak matches a sinusoid.

Sinusoidal Tracks

Sinusoidal tracks represent sinusoids that maintain their continuity
over time as their frequencies, amplitudes, and phases evolve. Sinusoidal
tracks are the standard data format used as the input to classical additive
synthesis.

1TRC frames consist of a single 1TRC matrix:

Matrix type: "1TRC"
Allowed MatrixDataTypes: float32, float64
Rows: Sinusoidal tracks
Columns
- Index (a unique integer >= 1 identifying this track and allowing
  it to be matched with 1TRC data in other frames. This is similar in concept
  to a Stream ID.) Required.
- Frequency (Hertz). Required.
- Amplitude (linear). Optional; default is 1.0.
- Phase (Radians: must be between
  0 and 2*pi). Optional, no default.

Synthesizers of 1TRC frames are expected to match the data for each sinusoid
from frame to frame using the index numbers. Values for amplitude and frequency
should somehow be interpolated so that they change smoothly between each
frame.

As phase is the integral of the instantaneous frequency over time, the
phase values in each frame may not necessarily match a synthesizer's concept
of what the phase should be based on the previous phase and the frequency
trajectory since the previous phase. Some synthesizers will ignore the phase
field or use it only for the initial phase. Others will take the phases
into account when interpolating frequencies from frame to frame. Others
will "cheat" the desired frequencies to produce the desired phases.

We imagine SDIF utilities that
would check the "reasonableness" of phase values based on the
frequencies.

There is no guarantee that a partial appearing in one frame will also
appear in the next frame. The situation where a partial appears in one frame
but not the next is called a "death", and when a partial does
not appear in one frame but does appear in the next frame it's called a
"birth". These cases are challenging when writing a synthesizer.
It's recommended that partials appearing for the first or last time in a
series of frames should have amplitudes of zero, so that the semantics of
fading in and out are explicitly in the SDIF data rather than needing to
be added on by the synthesizer.

Pseudo-harmonic Sinusoidal Tracks

Pseudo-harmonic sinusoidal tracks frames are exactly like sinusoidal
track frames except that the partials are understood to lie on or close
to a harmonic series. Thus, the index column of the 1HRM matrix represents harmonic
partial number rather than an arbitrary index. Partial numbers start from 1,
so the frequency of each pseudo-harmonic sinusoid should be close to the partial
number times the fundamental frequency.

Exponentially Decaying Sinusoids/Resonances

Resonances data can describe the characteristics of a resonant system
like a group of tuned filter banks, or can specify parameters for a model
of sinusoids with fixed frequencies and exponentially decaying amplitudes.
(If you put an impulse into such a group of filter banks, the output should
be a sum of sinusoids with fixed frequencies and exponentially decaying
amplitudes, so these two situations are in a certain sense the same.)

1RES frames consist of a single 1RES matrix:

Matrix type: "1RES"
Allowed MatrixDataTypes: float32, float64
Rows: resonances
Columns:
- Frequency (Hertz). Required.
- Amplitude (linear). Optional, default is 1.0.
- Decay Rate (Hertz). Optional, no default.
- Phase (Radians: from 0 to 2*pi). Optional, no default.

The decay curve of a resonance should be the same as that of a two-pole filter
with bandwidth equal to decay rate divided by pi. This formula gives the amplitude
of each sinusoid over time:

	amp(t) = initial_amp * e ^ (- decay_rate * t)

The phase of a resonance specifies the initial phase of each decaying
sinusoid. (Thanks to Jean Laroche for
suggesting that we include phase in this frame type.)

The original SDIF spec included some interesting extra
columns for resonances.