Latency: Myths and Facts. Part 3: A look at a quantitative study.

In the previous posts about latency (Part 1 and Part 2) we informally talked about latency and its perception. We mostly made “rule of thumb” reasoning in order to arrive to some reasonable conclusion about latency, its perception and good latency thresholds.

In this post we will look instead at a scientific paper published on the topic: The Effects of Latency on Live Sound Monitoring by Michael Lester and Jon Boley. Perception of sound is a very counter-intuitive and complex phenomenon, it will be nice to see how our rule of thumb compares to a more appropriate scientific study of the matter. It is not uncommon, especially in this field, to see results completely different from what it was expected due to simple reasoning…

So let’s not loose any more time and let’s dig in!

What actually is latency perception?

This is a very tough question to answer, and we have to look at it before we actually look at the paper.

Latency by itself is a time lag. Usually, we mean the lag between the instant at which we trigger a note on our instrument (which could be as well our voice) and the instant at which we hear the sound as produced by some sound altering or producing system.

For example, we might call latency the time between singing a well defined note and the time at which we hear our voice in a monitoring system, after being processed by a digital compressor.

Or we might call latency the time between the instant at which we press an organ key and the instant at which we hear the sound produced by the organ pipes.

Most often we think about latency in relation to digital systems, especially computers. But how we perceive latency? Well, there isn’t just one answer, and actually it isn’t totally understood.

Latency as a time domain effect

It happens sometimes that we hear latency as a clear delay: we literally hear that the sound is taking time to be produced after we trigger it somehow. Or, if maybe we are in the conditions of hearing both an original sound and a latency delayed one, we hear a sort of “echo”. This kind of perception is mostly associated with long latencies.

Latency as a spatial audio effect

This effect is not discussed in the paper, but one important way we can locate a sound source in space is due to binaural delay: the sound coming from a source will need different times to reach the different ears. These times depend on the source location relative to the head. This means that from the inter-aural delay our brain can extract directional information. According to Fastl and Zwicker inter-aural delays of 50\,{\textstyle \mu s} are perceptible in this sense (Third Edition, see Section 15.1 Just-Noticeable Interaural Delay, at page 293). It is clear then that if one or more audio sources are affected by latency the perceived position of the overall sound source could be changed.

Latency as a frequency domain effect

This happens when the primary signal and the one affected by latency coexist at the same time and location.

For example, maybe we are playing an acoustic guitar, but we are also capturing it with a mic, applying a digital amplifier, and sending the signal to a speaker. This means that at our listening position we are exposed to the direct pressure wave from the guitar (primary signal) and the wave from the speaker (delayed signal). If the levels are such that we can hear both signals then we will have a frequency domain alteration known as comb filtering.


Let’s suppose that this is our primary signal:

p\left(f,\,t\right)=\cos\left(2\pi ft\right)

A simple cosine wave at some frequency f, which oscillates in time t as represented below, where one period, the time needed for the wave to complete one cycle, is given by \frac{1}{f} (in seconds, if the frequency is expressed in Hertz). To make things general, let’s imagine the equation above is the result of dividing the actual physical signal for its peak amplitude, so that the signal is expressed with normalized dimensionless units taking values between -1 and 1.


This is the most basic kind of signal in every domain (electronics, acoustics… etc…). When it is a pressure wave it relates very simply to perception, its amplitude being strongly correlated to loudness and its frequency to pitch. Also, signals of arbitrary complexity can be decomposed in a sort of superposition of different signals of this kind.

Now, let’s try to describe an identical signal, but delayed by \delta seconds:

p\left(f,\,t\right)=\cos\left(2\pi f\left[t-\delta\right]\right)

When primary and delayed signals are present at the same time the resulting signal will be their sum. But the result depends on the frequency!

Let’s look at this animation for cosine signals delayed by \delta=1\,{\textstyle ms}:


The two signals are shown on the left panel together with their sum. We can see that at the starting frequency of 20\,\textrm{Hz} the waves are very similar, hence their sum is close to the primary wave, but doubled in amplitude, as we are summing nearly identical values for each time instant. However, the more we approach 500\,\textrm{Hz} the more the peaks tend to get opposed to one another, making the waves to cancel each other when summed, as they become a time collection of opposed sign values. This happens because changing the frequency, which is the number of cycles per unit time, obviously alters the amount of wave cycles we pack into a slice of time. Hence we keep on changing the arrangement of the peaks.

On the right panel the amplitude of the sum is recorded as the frequency changes. It is clear that the phenomenon is cyclic: there isn’t a number bigger than 2 or smaller than 0 to which the waves amplitudes can sum up. As we are constantly changing the frequency there will be more points of cancellation and more points of summation, as the peaks are constantly rearranged. In fact, we can copy and paste the right half of the plot infinite times: the plot for frequencies between 1000\,\textrm{Hz} and 2000\,\textrm{Hz} will look just as the one between 0\,\textrm{Hz} and 1000\,\textrm{Hz}.

This regular repetition of maximal/minimal values at certain values of frequencies, which are determined by the amount of latency, is called comb filter, due to its shape being reminiscent of a comb. But why filter? Well, we know that signals of arbitrary complexity can be expressed as some sort of superposition of simple (eventually an infinite number of) sinusoids. Imagine now that the primary signal was not a cosine wave, but some recorded music. Then, every signal component around 500\,\textrm{Hz}, 1500\,\textrm{Hz}, 2500\,\textrm{Hz}… would be strongly attenuated, as the sinusoid waves that make up our music which are unlucky enough to exist in that region would be cancelled. Similarly, sinusoid components around 1000\,\textrm{Hz}, 2000\,\textrm{Hz}, 3000\,\textrm{Hz}… will be boosted. This is exactly as playing our music through a filter, which is a device altering the frequency content of a signal.

Of course, if the two waves don’t have the same amplitude the combing will not be as strong, but this shows that a latency of even just 1\,\textrm{ms} can have important and well perceptible frequency domain effects, the result of direct and delayed signals superposition being comb filtering. As noted by the paper authors, in these conditions even a latency of 50\,{\textstyle \mu s} can be perceptible as a frequency domain effect, being associated with a comb minimum at 10000\,\textrm{Hz}. In fact, the smaller the latency the higher the chances for latency artifacts to be perceived as frequency domain effects, if at all.

How we should think then about Latency perception?

Up to now we mostly discussed physical aspects of latency, that is: how latency affects physical signals. However, we hear the physical signals through our hearing system, a complex system that involves both the ears and the brain. What perceive ultimately depends on the interaction about the physical stimuli and our hearing system reaction to those.

It is clear that we cannot perceive latency in a single way. Depending on conditions we might hear a time lag, or a filtering effect. Moreover, there isn’t a clear separation between the two: we might as well hear both a time lag and a filtering effect, depending on many factors among which the most important is the amount of latency itself. Due to how our ear works the perception of one or the other effect might be more or less pronounced due to fundamental earing phenomena, such as masking (the ability of one sound to impede the perception of another), which has both well defined time and frequency patterns.

Also, we didn’t discuss about level of concentration and task being done by the listener. Are we just listening casually? What are our critical listening skills? Is our ear trained enough to capture minute differences? Maybe we are actually playing an instrument. What instrument? Organ players are quite accustomed to latency as the sound from an organ needs time to develop in the pipes. How will latency perception adapted by their training on this instrument? Will they be more/less sensible to latency?

There are many more similar questions that can be asked, all about the psychophysiological side of things, and each one of them would require a separate study to get some form of answer. But we can simplify things in a way:

What amount of latency will cause a test subject in a particular condition to have a different perception of what she/he is listening to?

Most of the study we are gonna look at revolves around a similar question, but branched in few more directions.

A brief study review

Now that we developed a feeling of the complexity of latency perception let’s look at what the researcher investigated in their study. Here the list of research questions directly from the study, together with few comments expanding on why these are legit questions.

What are the different factors that affect latency perception in live monitoring?

For factor we mean whatever variable can end up affecting the phenomenon being studied, in this case latency perception. For example, we seen that to have comb filtering we need the primary and delayed signal to coexist, for example as acoustic waves at our listening position. If we are not able to hear the primary sound (maybe we are just monitoring a MIDI keyboard output from a computer speaker) then we will have not comb filtering at all, ruling out the most important physical frequency domain effect of latency. Clearly, the presence of the primary sound is a factor altering latency perception. There might be of course many more.

What are the differences in latency perception between two different monitor situations: Wedge Monitors 4-6ft from the ear and In-Ear Monitors (IEM)?

It is reasonable to imagine that also the kind of monitoring system might be a factor. Wedge Monitors will produce sounds propagating in an environment that will have some reverberation. This will introduce many different frequency and time domain effects that can result in making latency perception more or less evident. Also, when listening to Wedge Monitors Left and Right sources are not separated from each other: our right ear is able to listen to the left audio as well, with the shape of our head and torso determining with what kind of filtering, which could as well shape latency perception. In contrast, IEM monitors produce well separated audio which propagates only through our ear canal: a clearly different situation.

(Unless I missed it, the researchers did not report what kind of room they used for the experiment. It would have been appropriate to give some explanation of what room they used and why they chosen it)

What are the differences in latency perception among different instrumentalists? Which musicians are more sensitive than others?

Think about the organ players of the example above. They are very well adapted to latency from their instrument. Will it mean they become less sensitive? Or maybe they will be extremely sensitive to every latency that deviates to what they are so used to? What will instead happen to singers, which had a completely different training? What for instrumentalists that play instruments that give an immediate sensory feedback to the user, for example wind instruments, which vibrate and alter the pressure at the instrumentalist mouth (and maybe even teeth and hence bones, allowing them to hear their instrument through the bones). Will their latency sensibility be higher due to the instantaneous instrument sensory feedback which contradicts the delayed sound?

(Organ players are not included in the study, the most similar musician included being perhaps the keyboard player, but they are perfect to build this example).

Is there a difference between solo delayed monitoring and monitoring one’s own delayed instrument while playing with a group of non-delayed musicians?

When playing with other musicians one of the main focuses is playing on time (or slightly off depending on the intention). Now, if one musician is affected by latency this might clearly affect his/her capacity to keep on time with the other. The level of concentration required to keep on time might raise latency sensibility. This is clearly a different situation with respect playing solo.

How much latency can be present in a signal path before a musician will perceive an artifact in the audio signal?

An artifact of any kind. We know that latency might be perceived in different ways, many of which unknown due to how our hearing system works. What is the smallest latency that produces a different perception? Of course, as we noted above, it is reasonable to imagine that each musician will have a different threshold, both due to each human being different, but also due to the different coupling with the instrument and training.

How much latency can be present in a signal path before a musician will perceive an actual delay in the signal?

Similar question as above, but focusing on the time domain effects only.

How to answer these questions?

Appropriate experiment design is needed to be able to arrive at an answer. There isn’t a single way an experiment of this kind can be designed. Let’s have a very brief look at the researchers choices, with a critical eye.

The natural way to answer the research questions above involves creating a subjects panel with different musicians. The researchers included 19 practicing musicians in their experiment. Various instruments were played (not necessarily by different musicians): Vocals, Saxophone, Electric Guitar, Keyboard with a piano patch, Electric Bass, and Drums. Apparently, these instruments were selected as they are representative of typical rock/pop band layouts. It isn’t stated clearly why, but perhaps due to the fact that rock/pop bands are the most common (or seen as the most common by the researchers). This perhaps needed some more justification by the authors.

As often in psycho-acoustics experiments, the researchers decided to include trained and critical subjects in the experiment. They were 11 of the 19 subjects. Trained subjects are most likely to have lowered perception thresholds, increasing the sensibility of the experiment and reducing the spread of the results, thus increasing the confidence of the conclusions. Also, critical and trained subjects can give better information about perception thresholds for humans: for normal subjects we would be asking ourselves if what we are observing is normal for healthy human beings or the result of having used subjects that are not trained enough to use their auditory system to the top of its capabilities. If that happens, we would end up underestimating the human hearing system capabilities, our measurement being polluted by confounding factors like the amount of training. However, less critical subjects are useful subjects too as they allow, by comparison with the critical subjects results, to asses the impact of less training or critical skills on the phenomenon under study. To understand whether a subject was critical an initial boundary test (more below) was performed by the researches to coarsely determine latency thresholds. If the threshold was higher than the one due to the Haas effect (mentioned here) the subject would be considered not critical. This also allowed to determine latency test values better suited for the different coarse thresholds of the different instrumentalists, so that to be able to find accurate results for each. These appear all as good choices, as they personalize the test for each subject, making possible to put each subject in the condition of maximal sensibility.

Even though all of the above makes sense, the subjects panel is the most unsatisfactory part of the experiment. In my opinion, it is too small and uneven. This yields, as the researches admit, to poor confidence of some experimental results.

The actual test and and initial boundary test made use of the same technique: the subjects were able to route their instruments through a 8 channels 0 latency analog switch-box connected to various linearly spaced digital delays unknown the the subjects. The musician would operate the switch and monitor through wedge monitors first and IEM second (the latency sets were different in each case, as different were the thresholds found from the boundary test). To simulate playing with a band, a no latency metronome at 120 BPM was used to avoid the complexity of another human musician. The musicians would then rate the channels on  a scale from Horrible (0) to Excellent (100) based on how perceptible and disruptive were the artifacts they heard. In more detail, the scale was defined as follows:

Excellent: Artifacts are imperceptible. Delay as well as artifacts cannot be identified.
Good: Some artifacts are perceptible, but not necessarily delay. The artifacts, though perceptible, are not annoying and do not contribute badly to musician’s performance.
Fair: Delay and/or artifacts are perceptible. The delay and/or artifacts are slightly annoying, but in most cases would not affect musician’s performance.
Bad: A considerable amount of delay is perceptible. The delay is annoying and is detrimental to musician performance.
Horrible: A musician can’t work under these conditions!

Using these pivotal descriptions is a good practice in these experiments, as it gives to the subjects a way to unambiguously quantify the degree to which they rate the quality of what they are hearing.

The boundary test differed by being simpler: by using a scale of thumbs up and thumbs down the value that most consistently scored the lowest quality judgment was set as the highest latency to test for. This means that the latency test values were different for each subject. This is a good choice as well, as it makes possible to maximize the sensibility of each subject, which must depend on their ear and their training, especially the training on their particular instrument.

To sum up, the experimental procedure looks, in the most simplified way, as follows:

  1. For each subject, determine the highest latency to test for when using IEM.
  2. Space 8 value of latency from 0 to the highest, and test latency perception for IEM.
  3. For each subject, determine the highest latency to test for when using wedge monitors.
  4. Space 8 value of latency from 0 to the highest, and test latency perception for wedge monitors.

The conclusions

Let’s then briefly look at what the experimenters concluded. This table summarize the amounts of latency that yielded to Good and Fair judgments from the subjects. We see that sax players are very sensible to latency. This makes sense with respect our expectations due to sensory feedback from the instrument. Other instruments have larger values, the higher belonging to Keyboard players. Why this happens was not determined and it is subject to speculation. Keyboards don’t have a huge sensory feedback for the musician and they are often used with digital systems. Perhaps, this makes keyboardists “human latency compensation devices”, like organ players. As such, they could be way less disturbed by it.

Latency [ms] Saxophone Vocals Guitar Drums Bass Keyboards
IEM Good (artifacts perceptible) 0 1 4.5 8 4.5 27
Wedge Good (artifacts perceptible) 1.5 10 6.5 9 8 22
IEM Fair (delay perceptible) 3 6.5 14.5 24.5 25.5 43
Wedge Fair (delay perceptible) 10 26 16 25 30 40.5

With reference to this table, let’s look to how the research questions are answered.

What are the different factors that affect latency perception in live monitoring?

What are the differences in latency perception between two different monitor situations: Wedge Monitors 4-6ft from the ear and In-Ear Monitors (IEM)?

What are the differences in latency perception among different instrumentalists? Which musicians are more sensitive than others?

We can answer all of these on one go: we can clearly see that the latencies associated with the two judgments clearly changes when considering IEM or Wedge monitor, so we can see that this is an important factor. Drums appear to be associated with a less significant change. We can also see that typically IEM is associated with lower latency ratings, a part for keyboards. Now, we can see that the latency ratings are very different between the various instruments. This clearly shows that the instruments and their associated training are a major factor. In fact, the instrument is much more important than the subject, the researches found (by analyzing the results from subjects playing more than one instrument). We can clearly see that saxophonists appear to be the most sensitive, keyboardists the least. However, the researchers advise to take the saxophone results with a grain of salt: there were too few saxophonist to have a good statistical confidence about their results.

Is there a difference between solo delayed monitoring and monitoring one’s own delayed instrument while playing with a group of non-delayed musicians?

The difference was appreciable only for low latencies. The difference was very small though, and the experimenters advise to design a new experiment to address this question better.

How much latency can be present in a signal path before a musician will perceive an artifact in the audio signal?

It really depends on monitoring and instruments. Pretty much, we can expect to hear some artifacts for latencies as low as 1\,\textrm{ms} for IES and 6.5\,\textrm{ms} for wedge.

How much latency can be present in a signal path before a musician will perceive an actual delay in the signal?

Similarly to above, but with 6.5\,\textrm{ms} for the IES and 16\,\textrm{ms} for the wedge system.

How this compares to our rule of thumb?

We concluded to use 7\,\textrm{ms} as threshold for our systems. That appears to correlate rather well with the lowest latency that can produce delay perception (a part for saxophone data): 6.5\,\textrm{ms}, which is also the lowest latency (still a part for saxophones) that will produce artifacts when using wedge monitors. So, excluding the saxophone data (which might be unreliable) apparently we did not go too far off: our value is just bigger than the Good threshold for wedge monitors and the Fair threshold for IEM. But we learned that an absolute latency threshold cannot really be drawn, latency perception depending on many things at once. Also, we learned that the requirement of lowlatency based on whether or not we can hear a delay is not very good, as latency can affect perception also when not heard as a delay.

On this note, no value appears to be in contradiction with the ear time resolution reported by Fastl and Zwicher (2\,\textrm{ms}). The time resolution is, in simple terms, the amount of time gap within two stimuli that is just sufficient to make them perceptible as two different signals in time. It means that two identical signals within the resolution cannot be perceived as separated in time. In other words, the smallest audible delay cannot be smaller than the resolution. The smallest recorded value for delay perception is 3\,\textrm{ms}, indeed bigger than the resolution. All the values smaller than the resolution do not contradict it, as they are all related to artifacts perception different from delay.

We found that the usually quoted 20\,\textrm{ms} threshold for latency can indeed hold for few instruments, but we also found confirmation that latency much shorter than that (but also larger than that) can yield to audible artifacts.

5 thoughts on “Latency: Myths and Facts. Part 3: A look at a quantitative study.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s