Perceptual coding

You must add a |reason= parameter to this Cleanup template – replace it with {{Cleanup|February 2007|reason=<Fill reason here>}}, or remove the Cleanup template.

Perceptual Audio Coding

Is a method of encoding audio that uses psychoacoustic models to discard data unable to be perceived by humans.

Why do we want to do this? What are the advantages?

As digital audio has increasingly become more and more popular the need to facilitate high quality digital audio with low bit rates has become a focus for new research and development. The answer for this has been to use Perceptual Audio Coding. Essentially, the goal of perceptual audio coding is to take away all of the sounds and frequencies that the human ear can not perceive. In doing this it allows for smaller file sizes by not having to reproduce as much data. Additionally, perceptual coding can improve digital audio through advanced bit allocation.

How is this achieved?

In most cases this is achieved by the use of either one of two types of audio coding schemes referred to as either lossless or lossy coding. Lossless requires larger file sizes and can generally be found in use on such mediums as Digital Versatile Disc's (DVD). While lossy coding reproduces a sound further from the original recording, it has the advantage of producing a smaller file size. In either case the coding scheme takes advantage of the sounds in an audio recording either being drowned out or masked by other ones. The coding scheme process can then take advantage of this by not encoding those overlapping or masked sounds. There are several standard coding schemes being used today to achieve this, such as MPEG layers 1-3 and AAC. The most popular audio coding scheme used today is MPEG Layer 1 or MP3 (which is considered to be a lossy type of coding scheme). This scheme uses a method of coding called Pulse Code Modulation which will be discussed later on.

Pulse Code Modulation

Basically, pulse code modulation (PCM) is a form of digital modulation used to transmit analog data in across a digital stream. The analog signal is represented by a series of pulses and non-pulses (1 and 0 respectively, or binary code). This is true no matter how complex the analog waveform happens to be. It must first be converted to a pulse-amplitude modulated signal (PAM) by regular sampling of its amplitude. The value of the amplitude is then converted into a binary code for transmission on the carrier wave.

The reason why PCM is so widely used is because the stream of pulses and non-pulse streams of 1’s and 0’s are not easily affected by interference and noise. Even in the presence of noise, the presence or absence of a pulse can be easily determined. Since PCM is digital, a more general reason would be that digital signals are easy to process by cheap standard techniques. This makes it easier to implement complicated communication systems such as telephone networks.

Using PCM, it is possible to digitize all forms of analog data, including full-motion video, voices, music, telemetry, and virtual reality. PCM is the primary way analog waves are converted into digital form for voice conversations as well as music. Codecs such as MP3 and AAC that compress the digital data further apply algorithms to the PCM samples in order to eliminate overlapping frequencies as well as sounds that are deemed inaudible to the human ear. PCM is also the standard in digital video, however, it is not typically used for video in consumer applications such as DVD or DVR because straight PCM would take too much bandwidth. Instead, compressed forms of digital audio are normally employed.

The PCM process uses filtering, sampling, quantizing, and encoding. The filtering stage removes frequencies above the highest signal frequency. These frequencies if not removed, may cause problems when the signal is going through the stage of sampling. Sampling of a waveform means determining instantaneous amplitudes of a signal at fixed intervals.

Earlier we talked about PCM as the process of changing/converting signals from one form (analog) to the other (digital). Sampling is the first part of the answer as to how the signal changes from one form to the other. Sampling is the reduction of a continuous signal to a discrete signal. A common example is the conversion of a sound wave (a continuous-time signal) to a sequence of samples (a discrete-time signal). Audio waveforms are commonly sampled at 44.1k samples/s (CD) or 48k samples/s (professional audio). This is usually sufficient for any practical purpose, since the human auditory system is capable of discerning sounds up to about 15-20 kHz.

Quantization is the process of allocating levels to the infinite range of amplitudes of sample values of the analog signal.

We now move onto the last process which is encoding. During encoding, a device used to change a signal (such as a bitstream) or data into a code. The code may serve any of a number of purposes such as compressing information for transmission or storage, encrypting or adding redundancies to the input code, or translating from one code to another. This is usually done by means of a programmed algorithm, especially if any part is digital, while most analog encoding is done with analog circuitry. During the encoding process of PCM, each step level is assigned a number. The numbers start with zero at the lowest level. These assigned numbers are then expressed in binary form.

Masking

Critical Bands

A form of masking that involves one sound being covered by a band of noise to the point where the target sound can no longer be heard. There are two methods of creating critical bands.

I. The first method starts by centering the band of noise frequency onto the target sound. The masker intensity is increased until the target can no longer be heard. This point is then recorded as the masked threshold.

x. Masked Threshold - The level of which an indistinguishable signal of interest becomes distinguishable from other signals or noise.

II. The second way involves slowly widening the noise bandwidth without adding energy to the original band. This increased bandwidth then reaches a point where masking no longer occurs.

This bandwidth is defined as the critical band. Even when the masker is extended to its full-bandwidth white noise the effect still remains the same as the critical band. Critical bands grow larger as the frequency spectrum grows, however in the lower frequency ranges there are more critical bands because they are smaller.

Simultaneous Masking

Where some sounds vanish in the presence of other sounds with certain characteristics. Simultaneous Masking is used with critical bands.

a. Spreading Functions

A Spreading Function SF(z,a), z is the frequency and a is the amplitude, is where masking spreads from the critical band to neighboring bands.

b. Tonality

Tonality of maskers effect the quality and efficiency of sound.

Spreading Functions and Tonality are used to create the Masking Threshold.

Temporal Masking

Characteristics of the Auditory system where sounds are hidden due to maskers that disappear, and after maskers that are about to appear.

a. Post-Masking

The effect of masking after a strong sound, lasting up to 200 milliseconds.

Post-Masking effect can vary based on the length of the masker burst in milliseconds.

b. Pre-Masking

Sound masked by something appearing after it, usually lasting 20 milliseconds or less.

Pre-Masking can hide effects of pre-echos, where noise is produced before the transient. By shorting the signal block of a sound pre-masking effectively eliminates the noise before the transient.

I. Transient

A short duration signal that represents a nonharmonic attack phase of a musical sound or spoken word.

Lossless and Lossy Compression

Coding algorithms can be broken down into two categories, lossless and lossy.

The loss that both of these titles refer to is that of waveform information. Both algorithms reduce the amount of information used to describe the audio data and reduce information redundancy. The use of lossy algorithms to encode audio results in a highly compressed file compared to the original. The downfall of such high compression is the loss of reconstructable information. Lossy compression of audio files is achieved by coding information, determined by the codec to be perceptually irrelevant, with less accuracy or not coding it at all. Lossless audio compressions can be decoded without losing any information in the process. Encoding audio using lossless coding algorithms limits the amount of resulting compression. Audio information, both captured naturally and generated synthetically, is too unpredictable for the types of algorithms used for coding other types of media information. Lossless encoding uses a mathematical process to whiten the audio signal during encoding resulting in a smaller file size. Losslessly encoded audio files can not only be decoded into an exact reproduction of their original file, but can also be transcoded to other lossless formats without degradation.

External links