What is Microphone Array?
A microphone array consists of a group of microphones arranged in specific geometric patterns, typically linear or circular. These arrays perform space-time processing on sound signals collected from various spatial directions, enabling advanced functions such as noise suppression, reverberation removal, interference reduction, sound source localization, sound source tracking, and array gain. These capabilities significantly enhance the quality of speech signal processing and improve speech recognition accuracy in real-world environments.
Microphone arrays can be classified into different shapes, including linear, circular, and spherical, though they can also take on more complex forms like cross, planar, spiral, or irregular configurations. The number of microphones in an array can range from just two to several thousand, making them versatile but complex. While intricate arrays are primarily used in industrial and defense applications, simpler configurations are more common in consumer electronics due to cost considerations.
Why Do We Need Microphone Arrays, Especially Dual Array Microphones?
The growing popularity of microphone arrays in consumer devices is largely driven by the booming voice interaction market. These arrays are crucial for improving long-distance voice recognition, ensuring accuracy in real-world scenarios. As voice interaction technology moves from mobile phones to devices like Echo smart speakers or robots, the challenges faced by microphones change dramatically—comparable to the difference between whispering and shouting.
Smartphones, like those equipped with Siri, typically use a single microphone system. This setup works well in low-noise environments, with no reverberation, and when the sound source is very close. However, when the sound source is farther away, and there’s significant noise, multipath reflection, or reverberation, the quality of the captured signal deteriorates, severely impacting voice recognition accuracy. Single microphones struggle to achieve sound source localization and separation under these conditions. This is where microphone arrays come into play, offering a solution for these limitations.
That said, a microphone array alone isn’t enough to guarantee high voice recognition rates. While the array serves as the physical gateway, handling sound signal processing in the real world, the ultimate recognition rate depends on cloud-based processing. For optimal results, the physical microphone array and the cloud-based recognition system must work in harmony.
Moreover, the quality of the signal processed by the microphone array is critical. Modern speech recognition systems rely heavily on deep learning, which is constrained by the quality of its training data. If the processed sound doesn’t closely match the characteristics of the training samples, recognition accuracy can suffer. Interestingly, the goal isn’t to produce the purest signal possible, but rather one that closely mirrors the characteristics of the training data—even if that data is less than ideal.
Near-field vs Far-field Microphone Array
Microphone arrays can be categorized based on the distance between the sound source and the array itself, leading to two distinct sound field models: the near-field model and the far-field model.
In the near-field model, sound waves are treated as spherical waves. This is because sound waves, as a type of vibration wave, spread outward in all directions after being generated by a vibrating sound source, making them inherently spherical. The near-field model accounts for the amplitude differences in the signals received by each microphone in the array.
On the other hand, the far-field model simplifies the situation by treating sound waves as plane waves, ignoring amplitude differences across the microphones. Instead, it assumes that the relationship between the signals received by each microphone is purely a matter of time delay. This simplification makes the far-field model easier to process and is the basis for most general speech enhancement techniques.
There isn’t an absolute rule to distinguish between near-field and far-field models. However, it is generally accepted that when the distance between the sound source and the central reference point of the microphone array is significantly greater than the signal wavelength, the far-field model applies. Conversely, if this distance is shorter, the near-field model is more appropriate.
For example, if the distance between adjacent microphones in a uniform linear array (known as the array aperture) is denoted by d and the wavelength of the highest frequency sound from the source (the minimum wavelength) is λ_min, then if the distance from the sound source to the center of the array is greater than 2d²/λ_min, it’s considered a far-field model; otherwise, it falls under the near-field model, as illustrated in Figure 1
Key Technologies of Microphone Array
Consumer-grade microphone arrays face a variety of challenges, including environmental noise, room reverberation, overlapping human voices, model noise, and array structure limitations. When used in speech recognition applications, additional optimization and alignment for speech recognition accuracy must be considered. To address these challenges, especially in specialized consumer applications, certain key technologies play a crucial role:
Noise Suppression
In speech recognition, it isn’t necessary to completely eliminate noise, unlike in call systems where full noise removal is often required. The noise in question here typically refers to environmental sounds like air conditioning noise, which lacks spatial directionality and has low energy levels. While this noise doesn’t overwhelm normal speech, it can reduce clarity and intelligibility. Though not suited for environments with high noise levels, this method is adequate for managing everyday voice interactions.
Reverberation Elimination
Reverberation is a particularly troublesome factor in speech recognition, significantly impacting the system’s performance. After the sound source stops producing sound, the sound waves continue to reflect and get absorbed within a room, creating a mix of sound waves for a brief period—this is reverberation. Reverberation can severely affect speech signal processing, reducing the accuracy of direction finding and impairing functions like cross-correlation or beamforming.
Echo Cancellation
More accurately termed “self-noise” rather than echo, this refers to a situation where the voice interaction device picks up its own sound output. Echoes are a more extended concept of reverberation, with longer delays—over 100 milliseconds, for instance, can make it seem like a sound is repeating itself, creating a distinct echo. In this context, echo cancellation is about eliminating the sounds emitted by the device itself, such as music or the voice of Alexa from an Echo speaker, to ensure that only the user’s voice is recognized.
Sound Source Direction Finding
Unlike sound source positioning, which is more complex, consumer-grade microphone arrays are designed primarily for direction finding. This process involves detecting the direction of the person speaking, which is essential for subsequent beamforming. Direction finding can be achieved using energy methods, spectrum estimation, or Time Difference of Arrival (TDOA) technology, and is typically implemented during the voice wake-up phase.
Beamforming
Beamforming is a common signal processing technique that involves manipulating the output signals of each microphone in the array—through weighting, delay, summation, and other methods—to create spatial directivity. This technique is used to suppress sound interference outside the main lobe, including human voices. For instance, when multiple people are speaking around an Echo device, beamforming allows it to focus on and recognize the voice of a single person.
Array Gain
This concept addresses the issue of pickup distance. If the captured signal is too weak, it can undermine speech recognition accuracy. Array gain involves enhancing the energy of the speech signal through array processing to ensure it is strong enough for reliable recognition.
Model Matching
This involves aligning the microphone array with speech recognition and semantic understanding models. Voice interaction is a complete signal chain that starts with the microphone array, and the model must be matched throughout the process. Effective microphone arrays designed for voice interaction typically use two sets of algorithms: one embedded in the hardware for real-time processing and another for cloud-based voice processing.
Future Trend in Microphone Arrays
Miniaturization
The trend toward smaller microphone arrays is gaining momentum. While many products currently use two microphones, this choice is often driven by industrial design considerations rather than cost. Microphone arrays can be made more compact and the method has already been proven effective. It’s only a matter of time before it becomes widely adopted in consumer electronics.
Low Cost
The high cost of microphone arrays, whether they have 2, 4, or 6 microphones, remains a barrier to widespread adoption. Reducing these costs isn’t simply about substituting cheaper components; it requires a complete redesign of the entire system, including the devices, chips, algorithms, and cloud infrastructure.
It’s important to note that even a 2-microphone array is not particularly cheap. In fact, the cost difference between 2- and 4-microphone arrays is minimal, although this comparison doesn’t account for the additional hardware required for echo cancellation. When considering the overall system, the cost differences between these configurations are not as significant as one might expect.
Processing and Recognizing Multiple Voices
The “cocktail party effect” refers to the human ability to focus on a single conversation in a noisy environment, even when multiple people are speaking simultaneously. Current microphone array and speech recognition technologies are still primarily designed for single-speaker scenarios. Achieving reliable multi-speaker recognition is a challenging goal that remains on the horizon, but it represents a significant area for future development in voice technology.
How to Choose the Right Microphone Array?
Choosing the right microphone array for your product involves understanding the balance between hardware solutions, algorithmic optimization, and cloud recognition capabilities. While the hardware for microphone arrays is fairly advanced, the front-end algorithms and cloud recognition are still evolving. The specific algorithmic approaches vary by company, with some solutions allowing users to select the central microphone independently, which is beneficial for design flexibility.
Microphone arrays with more than two microphones are generally organized in linear or ring structures, while 2-microphone arrays typically come in Broadside or Endfire configurations. With these options available, how should manufacturers decide which solution is best? The answer lies in product positioning and the intended user scenarios.
For Cost-Effective Solutions
If your product aims to be budget-friendly, there’s often no need for a complex microphone array. A single microphone, coupled with the right algorithms, can still achieve noise suppression and echo cancellation, ensuring adequate voice recognition in near-field environments at a much lower cost.
For Improved Noise Reduction
If your product requires better noise reduction, a 2-microphone solution might be more suitable. This configuration simplifies design and can effectively reduce noise within a certain range during calls. However, it doesn’t offer a significant improvement in voice recognition compared to a single microphone, and the cost is relatively high. Additionally, when factoring in the necessary echo cancellation features for voice interaction, costs can escalate further.
One major drawback of the 2-microphone solution is its limited ability to locate sound sources, making it more suitable for mobile phones and headphones where the focus is on call noise reduction. This can be simulated by a directional microphone, akin to the Endfire configuration of a 2-microphone array, where a single microphone is designed to mimic the functionality of two. However, this approach requires dual openings in the design, which can complicate the industrial design process.
For Versatile and High-Performance Products
If your product needs to handle more diverse user scenarios, a microphone array with four or more microphones is recommended. For example, Amazon Echo uses a configuration with more than six microphones to enhance voice recognition and noise handling. Robots generally perform well with four microphones, while speakers may benefit from six or more. In automotive applications, distributed arrays or other specialized structures may be the best choice.
Final Words: Why Dual Microphone Arrays Are Leading the Smart Home Revolution
While the tech world buzzes with advancements in multi-microphone arrays, the dual microphone solution has quietly become the workhorse of the smart home appliance control field. Based on extensive experience at Dusun IoT, it’s clear that dual microphones are not just a compromise—they’re the optimal choice for many smart home applications.
Since 2012, the home appliance industry has sought to seamlessly integrate voice interaction technology into everyday products. The key requirements are straightforward yet challenging: enable direct voice control unaffected by the appliance’s own noise, achieve reliable far-field voice interaction, and ensure the solution is both mature and cost-effective. Far-field voice interaction, in particular, stands out as the critical factor.
Although many might think more microphones equal better performance, reality paints a different picture. While an eight-microphone array might offer higher voice recognition accuracy, it also introduces a host of challenges—higher costs, more complex structures, and greater difficulties in production and installation. Moreover, for appliances like air conditioners and TVs, which are typically placed against walls, the extra microphones add little practical value.
In contrast, the dual microphone array shines in these scenarios. With its straightforward design, lower cost, easier implementation, and lower power consumption, it’s no surprise that dual microphones are poised to become the standard in smart home products. As we look to the future, it’s clear that simplicity and efficiency will continue to drive innovation in the smart home space.
FAQs on Microphone Arrays
What are types of microphone arrays?
Linear Microphone Array
A linear microphone array has its elements aligned along a single straight line. There are two main types:
Uniform Linear Array (ULA): In a ULA, the spacing between adjacent microphones is consistent. This uniformity results in equal phase and sensitivity across the array, making it the simplest and most common array topology.
Nested Linear Array: This type is essentially a combination of multiple ULAs, stacked or nested together. It’s a specialized form of a non-uniform array, providing flexibility while retaining some of the simplicity of the ULA. However, linear arrays are limited to capturing only the horizontal azimuth information of the sound signal.
Planar Microphone Array
A planar microphone array has its elements arranged across a flat surface, rather than a straight line. Depending on the geometric pattern, planar arrays can be classified into several subtypes, including:
Equilateral Triangle Array
T-Array
Uniform Circular Array
Uniform Square Array
Coaxial Circular Array
Circular or Rectangular Array
Planar arrays are advantageous because they can capture both the horizontal and vertical azimuth information of a sound signal, providing more comprehensive spatial information compared to linear arrays.
Stereo Microphone Array
Stereo microphone arrays expand into three-dimensional space, with their elements arranged in various 3D geometric shapes. Common configurations include:
Tetrahedron Array
Cube Array
Cuboid Array
Spherical Array
Stereo arrays offer the most complete spatial information. They can detect the horizontal and vertical azimuth, as well as the distance between the sound source and the reference point within the array, making them ideal for applications requiring precise 3D sound localization.
What is beamforming microphone?
Beamforming is a technique used in microphone arrays to focus on sound from a specific direction. It works by:
Delaying and Phase Compensating: Adjusting the timing and phase of each microphone’s signal to align them from a chosen direction.
Amplitude Weighting: Assigning different weights to each microphone’s signal to enhance the desired direction while suppressing noise from other directions.
Key Beam Pattern Parameters:
3dB Bandwidth: The range of frequencies where the array maintains performance within 3 decibels of the maximum gain.
Distance to the First Zero Point: The distance to the first point where the beam’s gain drops to zero.
First Sidelobe Height: The height of the first secondary peak outside the main beam.
Sidelobe Attenuation Rate: How quickly the gain decreases from the main beam to the sidelobes.
The power pattern, which is the square of the amplitude, is used to measure overall performance. Beamforming microphones are ideal for applications needing precise sound directionality and noise suppression.
What is the difference between multi array microphone and dual array microphone?
Different Costs
The cost of dual microphones is much lower than that of multi-microphones. In addition to the difference in the number of microphones that can be observed intuitively, the hardware circuits required to support multi-microphone channels and the additional computing power required to process more signal data all make the cost reflect a large difference.
Technical Differences
Although the technologies used by dual microphones and multi-microphones are similar, there are significant differences in the algorithm systems. Obviously, the more microphones there are, the easier it is to achieve better noise reduction and voice enhancement effects. Therefore, in order to achieve the same or similar effects, the dual microphone array technology is relatively more technically challenging. However, due to cost issues, the application of dual microphone arrays is more popular.
Voice Positioning and Recognition
If the technology optimization is good enough, in a home environment of 3 to 5 meters, the dual microphone array can achieve almost the same noise reduction and voice enhancement effects as the multi-microphone array. However, a disadvantage of dual microphones is that the sound source positioning can only be located within a range of 180°, while the circular microphone array can achieve positioning within a full angle range of 360°. Of course, this difference is not a problem for some devices that need to be placed against the wall, such as air conditioners and TVs. For products like robots placed in the center of the room, if you want them to locate the speaker, you can only use a multi-microphone solution.
Implementation
Finally, from the perspective of the final product form, the dual-microphone solution is simpler and easier to implement. The biggest problem with multi-microphone arrays is that, whether linear arrays or circular arrays, they have extremely strict requirements on the appearance and structural design of the product, because the microphones must be evenly distributed in space. Dual microphones obviously do not have to consider these factors.
What microphone array should robots or AIoT products use?
For robots or AIoT products, the choice of microphone array depends on the application requirements:
Robots: Require precise sound source localization. Therefore, a circular multi-microphone array is typically used. This type of array provides 360° sound source positioning, essential for accurate localization and interaction.
AIoT Products: The choice can be more flexible. Dual microphones offer faster implementation and are simpler to integrate, making them advantageous for quicker deployment and varied design forms. Multi-microphone arrays can also be used if high sound source localization is needed, but they are generally more complex and costly.
Overall, while multi-microphone arrays are ideal for precise localization, dual microphones are often preferred for their ease of implementation and versatility in building AIoT ecosystems.