In this article I want to talk about my long-time hobby – learning and working with far fields mic (mic array)- mic arrays.
This article will be of interest to those who are passionate about building their own voice assistants, it will answer some questions for people who see engineering as an art, as well as those who want to try their hand at Q ( It’s from Bondiana. ). My humble tale will hopefully perhaps help you understand why a smart speaker assistant made strictly according to the tutorial works well only if there is no noise at all. And so bad where they are, such as in the kitchen.
Many years ago I got into programming, I started writing code simply because my wise teachers only allowed me to play games written by myself. It was about 87 and it was a Yamaha MSX. It was also the time when I had my first startup. All strictly according to the wisdom : "Choose a job to your liking, and you won’t have to work a single day in your life" (Confucius).
And so the years have passed and I’m still writing code. Even hobby with code – well, except skating, to warm up brains and "I won’t forget math" is work with Far Fields mic (Mic array). I don’t know if my teachers wasted their time with me.
What is it and where it is applied
The voice assistant that listens to you usually has an array of microphones. We find them in video conferencing systems as well. When communicating collectively, the lion’s share of attention is given to speech, we naturally don’t look at the speaker all the time when communicating, and speaking precisely into a microphone or headset is constraining and uncomfortable.
Almost every customer-respecting cell phone manufacturer uses 2 or more microphones in their creations, (yes, yes there are microphones behind those holes at the top, bottom, back). For example in the iPhone 3G/3GS it was the only one, in the fourth generation of iPhones there were two, and in the fifth there were three microphones. In general, this is also an array of microphones. And it’s all for better audibility of the sound.
But back to our voice assistants
So how do you increase your hearing range?
"need big ears"
A simple idea: if one microphone is enough to hear who is nearby, then to hear from afar, you need to use a more expensive microphone with a reflector, similar to the ears of foxes-fennecks:
It’s not really a part of a furry suite, but a serious device for hunters and scouts.
Same, only on resonator tubes
(Taken from https://forum.guns.ru )
Mirrordiameter from 200mm to 1.5m
(see more of this http://elektronicspy.narod.ru/next.html )
"We need more microphones" id="nuzhno-bolshe-mikrofonov"
Or maybe if you put a lot of cheap microphones, then quantity will turn to quality and everything will work out? Zerg rush. With microphones only.
Strangely enough, it works in real life. True with a lot of math, but it works. And we’ll tell you about it in the next section.
And how do you learn to hear further without beautiful horns?
One of the problems with horn systems is that you hear what’s in focus. But if you need to hear something from another direction, you have to do a "feint with your ears" and physically redirect the system in the other direction.
And about the signal-to-noise ratio of microphone array systems is somehow better compared to a conventional microphone.
In microphone arrays, as well as in their closest relatives, phased antenna arrays, you do not need to rotate anything. See the section about Beamforming for more details. It is easy to see :
The unfocused microphone (left picture) records all sounds from all directions, not just the one you want.
So where does the greater range come from? In the right picture, the microphone listens carefully to only one source. As if focused, it gets the signal of the selected source only, rather than a mush of possible noise sources, and the clean signal is simply amplified (made louder), without applying complex noise-canceling techniques. Kind of like a horn, but on a matan thrust.
What’s wrong with noise cancellation?
The use of complex noise reduction has a lot of disadvantages – it means that part of the signal is gone, along with part of the signal the sound will change, and to the ear it looks like a characteristic coloring of the sound with noise reduction and as a result it is unintelligible. This illegibility is visible to Russian speakers, who want to hear those hissing sounds from the interlocutor. And as an addition – as a result of noise reduction the listener does not hear any identifying signals linking him with the speaker (breathing, sniffing and other noises that accompany live speech). This creates some problems, because in conversational speech you hear all this, and just helps to assess the state and attitude of the interlocutor to you. The absence of them (noises) while we hear the voice causes unpleasant feelings and reduces the level of perception, understanding and identification. Well, if you are listening to a voice assistant – noise reduction makes it difficult to recognize a key phrase, and speech after. True, there is a loophole – the recognizer needs to train on a sample recorded with the distortions from the used noise reduction.
Those who are familiar with the words cocktail party problem can go to coffee or cocktail for now and do a field experiment, those who are in the mood to read, continue on.
Briefly about the matane on which it works :
DOA Estimation and beamforming
DOA (directional determination and, if possible, localization to the source):
I will be brief, for the topic is very broad, it is done with white, gray or dark magic (depends on the preferred theme in the IDE) and matan.
main a frequent way to play DOA is to analyze correlations and other miscellaneous between pairs of mics (usually opposing diameters).
Lifehack : it is better to choose an array with circular placement of microphones for research. Benefit – it is easy to gather statistics from pairs with different distances between microphones – maximum in diameter, and to minimum between microphones – if you take pairs by chords, and with different azimuths (directions) to the source.
Beamforming – The simplest and easiest to understand method is delay sum (DAS and FDAS) – beamforming based on delay and summation.
For visuals :
Lifehack : Do not forget about the different wavelengths and for each frequency calculate a different phase difference tn
A rough directional diagram would look something like this
Those who haven’t forgotten how to smoke math can join JIO-RLS (Joint Iterative Subspace Adaptive reduced-rank least squares). It’s very similar to gradient descent, you know.
So, let’s summarize: with the usual methods it is difficult to achieve the quality comparable with the matrix microphone. After direction finding, and as a result we hear only the source we need, we get rid of noises and reverberation of the environment, even the ones that are hardly audible (Haas effect).
Voice assistant – what it looks like from the inside
So what does the sound processing circuitry look like for the master voice assistant :
The signal from the array of microphones comes to the device in which we form a beam to the sound source (beamforming), thereby removing interference. Then the sound of this beam begins to recognize, usually for quality recognition device resources are not enough, and most often the signal goes for recognition in the cloud (Microsoft, Google, Amazon).
The attentive reader will notice : And in the picture with the description there is some square Not word, and why not immediately recognition, as promised?
Why is this extra square probably drawn on the diagram?
And because constantly broadcasting a signal from all sources of noise to the internet for
listening recognition no resources are enough. So we begin to recognize only when we realized that we do want it – and for this said a special spell – ok google, Siri or alexa, or kortana called. And the Not word classifier is most often a neuron and works directly on the device. There’s a lot of interesting stuff about building a classifier, too, but that’s not what today is about.
And actually the circuit looks like this :
Several beams to different signal sources can be formed, and we are looking for a special word in each of them. But next we will process the one who said the right word.
The next step, recognition in the cloud, is covered many times on the Internet and there are many tutorials on it.
How can you join this holiday matana
The easiest way to buy a dev board. Review of existing dev boards : one of the most comprehensive is by following this link
The most beginner-friendly :
based on XMOS XVF-3000.
Made the way I like it – FPGA with an open interface controls the matrix microphones, communicating with it via SDA.
My exploits in crossing Android Things and Mic Array:
There are quite a few examples for this board (Voice), but I’m just comfortable using it for Things.
Arguments for Things:
You can build a flexible and powerful tool :
- convenient that you can use the screen as a separate device
- can be used as a headless device, i.e.to make the transfer over the network (create an api to transfer to another device)
- convenient debugging
- Many libraries including those for networking;
- analysis tools are plentiful.
- and if it is not enough, you can connect the C libraries
For example I use :
- audio file analysis,
- Trainingbuild classifiers.
And then if you have to port/rewrite code into some embed, it’s easier to do it from Java code.
Unfortunately, the example from the board authors for Things was a bit off, so I made my own demo project (of course – I can do that).
In a nutshell, all the black magic of quick microphone polling, FFT is done in C++, and visualization, analysis, and networking is done in Java.
Plans for future development
Source of plans and inspiration : ODAS
I want to do the same thing, only on Things and without glitches.
- Because ODAS is a bit awkward to use.
- I need a normal tool to work with
- Because I can and I like this theme
- The hardware and software used meet the complexity of the task.
"If you have anything to add or criticize, feel free to write about it in the comments, for one head is worse than two, two is worse than three, and n-1 is worse than n" nikitasius