Antenna Beamforming using Support Vector Machines
<Contributed to MeLi Wiki by Professor Christos Christodoulou, Department of Electrical and Computer Engineering, University of New Mexico>
Contents
Abstract
Support Vector Machines (SVM) have improved generalization performance over other classical optimization techniques. Here, we introduce an SVM-based approach for linear array processing and beamforming. The development of a modified cost function is presented and it is shown how it can be applied to the problem of linear beamforning. Finally, comparison examples are included to show the validity of the new minimization approach.
Introduction
Support Vector Machines (SVM) have shown clear advantage in prediction, regression and estimation over classical approaches in a wide range of applications due to its improved generalization performance. We introduce here a framework to apply the SVM approach to linear antenna array processing.
Antenna array signal processing involves complex signals, for which a complex formulation of the SVM is needed. We introduce this formulation by introducing the real and imaginary parts of the error in the primal optimization and then proceeding as usual to solve a complex valued constrained optimization problem. We formulate the algorithm using a modified cost function which includes the unavoidable numerical regularization for the optimization. The adjustment of the parameters into this cost function leads to an improved robustness of the method against thermal and multiuser noise.
The resulting algorithm is a natural counterpart of the real valued Support Vector Regressor, which can be immediately applied to array signal processing.
We introduce the derived formulation as the optimizer for a linear beamformer. Several experiments illustrate the advantage of SVM over the minimum mean squared error (MMSE) based algorithms due to its robustness against outliers and improved generalization ability. The first experiments compare the behavior of both algorithms in an environment in which interferent signals are close to the desired ones, thus producing nongaussian noise. The last experiment illustrates the improved generalization ability of the SVM when small data sets are used for training, which is common in communications applications.
The Support Vector Approach and the Cost Function
Let <math>\mathbf{x}[n]</math> be spatially sampled data. A linear beamformer can be written as
where <math>{\mathbf{x}}[n]</math> is the vector of <math>M</math> elements of the array output and <math>\epsilon[n]</math> is the output error.
Coefficients <math>\mathbf{w}</math> are usually estimated through the minimization of a certain cost function on <math>\epsilon[n]</math>
The SVM approach can be applied to the adjustment of this model. The main idea of SVM is to obtain the solution which has the minimum norm of <math>\mathbf{w}</math>. Due to the minimization of the weight vector norm, the solution will be regularized in the sense of Thikonov [1], improving the generalization performance. The minimization has to be subject to the constraints
not to be trivial. <math>\xi_n</math> and <math>\xi'_n</math> are the "slack" variables or losses. The optimization is intended to minimize a cost function over these variables. The parameter <math>\varepsilon</math> is used to allow those <math>\xi_n</math> or <math>\xi'_n</math> for which the error is less that <math>\varepsilon</math> to be zero. This is equivalent to the minimization of the so-called <math>\varepsilon</math>-insensitive or Vapnik Loss Function [2], given by
\begin{cases} |\epsilon|-\varepsilon &|e|>\varepsilon\\ 0&|e|<\varepsilon
\end{cases}</math>The functional to be minimized is then
subject to <math>\xi_n, \xi'_n \geq 0</math> where <math>C</math> is the trade-off between the minimization of the norm (to improve generalization ability) and the minimization of the errors [2].
The optimization of the above constrained problem through Lagrange multipliers <math>\alpha_i</math>, <math>\alpha'_i</math> leads to the dual formulation [3]
\alpha}-{\boldsymbol \alpha'})^T{\mathbf{y}}-({\boldsymbol
\alpha}+{\boldsymbol \alpha'}){\mathbf{1}}\varepsilon</math>to be minimized with respect to <math>({\alpha}_i-{\alpha'}_i)</math>.
It involves the Gram matrix <math>\mathbf{R}</math> of the dot products of the data vectors <math>\mathbf{x}[n]</math>. This matrix may be singular and thus producing an ill-conditioned problem. To avoid this numerical inconvenience, a small diagonal <math>\gamma {\mathbf{I}}</math> is added to the matrix prior to the numerical optimization.
We present here a modified derivation of the SVM regressor which leads to a more convenient equivalent cost function (Fig. \ref{fig:robustcost})
\begin{cases} 0&|e|<\varepsilon\\ \frac{1}{2\gamma}(|e|-\varepsilon)^2-\varepsilon &\varepsilon \leq |e| \leq \varepsilon+e_C\\ C(|e|-\varepsilon)-\frac{1}{2}\gamma C^2 & e_C
\leq |e|\end{cases}</math>
where <math>e_C=\varepsilon+\gamma C</math>
This cost function provides a functional that is numerically regularized by the matrix <math>\gamma I</math>. As it can be seen, the cost function is quadratic for the data which produce errors between <math>\varepsilon</math> and <math>e_C</math>, and linear for errors above <math>e_C</math>. Thus, one can adjust the parameter <math>e_C</math> to apply a quadratic cost for the samples which are mainly affected by thermal noise (i.e., for which the quadratic cost is Maximum Likelihood). The linear cost is then applied to the samples which are outliers [4], [5]. Using a linear cost function, the contribution of the outliers to the solution will not depend on its error value, but only on its sign, thus avoiding the bias that a quadratic cost function produces.
Finally, we generalize the derivation to the complex-valued case, as it is necessary for array processing.
Support Vector Machine Beamformer
The signal received by an antenna array consists of a number tdesired or friendly signals, which may come from one or more directions of arrival with different amplitudes and phases and the interfering or undesired signals. The idea here is to use SVM to find the optimum weights w's that will yield the best cost function that will steer the antenna array to direct its radiation pattern (its beams) towards the direction of the desired signals and place nulls in the directions where interfering signals are coming. Next, we show how this optimization can be achieved using SVMs on a linear antenna array. The output vector of an <math>M</math>-element array receiving <math>K</math> signals, shown in Figure 2 , can be written in matrix notation as
where
are respectively the <math>M \times K</math> steering matrix and vectors, the received signals and the thermal noise present at the output of each array element. The spatial correlation matrix of the received signals is:
{\mathbf{R}}&=\frac{1}{N}{\mathbf{X}}{\mathbf{X}}^H\\ &=E[({\mathbf{A}}{\mathbf{s}}[n]+{\mathbf{g}}[n])({\mathbf{A}}{\mathbf{s}}[n]+{\mathbf{g}}[n])^H]\\
&={\mathbf{A}}{\mathbf{P}}{\mathbf{A}}^H+\sigma_g^2I
\end{align}</math>provided that the signal and noise are independent, where <math>{\mathbf{P}}</math> is the autocorrelation of <math>{\mathbf{s}}[n]</math>, and <math>\sigma_g^2</math> is the noise power.
The optimal beamformer needs to satisfy
subject to the constraint
That is, it minimizes the output power subject to the condition that the desired signal <math>{\mathbf{s}}_d[n]</math> produces a given output <math>y[n]</math>. Applying Lagrange multipliers [6] to that constrained optimization problem leads to the solution
where <math>{\mathbf{S}}_d</math> is the matrix of all desired input vectors, and <math>{\mathbf{y}}</math> is the vector of all desired output signals.
How do we apply SVMs?
The output vector <math>{\mathbf{x}[n]}</math> is linearly processed to obtain the desired output <math>d[n]</math> [7], [8]. The expression for the output of the array processor is
where <math>{\mathbf{w}} = [w_1 \cdots w_M]</math> is the weight vector of the array and <math>{\epsilon[n]}</math> is the estimation error.
For a set of <math>N</math> observed samples of <math>\{x[n]\}</math> and when nonzero empirical errors are expected, the functional to be minimized is:
L_R\left(e[n],\varepsilon,\gamma,C
\right)</math>Therefore, particularizing to error cost function (6), we have to minimize
subject to
&Re\left(d[n]-{\mathbf{w}}^T {\mathbf{x}}[n]\right) \leq \varepsilon + \xi_n\\ &Re\left(-d[n]+{\mathbf{w}}^T {\mathbf{x}}[n]\right) \leq \varepsilon + \xi'_n\\ &Im\left(d[n]-{\mathbf{w}}^T {\mathbf{x}}[n]\right) \leq \varepsilon + \zeta_n\\ &Im\left(-d[n]+{\mathbf{w}}^T {\mathbf{x}}[n]\right) \leq \varepsilon + \zeta'[-]\\ &\xi[n], \xi'[n], \zeta[n], \zeta'[n] \geq 0
\end{align}</math>where <math>\xi[n]</math> (<math>\xi'[n]</math>) stand for positive (negative) errors in the real part of the output -analogously for <math>\zeta[n]</math> (<math>\zeta'[n]</math>) in the imaginary part. Note that errors are either negative or positive and, therefore, only one of the losses takes a nonzero value, this is, either <math>\xi[n]</math> or <math>\xi'[n]</math> (either <math>\zeta[n]</math> or <math>\zeta'[n]</math>) is null. This constraint can be written as <math>\xi[n] \xi'[n] = 0</math> (<math>\zeta[n] \zeta'[n] = 0</math>). Finally, as in other SVM formulations, parameter <math>C</math> can be seen as a trade-off factor between the empirical risk and the structural risk.
It is possible to transform the minimization of primal functional (15) subject to constraints in (16), into the optimization of the dual functional or Lagrange functional. Firstly, we introduce the constraints into the primal functional by means of Lagrange Multipliers, obtaining the following primal-dual functional:
L_{pd}&=\frac{1}{2} ||{\mathbf{w}}||^2 +\\ &C\sum_{n \in I_1}^N(\xi_n+\xi'_n) +C \sum_{n \in I_1}^N(\zeta_n+\zeta'_n) \\
& \frac{1}{2\gamma}\sum_{n \in I_2}^N \left({\xi_n^2 + {\xi}{'}_{n}^{2}}\right) +\frac{1}{2\gamma} \sum_{n \in I_2}^N \left({\zeta_n^2 +\zeta{'}_{n}^{2}}\right)\\
&-\sum_{n=n_0}^N(\lambda_n\xi_n+\lambda'_n\xi'_n)-\sum_{n=k_0}^N(\eta_n\zeta_n+\eta'_n\zeta'_n)\\ &\sum_{n=n_0}^N \alpha_n [Re\left(d[n] -{\mathbf{w}}^T {\mathbf{x}}[n] \right) -\varepsilon -\xi_n\\ &\sum_{n=n_0}^N \alpha'_n [Re\left(-d[n] +{\mathbf{w}}^T {\mathbf{x}}[n] \right) -\varepsilon -\xi'_n\\ &\sum_{n=n_0}^N \beta_n [Im\left(d[n] -{\mathbf{w}}^T {\mathbf{x}}[n] \right) -j\varepsilon -j\zeta_n\\ &\sum_{n=n_0}^N \beta'_n [Im\left(-d[n]+{\mathbf{w}}^T {\mathbf{x}}[n]+\right) -j\varepsilon -j\zeta'_n]
\end{align}</math>with the dual variables or Lagrange Multipliers constrained to <math>\alpha_n, \beta_n, \lambda_n, \eta_n, \alpha'_n, \beta'_n, \lambda'_n, \eta'_n \geq 0</math> and with <math>\xi_n, \zeta_n, \xi'_n, \zeta'_n \geq 0</math>. Note that cost function (1) has two active segments, a quadratic one and a linear one. Thus, sets <math>n\in I_2</math> and <math>n \in I_1</math> in (17) denote respectively the slacks <math>\xi_n</math>, <math>\xi'_n</math> which lie in the quadratic and linear segments of the cost function.
The following constraints must also be fulfilled:
\alpha_n \alpha_n' =0\\ \beta_n \beta_n' =0.
\end{align}</math>Besides, the Karush-Kuhn-Tucker (KKT) conditions [2] force <math>\lambda_n \xi_n=0</math>, <math>\lambda'_n \xi'_n=0</math> and <math>\eta_n\zeta_n=0</math>, <math>\eta'_n \zeta'_n=0</math>. Functional (17) has to be minimized with respect to the primal variables and maximized with respect to the dual variables. By minimizing <math>L_{pd}</math> with respect to <math>w_i</math> we obtain an optimal solution for the weights
where <math>\psi_n=\alpha_n - \alpha'_n +j(\beta_n - \beta'_n)</math>. This result is analogous to the one for the real-valued SVM problem, except that now Lagrange multipliers <math>\alpha_n</math> and <math>\beta_n</math> for both real and imaginary components have been considered. Optimizing <math>L_{pd}</math> with respect to <math>\xi_n</math> and <math>\zeta_n</math> and applying the KKT conditions leads to an analytical relationship between the residuals and the Lagrange multipliers, this relationship being given by
(\alpha-\alpha')& =
\begin{cases} -C, & Re(e) \leq -e_C\\ \frac{1}{\gamma}(Re(e)+\varepsilon), & -e_C \leq Re(e) \leq -\varepsilon\\ 0, & -\varepsilon \leq Re(e) \leq \varepsilon\\ \frac{1}{\gamma}(Re(e)-\varepsilon), & \varepsilon \leq Re(e) \leq e_C\\ C, & e_C \leq Re(e) \end{cases}\\
(\beta-\beta') & = \begin{cases} -C, & Im(e) \leq -e_C\\ \frac{1}{\gamma}(Im(e)+\varepsilon), & -e_C \leq Im(e) \leq -\varepsilon\\ 0, & -\varepsilon \leq Im(e) \leq \varepsilon\\ \frac{1}{\gamma}(Im(e)-\varepsilon), & \varepsilon \leq Im(e) \leq e_C\\ C, & e_C \leq -Im(e) \end{cases}
\end{align}</math>As remarked earlier, it is possible to continue towards the purely dual formulation of the problem solvable using Quadratic Programming (QP) approaches as usually done in the literature but, this approach is computationally intensive, and lacks the required flexibility for our communications problem at hand. Alternative optimization methods such as those relying on Iterative Re-Weighted Least Squares (IWRLS) have been introduced in [9] [10], with clear advantages in terms of computational cost and flexibility of operation.
Applying (19), the norm of the complex coefficients can be written as
At this point, it is worth using matrix notation again and store partial correlations in (21) as matrices
so that the norm of the coefficients can be written
<math>{\mathbf{R}}</math> being the matrix with elements <math>R[j,i]</math>, and <math>\psi =(\psi_{n_0} ... \psi_{N})^T</math>. By substituting (19) in functional (17), the dual functional to be maximized is as follows:
&L_d=\frac{1}{2}{\mathbf{\psi}}^H {\mathbf{R}} {\mathbf{\psi}} - Re[{\mathbf{\psi}}^H {\mathbf{R}}({\mathbf{\alpha - \alpha'}})] + \\ &+Im[{\mathbf{\psi}}^H {\mathbf{R}}({\mathbf{\beta - \beta'}})] \\ &+ Re [({\mathbf{\alpha - \alpha'}})^T {\mathbf{y}}] - Im[({\mathbf{\beta - \beta'}})^T {\mathbf{y}}] -\\ &- ({\mathbf{\alpha}} +{\mathbf{\alpha}}'){\mathbf{1}}\varepsilon - ({\mathbf{\beta}} +{\mathbf{\beta}}'){\mathbf{1}}\varepsilon + L_C
\end{align}</math><math>L_C</math> being a function of <math>\psi</math>. In order to find <math>L_C</math> in (24), note that provided that in both intervals <math>\xi_m^{(')}, \zeta_m^{(')} \neq 0</math>, then, by KKT conditions, <math>\lambda_m^{(')}, \eta_m^{(')} = 0</math>. Intervals <math>I_1</math> and <math>I_2</math> must be treated separately:
- Taking into account (20) into interval <math>I_1</math>, <math>\alpha_m^{(')}=\beta_m^{(')}=C</math>. Then, the last term of thefunctional for <math>I_1</math> is
- <math>{\mathbf{I}}</math> being the identity.
- Regarding (20) into interval <math>I_2</math>, so <math>\alpha_m^{(')}=\frac{1}{\gamma}\xi_m^{(')}</math> and <math>\beta_m^{(')}=\frac{1}{\gamma}\zeta_m^{(')}</math>. The last term for this interval is
- <math>\psi_2</math> being the elements of interval <math>I_2</math>.
Both terms can be grouped as
<math>D_{I_2}</math> being a diagonal matrix with terms corresponding to <math>I_1</math> set to 1 and the remaining set to <math>0</math>. As the last term of <math>L_C</math> is just a constant, it can be removed from the optimization.
By regrouping terms and taking into account that <math>{\mathbf{\psi}}^H{\mathbf{R}} {\mathbf{\psi}} = {\mathbf{\psi}}^H Re({\mathbf{R}}) {\mathbf{\psi}}</math>, the functional (24) can be written in a more compact form
Expression (28) is formally equal to the real-valued expression SVM functional.
Examples
Bit Error Rate performance
We first compared the algorithm with the regular Least Squares approach for an antenna array of 6 elements. The desired signals come from the angles between <math>-0.1 \pi</math> and <math>0.25\pi</math>, with amplitudes <math>1</math> and <math>0.3</math>, and the interfering signals come from within the range of <math>-0.05\pi</math>, <math>0.1\pi</math> and <math>0.3\pi</math> with an amplitude of <math>1</math>.
In order to train the beamformer, a burst of 50 known symbols is sent. Then, the Bit Error Rate (BER) is measured with bursts of 10,000 unknown symbols.
In the first two examples, we fix <math>\gamma</math> of eq. (6) to <math>10^{-6}</math> and then the product <math>\gamma C</math> is adjusted to the noise standard deviation. That way, most of the samples which are corrupted only by (thermal) noise will fall in the quadratic area, whereas the outliers produced by interfering signals will fall in the linear area.
We calculated the BER performance of LS and SVM for different noise
levels from <math>0 dB</math> to <math>-15 dB</math>. Each BER has been evaluated by
averaging the results of 100 independent trials. The results can be
seen in Figure 3.
In the next case, the desired signal coming from the angles of arrival (AOA) <math>-0.1 \pi</math> and <math>0.25 \pi</math>, with amplitudes <math>1</math> and <math>0.3</math>, where the interfering signals come from the AOAs <math>-0.02\pi</math>, <math>0.2\pi</math>, <math>0.3\pi</math> with amplitude <math>1</math> (see Figure 4).
In the last example, the interfering signals are much closer to the desired ones, thus biasing the LS algorithm. The better performance of SVM is due to its better robustness against the nongaussian outliers produced by the interfering signals.
Robustness against overfittng
One advantage of the SVM is that its generalization ability is controlled by the regularization imposed by the minimization of the weight vector norm. We highlight this fact by calculating the Bit Error Rate for different number of training samples. The results are shown in Figure 5.
Conclusions
We introduce in this work a way to apply the SVM framework to linear array beamforming. SVMs have a clear advantage over the minimum mean square error based algorithms in those cases in which small data sets are available for training and where nongaussian noise is present, due to the fact that the generalization ability of the machine is controlled. In order to make the algorithm adequate to array processing purposes, we first apply an alternative cost function which is suitable in problems in which there are gaussian noise and other nongaussian sources, as multiuser interference which may produce outliers in the signal. Also, this cost function provides a natural way to explain the numerical regularization present in any SVM.
Ongoing work is being done in the applications of nonlinear Support Vector Machines for beamforming and detection of angle of arrival.
Student Project
Consider two antenna arrays: One is composed of 6 elements and the other one with 10 elements. The elements are isotropic in both cases. Assume the desired signal comes from the angles of arrival (AOA) <math>-0.1 \pi</math> and <math>0.25 \pi</math>, with amplitudes <math>0.5</math> and <math>0.3</math>, and the interfering signals come from the AOAs <math>-0.05\pi</math>, <math>0.1\pi</math> and <math>0.35\pi</math> with amplitude <math>1</math>.
Can you train the beamformer to maximize its reception at the desired signals for both the 6 and the 10 element arrays?
What are the differences in both cases in terms of training, accuracy and BER.
Wiki Assessment
- What cost function is being used here?
- Why are we using the complex formulation of the SVM instead of the real-valued one?
- What happens if the desired signals to be detected are very close to the interfering signals?
- What happens if the interfering signals have a higher amplitude than the desired signals to be detected?
- What do we mean by primal optimization?
- What we mean by robustness?
- What is the Gram matrix?
- What is the Thikonov regularization?
- What other experiments we could have done with this particular algorithm?
References
- ↑ A. Tikhonov, V. Arsenen, Solution to Ill-Posed Problems, V.H. Winston & Sons, 1977.
- ↑ 2.0 2.1 2.2 V. Vapnik, Statistical Learning Theory, Adaptive and Learning Systems for Signal Processing, Communications, and Control, John Wiley & Sons, 1998.
- ↑ A. Smola, B. Scholkopf, A Tutorial on Support Vector Regression, NeuroCOLT Technical Report NC-TR-98-030, Royal Holloway College, University of London, UK (1988).
- ↑ P. J. Huber, Robust Statistics: a review, Vol. 43, Ann. Statistics, 1972.
- ↑ K.-R. Muller, A. Smola, G. Ratsch, B. Scholkopf, J. Kohlmorgen, V. Vapnik, Predicting time series with support vector machines, in: B. Scholkopf, C. Burges, A. Smola (Eds.), Advances in Kernel Methods â Support Vector Learning, MIT Press, Cambridge, MA, 1999, pp. 243â254.
- ↑ S. Haykin, Adaptive Filter Theory, 4th Edition, Prentice-Hall, Englewood Cliffs, N. J., 2001.
- ↑ M. Martınez-Ramon, N. Xu, C. Christodoulou, Beamforming using Support Vector Machines, IEEE Antennas and Wireless Propagation Letters 4 (2005) 439â442.
- ↑ M. Martınez-Ramon, C. Christodoulou, Support Vector Array Processing, IEEE Transactions on Antennas and PropagationSubmitted.
- ↑ A. Navia-Vazquez, F. Perez-Cruz, A. Artes-Rodrıguez, A. Figueiras-Vidal, Weighted Least Squares Training of Support Vector Classiï¬ers Leading to Compact and Adaptive Schemes, IEEE Trans. Neural Networks Vol. 12, No 5, pp. 1047-1059.
- ↑ F. Perez-Cruz, P. L. Alarcon-Diana, A. Navia-Vazquez, A. Artes-Rodrıguez, Fast Training of Support Vector Classiï¬ers, in: T. K. Leen, T. G. Dietterich, V. Tresp (Eds.), Advances in Neural Information Processing Systems 13, MIT Press, 2001.