Confidence measures of the FIR methodology

FIR deals with multi-input/single-output systems. Each state consists of a number of mask inputs (the so-called m-inputs) and a single mask output, called m-output. In the forecasting process, FIR compares the current values of the set of m-inputs (the so-called ‘input state’) with all the input states stored in the experience database that was constructed during the training phase, i.e. in the modelling process. It determines, which are the five nearest neighbours in terms of their input states in the experience database, and estimates the new m-output value as a weighted sum of the m-output values of its five nearest neighbours, i.e. proximity to the nearest neighbours is established in the input space, leading to a set of weight factors that are then used for interpolation in the output space.

There are two separate sources of uncertainty in making predictions that need to be taken into account. The first source of uncertainty is related to the proximity or similarity of the current (testing) input state to the input states of the training data in the experience database. If the previously observed training patterns are similar to the current testing pattern in the input space, it is more likely that a prediction made by interpolating between the observed m-outputs of the training data-sets will be correct. The second source of uncertainty has to do with the dispersion among the m-outputs of the five nearest neighbours in the experience database. If the m-output values are almost identical, i.e. the dispersion between the m-outputs is small, then it is more likely that the prediction will be accurate.

In order to create a meaningful metric of proximity in the input space, it is necessary to normalise the variables. This is accomplished using a normalised pseudo-regeneration of the previously fuzzified variables. A ‘position value’, posi, of the ith m-input, vari, can be computed as follows:

$$pos_i = class_i + side_i\cdot (1.0-Memb_i)$$

where $class_i$, $Memb_i$ and $side_i$ are the qualitative triple representing the $ith m-input$, obtained in the fuzzification process (Cellier et al. 1996b). In the above formula, the linguistic variables, $class_i$ and $side_i$, assume numerical (integer) values. The class values range from 1 to ni, where $n_i$ is the number of discrete classes attributed to $var_i$,and the side values are from the set ‘ – 1’, ‘0’ and ‘ + 1’, representing the linguistic values ‘left’, ‘centre’ and ‘right’ of the fuzzy membership function (Cellier et al.1996b). The position value, $pos_i$, can be viewed as a normalised pseudo-regeneration of the $ith m-input$. Irrespective of the original values of the variable, $pos_i$ assumesvalues in the range [1.0, 1.5] for the lowest class, [1.5, 2.5] for the next higher class, etc.

The data in the experience database can be characterised in the same fashion.
$$pos_{i}^{j}=class_{i}^{j}+side_{i}^{j}\cdot (1.0-Memb_{i}^{j})$$

represents the normalised pseudo-regenerated value of $var_{i}^{j}$, the $ith m-input$ of the $jth$ neighbour in the experience database and
$$pos^{j} = class^{j} + side^{j}\cdot (1.0-Memb_{i}^{j})$$
is the position value of the single output variable of the $jth$ neighbour in the database. The position values of the $m-inputs$ can be grouped into a position vector, pos:
$$pos_{in}=[pos_1, pos_2,\; pos_n]$$
where n represents the number of $m-inputs$. Similarly.
$$pos_{in}^{j}=[pos_{1}^{j}, pos_{2}^{j},\; pos_{n}^{j}]$$
represents the corresponding position vector of the jth nearest neighbour in the experience database.
The position vectors of the five nearest neighbours are the starting point for computing both types of confidence measures.

Definiton

More formally, a support vector machine constructs a hyperplane or set of hyperplanes in a high- or infinite-dimensional space, which can be used for classification, regression, or other tasks. Intuitively, a good separation is achieved by the hyperplane that has the largest distance to the nearest training-data point of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier.

Whereas the original problem may be stated in a finite dimensional space, it often happens that the sets to discriminate are not linearly separable in that space. For this reason, it was proposed that the original finite-dimensional space be mapped into a much higher-dimensional space, presumably making the separation easier in that space. To keep the computational load reasonable, the mappings used by SVM schemes are designed to ensure that dot products may be computed easily in terms of the variables in the original space, by defining them in terms of a kernel function $k(x,y)$ selected to suit the problem. The hyperplanes in the higher-dimensional space are defined as the set of points whose dot product with a vector in that space is constant. The vectors defining the hyperplanes can be chosen to be linear combinations with parameters $\alpha_i$ of images of feature vectors x_i that occur in the data base. With this choice of a hyperplane, the points x in the feature space that are mapped into the hyperplane are defined by the relation: $\textstyle\sum_i \alpha_i k(x_i,x) = \mathrm{constant}$. Note that if $k(x,y)$ becomes small as $y$ grows further away from $x$, each term in the sum measures the degree of closeness of the test point $x$ to the corresponding data base point $x_i$. In this way, the sum of kernels above can be used to measure the relative nearness of each test point to the data points originating in one or the other of the sets to be discriminated. Note the fact that the set of points $x$ mapped into any hy a result, allowing much more complex discrimination between sets which are not convex at all in the original space.

Non Linear Classification

The original optimal hyperplane algorithm proposed by Vapnik in 1963 was a linear classifier. However, in 1992, Bernhard E. Boser, Isabelle M. Guyon and Vladimir N. Vapnik suggested a way to create nonlinear classifiers by applying the kernel trick (originally proposed by Aizerman et al.) to maximum-margin hyperplanes. The resulting algorithm is formally similar, except that every dot product is replaced by a nonlinear kernel function. This allows the algorithm to fit the maximum-margin hyperplane in a transformed feature space. The transformation may be nonlinear and the transformed space high dimensional; thus though the classifier is a hyperplane in the high-dimensional feature space, it may be nonlinear in the original input space.

If the kernel used is a Gaussian radial basis function, the corresponding feature space is a Hilbert space of infinite dimension. Maximum margin classifiers are well regularized, and previously it was widely believed that the infinite dimensions do not spoil the results. However, it has been shown that higher dimensions do increase the generalization error, although the amount is bounded.

• Polynomial (homogeneous): $k(\mathbf{x_i},\mathbf{x_j})=(\mathbf{x_i} \cdot \mathbf{x_j})^d$
• Polynomial (inhomogeneous): $k(\mathbf{x_i},\mathbf{x_j})=(\mathbf{x_i} \cdot \mathbf{x_j} + 1)^d$
• Gaussian radial basis function: $k(\mathbf{x_i},\mathbf{x_j})=\exp(-\gamma \|\mathbf{x_i}-\mathbf{x_j}\|^2)$, for $\gamma > 0$. Sometimes parametrized using $\gamma=1/{2 \sigma^2}$
• Hyperbolic tangent: $k(\mathbf{x_i},\mathbf{x_j})=\tanh(\kappa \mathbf{x_i} \cdot \mathbf{x_j}+c), for some (not every) \kappa > 0 and c < 0$
Algorithm

The training examples are vectors in a multidimensional feature space, each with a class label. The training phase of the algorithm consists only of storing the feature vectors and class labels of the training samples.

In the classification phase, $k$ is a user-defined constant, and an unlabeled vector (a query or test point) is classified by assigning the label which is most frequent among the $k$ training samples nearest to that query point.

A commonly used distance metric for continuous variables is Euclidean distance. For discrete variables, such as for text classification, another metric can be used, such as the overlap metric (or Hamming distance). In the context of gene expression microarray data, for example, $k-NN$ has also been employed with correlation coefficients such as Pearson and Spearman. Often, the classification accuracy of $k-NN$ can be improved significantly if the distance metric is learned with specialized algorithms such as Large Margin Nearest Neighbor or Neighbourhood components analysis.

A drawback of the basic “majority voting” classification occurs when the class distribution is skewed. That is, examples of a more frequent class tend to dominate the prediction of the new example, because they tend to be common among the k nearest neighbors due to their large number. One way to overcome this problem is to weigh the classification, taking into account the distance from the test point to each of its $k$ nearest neighbors. The class (or value, in regression problems) of each of the $k$ nearest points is multiplied by a weight proportional to the inverse of the distance from that point to the test point. Another way to overcome skew is by abstraction in data representation. For example in a self-organizing map (SOM), each node is a representative (a center) of a cluster of similar points, regardless of their density in the original training data. $k-NN$ can then be applied to the SOM.

Parameter Selection

The best choice of $k$ depends upon the data; generally, larger values of $k$ reduce the effect of noise on the classification, but make boundaries between classes less distinct. A good $k$ can be selected by various heuristic techniques (see hyperparameter optimization). The special case where the class is predicted to be the class of the closest training sample (i.e. when $k = 1$) is called the nearest neighbor algorithm.

The accuracy of the $k-NN$ algorithm can be severely degraded by the presence of noisy or irrelevant features, or if the feature scales are not consistent with their importance. Much research effort has been put into selecting or scaling features to improve classification. A particularly popularapproach is the use of evolutionary algorithms to optimize feature scaling. Another popular approach is to scale features by the mutual information of the training data with the training classes.

In binary (two class) classification problems, it is helpful to choose $k$ to be an odd number as this avoids tied votes. One popular way of choosing the empirically optimal $k$ in this setting is via bootstrap method.

Bearings Fault Detection Using Inference Tools

Electric motors are nowadays widely used in all kind of industrial applications due to their robustness and ease of control through inverters. Therefore, any effort, with the aim of improving condition monitoring techniques applied to them, will result in a reduction of overall production costs by means of productive lines stoppage reduction, and increment of the industrial efficiency. In this context, the most used electric machine in the industry is the Induction Motor (IM), due to its simplicity and reduced cost. The analysis of the origin of IMs failures exhibits that the bearings are the major source of fault (Singh et al., 2003), and even a common cause of degradation in other kinds of motors as Permanent Magnet Synchronous Machines.