Mining linguistic associations from numerical data

By | 6.1.2016

1. Introduction

One of methods of data mining which are under development in our institute is mining of linguistic associations from numerical data [3],[4]. Data mining is regarded as a non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable knowledge in large scale data-sets [1]. Particularly interesting are associations that reflect relationships among items in data-sets. Recall that, in general, associations express specific semantics in linking data items together in the sense that if X ~ Y is such an association then “occurrence of X is associated with occurrence of Y”, where X and Y are attributes of data items.

2. Focus of our research

It should emphasized that in many situations discovering associations involves uncertainty and imprecision (vagueness). Hence, applying appropriate tools capturing this feature becomes necessary. The reason is that vagueness is inherent in many problems of knowledge representation and discovery and also, high-level complex decision processes often deal with generalized concepts and linguistic expressions, which are inherently vague.

We present a direct method for mining associations that characterize relations among attributes using natural language. Since the mined associations have a form of natural language sentences, we call them linguistic associations. A typical form of a linguistic association is

IF number of cars per hour is very big AND wind speed is small

THEN concentration of NO2 is more or less big.

Our method captures the genuine linguistic meaning of such associations. This is achieved using a formal logical theory which provides a mathematical model of the meaning of a special but very important class of natural language expressions, which are called evaluative linguistic expressions.

3. Description of main results

We developed a method for mining above-described associations from data. The method has two phases. First, we replace numerical data by appropriate evaluative expressions (such as very small, roughly big, etc.) according to their meaning (this procedure is analogous to what people do when they assign perceptions to the observed phenomena). Second, we apply a proper data mining procedure, which can be any suitable classical data mining method that works with logical or categorical data. We have used the classical GUHA method [2]. The reason why classical mining methods work also in mining linguistic associations consists in the fact that the assigned perceptions behave as if they were logical (categorical) data and so, the mining procedure may treat them formally without considering their original meaning. However, the linguistic meaning of found associations together with its vagueness is still kept. Hence, the associations can be treated accordingly. We see the main outcome of linguistic associations in their easy (or, at least, easier) understandability to the user, in the possibility to use their logical properties for significant reduction of their number, and also in the fact that their vague meaning enables less strict interpretation which complies with the uncertainty of existence of a relation characterized by them.

We have implemented this method using special experimental software called LAM (Linguistic Associations Mining), see Section 4. We tested our method on several standard data sets, such as Boston Housing dataset from StatLib library. Obtained associations are formulated in natural language. Hence, they can serve experts from various fields to discover new relations of dependencies in a way that is much closer to the form of their knowledge and the way of their thinking. Moreover, the discovered associations characterizing real dependencies can be directly taken as fuzzy IF-THEN rules and used as expert knowledge about the problem.

We also developed and implemented second method which uses fuzzy transform [4]. The antecedent of the found associations consists of expressions of the form “X is Fn(y)” where X is an attribute and Fn(y) is a fuzzy number which represents meaning of the linguistic expression “approximately y” where y ia real number. The consequent is a linguistic expression “B average Z” where Z is an attribute and B an evaluative linguistic expression (i.e. expression as “big, roughly medium, extremely small”, etc.), for example, “very small average concentration of gas”, etc. A typical example of such linguistic association is

IF number of cars per hour is approximately 1000

AND wind speed is approximately 5 m per second

THEN average concentration of NO2 is more or less big.

4. Demonstration

In the following Flash presentation you can see the operation of LAM program. First, data file and file with the definition of antecedent and consequent variables are loaded. Then, after optional change of several parameters, mining of linguistic associations is started. Found associations are displayed and could be saved in the form of linguistic description (extension .rb), which can then be loaded into LFLC software.

REFERENCES:

[1] FAYYAD, U., PIATETSKY-SHAPIRO, G. AND SMYTH, P.: From data mining to knowledge discovery: An overview. In U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, pp. 1-30. AAAI Press/The MIT Press, MA, U.S.A., 1996.

[2] HÁJEK, P., HAVRÁNEK, T.: Mechanizing Hypothesis Formation (Mathematical Foundations for a General Theory). Springer-Verlag, Berlin-Heidelberg-New York, 1978.

[3] NOVÁK, V., PERFILIEVA, I., DVOŘÁK, A., CHEN, Q., WEI, Q., YAN, P. Mining pure linguistic associations from numerical data. In International Journal of Approximate Reasoning, 48, 2008, pp. 4-22, ISSN 0888-613X.

[4] PERFILIEVA, I., NOVÁK, V., DVOŘÁK, A. Fuzzy transform in the analysis of data. In Intern. Journal of Appr. reasoning, 48, 2008, pp. 36-46, ISSN 0888-613X.

Flash presentation illustrating LAM software in use

Leave a Reply

Your email address will not be published. Required fields are marked *