Это – рабочий авторский перевод. Профессиональный перевод опубликован в журнале Pattern Recognition and Image Analysis, Vol. 15. No. 2, 2005, 362 - 364

Image Analysis Domain Ontology’s Representation for Information Retrieval Optimization

В.Н. Белоозеров, И.Б. Гуревич, Ю.О. Трусова

Научный совет по комплексной проблеме «Кибернетика» РАН

1199991 Москва ГСП-1 ул. Вавилова, 40

igourevi@ccas.ru, lcmi@ccas.ru

The problem of information search in the domain of operations on image data is considered. The concept of lexical thesaurus is proposed as a model of the subject domain ontology that aims an amelioration of the search machines output.

Introduction

Information supply has a paramount significance for investigations in the urgent and quickly developing sphere of image data analysis and understanding. The search of information in the modern global nets and big universal data banks cannot be successful if mere coincidence of the words in the request and a document works as a relevance criterion. Search machines use various complex criteria which take into account statistical properties of the words and their grammatical variability. However the practice of the information retrieval systems development in the domain of scientific and technological information showed that the high retrieval quality could be achieved only on the bases of the semantical analysis of the request and the information resources texts. An ontological model of the subject domain of the search should be incorporated in the system. Such model would permit the sense analysis by indication of the proper place for each text among the real objects of the notions treated in the text. The proximity of these ontological representations of the texts would be the criterion for the relevance decision. The human cognitive sphere comprises such an ontological model in the form of the net of psychic gestalten (ideas) and logic notions. The systems dealing with natural language texts could comprise the analogous net in the form of a dictionary which lists semantic characteristics and cross-references for each term denoting ontological realia.

The necessity of developing universal and global dictionaries impedes the full-scale use of the ontological dictionary conception for searching in the universal multi-branch documentation resources and global nets. Nevertheless a restricted representation of the isolated subject domain ontology is sufficient for a search machine to successfully detect the relevant information in that domain. So, for the information retrieval in the domain of image analysis, processing and understanding only specific notions have to be provided (with the restricted amount of general terms). Developing of such restricted dictionary becomes a quite executable problem.

Ontology representation by classification systems

Search machine may comprise ontology representation with different degrees of completeness. As a simplest way, a classification scheme of notions may represent it. The notions themselves are the kind of generalized classes of mental gestalten which mirror the objects of reality. These classes are to be realized in automated system by software connections between the terms naming the notions.

The features that indicate the affiliation of the notions to one or another class arise as the signs of interactions with an investigator. In the domain of image data handling, the four main notion classes are obviously distinguished: images (objects of handling), processes of handling, instruments of handling, and handling tasks. Qualitative words make up two more classes: properties (of objects, processes, instruments and tasks), and other words of general character. Finally, the system must reflect its own nature where all the objects are represented by some texts; so, the names of the texts will form a category of description terms.

According to classical theory of classification, the upper level categories should be divided consecutively to lesser classes by a general feature on each step of dividing. The resulting tree-structure of classes exists almost in each information system as its catalogue. However, the search in the catalogue restricts the search field by the resources that were previously classified and included in any catalogue division. Classification system of notions permits to organize an effective retrieval not only in the previously developed catalogues, but also in a free navigation mode across the information resources space. In accordance to a request for any object, a search instruction is to be formed automatically which includes not only the name of the search object, but also all terms of the subclasses of the search object class. In the result, the relevance criterion calculation will take into account all kinds of the search object.

Ontology representation by a thesaurus

Tree-structures do not reflect the relations of ontologically and pragmatically significant notions in full. The one and the same class may be divided by different bases, and different intersecting rows of subclasses will appear. (Such subclasses of the description class are image description and structural description intersecting in the notion structural image description). Many notions may be treated in the different aspects as subclasses of two and more broader notions. These phenomena show that the structure of notions relations is not a tree, but a general case oriented graph without cycles. Such a notion net can be represented in a tree-structure of terms if the identical terms were placed sometimes in different nods. This entails to a “tree with glued together nods”. For instance, the notion high-pass linear filter takes places in two nods: a subclass of high-pass filter as well as linear filter. A list of the class names with indication of the immediately broader superclasses and narrower subclasses is the adequate representation of the notion structure. This leads us to the representation of the classification system in the form of a dictionary which indicates by lexical links the interconnections between notions caused by the inclusion relations for notion extensions. An immediate generalization of this idea is the inclusion into consideration of the terms with coinciding notion extensions (synonyms) and partly coinciding as well. Special links are to be established between the last terms in the dictionary.

Synonymy link indicates that these terms have the same ontological denotation. These interconnections permit retrieval of the search objects under alternative ways of nomination, and evidently raise the search recall. In particular synonymy links must be registered between different orthographic variants, which has the great importance for English where British and American norms disagree; for instance, color (Am) = colour (Brit). If notion extensions intersect partly, a lexical link is expedient when the notions include mainly the objects from the intersection. These associative links permit to recall the documents that may contain the needed information with a great probability. That may be very valuable when the direct search leads to poor recall.

Additional information can be obtained also by search on terms with no notion intersection if they are connected by ontological substances or processes. The most obvious substantial interconnection is the link between a whole object and its parts. So, this link should enter the dictionary structure. The operational interconnections which have a great influence on image analysis processes are the following:

Kind of images - Processing methods

Processing methods - Result

Instrument - Processing method

Property - Bearer of the property

Inclusion of all these interconnection in a dictionary brings us to the conception of information retrieval thesaurus (IRT). It is a conventional retrieval tool in the systems of scientific and technological information what was confirmed by the series of national and international standards. IRT provides possibility for classifying the information resources during the search process, on the bases of requirements of individual request. If the search instruction contain the ontological links, not only new documents shall appear in the output, but those documents shall appear in the first rows of relevancy ordered output that mention the search object not casually, but treat it in detail, taking into consideration its properties, parts, functions in supersystems, processing operations. This shall be achieved by accounting for the linked terms in relevancy calculations. Statistical investigation of Internet search machines showed that the ratio of pertinent resources was doubled on the first pages of output. Even more effective retrieval performance will be achieved when the specially established software analyses thesaurus links in information resources texts.

More precise simulation of the domain ontology might include a procedural part that transforms the thesaurus links into activity corresponding to ontological processes. This role is performed by the information search software that spreads the search operations from general terms to the specific ones, from the whole to the parts, goes from a cause to its effects, from input material to output data, brings to consideration instrument terms for executing given processes etc. A system with other destination might be a different model of the same ontology, with other procedural part, for instance – with the functions of automatic image processes planning.

Use of a thesaurus for information retrieval

The spontaneity of Internet resources creation excludes the hope on preliminary classification of the materials or indexing with IRT terms, as it is in the sphere of scientific and technological information. As a whole the procedure of thesaurus application for intelligent search in global nets and big documentation banks may be described as follows. A software interface will be elaborated to generate a search instruction from the text of the end-user information request. The search instruction will be designed as to optimize the retrieval performance of the search machine used. The instruction will include the request terms and all the synonyms, specific terms, and terms of immediately connected notions, indicated in the thesaurus. The output will consist of the fist pages of the search machine recall ordered in accordance with a relevance criterion. The items of this output will be reordered by new relevance criterion which calculation accounts for different weights of the request terms indicated by end-user, and evoked for the search from the thesaurus as well. The items in the first rows of this list will be the pertinent documents for the user with the greatest probability.

Internet resource on image analysis

RAN Scientific Council on Cybernetics now develops an Internet resource on image processing, analysis and recognition that will concentrate the access to information sources [1]. Information access will be supported by semantic interface on the bases of the Thesaurus for Image Analysis (TIA). The resource will contain: 1) bibliographical data base of papers, monographs, and electronic publications; 2) catalogue of Internet resources with the relevant data; 3) terminological reference base on image processing, analysis and recognition; 4) means of semantic access to the above mentioned components and external Internet resources. All these components lean on the use of semantic TIA links for discovering the data needed on the bases of the request and document meanings conformity.

Now we have a TIA version that indicates the above defined links among about 1340 terms (in English and Russian forms), including 230 of image category, 535 – image processing, 165 – image analysis and 110 – pattern recognition. In view of the fact that the conceptual system of our subject domain is insufficiently structured and rather unstable, TIA will be continuously replenished with new terms, alternative ways of object nomination and it will incorporate new semantic relations as the experience of practical TIA exploitation be accumulated.

Acknowlegement

This paper is partially supported by the Russian Foundation for Basic Research, grant No. 04-07-90187.

References

1. В. Н. Белоозеров, И. Б. Гуревич, Д. М. Мурашов, Ю. О. Трусова. Концепция создания интернет-ресурса по обработке, анализу и распознаванию изображений // Тезисы докладов 7-ой Всероссийской с участием стран СНГ конференции «Методы и средства обработки сложной графической информации», Нижний Новгород, 15-18 сентября, 2003. - C. 11-12.