Sharable instructable agent for information filtering
Abstract
This thesis presents a sharable instructable agent for information filtering. It also proposes the use of network-based collaboration in huge distributed communities of personalized agents. These agents carry out the tasks related to access and dissemination of information on the Internet.
A significant part of the activity on the Internet is composed of access and dissemination of information. Information dissemination involves several tasks such as document generation, audience discovery, document translation, disseminating platform selection, and document dissemination. Information access also involves various tasks such as document retrieval, document filtering, document translation, and information extraction. The diverse, dynamic, and huge information access and dissemination environment (composed of users, documents, and needs) such as the Internet requires the agents and resources needed to carry out the above tasks to be personalized to meet the user needs. They need to be adaptive because of the dynamic environment. Also, they should be distributed (reside on several servers) since it may not be possible to accommodate all personalizations of all users for a task on a few servers. The distributed personalized agents of a given task may have significant overlap in the inputs they receive and the sub-tasks they perform. The overlap along with the open platform like Internet permits collaboration between them during their construction, maintenance, and operation. But, because of the large number of distributed agents, it is not possible to have a centralized system which can take care of the collaboration. As a result, Networked Distributed Collaboration (NDC) has been proposed in this thesis. In NDC, various distributed collaborators performing a task are connected using a semantic network which helps in the flow of collaboration. This thesis also presents the generic steps to be followed to realize NDC for a given task. Various agents needed to manage the massive distributed collaboration in NDC have been discussed. This thesis also proposes NDC schemes for information filtering, information dissemination, web personalization, and lexicons.
Out of various tasks presented above, information filtering is a very important task for the following reasons:
(1) Information overload on today’s Internet users and dissemination of information which can be against the values or policies of a community or an organization; this can be stopped by filtering it only at the receiving end because of the borderless nature of the Internet.
(2) The personalized agents need to filter out the documents which do not belong to their domain of operation.
(3) A user or a group of users can be characterized by the model developed for the documents they receive which may be exploited in document access and dissemination on the Internet.
Hence, an interactive information filtering agent called Sharable Instructable Information Filtering Agent (SIIFA) is designed and developed. SIIFA has a modular and comprehensible structure which is useful in NDC. The operation of SIIFA is centered around the new document analysis and representation scheme based on the directed acyclic graphical structure called concept network (CNW). SIIFA provides a comprehensible and flexible interaction for the user to construct a personalized CNW and filter the documents using it.
SIIFA uses both sample-based learning (SBL) and dialogue-based learning (DBL) to construct CNW based on several types of feedback allowed by SIIFA. The evaluation results on comp.ai.neural-networks Usenet newsgroup documents show that SIIFA performs very well while rendering comprehensible analysis of documents. This is obtained at the cost of high initial overhead incurred in constructing the CNW. But this overhead is shown to be worth investing for long-term and community gains in the filtering performance.
CNW is a new directed acyclic graphical structure whose nodes are called concepts. Concepts represent either phrases or topics of discourse or modes of discourse or some other features of documents. CNW employs filters called context filters on its edges or links. Context filters reduce ambiguity present in their inputs by handling synonymy and polysemy associated with various concepts. They further reduce ambiguity by handling the local spread properties of concepts recognized in the given CNW. Important features of the document analysis scheme based on CNW are automatic concept-dependent passage extraction from a document, sharability because of its greater independence to the user profile but its dependence on the stream characteristics, and glass-box comprehensibility in its analysis of documents because of the hierarchical concept arrangement, context filters, and the use of decision lists in context filters and concepts for decision making. Modularity is obtained because of various stages in context filters and the ability to separate out a concept from the CNW.
The feedback is supplied on various aspects of the given CNW and its performance on the documents received by the user. Learning is carried out mainly for decision lists of various concepts of CNW. In DBL mode, SIIFA queries the user on the feedback for better learning of decision lists and construction of CNW. SIIFA adapts various parameters related to decision list learning and the interaction in DBL using the response of the user to the queries. In SBL mode, learning is carried out without any interaction with the user. SBL uses the parameters output by the recent DBL session.
Decision list learning in SIIFA is facilitated by various decision list algorithms proposed in this thesis. They are IDLLA for incremental learning of decision lists, DLPrune to prune decision lists, and MergeDL to combine given decision lists into one decision list. IDLLA performs better than the known incremental learning algorithm called CDLA in terms of classification accuracy and the induced decision-list length on the benchmark datasets used for evaluation. The complex merging algorithms derived in MergeDL are found to be performing better than the simple merging strategies. DLPrune is an important step in both IDLLA and MergeDL.
The main contributions of the thesis are:
(1) Development of CNW and a comprehensible document analysis and representation scheme based on it;
(2) Postulating the characteristics of information filtering domain;
(3) Development and evaluation of SIIFA;
(4) Development of various feedback handling procedures for SIIFA;
(5) Proposal, development, and evaluation of various decision list algorithms (IDLLA, MergeDL, and DLPrune);
(6) Proposal of NDC; and
(7) Proposal of NDC schemes for information filtering, information dissemination, web personalization, and lexicons.

