Finding Optimal Rank for LSI Models
Abstract: Latent Semantic Indexing is a powerful linear algebraic method for dimension reduction. It is also very useful in solving synonymy problem of textual corpora. A corpora of several documents representing as a bag of features is represented as a Term-Document matrix (TDM), where a Term represents a feature. A TDM can also be a visualization of an experiment repeated several times on an unknown system, where a term of the TDM represents an unknown variable of the system and a document of the TDM represents experiment iteration. Using LSI, a large hyper-space of a corpora or a system could be decomposed into three smaller matrices (Left Singular matrix 'U', Right Singular Matrix 'V', Diagonal matrix of Singular values 'S') as a function of rank 'K', a scalar value. The rank is expected to be optimally smaller, with which the hyperspace could be represented in a sub-space without much of data loss. The choice of Rank 'K' is critical because if the value is chosen to be smaller than optimal, the derived subspace representation is rendered useless as the data loss could become high. We propose a method to mathematically derive the optimal rank, which ensures the best subspace representation of a large hyper-space TDM in reduced dimension. We prove the efficiency of our method by comparing the accuracy values of synonymy measurements made on reduced dimension subspaces that are cut at different 'K' values.
Index: Accuracy Measurement, Diagonal Matrix, Dimension Reduction, Hyperspace, LSI, Optimal Rank, Singular Matrix, Singular values, Sub-space, Synonymy, Term Document Matrix
Reference: Sudarsun S, Venkatesh Prabhu, "Finding Optimal Rank for LSI Models", Proceedings of ICAET 2010, pp. 2010
Topic Models based Personalized Spam Filter
Abstract: Spam filtering poses a critical problem in text categorization as the features of text is continuously changing. Spam evolves continuously and makes it difficult for the filter to classify the evolving and evading new feature patterns. Most practical applications are based on online user feedback, the task calls for fast, incremental and robust learning algorithms. This paper presents a system for automatically detection and filtering of unsolicited electronic messages. In this paper, we have developed a content-based classifier, which uses two topic models LSI and PLSA complemented with a text pattern-matching based natural language approach. By combining these powerful statistical and NLP techniques we obtained a parallel content based Spam filter, which performs the filtration in two stages. In the first stage each model generates its individual predictions, which are combined by a voting mechanism as the second stage.
Index: Dimension Reduction, LSA, N-Gram, PCA, PLSA, Spam Filter, Topic Models, Vectorization
Reference: Sudarsun S, Venkatesh Prabhu, Valarmathi B, "Topic Models based Personalized Spam Filter", Proceedings of ISCF 2006, pp. 199-203, 2006
Unsupervised Contextual Keyword Relevance Learning and Measurement using PLSA
Abstract: We have developed a probabilistic approach using PLSA for the discovery and analysis of contextual keyword relevance based on the distribution of keywords across a training text corpus. We have used the probabilistic inference to perform adaptive document segmentation, keywords classification, and collaborative recommendations. We show experimentally, the flexibility of this approach in classifying keywords into different domains based on their context. We also discuss the parameters that control PLSA performance including 1) number of aspects, 2) number of EM iterations 3) weighting functions on TDM (pre-weighting) and their role in the quality of relevancy estimates. We estimated the quality by computing P-R scores. We present our experiments on PLSA models built by varying corpus sizes and varying number of document domains.
Index: SVD, Synonymy, Polysemy, Unsupervised Clustering, PLSA, Aspect model, Keyword Relevance
Reference: Sudarsun S, Dalou Kalaivendhan, Venkateswarlu M, "Unsupervised Contextual Keyword Relevance Learning and Measurement using PLSA", Proceedings of IEEE INDICON 2006, 2006
Role of Weighting on TDM in Improvising Performance of LSA on Text Data
Abstract: In this paper, we show that the efficiency of LSA is significantly controlled by the choice of weighting algorithm applied. These weighting algorithms allocate relative importance to the document attributes (e.g. Keywords) based on their occurrences in the corpus. We evaluated various weighting algorithms to study their effects as measured by P-R values. Our experiments include weighting function application on TDM (Pre-Weighting) in order to increase or decrease the relative importance of words based on their occurrence. We also evaluated the application of weighting functions on the projected query (post-weighting). Post-weighted keyword queries were projected on an LSA model built on pre-weighted TDM to obtain closely correlated keywords or a document (keyword collection).
Index: Information Retrieval, Weighting Functions, IDF, IWF, WIDF, NDV, LSA, SVD, Precision, Recall, TDM
Reference: Sudarsun S, Venkatesh Prabhu G, Sathishkumar V, "Role of Weighting on TDM in Improvising Performance of LSA on Text Data", Proc of IEEE INDICON 2006, 2006
MANET: An Alternative Approach to Reduce Flooding by Propagating Neighborhood Information
Abstract: Mobile Ad Hoc Networks (MANETs) exemplify a complex distributed network, which is characterized by the lack of any infrastructure. The lack of infrastructure leads to the connection establishment being costly in terms of time and resource where the network is mostly affected by connection request flooding. The proposed approach presents a way to reduce flooding in MANETs. The proposed architecture embarks on the concept of sharing neighborhood information. The proposed approach focuses on exposing its neighborhood peer to another node that is referred to as its friend-node, which had requested/forwarded connection request. If there is a high probability for the friend node to communicate through the exposed routes, this could improve the efficacy of bandwidth utilization by reducing flooding, as the routes have been acquired, without any broadcasts. The nodes store the neighborhood information in their cache that is periodically verified for consistency. Inconsistent routes are erased rather than being updated after a record-validity period. The vicinity information is tracked based on a -- I'm alive signal to other nodes. These broadcasts are limited to a hop count of one and executed when the network activity is feeble.
Index: MANET, Flooding, Friend node, routing, I'am alive signal, broadcasting, connection-request
Reference: Balakrishan C, Sudarsun S, Bharathi Mani, Srinivasa Raghavan, "MANET: An Alternative Approach to Reduce Flooding by Propagating Neighborhood Information", Proc of INCON-CCC 2004 International Conference on Computers, Control and Communication, Aug 2004.
Adaptive Inferential Control in Distillation Columns
Reference: Sudarsun S and Vasudevan, "Adaptive Inferential Control in Distillation Columns", Proceedings of the International Conference on Trends in Industrial Measurements and Automation, Jan 1999.
Using Behaviour Patterns in Treating the Autistic
The development of behavioral therapy regimens for autistic patients is relatively challenging as these patients may not be able to express feedback to the applied treatment. The response to a treatment course is mostly estimated qualitatively and with little systematic feedback between therapy and response. Collecting and analyzing data about a patient's daily activities could yield patterns linking these activities, thereby providing therapists with some foreknowledge of likely possible behavioral outcomes related to their therapies. We propose a method for anomaly detection system, which can monitor behavior patterns of the patient based on the data collected on a daily basis. The knowledge gathered about the patient could prove suggestive of the patient's feedback to the applied therapy. Upon mining the behavioral patterns, the system could predict the response of a patient to a stimulus, given a list of recently displayed behaviors and/or completed activities. The knowledge thus gathered could also be used to treat other patients of similar disability.
Index:Anomaly Detection, Autistic, Behavioral Patterns, Data Mining, Sequence Analysis, Statistical Modeling.
Reference: Sudarsun S, Varun Kant Vashistha, Avijit Naik, "Using Behaviour Patterns in Treating the Autistic", Proc of the National Conference on AI, Robotics and Vision, October 2007.
An Humane Approach for Contiguous Stream Audio Service
Reference: Sudarsun S, Srivatsan, Vijay Bhaskar, "An Humane Approach for Contiguous Stream Audio Service", Proceedings of the National Conference on Communication and Networking, Feb 2003, Pg. 200-207.
Algorithmic Source Input Interface for a compiler
Reference: Sudarsun S, "NLP: Algorithmic Source Input Interface for a compiler", Proceedings of the CONMICRO 2002, National Conference on Bio-Informatics, Dec 2002.
Content Based Automatic Title Generator
Reference: Sudarsun S, "Natural Language Processing: Content Based Automatic Title Generator", Proceedings of the National Seminar on Convergence of Technologies by 2010, Jan 2001, Pg. 137-144.