Tuesday, December 4, 2012

Tutorial on text mining - Introduction Part 1


Well here we would be looking into 'How does textmining work'. As most peaple say textmining is analogous to finding a needle(your search keywords) in a haystack(world wide web). I have followed the book Matrix Methods in Data Mining and Pattern Recognition by Prof. Lars Eldén. The concepts in this tutorial is mainly based on the above book. The problems in datamining can be approached in either statistical or matrix factorization approach. I was fortunate to take the course under Lars, he is the most inspiring teacher I have had.
    Here we would be dealing with the matrix factorization technique. The task of retrieving a relevant tiny bit (relative to the entire domain) of information from large data which might be unstructured is a tricky one. These kind of tasks are also related to Information retrieval. Apart from the popular search engines like google, bing, yahoo.. IR also plays its role in the field of medicine. The medical community produces research articles pertaining to biomedical research at a prolific rate resulting in a very large knowledge bank. With the growth of such important data, the needs of the medical doctors and the like to extract relevant information had triggered research in test mining particularly for the biomedical knowledge domain. With the biomedical text mining there is a big difference with the normal World Wide Web search in that it includes semantics and natural language understanding. I would cover the NLP aspects in the Part 2 on text mining.
         Before going ahead I would mention the topics we would be learning here.
Preprocessing of the documents : Analyzing and Purification of documents, Indexing, Inverted Indexing, Item Normalization, Stop words, Stemming, Text parser.
Vector Space Model : TDM - Term Document Matrix, term weighting scheme, Sparse matrix storage, Query matching - cosine distance measure, Performance modeling - tolerance, Precision, Recall.
Latent Semantic Indexing : Matrix factorization, Low rank approximation, Singular Value Decomposition.
Clustering, NMF - Nonnegative Matrix Factorization, LGK Bidiagonalization.

Will be back!! 



A Cross Product Formula \[\mathbf{V}_1 \times \mathbf{V}_2 = \begin{vmatrix} \mathbf{i} & \mathbf{j} & \mathbf{k} \\ \frac{\partial X}{\partial u} & \frac{\partial Y}{\partial u} & 0 \\ \frac{\partial X}{\partial v} & \frac{\partial Y}{\partial v} & 0 \end{vmatrix} \]