Tuesday, December 4, 2012

Tutorial on text mining - Introduction Part 1


Well here we would be looking into 'How does textmining work'. As most peaple say textmining is analogous to finding a needle(your search keywords) in a haystack(world wide web). I have followed the book Matrix Methods in Data Mining and Pattern Recognition by Prof. Lars Eldén. The concepts in this tutorial is mainly based on the above book. The problems in datamining can be approached in either statistical or matrix factorization approach. I was fortunate to take the course under Lars, he is the most inspiring teacher I have had.
    Here we would be dealing with the matrix factorization technique. The task of retrieving a relevant tiny bit (relative to the entire domain) of information from large data which might be unstructured is a tricky one. These kind of tasks are also related to Information retrieval. Apart from the popular search engines like google, bing, yahoo.. IR also plays its role in the field of medicine. The medical community produces research articles pertaining to biomedical research at a prolific rate resulting in a very large knowledge bank. With the growth of such important data, the needs of the medical doctors and the like to extract relevant information had triggered research in test mining particularly for the biomedical knowledge domain. With the biomedical text mining there is a big difference with the normal World Wide Web search in that it includes semantics and natural language understanding. I would cover the NLP aspects in the Part 2 on text mining.
         Before going ahead I would mention the topics we would be learning here.
Preprocessing of the documents : Analyzing and Purification of documents, Indexing, Inverted Indexing, Item Normalization, Stop words, Stemming, Text parser.
Vector Space Model : TDM - Term Document Matrix, term weighting scheme, Sparse matrix storage, Query matching - cosine distance measure, Performance modeling - tolerance, Precision, Recall.
Latent Semantic Indexing : Matrix factorization, Low rank approximation, Singular Value Decomposition.
Clustering, NMF - Nonnegative Matrix Factorization, LGK Bidiagonalization.

Will be back!! 



A Cross Product Formula \[\mathbf{V}_1 \times \mathbf{V}_2 = \begin{vmatrix} \mathbf{i} & \mathbf{j} & \mathbf{k} \\ \frac{\partial X}{\partial u} & \frac{\partial Y}{\partial u} & 0 \\ \frac{\partial X}{\partial v} & \frac{\partial Y}{\partial v} & 0 \end{vmatrix} \]

Thursday, November 29, 2012

Apache Mahout - Befitting of the name!!

Well, it was 7 days back that I posted my first elementary question to the Mahout user group, and today I was able to successfully test my first program. Being a first timer in (Java+Eclipse+Maven+Mahout) = JEMM, it me almost a month to get things running. I would like to thank kuba pawloch@interia.pl, Sean Owen and Julian Ortega for your support. 
  Well I dont think there would have been anyone as ignorant as myself with JEMM, as I could'nt find queries as basic as mine. But anyways now I have moved on, and hope to use Mahout on a more serious note. However if there are/will be anyone who has similar issues, I would like to simply share the issues I had with getting mahout running.

Wednesday, November 14, 2012

my experiments with NETFLIX problem.

Evolution has been the main defining quality in us humans right from the start. The doctrine of change being central to the universe given by Heraclitus, we can observe temporal dynamics in almost everything. The conception of 'Tool' during the stone age compared to that of today reflects the advancements we have suceeded in.

 My interest lies in the playing a small role towards modernizing The WEB aided with machine learning algorithms and of course computer/internet technologies. With the Web getting smarter and more proximate with us, I would love to be a part of Connected World.

 
This is the latest project I have been working on. The NETFLIX problem seems to have attracted a great of attention from data scientists world over. This is one problem which I have enjoyed solving the most. Berkant Savas has been supervising me who is great fun to work with. Initially my attempt was to model the temopral aspects involved in the NETFLIX data. The main article I was refering to is "Collaborative Filtering with Temporal Dynamics" by Yehuda Koren. Yehuda Koren was part of the team BellKor's Pragmatic Chaos which won the NETFLIX prize in September 2009.

 The concept of evolution applied to 'the way people concieve movies' is what I would like to emphasize and capitalize on.
    

Wednesday, September 19, 2012

Journals in Datamining

  1. IPL - Information Processing Letters
  2. VLDB - The Vldb Journal
  3. DATAMINE - Data Mining and Knowledge Discovery
  4. Sigkdd Explorations
  5. CS&DA - Computational Statistics & Data Analysis
  6. Journal of Knowledge Management
  7. WWW - World Wide Web
  8. INFFUS - Information Fusion
  9. Journal of Classification
  10. KAIS - Knowledge and Information Systems
  11. TKDE - IEEE Transactions on Knowledge and Data Engineering
  12. IDA - Intelligent Data Analysis
  13. Transactions on Rough Sets
  14. JECR - Journal of Electronic Commerce Research
  15. TKDD - ACM Transactions on Knowledge Discovery From Data
  16. IJDMB - International Journal of Data Mining and Bioinformatics
  17. IJDWM - International Journal of Data Warehousing and Mining
  18. IJBIDM - International Journal of Business Intelligence and Data Mining
  19. Statistical Analysis and Data Mining
  20. IJICT - International Journal of Information and Communication Technology
  21. Advanced Data Analysis and Classification
  22. OIR - Online Information Review
  23. MLDM - Transactions on Machine Learning and Data Mining
  24. DQ - Data Quality Journal
  25. TGIS - Transactions in Gis
  26. ISJ-GP - Information Security Journal: A Global Perspective 

Retrieved from http://academic.research.microsoft.com/RankList?entitytype=4&topDomainID=2&subDomainID=7&last=0&start=1&end=100