Feature Extraction and Duplicate Detection for Text Mining: A Survey

1
Ramya R S
Ramya R S
2
Venugopal K R
Venugopal K R
3
Iyengar S S
Iyengar S S
4
Patnaik L M
Patnaik L M
1 University Visvesvaraya College of Engineering, UVCE

Send Message

To: Author

GJCST Volume 16 Issue C5

Article Fingerprint

ReserarchID

CSTSDEZ7U17

Feature Extraction and Duplicate Detection for Text Mining: A Survey Banner

AI TAKEAWAY

The objective of our study was to evaluate, in a population of Togolese People Living With HIV(PLWHIV), the agreement between three scores derived from the general population namely the Framingham score, the Systematic Coronary Risk Evaluation (SCORE), the evaluation of the cardiovascular risk (CVR) according to the World Health Organization.
  • English
  • Afrikaans
  • Albanian
  • Amharic
  • Arabic
  • Armenian
  • Azerbaijani
  • Basque
  • Belarusian
  • Bengali
  • Bosnian
  • Bulgarian
  • Catalan
  • Cebuano
  • Chichewa
  • Chinese (Simplified)
  • Chinese (Traditional)
  • Corsican
  • Croatian
  • Czech
  • Danish
  • Dutch
  • Esperanto
  • Estonian
  • Filipino
  • Finnish
  • French
  • Frisian
  • Galician
  • Georgian
  • German
  • Greek
  • Gujarati
  • Haitian Creole
  • Hausa
  • Hawaiian
  • Hebrew
  • Hindi
  • Hmong
  • Hungarian
  • Icelandic
  • Igbo
  • Indonesian
  • Irish
  • Italian
  • Japanese
  • Javanese
  • Kannada
  • Kazakh
  • Khmer
  • Korean
  • Kurdish (Kurmanji)
  • Kyrgyz
  • Lao
  • Latin
  • Latvian
  • Lithuanian
  • Luxembourgish
  • Macedonian
  • Malagasy
  • Malay
  • Malayalam
  • Maltese
  • Maori
  • Marathi
  • Mongolian
  • Myanmar (Burmese)
  • Nepali
  • Norwegian
  • Pashto
  • Persian
  • Polish
  • Portuguese
  • Punjabi
  • Romanian
  • Russian
  • Samoan
  • Scots Gaelic
  • Serbian
  • Sesotho
  • Shona
  • Sindhi
  • Sinhala
  • Slovak
  • Slovenian
  • Somali
  • Spanish
  • Sundanese
  • Swahili
  • Swedish
  • Tajik
  • Tamil
  • Telugu
  • Thai
  • Turkish
  • Ukrainian
  • Urdu
  • Uzbek
  • Vietnamese
  • Welsh
  • Xhosa
  • Yiddish
  • Yoruba
  • Zulu

Text mining, also known as Intelligent Text Analysis is an important research area. It is very difficult to focus on the most appropriate information due to the high dimensionality of data. Feature Extraction is one of the important techniques in data reduction to discover the most important features. Proce-ssing massive amount of data stored in a unstructured form is a challenging task. Several pre-processing methods and algo-rithms are needed to extract useful features from huge amount of data. The survey covers different text summarization, classi-fication, clustering methods to discover useful features and also discovering query facets which are multiple groups of words or phrases that explain and summarize the content covered by a query thereby reducing time taken by the user.

Article content is being processed or not available yet.

135 Cites in Articles

References

  1. R Agrawal,M Batra (2013). International Journal of Soft Computing and Engineering.
  2. V Bhat,P Rao,R Abhilash,P Shenoy,K Venugopal,L Patnaik (2010). A Data Mining Approach for Data Generation and Analysis for Digital Forensic Application.
  3. Y Zhang,M Chen,L Liu (2015). A Review on Text Mining.
  4. S Shehata,F Karray,M Kamel (2010). An Efficient Concept-based Mining Model for Enhancing Text Clustering.
  5. Veena Bhat,Prasanth Rao,Abhilash R.V.,P Shenoy,Venugopal K.R.,L Patnaik (2010). A Novel Data Generation Approach for Digital Forensic Application in Data Mining.
  6. D Brown (2015). Text Mining the Contributors to Rail Accidents.
  7. K Venugopal,K Srinivasa,L Patnaik (2009). Soft Computing for Data Mining Applications.
  8. Vijay Verma,Manish Ranjan,Priyanka Mishra (2015). Text mining and information professionals: Role, issues and challenges.
  9. A Akilan (2015). Text mining: Challenges and future directions.
  10. D Sanchez,M Martin-Bautista,I Blanco,C Torre (2008). Text Knowledge Mining:An Alternative to Text Data Mining.
  11. Yue Dai,Tuomo Kakkonen,Erkki Sutinen (2011). MinEDec: A decision support model that combines text mining with competitive intelligence.
  12. Y Hu,X Wan (2015). Ppsgen: Learning-Based Presentation Slides Generation for Academic Papers.
  13. E D'andrea,P Ducange,B Lazzerini,F Marcelloni (2015). Real-Time Detection of Traffic from Twitter Stream Analysis.
  14. R Li,K Lei,R Khadiwala,K,C Chang,V Bhat,V Malkani,P Shenoy,K Venugopal,L Patnaik (2011). Classification of Email using Beaks: Behavior and Keyword Stemming.
  15. Axel Schulz,Petar Ristoski,Heiko Paulheim (2013). I See a Car Crash: Real-Time Detection of Small Scale Incidents in Microblogs.
  16. Alvaro Gonzalez,Luis Bergasa,J Yebes (2014). Text Detection and Recognition on Traffic Panels From Street-Level Imagery Using Visual Appearance.
  17. D Muni,N Pal,J Das (2006). Genetic Programming for Simultaneous Feature Selection and Classifier Design.
  18. Mehdi Aghdam,Nasser Ghasem-Aghaee,Mohammad Basiri (2009). Text feature selection using ant colony optimization.
  19. K Srinivasa,A Singh,A Thomas,K Venugopal,L Patnaik (2005). Generic Feature Extraction for Classification using Fuzzy C-means Clustering.
  20. E Gasca,J Sánchez,R Alonso (2006). Eliminating redundancy and irrelevance using a new MLP-based feature selection method.
  21. Rui Li,Kin Lei,Ravi Khadiwala,Kevin Chen-Chuan Chang (2012). TEDAS: A Twitter-based Event Detection and Analysis System.
  22. R Parikh,K Karlapalem (2013). Et: Events from Tweets.
  23. R Sivagaminathan,S Ramakrishnan (2007). A Hybrid Approach for Feature Subset Selection using Neural Networks and Ant Colony Optimization.
  24. D Cai,C Zhang,X He (2010). Unsupervised Feature Selection for Multi-cluster Data.
  25. Z Zhao,X He,D Cai,L Zhang,W Ng,Y Zhuang (2016). Graph Regularized Feature Selection with Data Reconstruction.
  26. Lei Xu,Chunxiao Jiang,Yong Ren,Hsiao-Hwa Chen (2016). Microblog Dimensionality Reduction—A Deep Learning Approach.
  27. De Wang,Feiping Nie,Heng Huang (2015). Feature Selection via Global Redundancy Minimization.
  28. H Ogura,H Amano,M Kondo (2009). Feature Selection with a Measure of Deviations from Poisson in Text Categorization.
  29. N Azam,J Yao (2012). Comparison of Term Frequency and Document Frequency based Feature Selection Metrics in Text Categorization.
  30. S Yan,D Xu,B Zhang,H.-J Zhang,Q Yang,S Lin (2007). Graph Embedding and Extensions: A General Framework for Dimensionality Reduction.
  31. Zheng Zhao,Lei Wang,Huan Liu,Jieping Ye (2013). On Similarity Preserving Feature Selection.
  32. Zhou Zhao,Xiaofei He,Deng Cai,Lijun Zhang,Wilfred Ng,Yueting Zhuang (2016). Graph Regularized Feature Selection with Data Reconstruction.
  33. J Tang,H Liu (2012). Unsupervised Feature Selection for Linked Social Media Data.
  34. Hamid Mousavi,Deirdre Kerr,Markus Iseli,Carlo Zaniolo (2014). Harvesting Domain Specific Ontologies from Text.
  35. Y Yang,H Shen,Z Ma,Z Huang,X Zhou (2011). l2, 1-norm Regularized Discriminative Feature Selection for Unsupervised Learning.
  36. D Cai,X He,J Han,T Huang (2011). Graph Regularized Nonnegative Matrix Factorization for Data Representation.
  37. D Sejal,K Shailesh,V Tejaswi,D Anvekar,K Venugopal,S Iyengar,L Patnaik (2015). Qrgqr: Query Relevance Graph for Query Recommendation.
  38. Wentao Fan,Nizar Bouguila,Djemel Ziou (2013). Unsupervised Hybrid Feature Extraction Selection for High-Dimensional Non-Gaussian Data Clustering with Variational Inference.
  39. Vishal Gupta,Gurpreet Lehal (2010). A Survey of Text Summarization Extractive Techniques.
  40. P Negi,M Rauthan,H Dhami (2011). Text Summarization for Information Retrieval using Pattern Recognition Techniques.
  41. F Debole,F Sebastiani (2005). An Analysis of the Relative Hardness of Reuters-21578 Subsets.
  42. F Xie,X Wu,X Hu (2010). Keyphrase Extraction based on Semantic Relatedness.
  43. Yuefeng Li,Abdulmohsen Algarni,Ning Zhong (2010). Mining positive and negative patterns for relevance feature discovery.
  44. N Zhong,Y Li,S.-T Wu (2012). Effective Pattern Discovery for Text Mining.
  45. Yi-Cheng Chen,Wen-Chih Peng,Suh-Yin Lee (2015). Mining Temporal Patterns in Time Interval-Based Data.
  46. A Bartoli,A Lorenzo,E Medvet,F Tarlao (2016). Inference of Regular Expressions for Text Extraction from Examples.
  47. Yuefeng Li,Abdulmohsen Algarni,Mubarak Albathan,Yan Shen,Moch Bijaksana (2015). Relevance Feature Discovery for Text Mining.
  48. Q Song,J Ni,G Wang (2013). A Fast Clusteringbased Feature Subset Selection Algorithm for High-Dimensional Data.
  49. Thanh-Son Nguyen,Hady Lauw,Panayiotis Tsaparas (2015). Review Selection Using Micro-Reviews.
  50. Ammar Kadhim,Null- Cheah,Nurul Ahamed,Lubab Salman (2014). Feature extraction for co-occurrence-based cosine similarity score of text documents.
  51. D Fradkin,D Madigan (2003). Experiments with Random Projections for Machine Learning.
  52. S Joshi,D Shenoy,P Rashmi,K Venugopal,L Patnaik (2010). Classification of Alzheimer's Disease and Parkinson's Disease by using Machine Learning and Neural Network Methods.
  53. H Ahonen,O Heinonen,M Klemettinen,A Verkamo (1998). Applying data mining techniques for descriptive phrase extraction in digital document collections.
  54. Xifeng Yan,Jiawei Han,Ramin Afshar (2003). CloSpan: Mining: Closed Sequential Patterns in Large Datasets.
  55. A Gomariz,M Campos,R Marin,B Goethals (2013). Clasp: An Efficient Algorithm for Mining Frequent Closed Sequences.
  56. J Pei,J Han,R Mao (2000). Closet: An Efficient Algorithm for Mining Frequent Closed Itemsets.
  57. J Han,J Pei,B Mortazavi-Asl,Q Chen,U Dayal,M.-C Hsu (2000). Freespan: Frequent Pattern-Projected Sequential Pattern Mining.
  58. J Pei,J Han,B Mortazavi-Asl,H Pinto,Q Chen,U Dayal,M.-C Hsu (2001). Prefixspan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth.
  59. K Venugopal,R Buyya (2013). Mastering c++.
  60. M Garofalakis,R Rastogi,K Shim (1999). Mining sequential patterns with regular expression constraints.
  61. N Zhong,Y Li,S.-T Wu (2012). Effective Pattern Discovery for Text Mining.
  62. A Inje,U Patil (2014). Operational Pattern Revealing Technique in Text Mining.
  63. R Bayardo (1998). Efficiently Mining Long Patterns from Databases.
  64. Y Yang,J Pedersen (1997). A Comparative Study on Feature Selection in Text Categorization.
  65. P Shenoy,K Srinivasa,K Venugopal,Lalit Patnaik (2005). Dynamic Association Rule Mining using Genetic Algorithms.
  66. M Seno,G Karypis (2002). Slpminer: An Algorithm for Finding Frequent Sequential Patterns using Length-Decreasing Support Constraint.
  67. N Mabroukeh,C Ezeife (2010). A Taxonomy of Sequential Pattern Mining Algorithms.
  68. M Zaki (2001). Spade: An Efficient Algorithm for Mining Frequent Sequences.
  69. J Han,J Pei,Y Yin,R Mao (2004). Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach.
  70. V Raju,G Varma (2015). Mining Closed Sequential Patterns in Large Sequence Databases.
  71. J Zhang,X Zhao,S Zhang,S Yin,X Qin,I Member (2013). Interrelation Analysis of Celestial Spectra Data using Constrained Frequent Pattern Trees.
  72. Feng Li,Beng Ooi,M Özsu,Sai Wu (2014). Distributed data management using MapReduce.
  73. Nidhi Tiwari,Santonu Sarkar,Umesh Bellur,Maria Indrawan (2015). Classification Framework of MapReduce Scheduling Algorithms.
  74. Y Xun,J Zhang,X Qin (2016). Fidoop: Parallel Mining of Frequent Itemsets Using Mapreduce.
  75. S Sakr,A Liu,A Fayoumi (2013). The Family of Mapreduce and Large-Scale Data Processing Systems.
  76. L Wang,L Feng,J Zhang,P Liao (2014). An Efficient Algorithm of Frequent Itemsets Mining based on Mapreduce.
  77. T Ramakrishnudu,R Subramanyam (2015). Mining Interesting Infrequent Itemsets from Very Large Data based on Mapreduce Framework.
  78. E Ozkural,B Ucar,C Aykanat (2011). Parallel Frequent Item Set Mining with Selective Item Replication.
  79. Zuobing Xu,Ram Akella (2008). Active relevance feedback for difficult queries.
  80. S Desai,V Chandrasheker,V Mathapati,K Rajuk,S Iyengar,L Patnaik (2016). User Feedback Session with Clicked and Unclicked Documents for Related Search Recommendation.
  81. G Cao,J.-Y Nie,J Gao,S Robertson (2008). Selecting Good Expansion Terms for Pseudo-Relevance Feedback.
  82. J Paik,D Pal,S Parui (2014). Incremental Blind Feedback: An Effective Approach to Automatic Query Expansion.
  83. Abdulmohsen Algarni,Yuefeng Li,Yue Xu (2010). Selected new training documents to update user profile.
  84. S Niharika,V Latha,D Lavanya (2012). A Survey on Text Categorization.
  85. Fabrizio Sebastiani (2002). Machine learning in automated text categorization.
  86. R Irfan,C King,D Grages,S Ewen,S Khan,S Madani,J Kolodziej,L Wang,D Chen,A Rayes (2015). A Survey on Text Mining in Social Networks.
  87. H Vu,T Tran,I Na,S Kim (2015). Automatic Extraction of Text Regions from Document Images by Multilevel Thresholding and kmeans Clustering.
  88. Z Dou,Z Jiang,S Hu,J.-R Wen,R Song (2016). Automatically Mining Facets for Queries from Their Search Results.
  89. D Sejal,K Shailesh,V Tejaswi,Dinesh Anvekar,K Venugopal,S Iyengar,L Patnaik (2015). Query Click and Text Similarity Graph for Query Suggestions.
  90. Xiaodong Shi,Christopher Yang (2007). Mining related queries from Web search engine query logs using an improved association rule mining model.
  91. Mamadou Diao,Sougata Mukherjea,Nitendra Rajput,Kundan Srivastava (2010). Faceted search and browsing of audio content on spoken web.
  92. C Efstathiades,A Efentakis,D Pfoser (2016). Efficient Processing of Relevant Nearest-Neighbor Queries.
  93. C Zhang,Y Zhang,W Zhang,X Lin (2016). Inverted Linear Quadtree: Efficient Top k Spatial Keyword S219219earch.
  94. K Pripuˇzi´c,I Zarko,K Aberer (2015). Timeand Space-Efficient Sliding Window Top-k Query Processi-ng.
  95. W.-K Hon,R Shah,S Thankachan,J Vitter (2014). Space-Efficient Frameworks for Top-k String Retrieval.
  96. W Kong,J Allan (2014). Extending Faceted Search to the General Web.
  97. Marc Bron,Krisztian Balog,Maarten De Rijke (2010). Ranking related entities.
  98. Gonzalo Navarro (2014). Spaces, Trees, and Colors.
  99. S Liu,Y Chen,H Wei,J Yang,K Zhou,S Drucker (2015). Exploring Topical Lead-Lag Across Corpora.
  100. D Jiang,Y Tong,Y Song (2016). Cross-Lingual Topic Discovery from Multilingual Search Engine Query Log.
  101. Michael Cafarella,Alon Halevy,Daisy Wang,Eugene Wu,Yang Zhang (2008). WebTables.
  102. J Pound,S Paparizos,P Tsaparas (2011). Facet Discovery for Structured Web Search: A Query-Log Mining Approach.
  103. I Altingovde,R Ozcan,O¨ Ulusoy (2012). Static Index Pruning in Web Search Engines: Combining Term and Document Popularities with Query Views.
  104. P Koutris,P Upadhyaya,M Balazinska,B Howe,D Suciu (2015). Query-Based Data Pricing.
  105. Claudio Lucchese,Salvatore Orlando,Raffaele Perego,Fabrizio Silvestri,Gabriele Tolomei (2013). Discovering tasks from search engine query logs.
  106. Z Liu,Y Chen (2012). Differentiating Search Results on Structured Data.
  107. Vinhtuan Thai,Pierre-Yves Rouille,Siegfried Handschuh (2012). Visual Abstraction and Ordering in Faceted Browsing of Text Collections.
  108. Ilio Catallo,Eleonora Ciceri,Piero Fraternali,Davide Martinenghi,Marco Tagliasacchi (2013). Top-k diversity queries over bounded regions.
  109. Hannah Bast,Marjan Celikik (2013). Efficient fuzzy search in large text collections.
  110. A Termehchy,M Winslett (2011). Using Structural Information in Xml Keyword Search Effectively.
  111. Riccardo Colini-Baldeschi,Stefano Leonardi,Monika Henzinger,Martin Starnberger (2016). On Multiple Keyword Sponsored Search Auctions with Budgets.
  112. R Arguello,Capra (2016). The Effects of Aggregated Search Coherence on Search Behavior.
  113. D Wu,M Yiu,C Jensen (2013). Moving Spatial Keyword Queries: Formulation, Methods, and Analysis.
  114. Ying Lu,Jiaheng Lu,Gao Cong,Wei Wu,Cyrus Shahabi (2014). Efficient Algorithms and Cost Models for Reverse Spatial-Keyword <i>k</i> -Nearest Neighbor Search.
  115. X Cao,G Cong,T Guo,C Jensen,B Ooi (2015). Efficient Processing of Spatial Group Keyword Queries.
  116. Ziyu Guan,Shengqi Yang,Huan Sun,Mudhakar Srivatsa,Xifeng Yan (2015). Fine-Grained Knowledge Sharing in Collaborative Environments.
  117. H Wang,Y Song,M.-W Chang,X He,R White,W Chu (2013). Learning to Extract Cross-Session Search Tasks.
  118. A Kotov,P Bennett,R White,S Dumais,J Teevan (2011). Modeling and Analysis of Cross-Session Search Tasks.
  119. T Papenbrock,A Heise,F Naumann (2015). Progressive Duplicate Detection.
  120. H Bano,F Azam (2015). Innovative Windows for Duplicate Detection.
  121. Antoon Bronselaer,Daan Van Britsom,Guy De Tre (2015). Propagation of Data Fusion.
  122. George Papadakis,Ekaterini Ioannou,Themis Palpanas,Claudia Niederee,Wolfgang Nejdl (2013). A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces.
  123. George Papadakis,Wolfgang Nejdl (2011). Efficient entity resolution methods for heterogeneous information spaces.
  124. Oktie Hassanzadeh,Fei Chiang,Hyun Lee,Renée Miller (2009). Framework for evaluating clustering algorithms in duplicate detection.
  125. S Whang,D Marmaros,H Garcia-Molina (2013). Pay-asyou-go Entity Resolution.
  126. A Abraham,S Kanmani,J Tamilselvi,C Gifta (2011). A Survey on Various Methods used for Detecting Duplicates in 127.
  127. Ahmed Elmagarmid,Panagiotis Ipeirotis,Vassilios Verykios (2007). Duplicate Record Detection: A Survey.
  128. P Christen (2012). A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication.
  129. T Vries,H Ke,S Chawla,P Christen (2011). Robust Record Linkage Blocking using Suffix Arrays and Bloom Filters.
  130. Oktie Hassanzadeh,Renée Miller (2009). Creating probabilistic databases from duplicated data.
  131. Antoon Bronselaer,Guy De Tre (2010). Aspects of object merging.
  132. U Draisbach,F Naumann,S Szott,O Wonneberg (2012). Adaptive Windows for Duplicate Detection.
  133. F Naumann,A Bilke,J Bleiholder,M Weis (2006). Data Fusion in Three Steps: Resolving Schema, Tuple, and Value Inconsistencies.
  134. Jens Bleiholder,Felix Naumann (2009). Data fusion.
  135. Lei Meng,Ah-Hwee Tan,Dong Xu (2014). Semi-Supervised Heterogeneous Fusion for Multimedia Data Co-Clustering.

Funding

No external funding was declared for this work.

Conflict of Interest

The authors declare no conflict of interest.

Ethical Approval

No ethics committee approval was required for this article type.

Data Availability

Not applicable for this article.

Ramya R S. 2017. \u201cFeature Extraction and Duplicate Detection for Text Mining: A Survey\u201d. Global Journal of Computer Science and Technology - C: Software & Data Engineering GJCST-C Volume 16 (GJCST Volume 16 Issue C5): .

Download Citation

Journal Specifications

Crossref Journal DOI 10.17406/gjcst

Print ISSN 0975-4350

e-ISSN 0975-4172

Keywords
Classification
C.2.1,C.2.4,H.2.8
Version of record

v1.2

Issue date

January 27, 2017

Language

English

Experiance in AR

The methods for personal identification and authentication are no exception.

Read in 3D

The methods for personal identification and authentication are no exception.

Article Matrices
Total Views: 7062
Total Downloads: 1698
2026 Trends
Research Identity (RIN)
Related Research

Article in Review

Text mining, also known as Intelligent Text Analysis is an important research area. It is very difficult to focus on the most appropriate information due to the high dimensionality of data. Feature Extraction is one of the important techniques in data reduction to discover the most important features. Proce-ssing massive amount of data stored in a unstructured form is a challenging task. Several pre-processing methods and algo-rithms are needed to extract useful features from huge amount of data. The survey covers different text summarization, classi-fication, clustering methods to discover useful features and also discovering query facets which are multiple groups of words or phrases that explain and summarize the content covered by a query thereby reducing time taken by the user.

Our website is actively being updated, and changes may occur frequently. Please clear your browser cache if needed. For feedback or error reporting, please email [email protected]
×

This Page is Under Development

We are currently updating this article page for a better experience.

Request Access

Please fill out the form below to request access to this research paper. Your request will be reviewed by the editorial or author team.
X

Quote and Order Details

Contact Person

Invoice Address

Notes or Comments

This is the heading

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.

High-quality academic research articles on global topics and journals.

Feature Extraction and Duplicate Detection for Text Mining: A Survey

Ramya R S
Ramya R S University Visvesvaraya College of Engineering, UVCE
Venugopal K R
Venugopal K R
Iyengar S S
Iyengar S S
Patnaik L M
Patnaik L M

Research Journals