Feature Extraction and Duplicate Detection for Text Mining: A Survey

Feature Extraction and Duplicate Detection for Text Mining: A Survey

Ramya R S

Contact

Venugopal K R

Contact

Iyengar S S

Contact

Patnaik L M

Contact

Bangalore University

Feature Extraction and Duplicate Detection for Text Mining: A Survey

Article Fingerprint

ReserarchID

CSTSDEZ7U17

Feature Extraction and Duplicate Detection for Text Mining: A Survey Banner

AI TAKEAWAY

Connecting with the Eternal Ground

Abstract

Text mining, also known as Intelligent Text Analysis is an important research area. It is very difficult to focus on the most appropriate information due to the high dimensionality of data. Feature Extraction is one of the important techniques in data reduction to discover the most important features. Proce-ssing massive amount of data stored in a unstructured form is a challenging task. Several pre-processing methods and algo-rithms are needed to extract useful features from huge amount of data. The survey covers different text summarization, classi-fication, clustering methods to discover useful features and also discovering query facets which are multiple groups of words or phrases that explain and summarize the content covered by a query thereby reducing time taken by the user.

References

135 Cites in Article

Reference Format

R Agrawal,M Batra (2013). International Journal of Soft Computing and Engineering.
V Bhat,P Rao,R Abhilash,P Shenoy,K Venugopal,L Patnaik (2010). A Data Mining Approach for Data Generation and Analysis for Digital Forensic Application.
Y Zhang,M Chen,L Liu (2015). A Review on Text Mining.
S Shehata,F Karray,M Kamel (2010). An Efficient Concept-based Mining Model for Enhancing Text Clustering.
Veena Bhat,Prasanth Rao,Abhilash R.V.,P Shenoy,Venugopal K.R.,L Patnaik (2010). A Novel Data Generation Approach for Digital Forensic Application in Data Mining.
D Brown (2015). Text Mining the Contributors to Rail Accidents.
K Venugopal,K Srinivasa,L Patnaik (2009). Soft Computing for Data Mining Applications.
Vijay Verma,Manish Ranjan,Priyanka Mishra (2015). Text mining and information professionals: Role, issues and challenges.
A Akilan (2015). Text mining: Challenges and future directions.
D Sanchez,M Martin-Bautista,I Blanco,C Torre (2008). Text Knowledge Mining:An Alternative to Text Data Mining.
Yue Dai,Tuomo Kakkonen,Erkki Sutinen (2011). MinEDec: A decision support model that combines text mining with competitive intelligence.
Y Hu,X Wan (2015). Ppsgen: Learning-Based Presentation Slides Generation for Academic Papers.
E D'andrea,P Ducange,B Lazzerini,F Marcelloni (2015). Real-Time Detection of Traffic from Twitter Stream Analysis.
R Li,K Lei,R Khadiwala,K,C Chang,V Bhat,V Malkani,P Shenoy,K Venugopal,L Patnaik (2011). Classification of Email using Beaks: Behavior and Keyword Stemming.
Axel Schulz,Petar Ristoski,Heiko Paulheim (2013). I See a Car Crash: Real-Time Detection of Small Scale Incidents in Microblogs.
Alvaro Gonzalez,Luis Bergasa,J Yebes (2014). Text Detection and Recognition on Traffic Panels From Street-Level Imagery Using Visual Appearance.
D Muni,N Pal,J Das (2006). Genetic Programming for Simultaneous Feature Selection and Classifier Design.
Mehdi Aghdam,Nasser Ghasem-Aghaee,Mohammad Basiri (2009). Text feature selection using ant colony optimization.
K Srinivasa,A Singh,A Thomas,K Venugopal,L Patnaik (2005). Generic Feature Extraction for Classification using Fuzzy C-means Clustering.
E Gasca,J Sánchez,R Alonso (2006). Eliminating redundancy and irrelevance using a new MLP-based feature selection method.
Rui Li,Kin Lei,Ravi Khadiwala,Kevin Chen-Chuan Chang (2012). TEDAS: A Twitter-based Event Detection and Analysis System.
R Parikh,K Karlapalem (2013). Et: Events from Tweets.
R Sivagaminathan,S Ramakrishnan (2007). A Hybrid Approach for Feature Subset Selection using Neural Networks and Ant Colony Optimization.
D Cai,C Zhang,X He (2010). Unsupervised Feature Selection for Multi-cluster Data.
Z Zhao,X He,D Cai,L Zhang,W Ng,Y Zhuang (2016). Graph Regularized Feature Selection with Data Reconstruction.
Lei Xu,Chunxiao Jiang,Yong Ren,Hsiao-Hwa Chen (2016). Microblog Dimensionality Reduction—A Deep Learning Approach.
De Wang,Feiping Nie,Heng Huang (2015). Feature Selection via Global Redundancy Minimization.
H Ogura,H Amano,M Kondo (2009). Feature Selection with a Measure of Deviations from Poisson in Text Categorization.
N Azam,J Yao (2012). Comparison of Term Frequency and Document Frequency based Feature Selection Metrics in Text Categorization.
S Yan,D Xu,B Zhang,H.-J Zhang,Q Yang,S Lin (2007). Graph Embedding and Extensions: A General Framework for Dimensionality Reduction.
Zheng Zhao,Lei Wang,Huan Liu,Jieping Ye (2013). On Similarity Preserving Feature Selection.
Zhou Zhao,Xiaofei He,Deng Cai,Lijun Zhang,Wilfred Ng,Yueting Zhuang (2016). Graph Regularized Feature Selection with Data Reconstruction.
J Tang,H Liu (2012). Unsupervised Feature Selection for Linked Social Media Data.
Hamid Mousavi,Deirdre Kerr,Markus Iseli,Carlo Zaniolo (2014). Harvesting Domain Specific Ontologies from Text.
Y Yang,H Shen,Z Ma,Z Huang,X Zhou (2011). l2, 1-norm Regularized Discriminative Feature Selection for Unsupervised Learning.
D Cai,X He,J Han,T Huang (2011). Graph Regularized Nonnegative Matrix Factorization for Data Representation.
D Sejal,K Shailesh,V Tejaswi,D Anvekar,K Venugopal,S Iyengar,L Patnaik (2015). Qrgqr: Query Relevance Graph for Query Recommendation.
Wentao Fan,Nizar Bouguila,Djemel Ziou (2013). Unsupervised Hybrid Feature Extraction Selection for High-Dimensional Non-Gaussian Data Clustering with Variational Inference.
Vishal Gupta,Gurpreet Lehal (2010). A Survey of Text Summarization Extractive Techniques.
P Negi,M Rauthan,H Dhami (2011). Text Summarization for Information Retrieval using Pattern Recognition Techniques.
F Debole,F Sebastiani (2005). An Analysis of the Relative Hardness of Reuters-21578 Subsets.
F Xie,X Wu,X Hu (2010). Keyphrase Extraction based on Semantic Relatedness.
Yuefeng Li,Abdulmohsen Algarni,Ning Zhong (2010). Mining positive and negative patterns for relevance feature discovery.
N Zhong,Y Li,S.-T Wu (2012). Effective Pattern Discovery for Text Mining.
Yi-Cheng Chen,Wen-Chih Peng,Suh-Yin Lee (2015). Mining Temporal Patterns in Time Interval-Based Data.
A Bartoli,A Lorenzo,E Medvet,F Tarlao (2016). Inference of Regular Expressions for Text Extraction from Examples.
Yuefeng Li,Abdulmohsen Algarni,Mubarak Albathan,Yan Shen,Moch Bijaksana (2015). Relevance Feature Discovery for Text Mining.
Q Song,J Ni,G Wang (2013). A Fast Clusteringbased Feature Subset Selection Algorithm for High-Dimensional Data.
Thanh-Son Nguyen,Hady Lauw,Panayiotis Tsaparas (2015). Review Selection Using Micro-Reviews.
Ammar Kadhim,Null- Cheah,Nurul Ahamed,Lubab Salman (2014). Feature extraction for co-occurrence-based cosine similarity score of text documents.
D Fradkin,D Madigan (2003). Experiments with Random Projections for Machine Learning.
S Joshi,D Shenoy,P Rashmi,K Venugopal,L Patnaik (2010). Classification of Alzheimer's Disease and Parkinson's Disease by using Machine Learning and Neural Network Methods.
H Ahonen,O Heinonen,M Klemettinen,A Verkamo (1998). Applying data mining techniques for descriptive phrase extraction in digital document collections.
Xifeng Yan,Jiawei Han,Ramin Afshar (2003). CloSpan: Mining: Closed Sequential Patterns in Large Datasets.
A Gomariz,M Campos,R Marin,B Goethals (2013). Clasp: An Efficient Algorithm for Mining Frequent Closed Sequences.
J Pei,J Han,R Mao (2000). Closet: An Efficient Algorithm for Mining Frequent Closed Itemsets.
J Han,J Pei,B Mortazavi-Asl,Q Chen,U Dayal,M.-C Hsu (2000). Freespan: Frequent Pattern-Projected Sequential Pattern Mining.
J Pei,J Han,B Mortazavi-Asl,H Pinto,Q Chen,U Dayal,M.-C Hsu (2001). Prefixspan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth.
K Venugopal,R Buyya (2013). Mastering c++.
M Garofalakis,R Rastogi,K Shim (1999). Mining sequential patterns with regular expression constraints.
N Zhong,Y Li,S.-T Wu (2012). Effective Pattern Discovery for Text Mining.
A Inje,U Patil (2014). Operational Pattern Revealing Technique in Text Mining.
R Bayardo (1998). Efficiently Mining Long Patterns from Databases.
Y Yang,J Pedersen (1997). A Comparative Study on Feature Selection in Text Categorization.
P Shenoy,K Srinivasa,K Venugopal,Lalit Patnaik (2005). Dynamic Association Rule Mining using Genetic Algorithms.
M Seno,G Karypis (2002). Slpminer: An Algorithm for Finding Frequent Sequential Patterns using Length-Decreasing Support Constraint.
N Mabroukeh,C Ezeife (2010). A Taxonomy of Sequential Pattern Mining Algorithms.
M Zaki (2001). Spade: An Efficient Algorithm for Mining Frequent Sequences.
J Han,J Pei,Y Yin,R Mao (2004). Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach.
V Raju,G Varma (2015). Mining Closed Sequential Patterns in Large Sequence Databases.
J Zhang,X Zhao,S Zhang,S Yin,X Qin,I Member (2013). Interrelation Analysis of Celestial Spectra Data using Constrained Frequent Pattern Trees.
Feng Li,Beng Ooi,M Özsu,Sai Wu (2014). Distributed data management using MapReduce.
Nidhi Tiwari,Santonu Sarkar,Umesh Bellur,Maria Indrawan (2015). Classification Framework of MapReduce Scheduling Algorithms.
Y Xun,J Zhang,X Qin (2016). Fidoop: Parallel Mining of Frequent Itemsets Using Mapreduce.
S Sakr,A Liu,A Fayoumi (2013). The Family of Mapreduce and Large-Scale Data Processing Systems.
L Wang,L Feng,J Zhang,P Liao (2014). An Efficient Algorithm of Frequent Itemsets Mining based on Mapreduce.
T Ramakrishnudu,R Subramanyam (2015). Mining Interesting Infrequent Itemsets from Very Large Data based on Mapreduce Framework.
E Ozkural,B Ucar,C Aykanat (2011). Parallel Frequent Item Set Mining with Selective Item Replication.
Zuobing Xu,Ram Akella (2008). Active relevance feedback for difficult queries.
S Desai,V Chandrasheker,V Mathapati,K Rajuk,S Iyengar,L Patnaik (2016). User Feedback Session with Clicked and Unclicked Documents for Related Search Recommendation.
G Cao,J.-Y Nie,J Gao,S Robertson (2008). Selecting Good Expansion Terms for Pseudo-Relevance Feedback.
J Paik,D Pal,S Parui (2014). Incremental Blind Feedback: An Effective Approach to Automatic Query Expansion.
Abdulmohsen Algarni,Yuefeng Li,Yue Xu (2010). Selected new training documents to update user profile.
S Niharika,V Latha,D Lavanya (2012). A Survey on Text Categorization.
Fabrizio Sebastiani (2002). Machine learning in automated text categorization.
R Irfan,C King,D Grages,S Ewen,S Khan,S Madani,J Kolodziej,L Wang,D Chen,A Rayes (2015). A Survey on Text Mining in Social Networks.
H Vu,T Tran,I Na,S Kim (2015). Automatic Extraction of Text Regions from Document Images by Multilevel Thresholding and kmeans Clustering.
Z Dou,Z Jiang,S Hu,J.-R Wen,R Song (2016). Automatically Mining Facets for Queries from Their Search Results.
D Sejal,K Shailesh,V Tejaswi,Dinesh Anvekar,K Venugopal,S Iyengar,L Patnaik (2015). Query Click and Text Similarity Graph for Query Suggestions.
Xiaodong Shi,Christopher Yang (2007). Mining related queries from Web search engine query logs using an improved association rule mining model.
Mamadou Diao,Sougata Mukherjea,Nitendra Rajput,Kundan Srivastava (2010). Faceted search and browsing of audio content on spoken web.
C Efstathiades,A Efentakis,D Pfoser (2016). Efficient Processing of Relevant Nearest-Neighbor Queries.
C Zhang,Y Zhang,W Zhang,X Lin (2016). Inverted Linear Quadtree: Efficient Top k Spatial Keyword S219219earch.
K Pripuˇzi´c,I Zarko,K Aberer (2015). Timeand Space-Efficient Sliding Window Top-k Query Processi-ng.
W.-K Hon,R Shah,S Thankachan,J Vitter (2014). Space-Efficient Frameworks for Top-k String Retrieval.
W Kong,J Allan (2014). Extending Faceted Search to the General Web.
Marc Bron,Krisztian Balog,Maarten De Rijke (2010). Ranking related entities.
Gonzalo Navarro (2014). Spaces, Trees, and Colors.
S Liu,Y Chen,H Wei,J Yang,K Zhou,S Drucker (2015). Exploring Topical Lead-Lag Across Corpora.
D Jiang,Y Tong,Y Song (2016). Cross-Lingual Topic Discovery from Multilingual Search Engine Query Log.
Michael Cafarella,Alon Halevy,Daisy Wang,Eugene Wu,Yang Zhang (2008). WebTables.
J Pound,S Paparizos,P Tsaparas (2011). Facet Discovery for Structured Web Search: A Query-Log Mining Approach.
I Altingovde,R Ozcan,O¨ Ulusoy (2012). Static Index Pruning in Web Search Engines: Combining Term and Document Popularities with Query Views.
P Koutris,P Upadhyaya,M Balazinska,B Howe,D Suciu (2015). Query-Based Data Pricing.
Claudio Lucchese,Salvatore Orlando,Raffaele Perego,Fabrizio Silvestri,Gabriele Tolomei (2013). Discovering tasks from search engine query logs.
Z Liu,Y Chen (2012). Differentiating Search Results on Structured Data.
Vinhtuan Thai,Pierre-Yves Rouille,Siegfried Handschuh (2012). Visual Abstraction and Ordering in Faceted Browsing of Text Collections.
Ilio Catallo,Eleonora Ciceri,Piero Fraternali,Davide Martinenghi,Marco Tagliasacchi (2013). Top-k diversity queries over bounded regions.
Hannah Bast,Marjan Celikik (2013). Efficient fuzzy search in large text collections.
A Termehchy,M Winslett (2011). Using Structural Information in Xml Keyword Search Effectively.
Riccardo Colini-Baldeschi,Stefano Leonardi,Monika Henzinger,Martin Starnberger (2016). On Multiple Keyword Sponsored Search Auctions with Budgets.
R Arguello,Capra (2016). The Effects of Aggregated Search Coherence on Search Behavior.
D Wu,M Yiu,C Jensen (2013). Moving Spatial Keyword Queries: Formulation, Methods, and Analysis.
Ying Lu,Jiaheng Lu,Gao Cong,Wei Wu,Cyrus Shahabi (2014). Efficient Algorithms and Cost Models for Reverse Spatial-Keyword <i>k</i> -Nearest Neighbor Search.
X Cao,G Cong,T Guo,C Jensen,B Ooi (2015). Efficient Processing of Spatial Group Keyword Queries.
Ziyu Guan,Shengqi Yang,Huan Sun,Mudhakar Srivatsa,Xifeng Yan (2015). Fine-Grained Knowledge Sharing in Collaborative Environments.
H Wang,Y Song,M.-W Chang,X He,R White,W Chu (2013). Learning to Extract Cross-Session Search Tasks.
A Kotov,P Bennett,R White,S Dumais,J Teevan (2011). Modeling and Analysis of Cross-Session Search Tasks.
T Papenbrock,A Heise,F Naumann (2015). Progressive Duplicate Detection.
H Bano,F Azam (2015). Innovative Windows for Duplicate Detection.
Antoon Bronselaer,Daan Van Britsom,Guy De Tre (2015). Propagation of Data Fusion.
George Papadakis,Ekaterini Ioannou,Themis Palpanas,Claudia Niederee,Wolfgang Nejdl (2013). A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces.
George Papadakis,Wolfgang Nejdl (2011). Efficient entity resolution methods for heterogeneous information spaces.
Oktie Hassanzadeh,Fei Chiang,Hyun Lee,Renée Miller (2009). Framework for evaluating clustering algorithms in duplicate detection.
S Whang,D Marmaros,H Garcia-Molina (2013). Pay-asyou-go Entity Resolution.
A Abraham,S Kanmani,J Tamilselvi,C Gifta (2011). A Survey on Various Methods used for Detecting Duplicates in 127.
Ahmed Elmagarmid,Panagiotis Ipeirotis,Vassilios Verykios (2007). Duplicate Record Detection: A Survey.
P Christen (2012). A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication.
T Vries,H Ke,S Chawla,P Christen (2011). Robust Record Linkage Blocking using Suffix Arrays and Bloom Filters.
Oktie Hassanzadeh,Renée Miller (2009). Creating probabilistic databases from duplicated data.
Antoon Bronselaer,Guy De Tre (2010). Aspects of object merging.
U Draisbach,F Naumann,S Szott,O Wonneberg (2012). Adaptive Windows for Duplicate Detection.
F Naumann,A Bilke,J Bleiholder,M Weis (2006). Data Fusion in Three Steps: Resolving Schema, Tuple, and Value Inconsistencies.
Jens Bleiholder,Felix Naumann (2009). Data fusion.
Lei Meng,Ah-Hwee Tan,Dong Xu (2014). Semi-Supervised Heterogeneous Fusion for Multimedia Data Co-Clustering.

Download References

Funding

No external funding was declared for this work.

Conflict of Interest

The authors declare no conflict of interest.

Ethical Approval

No ethics committee approval was required for this article type.

Data Availability

Not applicable for this article.

How to Cite This Article

Ramya R S. 2017. \u201cFeature Extraction and Duplicate Detection for Text Mining: A Survey\u201d. Global Journal of Computer Science and Technology - C: Software & Data Engineering GJCST-C Volume 16 (GJCST Volume 16 Issue C5): .

More Citation Formats

Select Citation Style:

Download Citation

Download Article

GJCST Volume 16 Issue C5
Pg. 1- 20

Explore Journals Explore Volume Read This Issue

Journal Specifications

Crossref Journal DOI 10.17406/gjcst

Print ISSN 0975-4350

e-ISSN 0975-4172

Keywords

Not Found

Classification

C.2.1,C.2.4,H.2.8

Submission ReceivedDecember 12, 2015
Peer Review Double Blind
Handling Editor
Accepted December 31, 2015
Published January 15, 2016

Version of record

v1.2

Issue date

January 27, 2017

Language

Experiance in AR

Explore published articles in an immersive Augmented Reality environment. Our platform converts research papers into interactive 3D books, allowing readers to view and interact with content using AR and VR compatible devices.

View in VR

Read in 3D

Your published article is automatically converted into a realistic 3D book. Flip through pages and read research papers in a more engaging and interactive format.

View in 3D

Article Matrices

Total Score: 104

Country: India

Subject: Global Journal of Computer Science and Technology - C: Software & Data Engineering

Authors: Ramya R S, Venugopal K R, Iyengar S S, Patnaik L M (PhD/Dr. count: 0)

View Count (all-time): 281

Total Views (Real + Logic): 7112

Total Downloads (simulated): 1768

Publish Date: 2017 01, Fri

Monthly Totals (Real + Logic):

Month 1: 39 views
Month 2: 37 views
Month 3: 20 views
Month 4: 56 views
Month 5: 54 views
Month 6: 35 views
Month 7: 28 views
Month 8: 39 views
Month 9: 33 views
Month 10: 47 views
Month 11: 58 views
Month 12: 13 views
Month 13: 38 views
Month 14: 39 views
Month 15: 35 views
Month 16: 27 views
Month 17: 32 views
Month 18: 20 views
Month 19: 19 views
Month 20: 35 views
Month 21: 15 views
Month 22: 33 views
Month 23: 38 views
Month 24: 18 views
Month 25: 22 views
Month 26: 32 views
Month 27: 24 views
Month 28: 11 views
Month 29: 23 views
Month 30: 24 views
Month 31: 28 views
Month 32: 33 views
Month 33: 29 views
Month 34: 45 views
Month 35: 39 views
Month 36: 42 views
Month 37: 47 views
Month 38: 44 views
Month 39: 38 views
Month 40: 22 views
Month 41: 24 views
Month 42: 34 views
Month 43: 29 views
Month 44: 22 views
Month 45: 14 views
Month 46: 20 views
Month 47: 34 views
Month 48: 25 views
Month 49: 50 views
Month 50: 28 views
Month 51: 33 views
Month 52: 33 views
Month 53: 31 views
Month 54: 41 views
Month 55: 32 views
Month 56: 32 views
Month 57: 34 views
Month 58: 36 views
Month 59: 35 views
Month 60: 34 views
Month 61: 24 views
Month 62: 26 views
Month 63: 26 views
Month 64: 30 views
Month 65: 25 views
Month 66: 28 views
Month 67: 31 views
Month 68: 38 views
Month 69: 28 views
Month 70: 26 views
Month 71: 29 views
Month 72: 39 views
Month 73: 39 views
Month 74: 19 views
Month 75: 30 views
Month 76: 26 views
Month 77: 28 views
Month 78: 44 views
Month 79: 34 views
Month 80: 27 views
Month 81: 24 views
Month 82: 24 views
Month 83: 44 views
Month 84: 33 views
Month 85: 36 views
Month 86: 27 views
Month 87: 33 views
Month 88: 28 views
Month 89: 28 views
Month 90: 36 views
Month 91: 32 views
Month 92: 36 views
Month 93: 24 views
Month 94: 29 views
Month 95: 22 views
Month 96: 38 views
Month 97: 30 views
Month 98: 41 views
Month 99: 44 views
Month 100: 43 views
Month 101: 37 views
Month 102: 43 views
Month 103: 18 views
Month 104: 36 views
Month 105: 44 views
Month 106: 40 views
Month 107: 23 views
Month 108: 35 views
Month 109: 72 views
Month 110: 40 views
Month 111: 42 views

Total Views: 7112

Total Downloads: 1768

2026 Trends

Published Article

Our website is actively being updated, and changes may occur frequently. Please clear your browser cache if needed. For feedback or error reporting, please email [email protected]