Annotated Bangla News Corpus and Lexicon Development with POS Tagging and Stemming

Article ID

027C9

Annotated Bangla News Corpus and Lexicon Development with POS Tagging and Stemming

Tasnim Haider Chaudhury
Tasnim Haider Chaudhury
Abdul Matin
Abdul Matin Shahjalal University of Science & Technology, Coxs Bazar International University
M.S. Hossain
M.S. Hossain
Asie Uzzaman
Asie Uzzaman
Md. Masum
Md. Masum
DOI

Abstract

In this paper, we have developed a mono-linguistic Bengali news corpus using knowledge based AI (Artificial Intelligence) technique from some widely read Bengali newspapers which will be used as a reference corpus and will be very useful for lexicon development, morphological analysis, and automatic parts of speech detection. The corpus contains 74,698 word forms. The words in the lexicon are annotated with a combination of manual tags addressing Parts-of-Speech, Stemming, Morphemes, and other grammatical features are very important for almost all Natural Language Processing (NLP) applications. The lexicon contains around 14 thousand entries. In this paper we present some statistical analysis on some Bengali newspapers Prothom-Alo, Daily Janakantha, Daily Kalerkantho and Amardesh online from 1st January, 2012 to 31st January, 2012 those are the most popular Bengali newspapers in Bangladesh. We proposed a user friendly software interface to the user to annotate a large existing Bengali word set for the lexicon build up process.

Annotated Bangla News Corpus and Lexicon Development with POS Tagging and Stemming

In this paper, we have developed a mono-linguistic Bengali news corpus using knowledge based AI (Artificial Intelligence) technique from some widely read Bengali newspapers which will be used as a reference corpus and will be very useful for lexicon development, morphological analysis, and automatic parts of speech detection. The corpus contains 74,698 word forms. The words in the lexicon are annotated with a combination of manual tags addressing Parts-of-Speech, Stemming, Morphemes, and other grammatical features are very important for almost all Natural Language Processing (NLP) applications. The lexicon contains around 14 thousand entries. In this paper we present some statistical analysis on some Bengali newspapers Prothom-Alo, Daily Janakantha, Daily Kalerkantho and Amardesh online from 1st January, 2012 to 31st January, 2012 those are the most popular Bengali newspapers in Bangladesh. We proposed a user friendly software interface to the user to annotate a large existing Bengali word set for the lexicon build up process.

Tasnim Haider Chaudhury
Tasnim Haider Chaudhury
Abdul Matin
Abdul Matin Shahjalal University of Science & Technology, Coxs Bazar International University
M.S. Hossain
M.S. Hossain
Asie Uzzaman
Asie Uzzaman
Md. Masum
Md. Masum

No Figures found in article.

Abdul Matin. 2017. “. Global Journal of Research in Engineering – J: General Engineering GJRE-J Volume 17 (GJRE Volume 17 Issue J1): .

Download Citation

Journal Specifications

Crossref Journal DOI 10.17406/gjre

Print ISSN 0975-5861

e-ISSN 2249-4596

Classification
GJRE-J Classification: FOR Code: 200402, 170203
Keywords
Article Matrices
Total Views: 3512
Total Downloads: 1715
2026 Trends
Research Identity (RIN)
Related Research
Our website is actively being updated, and changes may occur frequently. Please clear your browser cache if needed. For feedback or error reporting, please email [email protected]

Request Access

Please fill out the form below to request access to this research paper. Your request will be reviewed by the editorial or author team.
X

Quote and Order Details

Contact Person

Invoice Address

Notes or Comments

This is the heading

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.

High-quality academic research articles on global topics and journals.

Annotated Bangla News Corpus and Lexicon Development with POS Tagging and Stemming

Tasnim Haider Chaudhury
Tasnim Haider Chaudhury
Abdul Matin
Abdul Matin Shahjalal University of Science & Technology, Coxs Bazar International University
M.S. Hossain
M.S. Hossain
Asie Uzzaman
Asie Uzzaman
Md. Masum
Md. Masum

Research Journals