Annotated Bangla News Corpus and Lexicon Development with POS Tagging and Stemming

α
Abdul Matin
Abdul Matin
σ
Tasnim Haider Chaudhury
Tasnim Haider Chaudhury
ρ
M.S. Hossain
M.S. Hossain
Ѡ
Asie Uzzaman
Asie Uzzaman
¥
Md. Masum
Md. Masum
α Shahjalal University of Science and Technology

Send Message

To: Author

Annotated Bangla News Corpus and Lexicon Development with POS Tagging and Stemming

Article Fingerprint

ReserarchID

027C9

Annotated Bangla News Corpus and Lexicon Development with POS Tagging and Stemming Banner

AI TAKEAWAY

Connecting with the Eternal Ground
  • English
  • Afrikaans
  • Albanian
  • Amharic
  • Arabic
  • Armenian
  • Azerbaijani
  • Basque
  • Belarusian
  • Bengali
  • Bosnian
  • Bulgarian
  • Catalan
  • Cebuano
  • Chichewa
  • Chinese (Simplified)
  • Chinese (Traditional)
  • Corsican
  • Croatian
  • Czech
  • Danish
  • Dutch
  • Esperanto
  • Estonian
  • Filipino
  • Finnish
  • French
  • Frisian
  • Galician
  • Georgian
  • German
  • Greek
  • Gujarati
  • Haitian Creole
  • Hausa
  • Hawaiian
  • Hebrew
  • Hindi
  • Hmong
  • Hungarian
  • Icelandic
  • Igbo
  • Indonesian
  • Irish
  • Italian
  • Japanese
  • Javanese
  • Kannada
  • Kazakh
  • Khmer
  • Korean
  • Kurdish (Kurmanji)
  • Kyrgyz
  • Lao
  • Latin
  • Latvian
  • Lithuanian
  • Luxembourgish
  • Macedonian
  • Malagasy
  • Malay
  • Malayalam
  • Maltese
  • Maori
  • Marathi
  • Mongolian
  • Myanmar (Burmese)
  • Nepali
  • Norwegian
  • Pashto
  • Persian
  • Polish
  • Portuguese
  • Punjabi
  • Romanian
  • Russian
  • Samoan
  • Scots Gaelic
  • Serbian
  • Sesotho
  • Shona
  • Sindhi
  • Sinhala
  • Slovak
  • Slovenian
  • Somali
  • Spanish
  • Sundanese
  • Swahili
  • Swedish
  • Tajik
  • Tamil
  • Telugu
  • Thai
  • Turkish
  • Ukrainian
  • Urdu
  • Uzbek
  • Vietnamese
  • Welsh
  • Xhosa
  • Yiddish
  • Yoruba
  • Zulu

Abstract

In this paper, we have developed a mono-linguistic Bengali news corpus using knowledge based AI (Artificial Intelligence) technique from some widely read Bengali newspapers which will be used as a reference corpus and will be very useful for lexicon development, morphological analysis, and automatic parts of speech detection. The corpus contains 74,698 word forms. The words in the lexicon are annotated with a combination of manual tags addressing Parts-of-Speech, Stemming, Morphemes, and other grammatical features are very important for almost all Natural Language Processing (NLP) applications. The lexicon contains around 14 thousand entries.

References

10 Cites in Article
  1. Peiman Habibollahi,Stephen Hunt,Therese Bitterman,Terence Gade,Michael Soulen,Gregory Nadolski (2018). Definitive locoregional therapy (LRT) versus bridging LRT and liver transplantation with wait-and-not-treat approach for very early stage hepatocellular carcinoma.
  2. J Hasan (2001). Automatic dictionary construction from large collections of text.
  3. Md. Mahtab,Monirul Haque,Mehedi Hasan,Farig Sadeque (2023). BanglaBait: Semi-Supervised Adversarial Approach for Clickbait Detection on Bangla Clickbait Dataset.
  4. Md. Mahtab,Monirul Haque,Mehedi Hasan,Farig Sadeque (2023). BanglaBait: Semi-Supervised Adversarial Approach for Clickbait Detection on Bangla Clickbait Dataset.
  5. A Bharati,R Sangal,S Bendre (1998). Some Observations Regarding Corpora of Some Indian Languages.
  6. N Dash (2005). Corpus Linguistics and Language Technology.
  7. Md Nur Hossain Khan,Md Farukuzzaman Khan,Md Islam,Bappa Habibur Rahman,Sarker (2014). Verification of Bangla Sentence Structure using N-Gram.
  8. Md Hanif,Seddiqui Rana,Abdullah Al Mahmud,Taufique Sayeed Parts of speech tagging using morphological analysis in bangla.
  9. Samsi Ara,Md. Islam,Jugal Das,Md. Saklayen,Md. Rahman (2003). Alzheimer Classifications Combining Machine Learning and Signal Processing.
  10. Kristina Toutanova,Colin Cherry A global model for joint lemmatization and part-of-speech prediction.

Funding

No external funding was declared for this work.

Conflict of Interest

The authors declare no conflict of interest.

Ethical Approval

No ethics committee approval was required for this article type.

Data Availability

Not applicable for this article.

How to Cite This Article

Abdul Matin. 2017. \u201cAnnotated Bangla News Corpus and Lexicon Development with POS Tagging and Stemming\u201d. Global Journal of Research in Engineering - J: General Engineering GJRE-J Volume 17 (GJRE Volume 17 Issue J1): .

Download Citation

Journal Specifications

Crossref Journal DOI 10.17406/gjre

Print ISSN 0975-5861

e-ISSN 2249-4596

Keywords
Classification
GJRE-J Classification: FOR Code: 200402, 170203
Version of record

v1.2

Issue date

May 18, 2017

Language
en
Experiance in AR

Explore published articles in an immersive Augmented Reality environment. Our platform converts research papers into interactive 3D books, allowing readers to view and interact with content using AR and VR compatible devices.

Read in 3D

Your published article is automatically converted into a realistic 3D book. Flip through pages and read research papers in a more engaging and interactive format.

Article Matrices
Total Views: 3576
Total Downloads: 1687
2026 Trends
Related Research

Published Article

In this paper, we have developed a mono-linguistic Bengali news corpus using knowledge based AI (Artificial Intelligence) technique from some widely read Bengali newspapers which will be used as a reference corpus and will be very useful for lexicon development, morphological analysis, and automatic parts of speech detection. The corpus contains 74,698 word forms. The words in the lexicon are annotated with a combination of manual tags addressing Parts-of-Speech, Stemming, Morphemes, and other grammatical features are very important for almost all Natural Language Processing (NLP) applications. The lexicon contains around 14 thousand entries.

Our website is actively being updated, and changes may occur frequently. Please clear your browser cache if needed. For feedback or error reporting, please email [email protected]

Request Access

Please fill out the form below to request access to this research paper. Your request will be reviewed by the editorial or author team.
X

Quote and Order Details

Contact Person

Invoice Address

Notes or Comments

This is the heading

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.

High-quality academic research articles on global topics and journals.

Annotated Bangla News Corpus and Lexicon Development with POS Tagging and Stemming

Tasnim Haider Chaudhury
Tasnim Haider Chaudhury
Abdul Matin
Abdul Matin Shahjalal University of Science and Technology
M.S. Hossain
M.S. Hossain
Asie Uzzaman
Asie Uzzaman
Md. Masum
Md. Masum

Research Journals