A Hybrid Framework for Arabic Extractive Document Summarization Using Pre-Trained Language Models and Topic Modeling
Main Article Content
Abstract
One of the hardest things to perform in Natural Language Processing (NLP) is automatically summarizing text. This is especially true for Arabic, which has complicated morphology, a lot of semantic information, and syntactic ambiguity. The goal of this study is to suggest a hybrid method for creating Arabic extractive summaries that combines the strengths of pre-conditioned language models (PLMs) with topic modeling techniques to create summaries that are very accurate, cover a lot of ground, and make sense semantically. The suggested model uses an Arabic pre-trained language method, like AraBERT, to get deep contextual sentence representations and figure out how they fit with the document. You can use BERTopic or Latent Dirichlet Allocation (LDA) to find out what the text is really about. This will make sure that the summary has all the important points. The system chooses the most representative sentences by combining the semantic and topical parts without making the text less clear. We check how well the suggested strategy works by using both standard automatic metrics like ROUGE-N and ROUGE-L and human evaluations of the quality and coherence of the content. The results show that the hybrid system is much better at summarizing Arabic text than just using standard math or deep learning methods. This makes it easier to find information and helps Arabic NLP applications move forward. The proposed hybrid approach achieves superior ROUGE-1, ROUGE-2, and ROUGE-L scores in an Arabic language news dataset compared to the baseline extractive analysis methods. This indicates that it makes the content more coherent and adds more of it.
Article Details
Section

This work is licensed under a Creative Commons Attribution 4.0 International License.