MULTIDOCUMENT HINDI TEXT SUMMARIZATION USING BACKPROPAGATION NETWORK
ABSTRACT
Now days it’s very difficult and laborious task to find out exactly want we want from internet. To make this task there are many summarization technique has been developed. For English language there are multiple option available but very less work has been done with respect to Hindi language. Proposed system is going to summarize multiple Hindi documents. In this summarization technique feature extracted from document such as sentence length, sentence position etc are used to calculate which sentence should be included in summary.
Keyword: Hindi text summarization , backpropagation network(bp)
INTRODUCTION
These days text summarization become more popular among…show more content… 3.2 Generic vs. Query-Based summary
Generic summary do not target to any particular group. It addresses broad community of readers while Query or topic focused queries are tailored to the specific needs of an individual or a particular group and represent particular topic . 3.3 Single vs. Multi-Document Summary single document summary provide the most relevant information contained in single document to the user that helps the user in deciding whether the document is related to the topic of interest or not whereas multi-document summary helps to identify redundancy across documents and compute the summary of a set of related documents of a corpus such that they cover the major details of the events in the documents, taking into account some of the major issues : the need to carefully eliminate redundant information from multiple documents and achieve high compression ratios; information about document and passage similarities, and weighting different passages accordingly; the importance of temporal information; co-reference among entities and facts occurring across documents…show more content… This preparation is basically going to perform in four steps sentence segmentation, sentence tokenization, stop word removal and stemming.
Sentence segmentation
In sentence segmentation step given text document is divided into sentence by sentence along with its word count. In Hindi language sentences are identify by purna viram(|).
Tokenization
In tokenization step sentence are divided into words by identifying spaces, comma and special symbols between words. So till now there is ready list of sentences with its words count for further processing.
Stop word removal
In stop word removal step some common words which do not aggregate relevant information to task are removed so that feature implementation use effectively by only considering words in the document which have more important. Stop words are common words that carry less important meaning than keyword, are eliminated for better summary generation.
Stemming
Stemming is process of obtaining root of each word which emphasizes its semantics. By this procedure syntactically similar words such as plurals, verbal variations etc. are considered similar. Stemming is used for matching words of sentences for checking similarity feature steamer used is developed by IIT