Urdu Language Analysis

1098 Words5 Pages
Urdu is a popular language of South East Asia and national language of Pakistan. It is spoken by about 250 million people across the world including around 100 to 130 million speakers that are native to Pakistan and India [1]. It is a very vast language with very rich literature written by a large number of famous poets and writers of old Indian sub-continent [2]. Urdu is one of the languages that contain the writing styles, features, scripts and properties of two languages Persian and Arabic [3]. Persian actually itself is taken up from the base language Arabic and both the Arabic and Persian serve as the parent languages for Urdu. Arabic is the actual base language for Urdu but Persian also contributes some of the features in it. The most…show more content…
Cursiveness actually means that joining characters take different shapes and forms depending on their sequence of joining [4]. This cursive nature of Urdu text makes the correct detection and recognition of Urdu words very difficult and challenging for image processing tasks. Moreover diacritics like zer, zabar, paish, madd are also used in Urdu text. The generation of ligature may include joining characters on both sides of its neighbors along with those characters that join from one side only. The shape of the alphabets in Urdu text is dependent on the position of their appearance in the formation of a word along with its neighboring letter. There are four positions in which a letter may appear in Urdu text i.e. Initial, Medial, Final and Isolated [4]. Out of 37, nine alphabets from standard Urdu dataset don’t join with any character to form single ligature and always remain separate [5]. The remaining 26 Urdu alphabets can be joined from both sides. Out of 37, 17 characters of basic Urdu alphabet set have dots affiliated with them and quite often diacritics like hamza (ء) and toy (ط) are used to differentiate characters from each other [6]. Moreover, within a ligature, horizontal overlapping of different characters is also possible in Urdu text. These different writing styles, cursive nature and overlapping of different…show more content…
This file is further available for the processing and information retrieval. An OCR system makes it possible for us to utilize the printed data, using a scanner and computer with minimum time and effort. It transforms the scanned image to a text document, which can be further used and processed easily by a text editor or a word processor [7]. Various works regarding conversion of images containing non-editable Urdu text into editable text file is available in the literature using OCR. Some of the reviewed work in this regard is summarized below. Baseline detection is considered as one of the most important step of any Urdu OCR system. Baseline detection is used to identify the primary and secondary strokes. Base line is a horizontal line made up of a number of points, containing the maximum number of pixels [3]. S.Naz et. al. provide a detailed review of the most frequently used baseline detection methods for Urdu OCR systems. The paper also discussed the challenges during baseline detection in cursive script languages for Nastalique and Naskh font

More about Urdu Language Analysis

Open Document