NLP for Burmese/Myanmar

Play
Burmese vowels_NLP for Burmese

 

Several decades ago, Myanmar was voted and expected to be a rising star in terms of industrialization and development. As technology started proliferating more widely in people’s everyday lives these days with mobile phones, the internet, websites, social media, blogs, news, articles, and more, it became clear that there was a need for a more efficient way of passing the language barrier between business and the country. Which at some point led to the necessity of machine translation (MT) and natural language processing (NLP) for Burmese. Yet to have a solid machine translation, even the best algorithm requires data. And this was and still is one of the major issues when it comes to Burmese. If you were to ask people who work with MT engines what one of the key challenges is when translating Burmese, or any Asian language for that matter, they would tell you that there’s simply not enough data for it to deliver top notch results.

Complications when working on Burmese NLP and MT

According to Wikipedia, Burmese was the fourth of the Sino-Tibetan languages to develop a writing system, after Chinese, Tibetan, and Tangut. This means the language is quite old and it has some very specific features and peculiarities. For example, when it comes to NLP it is one of the low resource languages, which means that it’s very difficult to build, train, and improve MT. Besides this, there are other challenges it is facing related to the language itself, which we’ve summarized below.

  • Word segmentation: Burmese uses spaces differently to English. Instead of short words being distinguished, they may be left together but with no particular spacing rules. This can make NLP and MT extremely challenging. For example, manual word segmentation has been divided into six different rules to try and address this challenge. These include: the combination of root word, prefix, and suffix; plural nouns; possessive words; particle to the verb or the adjective as a way of identifying the noun; a particle that states the type of noun and used after a number; as well as breakpoints or pipe characters used for compound words.  
     
  • Part-of-speech (POS) tagging: in addition, there are 15 POS tags that need to be factored in when engaging with MT and NLP. These include adjective, adverb, conjunction, foreign word, interjection, noun, number, particle, post-positional marker, pronoun, punctuation, symbol, text number, and verb. When doing POS tagging, as it relates to MT and NLP, it’s vital to ensure that the right words and their specific function in a sentence is clearly determined. As mentioned above, the spacing (or lack thereof) in Burmese, can make this a challenging task.
Unicode to Zawgyi and vice versa

Apart from some of the challenges outlined above, there are other more pressing issues to focus on in order to determine the best solutions. One example that comes to mind is the issue of coding words and language elements on electronic devices. While Unicode is the international standard used for all languages across the world, in Myanmar, the Zawgyi code is used, which has led to the creation of two different types of code and can be confusing for Myanmar citizens to decipher if their device is not attuned to both styles. This means that some characters need to be decoded in order for there to be an accurate understanding of the text - be it on a website, an app, or a mobile phone. Given the presence of this dual system, though, it is also important to note that over 90% of devices in Myanmar use Zawgyi. This poses even more challenges which include the following:

  • Data validation: this can range anything from names and phone numbers to email addresses and other distinguishing features - they remain at risk if the user enters Zawgyi characters where Unicode is the predominant form, and vice versa.
     
  • Misinterpretations: it may also happen that there’s a misinterpretation of information depending on how one performs an internet search. For example, the widely accepted Unicode might be used as a default code whereas Zawgyi users might type in their search terms using this code and then they either won’t have an accurate result, or the search engines will be unable to understand the text encodings, leaving the user with no useful data.
     
  • Conversions: it may be challenging for users to convert from one font type or code to another quite seamlessly and this is going to pose challenges of added hassles, more time used than necessary, or even the yielding of irrelevant or unuseful results.
Conclusion

With so many aspects of our lives taking place online, and modern technologies stepping into our everyday lives, users are also changing. The necessity to bring a product’s experience to their doorstep and this in their own native language, is not only a wish anymore but rather a demand and an expectation. Especially for countries like Myanmar, language plays an even bigger role as a differentiator, if you are planning to be successful. And thus, the need for a good translation arises and with it the need for the latest technologies to be applied efficiently. This is why modern technologies need to accurately factor in some of the challenges posed by the language to ensure a translation is accurate and professional.

Despite this need, Burmese remains a challenging language to translate using MT, and a lot of these challenges come from the language itself. We’ve created the current article to raise awareness of the state of MT and the way NLP is applied to low-source languages like Burmese. The enthusiasm on latest technologies doesn’t always correspond to the speed at which it is adopted on the different levels in the industry, and this is important to be recognized. Sometimes this is related to the expectations of customers, which especially for some of the Asian languages are much higher than reality.

 
Desi Tzoneva

Content Writer