Other Government Release

Department of Science and Technology

DOST leads charge for natural language research roadmap



As the Philippines commemorates Buwan ng Wika, the Department of Science and Technology (DOST) is leading the way in coming up with a research and development roadmap that can help strengthen natural language research in the country.

This was highlighted during the multi-sectoral meeting between DOST, Ateneo Social Computing Science Laboratory, Ateneo Center for Computing Competency and Research (ACCCRe), AI Singapore (AISG), and Komisyon sa Wikang Filipino (KWF) to discuss efforts in the Philippines on Natural Language Processing (NLP) and Large Language Model (LLM) in the Philippines.

With the theme “Towards Developing Large Language Models for Filipino Languages,” the gathering was attended by NLP, LLMs, and Artificial Intelligence (AI) experts from academic institutions, government, and industry. The program comprises presentations of current works and discussions of elements towards building culturally appropriate LLMs for the Philippine languages.

“Different Filipino languages provide connections to our cultural heritage and enables us to have a deeper understanding about our identity and historical roots.  We thank our researchers and partners from the academe, government, and industry in making solid efforts in providing solutions and opening opportunities for the Filipino people to utilize science, technology, and innovation in this initiative,” said DOST Secretary Renato U. Solidum, Jr.

The S&T initiatives funded by DOST for the development of LLMs for Philippine languages were presented.  One of the main DOST projects to preserve different languages of the Philippines is Project Marayum.  This project is a collaboratively built, desktop- and mobile phone-based, online dictionary platform for Philippine languages. Its goal is to empower native language speakers to create and curate an online dictionary of their language without needing to have technical expertise in website design, implementation, and maintenance.

A project funded to answer the need to a Filipino company is FilWordNet.  Funded through the DOST- Collaborative Research and Development to Leverage Philippine Economy (CRADLE) Program, FilWordNet is a Philippine language resource built by DLSU-CCS researchers in collaboration with Senti Techlabs. The project tracks how word senses change in the digital realm through natural language processing and network science.

The DOST- Philippine Council for Industry, Energy, and Emerging Technology Research and Development (PCIEERD) funded the Interdisciplinary Signal Processing for Pinoys (ISIP) Program, consists of 7 component projects implemented by the University of the Philippines –  Electrical and Electronics Engineering Institute (UP-EEEI) and its subsequent Interdisciplinary Signal Processing for Pinoys (ISIP): Software Applications for Education (SAFE) Program with three (3) component projects implemented by UPD-EEEI and De La Salle University (DLSU).

The ISIP Program developed a Philippine Languages Corpora of text and speech data of ten (10) spoken languages.

From these databases, several language models were created for Automatic Speech Recognition, Filipino Speech Synthesizer, Code Switching Detector, Essay Grader and English Proficiency training program applications.

The ISIP: SAFE Program developed an automated reading tutor for elementary students in Filipino, closed captioning systems for Philippine languages, and a Filipino language writing tool.

Additionally, the other projects include Project iWag, a mobile web bidirectional neural machine translation system for Filipino and Cebuano by the University of Immaculate Concepcion, and the Multi-Lingual Chatbot for Health Monitoring, which is a healthcare chatbot that is capable of interpreting audio input and conversing in Filipino and Bisaya by De La Salle University.

DOST-PCIEERD also funded the MinNa-LProc (Mindanao Natural Language Processing) Research and Development Laboratory and Senti Techlabs, a start-up company focusing on AI technologies.  MinNa-LProc Research and Development Laboratory intends to make use of NLP to aid in the protection of endangered languages in the Mindanao region such as Manobo, Mansaka, and Kalagan.

On the other hand, Senti Techlabs devised a machine-learning based language classifier that demonstrates automatic detection for any document’s language.

“This is an opportunity for us to maximize the use of technologies available to us in preserving and propagating our different languages.  Through this roadmap, we can identify research gaps as well as the possible solutions in terms of technology, human resource, and policy,” said DOST PCIEERD Executive Director Enrico C. Paringit.

The convention was graced by Dr. William Tjhi, Head of Applied Research for Foundation Models of SG-AI. He presented the SEA-LION (Southeast Asian Languages In One Network) project that started in 2023, a family of Southeast Asian LLMs that seeks to construct well-represented language datasets to maximize the vast array of AI applications for both research and industry.  SG-AI offered their assistance and expressed willingness to share their work to facilitate collaboration with the Philippines.

KWF and academic institutions present in the forum expressed support to this collaboration.  KWF stressed the importance of making linguistic resources available to the public and likewise underscored the importance of a national dictionary and grammar guidelines.

Attention was also drawn to the issues, needs, and challenges in large language models which must be considered in the formulation and implementation of the roadmap.