About The Corpus Project

This is a database of various texts in 10 Filipino languages. The database has a variety of purposes, namely, to:

These texts are intended for teaching and research. Instruction in the mother tongue (L1) is part of the new (2012) K-12 education curriculum, so teachers need texts that are appropriate for their students' reading level. This project uses readability algorithms to help rank the texts. Furthermore, the new curriculum includes L1 grammar instruction. Currently there are few grammar resources in these languages. This corpus is a starting point for researchers trying to formulate grammar instruction.

What the Corpus Does

The corpus is simply a list of language texts. There are news articles, blogs, poems, essays, stories, and song lyrics. But in addition to these texts, the corpus provides data on the individual passages, as well as the language as a whole. It identifies the following things about each entry:

It can also find the following things about the language as a whole:

A Functional Diagram of how this web software works

Every time a new entry is added, the corpus learns more about the language. Each word from the entry is added to a database of most commonly used words. This in turn is used to help determine which texts are easy to read and which are difficult: if a text has many common words, it will be easier to read; if it has many uncommon words, it will be more difficult.


The corpus analyzes texts according to readability. Put simply, readability determines how easy it is to comprehend a text. The basic theory is that longer sentences are harder to read than short sentences and longer words are harder to comprehend than short ones. Most readability formulas calculate the average length of sentences and the average number of syllables to give a readability score. This method is highly accurate. For example, the Flesch-Kincaid Readability Index accurate predicts comprehension level to 0.91 when compared to comprehension tests. This is the Flesch Reading Ease formula:

Score = 206.835 - (1.015 × ASL) - (84.6 × ASW)

Where: ASL = average sentence length (number of words divided by number of sentences)

ASW = average word length in syllables (number of syllables divided by number of words)

An alternative way to calculate readability that is also highly correlated to comprehension is to analyze the frequency of "hard" or "easy" words in a text. One common method is the Dale-Chall formula, described below:

  1. Select several 100-word samples throughout the text.

  2. Compute the average sentence length in words (divide the number of words by the number of sentences).

  3. Compute the percentage of words NOT on the Dale–Chall word list of 3,000 easy words.

  4. Compute this equation

Raw Score = 0.1579*(PDW) + 0.0496*(ASL) + 3.6365


Raw Score = uncorrected reading grade of a student who can answer one-half of the test questions on a passage.

PDW = Percentage of Difficult Words not on the Dale–Chall word list.

Challenges to Determining Readability in Waray

Readability algorithms are highly accurate measures of comprehension level, but the original formulas were developed for English, an Anglo-Saxon language with many single-syllable words. In contrast, Waray has very few single syllable words. Almost no nouns, verbs, or adjectives are monosyllablic. Filipino languages have very few prepositions, in comparison to the many monosyllabic prepositions in English. In short, Filipino languages are typically polysyllabic, and furthermore, their grammar is agglutinative. Prefixes and suffixes change parts of speech. Other parts of speech are formed through affixes:

Therefore, the original readability formulas would categorize language texts as much more difficult to comprehend, simply because they have more syllables.

Furthermore, syllabification is different in Filipino languages than in English. Vowels are never combined into one syllable (in English, "too" is one syllable; in Waray, "tuod" is two syllables).

The Modified Spache Readability Formula

We therefore made a readability formula tailored for these languages: (a) sentence length and (b) frequency of common words determine readabilty; syllable length is disregarded. This criteria is the same as the Spache formula cited above, and can be referred to as the Modified Spache Readability Formula.

A second challenge: many of the studied languages are predominantly oral. There is no standarized orthography. A word might be spelled "diritso" or "derecho", "diri" or "dire", "damo" or "damu". The corpus project therefore checks words with multiple spellings and consolidates them according to guidelines created by Voltaire Oyzon, et al. (2011) of Leyte Normal University in Tacloban City and Ricardo Ma. D. Nolasco of UP Diliman.

A third challenge: like any language, the ones collected here have regional dialects, which means different vocabulary is used in different locations, although the grammar is basically the same.


