About The Corpus Project
This is a database of various texts in 10 Filipino languages. The database has a variety of purposes, namely, to:
- create an archive of language texts
- analyze the structure of these languages, specifically word frequency
- rank texts according to comprehension level
These texts are intended for teaching and research. Instruction in the mother tongue (L1) is part of the new (2012) K-12 education curriculum, so teachers need texts that are appropriate for their students' reading level. This project uses readability algorithms to help rank the texts. Furthermore, the new curriculum includes L1 grammar instruction. Currently there are few grammar resources in these languages. This corpus is a starting point for researchers trying to formulate grammar instruction.
What the Corpus Does
The corpus is simply a list of language texts. There are news articles, blogs, poems, essays, stories, and song lyrics. But in addition to these texts, the corpus provides data on the individual passages, as well as the language as a whole. It identifies the following things about each entry:
- word count
- sentence count
- words per sentence
- syllable count
- average number of syllables
- percentage of words from the most frequent words
- readability metrics (Flesch-Kincaid, Coleman-Liau, Smog, Gunning-Fog, Flesch Grade Level)
It can also find the following things about the language as a whole:
- most frequent words
- usage patterns of individual words (such as "an" vs. "it" in Winaray, for instance)
- conjugation patterns (ex., when to use affixes "um" , "in" , and "nag" forms)
- rules for standard spelling (orthography)
A Functional Diagram of how this web software works
Every time a new entry is added, the corpus learns more about the language. Each word from the entry is added to a database of most commonly used words. This in turn is used to help determine which texts are easy to read and which are difficult: if a text has many common words, it will be easier to read; if it has many uncommon words, it will be more difficult.
Readability
The corpus analyzes texts according to readability. Put simply, readability determines how easy it is to comprehend a text. The basic theory is that longer sentences are harder to read than short sentences and longer words are harder to comprehend than short ones. Most readability formulas calculate the average length of sentences and the average number of syllables to give a readability score. This method is highly accurate. For example, the Flesch-Kincaid Readability Index accurate predicts comprehension level to 0.91 when compared to comprehension tests. This is the Flesch Reading Ease formula:
Score = 206.835 - (1.015 × ASL) - (84.6 × ASW)
Where: ASL = average sentence length (number of words divided by number of sentences)
ASW = average word length in syllables (number of syllables divided by number of words)
An alternative way to calculate readability that is also highly correlated to comprehension is to analyze the frequency of "hard" or "easy" words in a text. One common method is the Dale-Chall formula, described below:
- Select several 100-word samples throughout the text.
- Compute the average sentence length in words (divide the number of words by the number of sentences).
- Compute the percentage of words NOT on the Dale–Chall word list of 3,000 easy words.
- Compute this equation
Raw Score = 0.1579*(PDW) + 0.0496*(ASL) + 3.6365
Where:
Raw Score = uncorrected reading grade of a student who can answer one-half of the test questions on a passage.
PDW = Percentage of Difficult Words not on the Dale–Chall word list.
Challenges to Determining Readability in Waray
Readability algorithms are highly accurate measures of comprehension level, but the original formulas were developed for English, an Anglo-Saxon language with many single-syllable words. In contrast, Waray has very few single syllable words. Almost no nouns, verbs, or adjectives are monosyllablic. Filipino languages have very few prepositions, in comparison to the many monosyllabic prepositions in English. In short, Filipino languages are typically polysyllabic, and furthermore, their grammar is agglutinative. Prefixes and suffixes change parts of speech. Other parts of speech are formed through affixes:
- sumat (tell) --> magsumat (telling), magsusumat (will tell), susumaton (tales)
- ada (there) --> mayada (has/have), magkamayada (let there be)
Therefore, the original readability formulas would categorize language texts as much more difficult to comprehend, simply because they have more syllables.
Furthermore, syllabification is different in Filipino languages than in English. Vowels are never combined into one syllable (in English, "too" is one syllable; in Waray, "tuod" is two syllables).
The Modified Spache Readability Formula
We therefore made a readability formula tailored for these languages: (a) sentence length and (b) frequency of common words determine readabilty; syllable length is disregarded. This criteria is the same as the Spache formula cited above, and can be referred to as the Modified Spache Readability Formula.
A second challenge: many of the studied languages are predominantly oral. There is no standarized orthography. A word might be spelled "diritso" or "derecho", "diri" or "dire", "damo" or "damu". The corpus project therefore checks words with multiple spellings and consolidates them according to guidelines created by Voltaire Oyzon, et al. (2011) of Leyte Normal University in Tacloban City and Ricardo Ma. D. Nolasco of UP Diliman.
A third challenge: like any language, the ones collected here have regional dialects, which means different vocabulary is used in different locations, although the grammar is basically the same.
References
de Veyra, V. I. Ortograpiya han Binisaya. (A. K. de Veyra, Trans.). In Luangco, G.C. (Ed.), Kandabao: Essays on Waray language, literature, and culture. Tacloban City, PH: Divine Word University Press
Godin, E.S. 2007. “Mga Batakan Sa Panitik Sa Binisaya-Sinugboanon,†gipatik alang sa Pasinati sa Panitik ug Batadila sa Binisaya-Sinugboanon MSU-IIT, Iligan City Peb. 22-23, 2007 tinambayayongan sa Komisyon sa Wikang Filipino ug BATHALAD-Mindanao
Lobel, J. W. 2009. Samar-Leyte. In Brown, K. and Ogilvie, S. Concise encyclopedia of languages of the world. (pp. 914-916) Oxford, UK: Elsevier, Ltd.
Luangco, G. C. (Ed.) 1982. Kandabao: Essays on Waray language, literature, and culture. Tacloban City, PH: Divine Word University Press.
Makabenta, Eduardo A. (2004). "Binisaya-English; English-Binisaya Dictionary",Adbox: Quezon City, Philippines.
Romualdez, N. L. Orthography and Prosody. In Luangco, G.C. (Ed.), Kandabao: Essays on Waray language, literature, and culture. Tacloban City, PH: Divine Word University Press.
Rubino, C. 2001. Waray Waray. In Garry, J. and Rubino, C. Facts about the world’s languages: An encyclopedia of the world’s major languages, past and present. New York, NY: Wilson Press
Tramp, G.D. (1997).Waray-English Dictionary. Dunwoody Press: Maryland , USA
The UCLA Phonetic Lab Archives. 2007. Retrieved from http://archive.phonetics.ucla.edu/
Wolff, J. U. 1968. The Historical Development of the Leyte-Samar Bisayan Vowel System. Leyte-Samar Studies 1 (1), 19-25