The Hawaiian Corpus Project: Data from a corpus of written Hawaiian

This repository contains data based on a corpus of texts written in the Hawaiian language (ʻŌlelo Hawaiʻi). The data includes frequency lists, stopwords, and lists of most common n-grams. The text in the corpus was obtained from Ulukau, the Hawaiian Electronic Library.

There are a total of 10.7 million words in the corpus, which was restricted to modern (post-20th century) and non-scriptural text. An overview of statistics for the corpus (including the top most common words and n-grams) can be seen here.

The corpus was processed using scripts from the corpus-tools project.

Data

You can download all of the data by clicking here. Note that the individual files linked below are quite large and may take some time to load if you try to view them directly in your browser.

Files included in this repository:

Hawaiian frequency list: A list of all the words in the corpus, arranged by frequency
Hawaiian stopwords list: A list of stopwords derived from the frequency file (this is being actively verified and updated for eventual inclusion in the stopwords-json project)
List of Hawaiian bigrams - A list of the most common sequences of two words, arranged by frequency
List of Hawaiian 3-grams - A list of the most common sequences of three words, arranged by frequency
List of Hawaiian 4-grams - A list of the most common sequences of four words, arranged by frequency
Statistics for the Hawaiian corpus

Caveats

While only Hawaiian texts were included, due to the nature of the corpus it was nevertheless difficult to avoid the introduction of a certain amount of English (for example, copyright notices and bilingual text), as well as orthographic and OCR errors. While every effort has been made to remove and correct these, the corpus is far from perfect and errors inevitably remain. However, to the best of my knowledge no similar electronic corpus of Hawaiian texts is available, and it is hoped that this data will still be of use for the purposes of supporting language revitalization efforts in Hawaiʻi.

Acknowledgements

Many thanks to Dr. Candace Kaleimamoowahinekapu Galla of the University of British Columbia for her suggestions and support.

Thanks also to Ulukau.org for making such a rich collection of Hawaiian texts available online, without which the formation of this corpus would not have been possible.

License

CC0.