Corpus Statistics
- Total number of files in the corpus:
266
Words
- Total number of words in the corpus:
10715321
- Average number of words per file:
40283
- Longest file in corpus (words):
390161
(keku4.txt)
- Shortest file in corpus (words):
74
(aplc07.txt)
- Top 5 most common words in corpus:
i (322041 - 3.00%), ka (310362 - 2.89%), ke (132437 - 1.23%), ia (102421 - 0.95%), ma (83400 - 0.77%)
- Top 5 least common words in corpus:
ponaponawaikuamoo (1), ponetine (1), poneeaku (1), poneaikai (1), pone-to (1)
N-grams
- Top 5 most common 2-grams:
o,ka (4877), i,ka (4119), me,ka (2949), a,me (2876), o,ke (1913)
- Top 5 most common 3-grams:
a,me,ka (12209), no,ka,mea (7566), i,loko,o (5234), a,me,nā (4826), ʻo,ia,i (4389)
- Top 5 most common 4-grams:
no,ka,mea,ua (2270), i,loko,o,ka (1919), he,wahi,moʻolelo,no (1297), moʻolelo,no,kauaʻula,a (1281), no,kauaʻula,a,me (1281)
Lines
- Total number of lines in the corpus:
1118176
- Average number of lines per file:
4203
- Longest file in corpus (lines):
37281
(kam.txt)
- Shortest file in corpus (lines):
26
(aplc07.txt)
- Total number of lines in the frequency file:
96140