Sunday, June 5, 2011

So how many words do I know?

I've been wondering for some time now how many Mandarin words I know. It's funny though, because I have no idea what my wordcount is in English (and frankly I don't care) but somehow this number for Mandarin is a measure of progress, and I do care.

So here are two estimates of my Mandarin wordcount - which unfortunately do not tie up together at all :-)

Method 1 - flashcards
My self-estimate was around 5000 words, based on almost no science whatsoever.

I noted that with the Anki flashcard desk I have created over the last couple of years, that about 1500 'facts' are mature which I know well, and another 500 are in progress. In the early days the facts were just one word per card, but as I've progressed I've mainly been adding sentences - which itself probably contains somewhere between 1-6 new words. So if I assume on average a 'fact' contains two words, that's 4000 I probably know. Then add another 1000 for words which I know but aren't in my Anki set, and that's 5000.

No science, but I believed to it be true.

Method 2 - statistical sampling
I came across another method on Hugh Grigg's excellent East Asia Student blog, which uses the Known Chinese Words Test of Naturally, I quickly cleared some time this afternoon, and ran through the test.

Basically it splits "the entire" Chinese vocabulary (36,000 words) into into a number of groups, from very common words to very uncommon words. It then samples 165 words (15 per group) and notes how many you get right each time.

The conclusion of the test is that I know just over 11,000 words. Wow - that's more than double what I thought. So either I thought wrongly, or Chad's method produces very large over-estimates.

Obviously the more words he tests you against, the better the estimate would be. I think users should be given the option of taking a quick test (current 165 questions) or a slow test (say, 500 questions), which would improve the estimate dramatically. Let me show you what I mean ...
  • The first group is made up of the most common 125 words, where I got 100% (of the 15 questions) right. This starts my total words known at 125. Good.
  • The same applies to the second group of 125 words - taking me to a total of 250. Still good.
  • In the third group I only got 80% of the 15 questions, and since this group has 250 words, it adds just 200 (80% of 250) to my total. Still OK, taking me to 450 words so far.
  • This continues through more groups, each getting bigger, although still only getting 15 questions to sample. Naturally, since the  words are being decreasingly less common, you would expect your hit rate to fall - and mine did.
  • In the last group (words 24,001-36,000) I got 5/15 right, so this is extrapolated to these 12,000 words to deduce that I know 4000 words in this class. I'm flattered.  
In the last group, that 4000 makes a massive contribution to my total of 11,050 - but is just based on my knowing 5 words out of 15 tested. Of course his test shows a huge standard deviation in that group (1460!) but given how important this contribution is to the total (because it's such a large group) there really ought to be many more questions to get the deviation to a reasonable number.

Looking at it in a different way, if you split all words into just two groups: the most common 10,000, and the least common 26,000, I find it odd that I appear to known 4000 in the first group, and 7000 in the least common group. Instinctively that doesn't make sense, and yet you can't argue with his method (other than the small sample size).

Anyway ... 
I don't want to get carried away with detail (and part of me wants to believe my total really is 11,000 :-) . Maybe the real answer lies somewhere between 5000 and 11,000 - I don't know. But I would be interested to redo this test every now and then, and see how I progress.

Just to clarify in closing, I'm really supportive of the zhtoolkit wordcount tool,  and although I think it can still be improved, it was a really interesting exercise and well worth doing.

So if you're feeling brave, do the test (it really doesn't take long) and let us know your score.

Be counted.


  1. Thanks for the mention :) the zhtoolkit stuff is great, but it does seem like the known words test isn't all that accurate.

  2. Truly impressive. Pleased that it's been such a fruitful study

  3. Thanks for writing about your experience with my study! Yes, I had been postulating that the estimates can be high, even when I was counting for myself. I believe what is happening is that one can mark something as "known" when it's never been encountered, but is so obvious that it can be easily guessed. This can include things like place names, names of organizations, phonetic transliterations, or straightforward compound words. As an analogy, consider that spelling bee contestants will know the spelling for a large number of words they have never studied, simply because they have studied Latin and Greek roots that tend to be used in English words. Biologists also can guess the meaning of many words in their field for the same reason, even as fast as new words get invented. In both these examples, the chance of knowing the word is nearly independent of its frequency in the language. Thus, the "number of words one knows" becomes highly dependent on the word base you are testing on, and gets inflated proportionally as the word field increases (e.g., from 36,000 to 72,000).

    My model for Chinese word knowledge is a first attempt to tackle this, by quantitatively identifying this baseline knowledge. Simply subtracting it presumes that the subject doesn't know *any* rare words, and is thus highly pessimistic. But at least it can represent the low end of the estimated known words.

    As for counting mature words in Anki, keep in mind the percent of mature words that you mark with Ease = 1. These are words you really don't know, so you should discount the total count by that percentage.

  4. Hugh, I agree with the concern about accuracy, but it seems like it's a new model ... I'm sure it will refine extensively over time.

    Ruth, thanks, now if I could just get those words to fall as easily out my mouth while I'm talking as I would want, then I'd be sorted ... :-)

  5. Great job on your word count! I'm going to try it myself one of these days :)

    Hope all is well, Greg!

  6. Chad, thanks for stopping by. Frustratingly, I wrote an awesome reply to you two days ago, but my browser locked up, and now I have to rewrite it. Trust me, my previous reply was really good! :-)

    My starting point is that at a calculated 11,000 estimated words, I should know nearly one-third of the total vocab that you postulate. I can assure you, based merely on the conversation I had this evening with a Chinese friend, that my vocab is nowhere near 1/3 of "reasonably possible" words.

    OK, so here are some rather detailed thoughts - I hope they're useful as you play around with your model.

    1. When I did the test, if I correctly guessed the words, I took it as 'known', but when I knew I usually know the word but couldn't quite remember it at that time, then I counted it as 'unknown'. Since it's your test, *you* can set the rules! I can think of two ways of dealing with this ...
    a. Tell people what they should do, i.e. "Count words as 'known' if XXX, and count them as 'unknown' if YYY." (Some guidance would have been appreciated.)
    b. Instead of just [known] and [unknown] buttons, you could add two more: [guessed correctly] and [usually know but can't remember now]. Not only will that help you make sure the user does it right, but it will also give you more information, allowing you to do all sorts of other cool things with the data.

    2. On the statistical side:
    * As you move from word #1 to word #36,000 they become less likely to be known. In the final group (24-36,000) #24,001 is *massively* more likely to be known than #36,000. This drop-off will be quite steep, so hopefully when you sample 15 words out of this group, #24,001 is massively more likely to be sampled than #36,000. Otherwise you'll be assuming that I know 12,000 extra words simply because I know the more common #24,001-24,005.
    * Your samples are too small, and this is reflected in the massive standard deviations as you work your way down the group. I think most people doing your test would rather get an accurate test even if it means 500 words are tested.
    * Similarly your groups are too large. I got 5/15 in the last group - and this was extrapolated to assume I know 4000 words in the group - which I can't imagine is true for words which are amongst the 12,000 least frequent words in the dictionary.

    3. And while I'm at it, allow me the cheek of respectfully making some suggestions.

    * The easiest is of course to test an increasing number of words as the group size increases and the probability of knowing the word decreases. The standard deviation will still increase unacceptably, but it will at least give more accuracy in the tail.
    * There are however ways of using the same number of questions but being more 'efficient' in where they're allocated. For example, how about testing 10 words for each of the first 5 groups, and then on this basis determine in rough terms whether the learner is [beginner, intermediate, advanced). Then use the remaining questions around the 'hot area' to get a more solid estimate. So, if they get less than 10% known around the 5000 word mark, then assume they known nothing beyond 10,000 and use the remaining samples to get as much accuracy to determine where in the 0-10,000 range they lie. Similarly, if they have very high known-words in the first 5 groups, it's more sensible to skip sampling the 10-20,000 range and focus on on the 20k+ range.

    Chad, I really like the idea of your test, and certainly am keen to use it every now and then to chart my progress. I hope this comments are useful. (And apologies to anyone who read this far, waiting for some kind of punchline ...)

  7. Kara, one of these days? Nooooooo - do it soon, it doesn't take long :-)

  8. "27467". I also feel that the estimates are a little high at the top end.

    It raises the question of how many words does the average (educated?) Chinese speaker know? I feel like to make a more accurate test, we would need a really good measure, i.e. a few books (a corpora) where there were a lot of unique words. Have some natives & non-natives read these, pick out any words they don't know, note the frequency and total ratio, then compare with each person's averaged results on this test.

    Props to the ZHtoolkit guy, it was fun and I came across a few new words to add to my list.

  9. Mine was slightly below 10,000 and I know its too high for me