Sunday, June 5, 2011

So how many words do I know?

I've been wondering for some time now how many Mandarin words I know. It's funny though, because I have no idea what my wordcount is in English (and frankly I don't care) but somehow this number for Mandarin is a measure of progress, and I do care.

So here are two estimates of my Mandarin wordcount - which unfortunately do not tie up together at all :-)


Method 1 - flashcards
My self-estimate was around 5000 words, based on almost no science whatsoever.

I noted that with the Anki flashcard desk I have created over the last couple of years, that about 1500 'facts' are mature which I know well, and another 500 are in progress. In the early days the facts were just one word per card, but as I've progressed I've mainly been adding sentences - which itself probably contains somewhere between 1-6 new words. So if I assume on average a 'fact' contains two words, that's 4000 I probably know. Then add another 1000 for words which I know but aren't in my Anki set, and that's 5000.

No science, but I believed to it be true.


Method 2 - statistical sampling
I came across another method on Hugh Grigg's excellent East Asia Student blog, which uses the Known Chinese Words Test of zhtoolkit.com. Naturally, I quickly cleared some time this afternoon, and ran through the test.

Basically it splits "the entire" Chinese vocabulary (36,000 words) into into a number of groups, from very common words to very uncommon words. It then samples 165 words (15 per group) and notes how many you get right each time.

The conclusion of the test is that I know just over 11,000 words. Wow - that's more than double what I thought. So either I thought wrongly, or Chad's method produces very large over-estimates.

Obviously the more words he tests you against, the better the estimate would be. I think users should be given the option of taking a quick test (current 165 questions) or a slow test (say, 500 questions), which would improve the estimate dramatically. Let me show you what I mean ...
  • The first group is made up of the most common 125 words, where I got 100% (of the 15 questions) right. This starts my total words known at 125. Good.
  • The same applies to the second group of 125 words - taking me to a total of 250. Still good.
  • In the third group I only got 80% of the 15 questions, and since this group has 250 words, it adds just 200 (80% of 250) to my total. Still OK, taking me to 450 words so far.
  • This continues through more groups, each getting bigger, although still only getting 15 questions to sample. Naturally, since the  words are being decreasingly less common, you would expect your hit rate to fall - and mine did.
  • In the last group (words 24,001-36,000) I got 5/15 right, so this is extrapolated to these 12,000 words to deduce that I know 4000 words in this class. I'm flattered.  
In the last group, that 4000 makes a massive contribution to my total of 11,050 - but is just based on my knowing 5 words out of 15 tested. Of course his test shows a huge standard deviation in that group (1460!) but given how important this contribution is to the total (because it's such a large group) there really ought to be many more questions to get the deviation to a reasonable number.

Looking at it in a different way, if you split all words into just two groups: the most common 10,000, and the least common 26,000, I find it odd that I appear to known 4000 in the first group, and 7000 in the least common group. Instinctively that doesn't make sense, and yet you can't argue with his method (other than the small sample size).


Anyway ... 
I don't want to get carried away with detail (and part of me wants to believe my total really is 11,000 :-) . Maybe the real answer lies somewhere between 5000 and 11,000 - I don't know. But I would be interested to redo this test every now and then, and see how I progress.

Just to clarify in closing, I'm really supportive of the zhtoolkit wordcount tool,  and although I think it can still be improved, it was a really interesting exercise and well worth doing.

So if you're feeling brave, do the test (it really doesn't take long) and let us know your score.

Be counted.