"day year calendar days years month" is a representative example - topics for days of the week and units of measurement often appear in documents as a distinct discourse, but they are rarely the focus of a document. This metric is often useful as a way to identify corpus-specific stopwords.īut rarer topics can also have few rank-1 documents: University of Massachusetts, Amherst Amherst, MA 01003 Edmund Talley Miriam Leenders National Institutes of Health Bethesda, MD 20892. High token-count topics often have few rank-1 documents. Specific topics like "music album band song" or "cell cells disease dna blood treatment" are the "rank 1" topic in many documents. This metric counts the frequency at which a given topic is the single most frequent topic in a document. The difference is often measurable in terms of burstiness.Ī content-ful topic will occur in relatively few documents, but when it does, it will produce a lot of tokens.Ī "background" topic will occur in many documents and have a high overall token count, but never produce many tokens in any single document. Academic writing will talk about "paper abstract data", and a Wikipedia article will talk about "list links history". Some topics are specific, while others aren't really "topics" but language that comes up because we are writing in a certain context. The highest ranked topic in this metric is the "polish poland danish denmark sweden swedish na norway norwegian sk red" topic, suggesting that those ill-fitting words may be isolated in a few documents.Īlthough this metric has the same goal as coherence, the two don't appear to correlate well: bursty words aren't necessarily unrelated to the topic, they're just unusually frequent in certain contexts. ) & Choose the Best Threshold Value to Filter Out 'Low-quality' Documents where the two have their own curves, not just mirroring the other. This metric compares the number of times a word occurs in a topic (measured in tokens)Īnd the number of documents the word occurs in as that topic (instances of the word assigned to other topics are not counted). 840 1 7 15 I do not think it is just the upside down, have a look at Differences among Topic Coherence Metrics ('umass', 'cv'. In the sorted list of words, but may not be a good representative word for the topic. \[P(d | k) = \frac > 0)\) is proportional to the number of documents that contain at least one token of type \(w\) that is assigned to the topic.Ī words that occurs many times in only a few documents may appear prominently We usually think of the probability of a topic given a document.įor this metric we calculate the probability of documents given a topic.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |