From Mefi Wiki
The Metafilter Corpus project is a collection of language data about Metafilter. It's intended as an adjunct to the Metafilter Infodump, focusing more on actual language content of the site compared with the Infodump's focus on site activity numbers.
- The content of the project at this point is a series of frequency tables located here.
- Announcement in MetaTalk.
Some brief notes on the file format and the collection methodology follow.
Each table contains word-frequency information for comments posted to a specific subsite or set of subsites within a specific time period. The files have four header lines followed by a series of rows of tab-separated values, sorted in descending order of frequency.
The header lines are:
- 1. Date-range (and in some cases subsite) information about the file
- 2. Count of total words analyzed for the file, count of unique distinct tokens in the file (i.e. number of rows after the header)
- 3. Column headers for count (number of times the given word appeared in comments), parts per million (normalized value for comparison with other frequency tables), and the word in question
- 4. Blank line for readability.
2010-01-01 to 2011-01-01 86883789 total words, 567139unique words count PPM word 3676618 42316.5016433618 the 2469774 28426.1774080778 to 2258729 25997.1281869395 a 2075948 23893.3870621135 and 1842864 21210.67717247 of
The frequency tables are generated by a perl script running with local access to the metafilter database. For any given table, the script queries for the comment text fitting the subsite and date-range constraints, and then takes each individual comment through a tokenizing process to break the text down into individual words for counting.
The current tokenizing process involves:
- converting the source comment to all lower case
- removing HTML tag content to reduce the comment to plain text
- removing line breaks
- removing HTML named entities, e.g. & (&) or > (>)
- stripping most punctuation and replacing it with white space to prevent accidental concatenation of adjacent words
- splitting the resulting cleaned-up comment into individual word tokens wherever whitespace is found
- for each token, stripping any remaining character other than letters, numerals, hyphens, single-quotes (as apostrophes), and underscores
- further stripping any leading or trailing hypens, underscores, and single-quotes, leaving only intra-word occurrences of those characters
The resulting tokens are counted with a hash and sorted for write out to the resulting file.
- The sanitizing steps in the tokenizing process (stripping punctuation, etc) were designed to create a relatively clean word list for frequency calculation purposes, but leads to small discrepancies between the literal strings occurring in some comments and the data in the frequency table. Words containing intentional use of non-alphabetic, non-numeric characters will have been transformed somewhat in the process, non-ASCII unicode characters will be absent (so "año" would become "ao"), possessive forms with a trailing apostrophe will be missing that apostrophe (so "mefite's" appears in the word list correctly but "mefites'" will be counted as "mefites"), etc.
- Deleted comments are not excluded from the corpus, so manual counts on a token based on the results of a normal Metafilter site search may be lower for some tokens than the counts in these files.
- No attempt is currently made to identify and exclude those portions of a given comment consisting of quotation of some earlier comment in a thread or quotation from some external source. Accordingly, some tokens are arguably over-represented in the counts because they are reiterated in one or more replies to a given source comment or other text.
- Because HTML is simply stripped and ignored, the content of intra-tag text content such as title= fields is not captured in these files.