From Mefi Wiki
The infodump is a collection of information about Metafilter, for the number crunchers. See also the Metafilter Corpus project for more language-focused data.
- As of January 2012, the zipped-up data files themselves are hosted on the mefi.us domain.
- the location of the downloads - As of August 2009, active again and automatically updated every Saturday at 2 am PDT.
- Announcement in MetaTalk.
- The analyses are mostly listed in MetaAnalysis.
Some brief notes on the file formats follow.
The first lines of postdata_mefi.txt.zip (the Mefi file), slightly reformatted: I include the date at the beginning here, but I'll remove it in the following examples.
Wed Oct 1 06:39:16 2008 category favorites postid userid datestamp comments deleted reason 19 1 1999-07-14 15:03:04.930 0 116 35 0 [NULL] 24 15 1999-07-14 21:58:18.327 0 1 0 0 [NULL] 25 1 1999-07-15 09:37:51.770 0 6 0 0 [NULL] 26 16 1999-07-15 09:54:26.280 0 4 0 0 [NULL] 27 16 1999-07-15 09:57:54.160 0 1 0 0 [NULL]
The AskMe, MeTa, & Music headers are the same. The category field values differ for each of the four files: askme and meta have several topical categories each represented by values 1-n, music lists a value of 1 for "Music Talk" posts and a 0 for song posts, and mefi posts have no category information and so list a zero in every row. See below for a key to each category.
Deleted posts are included in these files, along with a deletion reason where one was provided; a value of 1 in the deleted column indicates a deleted post. Deleted comments are not counted toward the comment totals for each thread.
For Metatalk posts, thread-closure status is also captured in the deleted column; a value of 2 indicates the thread was closed, and a value of 3 indicates the post was both closed and deleted (presumably in that order).
Category value key
Cat ID Description URL stub 1 bugs bugs 2 feature requests feature-requests 3 etiquette/policy policy 4 uptime uptime 5 MetaFilter-related meta-meta 6 general weblog-related weblogs 7 ticketstub project ticketstub 8 MetaFilter gatherings meetups 9 MetaFilter Music music 10 Ask MetaFilter ask 11 MeFi Podcast mefi-podcast
The URL stub is the string by which to access a by-category-only index of Metatalk. A url of the form http://metatalk.metafilter.com/feature-requests would, for example, provide a list of only those posts filed as "feature requests".
- Only six of these categories are still available as category selections by users making new Metatalk posts: 1, 2, 3, 4, 5, and 8.
- Category 11, "MeFi Podcast", is usable only by admins to denote official podcast posts (which are also indexed on the Podcast subsite).
- Category 10, "Ask Metafilter", was used early in the life of the AskMe subsite, before it was ported to its own database table. It was never a selectable option for Metatalk posts.
- Categories 6 and 7 have both been removed as obsolete. The "general weblog-related" category was more relevant early in the site's history when the crowd and much of the site content was more explicitly and narrowly blog-centric. The "ticketstub project" category was used for posts about a ticketstub-memories idea Matt had been working on at the time which has since been set aside.
Cat ID Description URL stub 1 computers & internet computers-internet 2 technology technology 3 home & garden home-garden 4 work & money work-money 5 sports, hobbies, & recreation sports-hobbies-recreation 6 society & culture society-culture 7 travel & transportation travel-transportation 8 science & nature science-nature 9 education education 10 health & fitness health 11 shopping shopping 12 food & drink food-drink 13 writing & language writing-language 14 human relations human-relations 15 media & arts media-arts 16 pets & animals pets-animals 17 religion & philosophy religion-philosophy 18 clothing, beauty, & fashion clothing-beauty-fashion 19 law & government law-government 20 grab bag grab-bag
The URL stub values work just as with Metatalk above: http://ask.metafilter.com/home-garden will, for example, display an index of only recent "home & garden" questions.
1 song 2 talk post
When Music was first launched, all posts were songs uploaded by users. A new section, Music Talk, was added July 2nd, 2008. Posts in the talk section have category_id 2; song have category_id 1.
Music posts are not sortable by category via any URL stub; songs are listed on the front page of music.metafilter.com while talk posts are listed at music.metafilter.com/home/talk . Users cannot select a category per se at post time, but are presented with an initial page asking whether they intend to post a song or a talk post.
Metafilter has no category information associated with posts. All values in the category column of the postdata_mefi.txt are 0.
postid title 21616 Lord of the Peeps! 21617 boot bus 21618 Fireworks in England 21619 The Teddy bear turns 100
Note that titles were not initially part of the posting form for any subsite other than Music. Titles have since been added to some (or in the case of AskMe all or very nearly so) posts created before the introduction of the title field, by the backtagging crew (most likely) or by an admin.
- Post titles were added to Metafilter on November 12th, 2002.
- Post titles were added to AskMe on February 17, 2005.
- Post titles were added to Metatalk on February 13th, 2007.
- Post titles were present from launch day for Music.
postid title above below url urldesc 19 12 136 0 24 12 25 0 223 0 0 0 26 16 62 0 70 16 ... 108426 19 256 0 35 49 108427 40 368 130 0 0 108428 10 253 0 25 35
Length is a raw character count of each field, including white space and html.
Title is the thread title; above is the above-the-fold text area; below is the below-the-fold "more inside" area. url and urldesc have non-zero values only in the postlength_mefi file, and correspond to the dedicated link and linktext fields users can use when posting. The fields are included across all postlength files to keep the file format consistent across subsites.
postid comment-id userid datestamp faves best answer? 1 1 1 1999-06-13 17:48:00.000 0 0 2 24 1 1999-07-15 01:21:06.213 0 0 4 2 1 1999-07-15 01:58:52.340 0 0 5 26 1 1999-07-15 10:00:12.850 0 0 6 25 16 1999-07-15 10:04:48.563 0 0
The best answer column lists a value of 1 for askme answers that have been marked "best" the asker, a 0 for all other answers. The column lists zeroes for all rows in the non-askme comment files.
Deleted comments are not included in these files.
length comment-id 1 40 2 209 4 96 5 92 6 132
Length is a raw character count of the comment, including whitespace and html.
tag_id link_id link_date tag_name 1 38715 2005-01-18 01:06:16.560 testing 3 38715 2005-01-18 01:06:16.560 metafilter 4 38733 2005-01-18 15:23:29.233 silly 5 38733 2005-01-18 15:23:29.233 recursion 6 38733 2005-01-18 15:23:29.233 photo
Tags can be added to a post in three different cases:
- 1. By the poster at post creation time.
- 2. By the poster, a mutual contact of the poster, or an admin, at some point after post creation.
- 3. By a member of the backtagging crew at some point after post creation, if the post received no tags at the time it was created.
Tags whose link_date values are equal to the datestamp of their corresponding (by link_id) post were added by the poster at post creation time (case 1). There is no simple way to distinguish, with the data available in the database, between tags added in (case 2) or (case 3), though it's very likely that tags added to posts created before the original introduction of tagging to the various subsites were added by the backtagging crew.
Tags are automatically removed from deleted posts at deletion time, though that may not always have been the case.
Tag creation date approximation
Tag creation time is approximate for all tags that fall in to cases 2 and 3, above, and exact for case 1.
When a tag is created within the database, its link_date is set equal to the creation date of the post it is attached to, regardless of when the tag is added or by whom. To compensate for this, the Infodump uses a simple heuristic to provide a more correct approximation of the tags creation time:
- Given that tag_ids are created in chronological order, and so no tag record n+1 could be created earlier than tag record n For each record, 1. Let LASTDATE be the approximate date of the prior record. 2. For each tag record, compare two dates: - the link_date for this tag record stored in the database - LASTDATE. 3. Record the most recent of those two dates as the approximate date of this tag record's creation (and ergo the new LASTDATE).
Therefore, for any tag whose link_date is equal to that of the corresponding post, the link_date is exact; for any other tag, the link_date is an approximation in the form of the earliest date at which that tag could possibly have been added. The margin for error on this estimate is exactly the difference between a tag's listed link_date and the next newer link_date present in the tagdata file.
A similar technique is used in calculating Contact creation dates, see below.
Some example lines, the first valid ones of each type:
faveid faver favee type target parent datestamp 1 1 23470 1 51485 0 2006-05-09 10:40:43.467 376 33779 17897 2 1304343 51504 2006-05-10 17:37:38.530 12 1 30348 3 37730 0 2006-05-10 09:00:50.297 343 1 36467 4 584780 37790 2006-05-10 16:59:02.670 3 1 19832 5 11837 0 2006-05-09 14:12:52.670 418 1490 20496 6 311412 11855 2006-05-10 18:07:51.670 276 1 14928 7 317 0 2006-05-10 14:22:31.060 18386 1 1983 8 9 0 2006-06-29 20:57:24.060 18662 29872 30452 9 61 0 2006-06-30 11:04:06.983 77718 1 22242 9 2557 618 2006-10-22 11:16:09.553 47272 4741 508 10 10 0 2006-08-25 21:51:45.807 346380 52871 35136 11 7072 0 2007-07-06 17:41:58.100 283418 191 191 12 3 1 2007-05-24 16:03:46.233 1052954 1 191 13 1 1511 2008-05-27 09:38:47.633
- 1 & 2 - Metafilter post & comment
- 3 & 4 - Ask Metafilter post & comment
- 5 & 6 - MetaTalk post & comment
- 7 - Projects post
- 8 - Music post
- 9 - Music comment - if the parent is 0, this is broken.
- 10 - Jobs post
- 11 & 12 - Travel post & comment
- 13 - Projects comment
For post-type favorites, "target" is the link_id of the post being favorited.
For comment-type favorites, "target" is the comment_id of the comment being favorited, and "parent" is the link_id of the thread in which the comment resides.
Note that the case of a fave of type 9, a Music comment, with a parent of 0, is the result of a sporadic bug introduced at the launch of Music on June 29th, 2006 and present until October 21st, 2006. There are approximately 30 degenerate favorites of this sort in the database, and they may be either repaired or removed in the future.
Favorites that have been removed by the favoriting user are deleted from the database; accordingly, the faveid values present in this file are not strictly sequential.
contactee contacter date 1 14155 2004-06-15 12:00:00.000 1 2238 2004-06-15 12:00:00.000 1 14275 2004-06-15 12:00:00.000 ... 13099 7683 2004-06-17 16:31:51.040 15231 14752 2004-06-17 16:31:51.040 ... 45087 7610 2007-10-31 12:23:15.683 16719 61 2007-10-31 13:28:38.670 48758 1 2007-10-31 13:47:16.843
Contacter is the id of the user creating the contact; contactee is the user added as a contact.
All dates on or after 2007-10-31 11:55:38.840 are exact; all dates prior to that are approximate.
Contact creation date approximation
Because date-of-creation information for contacts did not exist in the database originally, the date provided for all records before 10/31/07 is an approximation, best formally described as "the earliest date at which the contact could possibly have been created". The algorithm for determining the date is as follows:
- Given that the earliest date any contact could have been added was June 15th, 2004 (the date the feature was launched), and - Given that contacts are created in chronological order, and so no contact record n+1 could be created earlier than contact record n For each record, 1. Let LASTDATE be the approximate date of the prior record. 2. For each contact record, compare three dates: - the date the contacter joined mefi - the contactee's join date - LASTDATE. 3. Record the most recent of those three dates as the approximate date of this contact record's creation.
The approximation relies on the following assumption: Alice cannot create a contact before she has joined the site; Bob cannot be the target of a contact before he has joined the site; and no contact can be be added before the date a previous contact was added. The date each record was created therefore cannot be earlier than the most recent of those three checkpoints.
This functionally limits the accuracy of the approximate dates to however frequently brand-new users either create or are the target of contact records; LASTDATE will remain static for stretches between these events.
- Signups were closed at the time the feature was launched, which means that the appearance of contacts involving newer-than-LASTDATE users is very infrequent up until November of 2004 when signups reopened. Consequently, the approximations for the first several months of the feature's use are exceptionally poor.
- The approximate date will never be correct; it will always be too early an estimate, as the time between when any new user joins the site and when they either create their first contact or become a contact of someone else is clearly non-zero.
- While it is apparently impossible to know what degree of error is involved in any approximate date, it may be possible to at least approximate the upper bound of the degree of error by comparing any given approximate date to the next distinct date.
userid joindate name 1 2000-01-27 20:16:57.367 mathowie 8 2000-01-27 20:16:57.367 OneBallJay 13 2000-01-27 20:16:57.367 jeffp 16 2000-01-27 20:16:57.367 jjg 17 2000-01-27 20:16:57.367 honkzilla
This list includes all account that have at some point been active, with the sole exception (to the best of cortex's knowledge) of early experimental accounts removed from the db by Matt near the original launch of the site.
Gaps in the userid sequence before November 18th, 2004 are the result of bugs/testing work on the site (cortex doesn't have as much detail about this as he'd like, yet!); gaps on and after that date (when paid $5 signups began) most likely correspond to accounts for which a potential new user began the signup process (thus reserving the username and hence a db row) but did not complete it.
This file includes accounts which were once active but have subsequently been closed by their owners or by an admin.
Signup dates weren't tracked until January 27, 2000. Users who registered before that get placeholders for signup date; the Infodump lists "2000-01-27 20:16:57.367", and profile pages list "sometime in 1999".
Tools for working with the Infodump
- If you want get started with the files in Excel, see Infodump and Excel for some steps. Note the restrictions below.
- FishBike has posted SQL scripts to import the Infodump (announcement); tested with SQL Server 2005. Kadin2048 has provided MySQL versions of these scripts, which are available from the same page.
- Pronoiac's beanplate lets you make simple queries to filter the Infodump.
- Combustible Edison Lighthouse assembled the infodumpster (announcement), which lets you quickly put together complicated queries to the Infodump.
- Python scripts for working with Infodump files.
Upon request, a user can have their id obscured in Infodump contents. In all instances where their userid would normally appear, a unique 7-digit fake id is listed instead, such that the user's activity/id connection is broken while analytical views of the data can still function normally.
Any analysis that makes assumptions about the meaningfulness of a usernumber should take this into account and sanity check for munged ids.
- Excel 95 supports 16384 rows
- Excel 97 to 2003 support 65536 rows
- Excel 2007 supports 1048576 rows
- Openoffice 1.1 & 2 support 32000 rows
- OpenOffice 3 supports 65536 rows
- MS Access supports a total of 2 gigabytes, also limiting temporary storage
Metafilter Corpus project
A younger sibling to the Infodump, the Metafilter Corpus project is focused more on actual language use on the site.