Madsci Network Dataset

What's available online?

The Madsci Network maintains an online archive of more than 40,000 answered questions. These question/answer pairs are available from 1996-2011, organized in directories named YYYY-MM (where YYYY is the four digit year and MM is the two digit month), as .zip or .tgz archives:

The archive is released for research purposes only and all rights are reserved by The Madsci Network. All use of the archives should be acknowledged by using the following citation:

  Sethi, Ricky J. and Bry, Lynn,
The Madsci Network: Direct Communication of Science from Scientist to Layperson, 
21st International Conference on Computers in Education (ICCE), 2013. PDF
  

An additional 110,000+ submitted questions from 1995 - present are maintained offline with associated meta-data. Full names and email addresses are not included. Access to partial or full components of the 150,000+ question dataset may be provided per request for research purposes only.

Code

In addition, we release the following code under the Common Public License. You are welcome to use the code under the terms of the license for either research or commercial purposes; however, please do acknowledge its use with the above citation, as well.

  • clean.pl: A Perl script to clean up the .html files in the archive
  • csv_export.pl: A Perl script to export the archives into CSV files suitable for use with topic-modeling programs like MALLET.