Data
Federated Web Search
We have released several datasets to support research on federated web search. The datasets contain samples from real search engines.
- FedWeb Greatest Hits (WWW 2015)
Recommended. This dataset combines Fedweb 2013 and 2014 and contains additional data. - FedWeb 2014 dataset (TREC 2014)
- FedWeb 2013 dataset (TREC 2013)
- FedWeb 2012 dataset (CIKM 2012)
Movember motivations
Motivations provided in Movember profiles annotated according
to the Social Identity Model of Collective Action (van Zomeren et al., 2008).
Download: [zip file]
D. Nguyen, T. van den Broek, C. Hauff, D. Hiemstra and M. Ehrenhard: #SupportTheCause: Identifying Motivations to Participate in Online Health Campaigns at EMNLP 2015. [pdf]
NL-TR word level language identification
Because the forum is not online anymore, we have removed the dataset from the web. If you are interested in using it, please send us an email.
D. Nguyen, A.S. Doğruöz : Word Level Language Identification in Online Multilingual Communication at EMNLP 2013. [pdf]
Kernel independence testing
Synthetic datasets: [zip file] (262 MB)
Code: [Github]
D. Nguyen and J. Eisenstein. A Kernel Independence Test for Geographical Language Variation. Computational Linguistics, Volume 43, Issue 3. 2017. [pdf]
Evaluating local explanations for text classification
Data: [zip file] (75.1 MB)
Code: [Github]
D. Nguyen. Comparing automatic and human evaluation of local explanations for text classification. NAACL 2018. [pdf]
Urban Dictionary
Annotations: [Github]
D. Nguyen, B. McGillivray, T. Yasseri. Emo, love, and god: Making sense of Urban Dictionary, a crowd-sourced online dictionary. Royal Society Open Science. [link]