***************************
Datasets used in "Large-margin Predictive Latent Subspace Learning for Multi-view Data Analysis"
***************************

Ning Chen and Jun Zhu
chenn07[at]mails.tsinghua.edu.cn, junzhu[at]cs.cmu.edu

(C) Copyright 2010, Ning Chen (chenn07 [at] mails [dot] tsinghua [dot] edu [dot] cn) and Jun Zhu (junzhu [at] cs [dot] cmu [dot] edu)

These datasets and features have been evaluated by Multi-view Latent Space Markov Network. They are TRECVID 2003, 13class animal Flickr image data and hotel review data.  

They are for journal review only. 

------------------------------------------------------------------------

TABLE OF CONTENTS

A. TRECVID 2003 Dataset

B. 13class animal Flickr image data

C. Hotel Review Data for Regression

D. Hotel Review Data with Paragraph Information

------------------------------------------------------------------------

A. TRECVID 2003 Dataset

The TRECVID2003 dataset contains 1078 manually labeled video shots that belong to 5 categories. Each shot is represented as a 1894-dim binary vector of text features and a 165-dim vector of HSV color histogram, which is extracted from the associated keyframe. We evenly split this dataset into 539 training samples and 539 testing samples.

1. Feature Files

The file "ImageData.dat" contains the text and image features of the samples. The format is shown as follows:

[Sample ID] [Label] [Number of Non-Zero text features] [text_term_id1:count1] ... [text_term_idN:countN] [165-dim real HSV color features]

------------------------------------------------------------------------

B. 13class animal Flickr image data

The 13class animal dataset is a subset selected from NUS-WIDE (http://lms.comp.nus.edu.sg/research/NUS-WIDE.htm) which is a big image dataset constructed from Flickr web images. This dataset contains 3411 images of 13 animals, including squirrel, cow, cat, zebra, tiger, lion, elephant, whales, rabbit, snake, antlers, hawk and wolf, from which 2054 images are randomly selected for training and the rest for testing. 

1. Training and Testing Images

The folder "TrainFlickrImage" and "TestFlickrImage" are web images for training and testing, respectively. Images are named by the samples ID which corresponds to the extracted features in the training and testing feature files, respectively.

2. Feature Files

For each image, six types of low-level features are extracted, including 634-dim real valued features (i.e., 64-dim color histogram, 144-dim color correlogram, 73-dim edge direction histogram, 128-dim wavelet texture and 225-dim block-wise color moments) and 500-dim bag-of-word representation based on SIFT features. The 1000-dim online tags are also downloaded for evaluating image annotation.

1) 2054 Training Image Features

The file "TrainTagSift_13_2054_allFeature.dat" contains the text and image features of the training samples. The format is shown as follows:

[Sample ID] [Label] [Number of Non-zero Tag Features] [Number of Non-zero SIFT Features] [tag_term_id1:count1] ... [tag_term_idN:countN] [SIFT_term_id1:count1] ... [SIFT_term_idN:countM] [634-dim real valued low-level features]

2) 1357 Testing Image Features

The file "TestTagSift_13_1357_allFeature.dat" contains the text and image features of the testing samples. The format is the same as in the training data.

------------------------------------------------------------------------

C. Hotel Review Data for Regression

The hotel review dataset consists of 5000 hotel reviews randomly collected from TripAdvisor (http://www.tripadvisor.com). Each review document is associated with two-view features (i.e., 12000-dim bag-of-word features and 14-dim contextual features) as well as a global rating score. we uniformly partition the dataset into training and testing sets.

1) Bag-of-word Dictionary

The file "hotReviewRegressDictionary.dat" is the 12000-dim bag-of-word dictionary of the hotel review dataset for regression.

2) hotel review feature files

The file "hotelReviewRegression.dat" contains the two-view features of all the review samples (both training and testing). The format is shown as follows:

[Sample ID] [Rating Score] [Number of Non-zero Bag-of-word Features] [term_id1:count1] ... [term_idN:countN] [14-dim real valued contextual features]

------------------------------------------------------------------------

D. Hotel Review Data with Paragraph Information

To evaluate the structured model, we build another dataset from the hotel reviews on TripAdvisor, which contains 3000 reviews (600 reviews for each of the 5 rating scores). We evenly split the reviews into 1500 for training and 1500 for testing. 

Each review document is associated with several paragraphs as well as a global rating score (considered as a discrete category label ranging from 1 to 5). Each paragraph has the same 9815-dim bag-of-word features. 

1. Bag-of-word Dictionary

The file "review_300_dictionary.dat" is the 9815-dim bag-of-word dictionary for the hotel reviews with Paragraph Information.

2. hotel review feature files

1) 1500 training review features

The file "review_300_par4.dat_train" contains the features of the training review samples. The format is shown as follows:

[Sample Label] [Number of paragraph]
Paragraph 1: [Number of Non-zero Bag-of-word features] [text_id1:count1] ... [text_idN1:countN1]
Paragraph 2: [Number of Non-zero Bag-of-word features] [text_id1:count1] ... [text_idN2:countN2]
.
.
.
.
Paragraph n: [Number of Non-zero Bag-of-word features] [text_id1:count1] ... [text_idNn:countNn]

2) 1500 testing review features

The file "review_300_par4.dat_test" contains the features of the testing review samples. The format is the same as in the training data.

------------------------------------------------------------------------