Multiple Instance Learning

In multiple instance learning (MIL), instead of the instances, there are bags and each bag has certain number of instances. Given the bags with class labels, aim of MIL is to classify bags with potentially unlabelled instances.  To learn a classifier at bag-level, bags can be encoded by using their instance frequencies in specific regions of the data space.

Bag encoding algorithms to perform multiple instance classification and their corresponding results on famous MIL datasets are introduced in:

Emel Seyma Kucukasci and Mustafa Gokce Baydogan, “Bag Encoding Strategies in Multiple Instance Learning Problems,” submitted to Pattern Recognition,  on March 27th, 2017.


Common MIL datasets:

Bag encoding algorithms are tested on 71 MIL benchmark datasets. This is the largest experimented MIL repository for algorithm comparison. Application areas of the datasets are molecular activity prediction, image annotation, text categorization, webpage classification and audio-recording classification (.mat files of the datasets are provided on

Each dataset file is a comma-separated value (CSV) formatted file which has number of instances many rows and number of features many columns together with two additionally attached columns. The first attached column corresponds to the bag class labels which are propagated to the instances. The second column is the bag ID column where each instance receives the bag ID number of its owner bag. The remaining columns individually store the feature values of the instances.

Full table of the datasets: [dataset_descriptions]

Link for the datasets: [Real_world_datasets]

PASCAL VOC 2007 dataset in MIL format:

The original natural image classification and object detection dataset can be downloaded from the original source page:

This dataset formed as a MIL problem by Dr. Melih Kandemir and Manuel Haussmann.  Their corresponding paper for citation is:

M. Haußmann, F.A. Hamprecht, M. Kandemir, “Variational Bayesian multiple instance learning with Gaussian processes”, CVPR, (2017).

Link for the PASCAL VOC 2007 MIL dataset:  [pvoc_2007_dataset]

Synthetic datasets:

These datasets are randomly generated based on four different MI-settings and can be used in bag encoding algorithms for MIL to measure the effects of different levels of number of bags, average number of instances per bag and number of features.

Link for the datasets: [Synthetic_datasets]

Pseudo-synthetic datasets:

Based on Elephant dataset, datasets with different levels of number of bags and number of features are generated to test the bag encoding algorithms.

Link for the datasets: [Pseudo-synthetic_datasets]


Bag encoding for MIL:

Several bag encoding algorithms are developed to represent bags in MIL. Then, bags are classified by using random forests.

Link for the R codes: [R_codes]

Link for the Python codes: [Python_codes]

Other MIL methods:

Standard MIL methods and bag dissimilarity-based methods (MInD) are contained in MIL Toolbox (can be downloaded from “Pattern Recognition Laboratory”s website).

Link for MIL toolbox: [Matlab_codes]

Another bag representation algorithm of MIL based on Fisher vectors (miFV) can be dowloaded from the link:  [miFV_code]


In the paper, each method is tested by repeating 10-fold cross validation five times. The randomly generated cross validation indices of the experimented datasets below can be used to reproduce the results to do comparisons with other methods.

Real world datasets: [CV_indices]