For our work on machine learning for the annotation of web services we have gathered WSDL files from salcentral and XMethods and organized them in a hierarchy.



The web services are hierarchically classified. The directory structure serves as the label, i.e. a wsdl file in the communication\mail directory was classified as a "mail" webservice, where "mail" is a subclass of "communication".

The labeled instances were crawled from the SALCentral website, the unlabeled instances (in directory "unlabelled") are from the xmethods website.

Each .wsdl file is accompanied by a .txt file with the following structure:

  1. line = service provider
  2. line = original classification by SALCentral
  3. line = service name
  4. line = URL of the original WSDL file on the Web
  5. line = Plain text description of the service crawled from the SALCentral/xmethods web page

Note that the SALCentral classification is not very useful (that's why we wanted to have our own...)

The filenames are serviceNN.OriginalClassification.[txt|wsdl], where OriginalClassification refers to the label assigned by SALCentral. The XMethods web site does not categorize the web services, therefore line 2 in the .txt files for the unlabeled instances is always "XMethods".

The classes are highly unbalanced and unfortunately not noise-free.

