Andreas Hess // Projects // GridWeka2

GridWeka2

GridWeka2 is a modified version of the well-known Weka machine learning and data-mining software in Java, written by Eibe Frank and other people from the University of Waikato in New Zealand.

GridWeka2 is able to run cross-validations in the Weka Explorer and Experimenter in parallel and distributed over several machines while being easily configurable. GridWeka2 is currrently based on Weka 3.4.3.

History and Related Work

I wrote GridWeka2 for two reasons:

  1. I had to run experiments that took a long time and wanted to speed them up.
  2. Other projects such as Weka Parallel and GridWeka only work for the command line version, but not for the Explorer and the Experimenter.

Note that GridWeka2 works in the Explorer and the Experimenter, but not in the CLI. GridWeka2 is a completely new project and is not based on Weka Parallel or GridWeka.

Download GridWeka2

As Weka itself, GridWeka2 is licensed under the terms of the GPL.

Configuration and Usage

GridWeka2 is very easy to configure and usage is transparent. You can run Weka as normal by typing:

java -cp GridWeka2.jar weka.gui.GUIChooser

However, to make use of the parallelisation, you need two more things. First, you will have to start at least one Weka server, either on the local machine, on some other machine that is reachable over the net, or both! The following command starts a server on port 6714 that allows up to 3 concurrent requests:

java -cp GridWeka2.jar weka.ucd.WekaServer 6714 3

The last thing you have to do is to tell you client about the Weka servers that you want to use. Do do that, you have to create a file servers.csv and place it in the directory from where you start your Weka client (i.e. the GUI Chooser, the Explorer or the Experimenter). A simple servers.csv file looks like this:

localhost,6714,-,-
myothermachine,6714,-,-

This tells Weka that there is a server running on localhost and another server running on myothermachine, and both listen on port 6714. The two minus signs are reserved entries: In a future version they will be used for user name and password based authentication.

After you have created the servers.csv file, just start the GUIChooser as shown above. If there is no servers.csv file, computation will be done on the client as normal.

Limitations

There are a lot of functions that would be desirable in GridWeka2, but are not implemented. Some of these functions are missing because GridWeka2 is a research prototype and not finished software. Therefore there is also no warranty of any kind. This is a list of known limitations:

  1. There is currently no authentication whatsoever! If you run a Weka server, everyone with a Weka client will be able to access it. To work around that you should use a firewall software on your machine that allows access to the Weka server port only from your own machines. If you want to run GridWeka2 securely over the Internet, you should consider using ssh port forwarding for authentication and security.
  2. Parallelisation works only for the Explorer and the Experimenter, but not from the CLI or if you run a classifier directly from the command line.
  3. In the Experimenter, only folds are parallelised, but not runs. That means if you run a 10-fold cross-validation repeated 10 times for statistical significance testing, GridWeka2 will spawn at most 10 parallel processes rather than 100, and furthermore it will synchronise after each run.
  4. In the Explorer, if you tick the "Output Model" box, the model is computed on the local machine and is not parallelised. Only the cross-validation that follows is distributed.
  5. Since computation is distributed on multiple machines, timing in the experimenter does not make sense any more. No valid timing information will be stored in your experiment output files.

Other Changes in GridWeka2

There are a few other changes in GridWeka2 when comparing it with the standard Weka 3.4.3:

  1. GridWeka2 includes the reference implementation of the Triskel algorithm.
  2. In the Experimenter, precision, recall and f-measure are microaveraged by default, not for one particular class.

26 Mar 2007, Andreas Hess, andreas at idirlion dot de