GitHub

VHamMLL

A machine learning (ML) library for classification using a nearest neighbor algorithm based on Hamming distances.

You can incorporate the VHamMLL functions into your own code, or use the included Command Line Interface app ( cli.v ).

Link to html documentation for the library functions and structs

You can use VHamMLL with your own datasets, or with a selection of publicly available datasets that are widely used for demonstrating and testing ML classifiers, in the datasets directory. These files are mostly in Orange file format ; there are also datasets in ARFF (Attribute-Relation File Format) or in comma-separated-values (CSV) as used in Kaggle .

This table reports balanced accuracy results for classification of a variety of publicly available datasets.

What, another AI package? Is that necessary? And have a look here for a more complete description and potential use cases .

Glossary of terms

For interactive descriptions of the two key algorithms used by VHamMLL, download the Numbers app spreadsheets: Description of Ranking Algorithm and Description of Classification Algorithm .

Usage:

To use the VHamMLL library in an existing Vlang project:

v install holder66.vhammll

You may also need to install its dependencies, if not automatically installed:

v install vsl
v install Mewzax.chalk

In your v code, add: import holder66.vhammll

To use the library with the Command Line Interface (CLI):

First, install V, if not already installed. On MacOS, Linux etc. you need git and a C compiler (For windows or android environments, see the v lang documentation ).

In a terminal:

git clone https://github.com/vlang/v
cd v
make
sudo ./v symlink	# add v to your PATH
v install holder66.vhammll

See above re needed dependencies.

In a folder or directory that you want to use for your project, you will need to create a file with module main , and a function main() . You can do this in the terminal, or with a text editor. The file should contain:

module main
import holder66.vhammll

fn main() {
    vhammll.cli()!
}

Assuming you've named the directory or folder vhamml and the file within main.v , in the terminal: v run . followed by the command line arguments, eg v run . --help or v run . analyze <path_to_dataset_file> Command-specific help is available, like so: v run . explore --help or v run . explore -h

Note that the publicly available datasets included with the VHamMLL distribution can be found at ~/.vmodules/holder66/vhammll/datasets .

That's it!

Tutorial:

v run . examples go

Updating:

v up        # installs the latest release of V
v update    # get the latest version of the libraries, including holder66.vhammll
v .         # recompile

Getting help:

The V lang community meets on Discord

For bug reports, feature requests, etc., please raise an issue on github

Speed things up:

Use the -c (--concurrent) argument (in the CLI) to make use of available CPU cores for some vhammll functions; this may speed things up (timings are on a MacBook Pro 2019)

v main.v
./main explore ~/.vmodules/holder66/vhammll/datasets/iris.tab  # 10.157 sec
./main explore -c  ~/.vmodules/holder66/vhammll/datasets/iris.tab   # 4.910 sec

A huge speedup usually happens if you compile using the -prod (for production) option. The compilation itself takes longer, but the resulting code is highly optimized.

v -prod main.v
./main explore ~/.vmodules/holder66/vhammll/datasets/iris.tab  # 3.899 sec
./main explore -c  ~/.vmodules/holder66/vhammll/datasets/iris.tab   # 4.849 sec!!

Note that in this case, there is no speedup for -prod when the -c argument is used.

Examples showing use of the Command Line Interface

Please see examples_of_command_line_usage.md

Example: typical use case, a clinical risk calculator

Health care professionals frequently make use of calculators to inform clinical decision-making. Data regarding symptoms, findings on physical examination, laboratory and imaging results, and outcome information such as diagnosis, risk for developing a condition, or response to specific treatments, is collected for a sample of patients, and then used to form the basis of a formula that can be used to predict the outcome information of interest for a new patient, based on how their symptoms and findings, etc. compare to those in the dataset.

Please see clinical_calculator_example.md .

Example: finding useful information embedded in noise

Please see a worked example here: noisy_data.md

MNIST dataset

The mnist_train.tab file is too large to keep in the repository. If you wish to experiment with it, it can be downloaded by right-clicking on this link in a web browser, or downloaded via the command line:

wget https://henry.olders.ca/datasets/mnist_train.tab

The process of development in its early stages is described in this essay written in 1989.