Usage: crossbow [OPTION...] [ARG...] Crossbow -- a document clustering front-end to libbow For building data structures from text files: --build-hier-from-dir When indexing a single directory, use the directory structure to build a class hierarchy -c, --cluster cluster the documents, and write the results to disk --classify Split the data into train/test, and classify the test data, outputing results in rainbow format --classify-files=DIRNAME Classify documents in DIRNAME, outputing `filename classname' pairs on each line. --cluster-output-dir=DIR After clustering is finished, write the cluster to directory DIR --index-multiclass-list=FILE Index the files listed in FILE. Each line of FILE should contain a filenames followed by a list of classnames to which that file belongs. -i, --index tokenize training documents found under ARG..., build weight vectors, and save them to disk --print-doc-names[=TAG] Print the filenames of documents contained in the model. If the optional TAG argument is given, print only the documents that have the specified tag. --print-matrix Print the word/document count matrix in an awk- or perl-accessible format. Format is sparse and includes the words and the counts. --print-word-probabilities=FILEPREFIX Print the word probability distribution in each leaf to files named FILEPREFIX-classname --query-server=PORTNUM Run crossbow in server mode, listening on socket number PORTNUM. You can try it by executing this command, then in a different shell window on the same machine typing `telnet localhost PORTNUM'. --use-vocab-in-file=FILENAME Limit vocabulary to just those words read as space-separated strings from FILE. Splitting options: --ignore-set=SOURCE How to select the ignored documents. Same format as --test-set. Default is `0'. --set-files-use-basename[=N] When using files to specify doc types, compare only the last N components the doc's pathname. That is use the filename and the last N-1 directory names. If N is not specified, it defaults to 1. --test-set=SOURCE How to select the testing documents. A number between 0 and 1 inclusive with a decimal point indicates a random fraction of all documents. The number of documents selected from each class is determined by attempting to match the proportions of the non-ignore documents. A number with no decimal point indicates the number of documents to select randomly. Alternatively, a suffix of `pc' indicates the number of documents per-class to tag. The suffix 't' for a number or proportion indicates to tag documents from the pool of training documents, not the untagged documents. `remaining' selects all documents that remain untagged at the end. Anything else is interpreted as a filename listing documents to select. Default is `0.0'. --train-set=SOURCE How to select the training documents. Same format as --test-set. Default is `remaining'. --unlabeled-set=SOURCE How to select the unlabeled documents. Same format as --test-set. Default is `0'. --validation-set=SOURCE How to select the validation documents. Same format as --test-set. Default is `0'. Hierarchical EM Clustering options: --hem-branching-factor=NUM Number of clusters to create. Default is 2. --hem-deterministic-horizontal In the horizontal E-step for a document, set to zero the membership probabilities of all leaves, except the one matching the document's filename --hem-garbage-collection Add extra /Misc/ children to every internal node of the hierarchy, and keep their local word distributions flat --hem-incremental-labeling Instead of using all unlabeled documents in the M-step, use only the labeled documents, and incrementally label those unlabeled documents that are most confidently classified in the E-step --hem-lambdas-from-validation=NUM Instead of setting the lambdas from the labeled/unlabeled data (possibly with LOO), instead set the lambdas using held-out validation data. 0). --random-seed=NUM The non-negative integer to use for seeding the random number generator --score-precision=NUM The number of decimal digits to print when displaying document scores -v, --verbosity=LEVEL Set amount of info printed while running; (0=silent, 1=quiet, 2=show-progess,...5=max) Lexing options --append-stoplist-file=FILE Add words in FILE to the stoplist. --exclude-filename=FILENAME When scanning directories for text files, skip files with name matching FILENAME. -g, --gram-size=N Create tokens for all 1-grams,... N-grams. -h, --skip-header Avoid lexing news/mail headers by scanning forward until two newlines. --istext-avoid-uuencode Check for uuencoded blocks before saying that the file is text, and say no if there are many lines of the same length. --lex-pipe-command=SHELLCMD Pipe files through this shell command before lexing them. --max-num-words-per-document=N Only tokenize the first N words in each document. --no-stemming Do not modify lexed words with a stemming function. (usually the default, depending on lexer) --replace-stoplist-file=FILE Empty the default stoplist, and add space-delimited words from FILE. -s, --no-stoplist Do not toss lexed words that appear in the stoplist. --shortest-word=LENGTH Toss lexed words that are shorter than LENGTH. Default is usually 2. -S, --use-stemming Modify lexed words with the `Porter' stemming function. --use-stoplist Toss lexed words that appear in the stoplist. (usually the default SMART stoplist, depending on lexer) --use-unknown-word When used in conjunction with -O or -D, captures all words with occurrence counts below threshold as the `' token --xxx-words-only Only tokenize words with `xxx' in them Mutually exclusive choice of lexers --flex-mail Use a mail-specific flex lexer --flex-tagged Use a tagged flex lexer -H, --skip-html Skip HTML tokens when lexing. --lex-alphanum Use a special lexer that includes digits in tokens, delimiting tokens only by non-alphanumeric characters. --lex-infix-string=ARG Use only the characters after ARG in each word for stoplisting and stemming. If a word does not contain ARG, the entire word is used. --lex-suffixing Use a special lexer that adds suffixes depending on Email-style headers. --lex-white Use a special lexer that delimits tokens by whitespace only, and does not change the contents of the token at all---no downcasing, no stemming, no stoplist, nothing. Ideal for use with an externally-written lexer interfaced to rainbow with --lex-pipe-cmd. Feature-selection options -D, --prune-vocab-by-doc-count=N Remove words that occur in N or fewer documents. -O, --prune-vocab-by-occur-count=N Remove words that occur less than N times. -T, --prune-vocab-by-infogain=N Remove all but the top N words by selecting words with highest information gain. Weight-vector setting/scoring method options --binary-word-counts Instead of using integer occurrence counts of words to set weights, use binary absence/presence. --event-document-then-word-document-length=NUM Set the normalized length of documents when --event-model=document-then-word --event-model=EVENTNAME Set what objects will be considered the `events' of the probabilistic model. EVENTNAME can be one of: word, document, document-then-word. Default is `word'. --infogain-event-model=EVENTNAME Set what objects will be considered the `events' when information gain is calculated. EVENTNAME can be one of: word, document, document-then-word. Default is `document'. -m, --method=METHOD Set the word weight-setting method; METHOD may be one of: fienberg-classify, hem-classify, hem-cluster, multiclass, default=naivebayes. --print-word-scores During scoring, print the contribution of each word to each class. --smoothing-dirichlet-filename=FILE The file containing the alphas for the dirichlet smoothing. --smoothing-dirichlet-weight=NUM The weighting factor by which to muliply the alphas for dirichlet smoothing. --smoothing-goodturing-k=NUM Smooth word probabilities for words that occur NUM or less times. The default is 7. --smoothing-method=METHOD Set the method for smoothing word probabilities to avoid zeros; METHOD may be one of: goodturing, laplace, mestimate, wittenbell --uniform-class-priors When setting weights, calculating infogain and scoring, use equal prior probabilities on classes. -?, --help Give this help list --usage Give a short usage message -V, --version Print program version Mandatory or optional arguments to long options are also mandatory or optional for any corresponding short options. Report bugs to .