Usage: crossbow [OPTION...] [ARG...]
Crossbow -- a document clustering front-end to libbow

 For building data structures from text files:
      --build-hier-from-dir  When indexing a single directory, use the
                             directory structure to build a class hierarchy
  -c, --cluster              cluster the documents, and write the results to
                             disk
      --classify             Split the data into train/test, and classify the
                             test data, outputing results in rainbow format
      --classify-files=DIRNAME   Classify documents in DIRNAME, outputing
                             `filename classname' pairs on each line.
      --cluster-output-dir=DIR   After clustering is finished, write the
                             cluster to directory DIR
      --index-multiclass-list=FILE
                             Index the files listed in FILE.  Each line of FILE
                             should contain a filenames followed by a list of
                             classnames to which that file belongs.
  -i, --index                tokenize training documents found under ARG...,
                             build weight vectors, and save them to disk
      --print-doc-names[=TAG]   Print the filenames of documents contained in
                             the model.  If the optional TAG argument is given,
                             print only the documents that have the specified
                             tag.
      --print-matrix         Print the word/document count matrix in an awk- or
                             perl-accessible format.  Format is sparse and
                             includes the words and the counts.
      --print-word-probabilities=FILEPREFIX
                             Print the word probability distribution in each
                             leaf to files named FILEPREFIX-classname
      --query-server=PORTNUM Run crossbow in server mode, listening on socket
                             number PORTNUM.  You can try it by executing this
                             command, then in a different shell window on the
                             same machine typing `telnet localhost PORTNUM'.
      --use-vocab-in-file=FILENAME
                             Limit vocabulary to just those words read as
                             space-separated strings from FILE.

 Splitting options:
      --ignore-set=SOURCE    How to select the ignored documents.  Same format
                             as --test-set.  Default is `0'.
      --set-files-use-basename[=N]
                             When using files to specify doc types, compare
                             only the last N components the doc's pathname.
                             That is use the filename and the last N-1
                             directory names.  If N is not specified, it
                             defaults to 1.
      --test-set=SOURCE      How to select the testing documents.  A number
                             between 0 and 1 inclusive with a decimal point
                             indicates a random fraction of all documents.  The
                             number of documents selected from each class is
                             determined by attempting to match the proportions
                             of the non-ignore documents.  A number with no
                             decimal point indicates the number of documents to
                             select randomly.  Alternatively, a suffix of `pc'
                             indicates the number of documents per-class to
                             tag.  The suffix 't' for a number or proportion
                             indicates to tag documents from the pool of
                             training documents, not the untagged documents.
                             `remaining' selects all documents that remain
                             untagged at the end.  Anything else is interpreted
                             as a filename listing documents to select.
                             Default is `0.0'.
      --train-set=SOURCE     How to select the training documents.  Same format
                             as --test-set.  Default is `remaining'.
      --unlabeled-set=SOURCE How to select the unlabeled documents.  Same
                             format as --test-set.  Default is `0'.
      --validation-set=SOURCE   How to select the validation documents.  Same
                             format as --test-set.  Default is `0'.

 Hierarchical EM Clustering options:
      --hem-branching-factor=NUM   Number of clusters to create.  Default is 2.
                            
      --hem-deterministic-horizontal
                             In the horizontal E-step for a document, set to
                             zero the membership probabilities of all leaves,
                             except the one matching the document's filename
      --hem-garbage-collection   Add extra /Misc/ children to every internal
                             node of the hierarchy, and keep their local word
                             distributions flat
      --hem-incremental-labeling   Instead of using all unlabeled documents in
                             the M-step, use only the labeled documents, and
                             incrementally label those unlabeled documents that
                             are most confidently classified in the E-step
      --hem-lambdas-from-validation=NUM
                             Instead of setting the lambdas from the
                             labeled/unlabeled data (possibly with LOO),
                             instead set the lambdas using held-out validation
                             data.  0<NUM<1 is the fraction of unlabeled
                             documents just before EM training of the
                             classifier begins.  Default is 0, which leaves
                             this option off.
      --hem-max-num-iterations=NUM
                             Do no more iterations of EM than this.
      --hem-maximum-depth=NUM   The hierarchy depth beyond which it will not
                             split.  Default is 6.
      --hem-no-loo           Do not use leave-one-out evaluation during the
                             E-step.
      --hem-no-shrinkage     Use only the clusters at the leaves; do not do
                             anything with the hierarchy.
      --hem-no-vertical-word-movement
                             Use EM just to set the vertical priors, not to set
                             the vertical word distribution; i.e. do not to
                             `full-EM'.
      --hem-pseudo-labeled   After using the labels to set the starting point
                             for EM, change all training documents to
                             unlabeled, so that they can have their class
                             labels re-assigned by EM.  Useful for imperfectly
                             labeled training data.
      --hem-restricted-horizontal
                             In the horizontal E-step for a document, set to
                             zero the membership probabilities of all leaves
                             whose names are not found in the document's
                             filename
      --hem-split-kl-threshold=NUM
                             KL divergence value at which tree leaves will be
                             split. Default is 0.2
      --hem-temperature-decay=NUM
                             Temperature decay factor.  Default is 0.9.
      --hem-temperature-end=NUM   The final value of T.  Default is 1.
      --hem-temperature-start=NUM
                             The initial value of T.

 General options
      --annotations=FILE     The sarray file containing annotations for the
                             files in the index
  -b, --no-backspaces        Don't use backspace when verbosifying progress
                             (good for use in emacs)
  -d, --data-dir=DIR         Set the directory in which to read/write
                             word-vector data (default=~/.<program_name>).
      --random-seed=NUM      The non-negative integer to use for seeding the
                             random number generator
      --score-precision=NUM  The number of decimal digits to print when
                             displaying document scores
  -v, --verbosity=LEVEL      Set amount of info printed while running;
                             (0=silent, 1=quiet, 2=show-progess,...5=max)

 Lexing options
      --append-stoplist-file=FILE
                             Add words in FILE to the stoplist.
      --exclude-filename=FILENAME
                             When scanning directories for text files, skip
                             files with name matching FILENAME.
  -g, --gram-size=N          Create tokens for all 1-grams,... N-grams.
  -h, --skip-header          Avoid lexing news/mail headers by scanning forward
                             until two newlines.
      --istext-avoid-uuencode   Check for uuencoded blocks before saying that
                             the file is text, and say no if there are many
                             lines of the same length.
      --lex-pipe-command=SHELLCMD
                             Pipe files through this shell command before
                             lexing them.
      --max-num-words-per-document=N
                             Only tokenize the first N words in each document.
      --no-stemming          Do not modify lexed words with a stemming
                             function. (usually the default, depending on
                             lexer)
      --replace-stoplist-file=FILE
                             Empty the default stoplist, and add
                             space-delimited words from FILE.
  -s, --no-stoplist          Do not toss lexed words that appear in the
                             stoplist.
      --shortest-word=LENGTH Toss lexed words that are shorter than LENGTH.
                             Default is usually 2.
  -S, --use-stemming         Modify lexed words with the `Porter' stemming
                             function.
      --use-stoplist         Toss lexed words that appear in the stoplist.
                             (usually the default SMART stoplist, depending on
                             lexer)
      --use-unknown-word     When used in conjunction with -O or -D, captures
                             all words with occurrence counts below threshold
                             as the `<unknown>' token
      --xxx-words-only       Only tokenize words with `xxx' in them

 Mutually exclusive choice of lexers
      --flex-mail            Use a mail-specific flex lexer
      --flex-tagged          Use a tagged flex lexer
  -H, --skip-html            Skip HTML tokens when lexing.
      --lex-alphanum         Use a special lexer that includes digits in
                             tokens, delimiting tokens only by non-alphanumeric
                             characters.
      --lex-infix-string=ARG Use only the characters after ARG in each word for
                             stoplisting and stemming.  If a word does not
                             contain ARG, the entire word is used.
      --lex-suffixing        Use a special lexer that adds suffixes depending
                             on Email-style headers.
      --lex-white            Use a special lexer that delimits tokens by
                             whitespace only, and does not change the contents
                             of the token at all---no downcasing, no stemming,
                             no stoplist, nothing.  Ideal for use with an
                             externally-written lexer interfaced to rainbow
                             with --lex-pipe-cmd.

 Feature-selection options
  -D, --prune-vocab-by-doc-count=N
                             Remove words that occur in N or fewer documents.
  -O, --prune-vocab-by-occur-count=N
                             Remove words that occur less than N times.
  -T, --prune-vocab-by-infogain=N
                             Remove all but the top N words by selecting words
                             with highest information gain.

 Weight-vector setting/scoring method options
      --binary-word-counts   Instead of using integer occurrence counts of
                             words to set weights, use binary absence/presence.
                            
      --event-document-then-word-document-length=NUM
                             Set the normalized length of documents when
                             --event-model=document-then-word
      --event-model=EVENTNAME   Set what objects will be considered the
                             `events' of the probabilistic model.  EVENTNAME
                             can be one of: word, document, document-then-word.
                              Default is `word'.
      --infogain-event-model=EVENTNAME
                             Set what objects will be considered the `events'
                             when information gain is calculated.  EVENTNAME
                             can be one of: word, document, document-then-word.
                              Default is `document'.
  -m, --method=METHOD        Set the word weight-setting method; METHOD may be
                             one of: fienberg-classify, hem-classify,
                             hem-cluster, multiclass, default=naivebayes.
      --print-word-scores    During scoring, print the contribution of each
                             word to each class.
      --smoothing-dirichlet-filename=FILE
                             The file containing the alphas for the dirichlet
                             smoothing.
      --smoothing-dirichlet-weight=NUM
                             The weighting factor by which to muliply the
                             alphas for dirichlet smoothing.
      --smoothing-goodturing-k=NUM
                             Smooth word probabilities for words that occur NUM
                             or less times. The default is 7.
      --smoothing-method=METHOD   Set the method for smoothing word
                             probabilities to avoid zeros; METHOD may be one
                             of: goodturing, laplace, mestimate, wittenbell
      --uniform-class-priors When setting weights, calculating infogain and
                             scoring, use equal prior probabilities on classes.
                            

  -?, --help                 Give this help list
      --usage                Give a short usage message
  -V, --version              Print program version

Mandatory or optional arguments to long options are also mandatory or optional
for any corresponding short options.

Report bugs to <mccallum@cs.cmu.edu>.