elib1_indexer

A full-text indexing engine.

A full-text indexing engine. The functions in this library are modeled after the algorithms in the book "Managing Gigabytes" by I.A.Witten, A.Moffat and T.C. Bell, 2'nd edition. Morgan Kaufmann Publishing, 1999. Unix Commands eindex There are a number of top level UNIX commands to perform indexing.

  $ eindex -crawl Name
      Input Name.in     Output: Name.files and Name.crawl
  $ eindex -index Name
      Input: Name.crawl Output: Name.index
  $ eindex -dump Name
      Input Name     Output: Name.tmp
      example eindex -dump test2.crawl produces test2.crawl.tmp
  $ eindex -dumpIndex Name
      Input: Name.index Output: Name.index.tmp
  $ eindex -statstics Name
      Input: Name.index Output: Statistics
  $ eindex -search Name "String"
      Input Name.in
      Output: matching objects
      example: eindex -search test2 "property lists"
  $ eindex -help
      prints usage
  

-crawlIs used to gather a collection of files together prior to indexing. The file Name.in contains a list of directories and file extensions which control the gathering phase. All matching files specified in the input file are read, compressed and appended into a single file called a crawl. This file will be rather large, depending upon the size of the scan - but it contains all the data we need for the subsequent indexing phases. In a typical gather phase, I collect 46,000 Erlang files on my disk and compress them into a single 120 MByte file.

File Formats Crawl files A file with the extension .crawl is a crawl file. A crawl file is a set of BFTs containing

  {Term:any(), Extension::string(), Md5Content::binry(), compressedContent::binary()}
  

The indexer uses a number or different file formats. These are contained in files with the extensions .in .crawl and .index. Input Data Files with the extension .in contain a list of tuples with a start directory and a file extension. For example the file test1.in in the examples/indexer directory contains the following:

  {"/Users/joe/code/elib2-1", ".erl"}.
  {"/Users/joe/msi/2005/erl/projects/supported", ".erl"}.
  

When the first phase of indexing occurs we crawl through the file system looking for all files under the root /Users/joe/code/elib2.1 with the file extension .erl. Symbolic links are followed if they point within the root directory, but circular links are not followed. All the Erlang code found is compressed and appended to a so called "crawl" file. The crawl file is a sequence of tuples of the form

  {FileName::string(), Md5::binary(), CompressedContent::binary()}
  

Binary Tuple Format Binary tuple formats (BTF) occurs a lot. If T is tuple, then the following binary written to disk is in BTF format:

<<(size(tuple_to_binary(T)):32-unsigned-big-integer),
          tuple_to_binary(T)>>
  

Stored on disk in binary tuple format (ie as a 4 byte length header,

Functions


crawl_files(Name) -> term()

Reads Name.in and finds all the files in the input packing them in a file called Name.crawl. This calls init:stop() when it has finished.

do(Cmd) -> term()

dump_index(Name) -> term()

extract(Name, I) -> term()

given the name of a crawl file and a pointer into the crawl return the filename and content of the file stored in the crawl.

list_files(Name, L) -> term()

lookup(Name, Str) -> term()

make_index(Name, Top) -> term()

q(Name, Str) -> term()

q(Dir, Name, Str) -> term()

Joe Armstrong erlang@gmail.com
View Functions