When a new server comes to our attention, we crawl that server and use the FIRSTINDEX procedure to build the three index files and the COPYTHEM perpetual batch job to move them to production. Then we create and optimize an FDL file for each of the three index files and place those FDL files in the location needed for future indexing to be done with the REINDEX procedure. This provides better performance, both for the indexing process and for the searching process.
The various steps include:
Even if you crawl using "-realm", LYNX will still leave you with disk files for external pages if there are server configuration re-directs to pages on another server, but by sorting in URL order, those will all be at the top or bottom of the file, and hence quick for a person to identify and remove, even for a large site.
A second batch job, DEDUP2, preserves the LNKnnnnnnnn.DAT files whose names remain in the edited, sorted summary file, and deletes the duplicates and other rejected files.
At this time, the workaround I use is to
Make sure that the PID you specify is NOT that of the controlling process you are typing in.
Doing this every few hours significantly speeds up the crawl, but large sites can still takes days because REJECT.DAT grows too large overnight.
$ analyze/rms/fdl/output=ohiou-sel.fdl www_root:[index]ohiou.sel
$ edit/fdl/script=optimize ohiou-sel.fdl
$ purge *.*
$ rename *.* [-.fdl]*.*.0
$ set default [-.fdl]
$ dir/dat *ohiou*.*.*
$ purge *ohiou*.*
Dick Piccard revised this file (http://ouvaxa.cats.ohiou.edu/vmsindex/examples/ohiocookbook.html) on September 29, 2000.
Please E-mail comments or suggestions to piccard@ohio.edu