FRC

FRC: Sample Whoose Search

This returns a list of ranked revelancy citations with links to the Philo4 powered version of the FRC.

Topic Model and Time Series Search Table.

Default search binds search terms with AND, which means you must have at least 1 instance of each search term in matching documents or it will return 0 results. Checking the "OR search" checkbox changes search term to OR. This will typically return many more documents but will start with those with the most relevant matches to your query. You can put in your own operators by leaving the OR box unchecked and using upper case operators such as (no quotes): "conspirateurs AND aristocrates OR étrangères AND royalistes". Limit sets the number of hits displayed. You can set this to 0 to get all results which is not advised for OR searches given the number of results you can get.

The results show the author, title, year, followed by the score (see below), the ID number which is a link to the PhiloLogic4 table of contents, the word Search which is a link that will search the document for the words you put in the query box (using an OR operator), and a Internet Archive identifier which will go to the first page of the document. Note that the IA does support word and phrase searching in documents, so you can use that to find instances of your words in the page images.

Here are some more sample searches: "grains farine blé"; "brigands émigrés nobles". Find you find any good ones, let me know.

This is using Python Whoosh. The index is generated from the words indexed by the PhiloLogic4 load, which have been stemmed and had accented removed. Having not included all of the "words" I decided not to display snippets which Whoosh can do. Results are sorted by default score (BM25).

The whoosh index is generated using the PhiloLogic4 data files in words_and_philo_ids which are generated at load time using Clovis' Text PreProcessing Library. There is a simple script run from the data directory of the PhiloLogic4 database you want to use under Whoosh. This reads all of the files in words_and_philo_ids and outputs structured data which are then put into the Whoosh indexer. Sample text preprocessing arguments:

preproc = PreProcessor(
        modernize=True,
        is_philo_db=True,
        text_object_type="doc",
        language="french",
        ascii=True,
        min_word_length=2,
        workers=16)

This is then applied to generate index data using a standard schema

schema = Schema(filename=TEXT(stored=True), author=TEXT(stored=True),
            title=TEXT(stored=True), date=TEXT(stored=True), year=TEXT(stored=True),
            philoid=TEXT(stored=True), divdate=TEXT(stored=True), divhead=TEXT(stored=True),
            philodbname=TEXT(stored=True), content=TEXT(stored=True, analyzer=myan)

We extract the metadata and text content from the returned object and load them into the Whoosh index.

for text_object in preproc.process_texts(files2process):
        textobject = " ".join(text_object)
        philoid = text_object.metadata.get("philo_id")
        author = text_object.metadata.get("author")
        title = text_object.metadata.get("title")
        date = text_object.metadata.get("create_date")
        divdate = text_object.metadata.get("create_date")
        divhead = text_object.metadata.get("head")
        year = text_object.metadata.get("year")
        filename = text_object.metadata.get("filename")
        index_writer.add_document(philoid=philoid, author=author, title=title, date=date, year=year, filename=filename, divdate=divdate, divhead=divhead, content=textobject)

This process can be iterated across multiple PhiloLogic4 instances, since we are using standard data structures.