La Loi de la Révolution française 1789-1799

Rev Law All 102 volumes: Sample Whoose Search

This returns a list of ranked revelancy citations with links to the Philo4 powered version of all 102 volumes of RevLaw.

Get Topic Table

Default search binds search terms with AND, which means you must have at least 1 instance of each search term in matching documents or it will return 0 results. Checking the "OR search" checkbox changes search term to OR. This will typically return many more documents but will start with those with the most relevant matches to your query. You can put in your own operators by leaving the OR box unchecked and using upper case operators such as (no quotes): "conspirateurs AND aristocrates OR étrangères AND royalistes". Limit sets the number of hits displayed. You can set this to 0 to get all results which is not advised for OR searches given the number of results you can get.

The results show the author, title, year, followed by the score (see below), the ID number which is a link to the PhiloLogic4 table of contents, the word Search which is a link that will search the document for the words you put in the query box (using an OR operator), and a Internet Archive identifier which will go to the first page of the document. Note that the IA does support word and phrase searching in documents, so you can use that to find instances of your words in the page images.

Here are some more sample searches: "grains farine blé"; "brigands émigrés nobles". Find you find any good ones, let me know.

This is using Python Whoosh. The index is generated from the words indexed by the PhiloLogic4 load, which have been stemmed and had accented removed. Having not included all of the "words" I decided not to display snippets which Whoosh can do. Results are sorted by default score (BM25).

The whoosh index is generated using the PhiloLogic4 data files in words_and_philo_ids which are generated at load time using Clovis' Text PreProcessing Library. There is a simple script run from the data directory of the PhiloLogic4 database you want to use under Whoosh. In this instance (script), we're loading from a list (of two) databases. This reads all of the files in words_and_philo_ids and outputs structured data which are then put into the Whoosh indexer. Sample text preprocessing arguments:

preproc = PreProcessor(
        modernize=True,
        is_philo_db=True,
        text_object_type="doc",
        language="french",
        ascii=True,
        min_word_length=2,
        workers=16)

This is then applied to generate index data using a standard schema

schema = Schema(filename=TEXT(stored=True), author=TEXT(stored=True),
            title=TEXT(stored=True), date=TEXT(stored=True), year=TEXT(stored=True),
            philoid=TEXT(stored=True), divdate=TEXT(stored=True), divhead=TEXT(stored=True),
            philodbname=TEXT(stored=True), content=TEXT(stored=True, analyzer=myan)

From a list of Philo4 databases

philodirs=["/var/www/html/intertextual_hub/journaux_de_marat/",
    "/var/www/html/philologic/frc1787-99rev2"]

we extract the metadata and text content from the returned objects and load them into the Whoosh index.

for text_object in preproc.process_texts(files2process):
        textobject = " ".join(text_object)
        philoid = text_object.metadata.get("philo_id")
        author = text_object.metadata.get("author")
        title = text_object.metadata.get("title")
        date = text_object.metadata.get("create_date")
        divdate = text_object.metadata.get("create_date")
        divhead = text_object.metadata.get("head")
        year = text_object.metadata.get("year")
        filename = text_object.metadata.get("filename")
        index_writer.add_document(philoid=philoid, author=author, title=title, date=date, year=year, filename=filename, divdate=divdate, divhead=divhead, content=textobject)

We will replicate this basic procedure for all of the operations for the Intertextual Hub.