Lemmatizing and word frequencies count

conducted by: Eugen Stroh

 

This work aims for lemmatizing the french translation of the chronicle of Michael the Syrian and for counting the word frequencies. This will allow an explorative analysis of the text to find scientific hypotheses.

 

For that purpose a tool shall be developed that supports I/O-Operations, data preprocessing and the lemmatizing itself. KNIME was used for prototyping, the implementation will be completely done in Python. Lemmatizing French was done by the Python-Module “Pattern”, developed by the CLIPS-center of the University of Antwerpen.

 

Pattern is also able do lemmatize German. Therefore a further goal of this work will be to integrate a module in the tool to lemmatize German. The tool and the modules shall be useable and extensible beyond lemmatizing the chronicle.