1. State of progress since last meeting
Jaspret and Roar: Worked on their anonymization system. Using Jena to read entries from ontologies for entity matching. They also include Thomas' PoS tagger in their system.
Ingrid: Looked at ways to do phrases/sentence matching.
Hans: Received a collection of standoff files from Christines annotation. Looked at some existing document representations for Java (Apache Lucene Document and Apache JCas). Tested some phrase/sentence matching methods.
2. Production of normalized, common sentence format and numbering. Should also be compatible with output (Brat) standoff format from Christine.
Everyone will look for existing framework for representing documents in Java, e.g. the Document class used in Lucene.
Also, everyone will make a list of required properties of such a representation or framework. This is homework till next meeting.
3. Common dictionary format
Flat dictionaries are stored as .txt files (UTF-8), one entry per line (including any metadata).
We have decided to use Jena for reading from ontologies (http://jena.apache.org/)
4. Other issues
Øystein suggest that we have a look at "Simstring" (from the people behind Brat...) for phrase matching, used in normalization in brat 1.3 (maybe something to help Christine).
5. TODO till next meeting
Everyone: See 2.
Jaspret & Roar: Create a preprocessed version of the data that includes PoS tags.
Ingrid & Hans: Derive a 'CVC dictionary' from Christine's offset files.