You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Current »

1. State of progress since last meeting

Jaspret and Roar: Worked on their anonymization system. Using Jena to read entries from ontologies for entity matching. They also include Thomas' PoS tagger in their system.

Ingrid: Looked at ways to do phrases/sentence matching.

Hans: Received a collection of standoff files from Christines annotation. Looked at some existing document representations for Java (Apache Lucene Document and Apache JCas). Tested some phrase/sentence matching methods.

2. Production of normalized, common sentence format and numbering. Should also be compatible with output (Brat) standoff format from Christine.

Everyone will look for existing framework for representing documents in Java, e.g. the Document class used in Lucene.
Also, everyone will make a list of required properties of such a representation or framework. This is homework till next meeting.

3. Common dictionary format

Flat dictionaries are stored as .txt files (UTF-8), one entry per line (including any metadata).

We have decided to use Jena for reading from ontologies (http://jena.apache.org/)

4. Other issues

Øystein suggest that we have a look at "Simstring" (from the people behind Brat...) for phrase matching, used in normalization in brat 1.3 (maybe something to help Christine).

5. TODO till next meeting

Everyone: See 2.

Jaspret & Roar: Create a preprocessed version of the data that includes PoS tags.

Ingrid & Hans: Derive a 'CVC dictionary' from Christine's offset files.

  • No labels