Øystein
To start working on the new data from Haldor, we need...
- Rune: Tell everyone about stand-off-management?
- Look into BRAT
- Examples from BioNLP 2013 competition
- Laura: Add timestamps down to the minute, and author, for all text files.
- The file format is RTF, and it is very messy
- Haldor can convert them to real text by using DIPS-scripts
- Create 10-fold split of the dataset
- Make sure only 8 folds are used for training, 1 for development testing, and keep the last split secret for FINAL testing
- Fileformats
- Øystein: We should use the I2B2 format
- Rune: I recommend the BioNLP (txt/a1/a2) file format: http://2013.bionlp-st.org/file-formats
- Sentence Separation
- Hans: Using available Java code from the web.
- Is it necessary? Yes, Haldor does not have sentence boundaries.
- Stoplist, stemlist,
- Upload this on the
- Check Thomas' sentence splitting
- Hans: Using available Java code from the web.
- Tools: http://2013.bionlp-st.org/supporting-resources
- Contains tokenisation, lemmatisation, sentence splitting, chunking and PoS-tagging in a unified BioC XML format