15
Personally I would advocate using something that is both not-specific to the NLP field, and something that is sufficiently general that it can still be used as a tool even when you've started moving beyond this level of metadata. I would especially pick a format that can be used regardless of development environment and one that can keep some basic structure if that becomes relevant (like tokenization)
It might seem strange, but I would honestly suggest JSON. It's extremely well supported, supports a lot of structure, and is flexible enough that you shouldn't have to move from it for not being powerful enough. For your example, something like this:
{'text': 'I saw the company's manager last day.", {'Person': [{'name': 'John'}, {'indices': [0:1]}, etc...]}
The one big advantage you've got over any NLP-specific formats here is that JSON can be parsed in any environment, and since you'll probably have to edit your format anyway, JSON lends itself to very simple edits that give you a short distance to other formats.
You can also implicitly store tokenization information if you want:
{"text": ["I", "saw", "the", "company's", "manager", "last", "day."]}
EDIT: To clarify the mapping of metadata is pretty open, but here's an example:
{'body': '',
'metadata':
{'':
{'': '',
'location': [, ]
}
}
}
Hope that helps, let me know if you've got any more questions.
This post was edited by Shivakumar Kota at June 11, 2019 4:52 PM IST