QBoard » Artificial Intelligence & ML » AI and ML - Others » How to annotate text documents with meta-data?

How to annotate text documents with meta-data?

  • Having a lot of text documents (in natural language, unstructured), what are the possible ways of annotating them with some semantic meta-data? For example, consider a short document:

    I saw the company's manager last day.

    To be able to extract information from it, it must be annotated with additional data to be less ambiguous. The process of finding such meta-data is not in question, so assume it is done manually. The question is how to store these data in a way that further analysis on it can be done more conveniently/efficiently?

    A possible approach is to use XML tags (see below), but it seems too verbose, and maybe there are better approaches/guidelines for storing such meta-data on text documents.

    I saw the company's
    manager last day.
      June 11, 2019 4:48 PM IST
    0
  • The brat annotation tool might be useful for you as per my comment. I have tried many of them and this is the best I have found. It has a nice user interface and can support a number of different types of annotations. The annotations are stored in a separate .annot file which contain each annotation as well as its location within the original document. A word of warning though, if you ultimately want to feed the annotations into a classifier like the Stanford NER tool then you will have to do some manipulation to get the data into a format that it will accept.
      August 25, 2021 6:19 PM IST
    0
  • The brat annotation tool might be useful for you as per my comment. I have tried many of them and this is the best I have found. It has a nice user interface and can support a number of different types of annotations. The annotations are stored in a separate .annot file which contain each annotation as well as its location within the original document. A word of warning though, if you ultimately want to feed the annotations into a classifier like the Stanford NER tool then you will have to do some manipulation to get the data into a format that it will accept.
      August 25, 2021 6:20 PM IST
    0
  • In general, you don't want to use XML tags to tag documents in this way because tags may overlap.

    UIMA, GATE and similar NLP frameworks denote the tags separate from the text. Each tag, such as Person, ACME, John etc. is stored as the position that the tag begins and the position that it ends. So, for the tag ACME, it would be stored as starting a position 11 and ending at position 17.
      August 26, 2021 1:57 PM IST
    0
  • 15

    Personally I would advocate using something that is both not-specific to the NLP field, and something that is sufficiently general that it can still be used as a tool even when you've started moving beyond this level of metadata. I would especially pick a format that can be used regardless of development environment and one that can keep some basic structure if that becomes relevant (like tokenization)

    It might seem strange, but I would honestly suggest JSON. It's extremely well supported, supports a lot of structure, and is flexible enough that you shouldn't have to move from it for not being powerful enough. For your example, something like this:

    {'text': 'I saw the company's manager last day.", {'Person': [{'name': 'John'}, {'indices': [0:1]}, etc...]}

    The one big advantage you've got over any NLP-specific formats here is that JSON can be parsed in any environment, and since you'll probably have to edit your format anyway, JSON lends itself to very simple edits that give you a short distance to other formats.

    You can also implicitly store tokenization information if you want:

    {"text": ["I", "saw", "the", "company's", "manager", "last", "day."]}

    EDIT: To clarify the mapping of metadata is pretty open, but here's an example:

    {'body': '',
    'metadata':
    {'':
    {'': '',
    'location': [, ]
    }
    }
    }
    Hope that helps, let me know if you've got any more questions. This post was edited by Shivakumar Kota at June 11, 2019 4:52 PM IST
      June 11, 2019 4:50 PM IST
    0
  • To describe all existed data it is so difficult task, but we can use a data model: http://schema.org/, where are structural types of the information. The prior execution was targeted to implement MarkUp technology, so, it seems can be useful for your task.

     
      November 18, 2021 12:16 PM IST
    0