Terminology Extraction

  1. Given

    • set of documents about topic (foreground)

    • set of documents about diverse topic (background)

  2. Find ranked list of terms

  3. Uses

    • terms for previously described tasks, search terms, forecasting, etc

The terminator

  • In-line term system

    • Find instances of terms(tokens)

  • Distributional Term system

    • Find terms types

    • Ranks term types by charactiristics ness to a particular topic

    • Top n terms types are kept, rest are discarded

    • Uses metrics similar to TF-IDF

In-line term extraction system

  • Manual rule based chunker

    • identifies sequence of nouns and adjectives using POS

    • Identify technical words

  • well-formedness filter

    • eliminates ill-formed, terms without oov or tech words, or names

  • Supplementary patterns

    • abbreviation patterns

    • terms matching regex patterns

Well-formedness filter

  • A term is well formed if it is:

    • abbreviation

    • a single oov

    • matches a regex pattern

Relevance Filter

  • Time consuming

  • yahoo search for exact match of term

  • Calculate relavance

  • H^2 * T

    • 0-1 score based on a log function

    • T = percentage of top 10 hits that are articles or patents

      • Based on keyword search