This method of classification provides greater flexibility when classifying text and doesn’t rely on a particular taxonomy to understand and categorize a piece of text.
Explicit Semantic Analysis (ESA) works at the level of meaning rather than on the surface form vocabulary of a word or document. ESA represents the meaning of a piece text, as a combination of the concepts found in the text and is used in document classification, semantic relatedness calculation (i.e. how similar in meaning two words or pieces of text are to each other), and information retrieval.
In document classification, for example, documents are tagged to make them easier to manage and sort. Tagging a document with keywords makes it easier to find. However, keyword tagging alone has it’s limitations; searches carried out using vocabulary with similar meaning, but different actual words may not uncover relevant documents. However classifying text semantically i.e. representing the document as concepts and lowering the dependence on specific keywords can greatly improve a machine’s understanding of text.
How is Explicit Semantic Analysis achieved?
Wikipedia is a large and diverse knowledge base where each article can be considered a distinct concept. In Wikipedia based ESA, a concept is generated for each article. Each concept is then represented as a vector of the words which occur in the article, weighted by their tf-idf score.
The meaning of any given word can then be represented as a vector of that word’s relatedness, or “association weighting” to the Wikipedia based concepts.
“word” -> , , - - -
A trivial example might be:
Comparing two word vectors (using cosine similarity) we can get a numerical value for the semantic relatedness of words i.e. we can quantify how similar the words are to each other based on their association weighting to the various concepts.
Note: In Text Analysis a vector is simply a numerical representation of a word or document. It is easier for algorithms to work with numbers than with characters. Additionally, vectors can be plotted graphically and the “distance” between them is a visual representation of how closely related in terms of meaning words and documents are to each other.