

This line of work has been investigated since the early 90s (e.g. Imagening these vectors as points in 3D space, "cat" is clearly closer to "dog" than it is to "umbrella", therefore "cat" also means something more similar to "dog" than to an "umbrella". In the same way, "dog" might be represented as and "umbrella" might be.

For example, if we only consider the three contexts "eat", "red" and "fluffy", the word "cat" might be represented as, because if you were to read a very very long piece of text (a few billion words is not uncommon by today's standard), the word "cat" would appear very often in the context of "fluffy" and "eat", but not that often in the context of "red". This vector is usually chosen to capture the contexts the word can appear in. The most important assumption here is that it is possible to obtain a vector that represents each word in the sentence in quesion. I'll very briefly summarize where we are and point you to a few publications: It is an unsolved problem in natural language processing research and also happens to be the subject of my doctoral work. The short answer is "no, it is not possible to do that in a principled way that works even remotely well". You can also develop it further, by using a more sophisticated way to extract words from a piece of text, stem or lemmatise it, etc. This does not include weighting of the words by tf-idf, but in order to use tf-idf, you need to have a reasonably large corpus from which to estimate tfidf weights. The cosine formula used here is described here. Text2 = "This sentence is similar to a foo bar sentence. Sum2 = sum( ** 2 for x in list(vec2.keys())])ĭenominator = math.sqrt(sum1) * math.sqrt(sum2) Sum1 = sum( ** 2 for x in list(vec1.keys())]) Numerator = sum( * vec2 for x in intersection]) Intersection = set(vec1.keys()) & set(vec2.keys()) A simple pure-Python implementation would be: import math
