Artificial Intelligence and Machine Text Analysis

No Comments

What does Google Rankbrain use to analyze the content of the pages? This is more of a reflection in relation to the criteria likely used by Google , or in relation to the patterns that we know well, for example the TF-IDF model (term frequency / inverse document frequency).

La problématique est la suivante : barbouiller sa page d’un large spectre lexical suffit-il à acquérir de la pertinence pour une thématique donnée ? Probablement pas, bien que certains “black hat” se fasse une spécialité de créer des sites avec de la bouillie de mots.

We will therefore try to reflect a little on the few analysis models combining syntax and semantics.

Semantic analysis or syntactic analysis?

First of all, let's make a little distinction between syntax and semantics; the two terms indeed each have a very precise definition.

Semantics takes care of content, and syntax takes care of form. In other words, semantics is concerned with the meaning of words, and syntax with their combination within sentences.

So what about semantic and syntactic analysis? Here again, it is easy to draw: semantic analysis deals exclusively with the meaning formed by a word or a combination of words, and syntactic analysis focuses on the positions and relations of a word in relation to another word, or a group of sentences. The content and the form we tell you!

You see, the web is not only semantic, it is also married to syntax. It must be said that the two friends share an important point in common: the characterization of a statement as a whole. Basically, and without strictly speaking about the web, each lexical unit has a semantic potential (modulated according to the other surrounding lexical and syntactic elements) which participates in the overall meaning of the sentence. Nothing complicated there; to make a metaphor, several pearls put end to end form a necklace… But a pearl alone does not make a necklace (although, it remains very pretty).

Does Google take into account in its calculations the syntactic relations between words and expressions? You guessed, that's the whole point.

LSA Schema - Latent Semantic Analysis

Latent semantic analysis and Kintsch

If you've been a bit interested in this, you must certainly know about LSA, "Latent Semantic Analysis" . This is called vector semantics: each word is associated with a vector, all in a multidimensional space. If that doesn't appeal to you, this little diagram should be pretty self-explanatory ->

Suddenly, calculating the proximities of meaning between the different words becomes very easy (and above all ... very automatic): it suffices to calculate the cosine of the angle of the different vectors. However, LSA suffers from a major problem, namely that the different meanings of a word are absolutely not taken into account. Difficult to make intelligent semantic web under these conditions ...

Let's make it a little more complicated. In 2001, Kintsch (a gentleman from the Institute of Cognitive Sciences at the University of Colorado) decided to enrich the formula with vectors that are no longer exclusive to only words, but to sentences of the type Name + Verb ( we therefore see the notion of syntax appear here) . In other words, the precise meaning of a verb depends on the noun to which it is attached. To polish the whole thing, Kintsch uses "markers" (chosen by a human, and not by a machine) which, labeled with a word or an expression, allow to interpret the meaning, and this in relation to others. markers.

The good thing is that Kintsch's algorithm addresses four major current semantic problems: metaphors, causal interference, similar judgments, and disambiguation. What's less good is that some of the work is done by hand… and that it requires considerable resources.

Pustejovsky's generative lexicon

Pustejovsky is an American computer teacher, and his hobbyhorse is automatic language processing. It is he who is at the origin of the Generative Lexicon, whose objective is to respond, among other things, to the problems of interpretation of words in their context (ambiguity of meaning, polysemy).

Pustejovsky started from the principle that lexicons, usually enumerative, cannot account for the meaning of words in a given context. To overcome this problem, our friend suggests specifying, for each lexical unit, different entries (which themselves can have several meanings ...):

The argument structure , which specifies the number and type of arguments.
The event structure , which attempts to describe verbs in terms of processes, states or transitions.
The structure of qualia, which specifies the attributes of the word, its origin, its function, which distinguishes it from a wider domain ...
The lexical inheritance structure. The latter is interesting because it takes into account the position of the word in a lexical network! So we always talk about syntax.
Unfortunately, the model proposed by Pustejovsky comes up against two big difficulties: not only is there no methodology for the construction of the various lexical entries above, but making a large-scale generative lexicon is as long as it is expensive.

Remove the punctuation, and the meaning changes!

Sacred punctuation. Very often omitted, ridiculed, humiliated (De Gaulle, get out of here), it is nevertheless a primordial element impacting both syntax and semantics. And for good reason: a comma, a period, and the meaning changes completely. As Wiki says so well, punctuation has three functions:

Prosodic indications (relation to accents and intonations of oral language).
Syntactic relations (how are the elements of speech linked and subordinated to each other?).
Semantic information (what logical sense links these elements?).
Come on, let's take a simple example: "Spammer says Matt Cutts is a disbeliever". Two possible meanings for this sentence (we therefore play the semantics):

"The spammer," says Matt Cutts, "is a disbeliever."
“The spammer says, 'Matt Cutts is a disbeliever.'

Here we are reaching the limits of the machine . No computer (and even more so Google) is yet smart enough to grasp the semantic subtleties of this kind of sentence, and we will probably have to wait for an embryo of artificial intelligence for this to happen.

But let's trust Google on this point: there is no doubt that they will be able to surprise us in the years to come… And perhaps not pleasantly.

Texte écrit par Axel du site “balisage sémantique ” en 2014

Specialist in growthacking and e-commerce, you will benefit from our experience to develop your contacts and your turnover in record time!

    SEOCAmp Paris 2020
    Black Hat Cycle Case Study

    How to self-finance your startup with aggressive netlinking? or the story of the launch of a platform which without branding, without social networks, without partners (marketing) and without legal notices ... generated 100,000 € with only a few web pages , good marketing and good SEO!

    Return on investment

    Did you know that google referencing is the central pillar of your communication? Undoubtedly the most profitable leverage around which your paid advertising actions should revolve.

    Free quote

    We offer professional SEO services that help websites dramatically increase their search score to compete with the highest rankings even when it comes to highly competitive keywords.

    Receive one free advice per week

    Our latest news

    View all publications
    No Comments