WordLift 3.0: A brief semantic story – part 2

Classifications help us find the material we are looking for.

Here is the part 1 of this article.

By now, the web has such a great amount of content that it has become impossible to apply homogeneous classification schemes to organize knowledge and make it available; unless only a specific domain is considered (more than 2,5 million new articles are published each and every day).

Classification schemes are structures that use results and relations as information to be added to content. The following four types can be identified: hierarchical, tree, faceted, according to reference models (or paradigms).

Structured information storage is ultimately aimed at improving human knowledge.

Our goal with WordLift consists in developing an application that will structure content so as to simultaneously represent various classification methods for machines, enabling the latter to organize the content that is published on digital networks so as to make it usable in different ways.

Due to the impasse met by semantic technologies, introduced in part 1, in the first phase of our analysis we excluded the digital world as the mandatory recipient of our solution.

Therefore, during the first phase we looked to the classification systems that mankind has used to organize its knowledge before the computing era; then we considered the evolution of faceted interfaces; the technologies that put the different web environments into reference with each other; and what is the consolidated on the web regarding the considered topics (interlinking with dbpedia, freebase, geonames and methodologies required by the search engines to classify and publish content).

It’s not easy to identify the answers; especially because the essential technological component is increasingly and continually evolving. In the book “Organizzare la Conoscenza“ (in Italian)… already mentioned in the previous post, at a certain point in Chapter 2 the essential categories – those having various facets in common and valid for all disciplines – are introduced.

They are introduced by the Indian mathematician Shiyali Ramamrita Ranganathan, who was the first – around 1930 – to talk about this analysis, consisting in breaking down a topic into components and then building it up again based on a code. He chose five essential categories: space and time, on which everyone agrees; energy, referring to activities or dynamism and indicating the ‘action’ in semantics; matter, for example of a material and its property; personas to indicate the main subject of that context, even when it’s not a human being.

These categories are considered abstract, but nevertheless we used them to design the back-end interface for the editors, and mapped them to the corresponding types in the schema.org vocabulary.

WordLift is indeed an editor built on top of the universally recognised vocabulary of concepts published by http://schema.org/, consisting so far of more than 1,200 items divided into nine essential categories: Action, CreativeWork, Event, Intangible, Medical Entity, Organization, Person, Place, Product.

In this November 2015 the schema.org vocabulary has over 217 million pages (URLs) containing a total of more than six billion triples.

WordLift 3.0 is a semantic editor that analyses content and automatically suggests metadata according to schema.org vocabulary categories that we have somewhat simplified for users, dividing them in this first experimental phase into four essential categories: Who (Person, Organization), Where (Place), When (Event), What (CreativeWork, Product, Intangible). However, users can add any amount of results to those suggested by the application, thus creating a personal vocabulary within the application.

The next release, which will complete the experimental phase in January 2016, will allow to assign different levels of importance to the results, creating a hierarchical and tree classification (by using the mainEntity that schema.org has created to mark articles).

For the future we are considering the Dewey (Dewey Decimal Classification) hierarchical classification that is used in all libraries across the world.

This is the general process that has led us to design a solution in which semantic technologies work jointly with relational technologies to automatically associate a set of metadata, or a semantic graph, to a specific content.

Identifying the technological development and services for users was not simple, but on the other hand the maturation and affirmation of the Open Data Linked cloud and of dbpedia (freebase, geonames) was essential to enable the WordLift 3.0 editor to generate reusable datasets.

[the first blog post of this brief Semantic Story is here]