WordLift 3.0: A brief semantic story – part 2

Classifications help us find the material we are looking for.

Here is the part 1 of this article.

By now, the web has such a great amount of content that it has become impossible to apply homogeneous classification schemes to organize knowledge and make it available; unless only a  specific domain is considered (more than 2,5 million new articles are published  each and every day).

Classification schemes are structures that use  results and relations as information to be added to  content. The following four types can be identified: hierarchical, tree, faceted, according to reference models (or paradigms).

Structured information storage is ultimately aimed at improving human knowledge.

Our goal with WordLift consists in developing an application that will structure content so as to simultaneously represent various classification methods for machines, enabling the latter to organize the content that is published on digital networks so as to  make it usable in different ways.

Due to the impasse met by semantic technologies, introduced in part 1, in the first phase of our analysis we excluded the digital world as the mandatory recipient of our solution.

Therefore, during the first phase we looked to the classification systems that mankind has used to organize its knowledge before the computing era; then we considered the evolution of faceted interfaces; the technologies that put the different web environments into reference with each other; and what is the consolidated on the web regarding the considered topics (interlinking with dbpedia, freebase, geonames and methodologies required by the search engines to classify and publish content).

It’s not easy to identify the answers; especially because the essential technological component is increasingly and continually evolving. In the book “Organizzare la Conoscenza (in Italian)… already mentioned in the previous post, at a certain point in Chapter 2  the essential categories – those having various facets in common and valid for all disciplines – are introduced.

They are introduced by the Indian mathematician  Shiyali Ramamrita Ranganathan, who was the first – around 1930 – to talk about this analysis, consisting in breaking down a topic into components and then building it up again based on a code. He chose five essential categories: space and time, on which everyone agrees; energy, referring to activities or dynamism and indicating the ‘action’ in semantics; matter, for example of a material and its property; personas to indicate the main subject of that context, even when it’s not a human being.

These categories are considered abstract, but nevertheless we used them to design the back-end interface for the editors, and mapped them to the corresponding types in the schema.org vocabulary.

WordLift is indeed an editor built on top of the universally recognised vocabulary of concepts published by http://schema.org/, consisting so far of more than 1,200 items divided into nine essential categories: Action, CreativeWork, Event, Intangible, Medical Entity, Organization, Person, Place, Product.

In this November 2015 the schema.org vocabulary has over 217 million pages (URLs) containing a total of more than six billion triples.

WordLift 3.0 is a semantic editor that analyses content and automatically suggests metadata according to schema.org vocabulary categories that we have somewhat simplified for users, dividing them in this first experimental phase into four essential categories: Who (Person, Organization), Where (Place), When (Event), What (CreativeWork, Product, Intangible). However, users can add any amount of results to those suggested by the application, thus creating a personal vocabulary within the application.

The next release, which will complete the experimental phase in January 2016, will allow to assign different levels of importance to the results, creating a hierarchical and tree classification (by using the mainEntity that schema.org has created to mark articles).

For the future we are considering the Dewey (Dewey Decimal Classification) hierarchical classification that is used in all libraries across the world.

This is the general process that has led us to design a solution in which semantic technologies work jointly with relational technologies to automatically associate a set of metadata, or a semantic graph, to a specific content.

Identifying the technological development and services for users was not simple, but on the other hand the maturation and affirmation of the Open Data Linked cloud and of dbpedia (freebase, geonames) was essential to enable the WordLift 3.0 editor to generate reusable datasets.

[the first blog post of this brief Semantic Story is here


WordLift 3.0: A brief semantic story – part 1

In the world of digital networks, the term knowledge is generically used to identify and justify all activities aimed at improving data collection and organization. Of all sorts.

Knowledge can be improved when information is made available for a variety of readings and reports aimed to interpret reality, fantasize on trends, evolution, a possible future, in order to somehow control or dominate it. 

Project processes have a necessary, preparatory activity in a project program, called identification of the reference scenario. In short, it consists in discovering and assimilating background contexts, or those that prepare the scene in which the subject of the study, as if it was an actor, inserts itself to explain the reasons for the first plan.

In computing knowledge is part of artificial intelligence. In this field the aim is (was) to achieve automation through strategies by making attempts and mistakes. This way of sketching a scenario is called  Knowledge Representation. This symbolic representation was limited by the difficulty to relate various scenario. The usual Tim Berners-Lee, still a WWW leader, is the one responsible for its evolution. Through the W3C he launched in 1996 the XML standard allowing to add semantic information to contents, so they could be related. It’s the beginning of the Semantic Web which made it possible to publish, alongside documents, information and data in a format allowing machines to automatically process them.

“Most of the information content in today’s web is designed to be read only by human beings …” (Tim Berners-Lee again) “computers cannot process the language in web pages”.

Semantic web means a web whose content is structured so that software can read it: read it, answer questions and interact with users.

Introduction freely adapted from .. and for whoever wants to know the whole story.

Having introduced the value of any operation aimed to develop what will automatically set or suggest the metadata to be attached to the content in order to make it readable by machines, one still has  to understand and define the following: what are the components of this structure or metadata?   How can the significant elements be extracted uniformly disregarding the language? Which types of ontological categorisation and which relations must be activated in a content in order for it to become part of a  semantic web for all? And especially: how can all this be done simultaneously?

And this is where the whole research and development area that revolves around the semantic technologies got stuck. We believe that this impasse was also caused by the lack of agreement among the various scientific paths necessary to achieve any kind of standardization. And also because of language and lexical differences, which are pushed towards a kind of ‘local’ multi-language system by the web itself and by the technologies that are distributed.

Considering the topic and the context of this post, we should leap from 1986, when the first markup languages were born, to 1998, when the standard XML was defined, and finally today, November 2015. We have performed this leap, at least partially, by means of  a query (described here below) on Wikidata.

The path we have followed (considering that our group lacks scientific skills distributed among all the included fields of knowledge) involves:

  • accepting that semantic technologies as they had been conceived and applied could not fully meet our need to make the machines understand and order content;
  • redefining the context after the cultural and economic affirmation of the open data world and the data structure of the Linked Open Data.

Therefore, remembering what was dictated by the Austrian logician, mathematician and philosopher  Gödel (also loved by the computing world), who stated:  a world cannot be understood from inside the system itself; in order to understand any of it, we have to go out and observe it from the outside; we have initially deconstructed it by enclosing  in sets all that would have necessarily been part of the final solution and then we turned to the  world that preceded the current one: the  analogical world and how it had tackled and replied to problems arising from the organization and classification of large amounts of “knowledge”.

A study/guide was very useful to us (and we therefore thank its authors): Organizzare la conoscenza: dalle biblioteche all’architettura dell’informazione per il web (Claudio Gnoli, Vittorio Marino and Luca Rosati).

The query on Wikidata to reconstruct the story of markup languages

Here below is the query you can make with a click (result were incomplete because we only entered languages whose creation date has a value in Wikidata – this value is expressed by Property:P571).

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX v: <http://www.wikidata.org/prop/statement/>
PREFIX q: <http://www.wikidata.org/prop/qualifier/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT ?entity ?ml ?sl WHERE {
 ?entity wdt:P31 wd:Q37045 . # ?entity is a markup language
 ?entity wdt:P571 ?sl . # ?sl is the inception date of ?entity
 ?entity rdfs:label ?ml . # ?entity name is ?ml
 FILTER(LANG(?ml) = "it") # ?ml is in Italian
 ORDER by ?sl
 LIMIT 100

…. continues and here is the part 2 of this article.