How To Grow Your Travel Business With The Help Of Open Data

In this blog post we will show that open data can effectively shape your customer decisions and give energy to your business, once translated and put into insights; especially in the travel industry, that is ever-changing and deeply related to customer’s commitment.

“There’s no substitute for just going there” wrote Yvon Chouinard, an American climber, environmentalist and businessman who dedicated his life to nature and traveling. A lot has changed since when Yvon traveled around the U.S.A. to climb the Yosemite; technology and internet have radically transformed the travel industry to the point that now holidays find people, not the other way round.  

In fact, with the advent of internet and smartphones travelers have been changing their habits and needs, they find all the information they need before, while and after the trip they have been thinking about; this is called Travel 2.0 a travel based extension of the Web 2.0 and refers to the growing presence and influence of the user in the traveling process. Along with the changing of the industry, marketing in the tourism sector is changing just as well and must be revolved around customers, so companies and consultants need to know perfectly their target and who they are talking to.

The core characteristics of tourism activity still remain the same, though:

  • tourism is perishable, which is typical of services, as they are consumed while produced so, for example, you won’t be able to sell a room that is vacant today, tomorrow and this is where overbooking comes from;
  • tourism is also inconsistent, as the conditions that apply today may not apply tomorrow if it rains the facility may be perceived by customers differently than if it had been sunny and this type of offer can’t be standardized;
  • intangibility: how many times have you showed pictures of an hotel room to your friends saying “well, I guess it looked better when I was there” this is because you can’t take home the hotel room in Paris or the view from a chalet in the Alps; tourism is just memories and images depicted in the customer’s mind.

The travel industry is mainly people-oriented, so what better way to develop, for example, an effective campaign than extracting it directly from your customers?

Travelers carry and leave behind a big amount of data: inquiries, questions on forums, bookings, researches about accommodation, transport, itineraries, feedbacks and so on in different stages that go from the idea of getting away to the feedback to the airline company, hospitality structure etc. As a matter of fact, the process that customers face when planning a trip is made of a dizzying number of moments which are rich in intent, even after having booked the trip.

This kind of data is mostly raw and unstructured and comes from different sources but when brought together and analyzed this information will portrait traveler’s behaviors and shape personas from which you can start building a marketing campaign and focus on your target and its needs. Open data have become a tool that can help businesses growing and acquiring new customers when extracted and processed, they can also help to predict the future, giving the hint of a future trend of the business.

In the past few years open data have become a major tool for businesses, they can be leveraged and shaped into solutions deliverable to the final user; along with the growing importance given to open data there has also been a development of free tools, open source softwares and case studies that made having to deal with data easier.

SalzburgerLand, the Salzburg State Board of Tourism, provided us with open data coming from one of their booking channels, it contained information about the flow of visitors in the Salzburg State, for each Nation the table contained the daily amount of Inquiries, Bookings, Arrivals and Nights divided by Country of origin.  

We analyzed the data and selected the Nations with the largest amounts of entries which presumably were the ones with the largest tourist flow towards Austria, in this case, Germany, Austria , and Holland.

touristflowblog

We were also able to report in which month each of these countries visited Austria, as expected the peak of the tourist flow is reached in the months of February, March and in the Summer (June, July, and August).

whocomesandwhenblog

Once we came up with these macro-groups we examined their correlations with a pattern of researches on Google and noticed that people looked for Salzburg themed queries in all the moments the table was divided in, moreover analyzing the queries we found out that, for example, German people at the time of booking looked for hotels in specific small towns in the Salzburg area or that Dutch people while they were staying in Salzburg region looked for information about Dolomites, we assumed they were on a trip with a camper van and a research-backed our assumption along with related queries (Camping Italy) that confirmed the intent of continuing the trip driving around Europe. At this point, we had enough information to define the various targets and identify their needs. We created six personas that resembled real people with real needs around which we built the set of queries extracted from the data.

Google published a research on Travel and Hospitality which divided the customer journey in four micro-moments: I-want-to-get-away moments, time-to-make-a-plan moments, let’s-book-it moments, and can’t-wait-to-explore moments. In each one of these moments, customers or potential customers can be taken by the hand and influenced by effective content and campaigns.

Those moments start when people dream of a getaway and go on until they are on holiday, enjoying their fully thought vacation.

We found out that the data we had extracted for our client were highly correlated with relevant patterns of researches on Google, for example, a general query like “ski holiday deals” is more likely searched for during the I-want-to-get-away moment, in which people are browsing and dreaming of their future holiday. The correlation between a query and the researches on Google in a certain Nation can be an opportunity to influence people’s preferences and purchases and to reach customers in all the micro-moments that the trip is made of.

stefanblogpost

This is Stefan, he is from Germany and he has already been in the Salzburg region so he’s looking for hotels in a specific town and he also wants to know where he can go skiing in Flachau (skigebiet meaning ski area). We also elaborated a graphic of each query’s trend.

This way we can build a content campaign centered on the client’s customers and aimed at conquering and engaging users. We determined what travelers want and when in order to just be there with the right offer.

The creation of the personas worked also as a segmentation based on nationality, travel reason and loyalty so it would be easy to understand the customer’s values and what to create. The aim of marketing and content marketing, in the age of smartphones, is to connect to the person you want to reach out and deliver the answers they are looking for, be relevant to the user when in need and be on time.
When examining the data we discovered that Inquiries drive Bookings because the two are strongly correlated (Pearson Correlation=0.70) as we expected, so we could predict the trend of the bookings in the future, as the graphic below shows and this data can be useful to programme and deliver campaigns on time to users.

predictblog

A data-backed marketing campaign can fuel your business growth centering it around your customers and adapting it to the modern standards of connectivity and sharing; we can transform plain data and grey spreadsheets into beautiful, compelling stories that will draw your target’s attention.

In the end, all will be at hand, even the Yosemite, at least online!

Find out more on how this analysis has helped SLT, after the first 6 months, outperforming the competition by bringing 92.65% more users via organic search than all other travel websites in Austria (from the WordLift website)! 

 

 

 

WordLift and the Tourism Industry. Let the users talk.

Huge investments made by Google, Facebook, BBC, and other big players on graph databases, linked data, and semantic search are shaping the online world, but mid/small size content owners, bloggers and news publishers struggle to adjust their existing information systems to take full advantage of these technologies.

With WordLift we want to democratize this field, offering an easy-to-use plug-in and the power of Artificial Intelligence to every blogger on WordPress.

WordLift Information Architecture Increases Unique Pageviews

As seen on both this blog and our tester’s blog WordLift’s smart permalinks increase the monthly unique pageviews

Since last October we’ve been testing WordLift’s value proposition and results with a set of over 20 beta-testers, which have been crucial to inform the latest product improvements and confirm the value that the plug-in brings to its users. This post is a direct translation of an article posted by one of our testers, sharing his WordLift experience with his community. Hearing from our beta-testers gave us the confidence to open WordLift to the whole WordPress community and getting it ready for its market launch soon to be announced.

Seeing a small hiking organization taking advantage of and contributing to the web of data, makes us proud of what we accomplished so far. Helping this association promote their strategy with much bigger institutions for the development of the whole area, give us great confidence in our higher mission: organize internet knowledge from the bottom-up, leaving the full value of content production in the hands of those who produce it.

We want to thank the team of CamminandoCon for their contribution to the development of WordLift and for sharing their journey with all of us.

If you’re in the tourism industry as well and want to know more about what WordLift can do for you, contact us anytime! For now, let the users talk!


CommunicandoCon: WordLift for Tourism in the Turano Valley

We’re happy to witness many efforts around us to initiate communication strategies to promote the Turano Valley. A few weeks ago we gladly participated a workshop organized by the Riserva Naturale Monte Navegna e Monte Cervia, together with Life Go Park, to give participants basic notions of digital communication, placing training and education first among the many activities necessary to develop a touristic offer. With many other associations and commercial venues, we begun a collaborative path with the editorial team of ConfineLive.it, to open the attractions of the Turano Valley to all the provinces nearby.

Since the beginning of our journey, we’ve been building our online presence, leveraging a new website and our Facebook page, according to an editorial calendar, and creating partnerships with similar sites to share our weekly events with the broadest possible audience.

To offer a wider contribution to the development of new communication strategies which could benefit the Turano valley as a whole, the team of Camminando Con is proud to announce that, since the launch of our website camminandocon.org, we’ve been part of the testing group of WordLift.io, a platform for editing and publishing content online leveraging artificial intelligence technologies.

Without the need of knowing any particular technical skills, this technology allowed us:

  • to create interactive pages, rich of images and links to contextual info about topics, places, and people of the valley, and interactive widgets to connect our stories with yours
  • to optimize our content for search engines like Google, where in just a few months, and with no SEO efforts, we reached page 1 for searches of our interest (such as Escursioni Turano, Borgo Antuni) and for the promotion of our weekly events.
  • to implement a smart navigation system, aggregating our content dynamically on pages dedicated to events and to the territory and disseminating it on third parties communication platforms such as Telegram and Facebook Messenger
  • above all, to create, curate and publish a system of open data mapping all data related to the Turano valley, available in the near future to anyone who wants to promote their business on the web and to wider project such as the Open Data portal of the Lazio Region.

WordLift Graph - Turano Valley

At the end of this test phase, we want to thank the WordLift’s team for letting us use this technology and for the guidance we needed to extend this approach to and collaborate with players bigger than us.

In the wake of a project being implemented by the Tourism Agency of Salzburg, we have proposed the adoption of this technology, integrated with a multi-year communication plan, to a group of stakeholders interested in the development the Turano valley including administrators of Castel di Tora, Colle di Tora, Rocca Sinibalda and Paganico, and representatives of the Riserva Naturale dei Monti Navegna e Cervia, the Lega Navale of the Turano Lake, the Pro Loco of Colle di Tora and the association Andar per Lago, Monti e Castelli. A very fruitful meeting with which we have started a process aimed to enrich the communication of everything that the Turano Valley has to offer.

In the next few months, we would like to expand this project to organizations promoting naturalistic and eco-friendly activities and tourism, beginning with the FederTrek community, which we’re happy to be part of, with the aim to include, in a structural way, the hiking offer of the Turano valley within a national circuit.

It’s an ambitious project: to develop our core mission as part of a larger network, enriching while being enriched by, the whole touristic offerings of the Turano Valley and of hiking organizations on a national scale.

Networking means being available to the network, increasing its value with every activity, aware that the success of all coincide, and it is indeed instrumental, to our own success. Transforming this network into a platform allows everyone to become effectively part of it, and to enjoy the benefits that new technologies and online communication strategies offer today. If you want to be part of this project , help us in the editorial effort it entails, or enrich it with your business, please contact us. We will be happy to meet you, perhaps during one of our escursions. See you soon!

 

Reimagining the Video Player with the help of MICO – Part 1

How more context can help us revamp HelixWare’s Video player to boost user engagement on news sites.

originally posted on http://www.mico-project.eu/reimagining-the-video-player/

Our use case in MICO is focused on news and media organisations willing to offer more context and better navigation to users who visit their news outlets. A well-known (in the media industry at least) report from Cisco that came out this last February predicts that nearly three-fourths (75 percent) of the world’s mobile data traffic will be attributed to video by 2020.

The latest video content meetup - organized in Cairo this March with the Helixware's team at Injaz. (Images courtesy of Insideout Today)

The latest video content meetup – organized in Cairo this March with the Helixware’s team at Injaz. (Images courtesy of Insideout Today)

While working with our fellow bloggers and editorial teams we’ve been studying how these news organizations, particularly those with text at their core, can be helped in crafting their stories with high-quality videos.

More over, the question we want to answer is: can videos become a pathway to deeper engagement?

With the help of MICO’s media extractors we can add semantic annotations to videos on demand . These annotations are in the form of media fragments that can be used as input for both the embeddable video player of HelixWare and the video thumbnails created by HelixWare WordPress plugin. Media fragments, in our showcase, are generated from the face detection extractor and are both temporal (there is a face at this time in the video) and spatial (the face within these frames is located at xywh).

The new HelixWare video player that we’re developing as part of our effort in MICO aims at creating an immersive experience for the end-users. The validation of both video player and video thumbnail will be done using A/B testing against a set of metrics our publishers are focused on: time per session, numbers of videos played per session, number of shares of the videos over social media (content virality).

Now let’s review the design assumptions we’ve worked on so far to reimagine the video player and in the next post we will present the first results.

1. Use faces to connect with users

Thumbnails, when done right, are generally key to ensuring a high level of engagement on web pages. This seems to be particularly true when thumbnails feature human faces that are considered “powerful channels of non-verbal communication in social networks. With MICO we can now offer to the editor a better tool to engage audiences by integrating a new set of UI elements that use human faces in the video thumbnail.  The study documenting this point is “Faces Engage Us: Photos with Faces Attract More Likes and Comments on Instagram” and has been authored by S Bakhshi, D. A. Shamma, E. Gilber in ‎2014.   

2. Increase the saturation by 20%-30% to boost engagement

Another interesting finding backing up our work is that filtered photos are 21% more likely  to  be  viewed and 45% more likely to be commented on by consumers of photographs. Specifically, filters that increase warmth, exposure and contrast boost engagement the most.

3. Repeat text elements

As seen in most of the custom thumbnail tutorials for YouTube available on-line adding some elements of the title or the entire title using a bold font over a clear background can make the video more compelling and accordingly to some, significantly increase the click-through-rate.  One of the goals of the demo will be to provide a simple and appealing UI where text and image cooperate to offer a more engaging user experience removing any external informations that could distract the viewer.

4. Always keep the editor in full control

We firmly believe machines will help journalists and bloggers focus on what matters most – writing stories that people want to read. This means that whatever workflow we plan to implement there shall always be a human (the editor himself) behind the scene validating the content produced by technologies such as MICO.

This is particularly true when dealing with sensitive materials such as human faces depicted in videos. There might be obvious privacy concerns for which an editor might choose to use a landscape rather than a face for his video thumbnail and we shall make sure this option always remains available.

We will continue documenting this work in the next blog post and as usual we look forward to hearing your thoughts and ideas – please email us anytime.

 

Build Your Knowledge Graph with WordLift. Introducing version 3.4

We love the Web. We’ve been using the Internet in various forms since the 90’s. We believe that the increasing amount of information should be structured beforehand by the same individuals that create the content.

With this idea in mind and willing to empower journalists and bloggers we’ve created WordLift: a semantic editor for WordPress.

With the latest release (version 3.4) we are introducing a Dashboard to provide a quick overview of the website’s Knowledge Graph.

dashboard-wordlift

What the heck is a Knowledge Graph?

Knowledge Graphs are all around us. Internet giants like Google, Facebook, LinkedIn and Amazon they are all running their own Knowledge Graphs and willingly or not we are all contributing to them.

Knowledge Graphs are networks of all kind of things which are relevant to a specific knowledge domain or within an organization. They contains abstract concepts and relations as well as instances of all sort of ‘things’ including persons, companies, documents and datasets.

Knowledge Graphs are intelligent data models made of descriptive metadata that can be used to categorise contents.

Why should You bring all your content and data in a Knowledge Graph?

The answer is simple. You want to be relevant for your audience. A Knowledge Graph allows machines (including voice-enabled assistants, smartphone apps and search crawlers) to make complex queries over the entirety of your content. Let’s break this down into benefits:

  • Facilitate smarter and more relevant search results and recommendations
  • Support intelligent personal assistant like Apple Siri and Google Now understand natural language requests by providing the needed vocabulary to identify content
  • Get richer insights on how content is performing and is being received by the audience. Someone calls it Semantic Analytics (more on this topic soon)
  • Sell advertising more wisely by providing in-depth context to advertising networks
  • Create new services that drive reader engagement
  • Share (or sell) metadata to the rest of the World

So what makes WordLift special

WordLift allows anyone to build his/her own Knowledge Graph. The plugin adds a layer of structured metadata to the content you write on WordPress. Every article is classified with named entities and these classifications are used to provide relevant recommendations that boost the articles of your site with widgets like the navigator and the faceted search. There is more.

The deep vocabulary can be used to understand natural language requests like – “Who is the founder of company [X]?”. Let’s dig deeper. Here is an example that uses a generalist question answering tool called Platypus.

platyp

Playtypus leverages on the Wikidata Knowledge Graph. Now if I would ask “Who is the founder of Insideout10?” Wikidata, would probably politely answer “I’m sorry but I don’t have this information“.

Now, the interesting part is that, for this specific question, this same blog holds the correct answer.

As named entities are described along with their properties I can consult the metadata about Insideout10and eventually have applications like Platypus run a SPARQL query on my graph.

redlink

This query returns two entities:

Who owns the data?

The site owner does. Every website has its own graph published on data.wordlift.it (or any custom domain name you might like) and the creator of the website holds all licensing rights of his/her data. In the next upcoming release a dedicated rights statement will be added to all graphs published with WordLift (here you’ll find the details of this issue).

So how big is this graph?

If we combine all existing beta testers we reach a total of 37.714 triples (this being the unit of measurement of information stored in a Knowledge Graph). Here is a chart representing this data.

While this is a very tiny fraction of the World’s knowledge (Wikidata holds 975.989.631 of triples – here is a simple query to check this information on their systems) it is relevant for the readers’ of this blog and contributes to the veracity of big data (“Is this data accurate?”).

Happy blogging!

 

Innovation in Digital Journalism – a report from the second year of FP7-MICO

The blog post summaries the work done in the context of the EU project FP7-MICO in the area of digital journalism.

This last December we attended in Cairo one of the most exciting events in the Middle East and North Africa region on Entrepreneurship and Hi-Tech startups: RiseUp Summit 2015. We engaged with the overwhelming crowd of startuppers and geeks on our HelixWare booth and in a separated Meetup organised at the Greek Campus.

We had the opportunity, during these two hectic days, to share the research work done in MICO for extending the publishing workflows of independent news organizations with cross-media analysis, natural language processing and linked data querying tools.

… 

 

Introducing WordLift new Vocabulary

When we first started WordLift, we envisioned a simple way for people to structure their content using Semantic Fingerprinting and named entities. Over the last few weeks we’ve seen the Vocabulary as the central place to manage all named entities on a website. Moreover we’ve started to see named entities playing an important role in making the site more visibile to both humans and bots (mainly search engines at this stage).

Here is an overview on the numbers of weekly organic search visits from Google on this blog (while numbers are still small we’ve a 110% growth).

google

To help editors increase the quality of their entity pages, today, we are launching our new Vocabulary along with version 3.3.

wordlift-vocabolary

The Vocabulary can be used as a directory for entities. Entities can now be filtered using the “Who“, “Where“, “When” and “What” categories and most importantly entities have a rating and a traffic light to quickly see where to start improving.

Until now it was hard to have a clear overview (thumbnails have been also introduced); it was also hard to see what was missing and..where. The principles for creating successful entity pages can be summarised as follow: 

  1. Every entity should be linked to one or more related posts. Every entity has a corresponding web page. This web page acts as a content hub (here is what we have to say about AI on this blog for example) – this means that we shall always have articles linked to our entities. This is not a strict rule though as we might also use the entity pages to build our website (rather than to organise our blog posts).
  2. Every entity should have its own description. And this description shall express the editor’s own vision on a given topic. 
  3. Every entity should link to other entities. When we chose other entities to enrich the description of an entity we create relationships within our knowledge graph and these relationships are precious and can be used in many different ways (the entity on AI on this blog is connected for instance with the entity John McCarthy who was the first to coin the term in 1955)
  4. Entities, just like any post in WordPress, can be kept as draft. Only when we publish them they become available in the analysis and we can use them to classify our contents.
  5. Images are worth thousand words as someone used to say. When we add a featured image to an entity we’re also adding the schema-org:image attribute to the entity.
  6. Every entity (unless you’re creating something completely new) should be interlinked with the same entity on at least one other dataset. This is called data interlinking and can be done by adding a link to the equivalent entity using the sameAs attribute (here we have for instance the same John McCarthy in the Yago Knowledge Base).
  7. Every entity has a type (i.e. Person, Place, Organization, …) and every type has its own set of properties. When we complete all the properties of an entity we increase its visibility.  

Happy blogging!

 

WordLift 3.0: A brief semantic story – part 2

Classifications help us find the material we are looking for.

Here is the part 1 of this article.


By now, the web has such a great amount of content that it has become impossible to apply homogeneous classification schemes to organize knowledge and make it available; unless only a  specific domain is considered (more than 2,5 million new articles are published  each and every day).

Classification schemes are structures that use  results and relations as information to be added to  content. The following four types can be identified: hierarchical, tree, faceted, according to reference models (or paradigms).

Structured information storage is ultimately aimed at improving human knowledge.

Our goal with WordLift consists in developing an application that will structure content so as to simultaneously represent various classification methods for machines, enabling the latter to organize the content that is published on digital networks so as to  make it usable in different ways.

Due to the impasse met by semantic technologies, introduced in part 1, in the first phase of our analysis we excluded the digital world as the mandatory recipient of our solution.

Therefore, during the first phase we looked to the classification systems that mankind has used to organize its knowledge before the computing era; then we considered the evolution of faceted interfaces; the technologies that put the different web environments into reference with each other; and what is the consolidated on the web regarding the considered topics (interlinking with dbpedia, freebase, geonames and methodologies required by the search engines to classify and publish content).

It’s not easy to identify the answers; especially because the essential technological component is increasingly and continually evolving. In the book “Organizzare la Conoscenza (in Italian)… already mentioned in the previous post, at a certain point in Chapter 2  the essential categories – those having various facets in common and valid for all disciplines – are introduced.

They are introduced by the Indian mathematician  Shiyali Ramamrita Ranganathan, who was the first – around 1930 – to talk about this analysis, consisting in breaking down a topic into components and then building it up again based on a code. He chose five essential categories: space and time, on which everyone agrees; energy, referring to activities or dynamism and indicating the ‘action’ in semantics; matter, for example of a material and its property; personas to indicate the main subject of that context, even when it’s not a human being.

These categories are considered abstract, but nevertheless we used them to design the back-end interface for the editors, and mapped them to the corresponding types in the schema.org vocabulary.

WordLift is indeed an editor built on top of the universally recognised vocabulary of concepts published by http://schema.org/, consisting so far of more than 1,200 items divided into nine essential categories: Action, CreativeWork, Event, Intangible, Medical Entity, Organization, Person, Place, Product.

In this November 2015 the schema.org vocabulary has over 217 million pages (URLs) containing a total of more than six billion triples.

WordLift 3.0 is a semantic editor that analyses content and automatically suggests metadata according to schema.org vocabulary categories that we have somewhat simplified for users, dividing them in this first experimental phase into four essential categories: Who (Person, Organization), Where (Place), When (Event), What (CreativeWork, Product, Intangible). However, users can add any amount of results to those suggested by the application, thus creating a personal vocabulary within the application.

The next release, which will complete the experimental phase in January 2016, will allow to assign different levels of importance to the results, creating a hierarchical and tree classification (by using the mainEntity that schema.org has created to mark articles).

For the future we are considering the Dewey (Dewey Decimal Classification) hierarchical classification that is used in all libraries across the world.

This is the general process that has led us to design a solution in which semantic technologies work jointly with relational technologies to automatically associate a set of metadata, or a semantic graph, to a specific content.

Identifying the technological development and services for users was not simple, but on the other hand the maturation and affirmation of the Open Data Linked cloud and of dbpedia (freebase, geonames) was essential to enable the WordLift 3.0 editor to generate reusable datasets.

[the first blog post of this brief Semantic Story is here

 

WordLift 3.0: A brief semantic story – part 1

In the world of digital networks, the term knowledge is generically used to identify and justify all activities aimed at improving data collection and organization. Of all sorts.

Knowledge can be improved when information is made available for a variety of readings and reports aimed to interpret reality, fantasize on trends, evolution, a possible future, in order to somehow control or dominate it. 

Project processes have a necessary, preparatory activity in a project program, called identification of the reference scenario. In short, it consists in discovering and assimilating background contexts, or those that prepare the scene in which the subject of the study, as if it was an actor, inserts itself to explain the reasons for the first plan.

In computing knowledge is part of artificial intelligence. In this field the aim is (was) to achieve automation through strategies by making attempts and mistakes. This way of sketching a scenario is called  Knowledge Representation. This symbolic representation was limited by the difficulty to relate various scenario. The usual Tim Berners-Lee, still a WWW leader, is the one responsible for its evolution. Through the W3C he launched in 1996 the XML standard allowing to add semantic information to contents, so they could be related. It’s the beginning of the Semantic Web which made it possible to publish, alongside documents, information and data in a format allowing machines to automatically process them.

“Most of the information content in today’s web is designed to be read only by human beings …” (Tim Berners-Lee again) “computers cannot process the language in web pages”.

Semantic web means a web whose content is structured so that software can read it: read it, answer questions and interact with users.

Introduction freely adapted from .. and for whoever wants to know the whole story.

Having introduced the value of any operation aimed to develop what will automatically set or suggest the metadata to be attached to the content in order to make it readable by machines, one still has  to understand and define the following: what are the components of this structure or metadata?   How can the significant elements be extracted uniformly disregarding the language? Which types of ontological categorisation and which relations must be activated in a content in order for it to become part of a  semantic web for all? And especially: how can all this be done simultaneously?

And this is where the whole research and development area that revolves around the semantic technologies got stuck. We believe that this impasse was also caused by the lack of agreement among the various scientific paths necessary to achieve any kind of standardization. And also because of language and lexical differences, which are pushed towards a kind of ‘local’ multi-language system by the web itself and by the technologies that are distributed.

Considering the topic and the context of this post, we should leap from 1986, when the first markup languages were born, to 1998, when the standard XML was defined, and finally today, November 2015. We have performed this leap, at least partially, by means of  a query (described here below) on Wikidata.

The path we have followed (considering that our group lacks scientific skills distributed among all the included fields of knowledge) involves:

  • accepting that semantic technologies as they had been conceived and applied could not fully meet our need to make the machines understand and order content;
  • redefining the context after the cultural and economic affirmation of the open data world and the data structure of the Linked Open Data.

Therefore, remembering what was dictated by the Austrian logician, mathematician and philosopher  Gödel (also loved by the computing world), who stated:  a world cannot be understood from inside the system itself; in order to understand any of it, we have to go out and observe it from the outside; we have initially deconstructed it by enclosing  in sets all that would have necessarily been part of the final solution and then we turned to the  world that preceded the current one: the  analogical world and how it had tackled and replied to problems arising from the organization and classification of large amounts of “knowledge”.

A study/guide was very useful to us (and we therefore thank its authors): Organizzare la conoscenza: dalle biblioteche all’architettura dell’informazione per il web (Claudio Gnoli, Vittorio Marino and Luca Rosati).

The query on Wikidata to reconstruct the story of markup languages

Here below is the query you can make with a click (result were incomplete because we only entered languages whose creation date has a value in Wikidata – this value is expressed by Property:P571).

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX v: <http://www.wikidata.org/prop/statement/>
PREFIX q: <http://www.wikidata.org/prop/qualifier/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT ?entity ?ml ?sl WHERE {
 ?entity wdt:P31 wd:Q37045 . # ?entity is a markup language
 ?entity wdt:P571 ?sl . # ?sl is the inception date of ?entity
 ?entity rdfs:label ?ml . # ?entity name is ?ml
 FILTER(LANG(?ml) = "it") # ?ml is in Italian
 }
 ORDER by ?sl
 LIMIT 100

…. continues and here is the part 2 of this article.

 

Looking for…science fiction movies on the Linked Data Cloud

When working in technology, sometimes you find yourself day-dreaming about Artificial Intelligence, Space Ships, Aliens. The trip can be intensified by a legacy of Science Fiction movies, that inspire and give motivation to work harder on the actual project. The bad part about being a science fiction junkie is that you always search for new movies worth watching. Along the years you become pickier and pickier, it’s never enough.

After re-reading The Foundations Series by Isaac Asimov, you crave for more and have the urge to find available movies with a solid background in literature. That seems to be a good filter for quality:

You want to watch all sci-fi movies inspired by books.

Before watching them of course you need to list them. There are many resources on the web to accomplish the task:

1 – imdb, rotten tomatoes: they offer detailed information about the movie, e.g. what the movie is about, what are the actors, some reviews. There are interesting user curated lists that partially satisfy your requirements, for example a list of the best sci-fi movies from 2000 to now. These websites are good resources to get started, but they don’t offer a search for the details you care about.

2 – individual blogs: you may be lucky if a search engine indexed an article that exactly replies to your questions. The web is huge and someone might have been so brave to do the research himself and so generous to put it online. Not always the case, and absolutely not reliable.

3 – Linked Data Cloud: the web of data comes as a powerful resource to query the web at atomic detail. Both dbPedia and Wikidata, the LOD versions of Wikipedia, contain thousands of movies and plenty of details for each. Since the LOD cloud is a graph database hosted by the public web, you can freely ask very specific and domain crossing questions obtaining as a result pure, diamond data. Technically this is more challenging, some would say “developer only”, but at InsideOut we are working to democratize this opportunity.

From the title of the post you may already know what option we like most, so let’s get into the “how to”.
We need to send a query to the Wikidata public SPARQL endpoint. Let’s start from a visual depiction of our query, expressing concepts (circles) and relation between them (arrows).

movies_based_on_scifi

Let’s color in white what we want in output from the query and in blue what we already know.

movies_based_on_scifi

– Why is it necessary to specify that m is a Movie?
Writings can inspire many things, for example a song or a political movement, but we only want Movies.

– Why it is not specified also that w is a Writing and p is a Person?
Movies come out of both books, short stories and sometimes science essays. We want our movies to be inspired by something that was written, and this relation is implied by the “author” relation. The fact that p is a person is implied from the fact that only persons write science fiction (at least until year 2015).

Let’s reframe the picture in a set of triples (subject-predicate-object statements), the kind of language a graph database can understand. We call m (movie), w (writing) and p (person) our incognitas, then we define the properties and relations they must match.

  • m is a Movie
  • m is based on w
  • w is written by p
  • p is a science fiction writer

Since the graph we are querying is the LOD cloud, the components of our triples are internet addresses and the query language we use is SPARQL. See below how we translate the triples above in actual Wikidata classes and properties. Keep in mind that movies, persons and writings are incognitas, so they are expressed with the ?x syntax. Everything after # is a comment.

?m <http://www.wikidata.org/prop/direct/P31> <http://www.wikidata.org/entity/Q11424> .
# ?m is a Movie
?m <http://www.wikidata.org/prop/direct/P144> ?w .
# ?m is based on ?w
?w <http://www.wikidata.org/prop/direct/P50> ?p .
# ?w written by ?p
?p <http://www.wikidata.org/prop/direct/P106> <http://www.wikidata.org/entity/Q18844224> .
# ?p is a science fiction writer

As you can see the triples’ components are links, and if you point your browser there you can fetch triples in which the link itself is the subject. That’s the major innovation of the semantic web in relation to any other kind of graph database: it is as huge and distributed as the web. Take a few moments to appreciate the idea and send your love to Tim Berners-Lee.

Similarly to SQL, we can express in SPARQL that we are selecting data with the SELECT…WHERE keywords. The PREFIX syntax makes our query more readable by making the URIs shorter:

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>

SELECT ?w ?m ?p WHERE {
?m wdt:P31 wd:Q11424 . # ?m is a Movie
?m wdt:P144 ?w . # ?m is based on ?w
?w wdt:P50 ?p . # ?w written by ?p
?p wdt:P106 wd:Q18844224 . # ?p is a science fiction writer
}

If you run the query above you will get as result a set of addresses, being the URI of the movies, writings and persons we searched for. We should query directly for the name, so let’s introduce ml (a label for the movie), wl (a label for the writing) and pl (a label for the person). We also impose the label language to be in english, via the FILTER command.

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?pl ?wl ?ml WHERE {
?m wdt:P31 wd:Q11424 . # ?m is a Movie
?m wdt:P144 ?w . # ?m is based on ?w
?w wdt:P50 >?p . # ?w written by ?p
?p wdt:P106 wd:Q18844224 . # ?p is a science fiction writer
?p rdfs:label ?pl . # ?p name is ?pl
?w rdfs:label ?wl . # ?w name is ?wl
?m rdfs:label ?ml . # ?m name is ?ml
FILTER(LANG(?pl) = "en") # ?pl is in english
FILTER(LANG(?wl) = "en") # ?wl is in english
FILTER(LANG(?ml) = "en") # ?ml is in english
}

Let’s run the query on the dedicated Wikidata service. You can image this process as SPARQL trying to match the pattern in our picture on all its data, and giving back as result only the values of our incognitas that can satisfy the constraints. The results are:

pl wl ml
Carl Sagan Contact Contact
Philip K. Dick Do Androids Dream of Electric Sheep? Blade Runner
Philip K. Dick Paycheck Paycheck
Philip K. Dick The Golden Man Next
H. G. Wells The War of the Worlds War of the Worlds
H. G. Wells The War of the Worlds The War of the Worlds
H. G. Wells The War of the Worlds War of the Worlds 2: The Next Wave
H. G. Wells The War of the Worlds H. G. Wells’ War of the Worlds
Mikhail Bulgakov Ivan Vasilievich Ivan Vasilievich: Back to the Future
Mikhail Bulgakov Heart of a Dog Cuore di cane
Mikhail Bulgakov The Master and Margarita Pilate and Others
H. G. Wells The Shape of Things to Come Things to Come
H. G. Wells The Time Machine The Time Machine
H. G. Wells The Island of Doctor Moreau The Island of Dr. Moreau
H. G. Wells The Time Machine The Time Machine
H. G. Wells The Invisible Man The Invisible Man
H. G. Wells The First Men in the Moon First Men in the Moon
H. G. Wells The Invisible Man The Invisible Woman
Isaac Asimov The Bicentennial Man Bicentennial Man
Isaac Asimov I, Robot I, Robot
Isaac Asimov The Caves of Steel I, Robot
Philip K. Dick Adjustment Team The Adjustment Bureau
Philip K. Dick Second Variety Screamers
Philip K. Dick Impostor Impostor
Philip K. Dick Radio Free Albemuth Radio Free Albemuth
Philip K. Dick We Can Remember It for You Wholesale Total Recall
Philip K. Dick The Minority Report Minority Report
Philip K. Dick A Scanner Darkly A Scanner Darkly
Daniel Keyes Flowers for Algernon Charly
Kingsley Amis Lucky Jim Lucky Jim (1957 film)
Kingsley Amis That Uncertain Feeling Only Two Can Play
John Wyndham The Midwich Cuckoos Village of the Damned
Fritz Leiber Conjure Wife Night of the Eagle
Brian Aldiss Super-Toys Last All Summer Long A.I. Artificial Intelligence
John Steakley Vampire$ Vampires
Iain Banks Complicity Complicity

You got new quality movies to buy and watch, to satisfy the sci-fi addiction. Our query is just an hint of the immense power unleashed by linked data. Stay tuned to get more tutorials, and check out WordLift, the plugin we are launching to manage and produce linked data directly from WordPress.

Some fun exercises:

  • EASY: get the movies inspired to writings of Isaac Asimov
  • MEDIUM: get all the movies inspired by women writers
  • HARD: get all music artists whose songs were featured in a TV series

 

 

MICO Testing: One, Two, Three…

We’ve finally reached an important milestone in our validation work in the MICO project…we can begin testing and integrating our toolset with the first release of the platform to evaluate the initial set of media extractors. 

This blog post is more or less a diary of our first attempts in using MICO in conjunction with our toolset that includes:

  • HelixWare – the Video Hosting Platform (our online video platform that allows publishers and content providers to ingest, encode and distribute videos across multiple screens)
  • WordLift – the Semantic Editor for WordPress (assisting the editors writing a blog post and organising the website’s contents using semantic fingerprints)
  • Shoof – a UGC video recording application (this is an Android native app providing instant video-recording for people living in Cairo)

The workflow we’re planning to implement aims at improving content creation, content management and content delivery phases. 

Combined Deliverable 7.2.1 & 8.2.1 Use Cases- First Prototype

The diagram describes the various steps involved in the implementation of the scenarios we will use to run the tests. At this stage the main goal is to:

  • a) ingest videos in HelixWare,
  • b) process these videos with MICO and
  • c) add relevant metadata that will be further used by the client applications WordLift and Shoof.  

While we’re working to see MICO in action in real-world environments the tests we’ve designed aims at providing valuable feedback for the developers of each specific module in the platform.

These low-level components (called Technology Enablers or simply TE) include the extractors to analyse and annotate media files as well as modules for data querying and content recommendation. We’re planning to evaluate the TEs that are significant for our user stories and we have designed the tests around three core objectives:

  1. output accuracy​­ how accurate, detailed and meaningful each single response is when compared to other available tools;
  2. technical performance ​­ how much time each task requires and how scalable the solution is when we increase in volume the amount of contents being analysed;
  3. usability ​evaluated both in terms of integration, ​modularity ​and usefulness. ​

As of today being, everything still extremely experimental, we’re using a dedicated MICO platform running in a protected and centralised cloud environment. This machine has been installed directly by the technology partners of the project: this makes it easier for us to test and simpler for them to keep on developing, hot-fixing and stabilising the platform.    

Let’s start

By accessing the MICO Admin UI (this is accessible from the `/mico-configuration` directory), we’ve been able to select the analysis pipeline. MICO orchestrates different extractors and combines them in pipelines. At this stage the developer shall choose one pipeline at the time.  

MICO-Initial-Screen-01

Upon startup we can see the status of the platform by reading the command output window; while not standardised this already provides an overview on the startup of each media extractor in the pipeline.

MICO-Initial-Screen-02

For installing and configuring the MICO platform you can read the end-user documentation: at this stage I would recommend you to wait until everything becomes more stable (here is a link to the MICO end-user documentation)!

After starting up the system using the platform’s REST APIs we’ve been able to successfully send the first video files and request the processing of it. This is done mainly in three steps:

1. Create a Content Item
Request
curl -X POST http://<mico_platform>/broker/inject/create
Response
{“uri”:”http://

<mico_platform>/marmotta/322e04a3-33e9-4e80-8780-254ddc542661″}

2. Create a Content Part
Request
curl -X POST “http://

<mico_platform>/broker/inject/add?ci=http%3A%2F%2Fdemo2.mico-project.eu%3A8080%2Fmarmotta%2F322e04a3-33e9-4e80-8780-254ddc542661&type=video%2Fmp4&name=horses.mp4″ –data-binary @Bates_2045015110_512kb.mp4

Response
{“uri”:”http://

<mico_platform>/marmotta/322e04a3-33e9-4e80-8780-254ddc542661/8755967a-6e1d-4f5e-a85d-4e692f774f76″}

3. Submit for processing
Request
curl -v -X POST “http://

<mico_platform>/broker/inject/submit?ci=http%3A%2F%2Fdemo2.mico-project.eu%3A8080%2Fmarmotta%2F322e04a3-33e9-4e80-8780-254ddc542661″

Response
HTTP/1.1 200 OK

Server: Apache-Coyote/1.1

Content-Length: 0

Date: Wed, 08 Jul 2015 08:08:11 GMT

In the next blog posts we will see how to consume the data coming from MICO and how this data will be integrated in our application workflows.

In the meantime, if you’re interested in knowing more about MICO and how it could benefit your existing applications you can read:

Stay tuned for the next blog post!