Innovation in Digital Journalism - a report from the second year of FP7-MICO

The blog post summaries the work done in the context of the EU project FP7-MICO in the area of digital journalism.

This last December we attended in Cairo one of the most exciting events in the Middle East and North Africa region on Entrepreneurship and Hi-Tech startups: RiseUp Summit 2015. We engaged with the overwhelming crowd of startuppers and geeks on our HelixWare booth and in a separated Meetup organised at the Greek Campus.

We had the opportunity, during these two hectic days, to share the research work done in MICO for extending the publishing workflows of independent news organizations with cross-media analysis, natural language processing and linked data querying tools.

What have we done – the summary of MICO year two in slides

“New Thinking in the Practice of Digital Journalism”

We’re primarily working with two partners: Greenpeace Italy and Shoof (a startup developing an Android app for user generated content). Our focus is on the following aspects:

organizing the news desk of small and medium editorial teams by creating a flexible network of metadata around both text and media
reducing the complexity of content management operations
extending readers dwell time with repurposing matching content (namely content recommendations)

The results of our work is delivered in two forms:

things worth sharing that we learn along the way,
enhancements on our software products, WordLift and HelixWare, that benefit from the integration with the MICO platform (this part of the work can eventually help other companies in extending their product and services with the same stack of technologies)

What we have learned

Evolutions in the news sector are constant as the web has brought us an unprecedented growth of information. At the same time the Internet has dramatically destabilized the existing business models that supported journalism for a long time.

We’re now facing a struggle for surviving and the pressing need for radical changes through the entire industry.

One thing that we learned is that throwing news online without context and analysis simply doesn’t work when the focus for digital news is on interactivity, engagement and community.

The other important aspect is that algorithms are reshaping the user experience of the readers and news content is constantly re-published in different forms (think of Facebook Instant Articles or the Google Accelerated Mobile Pages project) and across different devices. Discovery tools like Taboola and Outbrain analyze news content and with their proprietary algorithms leverage on their large distribution networks to monetize these contents via native advertising (eventually bringing back a cut of the revenues to the content owners).

In this context is clear that knowledge structuring and organization have a clear business value. More over data ownership also plays a crucial role: if the publisher maintains the full ownership on its own metadata it can eventually create new revenue streams directly and has a better edge when negotiating with third parties.

Last but not least we learned that journalists, in this ever evolving scenario, need to keep their attention on writing great stories and on creating meaningful relationships with the members of their target groups and communities. Technology has to be there to assist their work but shouldn’t requiring too much of their attention (at least in most cases).

Organizing Knowledge

As we progressed with our work and kept on developing our semantic editor WordLift (a plugin for WordPress being used by Greenpeace on their magazine website) we realised that structuring contents with a classification scheme would provide the needed context; we also realised that by structuring contents and creating multiple access points (in the form of web pages) we can increase the overall content discoverability over social networks and search engines.

While we’re still gathering data it is clear that classifying news content with a clear scheme (Who: persons and organization involved, What: key topics, Where: places and When: events) is increasing both engagement and traffic.

Improving the web publishing workflow with MICO

Adding MICO in this scenario and, specifically for news sites, means two things:

we can unveil the hidden semantics on media contents; this translates into a better search for images and videos within the CMS’s media library and a consequent reduction of the time spent by the editor in searching for media files that can complement an article
we can create new valuable links between articles and videos using the cross-media content recommendations (this is work still in progress)

Takeaways from the testing and validation

Semantic tagging, as implemented in our MICO showcase, means that every publisher starts curating a set of concepts (expressed as named entities) that emerge from the contents being produced and analyzed.

In WordLift these concepts are gathered within an internal vocabulary.

This vocabulary brought in the organization (Greenpeace Italy in our validation tests) a new level of self-awareness. The editorial team has began studying more carefully the relationship between the organization, the concepts they use for tagging and their target audience. This process is helping them making the strategic editorial decision about what connects to what and why.

Next step…Finding all videos relevant to an article

At the technical level one of the most exciting achievement has been to build a first prototype to connect videos being uploaded in HelixWare and analyzed by MICO (using the ASR – Automatic Speech Recognition extractor) with an article on the same topic written with WordLift.

Here is the workflow:

the editor creates an article with WordLift on the #SaveTheArtic campaign,
while tagging the article WordLift creates a named entity for “Lego” and a corresponding web page (i.e. http://your-website/entity/lego)
MICO analyzes a list of videos being uploaded with HelixWare and several processes take place in the background:
- the audio part is split from the video,
- the audio is analyzed using with the ASR extractor,
- the text derived from the ASR is analyzed with an NLP extractor
- the NLP extractor uses both DBpedia entities and the WordLift custom vocabulary that contains the entity “Lego”
we can now look for videos containing concepts created with WordLift and stored in the user’s vocabulary.

Here is the query in SPARQL:

PREFIX fam: <http://vocab.fusepool.info/fam#>
PREFIX p: <http://www.mico-project.eu/ns/platform/1.0/schema#>
SELECT DISTINCT ?anno ?entity ?source ?confidence
WHERE {    ?anno a fam:LinkedEntity ;
            fam:entity-reference ?entity ;
            fam:confidence ?confidence ;
            fam:extracted-from ?source .
        <http://our-mico-demo-server:8080/marmotta/d7eea551-266d-4f93-97c7-42ad2bbc6df1> p:hasContentPart ?source .}
filter regex(str(?entity), "wordlift")
order by ?confidence

Now the advantage of using the metadata generated with WordLift to filter videos analyzed with MICO is that the selection can become very relevant (and of course the filters in the query can be much more specific than the one in this example) and we can improve the signal/noise ratio of the Automatic Speech Recognition that is otherwise way to noisy to bring any value in this specific context.

If you’re from the news and media industry and you’re interested in this work or willing to understand how this approach could benefit your business, drop us an email!

Scrabble Word Finder

Innovation in Digital Journalism – a report from the second year of FP7-MICO