Analysis of ITAR-TASS wiki

My last project at WikiVote was to create a knowledgebase for the main Russian news agency, ITAR-TASS. In the next posts I’m going to describe the project and analyse its results. This post is an introduction to the task we had.

For me it all started when I received a message from TASS telling how awesome the Semantic Web and semantic wikis are, and how much TASS needs these technologies. The head of computing in their monitor department have read my articles in habrahabr and got very excited about the dynamic way of creating the content and data semantic wikis provide.

I wondered, why does the news agency may need a wiki? How would they use it? Some time later I realized that the answer to this question is pretty straightforward: they will use it as the homemade Wikipedia, that is: they will store facts about people, events, places etc.

You may ask, “why not use the actual Wikipedia for that”? Well, they can’t do that because of the following reasons:

  • Inaccuracy. The main news agency that represent Russia can’t rely on a crowdsourced encyclopedia. They’re afraid that the facts in Wikipedia will be inaccurate.
  • Position. Of course both wikipedia and the news agencies try to remain as neutral as possible. Sometimes however, it’s hard to describe the events and things in a neutral way, so TASS can’t rely on a crowdsourced position of Wikipedia
  • Proper accents. For example how are we going to describe Arnold Schwarzenegger – as a politics, bodybuilder or actor?

Besides I did my best to explain that non-semantic wikis sucks when the amount of facts is big enough and there is  not a lot of editors working on the content.

So, the project began. Our system ment to replace this knowledgebase:



Yeah, there is a whole floor in TASS full of these card indexes. The smell of the paper and dust returned be to my childhood when I’ve spent days in the public library where my mother worked. The API of the knowledge base is the telephone and the old lady who know how the indexes are organized.


Using Semantic MediaWiki for creating standards for professional activity

I think that 97 percent of users underestimate the power of Semantic MediaWiki. It’s not just a tool that will help you to manage your encyclopedia. In fact using wikis for just encyclopedias or for documentation is just one dumb, staightforward use case.

On SMWCon I’ve desribed how we’ve used SMW for creating standards by the professional communities. The result of their effort is one document – the standard of the professional activity. It describes stuff you have to know, actions you have to perform at job and qualities you have to have in order to be a professional.

In order to produce a good document we have to structure the process of creation. In other words we have to follow specific methodology that ensures that all skills, knowledge and qualities are properly described. The methodology states how to structure the building blocks of the standard and their relations and depencies. Of course we need to develop the other part of the methodology – the social part. It will describe the process of creation itself. Just imagine, to create the standard that describes a manager you have to have various levels of managers – from linear managers to CEO. These guys will desagree with each other and you need to somehow create a document with their consensus.

That’s what Semantic MediaWiki is good for: SUPPORTING COMPLICATED METHODOLOGIES. That’s what my talk is about:

There was a great guy at the conference, Alexander Gesinn. He is in 3% of people that understand what SMW can be used in. And he’s making a business in business process management. What I was amazed about is how he managed to pack it: his semantic::apps is really a tool for BPM, it’s no bloody general-purpose-wiki. It’s a product!!!

ChapTimeline Result Format is now open source!

Timeline is one of the most impressive result format in Semantic MediaWiki. But SIMILE timeline looks very old-fashioned and its code is a great mess. Moreover many of our clients find it too complicated and I totally agree with them:

So I decided to implement new timeline result format. There are quite a few javascript libraries out there but only some of them are as powerful as SIMILE widget. I’ve chosen CHAP-timeline from almende.


  • It’s open source
  • It’s zoomable
  • It’s clean and simple and it really appears interactive to me. It seems like it invites you to click and zoom it
  • It allows you to insert complex html into event boxes
  • It’s damn powerful: they have clustering, mobile version, jquery themes support, skin support, grouping and editing (I’m thrilled to support that in SMW)
  • it has beautiful documentation for programmer. One of the best documentation I’ve seen since I program on Qt.


  • it uses Google API and thus it’s illegal to use it without internet (though technically possible)

Go to extension page and download this nice result format and try it in your projects! If you need some specific features to be added, just ask me. As always, if I have time to implement them in my free time, I’ll gladly do so. Otherwise, funding is also possible.


Results of SMWCon and possible future of Semantic MediaWiki

SMWCon Fall logo. SMW logo blended with Berlin TV Tower

SMWCon conference have just ended and I’ve successfully recovered after it. It was the biggest conference ever: we’ve got almost 90 participants. It was also a very interesting one: we’ve got business talks and scientific talks, talks about open governement and talks about enterprise wikis and a lot more.

Semantic MediaWiki remains to be one of the rare projects that related to Semantic Web and at the same time doesn’t rely solely upon grants to be alive. In fact, most of the core developers of the platform have nothing to do with linked data and semantic web and I think that’s good.

I was a Program Chair, in other words I was responcible for a content of the conference. Unfortunately I couldn’t reach anyone from DataRangers team: these dudes would be most welcomed guests in the conference because they really can talk how to turn semantic wiki into a focused business solution. Still, Alexander from was amazing too: you could never guess that his semantic::apps solution is SMW inside.

I think that for Semantic MediaWiki future several things are quite important and I’m going to contentrate on them:

  • proper positioning. What SMW is good for? It’s not clear from the website.
  • outreach. MORE PEOPLE. We need the community as big as Joomla/Drupal/Wordpress
  • funding of the development. It’s an old notice that if you put some money into some project, it may become more successfull

About the positioning I’d say that we have to distance from Semantic Web. It was good to have big grants, back then in 2006, but now the dissapointment in Semantic Web grows – despite the fact that many SW-technologies have become part of everyday Web. In my opinion we have to focus the following audience:

  • open government data people. There are some good use cases of that, nost notably NYCPedia
  • enterprise wikis. Most of them are behind firewalls and it’s sometimes not easy to contact the people from there. But they really have proper requirements and funding too.
  • Documentation projects. Webplatform is the one, recently Parson Communications started to use SMW for technical writing.
  • not Semantic Web research projects from bioinformatics, neuroscience, engineering. For example I’ve just stumbled upon the Texas Instruments wiki. BlueBrain project also wants to use us.
  • consultancies that use SMW for supporting some particular methodology. Examples are WikiVote with standards and roadmaps, with Business Process Modeling

The funding part is especially interesting. So far I see the following ways to bring money and/or workforce into Semantic MediaWiki:

  • If you’re a professor you can have your bachelors to make projects that affect Semantic MediaWiki. Ideal project here will be to measure the efficiency, latency, speed for different storages.
  • If you’re using SMW for your research and have a grant you can use the grant money to sponsor features you need.
  • If you’re developer you can try applying for Wikimedia Individual Engagement Grant or be a mentor in Google Summer of Code. This is how I and Stephan pushed Semantic Glossary a little bit forward.
  • If you can’t spend your money but have qualifield people that ready to make improvements in SMW, please commit them back to core! The whole Linux Kernel works this way: there is a lot of people from Intel, NVidea and IBM people that write open source code in their working hours.
  • If you’re person or a company and you need small feature (e.g. new result format), hire one of the developers: he will write it for you and make it open source thus providing support
  • If you’re a company and you need BIG feature (e.g. speeding up, support for new store, new parameters for #ask, etc.) you can try to ask how many more companies need that. If it’s a common demand we can create a fund where every company put some money.

During the conference I’ve proposed this last model of groupfunding and I’m very eager to try it in action. I haven’t heard of this model in any of the projects but I think that it can just work in our case. Here is the poster I’ve presented at the conference about that:


What I’m thinking about now is the pilot project that will be supported with this groupfunding. It should be something medium-sized and long awaited, something that will interest many companies. Some candidates for the pilot feature I’ve come up so far

  • measuring performance. That is load testing. Many parties want to know how the amount of properties and subobjects affects latency/responce time. How does it work for RDF store? Is it quicker? How much quicker? How about the memory consumption? With and without cache? 
  • Developer documentation. This is tricky, every time sombody ask Jeroen about the proper way to do anything he answers that currently the code is a big mess and SMW will have new cool developer API soon. But anyway have description of something would already be good.
  • fully-fledged SPARQL support. That is, support make inline SPARQL quesries to work with all kinds of result formats
  • Stuff from our questionnaries, for example:
    • support displaying linked properties like “?a.b”,
    • greater support for ORs and ANDs,
    • more up-to-date display of data,
    • free-text search features in queries
    • brackets/braces for complex queries
    • things for forms and page schemas, for example visual editing of forms, WYSIWYG support in textareas, automatic escaping for form field content
    • Access fucking control. I know, it depends on many other factors. 🙁
  • Support for a new storage that will boost the performance. Maybe MongoDB?
  • Semantification
  • Custom datatypes

Any other ideas?

Helping SEMAT community to grow and be more transparent

SEMAT stands for Software Engineering Method and Theory. It has been designed by Ivar Jacobson, the co-author of UML, Use Case methodology and Rational Unified Process. The project is pretty big and has many parts and I’m sure it will influence the software engineering world a great deal.

Here are the parts I can understand:

  • (academic part) there is an effort in SEMAT to create a solid grounding for the software engineering. For example there are dozens of processes and methodologies for software engineering: what do they have in common? 
  • (practical part) SEMAT provides a framework to quickly analyse the software project from product point of view. What is the current state of the the requirements: do they formalized or maybe already satisfied? Do we know all the stakeholders of the product? What is the current state of engineering Team? And so on.

Of course I like the practical part more. In fact I’ve totally fallen in love with the thing and want it to be known to the whole world. Because of that when I met Ivar Jacobson at the presentation in Moscow I immediately asked him how do they want to advertise SEMAT.

It turned out that they wanted to do the following:

  • present SEMAT on academic events and publish papers to reach the academic world. Because of that parts of SEMAT wil have a chance to be taught at more universities.
  • contact the big enterprises and propose them to try SEMAT in practice. These case studies will show how efficient the methodology is and bring more practical professional to SEMAT camp
  • publish in good practical journals like ACM and IEEE journals to reach some part of engineering community

… and that’s it. “But hey, listen, what about reaching the simple folks? They don’t work in IBM and don’t read IEEE journals. They don’t know about CMMI or RUP and really think that it’s Alan Cooper who have brought the most powerful ideas into software design. They read Joel on software and Wired and hackernews and Reddit. How do you plan to reach them?” Well, there wasn’t much of the plan, mostly because of generation gap. And here it seems I can help.

I’ve interviewed the most active members of SEMAT community and together we’ve created this list of goals. So with SEMAT community we want to:

  • become more transparent
  • become more open for the new ideas
  • convert the potential of supporters to the actual force
  • reach more people and advertise ourselves in academia, enterprise AND SIMPLE FOLKS
  • reach more younger people
  • organize those people in a self-supportive community that can do things useful for SEMAT

To reach some of these goals I decided to start from actions that require almost no effort but will produce the big effect:

Semat community actions

  • create a medium where the people can ask questions and communicate. After some trials I’ve chosen Nabble mailing list for that
  • another option is LinkedIn group which have existed before but have been private
  • start working on the guidelines for transparency and openness for the community members
  • gather all the existing materials such as presentations, papers, articles, book chapters

Moving on! This post is only a start, I’ll try to describe more my adventures as community  manager. Let’s see if I can improve SEMAT community working on that for 1-2 hours a day!

Developing new glossary tools in MediaWiki

Good news for me! My protege Zhenya have been accepted for Google Summer Of Code 2013! I and Stephan Gambke are Wikimedia Foundation mentors for the project aims inproving two MediaWiki extensions: Lingo and Semantic Glossary.

Here is the description of the project.

We will add many interesting and promising features to Lingo. Most exiting new feature is word forms support and ability to create new terms on-the-fly.

I love Semantic Glossary because it allows you to link pages without creating hyperlinks by hands. Zhenya have already started the work, bu if you have good ideas, feel free to make suggestions!

Wikiapiary – a useful website for MediaWiki admins

A fairly simple idea of gathering statistics from all MediaWiki’s they can find resulted in surprisingly useful website with huge potential. Wikiapiary knows which version of software your wiki use: the skin, extensions, version of MediaWiki itself.

More importantly wikiapiary can show the stats from the point of view of the software: for example here is the page about Semantic Result Formats:

Plot of Semantic Result Formats usage statistics

What’s interesting facts can be found here:

  • some of the extensions are not as popular as we thought them to be. For example Semantic MediaWiki is installed only in 10 percent of the wikis
  • Many wiki admins don’t upgrade their software very often: pretty interesting fact in the world of WMF where nobody give a damn about being compatible with as much versions of MediaWiki as possible.


Why do I claim that Wikiapiary is a super useful website?

  • For start it provides the list of extensions and skins, ranked by popularity, which has never been done before. It’s important for me as wiki admin
  • Secondly it creates a connection between the developers of the extensions and their users. As a programmer I can monitor how popular my extension is, can ask my users to upgrade their versions in case of security risks, maybe propose custom services for them
  • Thirdly it creates a repository of MediaWiki websites based on which it’s possible to organize contests, write case studies, make research and share the experience about proper wiki gardening.

I think that Jamie Thingelstad, the author of Wikiapiary is doing a great job and I’ll try to help him in any way I can.

Export and import uploads in MediaWiki

I’ve been searching for the way to export all the uploaded files in MediaWiki for about two days and now I know how to do it!

Here is the problem: the files in MediaWiki have their pages in the File: namespace. All the files are stored in images directory. The problem is that if you try to simply copy the file directory from one MediaWiki installation to another, you will not see those files in the latter. Bigger problem are the error messages you’ll get when you try to insert the file that is presented in directory but is not presented in MediaWiki database.

Here is the right way to dump MediaWiki uploads:

1. In the SOURCE wiki form the list of files you want to dump. It’s usually done with Special:AllPages page,  where you pick the namespace File. Copy the list and use some Vim magic to deal with whitespaces and tabs, turning them to line breaks:


2. Suppose that the list of uploads is in the file ~/filelist on your server

3. Exporting:

php maintainance/dumpBackup.php --current --pagelist=~/filelist--uploads > ~/dmp

4. Importing:

php maintainance/importDump.php ~/dmp --uploads

I hope that was useful!

Page Schemas screencast

Page Schemas is a great extension for Semantic MediaWiki that allows us to generate the structure of our websites from the XML description called schema. I’ve fallen in love with this extension long before it have been implemented (back then Yaron wanted to call it Semantic Schemas).

So, Page Schemas works like this:

  1. you go to a category page;
  2. you define all the properties, templates and forms that will be used for the pages of this category – this is called the schema;
  3. you push the Generate button and enjoy the result;
  4. if you need to make improvements in the structure of your schema (like add some fields, change the allowed values for properties or modify the template) – you edit the previously defined schema and re-generate all the pages
Page Schemas look like Create a class special page but:
  • It’s more powerful and beautiful: you can define multiple templates for forms (remember ‘Add another’ button?), define the parameters of the fields (like ‘values from category’)
  • It allows you to add and remove the fields and templates without all that tedious mucking about in mediawiki editor
  • it has very nice pluggable architecture and you can connect your own extension to it


Fighting spam in MediaWiki

Cleaning up spam on and on have resulted in a little tutorial and a presentation I want to show.

Ok, so here is some advice that can alone improve your spam situation.

  1. Healthy community. If your wiki has enough users who care about the content, they will clean up the spam manually.
  2. Heuristics and little tricks. For example you can set the waiting period for all the newly-registered users. That will help because the majority of bots register and immediately write something
  3. Questy captcha instead of any other captcha
  4. If nothing helps use behavior analysis tool called AbuseFilter