This Dashboard is a part of the Wikidata Concepts Monitor (WDMC). The WDCM system provides analytics on Wikidata usage across the Wikimedia sister projects. The WDCM Semantics Dashboard is probably the central and the analytically most complicated of all WDCM Dashboards. Here we provide only the necessary basics of distributional semantics needed in order to understand the results of semantic topic modeling presented on this WDCM dashboard. A user who needs to dive deep into the similarity structures between the Wikimedia sister projects in respect to their Wikidata usage patterns will most probably have to do some additional reading first. However, the Dashboard simplifies the presentation of the results as much as possible to make them accessible to any Wikidata user or Wikipedia editor who is not necessarily involved in Data or Cognitive Science. Reading through the WDCM Semantic Topic Models section in this page is highly advised to anyone who has never met semantic topic models or distributional semantics before. Before that, our next stop: Definitions.
N.B. The current Wikidata item usage statistic definition is the count of the number of pages in a particular client project where the respective Wikidata item is used. Thus, the current definition ignores the usage aspects completely. This definition is motivated by the currently present constraints in Wikidata usage tracking across the client projects (see Wikibase/Schema/wbc entity usage). With more mature Wikidata usage tracking systems, the definition will become a subject of change. The term Wikidata usage volume is reserved for total Wikidata usage (i.e. the sum of usage statistics) in a particular client project, group of client projects, or semantic categories. By a Wikidata semantic category we mean a selection of Wikidata items that is that is operationally defined by a respective SPARQL query returning a selection of items that intuitivelly match a human, natural semantic category. The structure of Wikidata does not necessarily match any intuitive human semantics. In WDCM, an effort is made to select the semantic categories so to match the intuitive, everyday semantics as much as possible, in order to assist anyone involved in analytical work with this system. However, the choice of semantic categories in WDCM is not necessarily exhaustive (i.e. they do not necessarily cover all Wikidata items), neither the categories are necessarily mutually exclusive. The Wikidata ontology is very complex and a product of work of many people, so there is an optimization price to be paid in every attempt to adapt or simplify its present structure to the needs of a statistical analytical system such as WDCM. The current set of WDCM semantic categories is thus not normative in any sense and a subject of change in any moment, depending upon the analytical needs of the community.
The currently used WDCM Taxonomy of Wikidata items encompasses the following 14 semantic categories: Geographical Object, Organization, Architectural Structure, Human, Wikimedia, Work of Art, Book, Gene, Scientific Article, Chemical Entities, Astronomical Object, Thoroughfare, Event, and Taxon.
While Wikidata itself is a semantic ontology with pre-defined and evolving normative rules of description and inference, Wikidata usage is essentialy a social, behavioral phenomenon,
suitable for study by means of machine learning in the field of distributional semantics: the analysis and modeling of statistical patterns of occurrence and co-occurence of Wikidata item and property
usage across the client projects (e.g. enwiki, frwiki, ruwiki, etc). WDCM thus employes various statistical approaches in an attempt to describe and provide insights from the observable Wikidata
usage statistics (e.g. topic modeling, clustering,
dimensionality reduction, all beyond providing elementary descriptive statistics of Wikidata usage, of course).
Wikidata Usage Patterns. The “golden line” that connects the reasoning behind all WDCM functions can be non-technically described in the following way. Imagine observing the number of times a set of size N of particular Wikidata items was used across some project (enwiki, for example). Imagine having the same data or other projects as well: for example, if 200 projects are under analysis, then we have 200 counts for N items in a set, and the data can be desribed by a N x 200 matrix (items x projects). Each column of counts, representing the frequency of occurence of all Wikidata entities under consideration across one of the 200 projects under discussion - a vector, obviously - represents a particular Wikidata usage pattern. By inspecting and modeling statistically the usage pattern matrix - a matrix that encompasses all such usage patterns across the projects, or the derived covariance/correlation matrix - many insigths into the similarities between Wikimedia projects items projects (or, more precisely, the similarities between their usage patterns) can be found.
In essence, the technology and mathematics behind WDCM relies on the same set of practical tools and ideas that support the development of semantic search engines and recommendation systems, only applied to a specific dataset that encompasses the usage patterns for tens of millions of Wikidata entities across its client projects.
Each of the 14 currently used semantic categories in the WDCM Taxonomy of Wikidata items receives a separate topic model. Each topic model encompasses two or more topics, or semantic themes. Here you can select a semantic category (e.g. "Geographical Object", "Human") and a particular topic from its model. The page will produce three outputs: (1) the Top 50 items in this topic chart, which presents the 50 most important items in the select topic of the selected category's topic model, (2) the Topic similarity network, which presents the similarity structure among the 50 most important items in the selected topic, and (c) the Top 50 projects in this topic chart, where 50 Wikimedia projects in which the selected topic plays a prominent role in the selected semantic category.
Make a selection of Wikimedia projects here and hit Apply Selection. The Dashboard will produce a series of charts, one per each Wikidata semantic category that is present in your selection of projects, and compute the relative importance (%) of each topic in the given selection and for each semantic category. Do not forget that category specific semantic models do not necessarily encompass the same number of topics (in fact, they rarely do); also, Topic n in one category is obviously not the same thing as Topic n in some other category.
Upon a selection of semantic category, the Dashboard will present a 2D map which represents the similarities between the Wikimedia projects computed from the selected category's semantic model only. Here you can learn how similar or dissimilar are the sister projects in respect to their usage Wikidata items from a single semantic category.