Wikidata SPARQL Logs

From International Center for Computational Logic

Wikidata SPARQL Logs

Access logs from the Wikidata SPARQL Query Service

This page describes multiple files with anonymised logs of several hundred million SPARQL queries from the Wikidata SPARQL endpoint that accompany the publication

Stanislav Malyshev, Markus Krötzsch, Larry González, Julius Gonsior, Adrian Bielefeldt:
Getting the Most out of Wikidata: Semantic Technology Usage in Wikipedia’s Knowledge Graph.
In Proceedings of the 17th International Semantic Web Conference (ISWC-18), Springer 2018. PDF

Further related publications can be found in the publications tab. The following datasets are currently available. Details on how this data was created are explained below. We also offer a sample snippet that illustrates the structure of the files.

Interval First day Last day Queries Download (tsv.gz) Size
Interval 1 2017-06-12 2017-07-09 59,554,358 All queries, success (HTTP code 200) 2.7G
191,295 Organic queries, success (HTTP code 200) 5.7M
Interval 2 2017-07-10 2017-08-06 70,338,733 All queries, success (HTTP code 200) 2.7G
196,601 Organic queries, success (HTTP code 200) 5.7M
Interval 3 2017-08-07 2017-09-03 78,273,973 All queries, success (HTTP code 200) 2.9G
250,037 Organic queries, success (HTTP code 200) 7.7M

All of the above are published under License CC-0, which minimises legal obstacles in re-use. The authors believe that the good scientific practice of acknowledging related work with a citation does not need to be enforced legally.

What is in this dataset?

The datasets consists of several files that each contain SPARQL logs from a specific time interval, complete with SPARQL query, timestamp, and user agent information. Queries are anonymised as described below, but are valid SPARQL queries. Files are in gzipped tab-separated values format, each containing the following columns:

  1. Anonymised query: The original query, reformatted and processed for reducing identifiability. This string is URL-encoded.
  2. Timestamp: The exact time (timezone GMT) of the request, in ISO format.
  3. Source category: This field indicates whether we believe that the query was issued by an automated process. This is true for all queries that came from non-browser agents, and in addition for some queries that used a browser-like agent. The field specifies the classification into robotic and organic traffic as explained in the paper.
  4. User agent: A simplified/anonymised version of the user agent string that was used with the request. It is simply "browser" for all browser-like agents, and might be slightly more specific for bot-like agents (e.g. "PHP" or "curl"). See below.

Overall, the data amounts to around 200 million requests. Removing all queries that we believe are sent by bots, there are still than 650,000 queries remaining (labelled "organic" above; these files are excerpts from the complete set). The queries are very diverse in terms of size and structure.

Where does this data come from?

Wikidata, the knowledge base of Wikimedia, is collecting a large amount of structured knowledge across all Wikimedia projects and languages. Since 2015, a query service is available to retrieve and analyze this data using the SPARQL query language. Since the data is rich and the query language is powerful, many complex questions can be asked and answered in this way. The service is used not only by individual power users but also by applications inside and outside of Wikimedia, which issue a large number of queries to provide users with the information they request.

How was the data created?

All source code used for generating the data is published in a dedicated git repository.

Anonymised query

The query strings were processed to remove potentially identifying information as far as possible, and to reduce spurious signals that could be used to reconstruct user traces. The following steps were performed:

  • Stage 1: A SPARQL programming library (OpenRDF) was used to transform the original query string into an object model. If this fails (invalid query), the query is dropped completely. We do not publish any information about invalid requests.
  • Stage 2: The structure of the parsed SPARQL query was modified:
    • All comments were removed
    • All string literals in the query were replaced by placeholders of the form "stringnumber" that have no relationship to the original string (we simply enumerate the strings as found in the query).
      • The same string was uniformly replaced by the same placeholder within each query, but the same string across different queries was usually not be replaced by the same placeholder.
      • The only exception are very short strings of at most 10 characters, strings that represent a number, lists of language tags in language service calls (e.g., "en,de,fr"), and a small number of explicitly whitelisted strings that are used to configure the query service (e.g., the string "" that instructs BlazeGraph to do a breadth-first search). These strings were preserved.
    • All variable names were replaced by generated variable names "varnumber" or "varnumberLabel"
      • Replacement was uniform on the level of queries like for strings.
      • The ending "Label" was preserved, since BlazeGraph has a special handling for such variables.
    • All geographic coordinates were rounded to the next full degree (latitude and longitude). This was also done with coordinates in the alternative, more detailed format, where latitude and longitude are separate numerical values.
  • Stage 3: The modified query was converted back into a string
    • All formatting details (whitespace, indentation, ...) were standardized in this process
    • No namespace abbreviations are used in the generated query, and no namespace declarations are given.

Example: The well-known example query for the 10 largest cities with a female mayor:

#Largest cities with female mayor
#added before 2016-10
#TEMPLATE={"template":"Largest ?c with ?sex head of government","variables":{"?sex":{"query":" SELECT ?id WHERE { ?id wdt:P31 wd:Q48264 .  } "},"?c":{"query":"SELECT DISTINCT ?id WHERE {  ?c wdt:P31 ?id.  ?c p:P6 ?mayor. }"} } }
SELECT DISTINCT ?city ?cityLabel ?mayor ?mayorLabel
 BIND(wd:Q6581072 AS ?sex)
 BIND(wd:Q515 AS ?c)
 ?city wdt:P31/wdt:P279* ?c .  # find instances of subclasses of city
 ?city p:P6 ?statement .            # with a P6 (head of goverment) statement
 ?statement ps:P6 ?mayor .          # ... that has the value ?mayor
 ?mayor wdt:P21 ?sex .       # ... where the ?mayor has P21 (sex or gender) female
 FILTER NOT EXISTS { ?statement pq:P582 ?x }  # ... but the statement has no P582 (end date) qualifier
 # Now select the population value of the ?city
 # (wdt: properties use only statements of "preferred" rank if any, usually meaning "current population")
 ?city wdt:P1082 ?population .
 # Optionally, find English labels for city and mayor:
 SERVICE wikibase:label {
   bd:serviceParam wikibase:language "en" .
ORDER BY DESC(?population)

turns into the following normalized query, which yields the same results:

SELECT DISTINCT ?var1  ?var1Label  ?var2  ?var2Label 
  BIND (  <>  AS  ?var3 ).
  BIND (  <>  AS  ?var4 ).
  ?var1 ( <> / <> *) ?var4 .
  ?var1  <>  ?var5 .
  ?var5  <>  ?var2 .
  ?var2  <>  ?var3 .
   ?var5  <>  ?var6 .
) .
  ?var1  <>  ?var7 .
 SERVICE  <>   {
    <>  <>  "en".
ORDER BY  DESC( ?var7 )

User agent

The user agent was set to be "browser" for all user agents that start with "Mozilla" (for example "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; chromeframe/12.0.742.100)" would be considered a "browser" for this purpose). Some additional browser-like strings were substituted manually with "browser" as well.

For requests that do not originate from browsers, the "user agent" is a coarse description of the software or tool that was used in making the request. All agent strings have been stripped of system information and overly detailed version information. Agent strings that occurred less than 10,000 times in a twelve week window, or that only occurred in a single week were always replaced by other. A manually checked whitelist was used to decide which strings to keep.

Source category

The source category field is "robotic" if we believe that the source of the query was a bot (i.e, some automated software tool issuing large numbers of queries without human intervention). This is the case if the user agent was not a browser, or if the query traffic pattern was very unnatural (e.g., millions of similar queries in one hour). This corresponds to the classification into robotic and organic traffic as explained in the paper.

This field is there for convenience and only makes explicit how we interpreted the logs. As shown in our publications, organic queries are only a tiny fraction of all queries, but at the same time are structurally more diverse. In contrast, robotic queries contain many trivial queries generated automatically. For some research works, the organic queries might therefore be of special interest.

Proceedings Articles

Adrian Bielefeldt, Julius Gonsior, Markus Krötzsch
Practical Linked Data Access via SPARQL: The Case of Wikidata
In Tim Berners-Lee, Sarven Capadisli, Stefan Dietze, Aidan Hogan, Krzysztof Janowicz, Jens Lehmann, eds., Proceedings of the WWW2018 Workshop on Linked Data on the Web (LDOW-18), volume 2073 of CEUR Workshop Proceedings, 2018.
Details Download

Stanislav Malyshev, Markus Krötzsch, Larry González, Julius Gonsior, Adrian Bielefeldt
Getting the Most out of Wikidata: Semantic Technology Usage in Wikipedia’s Knowledge Graph
In Denny Vrandečić, Kalina Bontcheva, Mari Carmen Suárez-Figueroa, Valentina Presutti, Irene Celino, Marta Sabou, Lucie-Aimée Kaffee, Elena Simperl, eds., Proceedings of the 17th International Semantic Web Conference (ISWC'18), LNCS, to appear. Springer
Details Download

Talks and Miscellaneous

Markus Krötzsch
Getting the most out of Wikidata
Invited presentation at Wiki Workshop 2018, 2018
Details Download