Silk-book

 

Book assembled from https://www.assembla.com/spaces/silk/wiki/ on 07-Feb-2013 by vladimir.alexiev@ontotext.com

About Silk

The Silk Link DIscovery Framework is a tool for discovering relationships between data items within different Linked Data sources. Data publishers can use Silk to set RDF links from their data sources to other data sources on the Web.

The official homepage of Silk can be found on http://www4.wiwiss.fu-berlin.de/bizer/silk/.

Silk Workbench

Introduction

Silk Workbench is a web application which guides the user through the process of literlinking different data sources.

The Silk Workbench offers the following features:

  • It enables the user to manage different sets of data sources and linking tasks.
  • It offers a graphical editor which enables the user to easily create and edit link specifications.
  • As finding a good linking heuristics is usually an iterative process, the Silk Workbench makes it possible for the user to quickly evaluate the links which are generated by the current link specification.
  • It allows the user to create and edit a set of reference links used to evaluate the current link specification.

The Silk Workbench provides the following components:

  • Workspace Brower Enables the user to browse the projects in the workspace. Linking Tasks can be loaded from a project and committed back to it later.
  • Linkage Rule Editor A graphical editor which enables the user to easily create and edit link specifications. The widget will show the current link specification in a tree view while allowing editing using drag-and-drop.
  • Evaluation Allows the user to execute the current Link Specification. The links are displayed while they are generated on-the-fly. Generated links for which the reference link set does no specify their correctness, the user may confirm or decline their correctness. The user may request detailed summaries on how the similarity score of specific links is composed of.

The typical workflow of creating a new Link Specification consists of:

Alt text

  1. The user opens an existing Linking Task from the Workspace or creates a new Linking Task.
  2. The user uses the Linkage Rule Editor to refine the current Linkage Rule.
  3. The output of the Link Specification is evaluated based on the reference links using the Evaluation.
  4. If all links are correct, the user commits the Link Specification to the Workspace. If some links are wrong the user proceeds with the next step.
  5. The user confirms or declines the correctness of a number of links.

Workspace

The workspace allows users to manage data sources and linking tasks defined for each project. The Workspace Browser shows a tree vies of all Projects in the current Workspace:

Alt text

Projects

A project holds the following information:

  1. All URI prefixes which are used in the project.
  2. A list of data sources.
  3. A list of linking tasks.

Users are able to create new projects or import existing ones. Existing projects can be deleted or exported to a single file.

Data sources

A data sources holds all information that is needed by Silk to retrieve entities from it. Users can add new data sources and edit their properties:

Alt text

The following properties can be edited:

  • Endpoint URI The URI of the SPARQL endpoint
  • Graph URI Only retrieve instances from a specific graph
  • Retry count To recover from intermittent SPARQL endpoint connection failures, the ‘retryCount’ parameter specified the number of times to retry connecting.
  • Retry pause To recover from intermittent SPARQL endpoint connection failures, the ‘retryPause’ parameter specifies how long to wait between retries.

Linking Tasks

A linking tasks consists of the following elements:

  1. Metadata
  2. A link specification
  3. Positive and negative reference links

Linking Tasks can be added to a existing project and removed from it. Clicking on Metadata opens a dialog to edit the meta data of a linking task:

Alt text

The following properties can be edited:

  • Name The unique name of the linking task
  • Source The source data set
  • Source restriction Restricts source dataset using SPARQL clauses
  • Target The target data set
  • Target restriction Restricts target dataset using SPARQL clauses
  • Link type Type of the generated link e.g. owl:sameAs

Clicking on the open button opens the Link Specification Editor

Linkage Rule Editor

The linkage Rule Editor allows users to edit linkage rules in graphical way. Linkage rules are created as a tree, resembling the Silk LSL, by dragging and dropping the rule elements.

The editor is divided in two parts:

  • The left pane contains the most frequent used property paths for the given data sets and restrictions. It also contains a list of all Silk operators (transformations, comparators and aggregators) as draggable elements.
  • The right part (editor pane) allows for drawing the flow chart by combining the elements chosen.

Alt text

Editing

  • Drag elements from the left pane to the editor pane
  • Connect the elements by drawing connections from and to the element endpoints (dots to the left and right of the element box)
  • Build a flow chart by connecting the elements, ending in one single element (either a comparison or aggregation)

The editor will guide the user in building the flow chart by highlighting connectable elements when drawing a new connection line.

Property Paths

Property paths for both data sources to be interlinked are loaded on the left pane and added in the order of their frequency in the data source.

Users can also add custon paths by dragging the (custom path) element to the editor pane and editing the path.

Operators

The following operator panes are shown below the property paths:

Hovering over the operator elements will show you more information on them.

Threshold

Threshold defines the minimum similarity of two data items which is required to generate a link between them. Please provide values between 0 and 1.

Link Limit defines the number of links originating from a single data item. Please choose between 1 and n (unlimited).

As finding a good linking heuristics is usually an iterative process, the Silk Workbench allows the user to quickly evaluate the links which are generated by the current link specification.

Alt text

After clicking on the Start button, the linking engine starts to generate links in the background. The view is updated whenever new links have been found to show the all generated links. For further examination, a drill-down view can be shown by clicking on a link.

The drill-down shows a detailed summary how the individual comparisons and aggregations contribute to the overall fitness of the link, the overall similarity between two links is composed by clicking on a link. This information allows the user to spot parts of the similarity evaluation which did not behave as expected.

Based on its correctness, each link can be associated to one of the following 3 categories:

  • correct Confirms the link as correct. Confirmed links are part of the positive reference link set.
  • correctness Contains link whose correctness is not decided i.e. which are not contained in the reference link sets.
  • incorrect Confirms the link as incorrect. Incorrect links are part of the negative reference link set.

In order to evaluate the correctness of a link specification, the user typically wants to evaluate the correctness of links which are close the similarity threshold. Clinking on the confidence header sorts all links by their confidence which allows to find these links. In addition, a filter can be added, so that only links are shown which contain a specific string.

In order to iteratively increase the quality of a link specification, the workbench holds a set of reference links for which their correctness has been confirmed or declined by the user.

Alt text

Reference Links may be imported and exported in the Ontology Alignment Format specified at http://alignapi.gforge.inria.fr/format.html.

Learning Linkage Rules

The Silk Workbench supports learning linkage rules. In order to be able to learn linkage rules, the current linking task must contain reference links.

Usually reference links can be added in two ways:

After reference links have been added, the learning can be started by pressing the start button. While learning, the current population of candidate linkage rules is displayed. At any time, the learning can be stopped by pressing the stop button. The learning will stop automatically as soon as either the full f-Measure is reached or the maximum number of iterations has been exceeded.

As soon as the learning has finished, the user can view the learned linkage rules and select a linkage rule for loading it into the editor.

Alt text

Basic Concepts

This section introduces the basic concepts in link discovery such as data sources, linkage rules and reference alignments.

Data Sources

Overview

Data sources hold the access parameters to local or remote SPARQL endpoints or RDF files. The defined data sources may later be referred to and used by their ID. Data Sources can by defined using either the API or XML.

Available Data Sources

SPARQL Endpoint Data Source Definitions

For SPARQL endpoints (dataSource type: sparqlEndpoint) the following parameters exist:

Parameter Description Default
endpointURI The URI of the SPARQL endpoint  
login Login required for authentication No login
password Password required for authentication No password
instanceList a list of instances to be retrieved. If not given, all instances will be retrieved. Multiple instances can be separated by a space. Retrieve all instances
pageSize Limits each SPARQL query to a fixed amount of results. Silk implements a paging mechanism which translates the pagesize parameter into SPARQL LIMIT and OFFSET clauses. 1000
graph Only retrieve instances from a specific graph.  
pauseTime To allow rate-limiting of queries to public SPARQL servers, the pauseTime statement specifies the number of milliseconds to wait in between subsequent queries. 0
retryCount To recover from intermittent SPARQL endpoint connection failures, the retryCount parameter specifies the number of times to retry connecting. 3
retryPause Specified how long to wait between retries. 1000

Example (XML)

<DataSOurce id="dbpedia" type="sparqlEndpoint">
	<Param name="endpointURI" value="http://dbpedia.org/sparql" />
    <Param name="retryCount" value="100" />
</DataSOurce>

Example (Scala API)

Note that all parameters except the endpoint URI optional and can be left out.

Source("dbpedia",
  SparqlDataSource(
    endcpointURI = "http://dbpedia.org/sparql",
    login = "user",
    password = "password",
    graph = "http://dbpedia.org",
    pageSize = 1000,
    pauseTime = 0,
    retryCount = 3,
    retryPause = 1000
    )
  )
RDF File Data Source Definitions

For RDF files (dataSource type: file) the following parameters exsit:

Parameter Description Default
filr (mandatory) The location of the RDF file.  
format(mandatory) The format of the RDF file. Allowed values: “RDF/XML”, “N-TRIPLE”, “TURTLE”, “TTL”, “N3”  

Currently the data set is held in memory.

Example (XML)

<DataSource id="musicbrainz" type="file">
	<Param name="file" value="musicbrainz_dump.nt" />
    <Param name="format" value="N-TRIPLE" />
</DataSource>

Example (Scala API)

Source("musicbrainz",
  FileDataSource(
    file = "musicbrainz_dump.nt",
    format = "N-TRIPLE"
  )
)

Linkage Rule

A linkage rule specifies how two data items are compared for similarity. A linkage rule consists of four basic components:

  • Path Input Retrieves values from a entity by a given RDF path e.g. ?movie/dbpedia:director/rdfs:label
  • Transformation Applies a data transformation to all values e.g. lowerCase
  • Comparison Evaluates the similarity of two inputs based on a user-defined distance measure and returns a confidence.
  • Aggregation Aggregates multiple confidence values.
Path Input

Overview

An input retrieves all values which are connected to the entities by a specific path.

Every path statement begins with a variable (as defined in the datasets), which may be followed by a series of path elements. If a path cannot be resolved due to a missing property or a too restrictive filter, an empty result set is returned.

The following operators can be used to traverse the graph:

Operator Name Use Description
/ forward operator <path_segment>/<property> Moves forward from a subject resource (set) through a property to its object resource (set).
\ reverse operator <path_segment>\<property> Moves backward from an object resource (set) through a property to its subject resource (set).
[ ] filter operator <path_segment>[<property> <comp_operator> <value>] <path_segment>[@lang <comp_operaor> <value>] Reduces the currently selected set of resources to the ones matching the filter expression. comp_operator may be one of >, <, >=, <=, =, !=

Example (XML)

# Select the English label of a movie
<Input path="?movie/rdfs:label[@lang='en']" />

# Select the label (set) of the director(s) of a movie
<Input path="?movie/dbpedia:director/rdfs:label" />

# Select the albums of a gicen artist (albums have an dbpedia:artist property)
<Input path="?artist\dbpedia:artist[rdf:type=dbpedia:Album]" />

Example (Scala API)

# Select the English label of a movie
Path.parse("?movie/rdfs:label[@lang='en']")

# Select the label (set) of the director(s) of a movie
Path.parse("?movie/dbpedia:director/rdfs:label")

# Select the albums of a given artist (albums have an dbpedia artists propetry)
Path.parse("?artist\dbpediaLartist[rdf:type=dbpedia:Album]")
Transformation

Overview

As different datasets usually use different data formats, a transformation can be used to normalize the values prior to comparison.

Parameters

TODO

Example (XML)

<TransformInput function="lowerCase">
	<TransformInput function="replcae">
    	<Input path="?a/rdfs:label?" />
        <Param name="search" value="_" />
        <Param name="replace" value=" " />
    </TransformInput>
</TransformInput>

Example (Scala API)

TransformInput(
  id = "ReplaceUnderscores",
  transformet = ReplaceTransformer("_", " ")
  inputs = PathInput(path=Path.parse("?a/rdfs:label"))
)
Transformations

Silk provides the following transformation and normalization functions:

Function and parameters Description
removeBlanks Remove whitespace from a string.
removeSpecialChars Remove special characters (including punctuation) from a string.
lowerCase Convert a string to lower case.
upperCase Convert a string to upper case.
capitalize(allWords) Capitalizes the string i.e. converts the first character to upper case. If ‘allWords’ is set to true, all words are capitalized and no only the first character. By default ‘allWords’ is set to false.
stem Apply word stemming to the sting.
alphaReduce Strip all non-alphabetic characters from a string
numReduce Strip all non-numeric characters from a string.
replace(string search, string replace) Replace all occurrences of “search” with “replace” in a string.
regexReplace(string regex, string replace) Replace all occurrences of a regex “regex” with “replace” in a string.
stripPrefix Strip the prefix from a string.
stripPostfix Strip the postfix from a string.
stipUriPrefix Strip the URI prefix (e.g. http://dbpedia.org/resource/) from a string.
concat Concatenates strings grom two inputs.
logarithm([base]) Transforms all numbers by applying the logarithm function. Non-numeric values are left unchanged. If base is no defined, it defaults to 10.
convert(string sourceCharset, string targetCharset) Converts the string from “sourceCharset” to “targetCharset”
tokenize([regex]) Splits the string into tokens. Splits at all matches of “regex” if provided and at whitespaces otherwise.
removeValues(blacklist) Removes specific values (i.e. stop words) from the value set. ‘blacklist’ is a comma-separated list of words.
Comparison

Overview

A comparison operator evaluates two inputs and computes the similarity based on a user-defined distance measure and a user-defined threshold.

The distance measure always outputs 0 for a perfect match, and a higher value for an imperfect match. Only distance values between 0 and threshold will result in a positive similarity score. Therefore it is important to know how the distance measures work and what the range of their output values is in order to set a threshold values sensibly.

Parameters
Parameter Description
required (optional) If required is true, the parent aggregation only yields a confidence value if the given inputs have values for both instances.
weight (optional) Weight of this comparison. The weight is used by some aggregations such as the weighted average aggregation.
Hreshold The maximum distance. For normalized distance measured, the threshold should be between 0.0 and 1.0.
distanceMeasure The used distance measure. For a list of available distance measures see below.
Inputs The 2 inputs for the comparison.

Example (XML)

<Compare metric="leenshteinDistance" threshold="2.0" required="true">
	<TransformInput function="lowerCase">
    	<Input path="?a/rdfs:label" />
    </TransformInput>
    <TransformInput function="lowerCase">
    	<Input path="?b/rdfs:label" />
    </TransformInput>
</Compare>

Example (Scala API)

Comparison(
  id = "labels",
  required = false,
  weight = 1,
  threshold = 2.0,
  metric = LevenshteinDistance()
  inputs = PathInput(path = Path.parse("?a/rdfs:label")) ::
    	   PathInput(path = Path.parse("?b/rdfs:label")) :: Nil
)

Threshold

The threshold is used to convert the computed distance to a confidence between -1.0 and 1.0. Links will be generated for confidences above 0 while higher confidence values imply a higher similarity between the compared entities.

Alt text

Distance Measures

Character-Based Distance Measures

Character-based distance measures compare strings on the character level. They are well suited for handling typographical errors.

Measure Description Normalized
levenshteinDistance Levenshtein distance. The minimum number of edits needed to transform on string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character No
levenshtein The levenshtein distance normalized to the interval [0, 1] Yes
jaro Jaro distance metric. Simple distance metric originally developed to compare person names Yes
jaroWinkler Jaro-Winkler distance measure. The Jaro-Winkler distance metric is designed and best suited for short strings such as person names Yes
equality 0 if strings are equal, 1 otherwise Yes
inequality 1 if strings are equal, 0 otherwise Yes

Example (XML)

<Compare metric="levenshteinDistance" threshold="2">
	<Input path="?a/rdfs:label" />
    <Input path="?b/gn:name" />
</Compare>

Token-Based Distance Measures

While character-based distance measure work well for typographical errors, they are number of tasks where token-base distance measures are better suited:

  • Strings where parts are reordered e.g. “John Doe” and “Doe, John”
  • Texts consisting of multiple words
Measure Description Normalized
jaccard Jaccard distance coefficient Yes
dice Dice distance coefficient Yes
softjaccard Soft Jaccard similarity coefficient. Same as Jaccard distance but values within an levenshtein distance of ‘maxDistance’ are considered equivalent. Yes

Example (XML)

<Compare metric="jaccard" threshold="0.2">
	<TransformInput function="tokenize">
    	<Input path="?a/rdfs:label" />
    </TransformInput>
    <TransformInput function="tokenize">
    	<Input path="?b/gn:name" />
    </TransformInput>
</Compare>

Special Purpose Distance Measures

Silk offers a number of distance measures which are designed to compare specific types of data e.g. numeric values.

Measure Description Normalized
num(float minValue, float maxValue) Computes the numeric difference between two numbers.
Parameters:
minValue, maxValue The minimum and maximum values which occur in the datasource.
No
date Computes the distance between two dates (“YYYY-MM-DD” format). Returns the difference in days. No
dateTime Computes the distance between two date time values (xsd:dataTime format). Returns the difference in seconds. No
wgs84(string unit, string curveStyle) Computes the geographical distance between two points.
Parameters:
unit The unit in which the distance is measured. Allowed values: “meter” or “m” (default), “kilometer” or “km”
No

Example (XML)

<Compare metric="wgs84" threshold="50">
	<Input path="?a/wgs:84:geometry" />
    <Input path="?b/wgs84:geometry" />
    <Param name="unit" value="km" />
</Compare>
Aggregation

Overview

An aggregation combines multiple confidence values into a single value. In order to determine if two entities are duplicates it is usually not sufficient to compare a single property. For instance, when comparing geographic entities, and aggregation may aggregate the similarities between the names of the entities and the similarities based on the distance between the entities.

Parameters

Required (Optional)

The required attribute can be set if the aggregation only should generate a result if a specific suboperator return a value

Weights (Optional)

Some comparison operators might be more relevant for the correct establishment of a link between two resources than others. For example, depending on data formats/quality, matching labels might be considered less important than matching geocoordinates when linking cities. If this modifier is not supplied, a default weight of 1 will be assumed. The weight is only considered in the aggregation types average, quadraticMean and geometricMean.

Type

The function according to the similarity values are aggregated. The following functions are included in Silk:

Id Name Description
average AverageAggregator Evaluate to the (weighted) average of confidence values.
max MaximumAggregator Evaluate to the highest confidence in the group
min MinimumAggregator Evaluate to the lowest confidence in the group
quadraticMean QuadraticMeanAggregator Apply Euclidian distance aggregation
geometricMean GeometricMeanAggregator Compute the (weighted) geometric mean of a group of confidence values

Example (XML)

<Aggregate type="average">
	<Compare metric="jaro" required="true">
    	<Input path="?a/rdfs:label" />
        <Input path="?b/gn:name" />
    </Compare>
    <Compare metric="num">
    	<Input path="?a/dbpedia:populationTotal" />
        <Input path="?b/gn:population" />
    </Compare>
</Aggregate>

Example (Scala)

Aggregation(
  id = "id1",
  required = false,
  weight = 1,
  operators = operators,
  aggregator = MaximumAggregator()
)

Output

Overview

An output represents a destination where the generated links are written to. Outputs can have an acceptance windows (defined by minConfidence and maxConfidence) e.g. for separating accepted links and links with a lower confidence which need to be verified before being accepted.

Example (XML)

<Outputs>
	<Output type="output type" minConfidence="lower threshold" maxConfidence="upper threshold">
    	...
    </Output>
</Outputs>

Example (Scala API)

Output(
  id = "MyOutput",
  writer = FileWriter(file = "output.nt", format = "ntriples")
)
Available Output Types
File Output
Parameter Description
file Writes the links to a file. Links are written to {user.dir}/.silk/output by default.
format The output format. Available formats are “ntriples” and “alignment”

Example (XML)

<Outputs>
	<Output type="file" minConfidence="0.1">
    	<Param name="file" value="accept_links.nt" />
        <Param name="format" value="ntriples" />
    </Output>
</Outputs>
Formats

N-Triples

Writes the links as N-Triples statements.

Alignment

Writes the links in the OAEI Alignment Format. This includes not only the uris of the source and target entities, but also the confidence of each link.

SPARQL/Update Output
Parameter Description
uri The URI of the SPARQL/Update endpoint e.g. http://localhost:8090/virtuoso/sparql
login Login required for authentication
password Password required for authentication
parameter The HTTP parameter used to submit queries. Defaults to “query” which works for most endpoints. Some endpoints require different parameters e.g. Sesame expects “update” and Joseki expects “request”.
graphUri The URI of the graph to put the links

Example (XML)

<Outputs>
	<Output type="sparul">
    	<Param name="uri" value="http://localhost:8080/query" />
    </Output>
</Outputs>
Detailed Alignment (Work in Progress)

Writes the links in a detailed alignment format.

Example (XML)

<DetailedAlignment>
	<Cell>
    	<Entity1 rdf:resource="http://dbpedia.org/resource/Hydroflumenthiazide"/>
        <Entity2 rdf:resource="http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugs/DB00774" />
        <Aggregate similarity="1.0">
        	<Compare similarity="1.0">
            	<Input path="?a/rdfs:label">
                	<Value>Idroflumetiazide</Value> 
          			<Value>Hydroflumethiazide</Value>
                </Input>
                <Input path="?b/rdfs:label">
                	<Value>Hydroflumethiazide</Value>
                </Input>
            </Compare>
            <Compare similarity="1.0">
            	<Input path="?a/rdfs:label">
                	<Value>Idroflumetiazide</Value> 
          			<Value>Hydroflumethiazide</Value>
                </Input>
                <Input path="?b/drugbank:synonym">
                	<value>Idroflumetiazide</value>
                </Input>
            </Compare>
        </Aggregate>
    </Cell>
    ...
</DetailedAlignment>

Alternative (as RDF/XML)

<?xml version='1.0' encoding='utf-8' standalone='no'?>
<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'>
<DetailedAlignment>
  <map>
    <Cell>
      <entity1 rdf:resource="http://dbpedia.org/resource/Hydroflumethiazide"/>
      <entity2 rdf:resource="http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugs/DB00774"/>
      <aggregate>
        <similiarity>1.0</similarity>
        <compare>
          <similiarity>1.0</similarity>
          <input>
            <path>?a/rdfs:label</path>
            <value>Idroflumetiazide</value> 
            <value>Hydroflumethiazide</value>  
          </input>
          <input>
            <path>?b/rdfs:label</path>
            <value>Hydroflumethiazide</value> 
         </input>
        </compare>
        <compare>
         <similiarity>1.0</similarity>
          <input>
            <path>?a/rdfs:label</path>
            <value>Idroflumetiazide</value> 
            <value>Hydroflumethiazide</value>  
          </input>
          <input>
            <path>?b/drugbank:synonym</path
            <value>Idroflumetiazide</value> 
         </input>
        </compare>
      </aggregate>
    </Cell>
  </map>
  ...
</DetailedAlignment>
</rdf:RDF>

Overview

Reference Links are a set of links whose correctness has either been confirmed or declined by the user. Reference links can be used to evaluate the completeness and correctness of a linkage rule.

We distinguish between positive and negative reference links:

  • Positive reference links represent definitive matches
  • Negative reference links represent definitive non-matches

The Silk framework provides a declarative language for specifying which types of RDF links should be discovered between data sources as well as which conditions data items must fulfill in order to be interlinked. This section describes the language constructs of the Silk Link Specification Language (Silk-LSL)

The example below gives an overview of the main language constructs of Silk-LSL

<Silk>
	<Prefixes>
    	<Prefix id="rdfs" namespace="http://www.w3.org/2000/01/rdf-schema#" />
        <Prefix id="dbpedia" namespace="http://dbpedia.org/ontology/" />
        <Prefix id="gn" namespace="http://www.geonames.org/ontology#" />
    </Prefixes>
    <Datasources>
    	<Datasource id="dbpedia">
        	<Param name="endpointURI" value="http://demo_sparql_sever1/sparql" />
            <Param name="graph" value="http://dbpedia.org" />
        </Datasource>
        <Datasource id="geonames">
        	<Param name="endpointURI" value="http://demo_sparql_server2/sparql" />
            <Param name="graph" value="http://sws.geonames.org/" />
        </Datasource>
    </Datasources>
    
    [<Blocking blocks="100" />]
    
    <Interlinks>
    	<Interlink id="cities">
        	<LinkType>owl:sameAs</LinkType>
            <SourceDataset dataSource="dbpedia" var="a">
            	<RestrictTo>
                    ?a rdf:type dbpedia:City
            	</RestrictTo>
            </SourceDataset>
            <TargetDataset dataSource="geonames" var="b">
            	<RestrictTo>
                	?b rdf:type gn:P
                </RestrictTo>
            </TargetDataset>
            <LinkageRule>
            	<Aggregate type="average">
                	<Compare metric="jaro">
                    	<Input path="?a/rdfs:label" />
                        <Input path="?b/gn:name" />
                    </Compare>
                    <Compare metric="num">
                    	<Input path="?a/dbpedia:populationaTotal" />
                        <Input path="?b/gn:population" />
                    </Compare>
                </Aggregate>
            </LinkageRule>
            
            <Filter threshold="0.9" />
            
            <Outputs>
            	<Output type="file" minConfidence="0.95">
                	<Param name="file" value="accepted_links.nt" />
                    <Param name="format" value="ntriples" />
                </Output>
                <Output type="file" maxConfidence="0.95">
                	<Param name="file" value="verify_links.nt" />
                    <Param name="format" value="alignment" />
                </Output>
            </Outputs>
        </Interlink>
    </Interlinks>
</Silk>

Structure and Elements

The Silk-LSL is expressed in XML as specified by the corresponding Silk XML Schema. The root tag name is <Silk>. A valid document may contain four types of top-level statements beneath the root element:

  • prefix definitions
  • datasource definitios
  • link specifications
  • output definitions
<?xml version="1.0" encoding="utf-8" ?>
<Silk>
	<Prefixes ... />
    	...
    <Datasources ... />
    	...
    [<Blocking ... />]
    	...
    <Interlinks ... />
    	...
    [<Outputs ... />]
    	...
</Silk>

The Blocking and Outputs statements are optional.

Prefix Definitions

Prefix definitions are top-level statements that allow the binding of a prefix to a namespace:

<Prefixes>
	<Prefix id="prefix id" namespace="namespace URI" />
</Prefixes>

Example (XML)

<Prefixes>
	<Prefix id="rdf" namespace="http://www.w3.org/1999/02/22-rdf-syntax-ns#" />
</Prefixes>

Data Source Definitions

Data source definitions are top-level statements that allow the specification of access parameters to local or remote SPARQL endpoints. The defined data sources may later be referred to and used by their ID within link specification statements.

<DataSources>
	<DataSource id="data source ID" type="dataSource type">
    	<Param name="parameter name" value="parameter value" />
    </DataSource>
</DataSources>

For details see Data Sources

Blocking Data Items

Since comparing every source resource to every single target resource results in a number of n * m comparisons (n being the number of source resources, m the number of target resources) which might be too time consuming, blocking can be used to reduce the number of comparisons. Blocking partitions similar data items into clusters reducing the comparisons to items in the same cluster.

For example given two datasets describing books, in order to reduce the number of comparisons, we could block the books by publisher. In this case only books from the same publisher will be compared. Given a number of 40,000 books in the first dataset and 30,000 in the second dataset, evaluating the full Cartesian product requires 1.2 billion comparisons.

If we block this datasets by publisher, each book will be allocated to a block based on its publisher. Using 100 blocks, if the books are uniformly distributed, there will be 300 respectively 300 books per block, which reduces the number of comparisons to 12 million.