[vset VERSION 1.2.7] [manpage_begin uri n [vset VERSION]] [keywords {fetching information}] [keywords file] [keywords ftp] [keywords gopher] [keywords http] [keywords https] [keywords ldap] [keywords mailto] [keywords news] [keywords prospero] [keywords {rfc 1630}] [keywords {rfc 2255}] [keywords {rfc 2396}] [keywords {rfc 3986}] [keywords uri] [keywords url] [keywords wais] [keywords www] [moddesc {Tcl Uniform Resource Identifier Management}] [titledesc {URI utilities}] [category Networking] [require Tcl 8.2] [require uri [opt [vset VERSION]]] [description] This package does two things. [para] First, it provides a number of commands for manipulating URLs/URIs and fetching data specified by them. For fetching data this package analyses the requested URL/URI and then dispatches it to the appropriate package ([package http], [package ftp], ...) for actual retrieval. Currently these commands are defined for the schemes [term http], [term https], [term ftp], [term mailto], [term news], [term ldap], [term ldaps] and [term file]. The package [package uri::urn] adds scheme [term urn]. [para] Second, it provides regular expressions for a number of [const registered] URL/URI schemes. Registered schemes are currently [term ftp], [term ldap], [term ldaps], [term file], [term http], [term https], [term gopher], [term mailto], [term news], [term wais] and [term prospero]. The package [package uri::urn] adds scheme [term urn]. [para] The commands of the package conform to RFC 3986 ([uri https://www.rfc-editor.org/rfc/rfc3986.txt]), with the exception of a loophole arising from RFC 1630 and described in RFC 3986 Sections 5.2.2 and 5.4.2. The loophole allows a relative URI to include a scheme if it is the same as the scheme of the base URI against which it is resolved. RFC 3986 recommends avoiding this usage. [section COMMANDS] [list_begin definitions] [call [cmd uri::setQuirkOption] [arg option] [opt [arg value]]] [cmd uri::setQuirkOption] is an accessor command for a number of "quirk options". The command has the same semantics as the command [cmd set]: when called with one argument it reads an existing value; with two arguments it writes a new value. The value of a "quirk option" is boolean: the value [const false] requests conformance with RFC 3986, while [const true] requests use of the quirk. See section [sectref {QUIRK OPTIONS}] for discussion of the different options and their purpose. [call [cmd uri::split] [arg url] [opt [arg defaultscheme]]] [cmd uri::split] takes a [arg url], decodes it and then returns a list of key/value pairs suitable for [cmd "array set"] containing the constituents of the [arg url]. If the scheme is missing from the [arg url] it defaults to the value of [arg defaultscheme] if it was specified, or [term http] else. Currently the schemes [term http], [term https], [term ftp], [term mailto], [term news], [term ldap], [term ldaps] and [term file] are supported by the package itself. See section [sectref EXTENDING] on how to expand that range. [para] The set of constituents of a URL (= the set of keys in the returned dictionary) is dependent on the scheme of the URL. The only key which is therefore always present is [const scheme]. For the following schemes the constituents and their keys are known: [list_begin definitions] [def ftp] [const user], [const pwd], [const host], [const port], [const path], [const type], [const pbare]. The pbare is optional. [def http(s)] [const user], [const pwd], [const host], [const port], [const path], [const query], [const fragment], [const pbare]. The pbare is optional. [def file] [const path], [const host]. The host is optional. [def mailto] [const user], [const host]. The host is optional. [def ldap(s)] [const host], [const port], [const dn], [const attrs], [const scope], [const filter], [const extensions] [def news] Either [const message-id] or [const newsgroup-name]. [list_end] For discussion of the boolean [const pbare] see options [emph NoInitialSlash] and [emph NoExtraKeys] in [sectref {QUIRK OPTIONS}]. [para] The constituents are returned as slices of the argument [arg url], without removal of percent-encoding ("url-encoding") or other adaptations. Notably, on Windows® the [const path] in scheme [term file] is not a valid local filename. See [sectref EXAMPLES] for more information. [para] [call [cmd uri::join] [opt "[arg key] [arg value]"]...] [cmd uri::join] takes a list of key/value pairs (generated by [cmd uri::split], for example) and returns the canonical URL they represent. Currently the schemes [term http], [term https], [term ftp], [term mailto], [term news], [term ldap], [term ldaps] and [term file] are supported by the package itself. See section [sectref EXTENDING] on how to expand that range. [para] The arguments are expected to be slices of a valid URL, with percent-encoding ("url-encoding") and any other necessary adaptations. Notably, on Windows the [const path] in scheme [term file] is not a valid local filename. See [sectref EXAMPLES] for more information. [call [cmd uri::resolve] [arg base] [arg url]] [cmd uri::resolve] resolves the specified [arg url] relative to [arg base], in conformance with RFC 3986. In other words: a non-relative [arg url] is returned unchanged, whereas for a relative [arg url] the missing parts are taken from [arg base] and prepended to it. The result of this operation is returned. For an empty [arg url] the result is [arg base], without its URI fragment (if any). The command is available for schemes [term http], [term https], [term ftp], and [term file]. [call [cmd uri::isrelative] [arg url]] [cmd uri::isrelative] determines whether the specified [arg url] is absolute or relative. The command is available for a [arg url] of any scheme. [call [cmd uri::geturl] [arg url] [opt "[arg options]..."]] [cmd uri::geturl] decodes the specified [arg url] and then dispatches the request to the package appropriate for the scheme found in the URL. The command assumes that the package to handle the given scheme either has the same name as the scheme itself (including possible capitalization) followed by [cmd ::geturl], or, in case of this failing, has the same name as the scheme itself (including possible capitalization). It further assumes that whatever package was loaded provides a [cmd geturl]-command in the namespace of the same name as the package itself. This command is called with the given [arg url] and all given [arg options]. Currently [cmd geturl] does not handle any options itself. [para] [emph Note:] [term file]-URLs are an exception to the rule described above. They are handled internally. [para] It is not possible to specify results of the command. They depend on the [cmd geturl]-command for the scheme the request was dispatched to. [call [cmd uri::canonicalize] [arg uri]] [cmd uri::canonicalize] returns the canonical form of a URI. The canonical form of a URI is one where relative path specifications, i.e. "." and "..", have been resolved. The command is available for all URI schemes that have [cmd uri::split] and [cmd uri::join] commands. The command returns a canonicalized URI if the URI scheme has a [const path] component (i.e. [term http], [term https], [term ftp], and [term file]). For schemes that have [cmd uri::split] and [cmd uri::join] commands but no [const path] component (i.e. [term mailto], [term news], [term ldap], and [term ldaps]), the command returns the [arg uri] unchanged. [call [cmd uri::register] [arg schemeList] [arg script]] [cmd uri::register] registers the first element of [arg schemeList] as a new scheme and the remaining elements as aliases for this scheme. It creates the namespace for the scheme and executes the [arg script] in the new namespace. The script has to declare variables containing regular expressions relevant to the scheme. At least the variable [var schemepart] has to be declared as that one is used to extend the variables keeping track of the registered schemes. [list_end] [section SCHEMES] In addition to the commands mentioned above this package provides regular expression to recognize URLs for a number of URL schemes. [para] For each supported scheme a namespace of the same name as the scheme itself is provided inside of the namespace [emph uri] containing the variable [var url] whose contents are a regular expression to recognize URLs of that scheme. Additional variables may contain regular expressions for parts of URLs for that scheme. [para] The variable [var uri::schemes] contains a list of all registered schemes. Currently these are [term ftp], [term ldap], [term ldaps], [term file], [term http], [term https], [term gopher], [term mailto], [term news], [term wais] and [term prospero]. [section EXTENDING] Extending the range of schemes supported by [cmd uri::split] and [cmd uri::join] is easy because both commands do not handle the request by themselves but dispatch it to another command in the [emph uri] namespace using the scheme of the URL as criterion. [para] [cmd uri::split] and [cmd uri::join] call [cmd "Split[lb]string totitle [rb]"] and [cmd "Join[lb]string totitle [rb]"] respectively. [para] The provision of split and join commands is sufficient to extend the commands [cmd uri::canonicalize] and [cmd uri::geturl] (the latter subject to the availability of a suitable package with a [cmd geturl] command). In contrast, to extend the command [cmd uri::resolve] to a new scheme, the command itself must be modified. [para] To extend the range of schemes for which pattern information is available, use the command [cmd uri::register]. [para] An example of a package that provides both commands and pattern information for a new scheme is [package uri::urn], which adds scheme [term urn]. [section {QUIRK OPTIONS}] The value of a "quirk option" is boolean: the value [const false] requests conformance with RFC 3986, while [const true] requests use of the quirk. Use command [cmd uri::setQuirkOption] to access the values of quirk options. [para] Quirk options are useful both for allowing backwards compatibility when a command specification changes, and for adding useful features that are not included in RFC specifications. The following quirk options are currently defined: [list_begin definitions] [def [emph NoInitialSlash]] This quirk option concerns the leading character of [const path] (if non-empty) in the schemes [term http], [term https], and [term ftp]. [para] RFC 3986 defines [const path] in an absolute URI to have an initial "/", unless the value of [const path] is the empty string. For the scheme [term file], all versions of package [package uri] follow this rule. The quirk option [emph NoInitialSlash] does not apply to scheme [term file]. [para] For the schemes [term http], [term https], and [term ftp], versions of [package uri] before 1.2.7 define the [const path] [emph NOT] to include an initial "/". When the quirk option [emph NoInitialSlash] is [const true] (the default), this behavior is also used in version 1.2.7. To use instead values of [const path] as defined by RFC 3986, set this quirk option to [const false]. [para] This setting does not affect RFC 3986 conformance. If [emph NoInitialSlash] is [const true], then the value of [const path] in the schemes [term http], [term https], or [term ftp], cannot distinguish between URIs in which the full "RFC 3986 path" is the empty string "" or a single slash "/" respectively. The missing information is recorded in an additional [cmd uri::split] key [const pbare]. [para] The boolean [const pbare] is defined when quirk options [emph NoInitialSlash] and [emph NoExtraKeys] have values [const true] and [const false] respectively. In this case, if the value of [const path] is the empty string "", [const pbare] is [const true] if the full "RFC 3986 path" is "", and [const pbare] is [const false] if the full "RFC 3986 path" is "/". [para] Using this quirk option [emph NoInitialSlash] is a matter of preference. [def [emph NoExtraKeys]] This quirk option permits full backward compatibility with versions of [package uri] before 1.2.7, by omitting the [cmd uri::split] key [const pbare] described above (see quirk option [emph NoInitialSlash]). The outcome is greater backward compatibility of the [cmd uri::split] command, but an inability to distinguish between URIs in which the full "RFC 3986 path" is the empty string "" or a single slash "/" respectively - i.e. a minor non-conformance with RFC 3986. [para] If the quirk option [emph NoExtraKeys] is [const false] (the default), command [cmd uri::split] returns an additional key [const pbare], and the commands comply with RFC 3986. If the quirk option [emph NoExtraKeys] is [const true], the key [const pbare] is not defined and there is not full conformance with RFC 3986. [para] Using the quirk option [emph NoExtraKeys] is [emph NOT] recommended, because if set to [const true] it will reduce conformance with RFC 3986. The option is included only for compatibility with code, written for earlier versions of [package uri], that needs values of [const path] without a leading "/", [emph {AND ALSO}] cannot tolerate unexpected keys in the results of [cmd uri::split]. [def [emph HostAsDriveLetter]] When handling the scheme [term file] on the Windows platform, versions of [package uri] before 1.2.7 use the [const host] field to represent a Windows drive letter and the colon that follows it, and the [const path] field to represent the filename path after the colon. Such URIs are invalid, and are not recognized by any RFC. When the quirk option [emph HostAsDriveLetter] is [const true], this behavior is also used in version 1.2.7. To use [term file] URIs on Windows that conform to RFC 3986, set this quirk option to [const false] (the default). [para] Using this quirk is [emph NOT] recommended, because if set to [const true] it will cause the [package uri] commands to expect and produce invalid URIs. The option is included only for compatibility with legacy code. [def [emph RemoveDoubleSlashes]] When a URI is canonicalized by [cmd uri::canonicalize], its [const path] is normalized by removal of segments "." and "..". RFC 3986 does not mandate the removal of empty segments "" (i.e. the merger of double slashes, which is a feature of filename normalization but not of URI [const path] normalization): it treats URIs with excess slashes as referring to different resources. When the quirk option [emph RemoveDoubleSlashes] is [const true] (the default), empty segments will be removed from [const path]. To prevent removal, and thereby conform to RFC 3986, set this quirk option to [const false]. [para] Using this quirk is a matter of preference. A URI with double slashes in its path was most likely generated by error, certainly so if it has a straightforward mapping to a file on a server. In some cases it may be better to sanitize the URI; in others, to keep the URI and let the server handle the possible error. [list_end] [para] [subsection {BACKWARD COMPATIBILITY}] To behave as similarly as possible to versions of [package uri] earlier than 1.2.7, set the following quirk options: [list_begin itemized] [item] [cmd uri::setQuirkOption] [arg NoInitialSlash] 1 [item] [cmd uri::setQuirkOption] [arg NoExtraKeys] 1 [item] [cmd uri::setQuirkOption] [arg HostAsDriveLetter] 1 [item] [cmd uri::setQuirkOption] [arg RemoveDoubleSlashes] 0 [list_end] In code that can tolerate the return by [cmd uri::split] of an additional key [const pbare], set [list_begin itemized] [item] [cmd uri::setQuirkOption] [arg NoExtraKeys] 0 [list_end] in order to achieve greater compliance with RFC 3986. [subsection {NEW DESIGNS}] For new projects, the following settings are recommended: [list_begin itemized] [item] [cmd uri::setQuirkOption] [arg NoInitialSlash] 0 [item] [cmd uri::setQuirkOption] [arg NoExtraKeys] 0 [item] [cmd uri::setQuirkOption] [arg HostAsDriveLetter] 0 [item] [cmd uri::setQuirkOption] [arg RemoveDoubleSlashes] 0|1 [list_end] [subsection {DEFAULT VALUES}] The default values for package [package uri] version 1.2.7 are intended to be a compromise between backwards compatibility and improved features. Different default values may be chosen in future versions of package [package uri]. [list_begin itemized] [item] [cmd uri::setQuirkOption] [arg NoInitialSlash] 1 [item] [cmd uri::setQuirkOption] [arg NoExtraKeys] 0 [item] [cmd uri::setQuirkOption] [arg HostAsDriveLetter] 0 [item] [cmd uri::setQuirkOption] [arg RemoveDoubleSlashes] 1 [list_end] [section EXAMPLES] A Windows® local filename such as "[const {C:\Other Files\startup.txt}]" is not suitable for use as the [const path] element of a URI in the scheme [term file]. [para] The Tcl command [cmd {file normalize}] will convert the backslashes to forward slashes. To generate a valid [const path] for the scheme [term file], the normalized filename must be prepended with "[const /]", and then any characters that do not match the [cmd regexp] bracket expression [example { [a-zA-Z0-9$_.+!*'(,)?:@&=-] }] must be percent-encoded. [para] The result in this example is "[const /C:/Other%20Files/startup.txt]" which is a valid value for [const path]. [example { % uri::join path /C:/Other%20Files/startup.txt scheme file file:///C:/Other%20Files/startup.txt % uri::split file:///C:/Other%20Files/startup.txt path /C:/Other%20Files/startup.txt scheme file }] On UNIX® systems filenames begin with "[const /]" which is also used as the directory separator. The only action needed to convert a filename to a valid [const path] is percent-encoding. [section CREDITS] [para] Original code (regular expressions) by Andreas Kupries. Modularisation by Steve Ball, also the split/join/resolve functionality. RFC 3986 conformance by Keith Nash. [vset CATEGORY uri] [include ../common-text/feedback.inc] [manpage_end]