\documentclass[acmtois]{acmtrans2m}
\usepackage{epsfig}
\usepackage{graphicx}
\usepackage{amssymb,amsmath}
\usepackage{amstext}
\usepackage{subfigure}
\newcommand{\B}{\mathbb{B}}
\newcommand{\A}{\mathbb{A}}
\newcommand{\R}{\mathbb{R}}
\newcommand{\N}{\mathbb{N}}
\newcommand{\eg}{\textit{e.g.}}
\newcommand{\ie}{\textit{i.e.}}
\newcommand{\etc}{\textit{etc.}}
\newtheorem{sdef}{Definition}
\acmVolume{}
\acmNumber{}
\acmYear{04}
\acmMonth{}
\markboth{M. A. Gon{\c c}alves et al.}{Streams, Structures, Spaces, Scenarios, Societies (5S): A Formal Model for Digital Libraries}
\title{Streams, Structures, Spaces, Scenarios, Societies (5S):
A Formal Model for Digital Libraries}
\author{MARCOS ANDR{\'E} GON{\c C}ALVES, EDWARD A. FOX, LAYNE T. WATSON, NEILL A. KIPP \\
Virginia Polytechnic Institute and State University \\
}
\begin{abstract}
Digital libraries (DLs) are complex information systems and therefore demand formal foundations lest development efforts diverge and interoperability suffers. In this paper, we propose the fundamental abstractions of Streams, Structures, Spaces, Scenarios, and Societies (5S), which allow us to define digital libraries rigorously and usefully. Streams are sequences of arbitrary items used to describe both static and dynamic (e.g., video) content. Structures can be viewed as labeled directed graphs, which impose organization. Spaces are sets with operations on those sets that obey certain constraints. Scenarios consist of sequences of events or actions that modify states of a computation in order to accomplish a functional requirement. Societies are sets of entities and activities and the relationships among them. Together these abstractions provide a formal foundation to define, relate, and unify concepts -- among others, of digital objects, metadata, collections, and services -- required to formalize and elucidate ``digital libraries''. The applicability, versatility, and unifying power of the 5S model are demonstrated through its use in three distinct applications: building and interpretation of a DL taxonomy, informal and formal analysis of case studies of digital libraries (NDLTD and OAI), and utilization as a formal basis for a DL description language.
\end{abstract}
\category{H.3.7}{Information Systems}{Information Storage and Retrieval}[Digital Libraries]
\terms{Theory}
\keywords{foundations, taxonomy, definitions, applications.}
\begin{document}
\begin{bottomstuff}
Authors' adresses: Department of Computer Science, Virginia Tech, Blacksburg, Va 24061, USA. Email:\{mgoncalv,fox,ltw,kipp\}@vt.edu. This
work has been partially supported by the National Science Foundation under grants No. IIS-9986089, IIS-0002935, IIS-0080748, IIS-0086227,
DUE-0121679, DUE-0121741, and DUE-0136690. The first author also is supported by CAPES, process 1702-980, and a fellowship by America OnLine (AOL).
\end{bottomstuff}
\maketitle
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Section 1
\section{Motivation}
Digital libraries are extremely complex information systems. The proper concept of a digital library seems hard to completely understand and evades definitional consensus. Different views (\eg, historical, technological) and perspectives (\eg, from the library and information science, information
retrieval, or human-computer interaction communities) have led to a
myriad of differing definitions. Licklider, in his seminal
work~\cite[pp. 36--39]{Licklider65}, visualized a collection of digital versions of the worldwide corpus of published literature and its availability through interconnected computers. More recently, Levy and Marshall gave a view of digital libraries as a polygamy of documents, technology, and work~\cite{LevyMarshall95}. Lesk analyzed the relative weights of the words \textit{digital} and \textit{library} in recent efforts in the field, and concluded that many of those efforts are dissociated from an understanding of users' needs and their use of the resources being provided~\cite{Lesk99}. Borgman explicitly explored the competing visions of the digital library field, both from research and from practitioner communities, and showed the difficulty that this conflict imposes on activities like defining terms, characterizing terminologies, and establishing contexts~\cite{Borgman99}. A Delphi study of digital libraries coalesced a broad definition: organized collection of resources, mechanisms for browsing and searching, distributed networked environments, and sets of services objectified to meet users' needs~\cite{KochtanekHein99}. The President's Information Technology Advisory Committee (PITAC) Panel on Digital Libraries discusses ``digital libraries -- the networked collections of digital text, documents, images, sounds, scientific data, and software that are the core of today's Internet and tomorrow's universally accessible digital repositories of all human knowledge'' \cite{PITAC01}. Underlying all of these is the consensus agreement that digital libraries are fundamentally complex.
Such complexity is due to the inherently interdisciplinary nature of this kind of system. Digital libraries integrate findings from disciplines such as hypertext, information retrieval, multimedia
services, database management, and human-computer
interaction~\cite{FoxMarchionini98}. The need to accommodate all
these characteristics complicates the understanding of the underlying
concepts and functionalities of digital libraries, thus making it
difficult and expensive to construct new digital library systems.
Designers of digital libraries are most often library technical staff,
with little to no formal training in software engineering, or computer
scientists with little background in the research findings about
information retrieval or hypertext. Thus, digital library systems are
usually built from scratch using home-grown architectures that do not
benefit from digital library and software design experience. Wasted
effort and poor interoperability can therefore ensue, raising the
costs of digital libraries and risking the fluidity of information
assets in the future.
The broad and deep requirements of digital libraries demand new models
and theories in order to understand better the complex interactions among their components~\cite{Gladney+94}. Supporting this claim, the summary
report of the Joint NSF-European Union (EU) Working Groups on Future
Directions of Digital Libraries Research recommended that ``new models
and theories be developed in order to understand the complex
interactions between the various components in a globally distributed
digital library''~\cite{SchaubleSmeaton98}. However, though the necessity for such an underlying theory has long been perceived and advocated, little if any progress has been made towards a formal model or theory for digital libraries.
Formal models and theories are crucial to specify and understand
clearly and unambiguously the characteristics, structure, and behavior
of complex information systems. It is not surprising that most of the disciplines related to digital libraries have underlying formal models that have steered them well: databases \cite{Codd70,Ullman88,Beeri,Cat94,Abiteboul+97}, information retrieval \cite{SaltonLesk65,Robertson76,Turtle91,Tague91,Sparck-Jones97,BYRN99}, and hypertext and multimedia \cite{LucarellaZanzi96,Masiero}. A formal model abstracts the general characteristics and common features of a set of systems developed for
similar problems, explains their structures and processes, and strengthens common practice. Furthermore, formal models for information systems can be
used for the design of a real system, providing a precise specification of requirements against which the implementation can be compared for
correctness. The lack of formal models leads to diverging efforts and has made interoperability one of the most important problems faced by the field.
In this paper we introduce the 5S model and formalisms for streams, structures, spaces, scenarios, and societies--- as a framework for providing theoretical and practical unification of digital libraries. These formalisms are important for making sense of complexity and can ultimately serve as an aid for designers, implementers, and evaluators of digital libraries. These abstractions work with other known and derived definitions to yield a formal, rigorous model of digital libraries.
This paper is organized as follows. Section 2 presents an overview of the 5S model, including definitions and examples. Section 3 discusses three applications of the 5S model including: a) construction and interpretation of a DL taxonomy; b) informal analysis of case studies of digital libraries; and
3) utilization of 5S as a basis for a DL description language. Sections 2 and 3 are purposely informal and introduce most of the key concepts in an intuitive manner without complete precision. The role of Section 4
is to formally define key information constructs that were introduced in the previous sections. Section 5 then builds on this framework to formally describe several DL higher level
constructs and settings. Section 6 illustrates the application of the formal model. Section 7 discusses related work and Section 8 concludes the
paper with some insights into future work.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Section 2
\section{5S Overview: Informal Definitions}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Section 3
\subsection{Streams}
Streams are sequences of elements of an arbitrary type (e.g., bits, characters, images, etc.). In this sense, they can model both static and dynamic content. The first includes, for example, textual material, while the later might be, for example, a presentation of a digital video, or a sequence of time and positional data (e.g., from a GPS) for a moving object.
A dynamic stream can represent an information flow---a sequence of messages encoded by the sender and communicated using a transmission channel possibly distorted with noise, to a receiver whose goal is to reconstruct the sender's messages and interpret message semantics~\cite{Shannon48}. Dynamic streams are thus important for representing whatever communications take place in the digital library. Examples of dynamic streams include video-on-demand delivered to a viewer, a timed sequence of news sent to a client, a timed sequence of frames that allows the assembly of a virtual reality scenario, etc. Typically, a dynamic stream is understood through its temporal nature. A dynamic stream then can be interpreted as a finite sequence of clock times and associated values{\footnote{These values are undefined or a value of type $T$, e.g., boolean, integer, text, or image.} that can be used to define a stream algebra, allowing operations on diverse kinds of multimedia streams~\cite{MackayBeaudouinLafon98}. The synchronization of streams can be specified with Petri Nets \cite{Oberweis96} or other approaches.
In the static interpretation, the temporal nature is generally ignored or is irrelevant, and a stream corresponds to some information content that is interpreted as a sequence of basic elements, often of the same type. A popular type of static stream according to this view is text (sequence of characters). The type of the stream defines its semantics and area of application. For example, any text representation can be seen as a stream of characters, so that text documents, such as scientific articles and books, can be considered as structured streams.
\subsection{Structures}
A structure specifies the way in which parts of a whole are
arranged or organized. In digital libraries, structures can represent
hypertexts, taxonomies, system connections, user relationships, and
containment -- to cite a few. Books, for example can be
structured logically into chapters, sections, subsections, and
paragraphs; or physically into cover, pages, line groups (paragraphs),
and lines~\cite{Furuta94}. Structuring orients readers within a
document's information.
Markup languages (e.g., SGML, XML, HTML) have been the primary form of exposing the internal structure of digital documents for retrieval and/or presentation purposes ~\cite{Andre+89,Coombs+88,GoldfarbPrescod98}. Relational and object-oriented databases impose strict structures on data, typically using tables or graphs as units of structuring~\cite{Beeri}. Indexing in information retrieval systems serves to categorize and support future requests, generating an organizational structure for the document space.
With the increase in heterogeneity of material continually being added
to digital libraries, we find that much of this material is
called ``semistructured'' or ``unstructured''. These terms refer to data that may have some structure, where the structure is not as
rigid, regular, explicit, or complete as the structure used by structured
documents or traditional database management
systems~\cite{Abiteboul+99}. Query languages and algorithms can
extract structure from these data~\cite{Debye,Abiteboul+97,Nestorov97}. Although most of those efforts
have a ``data-centric'' view of semi-structured data, works with a more ``document-centric view'' have emerged \cite{Yates2000,Fuhr2000,Fuhr04}. In
general, humans and natural language processing systems can expend considerable effort to unlock the interwoven structures found in texts at
syntactic, semantic, pragmatic, and discourse levels.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Spaces}
A space is a set of objects together with operations on those objects that obey certain constraints. The combination of operations on objects with the set of objects is what distinguishes spaces from streams and structures. Since this is such a powerful construct, when a part of a DL cannot be described well using another of the Ss, a space may well be applicable. Despite the generality of this definition, spaces are extremely important mathematical constructs. The operations and constraints associated with a space define its properties. For example, in mathematics, affine, linear, metric, and topological spaces define the basis for algebra and analysis ~\cite{Godement69}. In the context of digital libraries, Licklider discusses spaces for information~\cite[p. 62]{Licklider65}. In the information retrieval discipline, Salton and Lesk formulated an algebraic theory based on vector spaces and implemented it in the SMART system~\cite{SaltonLesk65}.
``Feature spaces'' are sometimes used with image and document collections and are suitable for clustering or probabilistic
retrieval~\cite{Robertson77}. Spaces also can be defined by a regular language applied to a collection of documents. Document spaces are a key concept in many digital libraries.
Human understanding can be described using conceptual spaces. Multimedia systems must represent real as well as synthetic spaces in one or several
dimensions, limited by some metric or presentational space (windows, views,
projections) and transformed to other spaces to facilitate processing (such as compression \cite{Edleno00,Ziviani00}). Many of the synthetic spaces represented in virtual reality systems try to emulate physical spaces. Digital libraries may model traditional libraries by using virtual reality spaces or environments~\cite{Bayraktar+98,DasNeves00}. Also,
spaces for computer-supported cooperative work provide a context for
virtual meetings and collaborations \cite{Crabtree97,Prince99}.
Again, spaces are distinguished by the operations on their elements. Digital libraries can use many types of spaces for indexing, visualizing, and other services they perform. The most prominent of these for digital libraries
are measurable spaces, measure spaces, probability spaces, vector spaces, and topological spaces.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Scenarios}
One important type of scenario is a story that describes possible ways to use a system to accomplish some function that a user desires. Scenarios are useful as part of the process of designing information systems. Scenarios can be used to describe external system behavior from the user's point of view \cite{Kying95}; provide guidelines to build a cost-effective prototype \cite{Sutcliffe97}; or help to validate, infer, and support requirements specifications and provide acceptance criteria for testing \cite{Hsia94,Sutcliffe98,Lams98}. Developers can quickly grasp the potentials and complexities of digital libraries through scenarios. Scenarios tell what happens to the streams, in the spaces, and through the structures. Taken together the scenarios describe services, activities, tasks, and operations, and those ultimately specify the functionalities of a digital library.
For example, user scenarios describe one or more users engaged in some meaningful activity with an existing or envisioned system. This approach has been used as a design model for hypermedia applications~\cite{Ogawa+90}. Human information needs, and the processes of satisfying them in the context of digital libraries, are well suited to description with scenarios, including these key types: fact-finding, learning, gathering, and exploring~\cite{Wilkison96}. Additionally, scenarios can aid
understanding of how digital libraries affect organizations and
societies, and how challenges to support social needs relate to
underlying assumptions of digital libraries~\cite{LevyMarshall95}. Scenarios also may help us understand the complexities of current publishing methods,
as well as how they may be reshaped in the era of digital libraries,
by considering publishing paths, associated participants, and publication
functions~\cite{Wiederhold95}.
The concepts of state and event are fundamental to understanding
scenarios. Broadly speaking, a state is determined by what contents are in specified locations, as, for example, in a computer memory, disk storage,
visualization, or the real world. The nature of the values and state locations related to contents in a system are granularity-dependent and their
formal definitions and interpretations are out of the scope of this paper; the reader is referred to \cite{Winskel93} for a lengthy discussion. An
event denotes a transition or change between states, for example, executing a command in a program. Scenarios specify sequences of events, which
involve actions that modify states of a computation and influence the occurrence and outcome of future events. Dataflow and workflow in digital
libraries can be modeled using scenarios.
%% ////////////////////////////////////////////////////////////////
\subsection{Societies}
A society is a set of entities and the relationships between them. The entities include humans as well as hardware and software components, which either use or support digital library services.
Societal relationships make connections between and among the entities and activities.
Examples of specific human societies in digital libraries include patrons,
authors, publishers, editors, maintainers, developers, and the library
staff. There are also societies of learners and teachers. In a human society, people have roles, purposes, and relationships. Societies follow certain rules and their members play different roles---participants, managers, leaders, contributors, or users. Members of societies have activities and relationships. During their activities, society members often create information artifacts---art, history, images, data---that can be managed by the library. Societies
are holistic---substantially more than the sums of their constituents
and the relationships between them. Electronic members of digital library societies, i.e., hardware and software components, are normally engaged in supporting and managing services used by humans.
A society is the highest-level component of a digital library, which exists to serve the information needs of its societies and to describe the context
of its use. Digital libraries are used for collecting, preserving, and sharing information artifacts between society members. Cognitive models for
information retrieval \cite{Belkin82,Ellis92,Ingwersen97}, for example, focus on user's information-seeking behavior (i.e., formation, nature, and
properties of a user's information need) and on the ways in which information retrieval systems are used in operational environments.
Several societal issues arise when we consider them in the digital
library context. These include policies for information use,
reuse, privacy, ownership, intellectual property rights, access management, security, etc. \cite{PITAC01}. Therefore, societal governance (law and its enforcement) is a fundamental concern in digital libraries. Language barriers are also an essential concern in information systems and internationalization of online materials is an important issue in digital libraries, given their globally distributed nature~\cite{Oard+99}.
Economics, a critical societal concern, is also key for digital
libraries \cite{Economics00}. Collections that were ``born electronic'' are cheaper to house and maintain, while scanning paper documents to be used online
can be relatively expensive. Internet access is widely available
and in many settings is inexpensive. Online materials are seeing more use, including from
distant locations. Since distribution costs of electronic
materials are very low, digital delivery makes economic sense. However, it brings the problem of long-term storage and preservation, which must be
adequately addressed if the information being produced today is to be accessible to future generations \cite{Raymond01}.
\section{Applications of 5S}
In this section, we illustrate the expressiveness and unifying power of 5S as a theory for digital libraries through three different example applications. In the first, we build a taxonomy of DL concepts derived from the literature and characterize the result in light of the theory. The second application uses 5S as an analytical tool to understand and dissect a DL instance and a DL protocol for interoperability. Third, we present a brief description of a declarative language based on 5S for the specification and automatic generation of DL applications.
\subsection{Digital Library Taxonomy}
A taxonomy is a classification system of empirical entities
with the goal of classifying cases according to their
measured similarity on several variables \cite{Bayley94}. Classifications are a premier descriptive tool and as such, they give a foundation towards an explanation for a phenomena. Classifications provide a terminology and vocabulary for a field and help to reduce complexity and
achieve parsimony of description by logically arranging concepts through
the identification of similarities and differences. We have built a taxonomy for digital libraries as a classification system of terms involved with the field. Our taxonomy describes the digital library field in conceptual terms and therefore its organization is amenable to be interpreted in the light of our 5S theory. This interpretation aims toward a more informal conceptual understanding of the `Ss' and corresponding DL components.
In the process of building such a taxonomy, we have considered the principles of taxonomies in social sciences, notably cluster analysis, and faceted
classification schemes \cite{Vickery65}. In particular we were guided by the idea that writing about a subject unequivocally reveals the appropriate facets for that subject \cite{Fosket80}, and that those facets are enough to describe the phenomenon \cite{Ranganathan65}. We followed an agglomerative strategy using subjective
relational concepts like association and correlation.
During the construction of the taxonomy we tried to
accommodate all the terms found in the literature and
marginal fields, guarantee mutual exclusivity,
and ensure consistency and clarity.
%To collect the unstructured list of concepts, we went
%through the early DL literature to find all features, issues, and
%roles utilized and identified specific terms \cite{5STechRep}.
%In particular, we explored relevant contributions from a number of the following
%literature sources:
%\begin{itemize}
%\item ACM DL conferences (1995-2000),
%\item ACM Transactions on Information System,
%\item Communications of the ACM (particularly 4/95, 4/98, 5/2001),
%\item D-Lib Magazine,
%\item European Conference on Digital Libraries (1997-2000),
%\item IEEE Computer DL Issue (4/97),
%\item IEEE-CS International Conference - Advances in Digital Libraries (1996-2000),
%\item Independent (Texas) DL Conferences 94, 95,
%\item International Journal on Digital Libraries (Springer),
%\item Journal of the American Society for Information Science (and Technology),
%\item Web in general.
%\end{itemize}
To collect the unstructured list of concepts, we went
through the early literature to find all features, issues, and
roles utilized, and identified specific terms \cite{5STechRep}.
As a starting point, we used an initial set of terms and
phrases listed alphabetically in \cite{Foxetal94}. To this
list we added other terms from various articles. When
this was reasonably voluminous, we produced a grouping of
terms of similar or related meaning into ``notational
families'' known as facets. Each group was given a label
that described the idea behind the homogeneity of the group
or the main variable considered. From there, we grouped the
clusters, and so on, until we achieved convergence into one
unique facet called ``digital library.''
Once the initial taxonomy was complete, we noticed certain
terms were missing or ambiguous, so we added terms and
qualified them in each context. After several iterations of
successive clustering, declustering, and reclustering, we
released a more concrete and consistent working set for peer
review and then improved the taxonomy based on comments received. The resulting taxonomy is shown in Fig.1.
\begin{figure}[hp]
\centering
\hspace{-1in}
\includegraphics[scale=0.75]{tax5.eps}
%\caption{Taxonomy of Digital Libraries Terms}
\end{figure}
We must point out that, as with any classification system, our taxonomy must evolve to accommodate changes in the digital library field. However, two factors should contribute to the stability of the taxonomy, and therefore to its relative longevity. First the taxonomy was derived from a significant corpus of digital library literature; therefore it is more stable than personal opinions. Second, the higher-level groupings are significantly abstract so that they may be applied to many fields, with possible additions or changes necessary only at the level of specific categories. Clearly, such changes are likely due to the youth and rapid development of the field. In the following we describe the main facets and sub-facets of the taxonomy, making use of 5S as an analytical tool.
\paragraph*{Actors: Who interacts with/within DLs?}
In our context, actors are the users of a digital library. Actors interact with the DL through an interface design that is (or should be) affected by
the actors' preferences and needs. Actors who have preferences and needs in common, display similar behavior in terms of services they use and interactions
they practice. We say these actors form a \textit{digital community}, the building blocks of a digital library society\footnote{Digital communities
are formed by actors who interact with a DL possibly through the same interface paradigm. The actors might belong to distinct social communities of
the real world. For instance, a digital community might be instantiated by the adoption of a particular architecture and interface for a DL (e.g., a chat room or MOO). This instantiation is somewhat arbitrary and artificial. Social communities, on the other hand, appear much more naturally as a result of complex social interactions.}. Communities---of students, teachers, librarians---interact with digital libraries and use digital libraries to interact, following pre-specified scenarios. Communities can act as a query-generator service, from the point of view of the library, and as a teaching, learning, and working service, from the point of view of other humans and organizations. Communications between actors and among the same and different communities occur through the exchange of streams. Communities of autonomous agents and computers also play roles in digital libraries. They instantiate scenarios upon requests by the actors of a DL. To operate, they need structures of vocabulary and protocols. They act by sending (possibly structured) streams of queries and retrieving streams of results.
\paragraph*{Activities: What happens in DLs?}
Activities of digital libraries --- abstracting, collecting, creating, disseminating, evaluating, modeling, organizing, personalizing, preserving, requesting, and selecting --- all can be described and implemented using scenarios and occur in the DL setting as a result of actors using services. Furthermore, these activities make and characterize relationships within and between societies, streams, and structures. Each activity happens in a setting, arena, or space. The relationships developed can be seen in the context of larger structures (e.g., social networks \cite{Schwartz93,ReferralWeb}).
\paragraph*{Components: What constitutes DLs?}
Digital libraries can contain repositories of knowledge, information, data, metadata, relationships, logs, annotations, user profiles, and documents,
all of which can be interpreted as distinct forms of digital objects, according to their particular structures, metadata, and streams. They can be
associated with higher-level structuring and organizational materials: term lists (e.g., authority files, dictionaries), classification tools (e.g., subject headings and taxonomies), thesauri, ontologies, and metadata catalogs. These knowledge organization sources are normally applied to collections of digital objects and support a number of services such as metadata-based resource discovery, query expansion with thesauri, hierarchical browsing with classification systems, and ontology-based crosswalks among disparate metadata formats and vocabularies. Finally, DLs are served by a substrate---a foundational complex amalgamation of different combinations of Ss that involves computers, network connections, file and operating systems, user interfaces, communication links, and protocols.
\paragraph*{Socio-economic, Legal Aspects: What surrounds the DL?}
This facet is mainly related to the societal aspects of the DL and their relationships and interactions, including regulations, measures, and derivatives. It abstracts aspects surrounding the other DL issues and involves policies, economic issues, standards, and qualities. For example, policies may dictate that only certain communities have the right to use specific portions of a collection. Some of these DL issues can be established regarding normative structured documents. Policies and quality control also can be enforced by specific services, for example, authentication, authorization \cite{Gladney01}, encryption, and specific practices (scenarios) or protocols, which can involve other communication services and serialized streams.
\paragraph*{Environment: In what contexts are DLs embedded?}
The environment involves a set of spaces (e.g., the physical space, or a concept space defined by the words of a natural language) that defines the use and the context of a DL. The environment also involves the society that sets up the DL and uses it. But the environment is also how the DL fits into the structure of community and its organization and dictates the scenarios by which its activities are performed.
\textit{Academic Disciplines} define a problem area ``per se'' and build a rational consensus of ideas and information about the problem that leads to a solution \cite{Sarasevic75}. Thus they carve out a space for their approaches (e.g., in terms of concepts in a domain language, etc.), and structure some subject knowledge jointly with specific scenarios that define the methods or activities used to solve their specific problems.
\textit{Purposes} and \textit{Scope} define types of societies served by the DL and determine a specific library structure.
\subsection{DL Case Studies with 5S -- Case Study 1: Networked Digital Library of Theses and Dissertations (NDLTD)}
In the last section, 5S was used to provide a better understanding of the DL field as a whole. The goals of this and the next section are
threefold: 1) to show the
use of 5S as an analytical tool helpful in comprehending specific DL phenomena; 2) to present the complex interplays that occur among 5S
components and DL concepts in real DL applications; and 3) to illustrate the possibility of using 5S as an instrument for requirements analysis in DL
development.
The Networked Digital Library of Theses and Dissertations (NDLTD) \cite{NDLTD01,NDLTDDLIB97} is an international federation of universities,
libraries, and other supporting institutions focused on efforts related to electronic theses and dissertations (ETDs). Many libraries and universities
run their own programs and services, but consortial activities at the state (e.g., OhioLINK), regional (e.g., Catalunya, Spain), and national (e.g.,
Australia, Brazil, China, Germany, India, Korea, Portugal) levels exist. NDLTD allows institutions to cooperate and collaborate in a federated fashion,
in a scalable and sustainable effort, especially since automation affords savings to both students and their universities relative to old paper-based approaches. As the distributed collection grows, and ultimately achieves critical mass, NDLTD has the potential to become one of the largest and most active digital libraries supporting education and research.
\subsubsection{Societies}
The primary community addressed through the NDLTD society is graduate students. The project aims to enhance graduate education, particularly of those students who prepare either a thesis or dissertation. Consequently, a second community is implicated, namely those involved in administering graduate programs. Those who are deans or associate deans of graduate schools, and their supervisors (e.g., associate provosts or associate chancellors) and staff, as well as the members of related associations (e.g., Council of Graduate Schools in USA, or the Canadian Association of Graduate Schools), are key members of this important community, that often decides if a university will join NDLTD. Because some universities have distributed these responsibilities to colleges or faculties, or because some involved in graduate program administration are too busy to carefully study NDLTD, we expanded this second community to include those in colleges or departments that administer graduate programs, allowing them to have their respective units join NDLTD prior to an action by the entire university.
The third community related to the NDLTD society includes those involved in related activities in university libraries. This often involves the director or dean of the university library, as well as those involved in automation, support of multimedia development, training, cataloging, preservation, or other similar roles.
A fourth community involved in NDLTD is that of faculty. They may encourage students to start early to experiment with electronic theses and dissertations (ETDs), and to prepare expressive works, using multimedia. They may assist by providing tools in their laboratories that help with production of an ETD. They may guide students to produce high-quality works, that, in turn, may encourage and help large numbers of interested readers. Faculty also assist students to grasp key issues regarding intellectual property and copyright, and to make their research results available to the widest community of readers possible given constraints relating to patents or publishers (see next paragraph).
The fifth, whose importance to the project became obvious early in 1997, is that of publishers. Though NDLTD was developed as a university effort,
there is linkage with scholarly publishers because thesis and dissertation work often relates to other writings involving those students, such as
conference papers, journal articles, and monographs. Because of copyright laws and publisher policies, that may force editors to make judgements
regarding prior publication, this important community must be considered. In cases like ACM, IEEE-CS, and Elsevier, there is strong support by way of policies encouraging ETDs, which has been highly beneficial.
\subsubsection{Scenarios}
Each of the communities involved in the NDLTD society needs particular services from the digital library. They engage in various tasks and activities
related to ETDs - each with corresponding scenarios. The NDLTD team has focused on training (through workshops, online materials, and help in media
centers or library sites) to assist students with the authoring or creation of ETDs. Next, there is the process of submission, supported by workflow
software to help students enter and edit the metadata about their ETDs. Staff in the graduate school and library also use other
parts of the workflow software as they check, approve, catalog, and archive new ETDs. Library staff ensure that new works are added to the collection,
and that the system affords access almost all the time. In terms of volume, the most active scenarios relate to use of the DL. First,
there are simple (running) and advanced (prototype) interfaces that support accessing individual university sites (searching or browsing), federated
search across multiple sites, and access to a union archive collection through MARIAN \cite{Gonc2001c} and Virtua \cite{VTLS} systems. There is
experimental software to add annotation capabilities (the service selected as most important to add, based on focus groups
to determine what other scenarios apply) \cite{Todd99}. There is also experimental software, extending SIFT (Stanford Information Filtering
Tool) \cite{SIFT} to provide routing services based on stored user profiles, for those who
wish to be notified whenever an interesting ETD arrives. As time proceeds, our work in interoperability with other digital library software like
Greenstone \cite{Greenstone2} and Phronesis \cite{Phronesis} may allow us to support other universities that
choose to use those packages to provide access services for their local ETDs.
\subsubsection{Spaces}
One space-related aspect of NDLTD is the physical location of members (a metric space) --- now spread over parts of Africa, Asia, Australia, and Europe, as well as North, Central, and South America. The Internet provides the name space of machines, while the WWW provides the name space of servers. Vocabulary used in different NDLTD services like searching relates to the conceptual space used in indexing. This will become more disciplined, as members use some version of MARC, Dublin Core, or the ETD-MS thesis and dissertations metadata standard \cite{ETDMS}, which is likely to provide the basic conceptual space for accessing the NDLTD collection. In addition, manual, semi-automatic, and automatic indexing and classification methods can be applied to place ETDs into conceptual spaces that relate to the Library of Congress or Dewey classifications, as well as discipline-specific thesauri (e.g., ACM's category system for computing) \cite{Gonc2001d}.
Another major space-related aspect of NDLTD deals with user interfaces. There are multiple graphical user interfaces that relate to our various software routines, including the ENVISION interface \cite{Envision}. In addition, we have investigated how the library metaphor applies to using our CAVE (virtual reality environment) \cite{DasNeves00}.
\subsubsection{Streams}
NDLTD deals with a variety of streams. At the simplest level are streams of characters for text, and streams of pixels for images. Some students have included audio files, or digital video, with their ETDs. These present challenges regarding quality of service if played back in real time, or alternatively storage problems if downloaded and then played back from a local system. On the one hand, using standards like MPEG will make it easier to prolong the useful life of multimedia-rich ETDs, but on the other hand the representations that allow streaming of audio and video tend to be proprietary. This suggests that students probably should store both types of representation. The other class of streams related to NDLTD is that of network protocols. Those involve transmissions of serialized streams over the network. Federated search, harvesting, and hybrid services, using a number of protocols, like Dienst, Z39.50, the Harvest system, and the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), have been developed in the context of NDLTD \cite{Gonc2001c}.
\subsubsection{Structures}
Structure plays many roles in NDLTD. A database management system is at the heart of the software for submission and workflow management developed at Virginia Tech. XML and SGML are ways to describe the structure of metadata, or of ETDs themselves. While only a small number of submissions at Virginia Tech have used such markup approaches, larger numbers are being collected in Germany. Moreover, NDLTD has developed and is promoting the Interoperability Metadata Standard for Electronic Theses and Dissertations (ETD-MS) as a standard descriptive metadata set for describing electronic theses and dissertations \cite{ETDMS}. Structures in the form of \textit{semantic networks} are used inside MARIAN to represent ETD collections and metadata and are explored through the services provided.
\subsection{Case Study 2: Open Archives Initiative}
The Open Archives Initiative (OAI) \cite{lagoze01} is not a digital library by itself but a multi-institutional project to address interoperability of
archives and digital libraries by defining simple protocols for the exchange of metadata. The current OAI technical infrastructure is defined by the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) \cite{OAI}, which defines mechanisms for archives to expose and export their metadata. In the following, this technical infrastructure is analyzed from the 5S point of view.
\subsubsection{Societies}
The main communities designed for the OAI society are electronic, namely active agents called harvesters and repositories, which interact through OAI-PMH. The other two kinds of communities emphasized by the initiative are the so-called \textit{data providers} and \textit{service providers}. The former may be the manager of an archive, acting on behalf of the authors submitting documents to the archive. The latter is a third party, creating end-user services based on data harvested from archives. Ultimately, we have those communities constituted by the final users of the services and those involved with administrative aspects of repositories/archives.
\subsubsection{Streams}
The main streams associated with the OAI are dynamic and include communications between harvester agents and the repository server. Those communications are organized as \textit{requests} from the agent to the server, which occur through specific verbs (see Open Archives Scenarios) embedded in HTTP requests, and \textit{responses} that are textual metadata, which must be encoded and serialized in XML streams. The Open Archives Initiative so far has not considered multimedia streams, except when they are encoded in XML as part of the metadata.
\subsubsection{Structures}
Major structures of OAI are involved with \textit{records}, \textit{sets}, and \textit{metadata formats}. OAI records can be considered containers \cite{Warwick}, which encapsulate several kinds of descriptive metadata. Thus, OAI records obey a structure organized into:
\begin{itemize}
\item \textit{Header}, which corresponds to information that is common to all records and includes a unique identifier and a datestamp -- the date of creation, deletion, or latest date of modification of an item, the effect of which is a change in the metadata of a record disseminated from that item.
\item A single manifestation of the metadata from an item. The OAI protocol supports multiple manifestations (structures) of metadata for any single item. At a minimum, repositories must be able to return records with metadata expressed in the Dublin Core format, without any qualification. Optionally, a repository also may be capable of disseminating other formats of metadata.
\item \textit{About}, an optional container to hold data about the metadata record itself, as opposed to the digital object associated with the metadata. Typically, this container is used to hold rights information regarding the metadata record, terms and conditions for usage, etc.
\end{itemize}
\textit{Sets} are optional hierarchical structures for grouping items in a repository for the purpose of selective harvesting of records. Membership of records in \textit{sets} is not mandatory, but \textit{sets} can share common records.
\textit{Registries}, with data about various OAI-compliant repositories, also are provided. This allows users or harvesters or service providers to
find suitable collections.
\subsubsection{Scenarios}
Regarding OAI repositories and the harvesting protocol, there is a fixed set of scenarios, namely those involved with requests and responses in the protocol conversations between harvesters and OAI archives. In a 5S analysis, we can associate each request-response pair with a scenario, involving an interaction between harvester and repository. Thus, in the OAI harvesting protocol there are scenarios for retrieving the identifiers of records in the repository restricted to specific sets (ListIdentifiers verb); to retrieve a particular record given an identifier and metadata format (GetRecord verb); to retrieve information about the repository, including administrative information (Identify verb); and to list all supported metadata formats, records and sets in the repository (respectively, ListMetadataFormats, ListRecords, and ListSets verbs)
Another extremely important set of services, which is not part of the OAI technical specifications itself, but is essential to its functionality, is provided by mediation middleware. This layer, which is placed between the repository and the OAI protocol itself, provides vertical communications, conversions, and translations from the OAI verbs and metadata organization to specific internal queries and operations on the underlying data representations of the repository. For example, if the repository is built upon a relational database, the mediation middleware is responsible for translating OAI requests to corresponding SQL queries.
\subsubsection{Spaces}
The OAI framework is naturally distributed along the physical space. Service providers can build indexing spaces on the top of metadata spaces, a kind of document space, and make use of vector or probabilistic spaces for building services like searching and filtering.
\subsection{Declarative Generation of DLs}
As a third application of the 5S framework, we have designed 5SL, a domain-specific, declarative language for conceptual modeling and generation of
digital library applications \cite{5SL}. 5SL is an XML serialization of 5S and has a formal semantics, which can be understood in terms of a
translation of the language constructs and primitives into the 5S formalisms. Its formal basis provides an unambiguous and precise DL specification
tool, which facilitates prototyping, allows proofs of assertions, and aids validation of implementations.
In 5SL, the specification of a digital library encompasses five complementary dimensions, including: the kinds of multimedia information the DL supports (Stream Model); how that information is structured and organized (Structural Model); different logical and presentational properties and operations of DL components (Spatial Model); the behavior of the DL (Scenario Model); and the different societies of actors and managers of services that act together to carry out the DL behavior (Societal Model).
To improve acceptability and interoperability, 5SL makes extensible use of existing standard specification sublanguages for representing DL concepts,
when possible. This possibility is defined by the ability to formally map those standards and sublanguages to 5S formal specifications. Moreover,
the need for the integration of multiple languages is a key aspect of the domain-specific language approach \cite{Mawl99}. A domain typically consists of multiple subdomains, each of which may require its own particular language. This is particularly true for digital libraries, but the aggregative nature of 5S matches this requirement especially well. 5SL utilizes an XML syntax, whose abundance of supporting software tools facilitates the construction of DL generators. Most of the 5SL model primitives are defined as XML elements, which can enclose other sublanguages that help to define DL concepts. In more detail, MIME types constitute the basis for encoding streams. XML Schema \cite{XMLSchema} and/or RDF Schema \cite{RDFSchema} are the primary tools for describing structures. User Interface Markup Language (UIML) \cite{UIML} and MathML \cite{MathML} are used to represent some aspects of spaces. And finally, an adapted and extended XML serialization of UML \cite{UML} is used with the Societal and Scenario Models.
\setcounter{figure}{1}
\begin{figure}[htbp]
\begin{center}
\includegraphics[angle=270,width=0.75\textwidth]{Generation.eps}
\caption{DL Generation Process with 5SL}
\end{center}
\end{figure}
The general process of automatic creation of DLs for a particular application is shown in Figure 2. Initially a DL designer is responsible for
formalizing a conceptual description of the digital library using the language concepts. This phase is normally preceded by a 5S analysis of the DL requirements and characteristics as in the previous subsections. Declarative specifications in 5SL are then fed into a DL generator, to produce tailored DLs, suitable for specific platforms and requirements. These are built upon a collection of stock parts and configurable components that provide the infrastructure for the new DL. This infrastructure includes the (e.g., MARIAN \cite{MARIAN1,MARIAN2}) classes of objects and relationships that make up the DL, and processing tools to create/load the actual library collection from raw documents, as well as services for searching, browsing, and collection maintenance.
5SL is in its infancy but we already have used it to build pilot systems and prototypes. In one of these, we built a 5SL generator for the MARIAN
system and used 5SL to design a union archive for a federation of ETD sites in NDLTD. In this union archive, metadata in ETD-MS
format is periodically harvested from ETD sites using the Open Archives Initiative Protocol for Metadata Harvesting. The MARIAN system works as a portal for accessing the collection. In this particular application, the component pool includes XML parsers, an OAI harvester and the MARIAN digital library API. MARIAN is built around a semantic network model, involving labeled digraphs or structures in 5S terminology, improved with weights and a hierarchy of classes. Any collection of nodes or links in a network can be weighted to represent how well they suit some description or fulfill some role.
The MARIAN DL generator, which is based on a DOM \cite{DOM} XML parser, automatically generates four kinds of output for the 5SL model of the NDLTD union archive (Figure 2):
\begin{enumerate}
\item Class managers and indexing classes for the NDLTD application
This includes class managers to represent the MARIAN semantic network view of the ETD-MS descriptive metadata standard. Class managers define the logical schema of the DL application, which in MARIAN corresponds to a set of Java classes that represent digital objects, their component parts, and linking information. Class managers also store and maintain instance objects of their class, and function as the search engines for the system. Indexing classes also are generated and are represented as sets of bipartite weighted semantic networks involving document parts and document features.
\item Collection Loader
The loader is an automatically generated SAX event handler that checks incoming XML documents against specifications, extracts structuring and indexing information from valid documents, including controlled authorities like person's names and subject headings, and invokes corresponding class manager loading methods which materialize and incrementally update structures and indexes, and manage underlying databases.
\item User interfaces
For user interfaces there are HTML web query forms for structured searches based on document and metadata structures, and classes for flat
representations of documents/metadata with methods for presenting different views of them using generated XSL stylesheets.
\item Tailored Databases
Finally, a set of customized tables are created which tailor the MARIAN general database schema for the specific structures of the DL applications at hand.
\end{enumerate}
%The amount of code in lines generated for each of these components is given in %Table 2.
%\begin{center}
%\begin{tabular}{|p{2.50in}| p{0.5in}| }
%\hline
%Indexing classes & 187\\ \hline
%Class managers & 424\\ \hline
%Analyzer & 361 \\ \hline
%Document representation and user interfaces & 853\\ \hline
%\end{tabular}
%\end{center}
%\begin{center}
%Table 2. MARIAN NDLTD Union Archive generated code
%\end{center}
%MARIAN architecture and features as well as a complete specification and generation of a digital library with 5SL \cite{5SL} are out of the scope of
%this paper. But to illustrate Figure 3, however helps to give an overview of the generation process, by showing a portion of the 5SL description for
%ETD-MS, the ETD
%metadata standard. In this particular case, we use an XML Schema for describing and generating the MARIAN semantic network representation of the
%descriptive metadata.
%\begin{figure}[htbp]
%\begin{center}
%\includegraphics[angle=270,width=0.80\textwidth]{etdmscode.eps}
%\caption{Portion of an XML schema defining the structure of ETD-MS, the electronic thesis and dissertation descriptive metadata standard}
%\end{center}
%\end{figure}
MARIAN architecture and features as well as a complete specification and generation of a digital library with 5SL \cite{5SL} are out of the scope of
this paper. However it is worth mentioning that one important feature of the MARIAN generator is its use of XML namespaces and the MARIAN API. The
MARIAN hierarchy of classes (or API) defines
a set of basic types for semantic networks (e.g., nodes, unweighted links), information retrieval (e.g., weighted links, weighted sets), and digital
library systems (e.g., controlled strings like personal names and subject headings, English and non-English terms, phrases, etc.). 5SL descriptions use
namespaces to import MARIAN types to specify properties of the many different parts of documents and metadata records. These properties include
specific matching methods, as well as methods for management of indexes, databases, and sets of instances of the particular class/type. Therefore,
in MARIAN, structure, content, and behavior are all represented by the use of weighted semantic networks in conjunction with a powerful API. These features tremendously facilitate the process of DL construction and maintenance.
%To be more specific, in the example of generation of the NDLTD Union Archive,
%the following steps were performed to represent ETD-MS records by mapping an XML schema to the mARIAN semantic networks: XML references are mapped to
%MARIAN unweighted link class managers (e.g., hasCreator, hasDescription), XML complex types are mapped to general MARIAN network nodes, and typed
%elements are mapped to sink nodes that inherit behavior (including loading and
%match) from the corresponding MARIAN class/type. Weighted links classes (e.g., occursInCreator, occursInDescription) for indexing element contents in
%actual XML documents also are generated. Therefore, in MARIAN, structure, content, and behavior are all represented by the use of weighted semantic
%networks in conjunction with a hierarchy of classes and a powerful API.
By using these techniques, we have automatically created several digital library applications. Most developed are those for the NDLTD union archive and
the National Library of Medicine DIRLINE collection. Work on these applications successfully demonstrated the feasibility of the 5SL
generation process in conjunction with MARIAN and its component pool. However the current 5SL design has its limitations. First, XML is very verbose
and 5SL design of complex digital libraries can be very cumbersome. We have been working on a user interface for graphical manipulation of 5S
constructs that automatically generates 5SL code \cite{Qinwei03}. Second, the current XML schema implementation is awkward for representing
structural metadata other than containment relationships. We expect that the use of RDF in specific cases or the use of recent standards for represent digital
objects such as METS \cite{METS} will help to alleviate the problem. Finally,
searching and browsing services were taken for granted due to the powerful MARIAN digital library API. More work is necessary to investigate the
process of generating DL prototypes and implementations for different and more complex scenarios/services using different component pools. First
results towards this goal can be found in \cite{Rohit03}. Nevertheless, it seems clear that 5S has intuitive appeal and practical application. In the next sections, we continue to explore 5S further with a more formal treatment.
\section{The 5S Formal Framework}
In this section, we precisely and unambiguously formalize most of the informal digital library concepts introduced in previous
sections. Figure 3 shows a map of the most important concepts and formal definitions. Each concept is associated with the corresponding
definition number
of its formal definition; arrows indicate that a concept is formally defined in terms of previously defined concepts that point to it \footnote{The
notion of a tuple (def. A.4) is used in most definitions, so, for simplicity, we are not showing arrows coming out of that concept in Fig.3. Other popular definitions are treated likewise.}. The mathematical preliminaries (Defs. A1--A14) are found in the Appendix.
\begin{figure}
\begin{center}
\includegraphics[angle=270,scale=0.4]{5Soverviewc.eps}
\caption{5S map of formal definitions}
\end{center}
\end{figure}
\subsection{5S Formalisms}
%%%%%%%%%%%%%%%%
%\setcounter{sdef}{8}
\begin{sdef}
A \textbf{stream} is a $sequence$ whose codomain is a nonempty set.
\end{sdef}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{sdef}
A \textbf{structure} is a tuple $(G,L,{\cal F})$, where $G=(V,E)$ is a directed graph with vertex set $V$ and edge set $E$, $L$ is a set of label values, and ${\cal F}$ is a labeling
function ${\cal F} : ( V \cup E ) \to L$.
\end{sdef}
As a derivative of this definition, the next one follows.
\begin{sdef}
A \textbf{substructure} of a structure $(G,L,{\cal F})$ is another structure $(G',L',{\cal F'})$ where $G'= (V',E')$ is a subgraph of G, $L' \subseteq L$ and ${\cal F'}: (V' \cup E') \rightarrow L'$.
\end{sdef}
\begin{sdef}
A \textbf{space} is a
measurable space, measure space, probability space, vector space, topological space, or a metric space \footnote{See Appendix definitions 9-14 for
formal definitions of each of these spaces.}.
\end{sdef}
\textbf{Probability} studies the possible outcomes of given events (or experiments) together with their relative likelihood and distributions. Probability is defined in terms of a \textbf{sample space} $S$, which is a set whose elements are called \textbf{elementary events}. More formally, in terms of a probability space, the set of possible events for an experiment consists of the $\sigma$-algebra $\B$ and a sample space is defined as the largest set $S \in \B$. The measure $\mu$ is called a probability distribution.
\textbf{Probabilistic information retrieval} (PIR) takes a more subjective interpretation of probability, called the \textit{bayesian} interpretation, which sees probability as a statistical procedure which endeavors to estimate parameters of an underlying probability distribution based on the observed distribution. In PIR the sample space is the set $Q \times D$ of all possible queries and documents and the probability distribution tries to estimate, given a query $q \in Q$ the probability that a document $d \in D$ will be \textbf{relevant} to the query, using any evidence at hand. Normally the words in the documents and in the query are the major sources of evidence. A precise definition of probability of relevance is dependent on the definition of relevance and different PIR models have different interpretations \cite{Crestani98}.
Vector spaces are the basis for a widely used information retrieval model, the Vector Space Model (VSM) ~\cite{Salton+75}. In this model, a document space $D$ is a vector space where a document $d_i \in D$ is represented by a $t$-dimensional vector $d_i=(w_{i1}, w_{i2}, ..., w_{it})$, $w_{ij}$ being the weight (a numerical value) of the $j$th index term $t_j$ of $d_i$, $ w_{ij} \geq 0$. An \textit{index term} is normally a word (or variant), occurring in the text of the document, whose semantics helps in defining the document's main themes. However, in general, an index term may be any value describing some aspect of the document, such as a feature value (e.g., color, shape, elevation, temperature) or descriptor (e.g., element in a thesaurus or classification system), or concept, or complex linguistic expression (e.g., phrase, entry in a gazetteer). Furthermore, it is possible to use their representation vectors, i.e., their terms and term weights, to define a number of functions such as \textit{degree of similarity} $s: D \times D \to \R$ between documents.
Vector spaces and measure spaces are often built on top of topological spaces, the latter being the more basic concept. Any use of the concept of distance implies an underlying \textbf{metric space}, which is a topological space whose open sets are defined by $\{y \mid d(x,y) < r\}$, where $d(x,y)$ is the distance between $x$ and $y$.
%%%%%%%%%%%%%%%%
\begin{sdef}
A \textbf{system state} (from now on, just state) is a function $s:L \to V$, from labels $L$ to values $V.$ A \textbf{state set} $S$ consists of a set of state functions $s:L\to V.$
\end{sdef}
Labels represent a logical \textit{location} associated with some value in a particular state. Thus $s_i(X)$ is the value, or the contents, of location X in state $s_i \in S$. The nature of the values related to contents in a system is granularity-dependent and its definition is out of the scope of this paper. Normally there are simple values of basic datatypes such as strings and numbers or higher-level DL objects such as digital objects and metadata specifications.
%%%%%%%%%%%%%%%%
\begin{sdef}
A \textbf{transition event} (or simply \textbf{event}) on a state set S is an element $e=(s_i,s_j) \in ( S \times S )$ of a
binary relation on state set $S$ that signifies the transition from
one state to another. An event $e$ is defined by a \textit{condition} function $c(s_i)$ which evaluates a Boolean function in state $s_i$ and by an \textit{action} function \textbf{$p$}.
\end{sdef}
This transition event is not a \textit{probabilistic} event~\cite{CLR90}.
Rather, it is more like the events in networked operating systems
theory~\cite{SinghalShivaratri94}, transitions in finite state
machines~\cite{DavisWeyuker}, those modeled by the Unified
Modeling Language (UML)~\cite{UML}, or transitions between places in Petri Nets \cite{Oberweis96}.
The condition is used to describe circumstances under which a state transition can take place. An action models a reference to an operator, command, subprogram or method, responsible to perform the actual state transition. Events and actions can have parameters that abstract data items associated with attributes (labels) of a state.
%%%%%%%%%%%%%%%%
\begin{sdef}
\label{def:scenario}
A \textbf{scenario} is a sequence of related transition events
$\langle e_1, e_2, ..., e_n \rangle$
on state set~$S$
such that
$e_k = (s_k, s_{k+1}) $,
for $ 1 \leq k \leq n$.
\end{sdef}
We also can interpret a scenario as a path in a directed graph $G=(S,{\Sigma}_e)$, where vertices correspond to states in the state set $S$ and
directed edges are equivalent to events in a set of events ${\Sigma}_e$ (and correspond to transitions between states). (Technically, $G$ is a
pseudodigraph \footnote{A digraph which permits both loops and multiple edges between nodes.}, since loops $(s_i,s_i)$ are possible as events.)
%%%%%%%%%%%%%%%%
\begin{sdef}
\label{def:activity}
A \textbf{service}, \textbf{activity}, \textbf{task},
or \textbf{procedure} is a set of scenarios.
\end{sdef}
Note that the scenarios defining a service can have shared states. Such a set of related scenarios has been called a ``scenario view''~\cite{Hsia94} and a ``use case'' in the UML~\cite{UML}. In this framework, a simple transmission service of streams can be formally specified as:
%%%%%%%%%%%%%%%%
\begin{sdef}
Let $T=\langle t_1,t_2,...,t_n \rangle$ be a stream. Let event $e_{t_i} = (s_{t_i},d_{t_i}$\footnote{ $d_{t_i}$ is the state that indicates that the destination has received stream item $t_i$}) and event $a_{t_i} = (d_{t_i},s_{t_{i+1}})$. A transmission of stream $T$ is the scenario (sequence of related events) $e_T = \langle e_{t_1}, a_{t_1}, e_{t_2}, a_{t_2},... e_{t_n} \rangle$
%Let $T=\langle t_1,t_2,...,t_n \rangle$ be a stream. Let $S$ be the initial %$state $ set of the source before the transmission. Let $D$ be the final state %set of the destination after the transmission. Let $s_{t_i}$ be the state in %$S
%\cup D$ that indicates that the source is ready to transmit stream
%item $t_i$. Let $d_{t_i}$ be the state in $S \cup D$ that indicates
%that the destination has just received stream item $t_i$. Let event
%$e_{t_i} = \langle s_{t_i},d_{t_i} \rangle$. A
%\textbf{transmission} of stream $T$ is the scenario (sequence of events)
% $e_T = \langle e_{t_1}, e_{t_2}, ... e_{t_n} \rangle$.
\end{sdef}
%\begin{figure}[htbp]
%\begin{center}
%\includegraphics[scale=0.4]{trans.eps}
%\caption{A scenario for the transmission of a stream}
%\end{center}
%\end{figure}
Scenarios are \textit{implemented} to make a working system and the
so-called ``specifi\-ca\-tion-im\-ple\-men\-ta\-tion'' gap must be
overcome~\cite{RossonCarroll96}. Formally, the implementation of scenarios can be mapped to an abstract machine represented by a deterministic finite automaton (DFA). This automaton $M=(Q,\Sigma_e,\delta,q_0,F)$ is such that M is the user-perceived conceptual state machine of the system and accepts a language $L(M)$ over the set of events ${\Sigma}_e$. A grammar $G = (V,{\Sigma}_e,R,s_0)$ for the language $L(M)$ is such that the non-terminals set V corresponds to the state set $S$, the terminals are the finite set of events ${\Sigma}_e$, $s_0$ is a distinguished initial state initializing all locations in that state, and $R$ is a finite set of rules. Each rule in R is of the form $s_i \rightarrow e s_j$ and conveys the system from state $s_i$ to $s_j$ as a consequence of event $e$, or is of the form $s_i \rightarrow e$ when $s_j \in F$ is a final state. The grammar and the corresponding conceptual state machine make up the abstract formal model which the analyst uses to capture, represent, and display system behavior in terms of scenarios. Alternatively, denotational semantics \cite{Winskel93} and object-oriented abstractions \cite{Rosson99} offer a programming language perspective for the question of formal scenario implementation.
%%%%%%%%%%%%%%%%
\begin{sdef}
A \textbf{society} is a tuple $(C,R)$, where
\begin{enumerate}
\item $C=\{c_1,c_2,...,c_n\}$ is a set of conceptual communities, each community referring to a set of individuals of the same class or type (e.g., actors, activities, components, hardware, software, data);
\item $R=\{r_1,r_2,...,r_m\}$ is a set of relationships, each relationship being a tuple $r_j=(e_j, i_j)$, where $e_j$ is a Cartesian product $c_{k_1}\times c_{k_2}\times\cdots\times c_{k_{n_j}}$, $1\le k_1 < k_2 <
\cdots < k_{n_j}\le n$, which specifies the communities involved in the relationship and $i_j$ is an activity (cf. Def. 8) that describes the
interactions or communications among individuals.
\end{enumerate}
\end{sdef}
The second part of the definition emphasizes the collaborative nature of societies as in the case of users and service managers engaged in performing
DL services. Scenarios describe the service behavior exactly in terms of interactions among the involved societies. For example, an ETD submission service involves interactions between graduate students and an ETD submission workflow manager (an electronic member of a service managers society). (cf. Section 6 for more formal examples of Societies.)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{5S Formal Definition of Digital Library}
As pointed out in previous sections, there is no consensual definition of a digital library. This makes the task of formally defining this kind of application and its components extremely difficult. In this section, we approach this problem by constructively defining a ``core'' or a ``minimal'' digital library, i.e., the minimal set of components that make a digital library, without which, in our view, a system/application cannot be considered a digital library. Each component (e.g., collections, services) is formally defined in terms of an S construct or as combinations or compositions of two or more of them. The set-oriented and functional mathematical formal basis of 5S allows us to precisely define those components as functional compositions or set-based combinations of the formal Ss.
Informally, a digital library involves a managed \textit{collection} of information with associated \textit{services} involving \textit{communities}
where information is stored in digital formats and accessible over a network. Information in digital libraries is manifest in terms of \textit{digital
objects}, which can contain textual or multimedia content (e.g., images, audio, video), and \textit{metadata}. Although the distinction between data
and metadata often depends on the context, metadata commonly appears in a structured way and covering different categories of information
\textit{about} a digital object. The most common kind of metadata is \textit{descriptive metadata}, which occurs in catalogs and indexes and includes
summary information used to describe objects in a DL. Another common characteristic of digital objects and metadata is the presence of
some internal structure, which can be explicitly represented and explored to provide better DL services. Basic services provided by digital libraries are indexing, searching, and browsing. Those services can be tailored to different communities depending on their roles, for example, creators of material, librarians, patrons, etc.
In the following we formally define those concepts of \textit{metadata (structural and descriptive), digital object, collection, catalog, repository, indexing service, searching service, browsing service}, and finally \textit{digital library}.
%%%%%%%%%%%%%%%%
\begin{sdef}
A \textbf{structural metadata} specification is a structure.
\end{sdef}
%%%%%%%%%%%%%%%
This simple definition emphasizes the role of structural metadata as a representation or abstraction of relationships between digital objects and their
component parts (cf. Def. 16). The graph-based representation of this type of metadata can be explicitly expressed, as in the case of markup
\cite{Coombs+88}, or implicitly computed \cite{Navarro97,Clarke95}.
The next definition, for \textbf{descriptive metadata specifications}, is inspired by new developments in the metadata area, mainly those related to
the \textit {Semantic Web} \cite{SemanticWeb} and the Resource Description Framework (RDF) \cite{RDF}, and emphasizes the semantic relationships
implied by the labeling function in a structure. Figure 4a illustrates the basic constructs. Statements, which are triples corresponding to a specific resource (the thing being described) together with a named property about the resource plus the value of that property for that resource, are promoted to first-class concepts. Figure 4b shows an example of an instantiation of the construct for a descriptive metadata specification about an electronic thesis with four statements: Statement1 = (Thesis1, `author', `M.A.Goncalves'), Statement2 = (Thesis1, `degree', Degree1), Statement3 = (Degree1, `level', `doctoral'), and Statement4 = (Degree1, `grantor', `Virginia Tech'). Below we define the notions of \textbf{descriptive metadata specification} and \textbf{metadata format} more formally.
\begin{figure}
\begin{center}
\subfigure[]{
%\label{FIG02a}
\includegraphics[height=25cm,angle=270,scale=0.24]{RDFa.eps}}
\subfigure[]{
\label{FIG02b}
\includegraphics[height=25cm,angle=270,scale=0.24]{RDFexamplea.eps}}
\caption{Overview of the descriptive metadata model with example}
\label{FIG02}
\end{center}
\end{figure}
%%%%%%%%%%%%%%%%
\begin{sdef}
Let ${\cal L} = \bigcup{D_k}$ be a set of literals defined as the union of domains $D_k$ of simple datatypes (e.g., strings, numbers, dates, etc.). Let also ${\cal R}$ and ${\cal P}$ represent sets of labels for resources and properties respectively. A \textbf{descriptive metadata specification} is a structure $(G, {\cal R} \cup {\cal L} \cup {\cal P},{\cal F})$, where:
\begin{enumerate}
\item ${\cal F}:(V \cup E) \rightarrow ({\cal R} \cup {\cal L} \cup {\cal P})$ can assign general labels ${\cal R} \cup {\cal P}$ and literals from ${\cal L}$ to nodes of the graph structure;
\item for each directed edge $e=(v_i,v_j)$ of $G$, ${\cal F}(v_i) \in {\cal R} \cup {\cal L}$, ${\cal F}(v_j) \in {\cal R} \cup {\cal L}$ and ${\cal F}(e) \in {\cal P}$;
\item ${\cal F}(v_k) \in {\cal L}$ if and only if node $v_k$ has outdegree $0$.
\end{enumerate}
The triple $st=({\cal F}(v_i), {\cal F}(e), {\cal F}(v_j))$ is called a \textbf{statement} (derived from the descriptive metadata specification), meaning that the resource labeled ${\cal F}(v_i)$ has property ${\cal F}(e)$ with value ${\cal F}(v_j)$ (which can be designated as another resource or literal).
\end{sdef}
%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%
\begin{sdef}
Let $D_{{\cal L}_{MF}} = \{D_1,D_2,...,D_i\}$ be the set of domains that make up a set of literals ${\cal L}_{MF} = \bigcup^{i}_{j=1}{D_j}$. As for metadata specifications, let ${\cal R}_{MF}$ and ${\cal P}_{MF}$ represent sets of labels for resources and properties, respectively. A \textbf{metadata format} for descriptive metadata specifications is a tuple $MF=(V_{MF},{\rm def}_{MF})$ with $V_{MF} = \{{\cal R}_1,{\cal R}_2,...,{\cal R}_k\} \subset 2^{{\cal R}_{MF}}$ a family of subsets of the resources labels ${\cal R}_{MF}$ and ${\rm def}_{MF}:V_{MF} \times {\cal P}_{MF} \rightarrow V_{MF} \cup D_{{\cal L}_{MF}}$ is a property definition function.
\end{sdef}
%%%%%%%%%%%%%%%%
Therefore a metadata format, through the property definition function, constrains the kinds of resources that can be associated together in statements
of a metadata specification as well as the basic datatype domains, which are associated with pairs (resource-property) related to literals \cite{CastelliECDL02}. For example, for any set of labels ${\cal R}$ for resources, the Dublin Core metadata format defines that ${\rm def}_{DC}({\cal R},`title')= String$ and ${\rm def}_{DC}({\cal R},`subject')=SubjectTerms$ where $SubjectTerms$ is a finite set of labels for Resources corresponding to controlled terms. The following definition follows from the previous two definitions:
%%%%%%%%%%%%%%%%
\begin{sdef}
A descriptive metadata specification $MS = (G_{MS}, {\cal R}_{MS} \cup {\cal L}_{MS} \cup {\cal P}_{MS}, {\cal F}_{MS})$ \textbf{conforms with} a metadata format $MF = (V_{MF},{\rm def}_{MF})$ if ${\cal R}_{MS} \subseteq {\cal R}_{MF}$, ${\cal L}_{MS} \subseteq {\cal L}_{MF}$, ${\cal P}_{MS} \subseteq {\cal P}_{MF}$, and for every statement $st=(r,p,l)$ derived from $MS$, $r \in {\cal R}_k$ for some ${\cal R}_k \in V_{MF}$ and $p \in {\cal P}_{MS}$ implies $l \in {\rm def}_{MF}({\cal R}_k,p).$
\end{sdef}
%%%%%%%%%%%%%%%%
\begin{figure}
\begin{center}
\includegraphics[angle=270,scale=0.35]{structstream.eps}
\caption{A StructuredStream for an ETD (adapted from [Navarro and Baeza-Yates 1997])}
\end{center}
\end{figure}
%%%%%%%%%%%%%%%%
\begin{sdef}
Given a structure $(G, L, {\cal F})$, $G=(V,E)$ and a stream $S$, a \textbf{StructuredStream} is a function $V \rightarrow ({\N} \times {\N})$ that associates each node $v_k \in V$ with a pair of natural numbers $(a,b)$, $a < b$, corresponding to a contiguous subsequence $[S_a, S_b]$ (segment) of the Stream $S$.
\end{sdef}
%%%%%%%%%%%%%%%%
Therefore, a StructuredStream defines a mapping from nodes of a structure to segments of a stream. An example in a textual stream can be seen in Figure
5. From the example, it can be deduced that several structures can be imposed over one stream and vice-versa. Also, it can be seen that segments associated with a node should include the segments of its children (in the case of a hierarchical tree), although it is not equal to the union of those, as ``gaps'' or ``holes'' can occur between child segments \cite{Navarro97}. Finally, it should be noted that this definition works also for multimedia streams like audio, video, and images.
\begin{sdef}
A \textbf{digital object} is a tuple $do = (h, SM, ST, StructuredStreams)$ where
\begin{enumerate}
\item $h \in H$, where H is a set of universally unique handles (labels);
\item $SM = \{sm_1, sm_2, \dots, sm_n\}$ is a set of streams;
\item $ST = \{st_1,st_2, \dots, st_m\}$ is a set of structural metadata specifications;
\item $StructuredStreams = \{stsm_1, stsm_2, \dots, stsm_p\}$ is a set of StructuredStream functions defined from the streams in the $SM$ set (the second component) of the digital object and from the structures in the $ST$ set (the third component).
\end{enumerate}
\end{sdef}
Fig. 6 shows an example of a very simple digital object with one structure and several streams. Two important aspects must be pointed out about this
formal definition of a digital object:
\begin{enumerate}
\item Any real implementation does not need to enforce physical containment of the several component parts of a digital object; for example, we could have pointers to external streams.
\item The definition does not consider active behavior of digital objects \cite{Fedora,Nelson01} which supports operations like
different disseminations or exporting of subparts. While there is no explicit restriction regarding this, the definition conforms to our minimalist
approach.
\end{enumerate}
\begin{figure}
\begin{center}
\includegraphics[angle=270,scale=0.4]{do2.eps}
\caption{A simple digital object}
\end{center}
\end{figure}
%%%%%%%%%%%%%%%%
\begin{sdef}
A \textbf{collection} $C=\{do_1,do_2, \dots, do_k\}$ is a set of digital objects.
\end{sdef}
\begin{sdef}
Let C be a collection with k handles in H. A \textbf{metadata catalog} $DM_C$ for C is a set of pairs $\{(h,\{dm_1, \dots, dm_{k_h}\})\}$, where $h \in H$ and the $dm_i$ are descriptive metadata specifications.
\end{sdef}
%%%%%%%%%%%%%%%%
\begin{sdef}
Let $C$ be a collection with handles H. A \textbf{repository} is a tuple $(R, get,store,del)$, where $R \subset 2^C$ is a family of collections and the functions ``get'', ``store,'' and ``del'' satisfy:
\begin{enumerate}
\item $get:H \rightarrow C$ maps a handle $h$ to a digital object \textit{get(h)}.
\item $store: C \times R \rightarrow R$ maps $(do, \tilde{C})$ to the augmented collection $\{do\} \cup \tilde{C}$.
\item $del: H \times R \rightarrow R$ maps $(h, \tilde{C})$ to the smaller collection $\tilde{C}- \{get(h)\}$.
\end{enumerate}
\end{sdef}
Thus a repository encapsulates a set of collections and specific services to manage and access the collections.
%%%%%%%%%%%%%%%%
\begin{sdef}
Let $I : 2^{\cal T} \to 2^H$ be an index function where ${\cal T}$ is a set of indexing features and $H$ is a set of handles. An \textbf{index} is a
set of index functions. An \textbf{indexing service} is a single scenario $\{\langle is_1,is_2,...,is_n \rangle \}$ comprised of pipelined scenarios $is_1,is_2,...,is_n$ in which the starting state $s_{k_0}$ of the first event of the initial scenario $is_1$ has a collection $s_{k_0}(K)=C$ and/or a metadata catalog $s_{k_0}(Y)=DM_C$ for collection $C$ as its values and the final state $s_{k_f}$ of the final scenario $is_n$ has an index $I_C = s_{k_f}(Z)$ as its value (K, Y, and Z being labels of the respective states).
\end{sdef}
The interpretation of the index and the indexing service is dependent upon the underlying indexing space. Features of an indexing space can be words, phrases, concepts, or multimedia characteristics, like shape or color, appearing or associated with the content of a digital object (in its descriptive and structural metadata or streams). Normally, if a vector space is considered, terms are treated as unrelated, therefore defining orthogonal vectors that span a space ${\cal T}$ with dimension $m$. If a probabilistic space $p=(X,\B,\mu)$ is used, ${\cal T} = X$ is the set of distinct terms and is called a \textit{sample space}. Also an index can be thought of as a mapping from an indexing space to a \textit{document (digital object) space} defined by the collection.
The indexing service normally takes the shape of a \textit{pipeline service} where scenarios themselves are executed in sequence and the final state of
a scenario is the starting state of the next one. A very simple instance of such an indexing service is shown in Figure 7 for indexing of textual material. The indexing service is composed of three scenarios organized as a pipeline of the following scenarios: 1) tokenization, which identifies unique terms inside the textual streams; 2) stopword removal, which filters out terms not useful for retrieval; and 3) stemming, which removes affixes and allows retrieval of syntactic variations of query terms \cite{BYRN99}. Each one of the scenarios can be thought of as doing some transformation in the representations of digital objects in order to produce the index function. Note again that we are making use our minimalist approach by not considering complex indexes, for example, defining locations inside streams of digital objects for phrase, proximity, or structural queries.
% also fit the basic definition by taking some parts of the digital objects, such as the set of structural metadata, to be the empty set e.g., in the case of a digital object made of a unique textual stream.
\begin{figure}
\begin{center}
\includegraphics[clip,angle=270,scale=0.5]{indexFigurec.eps}
\caption{Simple indexing service}
\end{center}
\end{figure}
%%%%%%%%%%%%%%%%
\begin{sdef}
Let $Q$ be a set of conceptual representations for user information needs, collectively called \textit{queries}. Let $M_{I_C}: Q \times (C \times DM_C) \rightarrow \R$ be a matching function, associated with an index $I_C$, that associates a real number with a query $q \in Q$ and a digital object $do \in C$ and possibly its descriptive metadata specifications $ms \in DM_C$, indicating how well the query representation matches with the digital object, structurally, by content, or regarding the descriptive metadata specifications. A \textbf{searching service} is a set of searching scenarios $\{sc_1,sc_2, \dots, sc_t\}$, where for each query $q \in Q$ there is a searching scenario $sc_k=\langle e_0, \dots, e_n \rangle$ such that $e_0$ is the start event triggered by a query q and event $e_n$ is the final event of returning the matching function values $M_I(q, d)$ for all $d \in C$.
\end{sdef}
%%%%%%%%%%%%%%%%
%As event $e_n$, most probably just the documents with higher scores will be %presented to the user in a list ordered by value.
The components of a digital object $do$, are denoted by $do(1)$, $do(2)$, etc. Therefore, $do_k(2)$ denote the second component, i.e., the stream set component of a digital object $do_k$, $do_k(3)$ its structural metadata set component (third component), and $do_k(4)$ its set of StructuredStreams functions (fourth component). Let also $G[v]$ denote the subgraph of a directed graph $G$ containing node $v$ and all points and edges reachable starting from $v$. A substructure defined by $G[v]$ inherits the labeling of the structure defined with G. Finally, let $f:A \rightarrow B$ and let ${\cal D}$ be any non-empty subset of A. The \textbf{restriction} of $f$ to ${\cal D}$, denoted by $f\vert_{\cal D}$, is a subset of $f$ and is a function from ${\cal D}$ to $B$. Then, for a collection $C$:
\begin{enumerate}
\item $AllStreams$ = $({\displaystyle \cup_{do_k \in C} do_k(2)})$ and \textit{AllSubStreams} = ${\displaystyle \cup_{sm_t \in AllStreams} }\{sm_t[i,j] \mid sm_t = \langle a_{0}, a_1, \dots, a_n \rangle, 0 \leq i \leq j \leq n)\}$ will be the set of all streams and substreams (segments of streams) of all digital objects in the collection $C$;
\item \textit{AllSubStructuredStreams} = $\bigcup_{k,j}(SubStructuredStream_{k_j})$ where:
\begin{enumerate}
\item $d_k \in C$;
\item $G_{k_j} = (V_{k_j},E_{k_j})$ is the first component of some structure $st_{k_j} \in d_k(3)$;
\item ${\cal H}_{k_j} = \{G_{k_j}[v_t] \mid vt \in V_{k_j}\}$ corresponds to the set of all substructures of $st_{k_j}$;
\item $SubStructuredStream_{k_j} = \{ {\cal S}\vert_{V^{'}} \mid (V^{'},E^{'}) \in {\cal H}_{k_j}$, ${\cal S} \in d_k(4)$ is a StructuredStream function defined from the structure $st_{k_j}$, and ${\cal S}\vert_{V^{'}}$ is the restrition of ${\cal S}$ to ${V^{'}}$$\}$.
\end{enumerate}
Therefore, $AllSubStructuredStreams$ corresponds to the set of all possible substructures and their corresponding connections to streams inside digital objects of the collection.
\end{enumerate}
%%%%%%%%%%%%%%%%
\begin{sdef}
Let $H=((V_H,E_H),L_H,{\cal F}_H)$ be a structure and $C$ be a collection. A \textbf{hypertext} $HT = (H, Contents, {\cal P})$ is a triple such that:
\begin{enumerate}
\item $Contents \subseteq C \cup AllSubStreams \cup AllSubStructuredStreams$ is a set of contents that can include digital objects of a collection $C$, all of their streams (and substreams) and all possible \textit{restrictions} of the StructuredStream functions of digital objects.
\item ${\cal P}: V_H \rightarrow Contents$ is a function which associates a node of the hypertext with the node content.
\end{enumerate}
\end{sdef}
%%%%%%%%%%%%%%%%
A hyperlink is an edge in the hypertext graph. Source nodes of a hyperlink are called ``anchors'' and are generally associated via function ${\cal P}$ with segments of streams. Also, in this definition, two basic types of hyperlinks can be identified: \textit{structural} and \textit{referential} \cite{Wang98}. Structural hyperlinks allow navigation inside internal structures and across streams of digital objects. Referential hyperlinks usually have their target nodes associated with different digital objects or their subcomponents.
Figure 8 illustrates the definition. The hypertext is made by structural hyperlinks that follow the structural metadata and external referential
links. Links originate from (segments of) streams. Link targets for, respectively, links 1, 2, and 3, are an entire digital object, a portion of its StructuredStream function (in the figure, represented by the subgraph pointed to by the link and the associated streams) and one of its streams, in this case an image.
An example of such a hypertext is the Web. The Web is a structure where hypertext links connect nodes that can be associated with: 1) complete HTML pages that can be considered digital objects; 2) substructures of a HTML page, for example, a section of the page; and 3) links to streams, e.g., images, audios, or text. The Distributed Graph Storage (DGS) system also implements similar ideas with structural and hyper-structural links representing, respectively, the internal structures of digital objects and hypertext constructs \cite{DGS}. It should be noted that for the sake of brevity we are not describing links to services, for example, external plugins that can be invoked by browsers or Web forms.
\begin{figure}
\begin{center}
\includegraphics[angle=270,scale=0.5]{hypermedia5.eps}
\caption{A simple hypertext}
\end{center}
\end{figure}
%%%%%%%%%%%%%%%%
\begin{sdef}
A \textbf{browsing service} is a set of scenarios $\{ sc_1, \dots, sc_n \}$ over a hypertext (meaning that events are defined by edges of the hypertext graph $(V_H, E_H)$), such that traverse link events $e_i$ are associated with a function $TraverseLink: V_H \times E_H \rightarrow Contents$, which given a node and a link retrieves the content of the target node, i.e., $TraverseLink(v_k, e_{k_i}) = {\cal P}(v_t)$ for $e_{k_i} = (v_k,v_t) \in E_H$.
\end{sdef}
%%%%%%%%%%%%%%%%
Therefore, by this definition, every browsing service is associated with an underlying hypertext construct. This view unifies the three modes of
browsing defined by Baeza-Yates and Ribeiro-Neto \cite{BYRN99}: flat browsing, structured guided, and navigational mode. The third one is the most
general case and fits exactly our model. The first two can be considered special cases. In flat browsing the hypertext has a flat organization, for example, an ordered list of documents or a set of points in an image, and the graph structure of the hypertext corresponds to a disconnected bipartite graph. In the second one, which includes classification hierarchies and directories, the hypertext graph is a tree. Many semi-structured wrapper algorithms disclose this hypertext ``hidden'' structure in the Web. Once revealed, this structure can be recorded in databases or represented in other semi-structured models to allow queries or transformations. Methodologies like PIPE \cite{Ramakrishnan} make use of this information to personalize Web sites.
Note also that more sophisticated kinds of hypertext can be defined by extending the current definition. For example, we could relax the function
${\cal P}$ to be a relation and associate different contents with the same node, which could be achieved by having different modes of traversing the
same link in an extension of the \textit{TraverseLink} function \footnote{This extended approach also generalizes the notion of link directionality
where bi-directional links or non-directional links correspond just to different ways of traversing the link (e.g., SOURCE\_TO\_SINK, SINK\_TO\_SOURCE,
BOTH).}. However, the present definition is simpler and serves well our minimalist approach \footnote{Note also that libraries can support
\textit{serendipity} or `random links'.}.
%%%%%%%%%%%%%%%%
%\begin{sdef}
%A \textbf{user interface} is a service that connects user states with
%machine states into sets of scenarios. A user interface implements [WHAT %FUNCTION?].
%\end{sdef}
%The state of wetware includes positions of fingers, hands, arms, and
%frame of mind. Device I/O states on users' machines include pixel
%illuminations on video display devices, keyboard key positions, and
%quantized sound pressures in microphones and speakers.
%%%%%%%%%%%%%%%%
%\begin{sdef}
%An \textbf{information warehouse} is a set of repositories, metadata
%services, search services, indexing services, user interface services,
%and transmission services%.
%\end{sdef}
%An information warehouse is for transforming raw information into
%information products.
%%%%%%%%%%%%%%%%
\begin{sdef}
A \textbf{digital library} is a 4-tuple $({\cal R}, DM, Serv, Soc)$, where
\begin{itemize}
\item ${\cal R}$ is a repository;
\item $DM = \{DM_{C_1}, DM_{C_2}, ..., DM_{C_K}\}$ is a set of metadata catalogs for all collections $\{C_1, C_2, ..., C_K\}$ in the repository;
\item Serv is a set of services containing at least services for indexing, searching, and browsing;
\item Soc is a society.
\end{itemize}
\end{sdef}
We should stress that the above definition only captures the syntax of a digital library, i.e., what a digital library is. Many semantic constraints
and consistency rules regarding the relationships among the DL components (e.g., how the scenarios in $Serv$ should be built from ${\cal R}$ and $DM$
and from the relationships among communities inside the society $Soc$, or what the consistency rules are among digital objects in collections of ${\cal
R}$ and metadata records in $DM$) are not specified here. Those will be a subject of future research.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Example: Formal Treatment of Open Archives and the NDLTD Union Archive}
\subsection{Open Archives Initiative}
The following formalizes the Open Archives Intitiative Protocol for Metadata Harvesting \cite{lagoze01} (OAI-PMH).
%%%%%%%%%%%%%%%%
\begin{sdef}
Let $dl = ({\cal R}, DM, Serv, Soc)$\ be a digital library. The digital library dl can be considered \textbf{OAI complaint} if:
\begin{enumerate}
\item there are two electronic members of the dl society, $\{dp,hvt\} \subset Soc(1), Soc= dl(4)$, called the data provider manager and the harvester;
\item there is a service $OAI\_Harvesting \in dl(4) = Soc$ whose behavior is defined below; and
\item $(\{dp\} \times \{hvt\},OAI\_Harvesting) \in Soc(2) $
\end{enumerate}
\end{sdef}
%%%%%%%%%%%%%%%%
The data provider manager $dp$ responds to requests of the harvester $hvt$. Conversations between the harvester and the data provider manager constitute the OAI harvesting service $OAI\_Harvesting$. $OAI\_Harvesting = \{$\textit{Identify},
\textit{ListMetadataFormats}, \textit{ListSets}, \textit{ListIdentifiers}, \textit{ListRecords}, \textit{GetRecord}$\} \in Serv$ is a service
formally defined below as 6 scenarios:
\begin{enumerate}
\item Identify
\textbf{Goal}: Returns general information about the archive (what in OAI terms corresponds to the repository ${\cal R}$ along with a metadata catalog $DM_{C_k} \in DM$ for some $C_k$ in the repository.
\textbf{Scenario}: $\langle e_1:p=$\textit{identify}, $e_2:p=response($\textit{identification}$)\rangle$,
where $e_1$ is an event generated by the harvester $hvt$ invoking an action in $dp$ of $dl$, $e_2$ is the event corresponding to the response from the
data provider $dp$, $p:$ specifies the corresponding action that is being invoked, and \textit{identification} is a parameter of the response
action. The \textit{identification} parameter is a descriptive metadata specification $= ({G}_{Ident}, {\cal R}_{Ident} \cup {\cal L}_{Ident} \cup
{\cal P}_{Ident},{\cal F}_{Ident})$ about the archive, where:
\begin{enumerate}
\item resource ${\cal R}_{Ident}= \{id\}$ is a unique identifier for the archive; and
\item properties ${\cal P}_{Ident} = \{$repositoryName, baseURL, protocolVersion, earliestDatestamp, deletedRecord, granularity$\}$
\end{enumerate}
\item ListMetaFormats
\textbf{Goal}: Lists metadata formats supported by the archive as well as their schema location.
\textbf{Scenario}: $\langle e_1:p=ListMetadataFormats,e_2:p=response({metadata\_formats}) \rangle$ and $oai\_{dc} \in {metadata\_formats}$, meaning the Dublin Core metadata format is mandatory.
\item ListSets
\textbf{Goal}: Provides a hierarchical listing of sets in which records may be organized.
\textbf{Scenario}: $\langle e_1:p=ListSets(resumptionToken)$, $e_2:p$$=response$($archive\_sets$, $resumptionToken_i) \rangle $ and $archive\_sets$ $=$
$\{set_1$, $set_2$, $...$, $set_k\}$ where each $set_i$ is a 3-tuple $(setSpec_i$, $setName_i$, $setDescription_i)$ and:
\begin{enumerate}
\item $setSpec_i$ is a colon [:] separated sequence of strings $\langle str_{1i}:str_{2i}:...:str_{ni}\rangle$ indicating the path from the root
of
the set hierarchy to the respective node. Each string in the sequence must not contain any colons [:]. Since a setSpec forms a unique identifier for the set within the repository, it must be unique for each set. Flat set organizations have only sets with setSpec that do not contain any colons [:].
\item $setName_{i}$ -- a short human-readable string naming $set_{i}$
\item $setDescription_{i}$ - an set of descriptive metadata specifications about $set_{i}$ (metadata format not specified; Dublin Core suggested).
\end{enumerate}
The resumptionToken is a mechanism for flow control when returning an incomplete list of sets. Its exact format is not defined by the protocol. The only defined use of resumptionToken is as follows \cite{OAI}:
\newline
\newline
$
\left\{
\begin{array}{l}
resumptionToken \neq \emptyset, \text{if archives\_sets list is incomplete}; \\
resumptionToken = \emptyset, \text{if archives\_sets list completes a previously received list}; \\
resumptionToken_i = resumptionToken_{i+1}, \text{ where } resumptionToken_{i +1}\\ \text{is the resumptionToken used in the next
ListSets request and } resumptionToken_i \\ \text{is the resumptionToken received in the response of the previous request.}
\end{array}
\right.
$
\newline
%\begin{itemize}
%\item an archive must include a resumptionToken element as part of each %response that includes an incomplete list;
%\item in order to retrieve the next portion of the complete list, the next %request must use the value of that resumptionToken element as the value of the %resumptionToken argument of the request;
%\item the response containing the incomplete list that completes the list must %include an empty resumptionToken element;
%\end{itemize}
\item ListRecords
\textbf{Goal}: Retrieves metadata for multiple records.
\textbf{Scenario}: $\langle e_1:p=$\textit{ListRecords(from,until,set,metadataPrefix, resumptionToken)}, $e_2:p=response(\{oai\_record_1,...,
oai\_record_k\}, resumptionToken_i)\rangle $. Each $ oai\_record_i$ is a 3-tuple $(header_{i}$, $metadata_{i}$, $about_{i}$, $status_{i})$ where:
\begin{enumerate}
\item $header_{i}$ is a 4-tuple $(record\_id_{i},datestamp_{i},sets_{i})$:
\begin{enumerate}
\item $record\_id_{i}$ being a unique identifier for the $oai\_record_{i}$, \item $datestamp_{i}$, the date/time of creation, modification or deletion of the record for the purpose of selective harvesting;
\item $sets_{i} \subset archive\_sets$, the set membership of the item for the purpose of selective harvesting.
\end{enumerate}
\item $metadata_{i} \in dm_{j}(2)$ for some $dm_j \in DM_{{C_k}}$;
\item $about_{i}$ is a descriptive metadata specification about the $oai\_record_{i}$; metadata format not specified. Common examples of properties
include \textit{rights statements} and \textit{provenance} information about the metadata record itself.
\item $status_{i}$ -- an optional status attribute with a value of `deleted' -- indicates the withdrawal of availability of the specified metadata format for the item, dependent on the repository support for deletions.
\end{enumerate}
For every $oai\_record_{i}$ in the response set, the following set of constraints follows:
\begin{enumerate}
\item $ from \leq datestamp _{i} \leq until$, i.e., datestamp corresponding to the record creation or modification is within the specified date range.
If omitted the request parameter $from$ takes the value associated with the earliestDatestamp property of \textit{identification} of the archive;
\item $set \in sets_{i}$;
\item \textit{metadataPrefix} $\in$ \textit{metadata\_formats}; $metadata_{i}$ \textbf{conforms with} the metadata format defined in
\textit{metadataPrefix};
\item and $resumptionToken$ fits within the sequence limits related to the flow control implemented by $dp$ as discussed above.
\end{enumerate}
\item ListIdentifiers
\textbf{Goal}: Lists all unique handles (in OAI terms, identifiers) corresponding to digital objects in the repository.
\textbf{Scenario}: $\langle e_1:p=$\textit{ListIdentifiers(from,until,set,resumptionToken)}, $e_2:p=response(\{$ $record\_id_{i}$, $...$,
$record\_id_{l}\}$,
$resumptionToken_i)\rangle $, where $\{record\_id_{i}$, ..., $record\_id_{l}\}$ is a set of identifiers (or handles) for OAI records
$\{oai\_record_{i}$, ..., $oai\_record_{l}\}$. The same set of constraints for \textit{ListRecords} apply to the \textit{ListIdentifiers} response.
\item $GetRecord$
\textbf{Goal}: Returns the metadata for a single identifier in the form of an OAI record.
\textbf{Scenario}: $\langle e_1:p=$\textit{GetRecord(id,metadataPrefix)}, $e_2:p=response(oai\_record_{i}) \rangle $, $id = record\_id_{i}$; other
constraints apply as above.
\end{enumerate}
\subsection{NDLTD Union Archive}
\begin{itemize}
\item A digital library federation is a set $DLF=\{dl_1,dl_2, ...,dl_k\}$ of independent and possibly heterogeneous digital libraries (DLs). NDLTD is a
digital library federation where each independent DL $dl_k =$ $(ETD\_{\cal R}_k$, $ETD\_DM_k$, $ETD\_Serv_k$, $ETD\_Soc_k)$. $ETD\_{\cal R}_k$ is a
repository having a collection $ETD\_{Coll}_k = \{$$etd_{1k}$, $etd_{2k}$, $...,$ $etd_{jk}\}$ composed of a set of digital objects $etd_{ik}$ corresponding to
electronic theses and/or dissertations (ETDs). The possible set of streams of an ETD, $etd_{ik}(2)$, is normally limited to a small number of standard types (e.g., Unicode encoding for the character set, MPEG for videos) due to preservation concerns and technological limitations. NDLTD currently does not enforce (yet) any specific structural metadata for ETDs, but several projects for standardizing such a structure with XML Schemas and DTDs are under development in many locations including Finland, Germany, and USA. For each ETD $etd_{ik} \in ETD\_{Coll}_k$ there should be at least one $etd\_dm_{ik} \in ETD\_DM_{k_{ETD\_Coll_k}}$, $ETD\_DM_{k_{ETD\_Coll_k}} \in ETD\_DM_k$, $ETD\_DM_{k_{ETD\_Coll_k}}$ being a metadata catalog for the ETD collection $ETD\_Coll_k$.
\item NDLTD promotes ETD-MS as the metadata format for ETD descriptive metadata specifications. For each $dl_k$ in NDLTD, let:
\begin{itemize}
\item $ETD\_{IDs}_k = \{h_{ik}| h_{ik} = etd_{ik}(1), etd_{ik} \in ETD\_Coll_k \}$, $NDLTD\_ETD\_IDs = \bigcup_{dl_{k} \in NDLTD}{ ETD\_{IDs}_k}$ be the set of the handles of all the ETDs in the NDLTD federation collections;
\item $ETD\_{Properties}$ = \{`title', `creator', `person', `subject', `description', `publisher', `contributor', `date', 'type', `format', `identifier', `language', `coverage', `rights', `degree'\};
\item $Degree =\{dg_1,dg_2,...,dg_x\}$ a set of unique labels representing the degree portion of an ETD;
\item and $Degree\_{Properties}$ = \{`name', `level', `discipline', `grantor'\}, a set of properties about the degree portion of an ETD.
\end{itemize}
In formal terms, ETD-MS is a metadata format $(V_{ETD-MS},def_{ETD-MS})$ for descriptive metadata specifications in ETD-MS $= (G_{ETD}, {\cal R}_{ETD}
\cup {\cal L}_{ETD} \cup {\cal P}_{ETD}$, ${\cal F}_{ETD})$, where resources ${\cal R}_{ETD} = (NDLTD\_ETD\_{IDs}$ $\cup$ $Degree)$,
$V_{ETD-MS}=$\\$\{
ETD\_{IDs}$, $Degree\}$, properties ${\cal P}_{ETD} = (ETD\_{Properties}$ $\cup$ $Degree\_{Properties})$ and for all triples $(r,p,z) \in
def_{ETD}$:
\begin{enumerate}
\item $r = NDLTD\_ETD\_{IDs}$ iff $p \in ETD\_{Properties}$,
\item $r = Degree$ iff $p \in Degree\_{Properties}$, and
\item $def_{ETD}(NDLTD\_ETD\_{IDs},`degree')= Degree$.
\end{enumerate}
\item Society $ETD\_Soc_k$ of $dl_k$ is such that $\{$Patron, Student, ETDReviewer, ETDCataloguer, ETDSearchManager, ETDWorkflowManager,...$\}$ $\subset ETD\_Soc_k(1)$ and $\{creates = (Student \times ETDWorkflow, ETDCreation)$, $searches = (Patron \times ETDSearchManager, searching)$, $is\_a = (Student \times Patron, \emptyset)\} \subset ETD\_Soc_k(2)$.
\item The NDLTD Union Archive is a tuple $(NDLTD\_Union,UA\_Harvester)$ where $NDLTD\_Union$ = $\bigcup_{dl_k \in NDLTD}$ ${ETD\_DM_{k_{ETD\_Coll}},
ETD\_DM_k = dl_k(2)}$,\\ $ETD\_DM_{k_{ETD\_Coll}}$ $\in$ $ETD\_DM_k$, is the union of the metadata catalogs for the ETD collections of all NDLTD
members and $UA\_Harvester$ is a manager, an electronic member of the NDLTD society, which participates in an OAI harvesting service that periodically harvests metadata records from the NDLTD members.
\item Each DL $dl_k$ in the union archive includes a data provider manager, $dp_k \in dl_k(4) = ETD\_Soc_k$, which responds to requests from the NDLTD
$UA\_{Harvester}$. Conversations between the $UA\_Harvester$ and $dp_k$ are governed by the OAI-PMH and constitute an
OAI harvesting service as defined in the previous section. \end{itemize}
\section{Related Work}
Formal models, which have supported research and development in most computer science subfields (e.g., programming languages, databases, information
retrieval, hypermedia), are surprisingly missing in the digital library literature. One could conjecture that is due to the previously argued
complexity of the field. Wang ~\cite{Wang99} provides one first attempt to fill this gap. His so-called ``hybrid approach'' defines a digital library
as a combination of a special purpose database and a hypermedia-based user interface and builds upon this combination to formalize digital libraries in
terms of the language Z \cite{Spivey88}. Kalinichenko \textit{et al.} \cite{Kalinichenko00} presented a canonical model for information systems
and a compositional approach that they applied to provide a partial solution for interoperability in DLs. Castelli \textit{et al.}
\cite{CastelliECDL02} have presented the closest work to ours so far. In the context of a multidimensional query language for digital libraries they
have formalized the concepts of documents, based on the notions of views and versions, metadata formats and specifications, and a first-order logic based language. These approaches, clearly incomplete, are, as far as we know, the only attempts to provide some formalization for the digital libraries field.
Formal models precisely and unambiguously define the semantics of specific abstractions of a knowledge field. In the case of computer science (CS),
this allows for the exploitation and development of declarative approaches in design and development. Not surprisingly, each of the cited CS sub-fields
has proposals to investigate declarative approaches. Accordingly, we have proposed 5SL for declarative specification and generation of digital library
applications. Closely related works, which are not supported by a rigorous underlying formal theory, include the Digital Library Definition Language
\cite{DLDL}, DSpace's data model (which includes communities and bitstreams) and fixed workflows \cite{DSpace}, the METIS Workflow framework
\cite{METIS},
and FEDORA's structoid approach
\cite{structoid}. All of these deal with small parts of the whole
problem (e.g., federated search, digital objects rendering) and are not as comprehensive as the 5S model in dealing with almost all aspects of DL design and construction.
The flexibility of the 5S theory has been further demonstrated as an instrument for requirements analysis in DL development and as a basis for
organizing a digital library taxonomy. While research in DL requirements analysis has been underrepresented with only small isolated case studies
(e.g., \cite{collaborativeDesign,Dong01,GohLeggett,LevyMarshall95}), to the best of our knowledge there is no other comprehensive DL taxonomy
published in the literature, other than that presented in \cite{Foxetal94}, which is a basis for ours.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Conclusions}
Motivated by the challenge of Licklider \cite{Licklider65} to construct a theory for digital libraries, we have developed 5S. We show that formal definitions allow the 5S framework to be fully described and make it possible to clearly and formally define a minimal digital library. Using that framework we demonstrate its utility: to discuss the terminology found in the digital library literature, to describe a representative digital library and the Open Archives Initiative, to construct 5SL -- a declarative specification language from which digital libraries can be generated, and to formally define a set of DL constructs and settings in the context of the NDLTD Union Archive.
Future work with the 5S framework will proceed in several directions. We will use our framework to help guide further development of the OAI and the
NDLTD, as well as other digital library applications such as the NSDL \cite{Zia01} and the ETANA archaeological DL \cite{ETANA}. We will extend 5SL to
be more complete, and to enable generation of
personalized digital libraries in connection with PIPE \cite{Ramakrishnan,Gonc2001d}. Further, we will encourage and assist others to adopt and adapt 5S and 5SL.
Finally, we plan to continue our work on the theory of digital libraries \cite{Gonc03}. We will explore qualitative aspects of the model including
consistency, completeness, correctness, and evaluation. We think that the formal definitions can serve as a basis to design quantitative
measures for several dimensions of quality in DLs. We also intend to use 5S to help with formal analysis of interoperability issues in digital
libraries. It can serve
as a canonical DL model to allow us to go beyond the current shallow descriptions of DL systems for federated search or harvesting purposes. Finally, the formal definitions given here can be used to prove
helpful lemmas and theorems, as a tool to precisely compare different DL architectures, and to guide future work in the field.
\begin{small}
\section*{APPENDIX}
\setcounter{section}{1}
\subsection*{Mathematical Preliminaries}
Here, we briefly review the mathematical foundations necessary for the development of the following discussion. Since the goal is complete precision, all terms used in the other definitions must be carefully and unambiguously defined. Authors' definitions of terms even as basic as ``function'' often disagree, so (for completeness) we begin at the most fundamental level, with set notations, relations, functions, sequences, tuples, strings, graphs, and grammars ~\cite{CLR90}. Readers familiar with these concepts can skip this section or simply refer to it as needed when some of the concepts are used in other definitions.
\newtheorem{app-sdef}{Appendix Definition}
Formally, \textit{set} and $\in$ (``element of'') are taken as undefined terms in the axioms of set theory. We remark that a set cannot contain itself
and the ``set of all sets'' does not exist. That $x$ is an element of set $S$ is denoted $x \in S$. There is an ``empty'' set ($\emptyset$). The notation $S = \{x | P(x)\}$ defines a set $S$ of precisely those
objects $x$ for which the logical proposition $P(x)$ is true.
Standard operations between sets $A$ and $B$ include union: $ A \cup B = \{x | x
\in A$ or $x \in B\}$; intersection: $ A \cap B = \{x | x \in A$ and $x \in B\}$;
and Cartesian product: $ A \times B = \{(a,b) | a \in A$ and $b \in B\}$
where $(a,b)$ is called an \textit{ordered pair}. A is called a
\textit{subset} of B, denoted by $A \subset B$, if $x \in A$ implies
$x \in B$. The set of all subsets of set $S$ (including $\emptyset$)
exists, is called the \textit{power set} of $S$, and is denoted~$2^S.$
%%%%%%%%%%%%%%%%
\begin{app-sdef}
A binary \textbf{relation} $R$ on sets $A$ and $B$ is a subset of $A
\times B$. We sometimes write $(a,b) \in R$ as $a$R$b$. An $n$-ary
relation $R$ on sets $A_1, A_2,...,A_n$ is a subset of the Cartesian
product $A_1 \times A_2 \times ... \times A_n$.
\end{app-sdef}
%%%%%%%%%%%%%%%%
\begin{app-sdef}
Given two sets $A$ and $B$, a \textbf{function} $f$ is a binary
relation on $A \times B$ such that for each a $\in A$ there exists $b \in B$ such that $(a,b) \in f$, and if $(a,b) \in f$ and $(a,c) \in f$
then $b = c$. The set $A$ is called the domain of $f$ and the set $B$
is called the codomain of $f$. This is shown as $f: A \to B$. We write $b = f(a)$ as a common notation for $(a,b) \in f$. The set $\{f(a)|a \in A\}$ is called the range of f.
\end{app-sdef}
%%%%%%%%%%%%%%%%
\begin{app-sdef}
A \textbf{sequence} is a function $f$ whose domain is the set of natural
numbers or some initial subset $\{1,2,...,n\}$ of the natural
numbers and whose codomain is any set.
\end{app-sdef}
%%%%%%%%%%%%%%%%
\begin{app-sdef}
A \textbf{tuple} is a finite sequence that is often denoted by listing the
range values of the function as $\langle f(1),f(2),...,f(n) \rangle$.
\end{app-sdef}
%%%%%%%%%%%%%%%%
\begin{app-sdef}
A \textbf{string} is a finite sequence of characters or symbols drawn from a finite set with at least two elements, called an \textbf{alphabet}. A string is often denoted by concatenating range values without punctuation. Let $\Sigma$ be an alphabet. $\Sigma^*$ denotes the set of all strings from $\Sigma$, including the empty string (an empty sequence $\epsilon$). A \textbf{language} is a subset of $\Sigma^*$.
\end{app-sdef}
%%%%%%%%%%%%%%%%
\begin{app-sdef}
A \textbf{graph} $G$ is a pair $(V,E)$, where $V$ is a nonempty set (whose elements are called \textbf{vertices}) and $E$ is a set of two-item sets of vertices,
$\{u,v\}$, $u,v \in V$, called \textbf{edges}. A \textbf{directed graph} (or digraph) $G$ is a pair $(V,E)$, where $V$ is a nonempty set of vertices (or nodes) and $E$ is a set of edges (or arcs) where each edge is an ordered pair of distinct vertices $(v_i, v_j)$, with $v_i,v_j \in V$ and $v_i \ne v_j$. The edge $(v_i,v_j)$ is said to be \textbf{incident} on vertices $v_i$ and $v_j$, in which case $v_i$ is \textbf{adjacent to} $v_j$, and $v_j$ is \textbf{adjacent from} $v_i$.
\end{app-sdef}
%%%%%%%%%%%%%%%%
Several additional concepts are associated with graphs. A \textbf{walk} in graph $G$ is a sequence of not-necessarily distinct vertices such that for every adjacent pair $v_i,v_{i+1}$, $1\le i < n$, in the sequence, $(v_i,v_{i+1}) \in E$. We call $v_1$ the origin of the walk and $v_n$ the terminus. The \textbf{length} of the walk is the number of edges that it contains. If the edges of the walk are distinct, the walk is a \textbf{trail}. If the vertices are distinct, the walk is a \textbf{path}. A walk is \textbf{closed} if $v_1=v_n$ and the walk has positive length. A \textbf{cycle} is a closed walk where the origin and non-terminal vertices are distinct. A graph is \textbf{acyclic} if it has no cycles. A graph is \textbf{connected} if there is a path from any vertex to any other vertex in the graph. A \textbf{tree} is a connected, acyclic graph. A \textbf{directed tree} or (DAG) is a connected, directed graph where one vertex - called the root - is adjacent from no vertices and all other vertices are adjacent from exactly one vertex. A graph $G'=(V',E')$ is a \textbf{subgraph} of $G=(V,E)$, if $V' \subseteq V$ and $E' \subseteq E$.
\begin{app-sdef}
A \textbf{context-free grammar} is a quadruple $(V,\Sigma,R,s_0)$ where V is a finite set of symbols called non-terminals, $\Sigma$ is an alphabet of terminal symbols, R is a finite set of rules and $s_0$ is a distinguished element of V called the \textbf{start} symbol.
\end{app-sdef}
A \textbf{rule}, also called a production, is an element of the set $ V \times {(V \cup \Sigma)}^*$. Each production is of the form $ A \rightarrow \alpha$ where $A$ is a non-terminal and $\alpha$ is a string of symbols (terminals and/or non-terminals).
\begin{app-sdef}
A \textbf{deterministic finite automaton} is a 5-tuple $(Q, q_0, A,\Sigma,\delta)$ where $Q$ is a finite set of symbols called \textit{states}, $q_0 \in Q$ is the \textbf{start} automaton state, $A \subseteq Q$ is a distinguished set of accepting states, $\Sigma$ is an alphabet (defining what set of input strings the automaton operates on), and $\delta$ is a function from $Q \times \Sigma$ into $Q$, called the transition function of the automaton.
\end{app-sdef}
The finite automaton begins in state $q_0$ and reads characters of an input string one at a time. If after reading the string the automaton is in a state $q \in A$ the string is \textbf{accepted}.
\begin{app-sdef}
Let $X$ be a set. A \textbf{$\sigma$-algebra} is a collection $ \B$ of subsets of $X$
that satisfies the following conditions:
\begin{enumerate}
\item every union of a countable collection of sets in $\B$ is again
in $\B$, \ie, if $A_i \in \B$ ($i=1,2,3,\dots$), then $\bigcup^{\infty}_{i=1}
A_i\in \B$;
\item if $A \in \B$, then $\tilde{A} \in \B$, where $\tilde{A}$ is
the complement of $A$ with respect to $X$.
\end{enumerate}
\end{app-sdef}
One consequence of the definition of $\sigma$-algebra is that the
intersection of a countable collection of sets in $\B$ is again in
$\B$.
%%%%%%%%%%%%%%%%
\begin{app-sdef}
A \textbf{measurable space} is a tuple $(X,\B)$ consisting of a set
$X$ and a $\sigma$-algebra $\B$ of subsets of $X$.
\end{app-sdef}
A subset $A$ of $X$ is called \textit{measurable} (or
\textit{measurable with respect to $\B$}) if $A \in \B$. A \textit{measure} $\mu$ on measurable space $(X,\B)$ is a nonnegative real-valued function defined for all sets of $\B$ such that the following conditions are satisfied:
\begin{enumerate}
\item $\begin{array}{cl}
\mu(\emptyset) = 0 &
\mbox{where $\emptyset$ is the empty set, and} \\
\end{array}$
\item $\begin{array}{cl}
\mu\left( \bigcup^{\infty}_{i=1} A_i \right) = \Sigma^{\infty}_{i=1} \mu (A_i) &
\mbox{ for any sequence $A_i$ of pairwise disjoint measurable sets.}
%& \mbox{ of disjoint measurable sets.}
\end{array}$
\end{enumerate}
%%%%%%%%%%%%%%%%
\begin{app-sdef}
A \textbf{measure space} $(X,\B,\mu)$ is a measurable space $(X,\B)$,
with measure $\mu$ defined on $\B$.
\end{app-sdef}
%%%%%%%%%%%%%%%%
\begin{app-sdef}
A \textbf{probability space} is a
measure space $(X,\B,\mu)$, such that measure $\mu(X)=1$.
\end{app-sdef}
%%%%%%%%%%%%%%%%
\begin{app-sdef}
A \textbf{vector space} is a set $V$ (whose elements are called \textbf{vectors})
together with a field of ``scalars'' \footnote{In this context, the field of real numbers.}
with an addition operation $+ : V \times V \to V$ and a multiplication
operation $* : S \times V \to V$ such that if $x,y,z$ are in $V$ and
$\alpha$ and $\beta$ are in $S$ then:
\begin{enumerate}
\item there is a unique vector $0 \in V$
such that $x + 0 = x$ for all $x \in V$ (additive identity);
\item for each vector $x \in V$ there exists a vector $-x \in V$ such that $x+(-x)=0$ (additive inverse);
\item $(x + y) + z = x + (y + z) $ (associativity of $+$);
\item $x + y = y + x $ (commutativity of $+$);
\item $1 * x = x$ (identity);
\item $(\alpha*\beta)*x = \alpha*(\beta*x)$ (associativity of $*$);
\item $ (\alpha + \beta)*x = \alpha*x + \beta*x$ (distributivity of $*$ over $+$, right); and
\item $ \alpha*(x + y) = \alpha*x + \alpha*y $ (distributivity of $*$ over $+$, left).
\end{enumerate}
\end{app-sdef}
\begin{app-sdef}
A \textbf{topological space} is a pair $(X,{\cal T})$ consisting of a set $X$ and a family ${\cal T} \subset 2^X$ of subsets of X such that:
\begin{enumerate}
\item $\emptyset$ (the empty set) $\in {\cal T}$ and $X \in {\cal T}$;
\item for any collection of sets in ${\cal T}$, $\{A_i \in {\cal T}| i \in I\}$, $\cup_{i \in I}A_i$ is also in ${\cal T}$, and if the \textit{index set} $I$ is finite, $\cap_{i \in I} A_i$ is in ${\cal T}$.
\end{enumerate}
\end{app-sdef}
${\cal T}$ is said to be a topology for X, and elements of ${\cal T}$ are called \textbf{open} sets. The complement of an open set is called a \textbf{closed} set.
\end{small}
\begin{acks}
\small{We are indebted to Robert K. France for many constructive discussions. The authors wish to thank the Digital Library Research Laboratory team
members that assisted with 5S. Thanks also go to the OCKHAM group for contributions to the taxonomy. Anonymous reviewers of earlier drafts
helped shape this work significantly.}
\end{acks}
\bibliographystyle{acmtrans}
\bibliography{5s}
\begin{received}
Received July 2001;
revised Dec 2002; accepted Oct 2003
\end{received}
\end{document}