While the client consists in a single role (the Web Client role), server-side
components include three roles: the Web Service role (implements a secure XML
Web API), the Back-End role (implements distributed computations), and the
Database role (implements a SQL-based storage system). All roles can be collapsed
into a single machine, or assigned to separate physical or even virtual machines.
Web Client Role
Clients implementing the Web Client role run within a browser. Web clients send
XML requests to Web Service machines to perform all data-related operations. They
are also responsible for implementing data visualization and user interaction
capabilities (using Microsoft Silverlight technology). To reduce traffic and latency,
Web clients also supports local data caching and data compression.
Web Service Role
Servers implementing the Web Service role are stateless Web servers which implement
the secure XML-based Web API. Web Service servers are responsible for accepting and processing XML requests,
while enforcing security. Web Service servers may be used to create and start new tasks.
However such tasks are processed in the background by Back-End servers.
Multiple Web Service servers can be deployed to scale out.
Servers implementing the Back-End role implement a computing mesh responsible for scheduling and executing tasks.
Distributed computing is used to increase scalability, while allowing
users to obtain results more quickly. The distributed computing infrastructure is
able to manage task priorities, detect abandoned tasks, restart failed tasks, terminate
long-running tasks, and synchronize task execution between nodes. Multiple Back-End
servers can be deployed to scale out.
Servers implementing the Database role implement a SQL-based storage system.
The system uses multiple databases to manage data. A single master database acts as a resource
directory, while data databases are used to store uploaded data & analysis results.
For example, when new data is uploaded, it is stored in a new table within a data database. However,
statistics about this data (as well as directory information about the new table) are stored in the master database.
Users, Workspaces, Rights
To create a new account, users must first register. For security reasons, users must resolve a visual CAPTCHA challenge.
If the system has been configured to require e-mail confirmation, users must click on an activation
link before they can log in. Registered users can create workspaces, which are essentially
secure containers encapsulating objects such as data sets, analysis results, images,
or comments. Registered users can also share workspaces with other users by granting
them specific rights (ex: read, write, manage). However, users must confirm they
accept such invitations by approving the granted right before it becomes effective.
Databases, Tables, Fields
The master database does not store uploaded data. Uploaded data (along with analysis
results) is stored in separate data databases. Instead, the master database keeps
an inventory of all data databases, tables within these data databases, and even
fields within these tables. For example, for each data table, the master database
keeps track of associated field names, field types, and field statistics. The inventory
mechanism essentially allows the system to find where data is stored, and to quickly
access summary information about the data. New uploaded data is stored in the master
database in a staging area, in the form of chunk records. Once the upload is complete,
a new table is created in a data database, and chunk records are deleted.
Nodes, Jobs, Tasks
The master database contains records which participate distributed computing. When
a server is added to the grid, a new processing node record is registered in SQL.
When a new task is started, a new job record (marked as scheduled) is created. The
distributed system ensures that nodes compete for job execution. Ultimately, each
job is assigned to a single node, which becomes responsible for executing the task.
Upon completion (or failure), the job record is updated, so as to signal its new
status to all nodes. Each job is able to spawn child jobs, which may execute on
different nodes. The distributed computing infrastructure is able to manage task
priorities, detect abandoned tasks, restart failed tasks, terminate long running
tasks, and synchronize task execution between nodes.
Keys, Licenses, Logs
The master database contains records used to implement and enforce system security.
The organization license table stores digitally signed license keys, which impose
specific usage restrictions on the organization and its users. When authentication
tickets, CAPTCHA challenges, or license keys must be verified, records stored in
the key table are used to perform cryptographic operations. When a user makes a
change to a workspace or objects it contains, a record containing event information
is stored in the log table.
Comments, Downloads, Images, Settings
The master database contains tables used to implement collaboration features. When
a user creates a comment, a record containing comment details is stored in the comment
table. When a user requests data download, a record specifying how data should be
filtered before it is streamed out is stored in the download table. When a user
exports or publishes content as an image, a record containing bitmap information
is stored in the image table.
To secure operations which read data, the XML Web API performs a SQL join between
the target table, the workspace table, and the right table. For example, consider
the case of a caller retrieving all comments he or she has access to. The system
will find all workspaces the user has read (or better) rights to, and perform a
join with comments under these workspaces. To secure operations which modify
data, the XML Web API simply checks if the caller has rights to perform the operation.
For example, consider the case of a caller deleting a comment. The system will find
under which workspace the comment is located, and check if the caller has been granted
write (or better) rights to this workspace.
Licenses determine which restrictions are applied to the organization and its users.
Each license key is digitally signed, and specifies limits, such as an expiry date,
how many rows can be imported, how many data sets can be created, etc. Data Applied
uses a public/private key pair mechanism to generate license keys. This choice explains
why our license keys are much longer than those used by many products. Organization
licenses are cumulative: each added organization license unlocks additional capabilities
by lifting restrictions. User licenses however are restrictive: user licenses restrict
default capabilities granted by organization licenses. The system validates entered
license information, and enforces such restrictions.
Data Applied includes powerful secure delegation capabilities: users can allow third-parties to perform operations on their behalf - securely.
Authenticated users can issue restricted tickets which allow authentication but also impose a set of usage restrictions.
For example, a user could obtain a restricted ticket which is only valid for 5 minutes, and only allows the caller to retrieve a specific entity type from a specific workspace.
The restricted ticket could then be sent to a third-party application, allowing it to securely perform certain operations on behalf of the user.
When a restricted ticket is generated, a record is created in the database to keep track of the list of restrictions to apply.
When the XML Web API receives a request which includes a restricted ticket, it loads the list of restrictions, and applies them to the execution pipeline.
As with non-restricted tickets, cryptographic signed hashes are used to verify tickets.
Users who register must supply a user name and password. The system computes and
stores a cryptographic hash of the user name + password combination. This means
that actual passwords are never stored, and that password hashes cannot be compared.
When authenticating using a log in name and password, the system again computes
a cryptographic hash of the user name + password combination, and checks whether
it matches the stored hash. If so, the system issues an authentication ticket, composed
of a user ID followed by a cryptographic signed hash of this user ID.
When presented with a ticket, the system extracts the user ID, again computes a cryptographic
hash of this user ID, and checks whether it matches the hash specified by the ticket (all in memory).
If valid, the system accepts the authentication claim, and performs
further security checks using this identity. Finally, when processing license keys,
the system uses a well-known public key to verify the digital signature. The
private key is kept confidential however, because it is used to generate valid license