Architecture before version 4
OpenRefine is a web application, but is designed to be run locally on your own machine. The server-side maintains states of the data (undo/redo history, long-running processes, etc.) while the client-side maintains states of the user interface (facets and their selections, view pagination, etc.). The client-side makes GET and POST ajax calls to cause changes to the data and to fetch data and data-related states from the server-side.
This architecture provides a good separation of concerns (data vs. UI); allows the use of familiar web technologies (HTML, CSS, Javascript) to implement user interface features; and enables the server side to be called by third-party software through standard GET and POST requests.
Technology stack
The server-side (back-end) part of OpenRefine is implemented in Java as one single servlet which is executed by the Jetty web server and servlet container. The use of Java strikes a balance between performance and portability across operating systems (there is very little OS-specific code and has mostly to do with starting the application).
The functional extensibility of OpenRefine is provided by a fork of the SIMILE Butterfly modular web application framework. With this framework, extensions are able to provide new functionality both in the server- and client-side. A list of known extensions is maintained on our website and we have specific documentation for extension developers.
The client-side part of OpenRefine is implemented in HTML, CSS and plain Javascript. It primariy uses the following libraries:
- jQuery
- Wikimedia's jQuery.i18n The front-end dependencies are fetched at build time via NPM.
The server-side part of OpenRefine relies on many libraries, for instance to implement import and export in many different formats. Those are fetched at build time via Apache Maven.
The data storage and processing architecture is being transformed. Up to version 3.x, OpenRefine uses an in-memory storage, where the entire project grid is loaded in the Java heap, with operations mutating that state. From 4.x on, OpenRefine uses a different architecture, where data is stored on disk by default and cached in memory if the project is small enough.
Server-side architecture
OpenRefine's server-side is written entirely in Java (main/src/
) and its entry point is the Java servlet com.google.refine.RefineServlet
. By default, the servlet is hosted in the lightweight Jetty web server instantiated by server/src/com.google.refine.Refine
. Note that the server class itself is under server/src/
, not main/src/
; this separation leaves the possibility of hosting RefineServlet
in a different servlet container.
The web server configuration is in main/webapp/WEB-INF/web.xml
; that's where RefineServlet
is hooked up. RefineServlet
itself is simple: it just reacts to requests from the client-side by routing them to the right Command
class in the packages com.google.refine.commands.**
.
As mentioned before, the server-side maintains states of the data, and the primary class involved is com.google.refine.ProjectManager
.
Projects
In OpenRefine there's the concept of a workspace similar to that in Eclipse IDE. When you run OpenRefine it manages projects within a single workspace, and the workspace is organized in a file directory with sub-directories. The default workspace directories are listed in the manual and it also explains how to change them.
The class ProjectManager
is what manages the workspace. It keeps in memory the metadata of every project (in the class ProjectMetadata
). This metadata includes the project's name and last modified date, and any other information necessary to present and let the user interact with the project as a whole. Only when the user decides to look at the project's data would ProjectManager
load the project's actual data. The separation of project metadata and data is to minimize the amount of stuff loaded into memory.
A project's actual data includes the columns, rows, cells, reconciliation records, and history entries.
A project is loaded when it needs to be displayed or modified, and it remains in memory until 1 hour after the last time it gets modified. Periodically the project manager tries to save modified projects, and it saves as many modified projects as possible within 30 seconds.
Data Model
A project's data consists of
- raw data: a list of rows, each row consisting of a list of cells
- models on top of that raw data that give high-level presentation or interpretation of that data. This design lets the same raw data be viewed in different ways by different models, and let the models be changed without costly changes to the raw data.
Column Model
Cells in rows are not named and can only be addressed by their list position indices. So, a column model is needed to give a name to each list position. The column model also stores other metadata for each column, including the type that cells in the column have been reconciled to and the overall reconciliation statistics of those cells.
Each column also acts as a cache for data computed from the raw data related to that column.
Columns in the column model can be removed and re-ordered without changing the raw data--the cells in the rows. This makes column removal and ordering operations really quick.
Column Groups
This feature is partially implemented, buggy and deprecated. It will be removed in OpenRefine 4.0. See the following links for details:
- Issue#5122 that first argues it's a useful feature, but then agrees with its deprecation
- Discussion Who uses column groups?
- Discussion The future of the records mode about better ways of implementing grouping and a hierarchical model in OpenRefine.
This feature is related to Rows vs Records, which however continues to be supported.
Consider the following data:
Although the data is in a grid, we humans can understand that it is a tree. First of all, all rows contain data ultimately linked to the movie Austin Powers, although only one row contains the text "Austin Powers" in the "movie title" column. We also know that "USA" and "Germany" are not related to Elizabeth Hurley and Mike Myers respectively (say, as their nationality), but rather, "USA" and "Germany" are related to the movie (where it was released). We know that Mike Myers played both the character "Austin Powers" and the character "Dr. Evil"; and for the latter he received 2 awards. We humans can understand how to interpret the grid as a tree based on its visual layout as well as some knowledge we have about the movie domain but is not encoded in the table.
OpenRefine can capture our knowledge of this transformation from grid to tree using column groups, also stored in the column model. Each column group illustrated as a blue bracket above specifies which columns are grouped together, as well as which of those columns is the key column in that group (blue triangle). One column group can span over columns grouped by another column group, and in this way, column groups form a hierarchy determined by which column group envelopes another. This hierarchy of column groups allows the 2-dimensional (grid-shaped) table of rows and cells to be interpreted as a list of hierarchical (tree-shaped) data records.
Blank cells play a very important role. The blank cell in a key column of a row (e.g., cell "character" on row 4) makes that row (row 4) depend on the first preceding row with that column filled in (row 3). This means that "Best Comedy Perf" on row 4 applies to "Dr. Evil" on row 3. Row 3 is said to be a context row for row 4. Similarly, since rows 2 - 6 all have blank cells in the first column, they all depend on row 1, and all their data ultimately applies to the movie Austin Powers. Row 1 depends on no other row and is said to be a record row. Rows 1 - 6 together form one record.
Currently (as of 12th December 2017) only the XML and JSON importers create column groups, and while the data table view does display column groups but it doesn't support modifying them.
Changes, History, Processes, and Operations
All changes to the project's data are tracked (N.B. this does not include changes to a project's metadata - such as the project name.)
Changes are stored as com.google.refine.history.Change
objects. com.google.refine.history.Change
is an interface, and implementing classes are in com.google.refine.model.changes.**
. Each change object stores enough data to modify the project's data when its apply()
method is called, and enough data to revert its effect when its revert()
method is called. It's only supposed to store data, not to actually compute the change. In this way, it's like a .diff patch file for a code base.
Some change objects can be huge, as huge as the project itself. So change objects are not kept in memory except when they are to be applied or reverted. However, since we still need to show the user some information about changes (as displayed in the History panel in the UI), we keep metadata of changes separate from the change objects. For each change object there is one corresponding com.google.refine.history.HistoryEntry
for storing its metadata, such as the change's human-friendly description and timestamp.
Each project has a com.google.refine.history.History
object that contains an ordered list of all HistoryEntry
objects storing metadata for all changes that have been done since after the project was created. Actually, there are 2 ordered lists: one for done changes that can be reverted (undone), an done for undone changes that can be re-applied (redone). Changes must be done or redone in their exact orders in these lists because each change makes certain assumptions about the state of the project before and after it is applied. As changes cannot be undone/redone out of order, when one change fails to revert, it blocks the whole history from being reverted to any state preceding that change (as happened in Issue #2).
As mentioned before, a change contains only the diff and does not actually compute that diff. The computation is performed by a com.google.refine.process.Process
object--every change object is created by a process object. A process can be immediate, producing its change object synchronously within a very short period of time (e.g., starring one row); or a process can be long-running, producing its change object after a long time and a lot of computation, including network calls (e.g., reconciling a column).
As the user interacts with the UI on the client-side, their interactions trigger ajax calls to the server-side. Some calls are meant to modify the project. Those are handled by commands that instantiates processes. Processes are queued in a first-in-first-out basis. The first-in process gets run and until it is done all the other processes are stuck in the queue.
A process can effect a change in one thing in the project (e.g., edit one particular cell, star one particular row), or a process can effect changes in potentially many things in the project (e.g., edit zero or more cells sharing the same content, starring all rows filtered by some facets). The latter kind of process is generalizable: it is meaningful to apply them on another similar project. Such a process is associated with an abstract operation com.google.refine.model.AbstractOperation
that encodes the information necessary to create another instance of that process, but potentially for a different project. When you click "extract" in the History panel, these abstract operations are called to serialize their information to JSON; and when you click "apply" in the History panel, the JSON you paste in is used to re-construct these abstract operations, which in turn create processes, which get run sequentially in a queue to generate change object and history entry pairs.
In summary,
- change objects store diffs
- history entries store metadata of change objects
- processes compute diffs and create change object and history entry pairs
- some processes are long-running and some are immediate; processes are run sequentially in a queue
- generalizable processes can be re-constructed from abstract operations
Client-side architecture
The client-side part of OpenRefine is implemented in HTML, CSS and Javascript and uses the following Javascript libraries:
Importing architecture
OpenRefine has a sophisticated architecture for accommodating a diverse and extensible set of importable file formats and workflows. The formats range from simple CSV, TSV to fixed-width fields to line-based records to hierarchical XML and JSON. The workflows allow the user to preview and tweak many different import settings before creating the project. In some cases, such as XML and JSON, the user also has to select which elements in the data file to import. Additionally, a data file can also be an archive file (e.g., .zip) that contains many files inside; the user can select which of those files to import. Finally, extensions to OpenRefine can inject functionalities into any part of this architecture.
The Index Page and Action Areas
The opening screen of OpenRefine is implemented by the file main/webapp/modules/core/index.vt
and will be referred to here as the index page. Its default implementation contains 3 finger tabs labeled Create Project, Open Project, and Import Project. Each tab selects an "action area". The 3 default action areas are for, obviously, creating a new project, opening an existing project, and importing a project .tar file.
Extensions can add more action areas in Javascript. For example, this is how the Create Project action area is added (main/webapp/modules/core/scripts/index/create-project-ui.js
):
Refine.actionAreas.push({
id: "create-project",
label: "Create Project",
uiClass: Refine.CreateProjectUI
});
The UI class is a constructor function that takes one argument, a jQuery-wrapped HTML element where the tab body of the action area should be rendered.
If your extension requires a very unique importing work flow, or a very novel feature that should be exposed on the index page, then add a new action area. Otherwise, try to use the existing work flows as much as possible.
The Create Project Action Area
The Create Project action area is itself extensible. Initially, it embeds a set of finger tabs corresponding to a variety of "source selection UIs": you can select a source of data by specifying a file on your computer, or you can specify the URL to a publicly accessible data file or data feed, or you can paste in from the clipboard a chunk of data.
There are actually 3 points of extension in the Create Project action area, and the first is invisible.
Importing Controllers
The Create Project action area manages a list of "importing controllers". Each controller follows a particular work flow (in UI terms, think "wizard"). Refine comes with a "default importing controller" (refine/main/webapp/modules/core/scripts/index/default-importing-controller/controller.js) and its work flow assumes that the data can be retrieved and cached in whole before getting processed in order to generate a preview for the user to inspect. (If the data cannot be retrieved and cached in whole before previewing, then another importing controller is needed.)
An importing controller is just programming logic, but it can manifest itself visually by registering one or more data source UIs and one or more custom panels in the Create Project action area. The default importing controller registers 3 such custom panels, which act like pages of a wizard.
An extension can register any number of importing controller. Each controller has a client-side part and a server-side part. Its client-side part is just a constructor function that takes an object representing the Create Project action area (usually named createProjectUI
). The controller (client-side) is expected to use that object to register data source UIs and/or create custom panels. The controller is not expected to have any particular interface method. The default importing controller's client-side code looks like this (main/webapp/modules/core/scripts/index/default-importing-controller/controller.js
):
Refine.DefaultImportingController = function(createProjectUI) {
this._createProjectUI = createProjectUI; // save a reference to the create project action area
this._progressPanel = createProjectUI.addCustomPanel(); // create a custom panel
this._progressPanel.html('...'); // render the custom panel
... do other stuff ...
};
Refine.CreateProjectUI.controllers.push(Refine.DefaultImportingController); // register the controller
We will cover the server-side code below.
Data Source Selection UIs
Data source selection UIs are another point of extensibility in the Create Project action area. As mentioned previously, by default there are 3 data source UIs. Those are added by the default importing controller.
Extensions can also add their own data source UIs. A data source selection UI object can be registered like so
createProjectUI.addSourceSelectionUI({
label: "This Computer",
id: "local-computer-source",
ui: theDataSourceSelectionUIObject
});
theDataSourceSelectionUIObject
is an object that has the following member methods:
attachUI(bodyDiv)
focus()
If you want to install a data source selection UI that is managed by the default importing controller, then register its UI class with the default importing controller, like so (main/webapp/modules/core/scripts/index/default-importing-sources/sources.js
):
Refine.DefaultImportingController.sources.push({
"label": "This Computer",
"id": "upload",
"uiClass": ThisComputerImportingSourceUI
});
The default importing controller will assume that the uiClass
field is a constructor function and call it with one argument--the controller object itself. That constructor function should save the controller object for later use. More specifically, for data source UIs that use the default importing controller, they can call the controller to kickstart the process that retrieves and caches the data to import:
controller.startImportJob(form, "... status message ...");
The argument form
is a jQuery-wrapped FORM element that will get submitted to the server side at the command /command/core/create-importing-job
. That command and the default importing controller will take care of uploading or downloading the data, caching it, updating the client side's progress display, and then showing the next importing step when the data is fully cached.
See main/webapp/modules/core/scripts/index/default-importing-sources/sources.js
for examples of such source selection UIs. While we write about source selection UIs managed by the default importing controller here, chances are your own extension will not be adding such a new source selection UI. Your extension probably adds with a new importing controller as well as a new source selection UI that work together.
File Selection Panel
This screen is shown when there are multiple files to choose from when creating a project, for instance after uploading a zip file with multiple files in it. This interface lets the user choose which files to import to create a new project. Although OpenRefine only supports one table per project so far, it is possible to select multiple files to import. Their contents will be concatenated into a single table.
Parsing UI Panel
The parsing UI panel is shown when importing data into a new project. Primarily, it lets the user select in which format the data is, which determines how it is read and transformed into an OpenRefine project. The back-end will try to supply an informed guess for the format using the format guesser, but it is not uncommon that this initial choice must be overriden by the user.
Beyond this choice of format, the parsing UI panel offers a configuration panel for the chosen importer. This part of the UI can be defined independently for each input format, given that not all options are relevant for all formats. For instance, when
selecting the "Text file" option, the specific UI of the LinedBasedImporter
will be shown. This UI is defined in:
main/webapp/modules/core/scripts/index/parser-interfaces/line-based-parser-ui.html
main/webapp/modules/core/scripts/index/parser-interfaces/line-based-parser-ui.js
Other importers generally define their own parsing configuration panel as well.
The link between the format's identifier (MIME type), importer (Java class which defines the parsing logic) and parsing options UI (Javascript class that defines the rendering of this options area) is made in the main/webapp/modules/core/MOD-INF/controller.js
file,
where those components are registered together in the ImportingManager
.
Server-side Components
ImportingController
An importing controller is a component of the back-end which is in charge of the entire importing workflow, from the initial transfer of the raw data to be imported to the created project, with all the configuration steps in between, as described in the earlier section. OpenRefine comes with a default importing controller which implements this for data coming from:
- file upload by the user via the web interface
- upload of textual information using the clipboard import form
- download of a file by supplying a URL
For all of these data sources, the first step consists of storing the corresponding input files in a temporary directory inside the workspace. The default importing controller provides an HTTP API used by the front-end to select which files to import, predict the format they are in, provide default importing options for the selected format, preview the project's first few rows with the given options, and finally create the project.
The importing controller is not used for loading existing projects or importing OpenRefine project archives: the project manager is responsible for both of those.
Extensions can define other importing controllers to implement other importing flows depending on the data source. For instance, importing data from a SQL database requires different steps such as selecting the database and providing a SQL query. The
database
extension implements such a workflow by providing its own importing controller.
FormatGuesser
A format guesser is a class that tries to determine the MIME type of a file, considering its contents.
The FormatGuesser
interface has multiple implementations, which can be used to determine the format depending on its basic type (binary, text based).
For text files, this relies on heuristics which are quite ad-hoc and brittle. For binary files, we do not currently try to do anything, while one would at least expect that we check for some so-called magic numbers at the beginning of files, which can be
used to detect many file formats.
Worse, our format guessing logic does not actually attempt to parse the files with the guessed formats, so it is not uncommon that the user is directly presented with a parsing error (in the form of a javascript alert) upon importing files.
This could be avoided by trying to read the given files with the predicted importer before suggesting the format to the user, making sure that at least that does not throw an exception.
ImportingParser
An ImportingParser
is a class that is responsible for parsing a file into OpenRefine's project model.
It takes a range of importing options passed on from the frontend, input by the user into a dedicated UI, specific to the format being parsed.
When possible, parsers are designed so that they can import the first few rows of the project without reading the entire input file in memory. This helps provide fast previews of the project to be created when the user changes importing options. Every change in the importing options triggers a new parse of the source files (unless the user has disabled auto preview option in the parsing configuration panel).
Faceted browsing architecture
Faceted browsing support is core to OpenRefine as it is the primary and only mechanism for filtering to a subset of rows on which to do something en masse (ie in bulk). Without faceted browsing or an equivalent querying/browsing mechanism, you can only change one thing at a time (one cell or row) or else change everything all at once; both kinds of editing are practically useless when dealing with large data sets.
In OpenRefine, different components of the code need to know which rows to process from the faceted browsing state (how the facets are constrained). For example, when the user applies some facet selections and then exports the data, the exporter serializes only the matching rows, not all rows in the project. Thus, faceted browsing isn't only hooked up to the data view for displaying data to the user, but it is also hooked up to almost all other parts of the system.
Engine Configuration
As OpenRefine is a web app, there might be several browser windows opened on the same project, each in a different faceted browsing state. It is best to maintain the faceted browsing state in each browser window while keeping the server side completely stateless with regard to faceted browsing. Whenever the client-side needs something done by the server, it transfers the entire faceted browsing state over to the server-side. The faceted browsing state behaves much like the WHERE
clause in a SQL query, telling the server-side how to select the rows to process.
In fact, it is best to think of the faceted browsing state as just a database query much like a SQL query. It can be passed around the whole system, to any component needing to know which rows to process. It is serialized into JSON to pass between the client-side and the server side, or to save in an abstract operation's specification. The job of the faceted browsing subsystem on the client-side is to let the user interactively modify this "faceted browsing query", and the job of the faceted browsing subsystem on the server side is to resolve that query.
In the code, the faceted browsing state, or faceted browsing query, is actually called the engine configuration or engine config for short. It consists mostly of an array facet configurations. For each facet, it stores the name of the column on which the facet is based (or an empty string if there is no base column). Each type of facet has different configuration. Text search facets have queries and flags for case-sensitivity mode and regular expression mode. Text facets (aka list facets) and numeric range facets have expressions. Each list facet also has an array of selected choices, an invert flag, and flags for whether blank and error cells are selected. Each numeric range facet has, among other things, a "from" and a "to" values. If you trace the AJAX calls, you'd see the engine configs being shuttled, e.g.,
{
"mode": "rows",
"facets" : [
{
"type": "text",
"name": "Shrt_Desc",
"columnName": "Shrt_Desc",
"mode": "text",
"caseSensitive": false,
"query": "cheese"
},
{
"type": "list",
"name": "Shrt_Desc",
"columnName": "Shrt_Desc",
"expression": "grel:value.toLowercase().split(\",\")",
"omitBlank": false,
"omitError": false,
"selection": [],
"selectBlank":false,
"selectError":false,
"invert":false
},
{
"type": "range",
"name": "Water",
"expression": "value",
"columnName": "Water",
"selectNumeric": true,
"selectNonNumeric": true,
"selectBlank": true,
"selectError": true,
"from": 0,
"to": 53
}
]
}
Server-Side Subsystem
From an engine configuration like the one above, the server-side faceted browsing subsystem is capable of producing:
- an iteration over the rows matching the facets' constraints
- information on how to render the facets (e.g., choice and count pairs for a list facet, histogram for a numeric range facet)
When the engine config JSON arrives in an HTTP request on the server-side, a com.google.refine.browsing.Engine
object is constructed and initialized with that JSON. It in turns constructs zero or more com.google.refine.browsing.facets.Facet
objects. Then for each facet, the engine calls its getRowFilter()
method, which returns null
if the facet isn't constrained in anyway, or a com.google.refine.browsing.filters.RowFilter
object. Then, to when iterating over a project's rows, the engine calls on all row filters' filterRow()
method. If and only if all row filters return true
the row is considered to match the facets' constraints. How each row filter works depends on the corresponding type of facet.
To produce information on how to render a particular facet in the UI, the engine follows the same procedure described in the previous except it skips over the facet in question. In other words, it produces an iteration over all rows constrained by the other facets. Then it feeds that iteration to the facet in question by calling the facet's computeChoices()
method. This gives the method a chance to compute the rendering information for its UI counterpart on the client-side. When all facets have been given a chance to compute their rendering information, the engine calls all facets to serialize their information as JSON and returns the JSON to the client-side. Only one HTTP call is needed to compute all facets.
Client-side subsystem
On the client-side there is also an engine object (implemented in Javascript rather than Java) and zero or more facet objects (also in Javascript, obviously). The engine is responsible for distributing the rendering information computed on the server-side to the right facets, and when the user interacts with a facet, the facet tells the engine to update the whole UI. To do so, the engine gathers the configuration of each facet and composes the whole engine config as a single JSON object. Two separate AJAX calls are made with that engine config, one to retrieve the rows to render, and one to re-compute the rendering information for the facets because changing one facet does affect all the other facets.