Parallel Data Model

Data Types

SECONDO is an extensible database framework, being filled with various DBMS models as algebras, like relational, temporal, moving objects etc. Each algebra contains its specific data types and relative operators. Most of the data types can be kept as attributes in relations, while the others like indices are stored as individual database objects. According to this storage classification, a new data type named flist is implemented in the algebra also named Hadoop, which is specially prepared for Parallel SECONDO. Essentially, flist is designed as a wrap structure, in order to encapsulate all existing and future SECONDO data types.

An flist partitions a SECONDO object into a N x 1 matrix, here the N is the number of Data Servers in the Parallel SECONDO. Each Data Sever keeps one piece data of the object. At the same time, if objects are kept in a relation, then the relation can be partitioned into a R x C matrix. Here a Data Server may contain one or several, or none of the R row data in the matrix, and each row can be further divided into C columns. Each piece data of the matrix can be exported into PSFS (Parallel SECONDO File System), which is a distributed file system prepared for exchanging data among Data Servers. This kind of matrix can also be represented as an flist object. The first kind flist is named DLO (Distributed Local Objects), and the second kind is named DLF (Distributed Local Files). In either way, the flist is kept only in the master Data Server, describing the partition scheme, while the data are actually kept on slave Data Servers.

Parallel Operators

Furthermore, a set of parallel operators are provided to process flist objects, in order to describe parallel queries as sequential queries, in SECONDO executable language. These operators are briefly divided into four kinds: flow, PSFS, assistant and Hadoop, as shown in the table below.

Hadoop Flow Assistant PSFS
hadoopMap spread para fconsume
hadoopReduce collect fdistribute
hadoopReduce2 createFlist ffeed

Hadoop operators convert parallel queries to Hadoop jobs. Each operator contains a template Hadoop job, which creates job instances according to the input arguments. A function parameter is needed in every Hadoop operator, describing a sequential query that will be preformed in one stage of the created job instance. In hadoopMap, the query described in its argument function is processed in the Map stage of the created Hadoop job. While each of the other two operators hadoopReduce and hadoopReduce2 processes its argument function in the Reduce stage.

Flow operators concatenates sequential and parallel queries. The spread partitions a tuple stream into the cluster and returns a DLF flist object. In contrast, the collect gathers all data in a DLF flist from the cluster and returns them as a stream of tuples. The createFlist is specially prepared for Parallel BerlinMOD, returning a DLF flist object in which the data were independently created in the cluster. Assistant operator para indicates flist objects used in Hadoop operators' argument functions. The PSFS are invisible to the user, handling data stored in PSFS.

More details about the parallel data model are described in the User Guide for Parallel SECONDO, and the technical report Simple and Efficient Coupling of Hadoop With a Database Engine.