Parallel SECONDO Overview
Parallel SECONDO scales up the capability of processing extensible data models in the SECONDO database system to a cluster of computers. It combines Hadoop 0.20.2 with a set of SECONDO databases, providing almost all existing SECONDO data types and operators. All parallel queries are expressed in SECONDO executable language, like common sequential queries. Therefore, the user can easily describe parallel queries without learning too many details about the underlying Hadoop platform.
SECONDO has been developed with a focus on supporting spatial and spatio-temporal data management. It offers a comprehensive set of data types and operators for representing and querying moving objects (e.g. GPS-captured tracks of people, vehicles, or animals, also called trajectories). Through the coupling of Hadoop and SECONDO, it is now possible to perform queries and analyses on massive sets of trajectory data in the cloud.
For installation, a set of bash scripts is provided which makes the installation relatively easy for the user and makes Parallel SECONDO usable not only on a single computer, but also on computer clusters containing tens or even hundreds of computers. Besides, a VMware image containing a single-computer Parallel SECONDO enables the user to explore the system by simply starting a virtual machine. At last, an Amazon Machine Image ( AMI ) containing a Parallel SECONDO Data Server is also published, with which the user can quickly set up Parallel SECONDO on a EC2 virtual cluster.
Furthermore, the BerlinMOD benchmark is also revised for Parallel SECONDO. In the past, the data generation and certain example queries are quite expensive in BerlinMOD, even with small factors like 1.0. At present, Parallel SECONDO can create a large data set with a scale factor of 30 on an Amazon cluster consisting of 110 large-type instances in only five hours.
Install Parallel SECONDO
- At present Parallel SECONDO is well supported on all Unix-based systems, while usually Ubuntu is highly recommended.
- A SECONDO database system of version higher than 3.3.2 and all its required libraries should be installed on all involved computers at the very beginning. The installation guides of SECONDO regarding different platforms, including Linux, MacOSX and Windows, help the user to finish this step easily.
- A Hadoop 0.20.2 archive is needed. We are aware that this release is out of date nowadays, and we are working on adapting Parallel SECONDO to the latest Hadoop distribution.
- JAVATM 1.6.x.
- The software ssh and screen should be installed.
For example, in an Ubuntu system, they can be installed by the commands:
$ sudo apt-get install ssh
$ sudo apt-get install screen
- Passphraseless SSH connection is also required. In Ubuntu, it is built by:
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
A configuration file named ParallelSecondoConfig.ini is prepared to set up various parameters of the system. Within the file, two parameters must be set before installing the system.
- JAVA_HOME It indicates the current JAVA installation for Hadoop.
- Cluster It describes the distribution of the system.
More details about this file are described in the User Guide for Parallel SECONDO.
Set Up Data Servers
Basically, the installation of Parallel SECONDO is automatically done by two bash scripts: ps-cluster-format and ps-secondo-buildMini. The first sets up Data Servers, which are basic processing units in Parallel SECONDO, to all involved computers. The second installs SECONDO to all Data Servers. The basic installation can be executed by the command:
After initializing all Data Servers, start a new console and type the command:
Detailed installation instructions, on both a single-computer and computer-cluster, are described in: