Cloudy Journey: VMware vFabric Blog: Part 1: The Value, Architecture, & Code for Building Geography-Based Apps

VMware vFabric Blog: Part 1: The Value, Architecture, & Code for Building Geography-Based Apps:

Will machine-generated data be larger than mobile and tablet-generated data?
No matter where you might place your chips on that bet, they both rely on geographic data for quite a number of business applications. These geographic data applications stand to release a tremendous amount of business value, and, in this two-part series, we will explain:

How geographic data can release business value through applications.
Where technology overcomes big data barriers to release the business value
The concepts behind vFabric GemFire’s data fabric as well as an object model and data architecture for software services connecting to geographic data fabrics
How to use an open source quadtree index and the related Java code for interacting with geographic data in vFabric GemFire

How Geographic Data Releases Business Value

Many early versions of geography, location, or proximity-based applications can be found in the market. Recently, we published a few examples of these types of applications in articles about ocean sensor data and mobile applications, but there are more:

Mobile phone apps that use your location to find the best hotels, restaurants, attractions, or any local product/service
Mobile phone apps that use your location to get coupons or provide advertisements
Mobile phone apps that use your location to share information with other people
Mobile phone apps that provide multi-player games based on virtual or real location
Sensor networks on any moving vehicle – planes, trains, automobiles, motorcycles, boats, trucks, or equipment
Sensor networks that relay RFID information on any moving object (like a palette of food)
Sensors located on robots
Sensors located on humans or wherever humans are moving around

Many companies and investors are headed on the journey towards these types of geography-based applications. For example, Gigom.com noted that a Boeing jet generates 10 terabytes of information per engine for every 30 minutes of flight. The 6-hour “New York to LA” trip you took recently would generate 240 terabytes, and there are over 20,000 US flights per day. Can we use this type of data to optimize flight patterns with weather, fuel savings, and maintenance costs? Yes.

Overcoming Big Data Barriers with Technology

When we look at the geographic scenarios above, we see big data problems. Analyzing large data sets or accessing data from large data sets in real-time poses problems at significant scale. Mainly, the operations are slow. This is where virtualized, dynamically scalable, in-memory data grids or fabrics like vFabric GemFire 7 prevail and monolithic, traditional databases run into walls. In addition, GemFire provides a way to use ANY indexing scheme you need to run as fast as possible and simple APIs, like the use of a hash map, to make it easier for Java developers to work with.
Given the need for speed, we are always looking for new, better, and faster ways to access our data, so we have real and valuable information coming out of it in a timely manner. A whole field of research in computer science exists around data structures, and how to make searching them more efficient. We are constantly evolving our indexes and search structures, trying to make them go faster, and optimizing them to meet the needs of our specific data set. Think of a Business Intelligence system. It is built to accommodate many indexes for the slicing and dicing of data in multiple ways, yet always has additional requirements.

Some data, like spatial (i.e. geographic) data, requires special indexing in order to access it efficiently. When using spatial data, you are generally more interested in where the data is rather than what its primary key is. Specialized indexes, like quadtrees, accommodate accessing data in this manner, and allow for very fast searching of two dimensional data via a tree structure, rather than a table scan. A quadtree is a hierarchical data structure where each node has four children derived from partitioning two-dimensional space into four quadrants recursively.
vFabric GemFire accommodates indexes on in-memory data sets based on the primary key, a field, or a function via simple configuration changes, however, it is built from the ground up to allow alternative indexes (e.g. the type used with geography based data or other) to participate in the system, giving users fast access to data, in any indexing scheme they choose.

An Object Model and Data Architecture for Geographic Data

Before we can explain the object model and data architecture, it is important to understand the concepts and terms within the vFabric GemFire Enterprise Data Fabric:
Cache/Data Fabric – A set of Regions.
Region – A Region is the core of GemFire, and is the main API that is used when developing with GemFire. It is a logical grouping within the data fabric for a single data set. You can define any number of regions within your fabric. Each region has its own configurable settings governing such things as the data storage model, local data storage and management, partitioning, data event distribution, and data persistence.
A Region implements the java.util.concurrent.ConcurrentMap,which is also an extension of java.util.Map. This means that if your developers can use a hash map, your developers can use GemFire.
GemFire Server – This is what provides all of the form and function that makes up GemFire.
Cache.xml – This allows system architects to declaratively create the cache through xml. While the file can be any name it commonly referred to as a “cache.xml” file.
Client – The client enables any Java, C++, or C# application to access, interact with, and register interest in data managed by GemFire. Clients can also be embedded within a Java application server’s process to provide HTTP session replication or Hibernate L2 object caching with eviction expiry and overflow to disk.
Locator – A GemFire locator is a registry where servers can advertise their location so clients and other servers can discover the active servers. Clients are aware of the locators of a system, but not the specific servers.
Bucket – A unit of storage for distribution and replication, which is distributed in accordance to the region’s attribute settings. If you are familiar with the implementation of HashMaps, this is the same concept. Multiple buckets are usually held on a single server.
Partitioning – A logical division of a region that is distributed over members of the data fabric. For high availability, configure redundant copies so that each data bucket is stored in more than one member, with one member holding the primary copy. The underlying implementation of a partition is a bucket.
Redundancy – Highly available partitioned regions provide reliability by maintaining redundant data. When you configure a partitioned region for redundancy, each bucket, and therefore entry, in the region is stored in at least two members’ caches. If one of its members fails, operations continue on the partitioned region with no interruption of service. Recovering redundancy can be configured to take place immediately, or delayed for a configurable amount of time until a replacement system is started.
Without redundancy, failure of any of the members hosting the partitioned region results in partial loss of data. Typically redundancy is not used when applications can directly read from another data source, or when write performance out weighs read performance. The addition of a redundant copy ensures high availability of your data in the partitioned region.
Failover – When a server hosting the primary copy of data fails, the data responsibilities pass to another server. How this happens depends on whether the new server is a secondary server. In any case, all failover activities are carried out automatically by the GemFire system without any intervention needed by the client.
Serialization – The process in which an object state can be saved, transmitted, received and restored. The process of saving or transmitting state is called serializing. The process of receiving and restoring state is called deserialization.

A Background on Indexing Geographic Data

The Map data structure is fantastic for managing a set of data. The data all exists in one place, keys determine the uniqueness of an entry, and the underlying mechanism for the storage of data (a set of Lists, called Buckets), makes searching the set, based on key, very straightforward and fast. GemFire provides mechanisms to also index this data in a declarative way, not based on the key, to provide faster access with queries against the Map. You can index data in a GemFire region based on the primary key, a field, or a function on the data. This is a configurable attribute that requires no code changes.
In some cases, however, a Map and some accompanying indexes are not enough. For example, in the case of spatial data, queries become more complex than just equality. With spatial data, you are generally doing queries such as

Find all data points inside of a rectangle
Find all data points near some specific point
Find all data points that are a given distance from a location

These queries are not handled efficiently by traditional database indexes; that is what is available in most systems today, SQL or NoSQL. Yet, companies whose data is geographical in nature are always looking for ways to improve their access to data.
The example below was built to show how to use a specialized index with GemFire, specifically a spatial index. The same methodology can be applied to any specialized indexing structure that might be required, such as bi-temporal.
The premise of this approach is that myRegion.get(key) calls are extremely fast. We want to reduce any given query to a set of “get” calls as quickly as possible to retrieve the data of interest in the shortest possible time, without iterating over the entire data set via a function, or OQL (Object Query Language).

The Object Model

GemFire is a Key/Value store, which means we need to come up with something that is a good representation of the “uniqueness” of a value (The Key), and the structure of the Value itself.
Let’s say we are tracking current location information for people, and have been given a class “Person” that will represent each person. For each person, we would have a unique identifier, a current latitude, and a current longitude. We can assume that we might want to do spatial operations on either the key set or value set when we are working iteratively with the data, however, we don’t have a way to modify the base object “Person”.
My “Key” class should consist of the unique identifier, the latitude, and the longitude, however the uniqueness of the key will only be based on the unique identifier. See com.vmware.gemfire.example.domain.SpatialKey for an example. The latitude and longitude are included in the key because they are small, and we want access to those values without having to access the Value object directly.
The Person class can be wrapped in an object that contains the coordinates (in case we want to do spatial operations on the object), the unique identifier (to do equals), and the Person object. See com.vmware.gemfire.example.domain.MyData for an example.
Both the key and the value classes in this example implement the DataSerializable interface for best performance. Please refer to the GemFire Documentation to learn more about different Serialization options and their impact on performance.
In general, there will be nothing special about how we are storing this data in GemFire. It is in a Region, which could be replicated or partitioned. We can access and query this data as we would any other data in a region, however, we want to always ensure that we are using the most efficient methods.
Key Class

public class SpatialKey implements DataSerializable

{

private float lat, lon;

private String id;

…



}

The Data Architecture

If we are working with a large amount of data, we will want to set this up as a partitioned data set. You will need to analyze your access and usage patterns to determine the best partitioning scheme for your needs. For this example, we will use default partitioning based on primary key (which is fairly random) to try and get a nice, even distribution across the nodes in the system. Please refer to the documentation to learn more about partitioning schemes in GemFire.

Using a Spatial Index with GemFire

In this example, we will be creating an in-memory Spatial Index on each server in the data fabric. The index on each server will hold the keys to the data whose primary copy resides on that server. This does two things:

1. Reduces the memory footprint required for each node. In a GemFire System, the data is possibly distributed across several nodes. It is not practical to keep a full index of the entire data set on each node. In large data sets, the keyset alone could possible fill up the memory of a server. To alleviate this, each server holds it’s own “partition” of the Index, which includes only elements where the primary copy of the data is held on that node. The partitioning of the Index is completely dependent on the partitioning scheme used for the data.

2. Streamlines the Index for scatter gather functions. When a spatial query is executed on the server side, it will be distributed to all nodes that hold relevant data. Having each node only index it’s local data makes the tree lighter to scan to get out the relevant keys, giving better performance than holding the entire locally on each node, or in a central location.

The Index can be rebuilt at any time from the data that it references; therefore, it is not held in GemFire. Rather, it is a singleton in each JVM that exists only in memory and is referenced by functions and listeners that run internally on the server.
The “entries” in the Index, or the leaf elements, are Keys to data that exists in a GemFire Region, in our example, the SpatialKey. Any operation on the QuadTree will return a SpatialKey, or set of SpatialKeys, that can be used to do a “get” on the GemFire Region.
The index itself is not held in a Region. All access against the actual data points is a series of “gets” on the keys retrieved from the spatial index, making retrieval of data very fast.
In this example, an open source implementation of a quadtree Index was used. To ensure a singleton on each node, the constructor was made private and a class level method was implemented to return the only instance.
Index Singleton

public class QuadTreeIndex



{

  private static QuadTreeIndex myTree = new QuadTreeIndex();



  private QuadTree tree;

  private QuadTreeIndex ()



{

tree = new QuadTree();



}

 



  public static QuadTreeIndex getSingleton()



{

if (myTree == null)



{



myTree = new QuadTreeIndex();

}



return myTree;



}



}

All spatial operations on the tree are done via this façade, so it implements each method of the quadtree that need to be supported. For example:
Wrapping an Indexed Method

public void put ( float lat, float lon, Object obj)



{



tree.put(lat, lon, obj);

Adding and removing data from the Region

To keep the tree up to date, and as streamlined as possible, we need to make sure that as data points are being added or removed, that those keys are also added or removed from the Index. We don’t care about updates to a value since any access to the value will always be via the Region API, not the index.
We will be discussing the following Region configuration to detail how these updates and changes will be handled in the system.
Region Configuration

<region name="MyData">



<region-attributes refid="PARTITION">

<partition-attributes redundant-copies="1">

<partition-listener><class-name>

com.vmware.gemfire.example.spatial.listener.QuadTreePartitionListener

</class-name></partition-listener>

lt;/partition-attributes>

<cache-listener><class-name>

com.vmware.gemfire.example.spatial.listener.QuadTreeCacheListener

</class-name></cache-listener>

</region-attributes>

</region>

A CacheListener is used to “do something” after data has been added or updated in GemFire. In this example, we have configured a QuadTreeCacheListener to listen for “after” events in GemFire in our spatial data region. The Listener will fire on nodes where the data is a primary copy. Each time a new value is added to the Region holding the spatial data, the QuadTree Index in our example will be updated.
The code below is not complete, but shows the pertinent methods. All we are doing is putting and removing data from the Index on the local node. To install the CacheListener on a node, refer to Table 4: Region Configuration.
CacheListener

public class QuadTreeCacheListener implements



CacheListener<SpatialKey, MyData>, Declarable



{

QuadTreeIndex tree = QuadTreeIndex.getSingleton();

public void afterCreate(EntryEvent<SpatialKey, MyData> ev)



{

SpatialKey key = (SpatialKey)ev.getKey();

tree.put(key.getLat(), key.getLon(), key);



}

public void afterDestroy(EntryEvent<SpatialKey, MyData> ev)



{

SpatialKey key = (SpatialKey)ev.getKey();



tree.remove(key.getLat(), key.getLon(), key);



}

public void afterInvalidate(EntryEvent<SpatialKey, MyData> ev)



{

SpatialKey key = (SpatialKey)ev.getKey();



tree.remove(key.getLat(), key.getLon(), key);



}

}

In the next post, we will continue the example, discuss making the Index HA, and use a Function to access the data via the Index. Let us know if you have questions or would like example code!

About the Author: Catherine Johnson is a strategist with VMware and works with customers who are Implementing Fast Data solutions. She has more than 15 years of experience in distributed systems and holds a Master’s degree in Computer Science focused in distributed systems design. Her grad school research focused on spatial data in distributed systems. Catherine has spent most of her career at Oracle and VMware, working across most organizations, including consulting, engineering, education, and pre-sales.

Cloudy Journey

Pages

Friday, December 14, 2012

VMware vFabric Blog: Part 1: The Value, Architecture, & Code for Building Geography-Based Apps