Data Integration Across Private / Cached Public Sources
Several issues arise when creating an integrated view across public and private sources. To avoid potentially unsafe traffic being traced between two sensitive points on the internet, some organizations choose to collect a large data set from various sources, in a manner which makes it appear to the outside as an anonymous chunk. Some organizations are concerned that focused mining of public sources in an ‘open’ environment may allow a potential competitor to gain some advantage from knowing ‘who accessed what’. To address this concern, they create large, local silos of cached public data. This creates several problems of its own.
When creating local silos that combine public and private data sources, maintaining data currency with respect to the public sources becomes a significant issue. Also, it is unlikely that a combined public / private data collection will remain structurally synchronized throughout its entire life cycle. Structural changes in the public source make re-synchronizing a private copy problematic; typically, a large effort is required to adjust to changes.
We can accommodate the above concerns by designing into our universe a layer that can seamlessly integrate multiple, incoherent data sets into one common view that readily adjusts to changes in the environment. This layer of abstraction is included within the query mediation level. Query mediation achieves integration at both the attribute and entity level, and its key to success is providing transparency to the end user. discoveryHub uniquely provides the platform and the tools to quickly create an efficient mediation level.
Using discoveryHub, we can create architecture to integrate new data from both pub