IRI
Last Updated:
Analyst Coverage: Philip Howard and Daniel Howard
IRI is a privately-owned ISV founded in 1978. Its offices are in Florida and it relies on a partner network of resellers for international coverage (in 40 locations throughout the world).
The company’s first product, CoSort, is a high-performance data transformation utility that was first designed to offload JCL sort/merge steps to CP/M. Needless to say, this has been extended and ported to other environments but it remains at the heart of IRI’s offerings, including Voracity.
Voracity is a data management platform designed to perform and consolidate common work in Data Discovery, Data Integration, Data Migration, Data Governance and Analytics.
IRI FieldShield
Last Updated: 7th January 2025
IRI offers several products that are capable of data masking, namely FieldShield, DarkShield, and CellShield. Of these, CellShield is the most specific: it is designed to operate exclusively on Excel spreadsheets. DarkShield, conversely, is the most general, capable of searching for and masking PII and other sensitive data in a variety of different formats, including structured, unstructured, and semi-structured. FieldShield sits somewhere in the middle, focusing on structured data (primarily relational databases and flat-files, but also including spreadsheets, though it is less specialised than CellShield in regards to the latter). All three products are available individually, as a service, or as part of the IRI Data Protector Suite within the Voracity platform. They each offer data discovery capabilities in addition to data masking, and this can be used for a variety of purposes, not the least of which is to identify sensitive data to mask.
FieldShield, in particular, was first released in 2011 as a data masking solution that would “shield” sensitive data at the field level (hence the name). Historically, one of its biggest use cases has been anonymising data sets for use in test environments (which is to say, test data management), and this continues today. Moreover, FieldShield is built on the SortCL engine that also drives several other IRI products, such as CoSort. This provides FieldShield with various advanced functionality, including input phase filtering and complex field logic that can combine data masking with data cleansing, joining, reformatting, transforming, and more.
FieldShield can also be leveraged in real-time database replication scenarios via IRI Ripcurrent, a CDC (Change Data Capture) facility included in Voracity that refreshes target table rows whenever there are changes in the source table. By applying FieldShield’s masking functionality in Ripcurrent, sensitive data classes being refreshed or replicated to can be updated with masked data automatically and immediately.
FieldShield works on RDBs and MongoDB, ASN.1 CDRs, Excel sheets, and fixed and delimited files – typically on-premise, but increasingly often in AWS, Azure, and GCP environments. The product itself can be deployed on-prem or in the cloud, and can be consumed using the IRI Workbench or an API. It will readily integrate with other IRI products, including DarkShield for masking unstructured data and RowGen for generating synthetic data. In addition, FieldShield metadata is interoperable with other SortCL-driven products, notably Voracity for ETL et al., plus its RowGen, NextForm, and CoSort components. The latest FieldShield version, version 6, was announced in December 2024 and includes several new or updated features.
Masking in FieldShield is rule-based and powered by the aforementioned SortCL engine. A variety of out-of-the-box masking methods are available, including several dozen static masking functions that range between deterministic, non-deterministic, reversible, non-reversible, format-preserving, and so on (as shown in Figure 1). String and numeric manipulations are both available, and nested functions can be used to combine string manipulations and/or value lookups with other masking rules. Moreover, you can invoke DarkShield on free text fields from within FieldShield, scanning the contents of those fields and masking any sensitive text contained therein. Masked data is always kept consistent across multiple data sources, meaning that structural and referential integrity are maintained. As mentioned above, you can also leverage SortCL to combine masking with other data manipulations. For example, you can use it to action ETL processes simultaneously with data masking, which could be useful for, say, creating protected data sets for analytics.
FieldShield also provides a substantial data discovery and profiling capability, although unlike in DarkShield this is a separate process from its data masking. This enables you to search through and classify your data against a centralised library of either pre-configured or bespoke data classes shared between all of your IRI masking jobs and products, which can in turn be married to masking rules when they correspond to sensitive data. These rules are acted on at execution time, ensuring that the associated sensitive data is protected.
Considerations are made for performance: for instance, tables that have already been scanned will be skipped during repeated discovery phases, and you can choose to exclude specific tables or data classes from the process entirely. In addition, data classes can be grouped together at either a global or a project level, and these data class groups can be categorised by assigning them one or more sensitivity levels and/or applicable compliance regulations.
A range of discovery methods can be used as part of this process, including lookup value or pattern matching, column name matching, and dictionary matching. Moreover, any number of these methods can be used in concert with each other to improve the accuracy of your results, albeit at the cost of performance. There is a configurable matching threshold for discovery, allowing you specify how sure you want to be before settling on a result. Predefined methods for finding data protected by GDPR and HIPAA are available out-of-the-box.
FieldShield also comes equipped for handling quasi- or indirectly identifiable demographic data, which refers to data that does not identify an individual by itself, but can be combined with other, similar data in order to do so. The product can score the re-ID risk from, and anonymise, this kind of data in order to keep it compliant but accurate enough for analytic or marketing purposes.
The results from FieldShield’s discovery process are summarised within an HTML report (a small sample of which is shown in Figure 2). This report provides a range of visualised information, including (but far from limited to) job performance (such as how long it took to run) and configuration (such as which data sources and classes of sensitive information were used). This can be used to copy the configuration and constraints present during a previous job in order to quickly reproduce it.
FieldShield is a robust solution for data masking. It is one of the most mature products currently available on the market, and while its focus on structured data is somewhat old-fashioned, the sophisticated capabilities that this focus – along with SortCL and Ripcurrent – provides are difficult to ignore. Indeed, SortCL is a significant differentiator for FieldShield, both within and without the IRI product catalogue. In particular, it is one of the primary reasons to use it over DarkShield, which can mask both structured and unstructured data but cannot leverage SortCL in the same way. That said, given the abundance of unstructured data available in many enterprise environments, we would expect that most organisations will want to leverage both products together rather than either individually. Indeed, the fact that you can easily do this is a major strength of IRI’s data masking offering.
The bottom line
Whether for test data management or more general data masking applications, FieldShield is a formidable product for finding and protecting the sensitive data that is hidden in your structured sources.
IRI Voracity
Last Updated: 9th November 2020
Powered by CoSort (or Hadoop), and built on Eclipse, IRI Voracity is a multi-purpose data management platform designed to perform, speed, and consolidate common work in 5 general areas:
- Data Discovery – data profiling, classification, search, and metadata redefinition
- Data Integration – high volume ETL, change data capture, slowly changing dimensions
- Data Migration – file/data/database type conversion, replication, and federation
- Data Governance – data quality, PII masking, re-ID risk scoring, test data synthesis
- Analytics – embedded BI, tie ups to Datadog, KNIME and Splunk, wrangling for the rest
As can be seen in Figure 1, Voracity drives solution depth by including standalone products in both the IRI Data Manager Suite and the IRI Data Protector Suite, each of which have various sub-components that support multiple capabilities.
Voracity is an integrated platform with metadata shared across the whole environment, which supports the provision of data lineage. A formal data catalogue is missing, though the product does have inherent data classification capabilities and its central metadata stores are easy to understand, share, and use across the above applications, or create for Collibra.
A similar consideration applies to data governance whereby there are some capabilities provided, mostly related to data privacy and quality, but not a general-purpose capability for which the company would rely on integration with partners like Erwin. Most notable of ancillary governance capabilities in Voracity are test data management, with options for synthetic data generation, database subsetting, and static and dynamic data masking (with the option to combine both).
Illustrated in Figure 1 but not discussed is the IRI Workbench IDE, which supports graphical metadata creation, conversion, discovery, and application wizards to create, deploy, and manage data rules, job scripts, data definition files (DDF), and the XML workflows common to all IRI software. In the same pane of glass, you can also administer your databases and develop or use applications in other languages and any plug-in supported in Eclipse. As an alternative to the wizards you can also develop jobs using diagrams, dialogs, or IRI’s domain specific language (a 4GL), called SortCL.
Customer Quotes
“We sought a reliable tool that would quickly sort and transform very large files… we see the Voracity platform as a much more cost-effective (and higher-performing) alternative to legacy ETL tools.”
Optum
“CoSort accurately and quickly processes billions of rows of data and allows us to join and analyze this information in connection with our other data warehouse processes. No other tool gives us this much speed and flexibility.”
Comcast
IRI CoSort is the default Voracity data integration engine. Unlike other such products, it is not confined to ETL (extract, transform and load) operations but also performs data replication (change data capture), federation, masking, cleansing, and reporting. Another key point to note about it is that it does not have to transform data in separate steps. You can define jobs that way, but at run time the engine consolidates multiple steps to reduce I/O. Added to the fact that the run-time engine is a 2MB, multi-threaded C executable and loads only the libraries it requires, and you will appreciate why CoSort has a performance advantage over its competitors.
Note that IRI also offers a Hadoop-based option that does not have the same footprint advantages of CoSort but otherwise runs in a similar fashion. Moreover, many jobs developed for native CoSort implementations will run without change in Map Reduce 2, Spark, Spark Streams, Storm or Tez. Dataflows are actually stored in files and can be executed from anywhere.
The company offers an extensive range of native connectors (including MQTT and Kafka) plus JDBC support. Not surprisingly given its heritage, it also supports mainframe sources that use COBOL copybooks, EBCDIC and so on. While it does not run on z/OS it does support mainframe databases as sources and will itself run on z/Linux.
While IRI Voracity does not offer a module called “data quality”, it does provide substantial relevant capability, as illustrated in Figure 2.
A major strength of IRI Voracity is clearly in its Data Protector Suite. To begin with, IRI has deployed machine learning (including within IRI DarkShield) to support the identification of sensitive data (though we are disappointed that M/L has not yet been implemented more widely across the platform). It also uses natural language processing for this purpose. Once discovered, as mentioned, the company offers significant capabilities when it comes to masking. In particular, dynamic data masking may be proxy-based, run in situ or driven by APIs, and can be mixed and matched with static masking. It is also worth mentioning that Voracity supports the ability to search, parse and protect multiple sources containing semi- and unstructured data.
Finally, given the current predilection for companies to migrate from on-premises data warehouses to cloud-native data warehouses such as Snowflake or Google BigQuery, it is worth noting the availability of IRI FACT and IRI NextForm, which bolster high volume database migration operations.
IRI Voracity is close to being a complete data management platform. It only lacks a formal data catalogue and some extensions to its policy and governance capabilities, which are in development. On the other hand, it is much more advanced when it comes to ETL performance and sensitive data protection than many of its competitors. The company’s data migration capabilities will also be a boon in the current environment, as will its relatively attractive price points and licensing options.
The Bottom Line
The key features of IRI Voracity are the performance that the CoSort engine offers, and the depth of capability it provides in extending its data management platform into the identification and management of sensitive data. If these are important issues for you, then you should seriously consider IRI Voracity.
Sensitive Data Discovery and Masking in IRI Voracity
Last Updated: 3rd March 2022
IRI Voracity is a data management platform that offers its core capabilities through two product suites: IRI Data Manager Suite, and IRI Data Protector Suite. In particular, the latter provides a selection of data masking products (namely IRI FieldShield, CellShield EE, and DarkShield, plus a services option that leverages them called DMaaS) that also come equipped with significant data discovery capabilities. This functionality can be used for a variety of purposes, not the least of which is to find and protect your sensitive data.
The Voracity platform, including the above products, can be accessed through either IRI Workbench, a largely wizard-driven Eclipse interface backed by graphical modelling, or via APIs. Licensing is flexible, with options available for Voracity as a whole as well as individual products and APIs. IRI also partners (and integrates) with a number of other vendors. These can variously add additional capabilities to the IRI offering as well as provide enhanced support for provisioning and CI/CD pipelines.
Customer Quotes
“Our experience with millions of unstructured files confirms the need to identify and mitigate the data privacy risks within them. Standalone and embedded spreadsheets, Word and PDF documents, image files in multiple formats, as well as logs and emails, are strewn with PII unknown to our customers. These needles in historical or operational customer haystacks need to be found and blunted. Fortunately, the search methods and masking functions in IRI DarkShield specifically and Voracity generally help us get control of these hidden risks.”
GDPR Tech
Masking in Voracity is rule-based and powered by the CoSort engine. FieldShield masks structured databases and flat files, CellShield masks Excel sheets, and DarkShield can search and mask structured, semi-structured and unstructured data sources simultaneously. Several dozen static masking functions are available for FieldShield and DarkShield, and about half of those are available in CellShield as well. In static operations, masked data is kept consistent across multiple data sources so that referential integrity is always maintained. Dynamic data masking is also available.
In addition to data masking, the various Data Protector Suite products provide data discovery and profiling capabilities. This enables you to classify your data against a centralised library of either pre-configured or bespoke data classes shared between all of the shield products, which can in turn be married to masking rules when they correspond to sensitive data (see Figure 1). These rules are acted on at execution time, ensuring that the associated sensitive data is protected. Each data class can also be equipped with a search methodology that is used to locate matching data in your system. This means that when set up correctly IRI can effectively automate the process of finding and anonymising your sensitive data: it will discover your sensitive data using the aforementioned search, associate it with the appropriate data class, and mask it at execution time. There are also considerations for performance that have been built in. For instance, tables that have already been scanned will be skipped during repeated discovery phases, and you can choose to exclude specific tables or data classes from the process entirely.
An impressive range of discovery methods can be used as part of this these capabilities, including lookup value or pattern matching, NER (Named Entity Recognition), column name matching, fuzzy or exact dictionary matching, path searching, facial recognition matching, font matching, character recognition, and coordinate matching (the latter two mostly for images). NER in particular uses semi-supervised machine learning to enable more sophisticated and effective language analysis of highly unstructured data. In addition, any number of these methods can be used in concert with each other to improve the accuracy of your results. There is also a configurable matching threshold for discovery, allowing you specify how sure you want to be before settling on a result.
Moreover, there are two ways to consume Voracity’s discovery and masking capabilities. You can go through Workbench – which has the advantage of a relatively friendly, wizard-driven user interface coupled with visualised reporting, as shown in Figure 2 – or you can leverage them directly through an API. In the latter case, this essentially allows you to use Voracity as a discovery and masking engine that underpins your other data pipelines. This has obvious (and positive) implications for integration and automation.
IRI Voracity uses a robust architecture for managing your data classes that both manages data class definitions, and assigns discovery and masking methods to them, centrally. It offers a healthy range of discovery methods running from the simple to the sophisticated, and its applicability to highly unstructured data, such as image files, is particularly notable.
Moreover, Voracity is billed as a total data management platform, and to that end it offers a wealth of additional capabilities – data integration, governance, quality, and so on – that will frequently tie into, and either augment or be augmented by, data discovery (and, to a lesser extent, masking) in one way or another. These capabilities are offered through a unified and user-friendly interface, complete with wizards, visual programming, and so on. This makes it easy to use each individual product and to shift your attention from one product to another. These advantages carry over to data discovery and data masking, at least if you plan to leverage these technologies through Workbench. That said, even if you don’t, you will simply benefit from the flexibility, integration and automation offered by an API-driven approach instead. By way of example, data discovery through the DarkShield API can be coupled with test data generation using IRI RowGen to replace values in images and documents with synthetic, but realistic data and fonts – providing more safety for applications and processes that handle these sorts of files.
The bottom line
IRI justifiably positions Voracity as a total data management platform. As a solution for data masking and data discovery, either for sensitive data or not, it is both highly competent and rather flexible in how you can interact with it. In short, whether you want a solution that comes integrated into a larger platform, or one that works as a standalone engine, IRI Voracity should satisfy.
Test Data Management in IRI Voracity
Last Updated: 6th December 2023
IRI Voracity contains two product suites that are relevant to test data management (TDM): IRI Data Manager Suite and IRI Data Protector Suite. The latter provides a selection of masking products (IRI FieldShield, CellShield EE, and DarkShield) suitable for various use cases, including TDM, that also come equipped with significant data discovery and classification capabilities. It also offers data classification, discovery, and masking as a professional service, aptly named Data Masking as a Service, or DMaaS. The former, on the other hand, contains IRI RowGen, which can be used to generate synthetic test data. In principle it also provides data subsetting, but in practice this is more typically delivered as part of the platform’s broader data integration capabilities.
The Voracity platform, including the above products, is accessed through either IRI Workbench, a largely wizard-driven Eclipse interface backed by graphical modelling (displayed in Figure 1), or via APIs. Licensing is flexible, with options available for Voracity as a whole as well as individual products and APIs. Database virtualisation is not offered directly, but is provided through integration with partner vendors Windocks and Actifio. Other partnerships support integration with provisioning and CI/CD pipelines – among other things – and recent collaborative efforts with Cigniti and ValueLabs are resulting in those company’s more workflow-oriented front-ends being applied to the core Voracity engine, creating a smoother experience when they are deployed together (at least for organisations that require extensive approval processes as part of their data access).
Customer quotes
“Test data management (TDM) is a critical part of our agile SDLC, and is subject to data privacy regulations. Integrated data classification, discovery, anonymization, subsetting, and synthesis functions in Voracity improve our time-to-market delivery strategy, and help us comply with GDPR and similar laws.”
Capgemini Technology Services
The Data Protector Suite provides (sensitive) data discovery and classification facilities in support of data profiling and masking operations (and thence TDM). It categorises your data against an extensible library of pre-configured or bespoke data classes, which can be tagged with varying levels of sensitivity and then married to appropriate masking or test data generation rules that are acted on at execution time. In this way, you can use Voracity to find and protect your sensitive data, allowing it to be used for testing.
Various discovery methods are available, including pattern matching, named entity recognition (which in turn leverages semi-supervised machine learning), column name matching, fuzzy and exact dictionary matching, path searching, font matching, character recognition, and coordinate matching. Any number of these methods can be used together for additional accuracy, and validation scripts can be employed to reduce false positives. Discovery results can be rendered as graphical reports; an example of this is shown in Figure 2.
Masking is powered by the CoSort engine. FieldShield masks relational databases and flat files, CellShield masks Microsoft Excel data, and DarkShield masks structured, semi-structured and unstructured data (including images and documents) simultaneously and consistently. Static and dynamic masking are available, as is support for a variety of data sources. Various masking functions are provided out of the box, and you can build your own functions externally and integrate them via an API. You can also combine multiple discovery methods and/or masking functions together and apply them simultaneously. Masked data is consistent across all sources, while referential integrity is always maintained.
RowGen provides synthetic data generation. It emphasises the customisation of test data, giving you fine-grained control over what, how and where your data is generated. For instance, it can generate test data using parameters you provide to it (including which class of data it should belong to) or select data randomly from one or more “set files” that have been prepared ahead of time, creating a holistic data profile for a person or other entity that does not exist but that has realistic attributes drawn directly from your data. Moreover, this extends past just what data you are generating and also encompasses how and where you are generating it (which means that you could, say, generate test data within a CI/CD pipeline).
Various generation functions are available for creating test data sets, including both the specific – such as national ID number generation – and the generic – generating data according to a predefined, weighted statistical distribution. There are multiple ways to customise the end results of these functions: test data can be generated in such a way that each value is unique, each value in a set file can be mandated to be used exactly once, and so on. You can even define your own compound data formats. Regardless, its production characteristics – including original data formats and sizes, value ranges, key relationships, and frequency distributions – are preserved. You can also generate test data in a variety of unstructured formats, including images and PDFs, based on predefined templates.
Subsetting is delivered via either RowGen or with Voracity’s data integration capabilities. In either case, you can specify a driver table and trace its foreign key relationships to create a self-contained subset. Voracity gives you the option to follow these relationships “downhill” – only moving from parent to child – or to move through them in either direction. The former is faster, but the latter is more comprehensive. In addition to quantitative subsetting based purely on volume, you can also employ more qualitative methods that apply conditions to the initial data set in order to create a coherent subset (which will, again, be self-contained).All of this functionality can be executed as individual scripts or batch jobs, which can be created using various wizards, form editors, and mapping diagrams. They can then be executed from within IRI Workbench, the command line, or a partnered database virtualisation environment such as Windocks. In the latter case in particular, Voracity and Windocks can be used to create sanitised clones of your production data in on-demand, self-service, containerised, and virtualized repositories.
APIs are provided, meaning that Voracity TDM functions are also operable as part of an external pipeline, and can be invoked directly from within your CI/CD platform processes, either on-premises or in the cloud. The test data created by Voracity’s processes can be exported to many databases and file formats, including spreadsheets, PDFs and images.
Finally, IRI Ripcurrent, a real-time database event processing module, was recently added to Voracity. Ripcurrent offers incremental data replication by detecting and acting on changes to relational database tables in real-time. This works by monitoring log events for inserts, updates, deletes and schema structural changes, then mapping the data on-the-fly and/or issuing alerts. Applied to TDM, it can be used to refresh your test data environment by carrying out both data replication and masking processes automatically as soon as a corresponding production environment is changed.
IRI’s subsetting, masking and synthetic data generation capabilities are all highly competent. The ability to create representative synthetic data sets via analysis is particularly notable and useful, as is Ripcurrent’s automatic and real-time refresh of your test data. That said, in this paper we have only been able to scratch the surface of the product’s capabilities. There is a significant depth of functionality here: IRI has been organically growing its technology for over 40 years, and it shows. If you would like to learn more, we refer you to our recent series of articles on IRI and Voracity, which explores several of the topics touched on in this report in greater detail. We are also told that IRI is working on implementing generative AI as part of its sensitive data discovery and synthetic data generation capabilities, although the details of this have yet to be announced.
What is more, TDM is only one aspect of Voracity. It is billed as a total data management platform, and to that end it offers a wealth of other capabilities – data integration, governance, quality, and so on – that stretch beyond just TDM. Moreover, these capabilities (including TDM) are offered through a unified and user-friendly interface, complete with wizards, visual programming and so on. This makes it easy to use each individual product and to shift your attention from one product to another. Integration with CI/CD pipelines is also a useful feature, enabling Voracity to automate both the production and consumption of test data.
The company’s partnership with containerised database virtualisation vendor Windocks is particularly notable, and its other relevant partners, including Actifio, CommVault, Cigniti, and ValueLabs, should be considered as well.
The bottom line
IRI Voracity is both a data management platform and TDM solution, with many elements of the former being highly applicable to the latter. Ripcurrent is a particularly compelling example of this kind of applicability. The end result is an effective and versatile solution for TDM.