Trifacta
Last Updated:
Analyst Coverage: Philip Howard
Trifacta was founded in 2012, after starting out as a joint research project between Stanford University and the University of California, Berkeley (consisting of professors and doctoral students). The company has its headquarters in San Francisco and also has offices in Palo Alto, Boston, London and Berlin.
Trifacta Wrangler and Wrangler Enterprise
Last Updated: 24th November 2016
Trifacta is a self-service data preparation platform that was originally designed primarily for use by data scientists. However, the company has gradually evolved the product to make it more targeted for use by business analysts while still maintaining the flexibility and power required by data scientists. This makes it unique in the marketplace as other products are (currently) only aimed at business analysts and do not have specific features to support the work of data scientists, for whom Trifacta is the clear market leader. Nevertheless, Trifacta tells us that it has more business analysts as users than it does data scientists.
Trifacta offers both Trifacta Wrangler Enterprise and Trifacta Wrangler, with the former being Hadoop-based and the latter being a desktop product that does not require Hadoop. Wrangler Enterprise can be deployed in the Cloud or on-premises.
Trifacta primarily uses a direct sales model though it also has reseller partners as well as systems integrators (Infosys is an example) that are partners. As far as technical partnerships are concerned, these include Hadoop distributors such as Cloudera, Hortonworks and MapR as well as business intelligence vendors such as Tableau, Qlik and ZoomData. Especially notable partnerships, where the respective products have been closely integrated, are with data cataloguing and governance products such as Waterline and Cloudera Navigator.
Trifacta has a substantial user base for both the enterprise and desktop products. Many of the company's user base are household names. For example, UnitedHealth Group, Santander, Zurich Insurance, GoPro, LinkedIn, Dish Networks, Lockheed Martin, Royal Bank of Scotland, and many others. As can be seen for from this list the use of Trifacta is not limited to any particular verticals or industry sectors.
Trifacta offers its capabilities for a variety of users. In the latest release (4.0) the company has released its Builder capability, which provides a menu-driven workflow-based approach for defining wrangling steps. This is, in our opinion, a significant step forward in making the product more intuitive and easier to use for non-technical users. At a lower level, data scientists can script directly in Trifacta's scripting language (using the advanced mode editor environment). This language is called Wrangle and it will build regular expressions for you, which are then compiled to MapReduce, Spark or Trifacta's in-memory engine Photon. In this context, it is worth remarking that Trifacta uses the term "wrangling" in a broader sense than some other vendors: encompassing discovery, structuring, cleaning, enrichment, validation, and the publishing of data. New features include pattern profiling and fuzzy joins, amongst others. Wrangle itself is delivered with several hundred pre-built functions (for example, changing case from upper to lower). There is also support for User Defined Functions (UDF) that can be written in Python or Java.
The platform is highly available and there are built in connections to the NameNode (in Hadoop environments) and ZooKeeper to ensure this. The company is a partner of both Cloudera and Hortonworks (as well as MapR) and the product supports both Sentry and Ranger for security purposes. Other notable Hadoop-based capabilities include support for HCatalog and the Hive MetaStore. While Trifacta will work with traditional relational and file-based data - xlsx, CSV - it also supports cloud sources in AWS, Microsoft Azure and Google Cloud Platform as well as more modern file formats such as JSON, Parquet, ORC and Avro. Lastly, Trifacta supports publishing of data in specific file formats for downstream use in business intelligence products such as Tableau and Qlik.
In terms of underlying features and functions Trifacta has the sorts of capabilities that one might expect: profiling that allows you to see type-specific histograms of values; automatic identification of data quality issues such as missing or mismatched values; automatic parsing of nested data formats and structures such as JSON (important when Trifacta is used in conjunction with business intelligence products such as Tableau); data enrichment; task orchestration and scheduling; and machine learning capabilities that will progressively improve its recommendations with respect to appropriate transformations. These are accessible through the browser and can run at scale on Hadoop and meets industry standards for security with support for Kerberos, Secure Impersonation, Sentry and Ranger.
One notable feature is what the company calls "Interactive Data Exploration", which is a form of data visualisation, not in the traditional sense of visualisation for end-consumption in analysis, but instead to more effectively provide users with information on the data they're working with to jump start or guide the process of transformation. The system presents the user with automated visual representations of the data based upon the inferred data type of each attribute of the data. These profiles require no specification by the user and automatically present each data type using the most compelling visual representation: geographic elements are presented as maps, time-oriented elements are presented via common hierarchies such as day, month, year, and so on. Every Trifacta profile is completely interactive: allowing the user to simply select certain elements of the profile to prompt transformation suggestions.
Trifacta provides the sort of training, consulting services and support, that you would expect.