Like many other folks out there who are discovering the power of data compute grids, we recently transitioned a part of our traditional database based ETL system to Hadoop based processing system. Being in the digital advertising field, we get a whole lot of impression data both from our tracking systems and external. Our existing ETL system is mainly comprised of four components: cleansing, standardization, dimensioning, aggregation.
The cleansing and standardization components that involve a whole lot of text parsing and mapping take the major brunt as they deal with the raw volume of incoming data. As part of the transition we moved these two components to the new system. We have a dedicated product support team that handles most of the daily user data queries. Issues like missing/incorrect data, re-running jobs due to changed dimensions or external data outages, configuring and testing new data fields/sources, generating ad-hoc reports, configuring new clients etc. These folks have a thorough domain knowledge and are well versed with this data. They also fully understand the current ETL data flow and the various business rules thats gets applied as part of the data processing. Technology wise they are comfortable with databases, SQL, basic scripting, Excel and are usually enthusiastic about learning new technologies as need be.
To be able to perform the above mentioned issues, they essentially need a way to slice and dice raw/stage level data and with the data residing in HDFS this becomes an issue. We have been brainstorming ways on how to expose the new system to support folks and below are some options.
Apache Hive: Facebook was the first company to encounter this problem wherein they transitioned their analysts folks from a RDBMS based warehouse to a more scalable hadoop system. They developed Hive which is essentially a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. HiveQL is very similar to SQL although it does not support the full SQL-92 specification. As per my reading it more closely resembles MySql's SQL dialect and that makes sense because Facebook is a mysql shop and a similarity would make the transition easy for its folks. This option stands top in our list of possibilities given its similarity to SQL.
Apache Pig: Apache Pig is also a higher level abstraction for map/reduce. Pig uses Pig Latin language to express data flows. Although this is a powerful tool, it would require the support folks to learn a entirely new language.
Commercial tools like Datameer, IBM Big Sheets: It is a well know fact that Microsoft Excel is most versatile analytic tool. Analysts love the ease of use and the tools it provides to slice and dice/graph datasets. Imagine an Excel like tool with the power of Hadoop. Its essentially what these commercial tools are. We recently received a great demo from the Datameer folks and were impressed by its ease of use and especially its pluggable architecture. Easy and familiar spreadsheet-like interface for business users with complete set of data integration, transformation/analytic and visualization tools. It also has a neat scheduler for cron based job scheduling. This option is also a strong contender in our option set given its ease of use and spreadsheet like usability and feel.
We haven't decided on an option yet. The next couple of weeks will involve closely working with support folks to evaluate these options and ensuring a smooth transition.