This is an excellent comment about Hadoop and Data Warehouse.
… There are three main use cases for Hadoop with a data warehouse, with the above picture an example of use case 3:
- Archiving data warehouse data to Hadoop (move)
- Exporting relational data to Hadoop (copy)
- Importing Hadoop data into data warehouse (copy)
… Here are some of the reasons why it is not a good idea to have only Hadoop as your data warehouse:
- Hadoop is slow for reading queries. HDP 2.0 today will not perform anywhere near PDW for interactive querying. This is why PolyBase is so important, as it bridges the gap between the two technologies so customers can take advantage of both the unique features of Hadoop and realize the benefits of a EDW. Truth be told users won’t want to wait 20+ seconds for a MapReduce job to start up to execute a Hive query
- Hadoop is not relational, as all the data is in files in HDFS, so there always is a conversion process to convert the data to a relational format
- There is no metadata stored in HDFS, so another tool needs to be used to store that, adding complexity and slowing performance
- Finding expertise in Hadoop is very difficult: The small number of people who understand Hadoop and all its various versions and products versus the large number of people who know SQL
- Super complex, lot’s of integration with multiple technologies to make it work
- Many tools/technologies/versions/vendors, no standards
… I also wanted to mention that “unstructured” data is a bit of a misnomer. Just about all data has at least some structure. Better to call it “semi-structured”. I like to think of it as data in a text file is semi-structured until someone adds structure to it, by doing something like importing it into a SQL Server table. Or think of structured data as relational and unstructured as non-relational.
Reference: Hadoop and Data Warehouses