How I passed CCD-410 (Cloudera Certified Developer for Apache Hadoop (CCDH))

Mark Gschwind

Getting experience in new technologies is a lot like the chicken and the egg; no one will put you on a project until you have experience, but it’s difficult to get experience without being on a project.  This was my conundrum with Big Data, and I chose to “hatch” some experience by gaining a CCDH credential.  My mindset was to truly learn the technologies around Hadoop, focusing less on test requirements.  Of course I also wanted to pass the test, but by having the test preparation be a guide that I occasionally strayed from, it made the process fun as well as rewarding.

Here is my advice on how to prepare for CCD-410:

      • Start by going through all the Hortonworks tutorials.  Hadoop is the same regardless of its distributor and you should take advantage of training materials from all of them.  You will need to install the Hortonworks sandbox first.  These tutorials give you a great introduction, as well as real experience using Hive/Pig/the CLI, etc.  They also show much of the Hadoop eco-system, which Cloudera will test you on.
      • Download and install the Cloudera QuickStart VM.  I installed it on VMWare player hosted by windows 7 and it ran fine with 4 GB of RAM.
      • Get Tom White’s book, “Hadoop: The Definitive Guide” and make sure to understand chapters 1-3 in particular.  The rest of the book is useful, but it is pretty dry reading.  Try switching to Alex Holmes’ book “Hadoop in Practice” when you feel you need a change.
      • Hire a tutor to teach you MapReduce.  I found an excellent one on eLance, and had about 10 sessions with him.  We went into great depth on the “Wordcount” program which is in Tom White’s book.  These sessions helped me understand the Writable, Serializable and other interfaces of MapReduce, which was invaluable for passing the test.  And much more of Tom White’s book became understandable.
      • If you need some help with programming Java, I found this series by thenewboston on YouTube very useful.  And the presenter has a great sense of humor.
      • Once you have some background with HDFS and the MapReduce API, you will be ready for these two video’s which are excellent, by Tom White.  They show the old API, but the concepts are spot-on for many test questions:
        • Tom White Devoxx’10 (Hadoop Fundamentals: HDFS, MapReduce, Pig and Hive – Part 1
        • Tom White Devoxx’10 (Hadoop Fundamentals: HDFS, MapReduce, Pig and Hive – Part 2
      • There will be questions on YARN and MRv2, and you will need to find resources to learn these important changes to Hadoop.  This article from Apache explains it generally, but you should find additional resources to learn it in greater detail.
      • Go through the Coudera study guide for the exam, watch all the videos it provides, and be sure to brush up on topics you do not fully understand.
      • As I went through the study guide from Cloudera, I watched many YouTube videos from Hadoop Summit 2013.
      • This article, “24 Interview Questions & Answers for Hadoop MapReduce developers,” was a great study guide, and served as a good practice test.

I hope this helps you learn Hadoop and pass CCD-410.


Enterprise Information Management with SQL Server 2012

I recorded 3 videos that demonstrate Microsoft technologies working with Melissa Data to address what a recent survey calls the “greatest barrier to adopting analytics or BI products enterprise-wide.” They all follow the same customer data set, showing how to build a process that corrects and standardizes customer data, integrating it with ERPs and MDM systems, and augmenting the data with master data. Together, they show how to create a business process to enable data governance.

The first explains Microsoft’s 3 Enterprise Information Management (EIM) technologies and how they work together. Then I demo Data Quality Services (DQS) working with Melissa Data to correct and standardize address and other data.

Enterprise Information Management: Intro and DQS (1 of 3) from Mark Gschwind

The second takes it a step further, showing how DQS cleansing and matching can be performed using SQL Server Integration Services (SSIS). It also shows a how to integrate the data flow with Master Data Services (MDS) within a typical workflow.

The third continues where the second left off, augmenting the customer data with master data using Master Data Services (MDS), Microsoft’s MDM tool. I discuss the architecture of the technology before demo’ing how to add master data with validations and workflow notifications.


The AMB Data Warehouse: A Case Study

In this presentation to my local PASS chapter, I show one of the leading analytic platforms in the Real Estate industry, AMB Property Corporation’s data warehouse.  I give the attendees a tour of the infrastructure, explaining the challenges faced and the ways I solved them.  I discuss how I achieved near-real-time data latency while integrating 6 source systems (Yardi, CTI, MRI, FAS, MDS and Dyna).  I demo the cubes that drove heavy user adoption, and also an innovative custom application called MyData that gave users useful information on data quality, data latency and business rules applied to the source data.  The presentation is a good example of how one organization achieved success using BI.