How an online real estate company optimized its Hadoop clusters
Acquired by online real estate database company Zillow in 2014 for $3.5 billion, Trulia is one of the largest online residential real estate marketplaces around, with more than 55 million unique site visitors each month.
With so much data to store and process, the company adopted Hadoop in 2008 and it has since become the heart of Trulia's data infrastructure. The company has expanded usage of Hadoop to an entire data engineering department consisting of several teams using multiple clusters. This allows Trulia to deliver personalized recommendations to customers based on sophisticated data science models that analyze more than a terabyte of data daily. That data is drawn from new listings, public records and user behavior, all of which is then cross-referenced with search criteria to alert customers quickly when new properties become available.
[ Related: AtScale looks to easily add BI on Hadoop ]
To make it all work, throughout each night, the company must complete dozens of workflows and hundreds of complex jobs on time. With many teams writing Hadoop jobs or using Hive or Spark concurrently, Trulia has to ensure reliability in its multi-tenant, multi-workload environment. Delayed or unpredictable jobs throw a wrench in the works and can seriously affect the bottom line. Until recently, that meant Trulia had to intentionally underutilize its Hadoop clusters to ensure jobs completed on time.
"We process, on a daily basis, over a terabyte of new information: public records, listings, user activity," says Zane Williamson, senior director of DevOps at Trulia. "We process this data across multiple Hadoop clusters and use the information to send out email and push notifications to our users. That's the lead driver to get users back to the site and interacting. It's very important that it gets done in a daily fashion. Reliability and uptime for the workflows is essential."
[ Related: Big data gets runtime specification ]
"It's been a pretty painful process, I think," Williamson adds, noting that he joined Trulia relatively recently. "It's been a pretty big challenge to reliably run this data cycle, maintain uptime and troubleshoot issues. Troubleshooting issues could sometimes take days to dial in on."
To ease that pain and achieve more reliable Hadoop job completion, Trulia turned to Pepperdata, a specialist in adaptive Hadoop performance that guarantees quality of service on Hadoop.
Pepperdata provides a granular view of everything happening across your Hadoop clusters, actively governing use of CPU, memory, disk I/O and network for every task, job, user and group. For Trulia, the pièce de résistance was Pepperdata's newest feature — the capability to turn any trackable metric into an alert defined at any level of granularity, from cluster, to node, user, queue, job or task.
"We're watching how every application on the cluster is actually using the hardware," says Sean Suchter, co-founder and CEO of Pepperdata. "If there is any contention between some high priority thing and some computationally expensive ad-hoc thing, we'll detect that and slow down or otherwise affect the low priority thing just enough to give a consistent, high quality of service to the high priority thing."
"The performance gains we get scale pretty well with the chaos of the cluster," he adds. "The more chaos you have, the more applications you run, the more different tenants you have, the better we can do. We're able to react in a second-by-second fashion and do a lot of optimization. The opportunity goes higher the more complex the environment is."
Pepperdata has used the alerting feature to create detailed notifications to proactively track performance metrics across its Hadoop environment. Between dashboards and the alerting functions, Trulia is now able to identify problems much easier and faster. With the new visibility, the company has been able to optimize its Hadoop usage and maximize utilization.
"We rolled out Pepperdata last year," Williamson says. "It's been an amazing tool for us to diagnose problems. Within hours, rather than days, we could zoom in on what was going on and make changes."
He notes that Trulia now uses Pepperdata to manage five different Hadoop clusters, range from a dozen nodes to more than 40. There's about 2 petabytes of data across all the clusters. The company also has a number of clusters on AWS that are not yet managed by Pepperdata because they're used for batch-driven EMR workloads that aren't persistent. But he's working with the Pepperdata team to bring those clusters under Pepperdata management too.
"It's definitely on my roadmap," he says. "I feel like I'm running blind here."