This week I read upon Fast Data in the Era of Big Data: Twitter’s
Real-Time Related Query Suggestion Architecture. The paper tells the story behind architecture to support Twitter’s real-time related query suggestion, why the architecture had to be designed twice and what lessons can be learned from this exercise. It does not talk much about the algorithms, rather it talks about the different design decisions that lead to the current architecture — the focus is not on how things were done but why they were done a certain way.
Twitter has an interesting use case — search assistance — which boils down to things like a user searching for “Obama” and getting results for related queries like “White House” as well. Spelling correction is also a part of search assistance. The problem is quite well researched from volume perspective of data but in Twitter’s context, velocity mattered as much as volume. The results had to adapt to rapidly evolving global conversations in real-time — where real-time loosely translates to a target latency of 10 minutes. The real-time sense is important in Twitter’s context where “relevance” has a temporal aspect as well. For example, after the Nepal Earthquake in 2015, the query “Nepal” led to results related to “earthquake” as well. Then when the new constitution was passed in Nepal, the query “Nepal” led to results related to “constitution”. The time frame in which suggestions have maximum impact is very narrow. A suggestion made too early would seem irrelevant while a suggestion made too late would seem obvious and hence less impactful. These fast moving signals have to be mixed with slow moving ones for insightful results. Sometimes the query volume is very low in which case longer observation period is needed before suggestions can be made. Twitter noticed that 17% of top 1000 query terms churn over an hourly basis which means they are no longer in top 1000 after an hour. Similarly, around 13% of top 1000 query terms are churned out every day. This suggested that a fine-grained tracking of search terms was needed.
Twitter started with a basic idea: if two queries are seen in the same context, then they are related. Now they had a large open design space. For example, context can be defined by user’s search session or tweet or both. Measures like log likelihood, chi-square test etc can be used to quantify how often 2 queries appear together. To consider the temporal effect, counts are decayed time. Finally, Twitter has to combine these factors, and some more factors, together to come up with a ranking mechanism. This paper does not focus on what algorithms were chosen for these tasks, it focuses on how an end-to-end system was created.
Twitter has a powerful petabyte-scale Hadoop-based analytics platform. Both real-time and batch processes write data to the Hadoop Distributed File System (HDFS). These include bulk exports from databases, application logs, and many other sources. Contents are serialized using either Protocol Buffersor Thrift, and LZOcompressed. There is a work-flow manager, called Oink, which schedules recurring jobs and handles dataflow dependencies between jobs. For example, if job B requires data generated by job A, A will be scheduled first.
Twitter wanted to take advantage of this stack and the first version was deployed in form of a Pig script that aggregated user search sessions to compute term and cooccurrence statistics and ranked related queries on top of the existing stack. While the results were pretty good, the latency was too high and results were not available until several hours.
Twitter uses Scribe to aggregate streaming log data in an efficient manner. These logs are rich with user interaction and are used by the search assistant. A Scribe daemon is running on each production server where it collects and sends local log data (consisting of category and message) to a cluster of aggregators which are co-located with a staging Hadoop cluster. This cluster merges per-category streams from the server daemons and writes the results to HDFS of the staging cluster. These logs are then transformed and moved to the main Hadoop data warehouse in chunks of data for an hour. These log messages are put in per-category, per-hour directories and are bundled in a small number of large files. Only now can the search assistant start its computations. The hierarchical aggregation is required to “roll up” data into few, large files as HDFS is not good at handling large numbers of small files. As a result, there is a huge delay from when the logs are generated to when they are available for processing. Twitter estimated that they could bring down the latency to tens of minutes by re-engineering their stack though even that would be too high.
Hadoop is not meant for latency sensitive jobs. For example, a large job could take tens of seconds to just startup — irrespective of the amount of data crunched. Moreover, the Hadoop cluster was a shared resource across Twitter. Using a scheduler (in this case, FairScheduler) is not the ideal solution as the focus is on predictable end-to-end latency bound and not resource allocation. Lastly, the job completion time depending on stragglers. For some scenarios, a simple hash partitioning scheme created chunks of “work” with varying size. This lead to large varying running times for different map-reduce jobs. For scripts that chain together Hadoop jobs, the slowest task becomes the bottleneck. Just like with log imports, Twitter estimated the best case scenario for computing query suggestions to be of the order of ten minutes.
Starting with the Hadoop stack had many advantages like a working prototype was built quickly and ad hoc analysis could be easily done. This also helped them to understand the query churn and make some important observations about factors to use in search assistant. For example, Twitter discovered that only 2 sources of context — search sessions and tweets — were good enough for an initial implementation.But due to high latency, Twitter had to restrict this solution to the experimental stage itself.
Firehose is the streaming API that provides access to all tweets in real time and the frontend, called Blender, brokers all requests and provides a streaming API for queries — also called query hose. These two streams are used by EarlyBird, the inverted indexing engine, and search assistant engine. Now client logs are not needed as Blender has all search sessions. Twitter search assistance is an in-memory processing engine comprising of two decoupled components:
- Frontend Nodes — These are lightweight in-memory caches which periodically read fresh results from HDFS. They are implemented as a Thrift service, and can be scaled out to handle increased query load.
- Backend Nodes — These nodes perform the real computations. The backend processing engine is replicated but not sharded. Every five minutes, computed results are persisted to HDFS and every minute, the frontend caches poll a known HDFS location for updated results.
Request routing to the replicas is handled by a ServerSet, which provides client-side load-balanced access to a replicated service, coordinated by
ZooKeeper for automatic resource discovery and robust failover.
Each backend instance is a multi-threaded application that consisting of:
- Stats collector: Reads the firehose and query hose
- In-memory stores: Hold the most up-to-date statistics
- Rankers: Periodically execute one or more ranking algorithm by getting raw features from the in-memory stores.
There are three separate in-memory stores to keep track
of relevant statistics:
- Sessions store: Keeps track of (anonymized) user sessions observed in the query hose, and for each session, the history of the queries issued in a linked list. Sessions older than a threshold are discarded. Metadata is also tracked separately.
- Query statistics store: Retains up-to-date statistics, like session count, about individual queries. These also include a weighted count based on a custom scoring function. This function captures things like association is more between 2 consecutively typed queries vs 2 consecutively clicked hash-tags. These weights are periodically decayed to reflect decreasing importance over time. It also keeps additional metadata about the query like its language.
- Query cooccurrence statistics store: Holds data about
pairs of co-occurring queries. Weighting and decaying are applied like in the case of query statistics store.
Query Flow — As a user query flows through the query hose, query statistics are updated in the query statistics store, it is added to the sessions store and some old queries may be removed. For each previous query in the session, a query cooccurrence is formed with the new query and statistics in the query cooccurrence statistics store are also updated.
Tweet Flow — As a tweet flows through the firehose, its n-grams are checked to determine whether they are query-like or not. All matching n-grams are processed just like the query above except that the “session” is the tweet itself.
Decay/Prune cycles — Periodically, all weights are decayed and queries or co-occurrences with scores below predefined thresholds are removed
to control the overall memory footprint of the service. Even user sessions with no recent activity are pruned.
Ranking cycles — Rankers are triggered periodically to generates suggestions for each query based on the various accumulated statistics. Top results are then persisted to HDFS.
- Since there is no sharding, each instance of the backend processing engine must consume the entire firehose and query hose to keep up with the upcoming data.
- The memory footprint for retaining various statistics, without any pruning, is very large. But if the footprint is reduced, by say pruning, the quality and coverage of results may be affected. Another approach could be to store less session history and decay the weights more aggressively though it may again affect the quality of the results.
Twitter managed to solve the problem of fast moving big data but their solution is far from ideal. It works well but only for scenario it is fine-tuned for. What is needed is a unified data platform to process for big and fast moving data with varying latency requirements. Twitter’s current implementation is an in-memory engine which mostly uses tweets and search sessions to build the context. Rich parameters like clicks, impressions etc are left out for now to keep the latency under check. Twitter described it as “a patchwork of different processing paradigms”.
Though not-so-complete, it is still an important step in the direction of unifying big data and fast data processing systems. A lot of systems exists which solve the problem is pieces eg message queues like Kafka for moving data in real time and Facebook’s ptail for running Hadoop operations in real time but there is no end-to-end general data platform which can adapt itself to perform analytics on both short and long term data and combine their results as per latency bound in different contexts.