Every 10 years, the United States Census Bureau sets out on a formidable task of taking stock of its population of 330 million – generating countless data sets. Searching through that data then becomes a task nearly as arduous as its collection.
But being able to make sense of the data collected is key. Census data is used to inform decisions on how funding is allocated to communities. It also determines improvements to public infrastructure, such as where to build new hospitals or schools.
Data scientists from IT consulting firm Booz Allen Hamilton built a prototype using Elasticsearch to make US census data more accessible and relevant to its users.
Massive and complicated data sets
In the 2010 census, the Bureau generated some 7.8 billion statistics about the 308 million people counted. Census data also exists across different categories such as age, industry or geography. There are over 130 different data environments within the bureau today, says Jesus Jackson, Chief Data Scientist at Booz Allen Hamilton.
With such complexity, the most important objective of the prototype was “making it simple for users to access the data that they wanted, not just all of the data but the data that is the most relevant to them,” says Jackson. This way, policymakers and businesses alike would be able to make decisions that benefit the community most.
Tackling dead ends
The original Census search engine did not stop users when there was no overlap between categories – otherwise known as a ‘dead end’, says Daewoo Chong, Data Scientist at Booz Allen Hamilton.
“What if you were trying to find something that was really complex that took you 15 minutes, and then you got to a dead end? That’s not a good use of anyone’s time,” says Jackson.
To tackle this issue, Jackson and Chong’s team tested different indexing prototypes. The eventual prototype managed to speed up response times to an average of 100 milliseconds or shorter, says Chong.
Elastic technology also stops the user in real-time if there is no information relevant to his or her search, says Jackson. The Elastic search engine will create an index and control the user interface to disable categories with no relevant results in 25 to 50 milliseconds, Chong says.
Using the Elastic Stack
The data science team also uses Logstash to parse and index logs, says Chong. Logstash ingests data from disparate sources and converges it on a common format for easier storage and analysis. It also anonymises personal identifiable information and excludes sensitive fields completely.
Kibana, on the other hand, helps the team visualise trending topics, industries, geographies, and performance metrics, Chong says. With Kibana, data scientists can use interactive pie charts, histograms, or time series analysis to make sense of highly complex population data.
Lastly, the team also uses Elastic Security (formerly known as Shield) to protect the data across the Elastic Stack. The security technology authenticates users before giving them access control based on their role, Chong says. This way, users only have access to data that is appropriate for them.
Ensuring search relevancy
Data scientists also need to ensure that the system could differentiate between a query ‘VA’ that could mean Virginia or Veteran’s Affairs, says Jackson. Search relevancy has been a “really tough problem”.
The team is still exploring options to address this issue. It is testing out rank boosting, where the system uses machine learning to determine which results would be most relevant to a user’s search and puts that first, Chong says.
For example, when someone searches for geographies, the system boosts results related to states, says Chong. “We want to make sure that states bubble up and they should take precedence over cities and counties.”
The Elasticsearch prototype has brought positive feedback so far, says Jackson. Users are able to get the data they want faster, and the community is happier, he adds.
Census data plays an important role in determining future US policies. With an effective search and filtering system in place, the Bureau is better-equipped to make informed decisions for the welfare of Americans.