At last week’s DeveloperWeek Enterprise 2022 conference, Victor Shilo, CTO of EastBanc Technologies, gave a keynote that aimed to clear up some of the confusion that can come with trying to make soup out of huge datasets.
“In many cases, big data is a big data swamp,” he said in his presentation, “The Big Data Delusion - How to Identify the Right Data to Power AI Systems.” The problem, he said, comes from traditional analytical systems and approaches being applied to outsized amounts of data.
For example, an unnamed fintech company that was a customer of EastBanc had huge datasets of its customer data, transactional data, and behavioral data that was cleaned by one team then transferred to another team that enhanced the data. While such an approach may be sufficient, Shilo said it can also slow things down.
The fintech company, he said, wanted a way to use its data to predict which of its customers would be receptive to contact. The trouble was it seemed to be a herculean task under traditional processes. “Their current team looked at the task and estimated the effort would take four, five months to complete,” Shilo said. “That’s a lot of time.”
EastBanc sought to tackle the problem within six weeks, he said. Turning huge data into assets that Shilo called “minimal viable predictions” required thinking backwards and thinking about the operational needs for that data. “You want to focus on the business outcome,” he said. “You really want to work with the team facing the customer or who’s making the decisions, like sales, and ask them, ‘how we can help?’”
The problem the fintech company had was the calls it had been making to potential customers were unproductive, Shilo said. “Either the customer didn’t pick up the phone or they objected to do to anything for them.” He called it a waste of time and money in the long run.
EastBanc’s approach was to not look at all of the data, but instead cherry-picked only necessary transactional data and behavioral data. “All others were like white noise in this particular case,” Shilo said. After the minimum viable predication was identified from the data through that approach, the next step was to make it work.
How data is moved traditionally from one stage to another, Shilo said, may include each team holding responsibility for certain tasks, which slowed the process. Rather than continue such a horizontal approach, he recommended building each team vertically. That allowed for more flexibility and granted teams the leeway to accomplish tasks as they needed, Shilo said. “We wanted to get answers as fast as possible.”
This process helped when EastBanc was called upon to assist Houston Metro. The task was to improve ridership on the transit systems buses and included access to GPS data from all the buses.
Shilo said EastBanc started off with a focus on predicting where buses might be in the next five or 20 minutes by using GPS coordinates. The effort began with just one bus to prove the efficacy of the approach.
Working with GPS data however meant dealing with fluctuations in coordinates, he said, as the bus moved through the city. Shilo said EastBanc applied the Snap to Roads API to make the data cleaner and easier to visualize but came to realize this may have confused their algorithms and model. “Eventually, we decided to remove Snap to Roads and instead train the model using raw data,” he said. “The quality of the predictions became way higher.” The processing time also decreased when they used raw data, Shilo said.
Ultimately, EastBanc found, at least for its purposes, focusing on just the raw data it determined to be relevant to deliver on operational needs was more efficient than getting bogged down by impervious mountains of data. “The next step is always to move further with your findings, to move closer to the end users, to the business end, to make more complex predictions along the way,” Shilo said.