pwshub.com

Data harvesting superapp says LLM tamed its unruly data

Asia's answer to Uber, Singaporean superapp Grab, has admitted it gathered more data than it could easily analyze – until a large language and generative AI turned things around.

Grab offers ride-share services, food delivery, and even some financial services. In 2021 the biz revealed it collects 40TB of data every day. Execs have bragged that its fintech arm knows enough about its drivers that it can rate their suitability for a loan before they even bother applying.

In a Thursday blog post, the developer admitted it has sometimes struggled to make sense of all that data.

"Companies are drowning in a sea of information, struggling to navigate through countless datasets to uncover valuable insights," the org wrote, before admitting it was no exception. "At Grab, we faced a similar challenge. With over 200,000 tables in our data lake, along with numerous Kafka streams, production databases, and ML features, locating the most suitable dataset for our Grabber's use cases promptly has historically been a significant hurdle."

Prior to mid-2024, Grab used an in-house tool called Hubble – built on top of the popular open source platform DataHub and utilizing open source search and analytics engine Elasticsearch – to sort through its giant data pile.

"While it excelled at providing metadata for known datasets, it struggled with true data discovery due to its reliance on Elasticsearch, which performs well for keyword searches but cannot accept and use user-provided context (ie it can't perform semantic search, at least in its vanilla form)," Grab's engineering blog explains.

Eighteen percent of searches were abandoned by staff users. Grab guessed the searches were abandoned because the Elasticsearch parameters provided by Datahub were not yielding helpful results.

  • Grab – Asia's Uber – knows customers and drivers so well it can vet them for loans
  • Ever wondered how much data web giants generate? Singaporean super-app Grab says 40TB a day
  • Big Tech's maps led ride-sharing giant Grab astray
  • Uber plans to ride out of stable Singapore, move APAC HQ to high-tension Hong Kong

But Elasticsearch wasn't the only problem to blame for laborious data discovery – oodles of documentation was missing. Only 20 percent of the most frequently queried tables had any descriptions.

The developer's data analysts and engineers were forced to rely on internal tribal knowledge in order to find the datasets they needed. Most reported it took days to find the right dataset.

Grab sought to rectify this through three initiatives: enhancing Elasticsearch; improving documentation; and creating an LLM-powered chatbot to catalog its datasets.

The Singaporean superapp enhanced Elasticsearch by boosting relevant datasets, hiding irrelevant ones, and simplifying the user interface.

Eventually it brought the number of abandoned searches to just six percent. It also built a documentation generation engine that used GPT-4 to produce labels based on table schemas and sample data. That effort increased the number of data sets with thorough descriptions from 20 to 70 percent.

And then it built the pièce de résistance: its own LLM. Called HubbleIQ, the LLM uses an off-the-shelf search tool called Glean to draw on its newly expanded descriptions and recommend datasets to its employees through a chatbot.

"We aimed to reduce the time taken for data discovery from multiple days to mere seconds, eliminating the need for anyone to ask their colleagues data discovery questions ever again," the superapp techies blogged.

The upgrades are a work in progress. Grab intends to work to improve the accuracy of its documentation and incorporate more dataset types into its LLM, in addition to other initiatives.

Grab's hyperlocalization strategy, which is enabled by its massive quantities of data, has given it the edge to know the ins and outs of Asia's people and roads – and frankly kept the business alive.

While its 2021 IPO results may have been unquestionably disappointing, it did run Uber out of town.

In Grab's Q2 2024 earnings, it reported a record high of 41 million monthly transacting users, narrowing losses and 17 percent revenue growth.

"Features like mapping, hyper batching and just-in-time allocation, they're all unique to Grab and none of our competitors have that and we believe that makes us consistently more reliable as well as more affordable," explained CEO Anthony Tan.

Consistently reliable, affordable … and drowning in datasets. ®

Source: theregister.com

Related stories
1 week ago - Says Lina Khan in latest push to rein in Meta, Google, Amazon and pals Buried beneath the endless feeds and attention-grabbing videos of the modern internet is a network of data harvesting and sale that's far more vast than most people...
1 week ago - I wanna know 🎵 What you're feeling 🎵 Tell me what's on your mind Meta is going to resume scraping the personal public feeds of British Facebook and Instagram users for training AI after reaching an agreement with the UK's Information...
1 month ago - enemies everywhere — Google hate is no longer reserved for conservatives. Enlarge / It's a tale as...
3 days ago - When robots.txt just ain't cutting the mustard Cloudflare on Monday expanded its defense against the dark arts of AI web scrapers by providing customers with a bit more visibility into, and control over, unwelcome content raids.…
1 month ago - Social media account recovery is a pain after a hacking incident, but it’s not impossible.
Other stories
10 minutes ago - The director of Raiders of the Lost Ark, E.T., Jurassic Park, and many more classics has long spoken about his love of video games, stretching back to 1982 when he revealed the instant thrill and ego "massage" that playing games like...
16 minutes ago - Many websites require password habits that actually don’t keep you safer. Neither special characters nor frequent password resets really help security.
16 minutes ago - Millions of asteroids are floating around in space. Find out what you know about these relics of our solar system with this cosmic quiz.
1 hour ago - Loch Electronics' latest innovation, the Capsule Solo, is a compact kitchen countertop dishwasher that can clean up to 35 dishes in just 15 minutes.
1 hour ago - Text Edit emerges, plus tinted terminal title bar when it's time to tread tactfully For those on the RPM side of the fence, Fedora 41 has hit beta, and works better in VirtualBox than ever if you're curious to try it.…