Who doesn’t love crystals? They’re solid; regular, predictable. Some crystals are sweet (sugar), some are strong (diamond), and as fans of Breaking Bad know, some are very profitable. Some people even think crystals have mystical powers.
So it’s no surprise that we like our data in crystalline form: regular, highly-structured SQL tables, SAS datasets, rows and columns. These structures are strong, transparent, and organized.
Ask your CIO ‘How’s our data?’ and you’ll likely hear ‘Fine. It’s accurate; accessible; safe; reliable’. And for most companies, it is all those things – but it’s not complete. If you don’t believe that – if you think your enterprise data captures all the nuances of your company – ask if you can get at the customer sentiment that’s driving the latest satisfaction survey results. If you’re a risk manager, try to access the video monitoring feed that shows whether or not workers are following safety procedures. Or if you’re an underwriter, ask to see image data from claim files showing the difference between hailstone damage to metal roofs versus asphalt shingles. (cue the crickets chirping)
That’s because most companies only utilize traditional structured data in production applications: unstructured data such as images, videos, audio recordings, and scanned documents aren’t viewed as ‘real’ data. In fact, even firms that purport to assemble “all” of a company’s data into a coherent, easily accessible data fabric typically only focus on structured data (apparently there are different interpretations of the phrase ‘all of your data’). But there’s a tremendous amount of value trapped inside those amorphous file types – it’s just a little harder to get at.
Embrace The Unstructured
Nowhere is this more true than with today’s artificial intelligence, machine learning, and business intelligence platforms. Far too often, companies that have the vision and foresight to bring one of these advanced technologies into their business process suddenly suffer from acute nearsightedness when it comes to their data. They make only structured data available to the new tools – and then question why the initiative doesn’t deliver against expectations.
The answer is simple: these powerful tools are only as good as the data they’re given. Of course, every technology professional is keenly aware of ‘garbage in / garbage out’ – and those same professionals will vigorously defend the data that they’re feeding to the machine learning model or the BI tool. And they’re right – the data is fine (where did we hear that before?) But what they don’t realize is that the data is incomplete. They’re not exposing the new tool to a veritable treasure trove of unstructured data.
Which begs the question ‘Why not?’
The answers reflect conventional wisdom. Fear of the unknown. Time constraints. Budget constraints. And the ever-popular ‘no one is doing that.’
The real reason is much simpler: unstructured data wasn’t included because it wasn’t even considered important. Technologists have been constrained to use only structured data for so long that nothing else enters their field of view. And their constituents – business managers and users – have been conditioned to not having access to unstructured data for so long that they don’t even request it.
The net effect is that clients and platform providers are leaving a great deal of value on the table by unknowingly omitting unstructured data from their initiatives. Despite all the hype around AI, machine learning, large language models and other technological magic, no one is talking about the basic necessity to include all of a company’s data in these projects (and ‘all’ here really means all data – not just the crystalline data that’s readily available in data warehouses, data lakes, lake houses, and other architectural wonders.)
So How Do You Do It?
The first step is to acknowledge that you shouldn’t do it: turning unstructured data into structured data – at scale – requires not only experienced data labelers, but a complement of data engineers, QA staff, process engineers, and project managers. Companies who attempt this work in-house typically press their data scientists and data engineers into service; not only is this a very expensive way to get the work done, it creates opportunity costs from all the work these individuals could have been doing.
The key is to establish a relationship with a trusted resource to label your images and videos, annotate text, calibrate computer vision models, correct ‘automated’ audio transcriptions, and provide other services like curating the input data and formatting the output to your specifications. All these functions demand both attention to detail and high-volume throughput, and the project plan for your AI/ML/BI initiative probably didn’t include the fairly large staff addition that they require.
Unless you’re absolutely convinced that your unstructured data represents very little value, this is not the assignment for a low-cost provider. ‘Crowd-sourced’ labeling and providers that trade solely on price will deliver a level of data quality that’s unlikely to yield much value – and you’ll continue to see lower-than-expected returns on your advanced technology investment.
Start leveraging the power of unstructured data – and break free of the crystal myth.