We believe that a data cloud should allow people to work with all kinds of data, no matter its storage format or location. To do this, we’re adding several exciting new capabilities to Google’s Data Cloud.
First, we’re adding
support for unstructured data in BigQuery to help significantly expand the ability for people to work with all types of data. Most commonly, data teams have worked with structured data, using BigQuery to analyze data from operational databases and SaaS applications like Adobe, SAP, ServiceNow, and Workday as well as semi-structured data such as JSON log files.
But this represents a small portion of an organization’s information. Unstructured data may account for up to
90 percent of all data today, like video from television archives, audio from call centers and radio, and documents of many formats. Beginning now, data teams can manage, secure, and analyze structured and unstructured data in BigQuery, with easy access to many of Google Cloud’s capabilities in ML, speech recognition, computer vision, translation, and text processing, using BigQuery’s familiar SQL interface.
Second, we’re adding support for major data formats in use today.
Our storage engine, BigLake, adds support for
Apache Iceberg and support for
Linux Foundation Delta Lake, and
Apache Hudi will be added soon. By supporting these widely adopted data formats, we can help organizations gain the full value from their data faster.
"Google Cloud's support for Delta is a testament to the demand for an open, multicloud lakehouse that gives customers the flexibility to leverage all of their data, regardless of where it resides,” said David Meyer, senior vice president of products at Databricks. "This partnership further exemplifies our joint commitment to open data sharing and the advancement of open standards like Delta Lake that make data more accessible, portable, and collaborative across teams and organizations."
Third, we’re announcing a new integrated experience in BigQuery for
Apache Spark, a leading open-source analytics engine for large-scale data processing. This new Spark integration, launching in preview today, allows data practitioners to create procedures in BigQuery, using Apache Spark, that integrate with their SQL pipelines. Organizations like
Walmart use Google Cloud to improve Spark processing times by 23% and have reduced time to close financial books from five days to three.
In addition, we’ve launched
Datastream for BigQuery which will help organizations more effectively replicate data in real-time, from sources including AlloyDB, PostgreSQL, MySQL and third-party databases like Oracle — directly into BigQuery. By helping to accelerate the ability to bring data from an array of sources into BigQuery, we enable you to get more insights from your data in real time. To learn more about these announcements read
our dedicated post about key innovations with Google Databases. Finally, a data cloud should enable organizations to manage, secure, and observe their data, which helps ensure their data is high quality and enable strong, flexible data management, and governance capabilities. To address data management, we’re announcing
updates to Dataplex that will automate common processes associated with data quality. For instance, users will now be able to easily understand data lineage — where data originates and how it has transformed and moved over time — which can reduce the need for manual, time consuming processes.
The ability to let our customers work with all kinds of data, in the formats they choose, is the hallmark of an open data cloud. We’re committed to delivering the support and integrations that customers need to remove limits from their data and avoid data lock-in across clouds.