Share

Analytics on Big Data

Christopher Wagner • August 11, 2019

Why you need to use Power BI Aggregations

Over the last 20 years, data sizes have exploded and technology has continually adjusted to come up with new ways to deal with these large data sets. Databases (SQL Server, Oracle, Sybase, DB2) with hundreds of GB of data were augmented by Data Warehouses (Netezza, Teradata, SAP HANA, Redshift, Snowflake, SQL DW) with hundreds of Terabytes of data. These Data Warehouses then added Data Data lakes (ADLg2, S3) with Petabytes of data to the ecosystem.


Power BI reports and dashboards are designed to be able to deal with any size data. You just have to know when and where to use certain techniques to make these work. To do this, you need to understand how Power BI works.


Power BI Engine Basics


Power BI Import Mode

Power BI was designed to run off of in-memory cached data. Cached data can execute tens of thousands of queries across hundreds of millions of records with a sub-second response. This allows users of reports to be able to really interact with the data in real-time. Because the data is in the cache, the time it takes for the engine to move through the data is insanely fast.


Import Mode PROS:

  • Super Fast Interactions


Import Mode CONS:

  • have to cache data = size constrained to the size of Cache


Power BI Direct Query Mode

Direct Query addresses the issue with the size of data by not having to cache the data. This enables Power BI to report on data that exceeds the size of data that can be cached.


Direct Query Mode PROS:

  • Size can exceed the size of the cache

Direct Query Mode CONS:

  • Performance is poor because the source system cannot respond to the tens of thousands of queries that Power BI will fire at it.

Even high performing platforms (Redshift, Snow, SQL DW, SAP HANA, Netezza and Teradata) can really struggle under the number of queries Power BI will fire at it. Poorly written queries and poorly designed data sets can crash Netezza, make Redshift go non-responsive, and fail to return data from SAP HANA.



The Challenge

Well, shoot- we are often asked to deal with data sets that are much larger than what we can cram into Power BI. At the same time, NO ONE wants a slow-performing report.


So how the heck are you going to deal with these Massive amounts of data AND make sure that it performs???



Power BI Aggregations the C-C-C-COMBO BREAKER!!!

The best solution for large data sets is to leverage Power BI's new Aggregations and Dual mode where you can have some imported data, and some direct queries for detailed data. You import aggregated data at a grain that allows interaction within the dashboard but does a direct query of detailed data.


Example:

Imported Aggregate Data :

SalesPersonID, Date, ClientID, ProductType, SUM(Amt), SUM(QTY)


Direct Query Detail Data:

SalesPersonID, Date, ClientID, OrderNumber, ProductType, ProductID, Product Description, Amt, QTY


This example would allow you to build interactive reports off of all of the Sales Person attributes, Client attributes, Product type attributes, and date.


For any details, you then drill into the data for subsets of the data on another tab of the report.


This really changes the way we look at building out our reports and dashboards. No longer do we have to do ALL IMPORT or ALL DIRECT QUERY.


More Information

Josh Caplan (Power BI Principal Group Program Manager) announced these features going to GA on July 18, 2019.

https://powerbi.microsoft.com/en-us/blog/aggregations-for-petabyte-scale-bi-is-generally-available/


Technical Details:

https://docs.microsoft.com/en-us/power-bi/desktop-aggregations



CHRIS WAGNER, MBA MVP

Analytics Architect, Mentor, Leader, and Visionary

Chris has been working in the Data and Analytics space for nearly 20 years. Chris has dedicated his professional career to making data and information accessible to the masses. A significant component in making data available is continually learning new things and teaching others from these experiences. To help people keep up with this ever-changing landscape, Chris frequently posts on LinkedIn and to this blog.
By Christopher Wagner 03 Sep, 2024
Your guide to becoming a Data Engineer.
By Christopher Wagner 19 Aug, 2024
Compare Microsoft Fabric and Databricks, two leading data platforms. Highlights their features, strengths, and unique offerings across various domains like data engineering, data analytics, data science, DevOps, security, integration with other tools, cost management, and governance. Microsoft Fabric is noted for its low-code/no-code solutions and seamless integration with Microsoft tools, making it accessible for users with varying technical skills. Databricks is praised for its high-performance capabilities in big data processing and collaborative analytics, offering flexibility and control for experienced data teams.
By Christopher Wagner 15 Nov, 2023
In a dynamic data engineering scenario, Sam, a skilled professional, adeptly navigates urgent requests using Microsoft Fabric. Collaborating with Data Steward Lisa and leveraging OneLake, Sam streamlines data processes, creating a powerful collaboration between engineering and stewardship. With precision in Azure Data Factory and collaboration with a Data Scientist, Sam crafts a robust schema, leading to a visually appealing Power BI report.
By Christopher Wagner 28 Apr, 2023
NOTE: This is the first draft of this document that was assembled yesterday as a solo effort. If you would like to contribute or have any suggestions, check out my first public GIT repository - KratosDataGod/LakehouseToPowerBI: Architectural design for incorporating a Data Lakehouse architecture with an Enterprise Power BI Deployment (github.com) This article is NOT published, reviewed, or approved by ANYONE at Microsoft. This content is my own and is what I recommend for architecture and build patterns.
Show More
Share by: