Performance Tips For Large Datasets - Knowledge Base
Performance Tips For Large Datasets - Knowledge Base
Performance Tips For Large Datasets - Knowledge Base
The Spotfire Community is moving to TIBCOmmunity and this forum location has closed. During the transition, you can still search the old forums but posting has been disabled. We encourage you to pick up the discussion at the new Spotfire community on TIBCOmmunity.
Abstract
The purpose of this brief blog is to provide some guidance around the topic of performance, especially in light of TIBCO Spotfire (aka TS) version 3.0 . Note: The content in this posting will be modified and evolve over time to adapt to newer Spotfire versions.
About performance
A high performing system is not stronger than its weakest link You should really optimize the performance of every link of the chain: Hardware, Operative System, Database, TS Server, TS Professional, TS Web Player
Preface
The purpose is to discuss how to improve total systemic performance. Were not going to cover tips and tricks already explained elsewhere, for example
spotfirecommunity.tibco.com/community/blogs/stn/archive/2009/10/26/Performance-tips-for-large-datasets.aspx 1/8
3/11/13
Can I load only a fraction of the data? Specifically, can I load data based on criteria or conditions?
Yes! You can use a Hard Filter in your Information Link. You can create a Prompted Information Link to prompt users for specific values, intervals, etc., and only retrieve data relevant to that user and moment in time. You can use Parameterized Information Links to retrieve only data that matches one or more conditions. You can use Personalized Information Links to retrieve data only pertinent to a specific user or group.
OK, were done with what we wont be covering, so lets get started
Keep it clean
It goes without saying, but you should really disable unnecessary software, services, daemons, etc. Youll want every bit of performance dedicated to the analytic core duties.
Microsoft OS
Microsoft file systems fragment badly (as does the Registry), which in turns affects dB performance (scattered files) and applications needing I/O access.
spotfirecommunity.tibco.com/community/blogs/stn/archive/2009/10/26/Performance-tips-for-large-datasets.aspx 2/8
3/11/13
Use, e.g., the free MyDefrag for file system Google free Registry defrag tools Windows Services take up precious resources Turn off all services strictly unnecessary for the task of running your apps. Google about this topic for your particular OS, client or server
Linux systems
File system doesnt fragment so bad Ext4 file systems has excellent performance. XFS or JFS are good choices too.
Solaris
ZFS (perhaps the best file system ever) will give you top performance
Database performance
How to see the time required by the dB to actually serve data
You can do that in a number of ways. For instance: While TS Pro loads data, expand the progress dialog and uncheck the check box at the bottom, so you keep this dialog up even after the data has been loaded. That will allow you to study what is happening, for systemic performance tweaking purposes. In the logged output in that dialog, look for three lines, one containing Reading data from data source..., another line containing Reading data, and another one containing Creating columns, all prefixed by a timestamp, like this example below (all other log lines removed): [...] 17:51:35 Reading data from data source... 17:51:36 Reading data. [...] 17:54:36 Creating columns. [...] 17:54:51 Done
Reading data from data source means that Spotfire has asked the dB to start processing a query. If the query is very complex and the dB hasnt been tuned, the dB make take quite some time to start producing records. So, this will tell the time the dB requires to process the SQL query. Reading data means that data is flowing in from the dB, and if needed, being stored in a temporary local file cache to later be used. So, this will tell the speed at which data flows from the dB machine, over the network, into the temporary cache being built. Creating columns means that the internal Spotfire in-memory data engine and filter engine are been created, column by column. So, this will tell the speed with which the destination machine is able to process data, once the dB has already done all work.
Basic dB recommendations
User normalized tables (star or snow schemas) Minimal types where possible e.g. store BIT instead of true, false, and create lookup table
spotfirecommunity.tibco.com/community/blogs/stn/archive/2009/10/26/Performance-tips-for-large-datasets.aspx 3/8
3/11/13
For all columns where involved in JOINs, or columns later used by Spotfire IM to filtering by, Create PK (Primary Keys), FPK (Foreign Primary Keys) and INDEXes. Also create multi-column indexes for better joins and compound queries
C R E A T EI N D E Xi x _ 1O Nt a b l e( c o l 1 ,c o l 2 , ,c o l n )
Run/Create statistics for the relevant tables only stats for dB Sytem tables can actually worsen db performance Shrink/defrag the tables involved in the analytic process Generally needed if you got a production dB export Convert Subqueries to JOINs Better performance
MS SQL:
S E L E C TT O P ( 1 0 0 0 )c o l 1 ,c o l 2F R O Mt a b l eO R D E RB Yc o l 2D E S CO P T I O N( F A S T 1 0 0 0 )
Use the techniques above from within Information Model Designer while creating Information Links by simply editing the SQL query and modify it accordingly. Also useful when querying data for the 1st time, to avoid surprises.
3/11/13
Use Index-organized tables (for read-only dBs, e.g., Data Warehouses). These are self contained: data is in index table Oracle specific but can be mimicked in all other dB Very fast reading Use Partitioned tables and indexes Recommended by all major dB makers for higher performance with very large tables! Bitmap indexes are better when data changes less frequently like in a typical DW Heres an article about Bitmap indexes. Just Google for more. For serious dB tuning Id recommend reading SQL tuning (OReilly) Very theoretical book though.
Anti-Virus
White-list the Antivirus for everything under TSS and under WebPlayers installation directories AV slows down our servers, and neither TSS or TSWP are a security threat since both run in sandboxes (Java and .Net, respectively) At the bare minimum, edit web.config to re-point the WP Temp Dir elsewhere, and white-list that directory:
< l o g 4 n e t > < a p p e n d e rn a m e = " F i l e A p p e n d e r " t y p e = " l o g 4 n e t . A p p e n d e r . R o l l i n g F i l e A p p e n d e r " > < f i l ev a l u e = " D : \ S p o t f i r e T m p \ W P _ l o g s \ S p o t f i r e . D x p . W e b . l o g "/ >
3/11/13
Use optimizer hints for dB queries as described earlier Include as few columns as possible in top views Drill down to fetch more detail, even from the same table
TS Pro performance
Antivirus
If possible, white-list the entire folder where TS Pro is installed.
Numerical columns
Numerical columns (displayed as Range Sliders in the Filter Panel) arent indexed by default for large datasets. Thats perfectly fine, but in some cases
spotfirecommunity.tibco.com/community/blogs/stn/archive/2009/10/26/Performance-tips-for-large-datasets.aspx 6/8
3/11/13
you can get improve filtering performance for those columns. If the column contains a large number of rows combined with few number of unique values, then you can increase performance by forcing an indexing (described below). Typical examples would be cases where you have millions of rows containing integers like customer ages, zip3 codes, a set of a few hundred unique IDs, etc. In those cases youll get a performance improvement when filtering. Otherwise this trick can potentially make Spotfire run slower How to do it: for those cases fitting the description above and If youre going to filter or join by a non-indexed column (i.e., use Add Columns or Add Rows), then first force the index creation simply by changing the slider to an Item Slider and back to Range Slider.
Implicit Joins
Sometimes implicit joins, i.e., side by side tables, are much more memory efficient than regular joins (Add Rows and Add Columns in Spotfire terminology) Implicit Joins are especially useful when youre only looking for the proper text value of a property to filter by More about that side-by-side tables here
Also, Id recommend starting to explore the data using aggregated plots, e.g., Tree Maps, Pie Charts or Bar Charts. Scatter Plots (2D or 3D), Network Graphs (aka NGs) and BoxPlots with Tukey-Kramer comparison-circles will all be more computationally intensive. Scatters have to comb through every row Aggregated Scatter Plots (commonly referred to as Bubble Charts by other vendors) are OK though NGs can have almost exponential complexity
Anyway, aggregated visualizations are going to give you the fastest rendering times. Measure rendering time (Easter egg, see further down in this article)
spotfirecommunity.tibco.com/community/blogs/stn/archive/2009/10/26/Performance-tips-for-large-datasets.aspx
7/8
3/11/13
Eventually youll see a box in the top left corner of the plot indicating graphic rendering time and total time (i.e., the chain of events leading to actually being able to render the plot usually the aggregations, etc) The colors of those boxes indicate: Green: Hardware acceleration is available and used Yellow: Hardware acceleration not needed for that plot Red: Hardware acceleration needed but not available
Comments
spotfirecommunity.tibco.com/community/blogs/stn/archive/2009/10/26/Performance-tips-for-large-datasets.aspx
8/8