Tech FAQ

SparkBeyond Discovery has increasingly become more popular as analytics leaders look to scale-out SparkBeyond to tackle bigger projects, larger datasets and to put self-service data analytics into the hands of more decision makers.

So here are the top FAQ about best practices for deploying SparkBeyond Discovery, how it scales, governance questions and much more.

Data Preparation

How to set/change identifiers and the need to use them is not clear

Column Identifiers are a tool to link data columns to the entities they represent in the real world, such as company name, id or address. They enable the mapping of data onto the real world, and hence the identification of relations in your problem space.

The assignment of a Column Identifier has the same effect as changing the data type of the data column. To revert to the original column data type, select Auto-detect Type.

By setting column identifiers, identifier-specific functions are considered, yielding deeper and more interpretable insights in a shorter time.

How should datasets be prepared for utilizing SparkBeyond’s time series functionality

Time Series Context is used to connect to additional data relative to a timestamp in the primary table. The related datasets should therefore be as raw and granular as possible, no pre-aggregations are needed. The related time series data will require a timestamp column to connect back to the timestamp in the primary table. 

If you are interested in time series features for a specific entity in your data e.g. a customer/ store you will need to create a “Keyed time series”. You will need the related dataset to include a column (Key Column) with values (or Keys) that link back to the primary dataset. 

Does the product support many-to-many relationships? 

Yes, our As Lookup connection allows this by simply specifying a single column in the dataset you’d like to connect. You can add as many connections as you like.

Is it possible to  limit data joins so that it would not introduce temporta/future leakage?
(For example if the purchase happened on March 1st, then only orders from February or January would be considered since they are from the past).

Yes, our Time Series connection allows you to specify a time window of a given size. Only rows that fall within this time window will be connected. You can also add a specific Key such as user ID to connect only relevant rows (other rows will not be connected = filtered out).

Hypothesis Generation

How to “squeeze more juice” from the platform?

There are multiple ways to configure the product to search deeper and generate more hypotheses. Here are the most common options. Please note that most of these configurations will significantly increase learning time.

  • Bring in more data 

The more data you bring in, the wider your search space is and the product can generate more hypotheses.

  • Increase "Global column interaction size" this will configure the system to look for features that are based on more than one column.
  • Enhance "Max complexity" or do this for specific columns using "Custom column interactions". This will allow more complex hypotheses to be evaluated.
Is there a way to hint two columns in the data are closely related and should be used to create a new feature?

Yes, we have an advanced functionality that is called Custom Column Interactions. Users can define a subset of columns (from the primary or related datasets) and direct the product to invest more resources in finding features that include that subset of features.

Feature Evaluation

How should features be analyzed? 

You should start with Lift / Target Mean Shift, move to support, consider overfitting / robustness and finally think of explainability.

Where can I find a description of the functions that appear in the feature? 

You can see a description of most functions by:

  • Open the feature drill-down page (double click the feature)
  • Scroll down to “Under the hood” and toggle the Feature Description option - this allows you to see the source code of the feature
  • Hover on the specific function

Feature Selection

What is the meaning of the various Feature Selection Methods?

You can choose the method used in order to evaluate your hypothesis and select the best features. This can be done by changing the feature selection method under Feature Selection in Settings:


  • Pairwise Information Gain
    Calculates the conditional mutual information between each feature and the target conditioned on the set of already selected features. On each iteration, select the feature that contributes the most information to the already selected set. This is the default setting. 
  • Simple By Rig
    Select the top features based on RIG score. Usually results in many correlated features. 
  • Robust Info Gain
    This measure decreases the importance of features with fluctuating information gain across different folds of the data. Expects to receive a column to sort by before taking folds, otherwise assumes that data is presorted. Useful for fluctuating time series data because it assumes that features that are consistent over all time periods are more likely to be robust on test data. 

    For instance, for one year’s worth of hourly store-level retail sales data, a feature about it being Christmas eve would be removed (as it appears only in one fold of time/data), while a feature about “average sales yesterday” would likely remain stable.
  • Simple Linear Selection
    Scores each feature according to its coefficient in a simple linear regression model that includes only that feature. 
  • Pairwise Linear
    Computes pairwise scores using a multivariate linear regression that includes pairs of features. Relevant only for regression problems. It can reduce collinearity and redundancy of features. Note. 
  • Pairwise Information Gain With Linear Regularization 
    Calculates the conditional mutual information between each feature and the target conditioned on the set of already selected features. On each iteration, select the feature that contributes the most information to the already selected set. Includes a stronger regularization parameter to penalize complex features.
  • Simple By Rig And Segmentation 
    Intended to be used only with unsupervised problems and in particular segmentation. The method will look for features that both characterize the data but also divide it well (e.g. if half of the population is below 1.77 and half above it the method will give high score to the feature height < 1.77 or even more so to height in range (0, 1.77) ). 
  • SemiSupervised Anomaly Detection
    Intended to be used only with anomaly detection problems. Method seeks features that identify anomalies. 
  • SemiSupervised Anomaly Detection Relaxed (based on rare event column)
    Intended to be used only with anomaly detection problems. Method seeks features that identify anomalies, but with decreased importance for anomalies relative to SemiSupervised Anomaly Detection.
  • HighLiftSelection 
    Selects features with the highest lift that characterize more homogeneous sub-populations. It naturally identifies smaller sub-samples therefore, if you use it in your analysis, we recommend increasing the default value of Support Threshold.

Models

Is it possible to export the actual predictions of the model? Can this be done for the “Test” portion of the primary dataset? 

Yes. This can be from the Models tab. Click on “Predictions” and select the portion.

Features

No items found.
No items found.

Deployment / Infrastructure

Does internet access required during installation?

Internet access is required to download SB install packages (~13 GB) from our S3/repository. This simplifies the installation process a lot.

If the server is not connected to the Internet, then the SB install packages must be downloaded manually (e.g. a user needs to download the packages to their laptop/PC and then move the packages to the server where SB should be installed).


Why is there a need for multiple workers?

There are multiple types of “jobs” that run on our system. Each worker can handle a single job at any given time. A single user can create many jobs (for example - many iterations in a single project, or many different projects).


Does SparkBeyond telemetry data include any sensitive information or personal Identifiable information (PII)?

No. Here’s the relevant paragraph from our agreement:

“Supplier’s Software sends telemetry. Telemetry does not contain any Data nor information that allows the direct identification of any user. Telemetry will not be shared with any third party and is only used for internal product development purposes. All telemetry is stored on Supplier’s securely designed AWS servers. The telemetry includes: Login/Logout activity, learning executions and failures, UI usage such as button clicks page opens and application errors and failures.”


It was easier in this project since we used this outpout

Business Insights

Apply key dataset transformations through no/low-code workflows to clean, prep, and scope your datasets as needed for analysis

Predictive Models

Apply key dataset transformations through no/low-code workflows to clean, prep, and scope your datasets as needed for analysis

Micro-Segments

Apply key dataset transformations through no/low-code workflows to clean, prep, and scope your datasets as needed for analysis

Features For
External Models

Apply key dataset transformations through no/low-code workflows to clean, prep, and scope your datasets as needed for analysis

Business Automation
Rules

Apply key dataset transformations through no/low-code workflows to clean, prep, and scope your datasets as needed for analysis

Root-Cause
Analysis

Apply key dataset transformations through no/low-code workflows to clean, prep, and scope your datasets as needed for analysis

Join our event about this topic today.

Learn all about the SparkBeyond mission and product vision.

RVSP
Arrow