Building a Machine Learning Model to Predict Solana Token Risk

Bitcoin is renowned for its security, but not all digital currencies share the same reliability. Despite some high-profile success stories, many tokens ultimately turn out to be fraudulent.

These scams are commonly referred to as “rug pulls.” This occurs when token creators abruptly disappear with investor funds, leaving holders with significant losses.

Solana, a Layer 1 blockchain, hosts a vast array of tokens that can be created quickly and at almost zero cost. While this fosters innovation, it also creates opportunities for malicious actors to launch tokens solely to execute rug pulls after attracting investment.

In this article, we explore the inner workings of a machine learning API designed to predict token risk on Solana. We also guide you through building a simplified version of such a model yourself.

Understanding Token Risk

Tokens on Solana vary widely in legitimacy and intent. While many projects are legitimate, others are created with fraudulent purposes.

What Is a Rug Pull?

A rug pull is a type of scam where developers abandon a project and withdraw all invested funds, causing the token’s value to crash. Investors are left with tokens that have little to no value.

Why Solana?

Solana’s high throughput and low transaction costs make it an attractive network for token creation. Unfortunately, these same features also make it a hotspot for fraudulent activities due to the ease of launching new tokens.

Setting Up the Environment

Before diving in, let’s prepare the development environment. We'll install necessary dependencies and gather data required to build the model. For convenience, consider using Google Colab to set up your notebook.

Install the dependencies:

# Install the required modules.
!pip install -U pandas
!pip install -U scikit-learn
!pip install -U numpy
!pip install -U matplotlib
!pip install xgboost==2.0.3
!pip install pandas==2.2.1
!pip install joblib==1.3.2

Note: If you are using Jupyter Notebook in VSCode, simply replace ! with %.

Here’s a brief overview of the libraries we are using:

Pandas: Used for data manipulation and analysis, offering data structures like DataFrames for handling structured data.
Numpy: Provides support for numerical computations, efficient data storage, and optimized mathematical functions.
SciKit-Learn: A comprehensive machine learning library with a consistent API and extensive documentation for various algorithms.
XGBoost: An efficient and scalable implementation of gradient boosting, ideal for structured data and regression tasks.
Joblib: A tool for saving and loading Python objects, particularly useful for machine learning models.

With the dependencies installed, we can proceed to data collection.

Collecting the Data

To train an effective machine learning model, a substantial amount of data is essential. Since we are building a supervised learning model, we need a labeled dataset. A labeled dataset consists of data instances where each instance is associated with one or more categories or labels.

A balanced dataset is critical. If the data is imbalanced—for example, containing more high-risk tokens than safe ones—the model may become biased and predict all tokens as high-risk, reducing its ability to identify legitimate tokens.

👉 Access real-time data tools for token analysis

Data Preprocessing

Raw data often contains irrelevant columns and attributes that need to be removed before training. We cannot feed raw data directly into the model.

def preprocess_data(df):
    df = df.drop(['address', 'lastTradeUnixTime', 'mc'], axis=1)
    X = df.drop('Risk', axis=1)
    y = df['Risk'].map({'Danger': 1, 'Warning': 1, 'Good': 0}).astype(int)
    return train_test_split(X, y, test_size=0.4, random_state=42)

In the code above, we remove columns like ‘address’, ‘lastTradeUnixTime’, ‘mc’, and ‘Risk’ as they do not contribute to the model’s predictive performance.

Building the Preprocessor

Data preprocessing is essential for preparing raw data for machine learning tasks. It involves handling missing values, categorical variables, and feature scaling.

The function build_preprocessor encapsulates preprocessing steps into pipelines for numerical and categorical features:

Numerical features are handled using mean imputation for missing values and standardized using StandardScaler.
Categorical features are processed using most frequent imputation and one-hot encoding.

💡 What is one-hot encoding? It is a technique that converts categorical data into binary vectors where only one element is active.

These pipelines are combined using ColumnTransformer, which specifies the features to transform and allows remaining columns to pass through unchanged.

def build_preprocessor(X_train):
    numeric_features = ['decimals', 'liquidity', 'v24hChangePercent', 'v24hUSD', 'Volatility', 'holders_count']
    categorical_features = ['logoURI', 'name', 'symbol']
    
    numeric_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='mean')),
        ('scaler', StandardScaler())
    ])
    
    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ])
    
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numeric_features),
            ('cat', categorical_transformer, categorical_features)
        ],
        remainder='passthrough'
    )
    return preprocessor

Training the Model

The train_model function encapsulates the training process within a pipeline that integrates the preprocessor:

def train_model(X_train, y_train, preprocessor):
    model = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', xgb.XGBClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42))
    ])
    model.fit(X_train, y_train)
    return model

This pipeline ensures seamless integration between preprocessing and model training, enhancing reproducibility and efficiency.

Why XGBoost?

XGBoost was chosen for this project due to its versatility, robustness, and superior performance across diverse data types and complexities. Compared to simpler models like logistic regression, XGBoost excels in handling complex datasets with varied feature types.

Key advantages of XGBoost include:

Regularization: Built-in L1 and L2 regularization techniques prevent overfitting.
Ensemble Learning: It combines multiple weak learners (decision trees) to produce robust predictions.
Gradient Boosting: The model iteratively improves upon previous models by focusing on inaccurately predicted instances.

💡 Decision trees use a tree-like structure of true/false feature questions to predict labels and estimate the minimum questions needed for accurate decisions. They can be used for classification (predicting categories) or regression (predicting continuous values).

Similarly, our model uses true/false feature questions based on attributes like volatility, liquidity, and others.

Evaluating the Model

To evaluate the model’s effectiveness, we use a confusion matrix to visualize its performance on test data:

Actual\Predicted | Positive | Negative |
-----------------------------------------
Positive        | TP       | FP       |
Negative        | FN       | TN       |

True Positive (TP): The model correctly predicts the positive class.
False Positive (FP): The model incorrectly predicts the positive class.
False Negative (FN): The model incorrectly predicts the negative class.
True Negative (TN): The model correctly predicts the negative class.

Let’s analyze the results:

def evaluate_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    classification_report_result = classification_report(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)
    
    print(f'Model Accuracy: {accuracy}')
    print('Classification Report:\n', classification_report_result)
    print("Confusion Matrix:\n", conf_matrix)

Putting It All Together

We now assemble all functions into a main function to run the entire process and save the model for future use:

def main():
    file_path = 'preProcessedTokens.json'  # Update this path
    df = load_data(file_path)
    X_train, X_test, y_train, y_test = preprocess_data(df)
    
    print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
    
    preprocessor = build_preprocessor(X_train)
    model = train_model(X_train, y_train, preprocessor)
    
    evaluate_model(model, X_test, y_test)
    
    # Save the model and preprocessor
    joblib.dump(model, "predictModel.pkl")
    joblib.dump(preprocessor, "mainPreprocessor.pkl")
    
    # Example prediction for a single token
    single_item_corrected = {
        "decimals": 6,
        "liquidity": 62215.15524335994,
        "logoURI": "https://example.com/token-image",
        "name": "Example Token",
        "symbol": "EXT",
        "v24hChangePercent": -49.17844813082829,
        "v24hUSD": 18220.724466666383,
        "Volatility": 76.06539722778419,
        "holders_count": 0
    }
    
    single_item_df = pd.DataFrame(single_item_corrected, index=[0])
    prediction = model.predict(single_item_df)
    print(f'Single Item Prediction: {prediction}')

if __name__ == "__main__":
    main()

Congratulations! Your machine learning model is now ready to analyze token risk. The saved files with .pkl extensions can be used to run the model and integrate it into applications via FastAPI endpoints.

👉 Explore advanced prediction strategies

Frequently Asked Questions

What is a rug pull?
A rug pull is a type of exit scam where developers abandon a project and withdraw all funds, causing the token to become worthless. This often happens shortly after liquidity is added.

Why is Solana prone to rug pulls?
Solana’s low transaction costs and high throughput make it easy and inexpensive to create tokens. While this encourages innovation, it also allows scammers to launch tokens with malicious intent.

How does the model predict risk?
The model uses features such as liquidity, volatility, holder distribution, and trading activity to assess the likelihood of a token being high-risk. It is trained on historical data to recognize patterns associated with fraudulent tokens.

Can this model be used for other blockchains?
While the model is built for Solana, the methodology can be adapted to other blockchains by retraining it on relevant data from networks like Ethereum or BNB Chain.

What are the limitations of this approach?
No model is 100% accurate. False positives and false negatives can occur. It’s essential to use this tool as part of a broader risk assessment strategy rather than relying on it exclusively.

How often should the model be retrained?
To maintain accuracy, the model should be retrained periodically with new data. The frequency depends on market dynamics but quarterly updates are a good starting point.

Conclusion

Building a token risk prediction model involves data collection, preprocessing, model training, and evaluation. Using XGBoost, we can create a robust tool to identify potentially fraudulent tokens on Solana.

This simplified version provides a foundation for further refinement and integration into broader risk management systems. By understanding the process and components involved, you can adapt and improve the model to suit specific needs.