top of page

Machine Learning Applications: Harness the Power of Algorithms for Data Classification, Regression, and Clustering

duplicated_1_duplicated_1_iStock-1205893172 [Converted].png

MachineLearningApp.py

MachineLearningApp.py is a Python script demonstrating a comprehensive example of using XGBoost, a powerful machine learning library, for regression tasks. The script showcases various steps involved in the machine learning pipeline, including data preparation, hyperparameter tuning, model training, evaluation, and model serialization.

The script begins by importing essential libraries such as pandas, numpy, scikit-learn, XGBoost, and pickle, ensuring all the necessary dependencies are available.

One of the key functionalities of MachineLearningApp.py is the generation of a sample CSV file. The `generate_csv` function creates a synthetic dataset by leveraging NumPy's random data generation capabilities. The generated data is stored in a pandas DataFrame and saved as a CSV file, providing a starting point for regression analysis.

Data preparation is a critical step in any machine learning task, and MachineLearningApp.py offers a robust approach. The `prepare_data` function loads the CSV file and separates the features (X) from the target variable (y). It further enhances the dataset by performing feature engineering, such as adding a new feature that sums all the existing features. Additionally, the features are scaled using the StandardScaler from scikit-learn for optimal model performance. The data is then split into training, validation, and test sets using the `train_test_split` function. XGBoost's feature importance is employed to select the most informative features based on a specified threshold, filtering the data to retain only the selected features.

Model training is a crucial stage where the `train_model` function shines. It sets up an XGBoost regression model and defines a hyperparameter grid for tuning. By utilizing the GridSearchCV class, the script performs grid search with cross-validation to identify the best hyperparameters based on the negative mean squared error. The best model and its corresponding hyperparameters are then obtained, and the model is trained using early stopping and verbose logging techniques.

The effectiveness of the trained model is evaluated in the `evaluate_model` function. It makes predictions on the test set and calculates the mean squared error as a performance metric. Additionally, the function retrieves the feature importances from the model and prints them, providing insights into the significance of each feature in the regression task.

To ensure the trained model can be utilized later, MachineLearningApp.py incorporates model serialization. The trained model is saved to a file using the pickle module, enabling easy storage and retrieval for future use.

In summary, MachineLearningApp.py serves as a comprehensive example of utilizing XGBoost for regression tasks. Its functionalities encompass data preparation, hyperparameter tuning, model training, evaluation, and model serialization, offering a holistic view of the machine learning pipeline.

Predictions.py is a Python script demonstrating how to use a pre-trained XGBoost model to make predictions on new data. The script follows a step-by-step process, including importing necessary libraries, defining a preprocessing function, generating random data for predictions, loading a serialized model, preprocessing the input data, making predictions, and printing the results.

The script starts by importing the required libraries, including pickle, pandas, and numpy, ensuring all dependencies are available for the subsequent steps.

Next, the preprocess_data function is defined. This function applies necessary preprocessing steps on the input data. In this example, it performs one-hot encoding on a categorical feature named "feature3". The function ensures the preprocessed data has the same features as the trained model by adding any missing features and reordering the columns.

To generate random data for predictions, the script uses NumPy to create a DataFrame called new_data. This data represents the new input for which predictions will be made.

The serialized model is then loaded from a file named 'xgboost_model.pkl' using the pickle.load function. The loaded model is stored in the loaded_model variable.

The preprocess_data function is called to preprocess the new_data according to the defined steps. This ensures that the input data is in the same format as the data used to train the model.

Using the preprocessed input data, the loaded model is used to make predictions by calling the predict method. The predictions are stored in the predictions variable.

Finally, the script concludes by printing the original new_data and the corresponding predictions.

Overall, Predictions.py showcases how to preprocess new data to match the format expected by a pre-trained XGBoost model and utilize the model to generate predictions.

Predictions.py

Start
|
|--- Generate CSV file
|       |
|       |--- Generate random data
|       |--- Save data to CSV file
|
|--- Prepare data
|       |
|       |--- Load CSV file
|       |--- Separate features and target variable
|       |--- Perform feature engineering
|       |--- Scale features
|       |--- Split data into training, validation, and test sets
|       |--- Perform feature selection
|
|--- Train model
|       |
|       |--- Set up XGBoost model
|       |--- Define hyperparameter grid
|       |--- Perform grid search with cross-validation
|       |--- Get best model and hyperparameters
|       |--- Train the best model
|
|--- Evaluate model
|       |
|       |--- Make predictions on test set
|       |--- Calculate mean squared error
|       |--- Get feature importance scores
|       |--- Print evaluation results
|
|--- Save trained model
        |
        |--- Save the model to a file using pickle
End

Start
|
|__ Import necessary libraries
|
|__ Define `load_csv` function
|    |
|    |__ Read CSV file
|    |__ Return data
|
|__ Define `prepare_data` function
|    |
|    |__ Separate features and target variable
|    |__ Perform feature scaling
|    |__ Split the data into training, validation, and test sets
|    |__ Return the split data
|
|__ Define `train_model` function
|    |
|    |__ Set up the XGBoost model
|    |__ Define hyperparameter grid for tuning
|    |__ Perform grid search with cross-validation
|    |__ Get the best model and hyperparameters
|    |__ Train the best model
|    |__ Return the trained model
|
|__ Define `evaluate_model` function
|    |
|    |__ Make predictions on the test set
|    |__ Evaluate the model using mean squared error
|
|__ Define `make_predictions` function
|    |
|    |__ Load the CSV file
|    |__ Prepare the data
|    |__ Train the XGBoost model with hyperparameter tuning
|    |__ Evaluate the model on the test set
|    |__ Save the trained model
|
|__ Specify the path to the CSV file
|
|__ Make predictions using the provided CSV file
|
End

bottom of page