RAPIDS

RAPIDS is a data science framework that bundles a collection of libraries for executing end-to-end data science pipelines completely on top of GPUs, and it uses optimized NVIDIA CUDA® primitives and high-bandwidth GPU memory to accelerate data preparation and machine learning tasks. For example, it can be used for ETL and preprocessing of deep learning workflows.

This content is based on the original documentation. Below is a step by step guide of how to run their quick-start examples.

Documentation

 

Usage

To use RAPIDS on Bridges, you will need to start an interactive session on a GPU node with EGRESS access, load the RAPIDS module, and run the activate command to enable all of the libraries and environment variables. Use commands like:

interact --gpu --egress
module load rapids
activate 

 

Examples

 

Descriptive statistics

This example reads in a .csv file and outputs some descriptive statistics. 

Create a python file with this content:

# statistics.py
import cudf
gdf = cudf.read_csv('/pylon5/datasets/user-guide/rapids/sample_file.csv')
for column in gdf.columns:
print(gdf[column].mean())

Run the program.

$ python statistics.py

The output should look like:

5000.5
36.2544

 

cuDF statistics

The original can be found here: https://github.com/rapidsai/cudf

This example loads a public dataset, from a CSV file on GitHub, into a GPU memory-resident DataFrame and performs a basic calculation. All of the CSV parsing and operations for calculating the tip percentage and average are done on the GPU.

Create a python file with the following content:

# cuDF.py
import cudf
import io, requests

# Download the CSV file from GitHub
url="https://github.com/plotly/datasets/raw/master/tips.csv"
content = requests.get(url).content.decode('utf-8')

# Read the CSV into memory
tips_df = cudf.read_csv(io.StringIO(content))
tips_df['tip_percentage'] = tips_df['tip']/tips_df['total_bill']*100

# Display the average tip amount by dining party size
print(tips_df.groupby('size').tip_percentage.mean())

Run the program:

$python cuDF.py

The output will look like:

size
1 21.729202
2 16.571919
3 15.215685
4 14.594901
5 14.149549
6 15.622920
Name: tip_percentage, dtype: float64

 

cuML example

The original can be found here: https://github.com/rapidsai/cuml. This example loads a small sample data frame and computes DBSCAN clusters.

Create a python file with the following content:

# cuML.py
import cudf
import cuml

# Create and populate a GPU DataFrame
df_float = cudf.DataFrame()
df_float['0'] = [1.0, 2.0, 5.0]
df_float['1'] = [4.0, 2.0, 1.0]
df_float['2'] = [4.0, 2.0, 1.0]

# Setup and fit clusters
dbscan_float = cuml.DBSCAN(eps=1.0, min_samples=1)
dbscan_float.fit(df_float)

print(dbscan_float.labels_)

Run the program:

$ python cuML.py

The output will look like:

0   0
1 1
2 2
dtype: int32