Was this page helpful?
Price prediction inference app¶
In this tutorial you’ll build a simplified feature store and real-time ML application with ScyllaDB. App predicts used car prices based on the car’s features.
Create a ScyllaDB cluster¶
In production, a ScyllaDB cluster should include at least three nodes. In this example, you’ll start with a single-node setup to keep things simple.
Start up a ScyllaDB node:
docker run --name node1 --network scylla -p "9042:9042" -d scylladb/scylla-enterprise:2024.2 \
--overprovisioned 1 \
--smp 1
Check the status of your node:
docker exec -it node1 nodetool status
The response you get back should be UN (Up and Normal). If it is not then wait a few seconds and try again as the node is not ready yet. You should see something that looks like this:
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 172.17.0.2 204 KB 256 ? b26ab7f8-17ee-4e51-b28a-523697d16a60 rack1
Create a feature store keyspace and tables¶
In this part, you’ll create a new keyspace and a new table.
Use cqlsh to interact with ScyllaDB:
docker exec -it node1 cqlsh
Create a keyspace called feature_store
:
CREATE KEYSPACE feature_store
WITH REPLICATION = {
'class' : 'NetworkTopologyStrategy',
'replication_factor' : 1
};
With this CQL command, you set the following replication strategy:
NetworkTopologyStrategy
: A production ready replication strategy that allows to set the replication factor independently for each data-centerreplication_factor:
: The number of nodes that will store the same data. In production you’d set this to 3 or higher, to ensure that if one node becomes unavailable, you can still query your data using another node.
Activate the newly created keyspace:
use feature_store;
Next, create a table used_cars
with the following columns: model, year, price, fuel_type, brand:
CREATE TABLE used_cars (
car_id UUID,
model TEXT,
year INT,
price FLOAT,
brand TEXT,
PRIMARY KEY (car_id)
);
You can confirm that the table has been created by using the following CQL command:
DESC SCHEMA;
You can notice, that the table definiton has car_id
as the PARTITION KEY. This allows efficient queries by the car_id
column.
Partition key, filtering¶
Insert two rows into the newly created table:
INSERT INTO used_cars(car_id, model, year, price, brand)
VALUES (70f6d200-e9fb-4765-af05-f5ef668e82ee, 'A6', 2022, 43950, 'Audi');
INSERT INTO used_cars(car_id, model, year, price, brand)
VALUES (9a0807eb-05df-47dc-9081-095f7881b470, 'Q5', 2024, 45300, 'Audi');
Now, read the table:
SELECT * FROM used_cars;
Query the table and add a WHERE
clause:
SELECT * FROM used_cars WHERE car_id = 70f6d200-e9fb-4765-af05-f5ef668e82ee;
This query is using the car_id
column in the WHERE
clause. This query is efficient and scalable because car_id
is a PARTITION KEY.
Let’s see what happens if you query by a non-PK column, model
:
SELECT * FROM used_cars WHERE model = 'Q5';
You see an error message: InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"
.
What this means is that you tried to query by a non-PK column which would require a table scan and have unpredictable performance.
If you are willing to run a table scan, you can still execute the query using the ALLOW FILTERING
extension:
SELECT * FROM used_cars WHERE model = 'Q5' ALLOW FILTERING;
What happens if you try to INSERT
a new product with the same ID but different price?
INSERT INTO used_cars(car_id, model, year, price, brand)
VALUES (70f6d200-e9fb-4765-af05-f5ef668e82ee, 'A6', 2022, 39000, 'Audi');
SELECT * FROM used_cars WHERE car_id = 70f6d200-e9fb-4765-af05-f5ef668e82ee;
In this case, because the car_id already exists in the table, ScyllaDB will perform UPDATE
instead of inserting a new row (executing an UPSERT
, in other words).
Clustering key¶
Now, let’s see how clustering keys help you access your data.
Create a new table, called car_features
:
CREATE TABLE car_features (
car_id uuid,
feature_name text,
feature_value text,
PRIMARY KEY (car_id, feature_name)
);
Insert some sample data:
INSERT INTO car_features(car_id, feature_name, feature_value)
VALUES (8fb321c6-8ee0-4cd8-8568-3f149b7c298c, 'brand', 'Audi');
INSERT INTO car_features(car_id, feature_name, feature_value)
VALUES (8fb321c6-8ee0-4cd8-8568-3f149b7c298c, 'model', 'A1');
INSERT INTO car_features(car_id, feature_name, feature_value)
VALUES (8fb321c6-8ee0-4cd8-8568-3f149b7c298c, 'fuel_type', 'Diesel');
INSERT INTO car_features(car_id, feature_name, feature_value)
VALUES (8fb321c6-8ee0-4cd8-8568-3f149b7c298c, 'price', '14950');
This table contains features of used cars. Let’s breakdown the PRIMARY KEY:
car_id
: PARTITION KEY - allows efficient queries using thecar_id
columnfeature_name
: CLUSTERING KEY - allows efficient queries using thecar_id
AND thefeature_name
columns
Query by car_id
to fetch all the features for one car:
SELECT * FROM feature_store.car_features WHERE car_id = 8fb321c6-8ee0-4cd8-8568-3f149b7c298c;
By adding a CLUSTERING KEY, you can query one specific feature value, e.g. model
:
SELECT * FROM car_features WHERE car_id = 8fb321c6-8ee0-4cd8-8568-3f149b7c298c AND feature_name = 'model';
Close cqlsh:
exit
Real-time machine learning app example¶
In this part, you’ll set up a simplified machine learning app that predicts used car prices based on user input and a pre-trained model.
Connect to your ScyllaDB cluster using cqlsh and create the schema:
docker exec -it node1 cqlsh
CREATE KEYSPACE IF NOT EXISTS feature_store WITH replication = {
'class': 'NetworkTopologyStrategy',
'replication_factor': 1
};
CREATE TABLE IF NOT EXISTS feature_store.car_features (
car_id uuid,
feature_name text,
feature_value text,
PRIMARY KEY (car_id, feature_name)
);
CREATE TABLE IF NOT EXISTS feature_store.raw_car_features (
car_id UUID,
brand TEXT,
model TEXT,
year INT,
transmission TEXT,
fuel_type TEXT,
mpg FLOAT,
engine_size FLOAT,
mileage INT,
tax FLOAT,
PRIMARY KEY (car_id)
);
Exit cqlsh:
exit
Test the app¶
Create a new virtual environment in the app’s folder:
cd labs/used-cars-feature-store/
virtualenv env --python=python3.11 && source env/bin/activate
Install Python requirements (scikit-learn, scylla-driver, streamlit):
pip install scikit-learn scylla-driver streamlit
Use the docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' node1
command to get the ScyllaDB HOST address.
NODE1=$(docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' node1)
echo $NODE1
Find the config.py
file:
SCYLLA_HOSTS = ["172.18.0.4"]
SCYLLA_KEYSPACE = 'feature_store'
SCYLLA_USER = 'scylla'
SCYLLA_PASS = ''
SCYLLA_DC = 'dc1'
Make sure the HOST
value contains the output of:
echo $NODE1
If it doesn’t, fix it.
Run streamlit:
streamlit run app/app.py
Visit APP UI: http://localhost:8501
Hit CTRL + C
to stop the server.