Using Azure Open AI with Microsoft Sentinel Part 2 - Converting Data to JSONL
Baby-stepping into the abyss
The data model file type that Azure Open AI supports (for importing to do customized modeling and tuning is JSONL - JSON lines. JSONL is the same data as JSON, just in a different format. JSONL takes JSON’s structured datasets and moves all data values on a single line (hence, the ‘L’).
Additionally, the JSONL file only needs to include ‘prompt’ and ‘completion’ columns. I’ve posted a sample file of what this looks like as I have begun to develop a training model specifically for KQL.
You can find that file here: https://rodtrent.com/tj9
But it looks something like this:
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}{"prompt": "<prompt text>", "completion": "<ideal generated text>"}{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
Instead of just creating your own JSONL file manually from scratch you can use the followings tools to perform the conversion.
Here’s a simple PowerShell script to convert JSON to JSONL for quick tests: https://rodtrent.com/xa2
For larger tasks where you need support for JSON, use the OpenAI CLI data preparation tool.
The Open AI CLI data preparation tool supports Comma-separated values (CSV), Tab-separated values (TSV), Microsoft Excel workbook (XLSX), JavaScript Object Notation (JSON), and JSON Lines (JSONL).
There are a few ways to run the Open AI CLI data preparation tool, but here’s what I’m using:
Once you have all the above installed, in Visual Studio Code, run the OpenAI installation.
pip install --upgrade openai
With the Open AI CLI data preparation tool installed, you can now run the converter against your CSV, TSV, XLSX, or JSON files in Visual Studio Code terminal using a command line similar to the following:
C:\<python_location>\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\Scripts\openai tools fine_tunes.prepare_data -f C:\<file_location>\filename.csv
Open AI CLI data preparation tool converts the data file and outputs it to the original file location with the same name but with ‘prepared’ appended to the filename.
For example sample_data.csv would become sample_data_prepared.csv
This data file can be imported into the File Management section of Azure Open AI Studio…
…which can then be used to train against a customized model.
NOTE: I learned the hard way today (I didn’t read the Docs) that you should deploy your Azure Open AI instances in US South Central, otherwise, no base model types are available for customized models - which is part of the process for creating a customized model. If you deploy an instance into another region, you won’t be able to work with customized models at all for now.
P.S. There’s no mention of Microsoft Sentinel in this post (until now), but I promise customized models and data conversion will play into the larger scope as part of this series. Stay tuned…
[Want to discuss this further? Hit me up on Twitter or LinkedIn]
[Subscribe to the RSS feed for this blog]
[Subscribe to the Weekly Microsoft Sentinel Newsletter]
[Subscribe to the Weekly Microsoft Defender Newsletter]
[Learn KQL with the Must Learn KQL series and book]