Synthetic data offers a multitude of benefits for organizations across various domains. Primarily, it serves as a robust privacy protection tool by enabling the generation of artificial data that closely mimics the original dataset, making it safe for testing and analysis without exposing real user details. This not only safeguards sensitive information but also ensures data security by eliminating the risk of breaches, as synthetic data contains no actual user or proprietary data. Additionally, it proves to be cost-effective, reducing the need for expensive and time-consuming real data acquisition and management. Synthetic data's scalability makes it valuable for training machine learning models and conducting large-scale analyses, while its compliance with data protection regulations such as GDPR or HIPAA ensures legal adherence. Moreover, synthetic data promotes data diversity, reduces biases, and facilitates rapid prototyping, accelerating the development of data-driven applications and fostering cross-organization collaboration by enabling the sharing of data while preserving anonymity and confidentiality.
GenAI for Synthetic Data
Generative AI, such as Generative Adversarial Networks (GANs), can address various challenges through its ability to create synthetic data or generate content. It can solve data scarcity issues by producing realistic synthetic data, enabling better model training and reducing overfitting. In fields like art and design, generative AI can inspire creativity and generate unique, novel content. It can also automate content creation tasks, such as text generation or image synthesis, streamlining content production processes. Moreover, generative AI can be used for anomaly detection, generating data points that highlight unusual patterns or outliers in datasets, aiding in fraud detection or quality control.
Our Synthetic-Data Curation API
Our API provides an entising solution to your privacy and security concerns. It has offers the following benefits: Customizable Data Generation: Your API offers highly customizable synthetic data generation capabilities. Users can specify the exact data distribution, relationships, and characteristics they need, tailoring the synthetic data to their specific use cases and requirements.
- Realistic Data Fidelity
Our API stands out for its ability to create synthetic data that closely mirrors real-world datasets. It captures nuances, variability, and statistical properties of actual data, providing users with a more accurate representation for testing and analysis.
- Dynamic Data Evolution
Unlike static synthetic datasets, our API allows for dynamic data evolution. Users can simulate how data changes over time or in response to different scenarios, enabling them to test the resilience and adaptability of their systems.
- Privacy-Preserving Techniques
We employs advanced techniques like differential privacy or secure multiparty computation to ensure that sensitive information remains confidential while still enabling meaningful analysis.
- Seamless Integration and Scalability
The API is designed for scalability, making it suitable for projects of all sizes, from small-scale testing to enterprise-level data-driven applications, without compromising performance.
Hands-on Example
First thing first, get your access-key from the contact page . Add a short description about your business and the purpose of using the API. We will get back to you with the access-key and the documentation.
The API provides several end-points to help you generate synthetic data. Let us start by defining some helper functions to make API calls easier.
### Helper function to submit post requests
def api_post_request(end_point, jsonData=None, formData=None, files=None, header=None):
url = addr + end_point
resp = requests.post(
url,
json=jsonData,
data=formData,
files=files,
headers=header,
verify=True
)
return resp
### Helper function to submit get requests
def api_get_request(end_point, header=None):
url = addr + end_point
resp = requests.get(
url,
headers=header,
# verify=True
)
return resp
Let's start testing the end-points. Run the below command to check connectivity with the server.
### Run the following end-point to confirm connectivity.
res = api_get_request('/test')
res.json()
We start by creating a new user. To get your **customer_key**, pls send request to contact@empirischtech.at
username = "test"
json = {
"name": 'wasif',
"lastname": "masood",
"username": username,
"jobtitle": "pending",
"role": "pending",
"email": "est@nothing.com",
"customer_key": "your key"
}
end_point = '/user/register'
res = api_post_request(end_point, jsonData=json)
if res.status_code == 200:
usersecret = res.json()['usersecret']
else:
print( str(res.status_code) + ' ' + res.json()['detail'])
Keep your secret key safe as you will be needing it to authenticate yourself for all the subsequent calls.
### reset password as follows
end_point = f'/user/reset/{username}'
res = api_get_request(end_point)
if res.status_code == 200:
usersecret = res.json()['usersecret']
else:
print( str(res.status_code) + ' ' + res.json()['detail'])
### save the usersecret for future use
Use the following end-point to login and get your access token.
data = {
'username': username,
'password': usersecret
}
## Authentication request
response = requests.post(addr + '/token', data=data, verify=True)
if response.status_code != 200:
print('Authentication failed')
print(response.status_code)
print(response.json())
exit()
token = response.json()
headersAPI = {
'accept': 'application/json',
'Authorization': 'Bearer '+ token['access_token'],
}
Now, lets read file which contains data of the telecom customers, available at Kaggle . Details on the data can be found here .
df = pd.read_csv('../data/telecom.csv', sep=',', header=0)
df.head()
A quick glance of the data table is follows:
cat_2 | cat_3 | cat_4 | num_4 | ... | cat_5 | cat_6 | num_17 | num_18 | target | |
---|---|---|---|---|---|---|---|---|---|---|
1 | Female | 0 | Yes | 1 | ... | No | One year | 108.15 | 306.60 | False |
2 | Male | 0 | No | 34 | ... | Yes | Month-to-month | 83807.86 | 29.85 | False |
3 | Male | 0 | Yes | 2 | ... | Yes | Two year | 159660.80 | 1889.50 | True |
Upload the above file to the server using the following end-point.
### now upload the file to the server
file_path = "../data/telecom.csv"
files = {'file': open(file_path, 'rb')}
form_data = {
'sep': ',',
'header': 0
}
end_point = "/files/upload"
res = api_post_request(end_point, formData=form_data, files=files, header=headersAPI)
if res.status_code == 200:
file_id = res.json()
print(f'file uploaded successfully. file_id: {file_id}')
else:
print( str(res.status_code) + ' ' + res.json()['detail'])
Verify if the file is uploaded successfully by running the following end-point.
### file uploaded successfully. Now lets check the uploaded files on the server.
res = api_get_request(end_point='/files', header=headersAPI)
if res.status_code == 200:
print(res.json())
else:
print( str(res.status_code) + ' ' + res.json()['detail'])
The code will return list of all the uploaded files. Use file_id to address them in your subsequent calls.
Users are assigned time quota to use the API. The following end-point can be used to check the remaining time quota, along with other user details.
### Before to submit a training job, lets see how much quota we have left.
res = api_get_request(end_point='/user/info', header=headersAPI)
if res.status_code == 200:
print(res.json())
else:
print( str(res.status_code) + ' ' + res.json()['detail'])
Next we now all the train end-point to train a GAN on this data. For free/basic licence, max training iterations are 500.
json = {
"file_id": file_id,
"training_iterations": 100,
"brief_label": "Telecom dataset from kaggle",
"desc": "First test dataset to generate synthetic data for.",
"target_col": "target",
}
end_point = "/task/train"
res = api_post_request(end_point, jsonData=json, header=headersAPI)
if res.status_code == 200:
task_uuid = res.json()
print(f'task submitted successfully. task_uuid: {task_uuid}')
else:
print( str(res.status_code) + ' ' + res.json()['detail'])
The above code will return a task_uuid which can be used to check the status of the jobs, like as follows:
### chec the status of the training job
end_point = f"/task/status/{task_uuid}"
res = api_get_request(end_point, header=headersAPI)
if res.status_code == 200:
print(res.json())
else:
print( str(res.status_code) + ' ' + res.json()['detail'])
In case a task failed, checkout the detail logs as follows:
### in case a task fails, check out detail logs
end_point = f"/task/details/{task_uuid}"
res = api_get_request(end_point, header=headersAPI)
if res.status_code == 200:
print(res.json())
else:
print( str(res.status_code) + ' ' + res.json()['detail'])
you can also check for which parameters this task was trained by running the following end-point.
### you can also check for which parameters this task was trained
end_point = f"/train/details/{task_uuid}"
res = api_get_request(end_point, header=headersAPI)
if res.status_code == 200:
print(res.json())
else:
print( str(res.status_code) + ' ' + res.json()['detail'])
Now download the generated synthetic data using the following end-point and save it as a csv.
### get synthetic dataset
end_point = f"/task/generate/{task_uuid}"
res = api_get_request(end_point, header=headersAPI)
f = open(f'synthetic_{task_uuid}.csv', "w")
f.write(res.text)
f.close()
Also check the revelance of the synthetic data to the actual data using the following end-points. Currently, we have defined 10 different statistical measures to capture the closeness between the two datasets.
### get the goodness of fit stats
end_point = f"/task/collective-efficacy/{task_uuid}"
res = api_get_request(end_point, header=headersAPI)
if res.status_code == 200:
stats = pd.DataFrame(res.json())
else:
print( str(res.status_code) + ' ' + res.json()['detail'])
If the above stats are not satisfactory, you can retrain the model with different parameters. For example, you can change the number of training iterations. For more professional consultation, please contact us at contact@empirischtech.at.
There are several use-full end-points. Please visit API Docs for a complete list of end-points.
Leave a Comment
Your email address will not be published.