LAB 4 - Amazon Comprehend
LAB 4 - Amazon Comprehend
In input_data_config -
S3Uri: Replace <S3_INPUT_GOES_HERE> with the test_uri that was defined previously
InputFormat: Replace <INPUT_FORMAT_GOES_HERE> with ONE_DOC_PER_LINE
In output_data-config -
S3Uri: Replace <S3_OUTPUT_GOES_HERE> with the s3_output_location
data_access_role_arn: Replace
arn:aws:iam::899882598055:role/service-role/c67833a1330685l3262871t1w-
ComprehendDataAccessRole-1EAP15HQRX9QE with the Amazon Resource Name (ARN) from the
Lab Details file
input_data_config={
'S3Uri': 'S3_INPUT_GOES_HERE',
'InputFormat': 'INPUT_FORMAT_GOES_HERE'
},
output_data_config={
'S3Uri': 'S3_OUTPUT_GOES_HERE'
},
data_access_role_arn =
'arn:aws:iam::899882598055:role/service-role/c67833a1330685l3262871t1w-
ComprehendDataAccessRole-1EAP15HQRX9QE'
### BEGIN_SOLUTION
input_data_config={
'S3Uri': test_url,
'InputFormat': 'ONE_DOC_PER_LINE'
}
output_data_config={
'S3Uri': s3_output_location
}
data_access_role_arn =
'arn:aws:iam::899882598055:role/service-role/c67833a1330685l3262871t1w-
ComprehendDataAccessRole-1EAP15HQRX9QE'
### END_SOLUTION
Now that you defined the job parameters, start the sentiment detection job.
response = comprehend.start_sentiment_detection_job(
InputDataConfig=input_data_config,
OutputDataConfig=output_data_config,
DataAccessRoleArn=data_access_role_arn,
JobName='movie_sentiment',
LanguageCode='en'
)
print(response['JobStatus'])
The following cell will loop until the job is completed. (This step might take a
few minutes to complete.)
%%time
import time
job_id = response['JobId']
while True:
job_status=(comprehend.describe_sentiment_detection_job(JobId=job_id))
if job_status['SentimentDetectionJobProperties']['JobStatus'] in
['COMPLETED','FAILED']:
break
else:
print('.', end='')
time.sleep(15)
print((comprehend.describe_sentiment_detection_job(JobId=job_id))
['SentimentDetectionJobProperties']['JobStatus'])
When the job is complete, you can return the details from the job by calling the
describe_sentiment_detection_job function.
output=(comprehend.describe_sentiment_detection_job(JobId=job_id))
print(output)
In the OutputDataConfig section, you should see the S3Uri. Extracting that URI will
give you the file that you must download from Amazon S3. You can use the results to
calculate metrics in the same way that you calculated the results from a batch
transformation by using an algorithm.
comprehend_output_file = output['SentimentDetectionJobProperties']
['OutputDataConfig']['S3Uri']
comprehend_bucket, comprehend_key = comprehend_output_file.replace("s3://",
"").split("/", 1)
s3r = boto3.resource('s3')
s3r.meta.client.download_file(comprehend_bucket, comprehend_key, 'output.tar.gz')
import json
data = ''
with open ('output', "r") as myfile:
data = myfile.readlines()
Add the lines to an array.
results = []
for line in data:
json_data = json.loads(line)
results.append([json_data['Line'],json_data['Sentiment']])
Convert the array to a pandas dataframe.
c = pd.DataFrame.from_records(results, index='index',
columns=['index','sentiment'])
c.head()
The results contain NEGATIVE, POSITIVE, NEUTRAL, and MIXED results instead of
numerical values. To compare these results to your test data, they can be mapped to
numerical values, as shown in the following cell. The index in the returned results
is also out of order. The sort_index function should fix this issue.
cm = confusion_matrix(test_labels, c['sentiment'])
TN = cm[0,0]
FP = cm[0,1]
FN = cm[1,0]
TP = cm[1,1]
Sensitivity = float(TP)/(TP+FN)*100
# Specificity or true negative rate
Specificity = float(TN)/(TN+FP)*100
# Precision or positive predictive value
Precision = float(TP)/(TP+FP)*100
# Negative predictive value
NPV = float(TN)/(TN+FN)*100
# Fall out or false positive rate
FPR = float(FP)/(FP+TN)*100
# False negative rate
FNR = float(FN)/(TP+FN)*100
# False discovery rate
FDR = float(FP)/(TP+FP)*100
# Overall accuracy
ACC = float(TP+TN)/(TP+FP+FN+TN)*100