This repo contains the instructions to perform interactive search on the YFCC100M (Yahoo Flickr Creative Commons 100 Million) image dataset using the Eureka/OpenDiamond (paper) software stack, and using AWS EC2 as back-end.
We have provided:
- A public Amazon Machine Image (AMI) containing an installed Eureka back-end with pre-configured YFCC100M meta data.
- A VirtualBox image and a KVM image containing the pre-configured front-end GUI
ToC:
- Launching the Eureka Back-ends on AWS EC2
- Starting the Front-end GUI
- Built-in Predicates
- Security and Privacy Risk
- FAQ
- Contact
- Region: US West (Oregon),
us-west-2
- AMI ID:
ami-078829174439aee2c
- Instance type (recommended):
g3.4xlarge
- Public IP enabled
- Security group
- Inbound: TCP 22, TCP 5872
- Outbound: all
- Create a security group
eureka-sg
with inbound rules TCP 22, TCP 5872, and outbound rules all. - Create a launch template using the aforementioned AMI ID, security group, and recommended instance type.
- Use the launch template to create subsequent EC2 instances.
- You can create as many instances as you need.
- Wait for the launched instances to show "running" in Instance State and "2/2 checks passed" in Status Checks before starting the front-end GUI.
- Stop or terminate the EC2 instances when you are done.
- You can use a non-GPU instance type, but GPU filters (e.g., DNN image classification) will be unusable. You can still use other filters. It is recommended to use instance types with =16 vCPUs and >= 64 GiB RAM.
- You should use US West Oregon (
us-west-2
) because the YFCC100M S3 bucket is in the same region.
You need a pair of AWS Access Key ID and Secret Access Key.
They may look like AKIAIOSFODNN7EXAMPLE
and wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
.
Whether in a VM or natively,
you must configure your AWS credentials
so that the scripts can obtain the public IPs of your launched EC2 instances.
$ aws configure
AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE
AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
Default region name [None]: us-west-2
Default output format [None]: json
And, of course, you will pay your own AWS bill.
Download VirtualBox image (v19.02)* or KVM image and xml (v19.02)
Login: ubuntu / Password: ubuntu
# Configure AWS credentials as shown above
cd /home/ubuntu/hyperfind/eureka-yfcc100m
./start-search.sh
*Tested host: macOS 10.13.6 + VirtualBox 6.0; Ubuntu 18.04 + VirtualBox 6.0
This is basically how the VM image is created.
- Install OpenDiamond. You need to at least have the executable
cookiecutter
from OpenDiamond functioning. - Download and compile HyperFind. This is the front-end GUI.
- Install the AWS Command Line Interface
- Configure AWS credentials as mentioned above.
- Clone this repo in the directory where
hyperfind.jar
is located.
cd /path/to/hyperfind/dir/
git clone https://github.com/fzqneo/eureka-yfcc100m.git
As a result, the directory structure looks like:
/path/to/hyperfind/dir/
|-- build.xml
|-- bin/
|-- edu/
|-- cmu/
|-- ...
|-- hyperfind.jar
|-- eureka-yfcc100m/ <------ this repo
|-- README.md <------ this file
|-- ...
|-- start-search.sh
|-- ...
- Start the front-end GUI after you launch the EC2 back-ends
cd /path/to/hyperfind/dir/eureka-yfcc100m
./start-search.sh
See Brief Descriptions of Built-in Predicates
The pre-configured Eureka back-end in the AMI has turned off ScopeCookie verification. It means anyone who knows the IP addresses of your EC2 instances can use the GUI to connect to your machines and perform the search using them. Since YFCC100M is a public data set, the privacy risk should be minimal. To further reduce the risk, you can:
- Stop or Terminate your EC2 instances as soon as you are done with your search.
- Configure your inbound rules to only accept connections from your IP address/range.
- Turn on ScopeCookie verification. This requires a private key and certificate be set up on the front-end and the back-end, respectively. Contact me for how.
The progress hangs, not moving forward.
Be patient!
There are several cases when this can happen:
- The first search session after the VMs start. The system may still be starting up, or the redis cache is loading from the disk.
- The first time you use a GPU-involving filter. It can take a long time to activate the GPU on EC2 on its first use.
- You use some just-in-time (JIT) machine learning filters that trains an ML model before filtering images. Depending on the algorithm and the training set size, the JIT training time can be considerable.
The GUI errors with SocketException
Make sure you have opened the necessary port (5872) on the EC2 instances.
Wait for the VMs to be in the "running" Instance State and "2/2 checks passed" in Status Checks.
I can't create GPU instances on EC2.
By default, AWS may only allow users to create 0 or 1 GPU instance. You may need to ask Amazon to increase you limit.
Ziqiang Feng (Carnegie Mellon University)
zf at cs dot cmu dot edu