So you want to set up Jupyter Notebooks and Run Python in the Cloud, then actually get the Data?
This is a comprehensive set of steps. In part 1, we get an account set up and reserve a Virtual Machine using Google’s Compute Engine. In Part 2, we install and run a python3 Jupyter Notebook. In Part 3, we set up a Google Cloud Storage bucket to simplify the process of transferring files to and from the VM.
First, a quick disclaimer - I’m not an expert on Ubuntu, and can’t claim to know how to do this all more securely than not handing out a password. For the time being, its best not to share any of your specifics here with anyone.
Second, this write-up assumes you have basic familiarity with command line navigation. If not, here is a quick primer:
Move into a subfolder from current path: cd./[folder name]
Move up a folder level: cd ../
Home directory: ~
Create a subfolder in the current working directory: mkdir [folder name]
Create a text file in the current directory: touch [filename].txt
List contents of current directory (subfolders and files): ls
Display full path of current working directory: pwd
Finally, note that with this SSH instance, any time you highlight anything it will be automatically copied. Be careful about pasting stuff in without first de-selecting what’s in the window.
Step 1: The Setup
Step 1a: Create a project
Search Google Cloud Compute Engine, which should take you to a page with a console link
In the upper left, there should be an option to start a new project. If you’ve used anything with an API in the past, you can see it there as well. Create a new project and write down the name that google assigns it (You can see it from the projects drop down later if you forget) If you haven’t already, you will need to attach a payment account, however Google gives you $300 over 365 days for free.
Important: Stop your instance when you’re done to save money! If you select static instead of Ephermeral storage (below), it will charge you a little more but all your data will stay saved when you close the VM down. CPUs are more expensive by orders of magnitude (while they're actually running), so this is more effective in the long run.
Step 1b: Creating a VM
Here is a summary of the salient points:
- Select a region - video used us-west 1, zone 1b, but I used us-east1-b and it was fine
- Select machine type - I used 8 vCPUs, but this can be changed later
- Boot disk -
- Video used Ubuntu 16.04 LTS xenial, I used 18.04 cosmic (I recommend the later for python3 compatability)
- At the bottom of this page, you can select more than 10 GB persistent if you think it will be necessary, but this could incur additional charges.
- Firewall
- Check allow HTTP and HTTPS traffic
- click show more, then Disks, then uncheck “delete boot disk when instance is deleted”
- Continue - this may take a moment
- Once it is created, copy down the External IP on the instance (something like ###.###.###.###)
- Important! When you are done with the session, check the instance and click Stop
- You’re static storage will be saved, so no worries there
At this point the VM is established, but we need some additional configuration
- Click on the horizontal bars in the upper left
- VPC Network -> VPC Network -> External IP addresses
- On the instance, change Type Ephemeral to Static, and record the name (I used blogexample)
- From the left bar, Firewall rules, then from the top create firewall rule
- Give it a name, and record: (e.g. myfirewallrule)
- Change Targets to from Specified Target Tags to All instances in the network
- Source IP Ranges: 0.0.0.0/0
- Under Specified Protocols and Ports check "TCP" and give it a number (I used 5000. If you're following along with the video, he first uses 1000 which is reserved and will not work)
From the horizontal bars, head back to Compute Engine - VM Instance
Step 2 Configuring the VM
Step 2A: Installing python3, pip3, and jupyter notebooks
Skip to the summary below for a list of these steps If you are following along with the video, installing python3 instead of python2 is the main point of divergence here To save some space, anywhere if it asks you to continue, enter ‘y’. Anytime you are given two options, enter [1]
- From the VM instance screen, click SSH
- Record user name and instance, for example username@instance-1. This will be needed for copying files
- In order:
- sudo apt-get update
- sudo apt-get --assume-yes upgrade (keep local version)
- sudo apt-get --assume-yes install software-properties-common
- sudo apt-get install python-setuptools python-dev build-essential
- From the newer Ubuntu: sudo apt install python3-pip
- For the example in the video: sudo easy_install pip
- From my example: sudo pip3 install jupyter
- From older (video example): sudo pip install jupyter
- jupyter notebook --generate-config
- sudo nano ~/.jupyter/jupyter_notebook_config.py
- Press down arrow to navigate just under # Configuration file
- Immediately under # Configuration file for jupyter-notebook, paste in the following block (press down arrow to navigate):
- Exit back: Press Ctrl + o (letter o), then Enter. Then, Press Ctrl + x
- Enter: jupyter notebook password, then type your password. The cursor will not move, however copy+paste works fine
c = get_config()
c.NotebookApp.ip= '*'
c.NotebookApp.open_browser = False
c.NotebookApp.port = 5000
Step 2B: Running Jupyter Notebook
These steps are stand alone from above, and are all that you need to do to run Jupyter notebooks. Once set up, this can be done from any point in the command line. The first couple steps deal with setting up subfolders to work in, which I like to do to avoid writing things in root. You don’t have to, but the rest of the cloud steps assume you have these subfolders set up.
Note that in the browser window, Jupyter will launch from whatever directory you’re in when you type the open command. From the SSH connection, do the following
- To set up project paths:
- from command line: mkdir projects
- cd ./projects
- Launch Jupyter notebook:
- jupyter notebook --ip=0.0.0.0 --port=5000 (whatever port you put)
- Go to IP:port (e.g. 123.456.789:5000) in a new browser tab
- Enter password and run Jupyter
- Intalling packages
- pip3 install can be run directly using the second window if you have tmux open
- Or, from Jupyter notebook you can use !sudo pip3 install pandas (or whatever) and it will work fine
- Note: There is a difference. With my installation (aka python3), you will need to use pip3, which otherwise works the same
- Closing Jupyter
- Quit as normal
- From command line, Ctrl + C, then y
- Close everything by typing exit
At this point, you are all set up, but of course you want to actually use data from somewhere.
Note: When setting up a new notebook, you may need to use !pip3 install instead of !pip install
Step 3: Setting Up Cloud Storage as an intermediary
Now the good stuff. By ‘logging in’ to Google’s Cloud utility (gsutil), we can access files in Cloud storage buckets. This makes it much easier to get files onto your VM. Naturally, you’ll want to pull them back down when you’re finished.
Step 3a: Set up a Google Cloud Storage Bucket
- From The Google Cloud Platform console, click the horizontal bars and go to Storage, then Browser (Make sure you are still in the same project you used to create the VM)
- Create bucket and record the name. In this example, I use sample-blog-bucket
- Record Name, and use Regional configuration
- Add some data and a directory
- Upload files by clicking button or dragging
- For this example, I’m creating a subfolder called data
- Clicking on data, I’m uploading a sample file called description_wordvec.csv
Step 3B: Allowing Communication within project VM and Storage by Establishing API Key
This is the crucial step!
- From the main menu, go to IAM & admin -> Service Accounts
- From actions on the right, select Crete key, and use the default JSON
- Save this file somewhere secure!
- You will not need the file in this configuration
Step 3C: Telling your VM how to find it
- Back to Compute Engine VM Instances
- SSH to your instance
- Type: gcloud init
- Choice: 1
- Project ID: use project ID from above (e.g. blogexample)
- Press enter one more time
- Secure login to account by typing and entering: gcloud auth login
- Do you want to continue: y
- Copy full html string and login with active account, allow
- Copy generated string back into SSH window (make sure you don’t still have the HTTP copied in the window at this point or the command line app will automatically copy it)
Step 3D: Moving Data Between Cloud Storage and VM
Once logged in with gcloud auth login from above step, type “gsutil cp [filepath from] [filepath to]” in the SSH command line to copy and paste files. Essentially, your bucket it at a location called “gs://”.
- Copy from bucket (to a folder with the VM path ~/projects/):
- General string looks like this: gsutil cp gs://[bucket-name][filepath] [VM path to copy to]
- Single file from bucket to projects folder: gsutil cp gs://sample-blog-bucket/data/description_wordvec.csv ~/projects/
- Careful with the spaces
- Whole directory: gsutil cp -r gs://sample-blog-bucket/data ~/projects/
- Copy to bucket
- You can create a file as normal from Jupyter, but for this example I’m creating a folder under projects called dl_files, and adding two files (test1.txt and test2.txt) for the example
- General String looks like this: gsutil cp [file] gs://[bucket-name][filepath]
- Single file: gsutil cp ~/projects/dl_files/test1.txt gs://sample-blog-bucket/data/
- Whole directory: gsutil cp -r ~/projects/dl_files/* gs://sample-blog-bucket/data/
- Contents of directory only: gsutil cp -r ~/projects/dl_files gs://sample-blog-bucket/data/
Step 3E: Copying Data from bucket
Right click on a file name, then save link as to download :)
On Pickles...
This seems like a good place, so I’ll include it here. Often, the biggest memory hog for data science is a trained model that we want to store.
Using the above steps, we can also easily store a model trained on a VM back down to our local drive for export to Kaggle kernels and the like.
Assuming you have used sklearn to train a model called “my_model”,
First: import pickle
Save the model: pickle.dump(model, open(‘ultimate_model.sav’, ‘wb’))
Open it back up later: my_model = pickle.load(open(‘ultimate_model.sav’, ‘rb’))
As usual, your filename will include the directory path where you would like it to be located.
Summary
Configuring VM code
Copy and paste your specifics here to help follow along:
[firewall]
0.0.0.0/0
[port number]
external IP: [IP number]
user_name@instance-1
sudo apt-get update
sudo apt-get –assume-yes upgrade
sudo apt-get –assume-yes install software-properties-common
sudo apt-get install python-setuptools python-dev build-essential
[y]
sudo apt install python3-pip
sudo pip3 install jupyter
jupyter notebook –generate-config
sudo nano ~/.jupyter/jupyter_notebook_config.py
c = get_config()
c.NotebookApp.ip= ‘*‘
c.NotebookApp.open_browser = False
c.NotebookApp.port = 5000
jupyter notebook password
ctrl+o [enter]
ctrl+x
password:
jupyter notebook –ip=0.0.0.0 –port=5000
Per Session
What you actually need to do to run Jupyter once set up:
- Click ssh, run jupyter notebook --ip=0.0.0.0 --port=[your port number]
- In a browser window, go to [VM IP Address]:[Port Number], enter your password
- If desired, copy files to and from the cloud bucket with gsutil
- If desired, install new Python3 packages using !pip3 install from inside a Jupyter Notebook
What you don’t need to do:
- Run config files
- Re-authorize your VM using gcloud auth login
- Re-load files from cloud bucket that have already been stored
Please let me know if you see any issues with this setup.