1. TL;DR: What happened?
Here’s a collection of useful Google Colab tips/code snippets.
2. Resources
- Neptune.ai article for dealing with files in Colab (TODO: check this thing out!)
- Colab’s official documentation about dealing with files
- PyDrive Documentation
3. Working with files
Reading from a shared Google Drive Link
Scenario: Someone shared a Google Drive link for a dataset (.csv) with you, and you want to read into Colab for processing.
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate with PyDrive
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
# Read file into wrapper class
downloaded = drive.CreateFile({"id": "INSERT_GOOGLE_DRIVE_FILE_ID_HERE"})
# Get the actual content of the file
downloaded.GetContentFile("FILE_NAME.csv")
# Check that file is actually present in root directory
!ls
Note: the ID of the file is the string between https://drive.google.com/file/d/
and /view?usp=sharing
in the shared link.
You will be required to enter an authorization code from your Google Account. This does NOT mean that the file is copied onto your Google Drive.
Download a file to your computer
(TO DO)
4. Imports/Libraries
Installing an older version of pytorch
!pip install torch==1.3.0 torchvision==0.4.1
Importing an older version of TensorFlow
Colab has default TensorFlow 1.x and 2,x installed.
TensorFlow 2.x is the default. To choose, use the %tensorflow_version
magic command.
%tensorflow_version 1.x
import tensorflow
print(tensorflow.__version__)
DO NOT use !pip install
to specify a particular TensorFlow version for GPU/TPU backends. Colab builds TensorFlow from source to ensure compatibility with our fleet of accelerators. Versions of TensorFlow fetched from PyPI by pip may suffer from performance problems or may not work at all.
Running FastAI
from fastbook import *
from fastai.vision.widgets import *
Using Pyspark
to finish and check, spark/hadoop version may be outdated.
See tutorial here for setup + quick regression example
# get pyspark
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
#Note, check https://spark.apache.org/downloads.html for version/URL changes
!wget https://downloads.apache.org/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
!tar xf spark-2.4.1-bin-hadoop2.7.tgz
!pip install -q findspark
#configure path for JAVA/Spark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.3.2-bin-hadoop2.7"
#
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()