Imports

from fastai.data.all import *
from fastai.vision.all import *

The first step is to download and decompress our data (if it’s not already done) and get its location:

path = untar_data(URLs.PETS)
100.00% [811712512/811706944 00:24<00:00]
Path.BASE_PATH = path
path.ls()
(#2) [Path('images'),Path('annotations')]

The filenames are in the “images” folder. The get_image_files function helps get all the images in subfolders

fnames = get_image_files(path/"images")

Empty DataBlock.

dblock = DataBlock()

By itself, a DataBlock is just a blue print on how to assemble your data. It does not do anything until you pass it a source. You can choose to then convert that source into a Datasets or a DataLoaders by using the DataBlock.datasets or DataBlock.dataloaders method. Since we haven’t done anything to get our data ready for batches, the dataloaders method will fail here, but we can have a look at how it gets converted in Datasets. This is where we pass the source of our data, here all our filenames

dsets = dblock.datasets(fnames)
dsets.train[0]
(Path('images/scottish_terrier_99.jpg'),
 Path('images/scottish_terrier_99.jpg'))
dsets
(#7390) [(Path('images/beagle_115.jpg'), Path('images/beagle_115.jpg')),(Path('images/boxer_18.jpg'), Path('images/boxer_18.jpg')),(Path('images/Maine_Coon_157.jpg'), Path('images/Maine_Coon_157.jpg')),(Path('images/scottish_terrier_28.jpg'), Path('images/scottish_terrier_28.jpg')),(Path('images/english_setter_6.jpg'), Path('images/english_setter_6.jpg')),(Path('images/american_pit_bull_terrier_79.jpg'), Path('images/american_pit_bull_terrier_79.jpg')),(Path('images/boxer_128.jpg'), Path('images/boxer_128.jpg')),(Path('images/Persian_265.jpg'), Path('images/Persian_265.jpg')),(Path('images/Maine_Coon_182.jpg'), Path('images/Maine_Coon_182.jpg')),(Path('images/keeshond_89.jpg'), Path('images/keeshond_89.jpg'))...]

By default, the data block API assumes we have an input and a target, which is why we see our filename repeated twice.

_The first thing we can do is use a getitems function to actually assemble our items inside the data block

dblock = DataBlock(get_items = get_image_files)
get_image_files
<function fastai.data.transforms.get_image_files(path, recurse=True, folders=None)>

Pass the folder name

dsets = dblock.datasets(path/"images")
dsets.valid[0]
(Path('images/Siamese_158.jpg'), Path('images/Siamese_158.jpg'))

Labeling the data is important, if capitalised name then classify as cat otherwise dog

def label_func(fname):
    return "cat" if fname.name[0].isupper() else "dog"
dblock = DataBlock(get_items = get_image_files,
                   get_y     = label_func)
dsets = dblock.datasets(path/"images")
dsets.train[0]
(Path('images/newfoundland_74.jpg'), 'dog')

Now that our inputs and targets are ready, we can specify types to tell the data block API that our inputs are images and our targets are categories. Types are represented by blocks in the data block API, here we use ImageBlock and CategoryBlock:

dblock = DataBlock(blocks    = (ImageBlock, CategoryBlock),
                   get_items = get_image_files,
                   get_y     = label_func)
dsets = dblock.datasets(path/"images")
dsets.train[0]
(PILImage mode=RGB size=500x333, TensorCategory(0))
dsets.vocab
['cat', 'dog']
dblock = DataBlock(blocks    = (ImageBlock, CategoryBlock),
                   get_items = get_image_files,
                   get_y     = label_func,
                   splitter  = RandomSplitter())

dsets = dblock.datasets(path/"images")
dsets.train[0]
(PILImage mode=RGB size=500x399, TensorCategory(0))

The next step is to control how our validation set is created. We do this by passing a splitter to DataBlock. For instance, here is how to do a random split.

Also, resize the images

dblock = DataBlock(blocks    = (ImageBlock, CategoryBlock),
                   get_items = get_image_files,
                   get_y     = label_func,
                   splitter  = RandomSplitter(),
                   item_tfms = Resize(224))
dls = dblock.dataloaders(path/"images")
dls.show_batch()

The way we usually build the data block in one go is by answering a list of questions:

  • what is the types of your inputs/targets? Here images and categories
  • where is your data? Here in filenames in subfolders
  • does something need to be applied to inputs? Here no
  • does something need to be applied to the target? Here the label_func function
  • how to split the data? Here randomly
  • do we need to apply something on formed items? Here a resize
  • do we need to apply something on formed batches? Here no

Image classification

Grandparents Spiltter splits the items from the grand parent folder names (train_name and valid_name)

mnist = DataBlock(blocks=(ImageBlock(cls=PILImageBW), CategoryBlock), 
                  get_items=get_image_files, 
                  splitter=GrandparentSplitter(),
                  get_y=parent_label)
dls = mnist.dataloaders(untar_data(URLs.MNIST_TINY))
dls.show_batch(max_n=9, figsize=(4,4))
100.54% [344064/342207 00:00<00:00]

One can use Random Splitter as well

pets = DataBlock(blocks=(ImageBlock, CategoryBlock), 
                 get_items=get_image_files, 
                 splitter=RandomSplitter(),
                 get_y=Pipeline([attrgetter("name"), RegexLabeller(pat = r'^(.*)_\d+.jpg$')]),
                 item_tfms=Resize(128),
                 batch_tfms=aug_transforms())
dls = pets.dataloaders(untar_data(URLs.PETS)/"images")
dls.show_batch(max_n=9)

The Pascal dataset is originally an object detection dataset (we have to predict where some objects are in pictures). But it contains lots of pictures with various objects in them, so it gives a great example for a multi-label problem. Let’s download it and have a look at the data:

pascal_source = untar_data(URLs.PASCAL_2007)
df = pd.read_csv(pascal_source/"train.csv")
100.00% [1637801984/1637796771 00:54<00:00]
df.head(5)
fname labels is_valid
0 000005.jpg chair True
1 000007.jpg car True
2 000009.jpg horse person True
3 000012.jpg car False
4 000016.jpg bicycle True
pascal = DataBlock(blocks=(ImageBlock, MultiCategoryBlock),
                   splitter=ColSplitter(),
                   get_x=ColReader(0, pref=pascal_source/"train"),
                   get_y=ColReader(1, label_delim=' '),
                   item_tfms=Resize(224),
                   batch_tfms=aug_transforms())
dls = pascal.dataloaders(df)
dls.show_batch()

Alternative way to write the data block

pascal = DataBlock(blocks=(ImageBlock, MultiCategoryBlock),
                   splitter=ColSplitter(),
                   get_x=lambda x:pascal_source/"train"/f'{x[0]}',
                   get_y=lambda x:x[1].split(' '),
                   item_tfms=Resize(224),
                   batch_tfms=aug_transforms())

dls = pascal.dataloaders(df)
dls.show_batch()

Image Localization

There are various problems that fall in the image localization category:image segmentation (which is a task where you have to predict the class of each pixel of an image), coordinate predictions (predict one or several key points on an image) and object detection (draw a box around objects to detect).

path = untar_data(URLs.CAMVID_TINY)
100.18% [2318336/2314212 00:00<00:00]
path.ls()
(#3) [Path('/root/.fastai/data/camvid_tiny/images'),Path('/root/.fastai/data/camvid_tiny/codes.txt'),Path('/root/.fastai/data/camvid_tiny/labels')]

The MaskBlock is generated with the codes that give the correpondence between pixel value of the masks and the object they correspond to (like car, road, pedestrian…).

camvid = DataBlock(blocks=(ImageBlock, MaskBlock(codes = np.loadtxt(path/'codes.txt', dtype=str))),
    get_items=get_image_files,
    splitter=RandomSplitter(),
    get_y=lambda o: path/'labels'/f'{o.stem}_P{o.suffix}',
    batch_tfms=aug_transforms())
dls = camvid.dataloaders(path/"images")
dls.show_batch()

Points

biwi_source = untar_data(URLs.BIWI_SAMPLE)
fn2ctr = load_pickle(biwi_source/'centers.pkl')
100.71% [598016/593774 00:00<00:00]
biwi = DataBlock(blocks=(ImageBlock, PointBlock),
                 get_items=get_image_files,
                 splitter=RandomSplitter(),
                 get_y=lambda o:fn2ctr[o.name].flip(0),
                 batch_tfms=aug_transforms())
dls = biwi.dataloaders(biwi_source)
dls.show_batch(max_n=9)
coco_source = untar_data(URLs.COCO_TINY)
images, lbl_bbox = get_annotations(coco_source/'train.json')
img2bbox = dict(zip(images, lbl_bbox))
100.22% [802816/801038 00:00<00:00]

Bounding Boxes

coco_source = untar_data(URLs.COCO_TINY)
images, lbl_bbox = get_annotations(coco_source/'train.json')
img2bbox = dict(zip(images, lbl_bbox))
100.22% [802816/801038 00:00<00:00]

_We provide three types, because we have two targets:the bounding boxes and the labels. That’s why we pass ninp=1 at the end, to tell the library where the inputs stop and the targets begin.

coco = DataBlock(blocks=(ImageBlock, BBoxBlock, BBoxLblBlock),
                 get_items=get_image_files,
                 splitter=RandomSplitter(),
                 get_y=[lambda o: img2bbox[o.name][0], lambda o: img2bbox[o.name][1]], 
                 item_tfms=Resize(128),
                 batch_tfms=aug_transforms(),
                 n_inp=1)
dls = coco.dataloaders(coco_source)
dls.show_batch(max_n=9)

Text

from fastai.text.all import *
path = untar_data(URLs.IMDB_SAMPLE)
df = pd.read_csv(path/'texts.csv')
df.head()
100.28% [573440/571827 00:00<00:00]
label text is_valid
0 negative Un-bleeping-believable! Meg Ryan doesn't even look her usual pert lovable self in this, which normally makes me forgive her shallow ticky acting schtick. Hard to believe she was the producer on this dog. Plus Kevin Kline: what kind of suicide trip has his career been on? Whoosh... Banzai!!! Finally this was directed by the guy who did Big Chill? Must be a replay of Jonestown - hollywood style. Wooofff! False
1 positive This is a extremely well-made film. The acting, script and camera-work are all first-rate. The music is good, too, though it is mostly early in the film, when things are still relatively cheery. There are no really superstars in the cast, though several faces will be familiar. The entire cast does an excellent job with the script.<br /><br />But it is hard to watch, because there is no good end to a situation like the one presented. It is now fashionable to blame the British for setting Hindus and Muslims against each other, and then cruelly separating them into two countries. There is som... False
2 negative Every once in a long while a movie will come along that will be so awful that I feel compelled to warn people. If I labor all my days and I can save but one soul from watching this movie, how great will be my joy.<br /><br />Where to begin my discussion of pain. For starters, there was a musical montage every five minutes. There was no character development. Every character was a stereotype. We had swearing guy, fat guy who eats donuts, goofy foreign guy, etc. The script felt as if it were being written as the movie was being shot. The production value was so incredibly low that it felt li... False
3 positive Name just says it all. I watched this movie with my dad when it came out and having served in Korea he had great admiration for the man. The disappointing thing about this film is that it only concentrate on a short period of the man's life - interestingly enough the man's entire life would have made such an epic bio-pic that it is staggering to imagine the cost for production.<br /><br />Some posters elude to the flawed characteristics about the man, which are cheap shots. The theme of the movie "Duty, Honor, Country" are not just mere words blathered from the lips of a high-brassed offic... False
4 negative This movie succeeds at being one of the most unique movies you've seen. However this comes from the fact that you can't make heads or tails of this mess. It almost seems as a series of challenges set up to determine whether or not you are willing to walk out of the movie and give up the money you just paid. If you don't want to feel slighted you'll sit through this horrible film and develop a real sense of pity for the actors involved, they've all seen better days, but then you realize they actually got paid quite a bit of money to do this and you'll lose pity for them just like you've alr... False
imdb_lm = DataBlock(blocks=TextBlock.from_df('text', is_lm=True),
                    get_x=ColReader('text'),
                    splitter=ColSplitter())
dls = imdb_lm.dataloaders(df, bs=64, seq_len=72)
dls.show_batch(max_n=6)
text text_
0 xxbos xxmaj my kids recently started watching the xxunk of this show - both the early episodes on the xxup n , and the later ones on xxup abc xxmaj family - and they love it . ( i was n't aware the show had even lasted past the first or second season ) xxmaj i 'm curious as to what xxunk all of the cast changes - xxmaj i 've seen xxmaj my kids recently started watching the xxunk of this show - both the early episodes on the xxup n , and the later ones on xxup abc xxmaj family - and they love it . ( i was n't aware the show had even lasted past the first or second season ) xxmaj i 'm curious as to what xxunk all of the cast changes - xxmaj i 've seen them
1 and junior had n't xxunk her wings . xxmaj xxunk gene , i suppose . xxmaj by the way , we can now make an educated guess that xxmaj grendel 's pop was probably xxmaj xxunk xxmaj thing . \n\n - xxmaj grendel and mom chose to randomly kill , fly away with or drag away their prey based only on a close reading of the next few xxunk of the script junior had n't xxunk her wings . xxmaj xxunk gene , i suppose . xxmaj by the way , we can now make an educated guess that xxmaj grendel 's pop was probably xxmaj xxunk xxmaj thing . \n\n - xxmaj grendel and mom chose to randomly kill , fly away with or drag away their prey based only on a close reading of the next few xxunk of the script .
2 a very funny show . xxmaj let 's hope more episodes turn up on youtube and lets hope that someone will release " the xxmaj fosters " on xxup dvd in xxmaj england . \n\n xxmaj best xxmaj episode : xxmaj sex and the xxmaj evans xxunk xxmaj series 1 episode 6 . xxmaj the xxmaj foster 's episode of it was called xxmaj sex in the xxmaj black xxmaj community . very funny show . xxmaj let 's hope more episodes turn up on youtube and lets hope that someone will release " the xxmaj fosters " on xxup dvd in xxmaj england . \n\n xxmaj best xxmaj episode : xxmaj sex and the xxmaj evans xxunk xxmaj series 1 episode 6 . xxmaj the xxmaj foster 's episode of it was called xxmaj sex in the xxmaj black xxmaj community . xxmaj
3 forces ( who , amusingly , are made to speak in xxunk - up xxunk ! ) are xxunk by our heroic trio alone , much to the king 's xxunk who , as portrayed by xxmaj marcel xxmaj xxunk – best - known for his role of leader of the xxmaj parisian xxunk in xxmaj marcel xxmaj xxunk ' 's xxup children xxup of xxup paradise ( xxunk ) – is ( who , amusingly , are made to speak in xxunk - up xxunk ! ) are xxunk by our heroic trio alone , much to the king 's xxunk who , as portrayed by xxmaj marcel xxmaj xxunk – best - known for his role of leader of the xxmaj parisian xxunk in xxmaj marcel xxmaj xxunk ' 's xxup children xxup of xxup paradise ( xxunk ) – is himself
4 cost , because it does n't project the true image of xxmaj batman . xxmaj this cartoon is more like a xxunk xxmaj kung xxmaj fu xxmaj flick and if you really wanna see a classic xxmaj batman cartoon i strongly recommend xxmaj batman the xxmaj animated xxmaj series , but this cartoon is nothing more than a piece of s xxrep 3 - xxup t ! xxmaj get xxmaj batman : , because it does n't project the true image of xxmaj batman . xxmaj this cartoon is more like a xxunk xxmaj kung xxmaj fu xxmaj flick and if you really wanna see a classic xxmaj batman cartoon i strongly recommend xxmaj batman the xxmaj animated xxmaj series , but this cartoon is nothing more than a piece of s xxrep 3 - xxup t ! xxmaj get xxmaj batman : xxmaj
5 said that the book is better . xxmaj i 'm sure it 's not and i do n't care anyway i loved the movie . xxmaj as in all of xxmaj arnold 's films the acting is what you would expect with classic one liners from xxmaj arnold and even xxmaj xxunk gets a couple in . xxmaj but without a doubt xxmaj richard xxmaj dawson is the standout in this film that the book is better . xxmaj i 'm sure it 's not and i do n't care anyway i loved the movie . xxmaj as in all of xxmaj arnold 's films the acting is what you would expect with classic one liners from xxmaj arnold and even xxmaj xxunk gets a couple in . xxmaj but without a doubt xxmaj richard xxmaj dawson is the standout in this film .

Tabular

from fastai.tabular.core import *
adult_source = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(adult_source/'adult.csv')
df.head()
100.69% [974848/968212 00:00<00:00]
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
0 49 Private 101320 Assoc-acdm 12.0 Married-civ-spouse NaN Wife White Female 0 1902 40 United-States >=50k
1 44 Private 236746 Masters 14.0 Divorced Exec-managerial Not-in-family White Male 10520 0 45 United-States >=50k
2 38 Private 96185 HS-grad NaN Divorced NaN Unmarried Black Female 0 0 32 United-States <50k
3 38 Self-emp-inc 112847 Prof-school 15.0 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 United-States >=50k
4 42 Self-emp-not-inc 82297 7th-8th NaN Married-civ-spouse Other-service Wife Black Female 0 0 50 United-States <50k
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']

Standard preprocessing in fastai, use those pre-processors:

procs = [Categorify, FillMissing, Normalize]
splits = RandomSplitter()(range_of(df))

to = TabularPandas(df, procs, cat_names, cont_names, y_names="salary", splits=splits, y_block=CategoryBlock)
dls = to.dataloaders()
dls.show_batch()
workclass education marital-status occupation relationship race education-num_na age fnlwgt education-num salary
0 Private 10th Never-married Machine-op-inspct Not-in-family White False 33.0 67005.996152 6.0 <50k
1 Private 12th Married-civ-spouse Craft-repair Husband White False 21.0 83703.995755 8.0 <50k
2 Private Preschool Married-civ-spouse Other-service Not-in-family White False 52.0 416129.003576 1.0 <50k
3 Local-gov HS-grad Married-civ-spouse Protective-serv Husband White False 34.0 155780.998653 9.0 <50k
4 Private HS-grad Never-married Adm-clerical Not-in-family White False 19.0 184758.999919 9.0 <50k
5 ? Bachelors Never-married ? Own-child White False 25.0 47010.997235 13.0 <50k
6 Private Masters Never-married Prof-specialty Not-in-family White False 30.0 196342.000092 14.0 <50k
7 Private HS-grad Married-civ-spouse Handlers-cleaners Husband Black False 27.0 275110.002518 9.0 >=50k
8 ? Bachelors Never-married ? Unmarried Asian-Pac-Islander False 27.0 190650.000040 13.0 <50k
9 Private Assoc-voc Never-married Craft-repair Not-in-family White False 32.0 38797.002375 11.0 <50k

End of Notebook