Python data structures and sequences: a pragmatic approach

Tl;dr: bismillah. Many of my folks who are not CS student, also applied to me, still stumbled in the concept of the data structures and algorithms. Even though, that is the core subject if we want to be a proficient programmer or maybe become data scientist/analysis. Also in my experience, during interview with some tech company, the more question I got was about data structure and algorithms, but mostly the problem is we know the algorithms to solve it, but how we actually implement the right data structure. Even, until now I still lear and re(learn) about this subject for particularly 6 moths to deeply understanding and reshape my sense.

Here we are, most of the materials I got is from Python For Data Analaysis by Wes McKinney and I also cover from Introduction to Programming Using Python by Y. Daniel Liang.

I’d love to quote what Wes McKinney said about data structures in Python.

Python’s data structures are simple, but powerful. Mastering their use is a critical part of becoming a proficient Python programmer.”

So I will cover the subject about the data structures using example in Python language.

List

Lists are variable length and their elements can be modified (mutable). They can be defined using square brackets [] or using the list function. Lists and tuples are semantically similar as one-dimensional sequences of objects and thus can be used interchangeably in many functions.

Example:

lst_1 = ["red", "blue", "green", "brown"] # initialize a list
lst_1.append('yellow') # adding an element
lst_1.insert(1, 'white') # you can insert an element at a spesific location in the list. But, insert is computationally expensive compared with append as references to subsequent elements have to be shifted internally to make room for the new element.
lst_1.pop(2) # removing an alement by index order and returns an element at a particular index.
lst_1.remove("brown") # removing an element by value del lst_1[1:3] # removing elements by their index order

Concatenating and Combining List

adding two list together with + concatenates them:

[1,2,3,4,5,6] + [7,8,9,10,11,12]

If you have a list already defined, you can append multiple elements using the extend method:

a = [1,2,3,4,5,6]
a.extend([7,8,9,10,11,12])

list concatenation is a comparatively expensive operation since new list must be created and the objects copied over. Using extend method to append elements to an existing list, especially if you are building up a large list, is usually preferable. Thus,

everything = []
for chunk in list_of_list:
	everything.extend(chunk)

is faster that the concatinative alternative:

everything = []
for chunk in list_of_list:
	everything = everything + chunk

Sorting in List

A list can be sorted-in place (without creating a new object) by calling its short function. Sort has a few options that will occasionally come in handy. One is the ability to pass a secondary sort key, i.e. a function that produces a value to use to sort the objects. For example, we could sort a collection of string by their lengths:

b = ['saw', 'small', 'tall', 'seven']
b.sort(key=len)

Binary search and maintaining a sorted list

The built-in bisect module implements binary-search and insertion into a sorted list. bisect.bisect finds the location where an element should be interested to keep it sorted, while bisect.insort actually inserts the element into that location.

import bisect
lst_2 = [0, 1, 2, 3, 4, 5, 7, 9]
bisect.bisect(lst_2, 6) 	#will return the location should be inserted for element 6
bisect.insort(lst_2, 6)		#inserts the element 6 into the exact sorted location

Note: the bisect module functions do not check wether the list is sorted as doing so and would be computationally expensive. Thus, using them with an unsorted list will succeed without error but may lead to incorrect results.

Slicing

You can select sections of list-like types (arrays, tuples, …) by using slice notation, which in basic form consist of start:stop passed to the indexing operator []:

seq = [0, 1, 2, 3, 4, 5, 7, 9]
seq = [1:5] # will return [1,2,3,4]

Slices can also be assigned to with a sequence:

seq [3:4] = [6,3] # will return [0, 1, 2, 6, 3, 4, 5, 7, 9]

While element at the start index is included, the stop index is not included, so that the number of elements in the result is stop-start. Either the start or stop can be ommited in which case they default to the start of the sequence and the end of seqeunce, respectively:

seq = [:5] # will return [0,1,2,3,4]
seq = [3:] # will return [3, 4, 5, 7, 9]

Negative indices slice the sequence relative to the end:

seq = [-4:] # will return [4, 5, 7, 9]
seq = [-6:-2] # will return [2, 3, 4, 5]

A step can also be used after a second colon to, say, take every other elemet:

seq = [::2] # it will return [0, 2, 4, 7]
seq = [::-2] # it will return [9, 5, 3, 1]

A clever use of this is to pass -1 which has the useful effect of reversing the list or tuple:

seq = [::-1] # it will return [9, 7, 5, 4, 3, 2, 1, 0]

 

Built-in sequence functions

Enumerate

It’s common when iterating over a sequence to want to keep track of the index of the current item. A do-it yourself would look like:

i = 0
for value in collection:
# do something with the value
i += 1

Since this is so common, Python has a built-in function enumerate which returns a sequence of (i, value) tuples:

for i, value in enumerate(collection):
# do something with the value

some_list = ['sony', 'samsung', 'apple']
for i, value in enumerate(some_list):
print (i, value)

# it will return
0 sony
1 samsung
2 apple

When indexing data, a useful pattern that uses enumerate is computing a dict mapping the values of sequence (which are assumed to be unique) to their locations in the sequence.

some_list = ['sony', 'samsung', 'apple']
mapping = dict((i, v) for i, v in enumerate(some_list)) # it will return {0: 'sony', 1: 'samsung', 2: 'apple'}

Sorted

The sorted function returns a new sorted list from the elements of any sequence. A commong pattern for getting a sorted list of the unique elements in a sequence is to combine sorted with set:

my_name = "sony wicaksono"
my_name = sorted(set(my_name))
my_name = [' ', 'a', 'c', 'i', 'k', 'n', 'o', 's', 'w', 'y'] # the return

Zip

Zip pairs up the elements of a number of listst, tuples, or other sequences, to create a list of tuples:

seq1 = ['man united', 'chelsea', 'liverpool']
seq2 = ['one', 'two', 'three']

seq3 = zip(seq2, seq1)

If you are trying to print the zipped lists directly then that won’t work. zip returns an object and so when you try to point it you just get the object method. If you want to see as a list, then apply an operation that returns a list.

Some of the method to print zip object

print ([i for i in seq])
print (list(zip(seq2, seq1)))

Both of them will return

[('one', 'man united'), ('two', 'chelsea'), ('three', 'liverpool')]

Also note that zip can take an arbitrary number of sequences, and the number of elements it produces is determined by the shortest sequence. A very common use of zip is for simultaneously iterating over multiple sequences, possibly also combined with enumerate:

for i, (x, y) in enumerate(zip(seq2, seq1)):
print ('%d: %s, %s' % (i, x, y))

# the return is :
0: one, man united
1: two, chelsea
2: three, liverpool

Given a “zipped” sequence, zip can be applied in a clever way to “unzip” the sequence. Another way to think about this is converting a list of rows into a list of columns.

kumat_ui = [('sony', 'wicaksono'), ('ilham', 'aufadhuha'), ('fuad', 'wafa'), ('rizky', 'ariefian')]
first_name, last_name = zip(*kumat_ui)

first_name = ('sony', 'ilham', 'fuad', 'rizky')
last_name = ('wicaksono', 'aufadhuha', 'wafa', 'ariefian')

The use of * in function call is equivalent to the following:

zip(seq[0], seq[1], ..., seq[len(seq)-1])

Reversed

Reversed iterates over the elements of a sequence in reverse order:

list(reversed(range(11)))
[10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0] # the return

 

Tuples

A tuple is a one-dimensional, fixed-length, immutable sequence of python objects. Tuples are like lists, but their elements are fixed; that is, once tuple is created, you cannot add new elements, delete elements, replace elements, or reorder the elements in the tuple.

tup = () # create an empty tuple

tup_2 = (1,3,5) # create a tuple with three elements

tup_3 = tupple([2*x for x in range(1,5)])

tup_4 = tupple("abcdef")

Any function that can applied onto tuple are: len, min, max, sum, for loop, in, not in, count, and many more.

Dictionaries

Dict is likely the most important built-in python data structure. A more common name for it is hash map or associative array. It is flexible-sized collection of key-value pairs, where key and value are Python objects. One way to create one is by using curly braces {} and using colons to separate keys and values:

d1 = {1: "united", 2: "chelsea", 3: "liverpool", 4: ["city", "hotspur"]}

Elements can be accessed and inserted or set using the same syntax as with checking whether a list or tuple contains a value:

d1 [5] = "arsenal"

{1: 'united', 2: 'chelsea', 3: 'liverpool', 4: ['city', 'hotspur'], 5: 'arsenal'} # the return

You can check if a dict contains a key using the same syntax as with checking whether a list or tuple contains a value:

"united" in d1 # it will return True

Values can be deleted either using the del keyword or the pop method (which simultaneously returns the value and deletes the key). The keys and values method give you lists of the keys and values, respectively. While the key-value pairs are not in any particular order, these functions output the keys and values in the same order.


d1.keys()
dict_keys([1, 2, 3, 4]) # the return

d1.value()
dict_values(['united', 'chelsea', 'liverpool', ['city', 'hotspur']]) # the return

One dict can be merged into another using the update method:

d1.update({5: "arsenal", 6: "everton", 7: "west ham"})
{1: 'united', 2: 'chelsea', 3: 'liverpool', 4: ['city', 'hotspur'], 5: 'arsenal', 6: 'everton', 7: 'west ham'} # the return

Creating dicts from sequences
It’s common to occasionally end up with two sequences that you want to pair up element-wise in a dict. As a first cut, you might write code like this:

mapping = {}
for key, value in zip(key_list, value_list):
mapping[key] = value

Since a dict is essentially a collection of 2 tuples, it should be no shock that the dict type function accepts a list of 2-tuples:

mapping = dict(zip(range(5), reversed(range(5))))
mapping = {0: 4, 1: 3, 2: 2, 3: 1, 4: 0}

Default values
It’s very common to have logic like:
if key in some_dict:

value = some_dict[key]
else:
value = default_value

Thus the dict method get and pop can take a default value to be returned, so that the above if-else block can be written simply as:

value = some_dict.get(key, default_value)

Get by default will return None if the key is not present, while pop will raise an exception. With setting values, a common case is for the values in a dict to be other collection, like lists. For example, you could imagine categorizing a list of words by their first letters as a dict of lists:

words = ["apple", "avocados", "banana", "blueberries" "chocolate", "durian"]
by_letter = {}
for word in words:
letter = word[0]
if letter not in by_letter:
by_letter[letter] = [word]
else:
by_letter[letter].append(word)

by_letter = {'a': ['apple', 'avocados'], 'b': ['banana', 'blueberries'], 'c': ['chocolate'], 'd': ['durian']} #the return

The setdefault dict method is for precisely this purpose. The if-else block above can be written as:

by_letter.setdefault(letter, []).append(word)

The built-in collection modules has a useful class, defaultdict, which makes this even easier. One is created by passing a type or function for generating the default value for each slot in the dict:

from collections import defaultdict
by_letter = defaultdict(list)
for word in words:
by_letter[word[0]].append(word)

defaultdict(<class 'list'>, {'a': ['apple', 'avocados'], 'b': ['banana', 'blueberries'], 'c': ['chocolate'], 'd': ['durian']} # the return

Set

Sets are like list in that you use them for storing a collection of elements. Unlike lists, however, the elements in a set are non-duplicates and are not placed in any particular order. You can think of them like dicts, but keys only, no values. A set can be created in two ways: via the set function or using a set literal with curly braces:

set([2,2,2,1,3,3])
{1, 2, 3} # the ouput

Sets support mathematical set operations like union, intersection, difference, and symmetric difference. You can also check if a set is a subset of (is contained in) or a superset of (contain all elements of) another set.

 

So, let’s wrap it up. How we can determine which one data structures will be use in some instances. So, here’s some usage and occasions I think it will be good to know as our perspective the data structure that should be use, maybe you got another, it’s okay too. Thanks to Rajkumal Rampali, I cover this from him.

List: Use list if you have a collection data that doesn’t need random access. Use list when you need a simple, iterable collection that is modified frequently.

Tuples: Use tuples when your data cannot change. A tuples is used in combination with a dictionary, for example, a tuple might represent a key, because its immutable.

Dictionaries: When you need a logical association between key:value pair. When you need fast lookup for your data, based on a custom key. When your data is being constantly modified.

Sets: Membership tesing and the elimination of duplicate entries. When you need uniqueness for the elements.

So that’s all, on next week I’ll cover about another topics in Data Structure like valid dict key types, iterable vs iterators, and list-set-dict comprehensions insyaallah.