Skip to content

prefect_google_sheets.tasks

prefect-google-sheets tasks

Classes

Functions

read_google_sheet_as_data_frame

This task leverages the Google Sheets API v4 through the gspread library in order to read the content of a Google Sheet and return it as a pandas Dataframe.

Parameters:

Name Type Description Default
is_public_sheet bool

Whether the Google Sheet is public or not. If True, the google_service_account param will be ignored.

False
google_service_account Union[Dict, str]

The Service Account to be used in order to interact with the Google Sheet. This can be a dict or a string representing the JSON Service Account body.

None
google_sheet_key Optional[str]

The key of the Google Sheet to read data from.

None
google_sheet_name Optional[str]

The name of the Sheet to read data from.

None
first_row_header Optional[bool]

Whether the first row is the header. If True, the first row will be used as header and data won't be read. Otherwise, if set to False. Default set to True

True
on_bad_lines Optional[str]

What to do if bad lines are discovered: 'error': An Exception is raised 'warn': A warning is printed and the line is skipped 'skip': The line is skipped with no warnings Default set to 'error'

'error'
clean Optional[bool]

Used in order to remove blank columns and rows left in the Google Sheet.

False

Returns:

Type Description
DataFrame

The content of a specific Sheet as a pandas dataframe.

Source code in prefect_google_sheets/tasks.py
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
@task
def read_google_sheet_as_data_frame(
    is_public_sheet: bool = False,
    google_service_account: Union[Dict, str] = None,
    google_sheet_key: Optional[str] = None,
    google_sheet_name: Optional[str] = None,
    first_row_header: Optional[bool] = True,
    on_bad_lines: Optional[str] = "error",
    clean: Optional[bool] = False,
) -> DataFrame:
    """
    This task leverages the Google Sheets API v4 through the gspread library
    in order to read the content of a Google Sheet and return it as a pandas Dataframe.
    Args:
        is_public_sheet: Whether the Google Sheet is public or not.
            If True, the google_service_account param will be ignored.
        google_service_account: The Service Account to be used
            in order to interact with the Google Sheet.
            This can be a dict or a string representing the JSON
            Service Account body.
        google_sheet_key: The key of the Google Sheet to read data from.
        google_sheet_name: The name of the Sheet to read data from.
        first_row_header: Whether the first row is the header.
            If True, the first row will be used as header and data won't
            be read. Otherwise, if set to False. Default set to True
        on_bad_lines: What to do if bad lines are discovered:
            'error': An Exception is raised
            'warn': A warning is printed and the line  is skipped
            'skip': The line is skipped with no warnings
            Default set to 'error'
        clean: Used in order to remove blank columns and rows left
            in the Google Sheet.
    Raises:
        - `GoogleSheetsConfigurationException`
            if google_service_account not provided or None
        - `GoogleSheetsConfigurationException`
            if google_sheet_key not provided or None
        - `GoogleSheetsConfigurationException`
            if google_sheet_name not provided or None
        - `GoogleSheetServiceAccountError`
            if the google_service_account is not valid or malformed
    Returns:
        The content of a specific Sheet as a pandas dataframe.
    """

    if not is_public_sheet and not google_service_account:
        exc_message = "Missing Google Service Account information."
        raise GoogleSheetsConfigurationException(exc_message)

    if not google_sheet_key:
        exc_message = "Missing the Google Sheet key identifier."
        raise GoogleSheetsConfigurationException(exc_message)

    if not google_sheet_name:
        exc_message = "Missing the Google Sheet name identifier."
        raise GoogleSheetsConfigurationException(exc_message)

    if is_public_sheet:
        sheet = f"""https://docs.google.com/spreadsheets/d/{google_sheet_key}/
            export?format=csv&sheet={google_sheet_name}"""
    else:
        google_credentials = generate_google_credentials(
            google_service_account=google_service_account
        )
        gspread_client = gspread.authorize(google_credentials)
        sheet = gspread_client.open_by_key(google_sheet_key).worksheet(
            google_sheet_name
        )

    sheet_df = get_sheet_dataframe(
        sheet,
        header=0 if first_row_header is True else None,
        parse_dates=True,
        on_bad_lines=on_bad_lines,
        clean=clean,
    )
    return sheet_df

read_google_sheet_as_dict_of_lists

This task leverages the Google Sheets API v4 through the gspread library in order to read the content of a Google Sheet and return it as a dict of lists.

Parameters:

Name Type Description Default
is_public_sheet bool

Whether the Google Sheet is public or not. If True, the google_service_account param will be ignored.

False
google_service_account Union[Dict, str]

The Service Account to be used in order to interact with the Google Sheet. This can be a dict or a string representing the JSON Service Account body.

None
google_sheet_key Optional[str]

The key of the Google Sheet to read data from.

None
google_sheet_name Optional[str]

The name of the Sheet to read data from.

None
first_row_header Optional[bool]

Whether the first row is the header. If True, the first row will be used as header and data won't be read. Otherwise, if set to False. Default set to True

True
on_bad_lines Optional[str]

What to do if bad lines are discovered: 'error': An Exception is raised 'warn': A warning is printed and the line is skipped 'skip': The line is skipped with no warnings Default set to 'error'

'error'
clean Optional[bool]

Used in order to remove blank columns and rows left in the Google Sheet.

False

Returns:

Type Description
List[Dict]

The content of a specific Sheet as a dict of lists.

Source code in prefect_google_sheets/tasks.py
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
@task
def read_google_sheet_as_dict_of_lists(
    is_public_sheet: bool = False,
    google_service_account: Union[Dict, str] = None,
    google_sheet_key: Optional[str] = None,
    google_sheet_name: Optional[str] = None,
    first_row_header: Optional[bool] = True,
    on_bad_lines: Optional[str] = "error",
    clean: Optional[bool] = False,
) -> List[Dict]:
    """
    This task leverages the Google Sheets API v4 through the gspread library
    in order to read the content of a Google Sheet and return it as a dict of lists.
    Args:
        is_public_sheet: Whether the Google Sheet is public or not.
            If True, the google_service_account param will be ignored.
        google_service_account: The Service Account to be used in order to interact with
            the Google Sheet. This can be a dict or a string representing the JSON
            Service Account body.
        google_sheet_key: The key of the Google Sheet to read data from.
        google_sheet_name: The name of the Sheet to read data from.
        first_row_header: Whether the first row is the header.
            If True, the first row will be used as header and data won't be read.
            Otherwise, if set to False. Default set to True
        on_bad_lines: What to do if bad lines are discovered:
            'error': An Exception is raised
            'warn': A warning is printed and the line  is skipped
            'skip': The line is skipped with no warnings
            Default set to 'error'
        clean: Used in order to remove blank columns and rows left in the Google Sheet.
    Raises:
        - `GoogleSheetsConfigurationException`
            if google_service_account not provided or None
        - `GoogleSheetsConfigurationException`
            if google_sheet_key not provided or None
        - `GoogleSheetsConfigurationException`
            if google_sheet_name not provided or None
        - `GoogleSheetServiceAccountError`
            if the google_service_account is not valid or malformed
    Returns:
        The content of a specific Sheet as a dict of lists.
    """

    if not is_public_sheet and not google_service_account:
        exc_message = "Missing Google Service Account information."
        raise GoogleSheetsConfigurationException(exc_message)

    if not google_sheet_key:
        exc_message = "Missing the Google Sheet key identifier."
        raise GoogleSheetsConfigurationException(exc_message)

    if not google_sheet_name:
        exc_message = "Missing the Google Sheet name identifier."
        raise GoogleSheetsConfigurationException(exc_message)

    if is_public_sheet:
        sheet = f"""https://docs.google.com/spreadsheets/d/{google_sheet_key}/
            export?format=csv&sheet={google_sheet_name}"""
    else:
        google_credentials = generate_google_credentials(
            google_service_account=google_service_account
        )
        gspread_client = gspread.authorize(google_credentials)
        sheet = gspread_client.open_by_key(google_sheet_key).worksheet(
            google_sheet_name
        )

    sheet_df = get_sheet_dataframe(
        sheet,
        header=0 if first_row_header is True else None,
        parse_dates=True,
        on_bad_lines=on_bad_lines,
        clean=clean,
    )
    return {
        column_name: sheet_df[column_name].values.tolist()
        for column_name in sheet_df.columns.values
    }

read_google_sheet_as_list_of_lists

This task leverages the Google Sheets API v4 through the gspread library in order to read the content of a Google Sheet and return it as a list of lists.

Parameters:

Name Type Description Default
is_public_sheet bool

Whether the Google Sheet is public or not. If True, the google_service_account param will be ignored.

False
google_service_account Union[Dict, str]

The Service Account to be used in order to interact with the Google Sheet. This can be a dict or a string representing the JSON Service Account body.

None
google_sheet_key Optional[str]

The key of the Google Sheet to read data from.

None
google_sheet_name Optional[str]

The name of the Sheet to read data from.

None
first_row_header Optional[bool]

Whether the first row is the header. If True, the first row will be used as header and data won't be read. Otherwise, if set to False. Default set to True

True
on_bad_lines Optional[str]

What to do if bad lines are discovered: 'error': An Exception is raised 'warn': A warning is printed and the line is skipped 'skip': The line is skipped with no warnings Default set to 'error'

'error'
clean Optional[bool]

Used in order to remove blank columns and rows left in the Google Sheet.

False

Returns:

Type Description
List[List]

The content of a specific Sheet as a list of lists.

Source code in prefect_google_sheets/tasks.py
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
@task
def read_google_sheet_as_list_of_lists(
    is_public_sheet: bool = False,
    google_service_account: Union[Dict, str] = None,
    google_sheet_key: Optional[str] = None,
    google_sheet_name: Optional[str] = None,
    first_row_header: Optional[bool] = True,
    on_bad_lines: Optional[str] = "error",
    clean: Optional[bool] = False,
) -> List[List]:
    """
    This task leverages the Google Sheets API v4 through the gspread library
    in order to read the content of a Google Sheet and return it as a list of lists.
    Args:
        is_public_sheet: Whether the Google Sheet is public or not.
            If True, the google_service_account param will be ignored.
        google_service_account: The Service Account to be used in order to interact with
            the Google Sheet. This can be a dict or a string representing the JSON
            Service Account body.
        google_sheet_key: The key of the Google Sheet to read data from.
        google_sheet_name: The name of the Sheet to read data from.
        first_row_header: Whether the first row is the header.
            If True, the first row will be used as header and data won't be read.
            Otherwise, if set to False. Default set to True
        on_bad_lines: What to do if bad lines are discovered:
            'error': An Exception is raised
            'warn': A warning is printed and the line  is skipped
            'skip': The line is skipped with no warnings
            Default set to 'error'
        clean: Used in order to remove blank columns and rows left in the Google Sheet.
    Raises:
        - `GoogleSheetsConfigurationException`
            if google_service_account not provided or None
        - `GoogleSheetsConfigurationException`
            if google_sheet_key not provided or None
        - `GoogleSheetsConfigurationException`
            if google_sheet_name not provided or None
        - `GoogleSheetServiceAccountError`
            if the google_service_account is not valid or malformed
    Returns:
        The content of a specific Sheet as a list of lists.
    """

    if not is_public_sheet and not google_service_account:
        exc_message = "Missing Google Service Account information."
        raise GoogleSheetsConfigurationException(exc_message)

    if not google_sheet_key:
        exc_message = "Missing the Google Sheet key identifier."
        raise GoogleSheetsConfigurationException(exc_message)

    if not google_sheet_name:
        exc_message = "Missing the Google Sheet name identifier."
        raise GoogleSheetsConfigurationException(exc_message)

    if is_public_sheet:
        sheet = f"""https://docs.google.com/spreadsheets/d/{google_sheet_key}/
            export?format=csv&sheet={google_sheet_name}"""
    else:
        google_credentials = generate_google_credentials(
            google_service_account=google_service_account
        )
        gspread_client = gspread.authorize(google_credentials)
        sheet = gspread_client.open_by_key(google_sheet_key).worksheet(
            google_sheet_name
        )

    sheet_df = get_sheet_dataframe(
        sheet,
        header=0 if first_row_header is True else None,
        parse_dates=True,
        on_bad_lines=on_bad_lines,
        clean=clean,
    )
    return sheet_df.values.tolist()