DATA ANALYSIS
When the domain from
which data are harvested is a science or engineering field, the terms data processing and information systems are considered
too broad, and the more specialized term data analysis is typically used. Data
analysis, arguably a special kind of data processing, focusses on
highly-specialized and highly-accurate algorithmic derivations and statistical
calculations that are less often observed in the typical general business
environment.
A divergence of culture
between data processing in general and data analysis is exhibited in the
numerical representations generally used; In data processing, measurements are
typically stored as integers, fixed –point or binary-coded decimal representations of numbers, whereas the majority of
measurements in data analysis are stored as floating-point representations of
rational numbers.
Data Processing
Basically, data are nothing
but facts (organized or unorganized) which can be converted into
other forms to make it useful, clear and practically used. This process of
converting facts to information is Processing. Practically all naturally
occurring processes can be viewed as examples of data processing systems where
"observable" information in the form of pressure, light, etc. are converted by human observers into electrical signals in the nervous system as the senses we recognize as touch, sound, and vision. Even the interaction of non-living systems may be
viewed in this way as rudimentary information processing systems. Conventional usage of the terms data processing and information
systems restricts their use to refer to the algorithmic derivations,
logical deductions, and statistical calculations that recur perennially in
general business environments, rather than in the more expansive sense of all
conversions of real-world measurements into real-world information in, say, an
organic biological system or even a scientific or engineering system.
Editing
Data have to be edited,
especially when they relate to responses to open-ended questions of interviews
and questionnaires, or unstructured observations. In other words, information
that may have been noted down by the interviewer, observer, or researcher in a
hurry must be clearly deciphered so that it may be coded systematically in its
entirety. Lack of clarity at this stage will result later in confusion. The
edited data should be identifiable through the use of a different color pencil
or ink so that original information is still available in case of further
doubts.
Incoming mailed
questionnaire data have to be checked for incompleteness and inconsistencies,
if any, by designated members of research staff. Inconsistencies that can be
logically corrected should be rectified and edited at this stage.Much of the
editing is automatically taken care of in the case of computer-assisted
telephone interviews and electronically administered questionnaires, even as
the respondent is answering the question.
Handling
Blank Responses
Not all respondents
answer every item in the questionnaire. Answers may have been left blank
because the respondent did not understand the question, did not know the
answer, was not willing to answer, or was simply indifferent to the need to
respond the entire questionnaire. If a substantial number of questions – say
25% of the items in the questionnaire – have been left unanswered, it may be a
good idea to throw out the questionnaire and not include it in the data set for
analysis. In this event, it is important to mention the number of returned but
unused responses due to excessive missing data in the final report submitted to
the sponsor of the study. If, however, only two or three items are left blank
in a questionnaire with, say, 30 or more items, we need to decide how these
blank responses are to be handled.One way to handle a blank response to an
interval-scaled item with a midpoint would be assign the midpoint in the scale
as the response to the particular item. An alternative way is to allow the
computer to ignore the blank responses when the analyses are done. There are
several ways to handling blank responses; a common approach, however, is either
to give the midpoint in the scale as the value or to ignore the particular item
during the analysis.
Coding
The next step is to code
the responses. Scanner sheets can be used for collecting questionnaire data;
such sheets facilitate the entry of the responses directly into the computer
without manual keying in of the data. However, if for whatever reason this
cannot be done, then it is perhaps better to use a coding sheet first to
transcribe the data from the questionnaire and then key in the data. This
method, in contrast to flipping through each questionnaire for each item,
avoids confusions, especially when there are many questions and a large number
of questionnaires as well.
It is possible to key in
the data directly from the questionnaires, but that would need flipping through
several questionnaires, page by page, resulting in possible errors and
omissions of items. Transfer of the data first onto a code sheet would thus
help.
Human errors can occur
while coding. At least 10% of the coded questionnaires should therefore be
checked for coding accuracy.
Their selection may
follow a systematic sampling procedure. That is, every nth form coded could be
verified for accuracy. If many errors are found in the sample, all items may
have to be checked.
Categorizing
At this point it is
useful to set up a scheme for categorizing the variables such that the several
items measuring a concept are all grouped together.
Responses to some of the
negatively worded questions have also to be reversed so that all answers are in
the same direction.
If the questions
measuring a concept are not contiguous but scattered over various parts of the
questionnaire, care has to be taken to include all the items without any
omission or wrong inclusion.
Entering
Data
If questionnaire data
are not collected on scanner answer sheets, which can be directly entered into
the computer as a data file, the raw data will have to be manually keyed into
the computer. Raw data can be entered through and software program. For
instance, the SPSS Data Editor, which looks like a spread-editor represents a
case, and each column represents a variable. All missing values will appear
with a period (dot) in the cell. It is possible to add, change, or delete
values after the data have been entered.
It is also easy to
compute the new variables that have been categorized earlier, using the Compute
dialog box, which opens when the Transform icon is chosen. Once the missing
values, the recodes, and the computing of new variables are taken care of, the
data are ready for analysis.
Interpretation
of Data Analyzed
After the data has been
completely analyzed, its results have to be properly interpreted. That
interpretation of results is the most meaningful to the organization.
Classification and Tabulation of Data in
Research
Classification is the way of arranging the data in
different classes in order to give a definite form and a coherent structure to
the data collected, facilitating their use in the most systematic and effective
manner.
It is the process
of grouping the statistical data under various understandable homogeneous
groups for the purpose of convenient interpretation. A uniformity of attributes
is the basis criterion for classification;
and the grouping
of data is made according to similarity. Classification becomes necessary when
there is diversity in the data collected for meaningful presentation and
analysis. However, in respect of homogeneous presentation of data,
classification may be unnecessary.
Objectives of
classification of data:
- To group
heterogeneous data under the homogeneous group of common characteristics;
- To facility
similarity of various group;
- To facilitate
effective comparison;
- To present complex,
haphazard and scattered dates in a concise, logical, homogeneous, and
intelligible form;
- To maintain
clarity and simplicity of complex data;
- To identify
independent and dependent variables and establish their relationship;
- To establish a
cohesive nature for the diverse data for effective and logical analysis;
- To make logical
and effective quantification
A good classification should have the characteristics of
clarity, homogeneity, and equality of scale, purposefulness, accuracy,
stability, flexibility, and unambiguity.
Classification is of two types, viz., quantitative
classification, which is on the basis of variables or quantity; and qualitative
classification (classification according to attributes). The former is the way
of grouping the variables, say quantifying the variables in cohesive groups,
while the latter group the data on the basis of attributes or qualities. Again,
it may be multiple classification or dichotomous classification. The former is
the way of making many (more than two) groups on the basis of some quality or
attributes, while the latter is the classification into two groups on the basis
of the presence or absence of a certain quality. Grouping the workers of a
factory under various income (class intervals) groups comes under multiple
classifications; and making two groups into skilled workers and unskilled
workers is dichotomous classification. The tabular form of such classification
is known as statistical series, which may be inclusive or exclusive. The
classified data may be arranged in tabular forms (tables) in columns and rows.
Tabulation is the simplest way of arranging the data, so that anybody can
understand it in the easiest way. It is the most systematic way of presenting
numerical data in an easily understandable form. It facilitates a clear and
simple presentation of the data, a clear expression of the implication, and an
easier and more convenient comparison. There can be simple or complex tables,
and general purpose or summary tables. Classification and tabulation are
interdependent events in a research.
FREQUENCY DISTRIBUTIONS
INTRODUCTION
The next step after the completion of data collection is
to organize the data into a meaningful form so that a trend, if any, emerging
out of the data can be seen easily. One of the common methods for organizing
data is to construct frequency distribution. Frequency distribution is an
organized tabulation/graphical representation of the number of individuals in
each category on the scale of measurement.[1] It allows the researcher to have a
glance at the entire data conveniently.
It shows whether the observations are high or low and
also whether they are concentrated in one area or spread out across the entire
scale.
Thus, frequency distribution presents a picture of how
the individual observations are distributed in the measurement scale.
DISPLAYING
FREQUENCY DISTRIBUTIONS
Frequency tables
A frequency (distribution) table shows the different
measurement categories and the number of observations in each category.
Before constructing a frequency table, one should have an
idea about the range (minimum and maximum values). The range is divided into
arbitrary intervals called “class interval.”
If the class
intervals are too many, then there will be no reduction in the bulkiness of
data and minor deviations also become noticeable.
On the other hand, if they are very few, then the shape
of the distribution itself cannot be determined. Generally, 6–14 intervals are
adequate.
The width of the class can be
determined by dividing the range of observations by the number of classes. The
following are some guidelines regarding class widths:[1]
- It is advisable to
have equal class widths. Unequal class widths should be used only when
large gaps exist in data.
- The class
intervals should be mutually exclusive and nonoverlapping.
- Open-ended classes
at the lower and upper side (e.g., <10, >100) should be avoided.
The frequency distribution table of the resting pulse
rate in healthy individuals is given in Table
1. It also gives the
cumulative and relative frequency that helps to interpret the data more easily.
Frequency distribution of the resting pulse
rate in healthy volunteers (N = 63)
Frequency distribution
graphs
A frequency distribution graph is a diagrammatic
illustration of the information in the frequency table.
Histogram
A histogram is a graphical representation of the variable
of interest in the X axis and the number of observations
(frequency) in the Y axis. Percentages can be used if the
objective is to compare two histograms having different number of subjects. A
histogram is used to depict the frequency when data are measured on an interval
or a ratio scale. Figure 1 depicts a histogram constructed for the data given
in Table
1.
Histogram of the resting pulse rate in
healthy volunteers (N = 63)
A bar diagram and a histogram may look the same but there
are three important differences between them:
In a histogram, there is no gap between the bars as the
variable is continuous. A bar diagram will have space between the bars.
All the bars need not be of equal width in a histogram
(depends on the class interval), whereas they are equal in a bar diagram.
The area of each bar corresponds to the frequency in a
histogram whereas in a bar diagram, it is the height [Figure 1].
Frequency polygon
A frequency polygon is constructed by connecting all
midpoints of the top of the bars in a histogram by a straight line without
displaying the bars. A frequency polygon aids in the easy comparison of two
frequency distributions. When the total frequency is large and the class
intervals are narrow, the frequency polygon becomes a smooth curve known as the
frequency curve. A frequency polygon illustrating the data in Table
1 is shown in Figure
2.
Frequency polygon of the resting pulse rate
in healthy volunteers (N = 63)
Box and
whisker plot
This graph, first described by Tukey in 1977, can also be
used to illustrate the distribution of data. There is a vertical or horizontal
rectangle (box), the ends of which correspond to the upper and lower quartiles
(75th and 25th percentile, respectively). Hence the
middle 50% of observations are represented by the box.
The length of the box indicates the variability of the
data. The line inside the box denotes the median (sometimes marked as a plus
sign). The position of the median indicates whether the data are skewed or not.
If the median is closer to the upper quartile, then they
are negatively skewed and if it is near the lower quartile, then positively
skewed.
The lines outside the box on either side are known as
whiskers [Figure 3]. These whiskers are 1.5 times the length of the box, i.e.,
the interquartile range (IQR). The end of whiskers is called the inner fence
and any value outside it is an outlier. If the distribution is symmetrical,
then the whiskers are of equal length.
If the data are
sparse on one side, the corresponding side whisker will be short. The outer
fence (usually not marked) is at a distance of three times the IQR on either
side of the box.
The reason behind
having the inner and outer fence at 1.5 and 3 times the IQR, respectively, is
the fact that 95% of observations fall within 1.5 times the IQR, and it is 99%
for 3 times the IQR.
Figure 3
Schematic diagram of a “box and whisker plot”
CHARACTERISTICS OF FREQUENCY
DISTRIBUTION
There are four important characteristics of frequency
distribution.
They are as
follows:
- Measures of
central tendency and location (mean, median, mode)
- Measures of
dispersion (range, variance, standard deviation)
- The extent of
symmetry/asymmetry (skewness)
- The flatness or
peakedness (kurtosis).
Diagrammatic Presentation of Data
Introduction
Although tabulation is
very good technique to present the data, but diagrams are an advanced technique
to represent data.
As a layman, one cannot
understand the tabulated data easily but with only a single glance at the
diagram, one gets complete picture of the data presented.
According to M.J.
Moroney, "diagrams register a meaningful impression almost before we
think.
Importance
or utility of Diagrams
- Diagrams give a very clear picture
of data. Even a layman can understand it very easily and in a short time.
- We can make comparison between
different samples very easily. We don't have to use any statistical
technique further to compare.
- This technique can be used
universally at any place and at any time. This technique is used almost in
all the subjects and other various fields.
- Diagrams have impressive value
also. Tabulated data has not much impression as compared to Diagrams. A
common man is impressed easily by good diagrams.
- This technique can be used for
numerical type of statistical analysis, e.g. to locate Mean, Mode, Median
or other statistical values.
- It does not save only time and
energy but also is economical. Not much money is needed to prepare even
good diagrams.
- These give us much more
information as compared to tabulation. Technique of tabulation has its own
limits.
- This data is easily remembered.
Diagrams which we see leave their lasting impression much more than other
data techniques.
- Data can be condensed with
diagrams. A simple diagram can present what even cannot be presented by
10000 words.
General
Guidelines for Diagrammatic presentation
- The diagram should be properly
drawn at the outset. The pith and substance of the subject matter must be
made clear under a broad heading which properly conveys the purpose of a
diagram.
- The size of the scale should
neither be too big nor too small. If it is too big, it may look ugly. If
it is too small, it may not convey the meaning. In each diagram, the size
of the paper must be taken note-of. It will help to determine the size of
the diagram.
- For clarifying certain ambiguities
some notes should be added at the foot of the diagram. This shall provide
the visual insight of the diagram.
- Diagrams should be absolutely neat
and clean. There should be no vagueness or overwriting on the diagram.
- Simplicity refers to love at first
sight. It means that the diagram should convey the meaning clearly and
easily.
- Scale must be presented along with
the diagram.
- It must be Self-Explanatory. It
must indicate nature, place and source of data presented.
- Different shades, colors can be
used to make diagrams more easily understandable.
- Vertical diagram should be
preferred to Horizontal diagrams.
- It must be accurate. Accuracy must
not be done away with to make it attractive or impressive.
Limitations
of Diagrammatic Presentation
- Diagrams do not present the small
differences properly.
- These can easily be misused.
- Only artist can draw
multi-dimensional diagrams.
- In statistical analysis, diagrams
are of no use.
- Diagrams are just supplement to
tabulation.
- Only a limited set of data can be
presented in the form of diagram.
- Diagrammatic presentation of data
is a more time consuming process.
- Diagrams present preliminary
conclusions.
- Diagrammatic presentation of data
shows only on estimate of the actual behavior of the variables.
Types
of Diagrams
(a) Line Diagrams
In these diagrams only
line is drawn to represent one variable. These lines may be vertical or
horizontal. The lines are drawn such that their length is the proportion to
value of the terms or items so that comparison may be done easily.
(b) Simple Bar Diagram
Like line diagrams these
figures are also used where only single dimension i.e. length can present the
data.
Procedure is almost the
same, only one thickness of lines is measured. These can also be drawn either
vertically or horizontally.
Breadth of these lines
or bars should be equal. Similarly distance between these bars should be equal.
The breadth and distance between them should be taken according to space
available on the paper.
(c) Multiple Bar Diagrams
The diagram is used,
when we have to make comparison between more than two variables. The number of
variables may be 2, 3 or 4 or more. In case of 2 variables, pair of bars is
drawn. Similarly, in case of 3 variables, we draw triple bars.
The bars are drawn on
the same proportionate basis as in case of simple bars. The same shade is given
to the same item. Distance between pairs is kept constant.
(d) Sub-divided Bar Diagram
The data which is
presented by multiple bar diagram can be presented by this diagram. In this
case we add different variables for a period and draw it on a single bar as
shown in the following examples.
The components must be kept in same order in
each bar. This diagram is more efficient if number of components is less i.e. 3
to 5.
(e) Percentage Bar Diagram
Like sub-divide bar
diagram, in this case also data of one particular period or variable is put on
single bar, but in terms of percentages. Components are kept in the same order
in each bar for easy comparison.
(f) Duo-directional Bar Diagram
In this case the diagram
is on both the sides of base line i.e. to left and right or to above or below
sides.
(g) Broken Bar Diagram
This diagram is used
when value of some variable is very high or low as compared to others. In this case
the bars with bigger terms or items may be shown broken.
Graphic Presentation of Data
Introduction
A graph refers to the
plotting of different valves of the variables on a graph paper which gives the
movement or a change in the variable over a period of time.
Diagrams can present the
data in an attractive style but still there is a method more reliable than
this.
Diagrams are often used for publicity purposes
but are not of much use in statistical analysis. Hence graphic presentation is
more effective and result oriented.
Diagrams can present the
data in an attractive style but still there is a method more reliable than
this.
Diagrams are often used for publicity purposes
but are not of much use in statistical analysis. Hence graphic presentation is
more effective and meaningful.
According to A. L.
Boddington, "The wandering of a line is more powerful in its effect on the
mind than a tabulated statement;
it shows what is happening and what is likely
to take place, just as quickly as the eye is capable of working."
Advantages
of Graphs
The presentation of
statistics in the form of graphs facilitates many processes in economics. the
main uses of graphs are as under:
- Attractive and Effective
presentation of Data: The statistics can be presented in attractive and
effective way by graphs. A fact that an ordinary man can not understand
easily, could understand in a better way by graphs. Therefore, it is said
that a picture is worth of a thousand words.
- Simple and Understandable
Presentation of Data: Graphs help to present complex data in a simple and
understandable way. Therefore, graphs help to remove the complex nature of
statistics.
- Useful in Comparison: Graphs also
help to compare the statistics. IF investment made in two different
ventures is presented through graphs, then it becomes easy to understand
the difference between the two.
- Useful for Interpretation: Graphs
also help to interpret the conclusion. It saves time as well as labour.
- Remembrance for long period:
Graphs help to remember the facts for a long time and they cannot be
forgotten.
- Helpful in Predictions: Through
graphs, tendencies that could occur in near future can be predicted in a
better way.
- Universal utility: In modern era,
graphs can be used in all spheres such as trade, economics, government
departments, advertisement, etc.
- Information as well as
Entertainment: Graphs help us in entertainment as well as for providing
information. By graphs there occurs no hindrance in the deep analysis of
every information.
- Helpful in Transmission of
Information: Graphs help in the process of transmission as well as
information of facts.
- No Need for training: When facts
are presented through graphs there is no need for special training for
these interpretations.
Rules
for the construction of Graph
The following are the
main rules to construct a graph:
- Every graph must have a suitable
title which should clearly convey the main idea, the graph intends to
portray.
- The graph must suit to the size of
the paper.
- The scale of the graph should be
in even numbers or in multiples.
- Footnotes should be given at the
bottom to illustrate the main points about the graph.
- Graph should be as simple as
possible.
- In order to show many items in a
graph, index for identification should be given.
- A graph should be neat and clean.
It should be appealing to the eyes.
- Every graph should be given with a
table to ensure whether the data has been presented accurately or not.
- The test of a good graph depends
on the case with which the observer can interpret it. Thus economy in cost
and energy should be exercised in drawing the graph.
Limitations
Following are the main
drawbacks/ limitations of graphs.
Limited
Application: Graphic representation
is useful for a common man but for an expert, its utility is limited.
Lack
of Accuracy: Graphs do not measure
the magnitude of the data. They only depict the fluctuations in them.
Subjective:
Graphs are subjective in character. Their interpretation
varies from person to person.
Misleading
Conclusions: The person who has no knowledge
can draw misleading conclusions from graphs.
Simplicity:
Graph should be as simple as possible.
Index:
In order to show many items in a graph, index for
identification should be given.
How to
choose a scale for a graph?
The scale indicates the
unit of a variable that a fixed length of axis would represent. Scale may be
different for both the axes.
It should be taken in such a way so as to
accommodate whole of the data on a given graph paper in a lucid and attractive
style.
Sometimes data to be presented
does not have low values but with large terms. We have to use the graph so as
it may present the given data for comparison even.
Define
various types of graphs.
Types
of Graphs
There are two types of
graphs.
- Time series Graphs or
Historigrams.
- Frequency Distribution Graphs.
Time series graphs may
be of one variable, two variables or more variables graph. Frequency
distribution graphs present (a) histograms (b) Frequency Polygons (c) Frequency
Curves and (d) Ogives.




No comments:
Post a Comment